Before anything, this is my first OpenCL program, so be cute.
I use AMD OpenCL implementation.
Command Queue on my CPU (host == device)
OpenCL time: 978ms
OpenMP time: 266ms
One thread time: 279ms
-
Why this bad performance in the same CPU?
I understand that this is a memory and not processor work, but on the same device must have at least the same results.
(I believe) there is no buffer copy (I use CL_MEM_USE_HOST_PTR on buffer creation). -
GPU has restrictions in memory allocation. I want to make ultra huge sparse matrix-vector multiplication for Finite Element Analysis. If I write and read ALL THE TIME small pieces of these big matrix-vector to GPU, I will have performance cost, No? (Matrix & vector cannot fit in small GPU memory - only 100MB OpenCL buffer allocations for my ATI Radeon).
OpenMP code
#pragma omp parallel for
for(size_t z = 0; z < SIZE; z++)
c[z] = a[z] + b[z];
OpenCL code:
status = clEnqueueWriteBuffer(*cqueue, *ba, CL_FALSE, 0, SIZE * sizeof(float), a, 0, 0, 0);
status |= clEnqueueWriteBuffer(*cqueue, *bb, CL_FALSE, 0, SIZE * sizeof(float), b, 0, 0, 0);
kernel.setArg(0, *ba);
kernel.setArg(1, *bb);
kernel.setArg(2, *bc);
size_t dim[1] { SIZE };
status |= clEnqueueNDRangeKernel(*cqueue, *kernel, 1, 0, dim, 0, 0, 0, 0);
status |= clEnqueueReadBuffer(*cqueue, *bc, CL_TRUE, 0, SIZE * sizeof(float), c, 0, 0, 0);
Kernel code
__kernel void vector_add(__global float *A, __global float *B, __global float *C)
{
size_t idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}