Performance on CPU

Before anything, this is my first OpenCL program, so be cute.

I use AMD OpenCL implementation.
Command Queue on my CPU (host == device)

OpenCL time: 978ms
OpenMP time: 266ms
One thread time: 279ms

  1. Why this bad performance in the same CPU?
    I understand that this is a memory and not processor work, but on the same device must have at least the same results.
    (I believe) there is no buffer copy (I use CL_MEM_USE_HOST_PTR on buffer creation).

  2. GPU has restrictions in memory allocation. I want to make ultra huge sparse matrix-vector multiplication for Finite Element Analysis. If I write and read ALL THE TIME small pieces of these big matrix-vector to GPU, I will have performance cost, No? (Matrix & vector cannot fit in small GPU memory - only 100MB OpenCL buffer allocations for my ATI Radeon).

OpenMP code

#pragma omp parallel for
for(size_t z = 0; z < SIZE; z++)
	c[z] = a[z] + b[z];

OpenCL code:

status = clEnqueueWriteBuffer(*cqueue, *ba, CL_FALSE, 0, SIZE * sizeof(float), a, 0, 0, 0);
status |= clEnqueueWriteBuffer(*cqueue, *bb, CL_FALSE, 0, SIZE * sizeof(float), b, 0, 0, 0);
kernel.setArg(0, *ba);
kernel.setArg(1, *bb);
kernel.setArg(2, *bc);
size_t dim[1] { SIZE };
status |= clEnqueueNDRangeKernel(*cqueue, *kernel, 1, 0, dim, 0, 0, 0, 0);
status |= clEnqueueReadBuffer(*cqueue, *bc, CL_TRUE, 0, SIZE * sizeof(float), c, 0, 0, 0);

Kernel code

__kernel void vector_add(__global float *A, __global float *B, __global float *C)
{
	size_t idx = get_global_id(0);
	C[idx] = A[idx] + B[idx];
}

I’d probably try:
a) using float4 as the kernel argument type, although openmp can probably vectorise that loop, i dont think amd’s compiler will by itself.
b) use enqueuemap/unmap rather than writebuffer: write WILL copy the data because that is what you told it to do. USE_HOST_PTR is only useful if you have the data already setup and for a cpu, aligned appropriately.
c) (or just don’t time the copies).

Still, remember it’s still only executing on a cpu, so it isn’t magic. openmp already adds spreading across cores, and the compiler is probably already doing vectorisation of such a simple loop.

TBH if you’re only doing opencl on a cpu and will never move to other hardware, it’s hardly worth the hassles. It’s more useful as a debugging tool at this point, although looking at where cpu and gpu designs are heading, they are converging rapidly.

If you only have a tiny gpu card, you can’t expect to be able to solve huge problems. A GPU kernel can only work on data that is in physical memory on the card at the time it executes. A whole programme can allocate more memory than that, but each kernel must be able to access the data when it runs. Go buy another card or be content with investigating smaller problems: the programming techniques are the same at any rate.

My observations:

  1. Definitely try float4s (or float8s, of float16s) … I achieved >50% speedup on AMD 5870 by going from floats to float4s.

  2. I tried mapping buffers, and for me, for the most part, it only increased throughput where it was already deficient due to variations in other parameters in my tuning process. Still, where my tuning curve is best, it gives 1 to 3 percent. SO, I use it, but it’s not a miracle for me. CL/GL interop may be the miracle I still seek; yet to be investigated…

  3. I don’t know “openmp”, but, before going to OpenCL, I went multi-CPU-core with XCode’s Grand Central Dispatch. It did give me increased throughput over single-threaded, but I get (I guess) at least double the throughput (CPU only) that I do under GCD, with OpenCL. So, I’d say that OpenCL may be very worth your while even if you’re only using the CPU.

  4. Good luck!

== Dave

Remember he’s talking about a cpu-only speed test, comparing against a cpu loop. It’s just timing memory copies, so extra copies are going to add up.

On a discrete GPU it can’t make so much difference since the data needs to be copied one way or another anyway.