I’m having a really hard time trying to achieve maximum performance with OpenCL using GPU devices. The best I got for the moment is about 2x the performance of 1 CPU core.
After really hackish/magic profiling (because AFAIK there’s no GPU OpenCL time profiler for the Mac, which is the platform I use -and this makes OpenCL a very difficult API to optimize for, IMHO- ), I got to the conclusion that my bottleneck is GPU global memory access: each work item reads 6 floats and 3 ints from global memory, and writes back 1 int to global memory (a total of 10 accesses to global memory, each one a 4-byte access). Apart from the memory access, the computations performed by each work item are as follows: 6 dot products, 8 float subtracts, 4 int subtracts, and 5 int bitwise ANDs.
Reading the NVIDIA OpenCL optimization docs, it’s obvious that 10 accesses to global memory per work item are a severe impact in performance: If I’m reading the docs correctly, there’s a latency of about 400 to 600 cycles (woah!!!) when accessing global memory. I suppose that my accesses are coalesced because they’re done as indexed arrays, where the index is taken from the work item global IDs, so I believe this meets coalesced accesses, but, as I said, there’s no GPU OpenCL profiler for Mac, so I cannot check if it’s actually the case or not.
My first try was to use constant memory for the kernel input data: It didn’t help. Performance is the same.
The last resource I have is to try local memory.
If I’m understanding local memory correctly, the 16KB of local memory that my NVIDIA GPUs reports isn’t per compute unit, but total. So, I understand that if I have 6 compute units, each workgroup will have, in the best case, 16KB/6 which is about 2.6KB. It’s small, but some of the input data can fit there, reducing such 10 accesses to global memory to almost half of them.
Now, my main question:
How can I transfer the global memory to local memory? I think it’s done with async_work_group_copy(), but didn’t find any understandable code snippet.
For example, imagine I want to transfer an array of 30 ints from global memory to local memory. I want that this happens only in the first workgroup of each compute unit, so that all other workgroups can access the previously initialized local memory.
How would I code the starting lines of my kernel, so that async_work_group_copy() is executed only by the first group in each compute unit?
I chose OpenCL, about a year ago, because I prefer multiplatform APIs and compatibility. But I’m missing so much a convenient path for proper optimization, that I think I’m going to give CUDA a try. I prefer the OpenCL concept, but I feel really lost here.
Thanks.