CPU vs GPU optimizations

Hello

I have implemented a straightaway naive matrix multiplication in OpenCL with AMD SDK. I get Speedup of around 16 for just an 8-core CPU system while I only run it on CPUs. I have applied some popular optimizations like utilizing private memory and local memory optimizations, and grouping my matrix in one dimension so I use both global and local dimension sizes. Now I get Speedup of around 24 with same 8-core CPU.
First I wonder this much speedup because for 8-cores I normally get around or less than 8 speedup with OpenMP for example. so these figures of 16 and 24 amaze me how its possible?
Second these local + private memory and grouping of work items are optimizations that I heard are only for GPUs and arent for CPUs so I again wonder how I get so much boost in speedup when I run it only on CPUs ?
Thirdly, I wonder how local and private memory and grouping are handled for CPUs as they cause speedup, caches or processor registers or what? Because this is magic to get so much speedup…

Please help me clarify because I am so new to OpenCL and its giving me so big performance I cant beleive it, I have verified results and they are perfectly accurate.
Thanks in advance

SIMD instructions such as SSE + multithreading.

Maybe it’s how you measure things? Using local memory on CPU should give you no performance increases as it is the same as global (host) memory.

Of course registers are used as much as possible on CPUs but besides that you are only left with multi-threading and vectorized instructions. As I said before, local and global memory are no different. You can verify this by querying the local memory type.

[quote=“matthiasv”]

SIMD instructions such as SSE + multithreading.

Maybe it’s how you measure things? Using local memory on CPU should give you no performance increases as it is the same as global (host) memory.

Of course registers are used as much as possible on CPUs but besides that you are only left with multi-threading and vectorized instructions. As I said before, local and global memory are no different. You can verify this by querying the local memory type.[/quote]

I measure it with passing CL_COMMAND_PROFILING_ENABLE flag in making command queue, and then after enqueuing each kernel, I do clFinish(kernel) and then measure time with corresponding event… so thats pretty much standard way of measuring kernel execution time, I even do it multiples times and then averaging it, so there is no problem with my time calculations, they Why I get manytimes speedup of private/local memory+grouping work items, compared with simple kernel which only uses global memory and single work item threads… if as you said these optimizations arent for CPUs; I really wonder why?

There might be some cache effects due to better alignment of memory accesses. But from my point of view this is just speculation. Are you on AMD or Intel OpenCL? If the latter is the case, you can inspect the compilation result with the Intel Offline Compiler and see what it generates for the simple and the more advanced kernel.

I am using AMD OpenCL implementation…
Also is SIMD utilization or auto-vectorization possible if I havent used OpencL vectors for example? Also local/private memory can boost speedup on CPUs? I am confused because someone told me that for device CPUs there is no local memory in OpenCL so no benefit, and that it only gives performance for GPUs…

Also is SIMD utilization or auto-vectorization possible if I havent used OpencL vectors for example?

With a good compiler, yes.

Also local/private memory can boost speedup on CPUs? I am confused because someone told me that for device CPUs there is no local memory in OpenCL so no benefit

While CPUs do not have actual local memory, writing your algorithm in a way that takes advantage of local memory will often improve cache performance.

Thank you so much for kind information…