CPU vs GPU performance

Hi :slight_smile:
Recently I’ve been experimenting with OpenCL on a laptop with i5 3210M processor and an HD 4000 GPU. I am trying to build a picture of what algorithms are faster on which device and why, but some of the results I am getting seem quite strange to me:

I am testing a very very simple kernel - just one floating point operation and no global/local memory access at all - just the registers:
__kernel
void dummy() {
float f = 2.0f * 2.0f;
}

However, the CPU totally wins over the GPU. Here are the results as reported by the analyze tool in Intel’s Kernel Builder for different workspace sizes.

CPU GPU
10^3 0,049036 0,418
10^4 0,079028 0,37
10^5 0,11 0,4
10^6 0,66 1
10^7 1,81 7,589
10^8 5,6 68,18

The local group sizes are set to Auto, so after several iterations, the optimal is chosen.

My first suggestion was that the work per work item is too little and the thread creation overhead (however cheap it is) outweighs the performance gains.
My next attempt was to increase the work per work item by just adding 50 more floating operations to it, but that did not change the result significantly.

Could someone explain what I am missing. Thanks :slight_smile:

Your example is far too simple. The compiler will first replace the constant expression 2.0f * 2.0f by the constant value 4.0f, then detect that f is used nowhere and optimize it away.

As a result your kernel probably does nothing.

Your kernel should at least write something to an output value, which ensures that computed values are not optimized away.

[QUOTE=utnapishtim;29745]Your example is far too simple. The compiler will first replace the constant expression 2.0f * 2.0f by the constant value 4.0f, then detect that f is used nowhere and optimize it away.

As a result your kernel probably does nothing.

Your kernel should at least write something to an output value, which ensures that computed values are not optimized away.[/QUOTE]

Yes, apparently that is the case.
The main reason behind the CPU’s being faster seems to be that there has to be some meaningful amount of work per work item - maybe something around 100 floating point operations in order to hide the thread creation latency.
I wasn’t detecting this behavior because of the reason poined by you - my ‘slightly more complicated’ kernel was being compiler-optimized to the same simple one. Thanks!