Recently I've been experimenting with OpenCL on a laptop with i5 3210M processor and an HD 4000 GPU. I am trying to build a picture of what algorithms are faster on which device and why, but some of the results I am getting seem quite strange to me:

I am testing a very very simple kernel - just one floating point operation and no global/local memory access at all - just the registers:
void dummy() {
float f = 2.0f * 2.0f;

However, the CPU totally wins over the GPU. Here are the results as reported by the analyze tool in Intel's Kernel Builder for different workspace sizes.
10^3 0,049036 0,418
10^4 0,079028 0,37
10^5 0,11 0,4
10^6 0,66 1
10^7 1,81 7,589
10^8 5,6 68,18

The local group sizes are set to Auto, so after several iterations, the optimal is chosen.

My first suggestion was that the work per work item is too little and the thread creation overhead (however cheap it is) outweighs the performance gains.
My next attempt was to increase the work per work item by just adding 50 more floating operations to it, but that did not change the result significantly.

Could someone explain what I am missing. Thanks