I'm working on Mac OS 10.7, with AMD Radeon 6750M.
I wrote an OpenCL kernel, signed with the following attributes:

Code :
__attribute__((reqd_work_group_size(1, WG_SIZE)))
void my_kernel(...) 
  // do something with float4 pixels
  local shared_res;
  local tmp[WG_SIZE];
  for (int i = 1; i < N; i++)
     float4 v = read_image_f(...);
     tmp[get_local_id(1)] = foo(v);
     barrier(...); // local barrier
     sum(tmp, &shared_res); // sum tmp and write the result to shared_res
     if (shared_res > SOME_VALUE) break;

As far as understand, each work-group runs on one warp (wavefront).
In AMD the wavefront size is 64. Hence, there will be generally no benefit from having more than 16 work-items in each workgroup if the vec_type_hint is float4 (and the compiler uses this hint).

However, it seems when WG_SIZE is 64 rather than 16 gives ~X4 boost to the running time of the kernel.
I suspect that the compiler ignores the vec_type_hint(float4) hint, and compiles the code without vectorizing the float4 operations (i.e. running them one-by-one leaving 75% of the warp size empty)

In my specific case, I would like to use a minimal but efficient size of work-group as I have a brunch in the kernel that allows me to stop the workgroup job and save some time (it saves ~80% of the time in my CPU implementation). As the break happens in all work-items at the group together, this should not make the performance worse (am I right?).

How can I check my hypothesis or understand what's going on there and why does a larger workgroup size gives better performance?

Thanks in advance,