Vector types and CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE

I can’t quite understand the relationship if any between vector types and CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE (let’s shorten that to PWGM). PWGM is related to the number of work-items that can be executed on processing elements in a compute unit. If your data type is scalar this makes sense (say a GPU with 16 SIMD units with VLIW4 can execute 16 work-items per clock piped over four clocks = 64 work-items). What happens in the case of vector types such as float4 or double2? My intuition says that the PWGM should decrease by a factor of the ILP explicitly invoked using the vector types (so if float gave 64 work-items, float4 would give 16 work-items). However, every time I query PWGM it gives the same result (64 work-items in this example).

This leads me to also wonder that if PWGM is independent of the data type then why must I query it from clGetKernelWorkGroupInfo, which is only available after building the program and kernel? Shouldn’t this query be available from clGetDeviceInfo?

CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE is most closely related to the warp size (the number of work-items in a warp/wavefront), although it’s not the same thing.

This leads me to also wonder that if PWGM is independent of the data type then why must I query it from clGetKernelWorkGroupInfo, which is only available after building the program and kernel? Shouldn’t this query be available from clGetDeviceInfo?

Because in principle it will depend on the kernel. Let’s say that your kernel is using float16 everywhere and has little or no flow-control. If your hardware has a native SIMD width of 16 then a smart compiler may decide that you have already done all the work vectorizing the code and the hardware can just run your kernel as-is. In that case the CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE will be 1.

The kernel described above is not common. Most kernels you see will be written in scalar form and if your hardware natively runs 16-float wide SIMD instructions it makes more sense to map 16 work-items to one SIMD unit. In that case CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE will be 16.

I hope this sheds some light on why things work the way they do.