Built-in Functions: Work-Item Functions

Please provide a built-in function that provides the preferred work group multiple from within a kernel, similar to get_global_size(0), get_local_size(0), etc.

The only work around I see right now requires querying CL_DEVICE_PREFERRED_WORK_GROUP_MULTIPLE and pass the value to a kernel using a preprocessor option before building the program from source.

Sean, what can you do with that information inside the kernel? Once an NDRange has been enqueued and a kernel is executing, it has a fixed work-group size and querying the preferred size cannot make a difference.

The example that came to mind was a parallel reduction. If the work group size and preferred work group multiple are passed as preprocessor options then the loops can be fully unrolled and improve performance. The preferred work group multiple still seems important inside the kernel because you can potentially take advantage of the SIMD nature of running in lock step within a wavefront, so those unnecessary barriers can be removed and improve performance too. Here’s a code snippet of what I had in mind.

#if !defined(WORK_GROUP_SIZE) || !defined(PREFERRED_WORK_GROUP_MULTIPLE)
for (size_t i = local_size; i >= local_multiple; i =>> 1)
#else
#pragma unroll
for (size_t i = WORK_GROUP_SIZE; i >= PREFERRED_WORK_GROUP_MULTIPLE; i =>> 1)
#endif
{ ; // reduction with a barrier to sync multiple wavefronts }

#if !defined(WORK_GROUP_SIZE) || !defined(PREFERRED_WORK_GROUP_MULTIPLE)
for (size_t i = local_multiple >> 1; i >= 1; i =>> 1)
#else
#pragma unroll
for (size_t i = PREFERRED_WORK_GROUP_MULTIPLE >> 1; i >= 1; i =>> 1)
#endif
{ ; // reduction without a barrier, only one wavefront }

Okay, I understand your use case and it does have some appeal. The issue I see is that your code at least is making one big assumption: that you can skip the barrier when you reach the work-group-size-multiple boundary. That is only true in certain hardware and if the returned value of work-group-size-multiple boundary is exactly one warp.

We would need to take a step back and see whether something like this can be exposed in a more hardware-agnostic way.

I’ll forward this to the committee to see what they think.

Hi David,

Do you happen to have any feedback from the committee on this suggested function?

I could get in trouble if I share what the committee has discussed. Let’s say that it didn’t make it to CL 1.2. :slight_smile:

I’ll ask the spec editor if he wants to make an official comment.

Thanks David, and I definitely don’t want to get you into any trouble.

CL_KERNEL_WORK_GROUP_SIZE_MULTIPLE is supposed to be a performance hint to make sure that the local workgroup size specified to clEnqueueNDRangeKernel is a multiple of this value.

To do what you are suggesting, the definition of CL_KERNEL_WORK_GROUP_SIZE_MULTIPLE will need to be extended to state that this now refers to the SIMD or SIMT execution size. If we did this, then it is correct that the barrier become a NOP. Another option would be to add a new query to indicate the device’s SIMD execution size will have to be added

I don’t believe this specific modification to CL_KERNEL_WORK_GROUP_SIZE_MULTIPLE was discussed by the working group. I’ll bring this up to the group for discussion.