Best practice choosing right work group size

After reading several books and googling, I still don’t fully understand what is the right way to choose the dimension of work items.

What I underatood is that we can let do it automatically to OpenCL or do it “manually” ourselves.
In case of an image 1024*1024 as example:

size_t globalThreads[] = { 1024, 1024 };

  • [li]Automatically
    [/li]
    status = clEnqueueNDRangeKernel( commandQueue, kernl, 2, NULL, globalThreads, NULL, 0, NULL, NULL);

    Setting to NULL the work group size.

    [li]Manually
    [/li]The second way it is to take max work item size from infodevice and fill it up with data as much as possible. In this way I want to have less work groups as possible because among them the parallelism it is not garanteed, whereas among the work items it is indeed garanteed. So the main goal is to have less work groups, and on other hand to maximize the work items in the work group.


    for (i = (int)deviceInfo.maxWorkGroupSize; i>0; i--){ if (1024%i == 0){ res2 = i; break; } } for (i = ((int)deviceInfo.maxWorkGroupSize) / res2; i>0; i--){ if (1024%i == 0){ res1 = i; break; } } size_t globalThreads[] = { 1024, 1024 }; size_t localThreads[] = { res2, res1 }; cl_event ndrEvt; status = clEnqueueNDRangeKernel( commandQueue, kernl, 2, NULL, globalThreads, localThreads, 0, NULL, &ndrEvt);


    The other way is also to play with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, which I’m not sure how really it does work. I’ve implemented in this way, but still not sure about it:



    clGetKernelWorkGroupInfo(kernl, devices[0], CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &preferredGroupSize, NULL); size_t globalThreads[] = { 1024, 1024 }; size_t localThreads[] = { maxItems/preferredGroupSize, preferredGroupSize }; cl_event ndrEvt; status = clEnqueueNDRangeKernel( commandQueue, kernl, 2, NULL, globalThreads, localThreads, 0, NULL, &ndrEvt);

https://software.intel.com/sites/products/documentation/ioclsdk/2013/OG/Work-Group_Size_Recommendations_Summary.htm

I’d love to be proven wrong, but in my opinion and based on my experience, it’s a black art.

It varies by hardware vendor, and I’ve even seen where non-multiples of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE are faster.

The only consistent way I’ve seen is to try them all and benchmark.

These days, using NULL is usually not a ton worse than the optimal size (a few years ago it could be much worse).

So you might consider only hardcoding the work group size for kernels that are designed for a particular size (e.g., use shared local memory for caching).

My intention is to assign workitems in hardware independent way… Could it be using NULL the best solution?

For many applications, yes. You can certainly try to write a function that calculates an optimal work group size, but it will be a challenge. Alternatively, you can benchmark all sizes on the user’s machine and remember it (but run the test again if the hardware or driver changes).