After reading several books and googling, I still don’t fully understand what is the right way to choose the dimension of work items.
What I underatood is that we can let do it automatically to OpenCL or do it “manually” ourselves.
In case of an image 1024*1024 as example:
size_t globalThreads[] = { 1024, 1024 };
- [li]Automatically
[/li]
status = clEnqueueNDRangeKernel( commandQueue, kernl, 2, NULL, globalThreads, NULL, 0, NULL, NULL);
Setting to NULL the work group size.
[li]Manually
[/li]The second way it is to take max work item size from infodevice and fill it up with data as much as possible. In this way I want to have less work groups as possible because among them the parallelism it is not garanteed, whereas among the work items it is indeed garanteed. So the main goal is to have less work groups, and on other hand to maximize the work items in the work group.
for (i = (int)deviceInfo.maxWorkGroupSize; i>0; i--){ if (1024%i == 0){ res2 = i; break; } } for (i = ((int)deviceInfo.maxWorkGroupSize) / res2; i>0; i--){ if (1024%i == 0){ res1 = i; break; } } size_t globalThreads[] = { 1024, 1024 }; size_t localThreads[] = { res2, res1 }; cl_event ndrEvt; status = clEnqueueNDRangeKernel( commandQueue, kernl, 2, NULL, globalThreads, localThreads, 0, NULL, &ndrEvt);
The other way is also to play with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, which I’m not sure how really it does work. I’ve implemented in this way, but still not sure about it:
clGetKernelWorkGroupInfo(kernl, devices[0], CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &preferredGroupSize, NULL); size_t globalThreads[] = { 1024, 1024 }; size_t localThreads[] = { maxItems/preferredGroupSize, preferredGroupSize }; cl_event ndrEvt; status = clEnqueueNDRangeKernel( commandQueue, kernl, 2, NULL, globalThreads, localThreads, 0, NULL, &ndrEvt);