Work Group and Work Item sizes

Hello all,
I’m having a bit of trouble understanding what my work group size and work item sizes should be. Beyond that I’m having trouble just finding out how large these can be for the hardware I have.

The problem I’m trying to parallel can be broken down to factoring a very large number which only has two factors (other than 1 & itself). The kernel will only ever have one number at a time to factor. Does this mean that I can have a single 1D work group, or is there some benefit to making this 2D or 3D?

I have been successful in finding the max work group size via (CL_DEVICE_MAX_WORK_GROUP_SIZE). However, if I try to use that as the work group size and 1 as the number of work groups I get a CL_INVALID_WORK_GROUP_SIZE error on the enqueueNDRangeKernel(). Here is what I tried:


size_t groupSize;
devices[0].getInfo((cl_device_info)CL_DEVICE_MAX_WORK_GROUP_SIZE, &groupSize);

...

err = queue.enqueueNDRangeKernel(
        kernel, 
        cl::NullRange,
	cl::NDRange(1),
        cl::NDRange(groupSize), 
        NULL, 
        &event);

The code I am writing will be run on multiple computers, so I need some way of dynamically calculating both the work group size and the number of work groups. Also, am I correct in thinking that total number of concurrently running “threads” = the work group size * the number of work items per work group? I’m not sure thread is the correct word there, but essentially the number of processes which are executing the kernel specified.

I really appreciate any help.

The third parameter to enqueueNDRangeKernel is actually the total number of work items, not the number of work groups. So if you want one work group, you should put groupSize in the third argument.

Also, I’m a bit worried that launching one work group won’t make the best use of the machine. You will probably need to launch many more work groups to efficiently use the machine.