Specify the number of compute units for execution of kernel.

Sayantan · June 3, 2011, 11:41pm

Hello,
My GPU has 10 compute units.But using all of them together makes the system unresponsive during execution.But after execution evrything becomes normal.So I want to use 5 compute units so that my system remains responsive during execution of kernel.Therfore how do I specify the number of compute units to use??

david.garcia · June 4, 2011, 4:52am

Try submitting less work each time so that overall the machine is still responsive. For example, instead of executing 100,000 work-groups in one call to clEnqueueNDRangeKernel(), make ten calls and run only 10,000 work-groups in each of them. The “global_work_offset” parameter of clEnqueueNDRangeKernel() should come handy.

The method above will work well even on older hardware.

Sayantan · June 4, 2011, 6:24am

Thanks for your help.I understand what you are telling but can you give me an example how to use global_work_offset in clEnqueueNDRangeKernel().suppose I have 100 work items and I want to split them in two halves 0-49 and 50-99.Assume work_dim=1.

ilektrik · June 4, 2011, 12:52pm

You can try something like it:

int workSize = 100;
int globalWorkSize = 50;
int passes = 2; // this value is obvious in this example
size_t globalWorkSize[1] = {globalWorkSize};
size_t globalWorkOffset[1] = {0};

for(int i=0; i<passes; i++)
{
clEnqueueNDRangeKernel(GPUCommandQueue, OpenCL, 1, globalWorkOffset, globalWorkSize, NULL, 0, NULL, NULL); 
// read results by clEnqueueReadBuffer() with blocking set to CL_TRUE
globalWorkOffset[0]+=globalWorkSize[1];
}

Hope code is OK

Of course take a look at:
http://www.khronos.org/registry/cl/sdk/ … ernel.html
http://www.khronos.org/registry/cl/sdk/ … uffer.html

david.garcia · June 4, 2011, 2:18pm

Yeah, something like what ilektrik suggests. If I may make a couple of little changes,


int globalWorkSize = 50;
int passes = 2; // this value is obvious in this example
size_t globalWorkSize = globalWorkSize;
size_t globalWorkOffset = 0;

for(int i=0; i<passes; i++)
{
clEnqueueNDRangeKernel(GPUCommandQueue, OpenCL, 1, &globalWorkOffset, &globalWorkSize, NULL, 0, NULL, NULL); 
globalWorkOffset+=globalWorkSize;
}
// Here you can read results by clEnqueueReadBuffer()
// with blocking set to CL_TRUE

Just make sure that your kernel source calls get_global_offset(0) to know which portion of the computation to execute since get_global_size(0) will now return values from 0 to 50 instead of from 0 to 100.

Sayantan · June 4, 2011, 8:06pm

Thanks a lot guys.Technique works great.