Out of resources problem

Hello everyone.
I have a problem with kernel execution on large number of threads.
I start kernel like this:


int count = 200;
clEnqueueNDRangeKernel(cmd_queue, my_kernel, 1, NULL, &count, &count, NULL, NULL, NULL);

Everything is working fine with count=200, but when I change it to 300 or more I recieve a CL_OUT_OF_RESOURCES error immediately after invoking clEnqueueNDRangeKernel. You can see kernel code below, it simply computes Cholesky decomposition.


kernel void cholesky_decomposition(global float* A, int n)
{
	unsigned int p = get_local_size(0);
	unsigned int u = get_local_id(0);
	for (unsigned int k=0; k<n; k++) {
		float s = sqrt(A[k*n + k]);
		for (unsigned int i=k+u; i<n; i+=p) {
			A[i*n + k] /= s;
		}
		barrier(CLK_GLOBAL_MEM_FENCE);
		for (unsigned int j=k+1+u; j<n; j+=p) {
			for (unsigned int i=j; i<n; i++) {
				A[i*n + j] -= A[j*n+k]*A[i*n+k];
			}
		}
		barrier(CLK_GLOBAL_MEM_FENCE);
	}
}

I can’t figure out what causing this error. Max work group size of my device (GeForce 8800GT) is 512, it supports OpenCL 1.0, I have the latest driver from NVIDIA site and OS WinXP SP3. The same problem occures with different kernels.
I appreciate any help on this problem, thanks in advance =)

“local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.” – clEnqueueNDRangeKernel

Maybe try this to see if the problem is your local work size.

“local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.” – http://www.khronos.org/registry/cl/sdk/ … ernel.html

Maybe try this to see if the problem is your local work size.

Global and local work sizes should be equal in my algorithm so i can use barriers. Problem persists only with big work sizes.

The fact it works for small workgroup sizes leads me to believe you’re exceeding your device’s capabilities with high values.

On Page 107 of the the 1.0.48 Specification Khronos lists various enumerations you can pass into clGetKernelWorkGroupInfo(…) to get specific details about your device’s capabilities.

Since local_size == size of the workgroup, you need to find out the maximum your device supports via:
cl_int clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, size_t param_value_size, void * param_value, size_t * param_value_size_ret)

You will probably have to tile your task similar to something like exists on Page 22 of the same specification (if you have to exceed the value returned by the above call)

Thank you very much, AlexAtOSU! OpenCL somehow limits value of work group size for specific kernels (honestly, I don’t know that). Although max wg size of my device is 512, the max wg size for kernel in 1st message is much smaller. That was the cause of error.
Have a good day =)

CL_DEVICE_MAX_WORK_GROUP_SIZE specifies the max. number of work-items in a work-group that the device can support. However, this value can vary from kernel to kernel as it is dependent on factors such as number of registers used by the kernel, amount of local memory etc. You should always query clGetKernelWorkGroupInfo(kernel, CL_KERNEL_WORK_GROUP_SIZE,…) to determine the max. work-group size that can be used for a given kernel.