command queue goes dead after clEnqueueNDRangeKernel ?

I’ve been trying to triangulate a really stubborn bug for the past couple days. I’m testing an algorithm to compute high dimensional integrals using a parallel “divide and conquer” approach. The outline is fairly simple: starting from two input arrays of all boundary data, each integration work item gets the boundary data for its subregion, computes integral and error estimates, and writes these values into respective output arrays.

I’m building everything on a MacBook Air with an NVIDIA GeForce 320M using Apple’s distribution of OpenCL 1.0.

I want to emphasize that when I run the OpenCL code for CL_DEVICE_TYPE_CPU, the algorithm works perfectly. The computations concur with the mathematically exact values up to numerical tolerance.

It is when I run the code for CL_DEVICE_TYPE_GPU that problems begin.

Instead of posting all of my code, I’ll try to highlight what I think is relevant. The prototype of my kernel looks like this:


__kernel void
mgk13s(int dim, __global void* params, 
	__global const float* a, __global const float* b,
	__global float* result, __global float* rawerr, __global float* resabs)

In my main routine, I’m denoting the number of work items by global_ws. The dimension of the domain (and hence boundary data) is denoted by dim. Allocating memory, buffers and setting up kernel arguments looks like this:


	size_t out_size = global_ws * sizeof(float);
	size_t in_size =  dim * out_size;
	float* a = (float*) malloc(in_size);
	float* b = (float*) malloc(in_size);
	float* result = (float*) malloc(out_size);
	float* rawerr = (float*) malloc(out_size);
	float* resabs = (float*) malloc(out_size);
	
	cl_mem par_buf = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(params), NULL, NULL);
	cl_mem a_buf = clCreateBuffer(context, CL_MEM_READ_ONLY, in_size, NULL, NULL);
	cl_mem b_buf = clCreateBuffer(context, CL_MEM_READ_ONLY, in_size, NULL, NULL);
	cl_mem res_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, out_size, NULL, NULL);
	cl_mem err_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, out_size, NULL, NULL);
	cl_mem abs_buf = clCreateBuffer(context, CL_MEM_READ_WRITE, out_size, NULL, NULL);
	
	err  = clSetKernelArg(kernel, 0, sizeof(int), &dim);
	err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &par_buf);
	err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &a_buf);
	err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &b_buf);
	err |= clSetKernelArg(kernel, 4, sizeof(cl_mem), &res_buf);
	err |= clSetKernelArg(kernel, 5, sizeof(cl_mem), &err_buf);
	err |= clSetKernelArg(kernel, 6, sizeof(cl_mem), &abs_buf);

The code which executes the black box outlined above looks like this:


	err = clEnqueueWriteBuffer(queue, par_buf, CL_TRUE, 0, sizeof(params), params, 0, NULL, NULL);
	err |= clEnqueueWriteBuffer(queue, a_buf, CL_TRUE, 0, in_size, a, 0, NULL, NULL);
	err |= clEnqueueWriteBuffer(queue, b_buf, CL_TRUE, 0, in_size, b, 0, NULL, NULL);
	if (err != CL_SUCCESS) ABORT(err);
	
	err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_ws, &local_ws, 0, NULL, NULL);
	if (err != CL_SUCCESS) ABORT(err);
	err = clEnqueueBarrier(queue);

	err = clEnqueueReadBuffer(queue, res_buf, CL_TRUE, 0, out_size, result, 0, NULL, NULL);
	if (err != CL_SUCCESS) ABORT(err);
	err |= clEnqueueReadBuffer(queue, err_buf, CL_TRUE, 0, out_size, rawerr, 0, NULL, NULL);
	err |= clEnqueueReadBuffer(queue, abs_buf, CL_TRUE, 0, out_size, resabs, 0, NULL, NULL);

The ABORT macro just displays the line number and CL_error information, which is how I’m isolating the problem.

For certain parameter values, the first clEnqueueReadBuffer call returns an error of type CL_INVALID_COMMAND_QUEUE, although clEnqueueNDRangeKernel call returns CL_SUCCESS.

Any ideas here?

Incidentally, global and local work sizes for this example are, local_ws = 64, global_ws = 4096. These are set at the beginning, using an algorithm that distinguishes between the CPU and GPU:


	size_t local_ws = (DEVICE == CL_DEVICE_TYPE_GPU) ? 64 : 1;
	size_t factor = N / local_ws;
	size_t global_ws = (N % local_ws) ? ((factor + 1) * local_ws) : N;

For certain parameter values, the first clEnqueueReadBuffer call returns an error of type CL_INVALID_COMMAND_QUEUE, although clEnqueueNDRangeKernel call returns CL_SUCCESS.

Any ideas here?

That sounds like enqueuing the kernel works fine, while actually trying to execute it is failing for one reason or another. It could be causing an invalid memory access (page fault) or it could be something else.

Have you passed a callback function “pfn_notify” to clCreateContext()? The callback typically provides more information than the error codes alone.

No, I haven’t tried that yet. If you have some expertise in this area, what kind of information should I be probing using the call back function?

This really is a frustrating bug. I’ve downloaded other source trees for parallel computations on a GPU that demand even more resources and follow the same “read in, compute, write back” paradigm, and these compile and run on my machine without issue.

If new information is helpful, I created the the context with the following simple call back function:


void pfn_notify(const char *errinfo, const void *private_info, size_t cb, void *user_data)
{
	fprintf(stderr, "%s
", errinfo);
}

And I edited the enqueuing and read-back part of my code as follows:


	err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_ws, &local_ws, 0, NULL, NULL);
	fprintf(stderr, "line %d: err %d
", __LINE__, err);
	err = clEnqueueBarrier(queue);
	fprintf(stderr, "line %d: err %d
", __LINE__, err);

	err = clEnqueueReadBuffer(queue, res_buf, CL_TRUE, 0, out_size, result, 0, NULL, NULL); \
	fprintf(stderr, "line %d:  err %d
", __LINE__, err);
	err |= clEnqueueReadBuffer(queue, err_buf, CL_TRUE, 0, out_size, rawerr, 0, NULL, NULL);
...

Passing through the “fprint(stderr” traps, the output is:


line 171: err 0
line 173: err 0
[CL_INVALID_COMMAND_QUEUE] : OpenCL Fatal Error : Read caused an error that invalidated the queue (0x100403bd0). This may be  due to a resource allocation failure at execution time.
[CL_INVALID_COMMAND_QUEUE] : OpenCL Error : clEnqueueReadBuffer failed: Invalid command queue
line 177:  err -36

The error value -36 is just CL_INVALID_COMMAND_QUEUE. This doesn’t tell me much more of anything, but maybe someone else can parse this better than I.

Here’s where it gets really weird:
[ul][li] the identical code executes perfectly when the device CL_DEVICE_TYPE_CPU is specified instead of _GPU and[/:m:2ycto2be][/li][li] when executing on the device CL_DEVICE_TYPE_GPU but for smaller parameter values, there is no CL_INVALID_COMMAND_QUEUE error, however values in result, rawerr and resabs returned from res_buf, err_buf and abs_buf respectively are just zeros.[/:m:2ycto2be][/ul][/li]
Thanks to everyone who has taken the time to read this through and think about it for a bit!

OpenCL Fatal Error : Read caused an error that invalidated the queue

In other words, the work-item is dereferencing a pointer and that’s causing a page fault. Could it have anything to do with the different local size you’ve chosen for GPU and CPU?

Hi David, you might be on to something. I really didn’t know how to select the global and local work sizes, except that the spec requires the global to be evenly divisible by the local.

http://www.khronos.org/registry/cl/sdk/ … ernel.html

For my machine,

[ul][li]CL_DEVICE_MAX_WORK_ITEM_SIZES = 512 for the GPU [/:m:g23dcibi][/li][li] CL_DEVICE_MAX_WORK_ITEM_SIZES = 1 for the CPU (as expected). [/:m:g23dcibi][/ul][/li]I got the numbers local_ws = 64 and global_ws = n * 64 for the GPU from another project that compiles and runs on my machine.

Any additional advice on this matter is most welcome. I’ll play around with variations on these values when I get some time later this evening.

If you don’t know which values to pass for the local size, I would simply pass NULL. That tells the OpenCL implementation to choose a suitable value. That way you don’t need to worry about being evenly divisible.

That said, it may still trigger the same bug you are seeing today. Why don’t you try a local work size of 1, which is the same value you used for the CPU and that seems to work? Although it will run slowly it will give us some more info on what’s going wrong.

Finally, could you take your kernel code and try to simplify it as much as possible while maintaining the error you are seeing? After you’ve simplified it as much as you can please post the code here and we can give it a look.

Okay. I had a lot of concurrent bugs in my code. The only one that I still can’t eradicate is CL_INVALID_COMMAND_QUEUE error.

Here is the kernel that I’m using; the source file is test4.cl.


__kernel void
test4(size_t dim, __global float* out)
{
	size_t offset = get_global_id(0);
	
	// set M = n^dim
	size_t M = 1, n = 10;
	for (size_t i = 0; i < dim; ++i) 
		M *= n;
	
	// enumerate multi-indices k = (k_1,..., k_d), 0 ? k_i ? n-1,
	// but don't do anything else!
	for (size_t j = 0; j < M; ++j) {
		size_t m = 1;
		for (size_t i = 0; i < dim; ++i) {
			size_t ki = (j / m) % n;
			m *= n; 
		}
	}
}

If you recognize that this kernel doesn’t do anything except count to 10^dim and so some trivial arithmetic, you’re right!

Here is the host code:


/* test4.c */

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/stat.h>
#include <OpenCL/OpenCL.h>

#define CL_CPPFLAGS ""

#ifdef GPU
#define DEVICE CL_DEVICE_TYPE_GPU
#define DEVICE_STR "test4.cl compiled on GPU"
#else
#define DEVICE CL_DEVICE_TYPE_CPU
#define DEVICE_STR "test4.cl compiled on CPU"
#endif

char* load_program_source(const char *filename);

int main(int argc, char** argv)
{
	size_t dim = (argc > 1) ? atoi(argv[1]) : 8; // dimension copy array
	size_t global_ws = (argc > 2) ? atoi(argv[2]) : 1; // number of work items
	
	printf("%s
", DEVICE_STR);
		
	const char* quad_kernel = "test4.cl";
	const char* kernel_name = "test4";
	
	int err;
	cl_context context;	
	cl_device_id device;
	cl_command_queue queue;
	cl_program program;
	cl_kernel kernel;
	
	err = clGetDeviceIDs(NULL, DEVICE, 1, &device, NULL);
	
#ifdef __CL_EXT_H  // Apple extension for error logging.
	context = clCreateContext(0, 1, &device, &clLogMessagesToStderrAPPLE, NULL, &err);
#else
	context = clCreateContext(0, 1, &device, NULL, NULL, &err);
#endif
	queue = clCreateCommandQueue(context, device, 0, NULL);
	
	char scratch[2048];
	char* source = load_program_source(quad_kernel);
	
	program = clCreateProgramWithSource(context, 1, (const char**)&source, NULL, &err);
	err = clBuildProgram(program, 1, &device, CL_CPPFLAGS, NULL, NULL);
	err |= clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 2048, scratch, NULL);
	if (err != CL_SUCCESS) {
		fprintf(stderr, "[BUILD LOG]
%s
", scratch);
		return err; 
	}
	kernel = clCreateKernel(program, kernel_name, &err);
	
	size_t out_size = global_ws * sizeof(float);
	
	float* out_loc = (float*) malloc(out_size);
	cl_mem out_buf = clCreateBuffer(context, CL_MEM_WRITE_ONLY, out_size, NULL, NULL);
	
	err = clSetKernelArg(kernel, 0, sizeof(size_t), &dim);
	err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &out_buf);
	assert(err == CL_SUCCESS);
	
	err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_ws, NULL, 0, NULL, NULL);
	err = clEnqueueBarrier(queue);
	assert(err == CL_SUCCESS);
	
	err = clEnqueueReadBuffer(queue, out_buf, CL_TRUE, 0, out_size, out_loc, 0, NULL, NULL);
	assert(err == CL_SUCCESS);
	
	free(out_loc);
	clReleaseMemObject(out_buf);
	
	clReleaseKernel(kernel);
	clReleaseProgram(program);
	clReleaseCommandQueue(queue);
	clReleaseContext(context);
	
	return 0;
}

char* load_program_source(const char *filename)
{ 
	struct stat statbuf;
	FILE *fh; 
	char *source; 
	
	fh = fopen(filename, "r");
	if (fh == 0)
		return 0; 
	
	stat(filename, &statbuf);
	source = (char *) malloc(statbuf.st_size + 1);
	fread(source, statbuf.st_size, 1, fh);
	source[statbuf.st_size] = '\0'; 
	
	return source; 
} 

One should compile two executables, one for the GPU using the flag -DGPU, and the other for the CPU without any flags. Here are the detailed results on my machine, a MacBook Air with NVIDIA GeForce 320M…


$ clang -framework OpenCL test4.c -DGPU -o gpu
$ clang -framework OpenCL test4.c -o cpu
$ ./cpu && ./gpu 
test4.cl compiled on CPU
test4.cl compiled on GPU
[CL_INVALID_COMMAND_QUEUE] : OpenCL Fatal Error : Read caused an error that invalidated the queue (0x100107010). This may be  due to a resource allocation failure at execution time.
Break on OpenCLFatalBreak to debug.
[CL_INVALID_COMMAND_QUEUE] : OpenCL Error : clEnqueueReadBuffer failed: Invalid command queue
Break on OpenCLErrorBreak to debug.
Assertion failed: (err == CL_SUCCESS), function main, file test4.c, line 103.
...

Any ideas?

I neglected to mention: it’s not some simple arithmetic overflow error in the kernel. Using the parameters set in the beginning of the host file, i.e., dim = 8, the double loop in the kernel counts to 10^8 = 100 million. The size_t type is at least a 32-bit unsigned integer, according to the OpenCL spec, which overflows at about 4.2 billion.

What happens if you replace the kernel with an empty kernel that takes the same parameters and does nothing?

If I replace the kernel with an empty one, it executes without issue on the GPU. In fact if I comment out the double loop, i.e., change to code to reflect


//	for (size_t j = 0; j < M; ++j) {
//		size_t m = 1;
//		for (size_t i = 0; i < dim; ++i) {
//			size_t ki = (j / m) % n;
//			m *= n; 
//		}
//	}

but leave everything else the same, then is also executes without issue.

It is this simple double loop that is causing some strange memory leak. Incidentally, there is nothing special about the base n = 10. This loop was embedded an algorithm for which I used several different values: n = 13, 9, 7, and 5. This is the key algorithm needed for the index-arithmetic in computing parameters related to a tensor-product approximation. I can’t really get rid of it. What I find the most puzzling is that it executes fine on the CPU without issue – even with lots of other floating-point math going on inside the inner loop!

If I replace the kernel with an empty one, it executes without issue on the GPU.

This is odd. A decent compiler would remove that loop altogether (since it doesn’t affect the kerne’s output) and it would never be executed.

It’s possible that Apple is doing just-in-time compilation and the compiler crashes when it tries to JIT that loop. That could make the driver invalidate the whole queue.

File a bug to Apple since this this is quite clearly not an issue with the application.

Thanks David, my intuition had told me pretty much the same thing – of course I didn’t express it as concisely as you did. I had actually already filed a bug report with Apple before this last post, but wanted to follow up here to get input from other members OpenCL community.

drjerry: have you been able to fix this bug? I just recently encountered the same thing, but i can fix it when running in DEBUG mode or by getting rid of one value in my code. Neither of which are feasible for the application.

Hi dlw, I can’t recall the exact work-around that I used in that project. However I filed a bug report with Apple, and the engineers got back to me. They basically pointed out that the declaration of the kernel with the size_t argument is invalid. Specifically: “Refer to section 6.8, item k in the 1.1 specification for a list of types that are not valid kernel argument types.”

That was when I was compiling kernels on a machine that only supported OpenCL 1.0. OS X 10.7 ships with OpenCL 1.1 support (e.g., for Nvidia graphics cards). The Apple engineers later asked me to check whether I could replicate this problem with OpenCL 1.1, and I couldn’t.

Just tried the code. Thanks!