OpenCL newb - OpenCL crashing :(

Hey,

I’m very new to OpenCL, still trying to get a handle on all the boiler plate stuff here for setting up the context and kicking off the kernel. What I’m working on is a parallel ray tracer which serves as the core of my MS project. I’m using a primer tutorial to get started, I don’t particularly have time to read through 400 page books at the moment as the parallel nature isn’t the core of this project (it just needs to be real time).

I’m following this primer tutorial here.

Setup: Win 7 x64, i7-3820, no OC, 16GB memory. The target OpenCL device is an AMD card, I’m not sure which (this isn’t my machine), device manager says it’s a Radeon HD 7900 series, but it doesn’t give me a specific model. Catalyst control center says the same thing, device ID is a 6798 if that means anything, but it seems to have 3GB of memory @ 1.3Ghz…anyway…it IS OpenCL capable, I know somebody who did an OpenCL project on this system.

So I’m following this guy’s primer, basically following it for the main part, but changing some stuff around (making it less OO), and I have this line of code, which it eats itself on:

status = clEnqueueNDRangeKernel(clCommandQueue, programKernel, 1, NULL, workGroupLength, NULL, 0, NULL, &event);

The kernel function it’s trying to run is defined by this function:

__kernel void adder(__global float* a, __global float* b, __global float* c)
{
	unsigned int i = get_global_id(0);
	c[i] = a[i] + b[i];
}

The command queue is created with this line:

cl_command_queue clCommandQueue = clCreateCommandQueue(clContext, gpuDevice, NULL, &status);

Where I believe the context and device were selected correctly.

The kernel was created with this line:

cl_kernel programKernel = clCreateKernel(clProgramObj, "adder", &status);

Where the program object was created with this line:

cl_program clProgramObj = clCreateProgramWithSource(clContext, 1, &programName, &programLength, &status);

Where programName is a deceptive variable that actually points the source code pulled from the CL source of pointer to char type.

EDIT:
workGroupLength is a pointer pointing to a variable of size_t which contains the 2 1D buffers I filled to be added together. I filled the 2 buffers with 5 numbers, which is what the value that the workGroupLength pointer points to.
/EDIT

EDIT 2:
I forgot to mention, after EVERY openCL line, I have a print statement which prints the return value of the previous statement executed. EVERY line results in CL_SUCCESS.
/EDIT 2

Anyways, when it hits the first line I listed, I just get a “<program name> has stopped working” - apparently a seg fault. If I choose to try and debug it drops me in msvcr100d.dll!_unlock(int), where it appears to be crashing on the only line of code in there (LeaveCriticalSection()), which I did not right. The stack trace goes through like 40 system DLL’s so I can’t really find where shit went bad.

Any help is appreciated.

TIA!

I just downloaded the tutorial’s source and it crashes at the very end, after the numbers have all been added, it crashes on clEnqueueReadBuffer which reads from the C memory object in that CL code I listed above (write only for the kernel).

So his is crashing further down than mine…awesome…

This worries me heavily about OpenCL. If this trivial example already has multiple fail points, I hope these fail points don’t go up exponentially with the complexity…

I sure hope these issues are connected…

The fail points stay the same, so it doesn’t go up with complexity. The fail points here are just bad API usage which is part of the learning curve.

I suggest you start with the vendor examples or find another tutorial, if this guys code doesn’t even work it’s just not worth worrying about and probably leading you down the wrong path.

Hrm, you’re probably right, but it’s like the first tutorial on Google, so it probably got some good thumb-ups :-.

I’m thinking it’s OpenCL version usage?

I’ll post the full code tomorrow…it’s like 10-15 OpenCL calls, if there’s bad usage, anybody with a decent working knowledge of OpenCL should be able to pick it up in a heart beat.

It’s literally adding 2 1D buffers (of 5 length) and pulling the output out of a 3rd. It doesn’t get much more simplistic :(. I’ll post it up tomorrow in whole…I just kind of realized now that using the random snippets above somebody will have a difficult time following it.

Here’s all the code:

	cl_platform_id platID;
	cl_int status = oclGetPlatformID(&platID);
	status = clGetDeviceIDs(platID, CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);

	cl_device_id gpuDevice;
	status = clGetDeviceIDs(platID, CL_DEVICE_TYPE_GPU, numDevices, &gpuDevice, NULL);

	cl_context clContext = clCreateContext(NULL, numDevices, &gpuDevice, clErrorHandler, NULL, &status);

	cl_command_queue clCommandQueue = clCreateCommandQueue(clContext, gpuDevice, NULL, &status);

	const char *programName = loadCLSource("add.cl");

	const size_t programLength = strlen(programName);
	cl_program clProgramObj = clCreateProgramWithSource(clContext, 1, &programName, &programLength, &status);

	delete [] programName;

	status = clBuildProgram(clProgramObj, numDevices, &gpuDevice, NULL, clBuildError, gpuDevice);

	cl_kernel programKernel = clCreateKernel(clProgramObj, "adder", &status);

	int buff_len = 5;
	float *a = new float[buff_len];
	float *b = new float[buff_len];
	for (int i = 0 ; i < buff_len ; i++)
	{
		a[i] = (float)i * 2.0f;
		b[i] = (float)i * 2.0f;
	}
	size_t *workGroupLength = new size_t();
	*workGroupLength = buff_len;

	cl_mem Amem = clCreateBuffer(clContext, CL_MEM_READ_ONLY, sizeof(float) * buff_len, NULL, &status);
	cl_mem Bmem = clCreateBuffer(clContext, CL_MEM_READ_ONLY, sizeof(float) * buff_len, NULL, &status);
	cl_mem Cmem = clCreateBuffer(clContext, CL_MEM_WRITE_ONLY, sizeof(float) * buff_len, NULL, &status);

	status = clEnqueueWriteBuffer(clCommandQueue, Amem, CL_TRUE, 0, sizeof(float) * buff_len, a, 0, NULL, NULL);
	status = clEnqueueWriteBuffer(clCommandQueue, Bmem, CL_TRUE, 0, sizeof(float) * buff_len, b, 0, NULL, NULL);

	status = clSetKernelArg(programKernel, 0, sizeof(cl_mem), Amem);
	status = clSetKernelArg(programKernel, 1, sizeof(cl_mem), Bmem);
	status = clSetKernelArg(programKernel, 2, sizeof(cl_mem), Cmem);

	status = clFinish(clCommandQueue);

	cl_event event;
	status = clEnqueueNDRangeKernel(clCommandQueue, programKernel, 1, NULL, workGroupLength, NULL, 0, NULL, &event); // We die here :(

loadCLSource just opens the source and returns a buffer with it contained.

numDevices is checked to be equal to 1

ALL code’s status are CL_SUCCESSFUL

Hopefully somebody has some good info on why it’s dying at the end there when I launch the kernel :(.

The add.cl source was posted above, but for convenience sake:

__kernel void adder(__global float* a, __global float* b, __global float* c)
{
	unsigned int i = get_global_id(0);
	c[i] = a[i] + b[i];
}

EDIT:
programName is misleading, it’s actually program source.

Also, all functions passed in as callback function pointers are valid functions.

Also, the build log is printed in the whole version and it has nothing to say error wise, return code is CL_SUCCESSFUL.

No ideas? It’s only like 20 OpenCL calls, I feel this stuff should be pretty boiler plate :(.

Yeah but on principle I myself just don’t do C++ because I only do this for fun and C++ aint fun. But since you’ve been polite and i’m on leave and a bit bored and nobody else has piped up …

It’s probably this line and those like it:

status = clSetKernelArg(programKernel, 0, sizeof(cl_mem), Amem);

As with every argument, you need to pass it the address of the memory containing the value. i.e. pass the pointer, &Amem

(this was in the tutorial you referenced, but as that is spread over many redundant files, is hard to follow)

I use JOCL myself (www.jogamp.org), and that hides some of these fiddly details and a bit more - so I usually don’t even see that boilerplate stuff. Particularly error checking, boy is that a pain otherwise, JOCL gives you a symboic error name and not a number, and dumps the source automatically if a compile fails. For me it also provides a way to write cross platform stuff more easily whilst still sticking to a c-like language I prefer.

Yeah but on principle I myself just don’t do C++ because I only do this for fun and C++ aint fun. But since you’ve been polite and i’m on leave and a bit bored and nobody else has piped up …

It’s probably this line and those like it:

status = clSetKernelArg(programKernel, 0, sizeof(cl_mem), Amem);

As with every argument, you need to pass it the address of the memory containing the value. i.e. pass the pointer, &Amem

(this was in the tutorial you referenced, but as that is spread over many redundant files, is hard to follow)

I use JOCL myself (http://www.jogamp.org), and that hides some of these fiddly details and a bit more - so I usually don’t even see that boilerplate stuff. Particularly error checking, boy is that a pain otherwise, JOCL gives you a symboic error name and not a number, and dumps the source automatically if a compile fails. For me it also provides a way to write cross platform stuff more easily whilst still sticking to a c-like language I prefer.[/quote]

Ha, thanks for the response!

I would like to do C, but I’m doing this in VS (I only have limited choice in the matter of tools :() and Microsoft pretty much deprecated it on their system, so C++ is about as close as you can get, unfortunately.

I’ll make sure all the addresses reside on the heap and try it again. It’s hard because things like cl_kernel are typedefs to _cl_kernel*, so you actually are working with their addresses even if it seems like you’re passing the struct by copy. Opaque pointers, is the term, I believe.

JOCL is a good answer, but I feel the Java code on the CPU may damage the run-time of the application. Since this is real time (which is why it needs to be done on the GPU), I’d be afraid of that interfering.

I’ll check, though, and make sure any memory I have control over is on the heap, thought for the love of God I’d hope that doesn’t make a difference, as these functions, when invoked, should still be on the same call-stack, so the stack addresses should work fine, though if it’s doing any threaded stuff in the libraries of these calls I can chuck that assumption out the window.

Not sure, I’ll take a look at the memory locations tomorrow.

Thanks for your input.

Okay, I fixed it.

Wow, spent a whole week in the labs jerking off with this simple code and 20 minutes on my Ubuntu box with a real compiler and I have the issue flushed out.

Thank God for g++ 4.6.3, it found a bunch of issues with the program prior to even running it, which the Microsoft C++ compiler didn’t give a shit about on the strictest settings.

Errors are as follows:
Maybe not an error, but an ‘issue’

cl_command_queue clCommandQueue = clCreateCommandQueue(clContext, gpuDevice, NULL, &status);

The third argument, which I have is NULL, is a bit-field, according to the OpenCL documentation. I always learned C/C++ as NULL being defined as 0, but as of later years NULL has been more-so implemented as a void pointer pointing to location 0. Probably was not the root cause, but should have been 0 regardless, as the documentation explicitly listed the properties as a bit field.

The error which was killing the program, as notzed predicted, it was a memory issue, though I think I should be rather annoyed as to the cause…when I bind the arguments in the kernel:

clSetKernelArg(programKernel, 0, sizeof(cl_mem), Amem);

The reason I have it listed as Amem instead of &Amem is because when I mouse-overed the definition of a cl_mem struct, it was implemented as: typedef _cl_mem *cl_mem or something to that regard if I remember correctly. Because of the apparent use of opaqe pointers, I believed the location I passed in was a valid reference to the OpenCL memory object…I was wrong. By referencing the pointer, it fixed the issue.

I guess a lesson learned should be that the use of opaque pointers can always cause some issues/confusion amongst devs not used to particular API’s when IDEs are being used (often), due to the highest level implementation being defined in a header file.

I’m going to try these changes on the Windows machine in the labs and see what the result was. I’m not sure if I’m going to continue development there, as development under Windows is painful to say the least.