Ok, I found out what was going on, and I’ll write up a little description to help anybody else.
My kernel is structured so that a number of float arrays are read, a calculation is performed and then the results are written out to another buffer. It was this output buffer that I was reading back, and that read was causing the failure.
Previously my code looked something like this:
kernel void foobar(float * in0, float * in1, float * out);
Notice the lack of global, local, constant, etc. I had assumed that that the system would make an informed choice as to which memory space these arrays needed to be in, but that didn’t seem to be the case when targeting the GPU (the CPU worked fine with this). I would have at least expected a default of global.
Defining the parameters like this (because the inputs are read-only):
kernel void foobar(constant float * in0, constant float * in1, global float * out);
brought this error
[CL_UNKNOWN_ERROR] : OpenCL Error : clBuildProgram failed: could not build program for device 0 (0x7365522f) (-44)
[CL_BUILD_ERROR] : OpenCL Build Error : Compiler build log:
Error while compiling the ptx module: CLH_ERROR_NO_BINARY_FOR_GPU
PTX Info log:
PTX Error log:
… which I still don’t know what it means, but it does seem to be related. When I only had a couple of things tagged constant, but it was still too big I got a sensible error during clEnqueueNDRangeKernel (i.e. after I had set the kernel parameters and therefore set the size of the constant data).
Tagging everything as global solves the issue, and get’s everything running. That probably means that the output buffer wasn’t in global space when I’d left it undefined, and so when I read it things went bang.
I’m now off to try some optimisations, as things are running slower than the CPU at the moment, but at least it’s running.