I’ve encountered a strange bug. I have a kernel that sometimes seems not to get executed by clEnqueueNDRangeKernel() depending on the value for global_work_size. Stranger still, the value of global_work_size that causes the kernel not to get executed varies from time to time. Even stranger still, making seemingly insignificant changes to the kernel causes the kernel to execute reliably.
// This kernel doesn't execute if global_work_size is
// larger than some (inconsistent) number. The number
// used to be > 2M, then 491904 was too big, then
// 524288 was too big; now starting at 628864 and
// incrementing by 128 (999 times) up through 756736 is
// fine, but starting at 628992 causes all 999 calls to fail.
//
// iMac with Radeon HD 4670, Mac OS X 10.6.6
__kernel void ComputeResults(__global uint4 * results,
const uint count)
{
uint id = get_global_id(0);
if (id == 0)
{
results[0].x = count;
}
uint X[10];
// removing the for() loop and leaving X[0] = 0 fixes it
for (int i = 0; i < 1; ++i)
{
X[0] = 0;
}
// removing this assignment fixes it
// replacing count with a constant fixes it
X[count] = 0;
}
When I say “fail”, I mean that the kernel never seems to execute, tested by checking whether results[0] has been set. (I clear it with a blocking clEnqueueWriteBuffer() before each call and read it with a blocking clEnqueueReadBuffer after each call.)
But none of the function calls return any errors, emit any warnings, etc.
For reference, I’m setting work_dim to 1 and local_work_size[0] to 128. I’m always using a multiple of 128 for the global_work_size.
CL_DEVICE_MAX_WORK_GROUP_SIZE returns 1024
CL_KERNEL_WORK_GROUP_SIZE returns 128
CL_DEVICE_MAX_WORK_ITEM_SIZES[0] is 1024
Any idea what’s going on? Is there some other crucial limit that I’m missing? Or is this likely a bug in the driver?