clEnqueueNDRangeKernel silently failing?

I’ve encountered a strange bug. I have a kernel that sometimes seems not to get executed by clEnqueueNDRangeKernel() depending on the value for global_work_size. Stranger still, the value of global_work_size that causes the kernel not to get executed varies from time to time. Even stranger still, making seemingly insignificant changes to the kernel causes the kernel to execute reliably.


// This kernel doesn't execute if global_work_size is
// larger than some (inconsistent) number.  The number
// used to be > 2M, then 491904 was too big, then
// 524288 was too big; now starting at 628864 and
// incrementing by 128 (999 times) up through 756736 is
// fine, but starting at 628992 causes all 999 calls to fail.
//
// iMac with Radeon HD 4670, Mac OS X 10.6.6
__kernel void ComputeResults(__global uint4 * results,
                             const uint count)
{
    uint id = get_global_id(0);
    if (id == 0)
    {
        results[0].x = count;
    }

    uint X[10];

    // removing the for() loop and leaving X[0] = 0 fixes it
    for (int i = 0; i < 1; ++i)
    {
        X[0] = 0;
    }

    // removing this assignment fixes it
    // replacing count with a constant fixes it
    X[count] = 0;
}

When I say “fail”, I mean that the kernel never seems to execute, tested by checking whether results[0] has been set. (I clear it with a blocking clEnqueueWriteBuffer() before each call and read it with a blocking clEnqueueReadBuffer after each call.)

But none of the function calls return any errors, emit any warnings, etc.

For reference, I’m setting work_dim to 1 and local_work_size[0] to 128. I’m always using a multiple of 128 for the global_work_size.

CL_DEVICE_MAX_WORK_GROUP_SIZE returns 1024
CL_KERNEL_WORK_GROUP_SIZE returns 128
CL_DEVICE_MAX_WORK_ITEM_SIZES[0] is 1024

Any idea what’s going on? Is there some other crucial limit that I’m missing? Or is this likely a bug in the driver?

Interesting. What is the value of “count”? Could the last assignment in the kernel be writing out of bounds of the array X?

Are you executing this on the CPU or GPU?

Is your queue in-order or out-of-order? If it’s out-of-order, have you set up the dependencies between write-ndrange-read correctly?

Nope. I’m setting count to 5. And note the unrelated changes that do fix the problem (in the comments).

Are you executing this on the CPU or GPU?

GPU. Works fine on the CPU.

Is your queue in-order or out-of-order? If it’s out-of-order, have you set up the dependencies between write-ndrange-read correctly?

I’m passing 0 as the properties to clCreateCommandQueue (so CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not set, which I believe means in-order).

I’m at least encouraged by your response that it’s not something obvious. :slight_smile:

This is bizarre indeed. I would suspect a bug in the compiler. There’s nothing apparently wrong with the code.

Since you’ve been able to reproduce it with such a short kernel and you’ve experimented so much with what small changes make the bug go away it would be great if you sent this to Apple (I think the link is http://developer.apple.com/contact/).

Good catch :slight_smile: