Mysterious kernel behavior. Can anyone explain?

The behavior of the test kernel below makes absolutely no sense to me. Any clues to what’s going on will be much appreciated!

The kernel basically initializes an output array, C, with 1’s. It furthermore contains a dummy for-loop that does noting but initialize two local data structures. If I only make a few iterations in the for-loop, C is correctly initialized with 1’s, but with many iterations (e.g. 1000) something goes horribly wrong and C is not initialized with 1’s. Why does the 1000-iteration loop have this odd effect?

Bonus info: for some reasons the problem does not occur when the input arrays are small. The problem do occur for a situation with
C: array of size 11,182,336
A: array of size 3,101,104
rowWidths and rowStartIdxs: arrays of size 3,344
I don’t understand, why the array sizes have anything to do with this, but they do.

I would welcome an explanation of the above mysterious phenomenon.

(FYI: the kernel is a boil-down of a larger and more meaningful kernel that suffers from the same “bug”)


#define BLOCK_SIZE 16
__kernel void test(__global int* C, int CSize,  __global int* A, __global int* rowWidths, __global int* rowStartIdxs)
{
    int bi = get_group_id(0);
    int bj = get_group_id(1);
    int ti = get_local_id(0);
    int tj = get_local_id(1);
    int rowAIdx =  bi * BLOCK_SIZE + ti;
    int rowBIdx =  bj * BLOCK_SIZE + tj;

    int cOut = 1;
    for(int x=0; x<1000; x++) {
      __local int As[BLOCK_SIZE][BLOCK_SIZE];
      __local int Bs[BLOCK_SIZE][BLOCK_SIZE];
      As[ti][tj] = 1;
      Bs[ti][tj] = 1;
      barrier(CLK_LOCAL_MEM_FENCE);
    }
    int c = rowBIdx * CSize + rowAIdx;
    C[c] = cOut;
}

It does look like a bug in the implementation. Can you tell us what OpenCL implementation are you using?

Did you double and triple-check that the work group size is equal to BLOCK_SIZE x BLOCK_SIZE?

The difference between a few (how many?) and many loop iterations could be that the implementation is doing some aggressive loop unrolling in the former case.

I access the GPU using the python interface pyopencl-0.91.4. I’m not sure where to find the opencl SDK version, but according to the release notes in the doc/folder, it is “OpenCL R190 Update Release”, including “SDK Version 1.2.0.15”. Can I somehow verify that this is also the version that pyopencl uses?
I work on a mac with Snow Leopard 10.6.3.

I forgot: yes, I have double, triple and quadro checked that the group size is really BLOCK_SIZE x BLOCK_SIZE. :slight_smile:

Below is a complete boiled-down example that can be run directly.

To see if the code works, verify that it outputs (prints) a number larger than zero.
You can make it work in two ways:

  1. lower the number of for-iterations in the kernel from 1000 to e.g. 10,
  2. lower the “rows” variable in the python code from 3344 to e.g. 144.

import sys
import struct
import pyopencl as cl
import numpy

block_size = 16
matrixLength = 3101104
rows = 3344

row2width = numpy.zeros(rows, numpy.int32)
row2startIdx = numpy.zeros(rows, numpy.int32)
matrix = numpy.zeros(matrixLength, numpy.int32)

pl = cl.get_platforms()
devs = pl[0].get_devices(cl.device_type.GPU)
if(block_size > devs[0].get_info(cl.device_info.MAX_WORK_GROUP_SIZE)):
   print "Error: block_size is larger than MAX_WORK_GROUP_SIZE..."
   exit(1)
ctx = cl.Context(devs)
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags

src = """
// Thread block size
#define BLOCK_SIZE 16
  
__kernel void test(__global int* C, int CSize, __global int* A, __global int* rowWidths, __global int* rowStartIdxs)
{
    int bi = get_group_id(0);
    int bj = get_group_id(1);
    int ti = get_local_id(0);
    int tj = get_local_id(1);

    int rowAIdx =  bi * BLOCK_SIZE + ti;
    int rowBIdx =  bj * BLOCK_SIZE + tj;

    int cOut = 1;
    for(int x=0; x<1000; x++) {
      __local int As[BLOCK_SIZE][BLOCK_SIZE];
      __local int Bs[BLOCK_SIZE][BLOCK_SIZE];
      As[ti][tj] = 1;
      Bs[ti][tj] = 1;
      barrier(CLK_LOCAL_MEM_FENCE);
    }
    
    C[rowBIdx * CSize + rowAIdx] = cOut;
}
""";
prg = cl.Program(ctx, src).build();
matrix_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(matrix).astype(numpy.int32))
row2width_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2width).astype(numpy.int32))
row2startIdx_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2startIdx).astype(numpy.int32))
o = numpy.zeros(rows * rows).astype(numpy.int32)
o_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=o)
w_o_buf = struct.pack("i", rows)

prg.test(queue, [rows, rows], o_buf, w_o_buf, matrix_buf, row2width_buf, row2startIdx_buf, local_size=(block_size, block_size))
cl.enqueue_read_buffer(queue, o_buf, o).wait()

i = numpy.nonzero(o)
print len(i[0])

If I were you, the first thing I would do is try to recreate the bug in C instead of using PyOpenCL. That would eliminate the possibility of a problem in the Python bindings.

Sorry, I don’t have my mac at hand to try and do this myself :frowning:

Problem solved! It seems that Nvidia hardware that is also used for display has a hard limit of 5 seconds of kernel execution time. After that, the kernel is aborted and an error returned.

I solved the problem by splitting it into smaller sub problems and constructing the final solution from the sub results.

Since this 5-second hard limit is pretty essential for computation intensive applications (most OpenCL programs I guess) I think it should be more visible in the documentation. IMHO.

I experienced a similar problem some time ago when executing nested for loops on the kernel, but couldn’t find at the time any exact explanation for it.