The behavior of the test kernel below makes absolutely no sense to me. Any clues to what’s going on will be much appreciated!
The kernel basically initializes an output array, C, with 1’s. It furthermore contains a dummy for-loop that does noting but initialize two local data structures. If I only make a few iterations in the for-loop, C is correctly initialized with 1’s, but with many iterations (e.g. 1000) something goes horribly wrong and C is not initialized with 1’s. Why does the 1000-iteration loop have this odd effect?
Bonus info: for some reasons the problem does not occur when the input arrays are small. The problem do occur for a situation with
C: array of size 11,182,336
A: array of size 3,101,104
rowWidths and rowStartIdxs: arrays of size 3,344
I don’t understand, why the array sizes have anything to do with this, but they do.
I would welcome an explanation of the above mysterious phenomenon.
(FYI: the kernel is a boil-down of a larger and more meaningful kernel that suffers from the same “bug”)
#define BLOCK_SIZE 16
__kernel void test(__global int* C, int CSize, __global int* A, __global int* rowWidths, __global int* rowStartIdxs)
{
int bi = get_group_id(0);
int bj = get_group_id(1);
int ti = get_local_id(0);
int tj = get_local_id(1);
int rowAIdx = bi * BLOCK_SIZE + ti;
int rowBIdx = bj * BLOCK_SIZE + tj;
int cOut = 1;
for(int x=0; x<1000; x++) {
__local int As[BLOCK_SIZE][BLOCK_SIZE];
__local int Bs[BLOCK_SIZE][BLOCK_SIZE];
As[ti][tj] = 1;
Bs[ti][tj] = 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
int c = rowBIdx * CSize + rowAIdx;
C[c] = cOut;
}
It does look like a bug in the implementation. Can you tell us what OpenCL implementation are you using?
Did you double and triple-check that the work group size is equal to BLOCK_SIZE x BLOCK_SIZE?
The difference between a few (how many?) and many loop iterations could be that the implementation is doing some aggressive loop unrolling in the former case.
I access the GPU using the python interface pyopencl-0.91.4. I’m not sure where to find the opencl SDK version, but according to the release notes in the doc/folder, it is “OpenCL R190 Update Release”, including “SDK Version 1.2.0.15”. Can I somehow verify that this is also the version that pyopencl uses?
I work on a mac with Snow Leopard 10.6.3.
If I were you, the first thing I would do is try to recreate the bug in C instead of using PyOpenCL. That would eliminate the possibility of a problem in the Python bindings.
Sorry, I don’t have my mac at hand to try and do this myself
Problem solved! It seems that Nvidia hardware that is also used for display has a hard limit of 5 seconds of kernel execution time. After that, the kernel is aborted and an error returned.
I solved the problem by splitting it into smaller sub problems and constructing the final solution from the sub results.
Since this 5-second hard limit is pretty essential for computation intensive applications (most OpenCL programs I guess) I think it should be more visible in the documentation. IMHO.
I experienced a similar problem some time ago when executing nested for loops on the kernel, but couldn’t find at the time any exact explanation for it.