Maximum array index limited to 16 bits ?

I’m doing a standard sum-of-ints algorithm. The minor twist is that I do this on an 2 dimensional matrix in which I sum each column individually.

The matrix is laid out in global memory in a column after column fashion.

__kernel void reduce(__global ushort *g_matrix,__global uint *g_sums,__local uint *s_local)

I use something like : g_matrix[col*number_of_rows+row] to index the g_matrix.

If I run my algorithm with a small input matrix of say 3232 ushorts then it works just fine : I get an array with 32 sums, one for each column. If I scale things up though, e.g. 512512 ushorts, then I seem to get a wrap-around after column 127. Now is 128 columns * 512 rows per column of course equal to 64 k, and as usual in any computing environment strange things happen if you cross the 16 bit boundary (and usually 32 bit as well) :smiley:

I’ve tried to find a limit in the OpenCL definition which would explain this behavior but couldn’t find anything I considered related. What did I overlook ?

Thanks in advance

What is your local group size?
How do you calculate your local memory index?

My global workgroup size is 256512 and my local workgroup size is 2562. The problem doesn’t seem to be in indexing the local memory, but in global memory. When I try to index the g_matrix with anything over 65536 it seems to wrap around to reading at index 0 again. I’ve included the kernel code below.


__kernel void reduce(__global ushort *g_matrix,__global uint *g_line,__local uint *s_column)
{
  // g_matrix  : 0..511,0..511
  // g_line    : 0..511
  // s_column  : 0..255,0..1

  size_t g_row=get_global_id(0);                  // 0..255
  size_t g_row_work_size=get_global_size(0);      // 256
  size_t g_column=get_global_id(1);               // 0..511
  size_t g_column_work_size=get_global_size(1);   // 512

  size_t l_row=get_local_id(0);                   // 0..255
  size_t l_row_work_size=get_local_size(0);       // 256
  size_t l_column=get_local_id(1);                // 0..1
  size_t l_column_work_size=get_local_size(1);    // 2

  ushort offset=l_row_work_size;                  // 256,128,64,32,16,8,4,2,1,0

  // copy current columns from global to shared memory for faster access and
  // do the first addition within a column in a single stroke
  s_column[l_column*l_row_work_size+l_row]=
      g_matrix[g_column*g_row_work_size*2+g_row]+
      g_matrix[g_column*g_row_work_size*2+g_row+offset];

  offset>>=1;

  barrier(CLK_LOCAL_MEM_FENCE);

  // do the remaining additions in the columns in shared memory
  while(offset>0)
  {
    if(l_row<offset)
    {
      s_column[l_column*l_row_work_size+l_row]+=
          s_column[l_column*l_row_work_size+l_row+offset];
    }

    offset>>=1;

    barrier(CLK_LOCAL_MEM_FENCE);
  }

  // copy sum to global result memory
  if(l_row==0)
  {
    g_line[g_column]=s_column[l_column*l_row_work_size];
  }
}

Everything works fine for the first 128 columns, but after that things seem to wrap back again to column 0 : the results are identical although the matrix data at column 128 and beyond is different from that at column 0 (for testing I’m providing data which starts with 0 at (0,0) and increments by 1 for each next cell, row first).

// copy current columns from global to shared memory for faster access and
// do the first addition within a column in a single stroke
s_column[l_columnl_row_work_size+l_row]=
g_matrix[g_column
g_row_work_size2+g_row]+
g_matrix[g_column
g_row_work_size*2+g_row+offset];

Are you sure you have to multiply by 2 here? With buffer size 256x512 (columns by rows I think) jump out of memory after 128.

Hi,

g_matrix is not 256512, but 512512 in size, but I’m throwing 256 threads at a time at each column of it, each of which copies two elements from global memory, 256 cells apart, adds the two values and stores the result in local memory.

So g_row_work_size is half the size of a column, and therefore I have to multiply by two to get the column size.

You can find the sizes of the arguments right after the kernel function definition.