How to index local accrays for best performance?

I have a problem understanding the memory access patterns for kernels in OpenCL. Consider the below snippet of a kernel. It exists in a two-dimensional 16x16 work-group and basically the 16x16 threads collaborates on initializing a 16x16 local array (As) with elements from a larger global array.


   ...
   int ti = get_local_id(0);
   int tj = get_local_id(1);
   __local int As[16][16];
   As[ti][tj] = someGlobalArray[...];
   barrier(CLK_LOCAL_MEM_FENCE);
   ...

The strange thing is that if I access As with As[tj][ti] instead of As[ti][tj] the code runs much faster. Can anyone explain why?

Opps, the title should of course say “arrays”, not “accrays” (whatever that is)… :slight_smile:

Do you only change the indices for the access to As or do you also change the indices for your global array?

If you’re not changing the access to global memory then the problem could be bank-conflicts.
Using As[ti][tj] means that adjacent workitems will access the same column and therefore the same bank (because your array width is 16). With As[tj][ti] adjacent workitems will access the same row and thus different banks. This is why it is so much faster.
You can try padding your array (e.g. As[16][17] to avoid this).