How to index local accrays for best performance?

tmp · June 9, 2010, 1:49am

I have a problem understanding the memory access patterns for kernels in OpenCL. Consider the below snippet of a kernel. It exists in a two-dimensional 16x16 work-group and basically the 16x16 threads collaborates on initializing a 16x16 local array (As) with elements from a larger global array.


   ...
   int ti = get_local_id(0);
   int tj = get_local_id(1);
   __local int As[16][16];
   As[ti][tj] = someGlobalArray[...];
   barrier(CLK_LOCAL_MEM_FENCE);
   ...

The strange thing is that if I access As with As[tj][ti] instead of As[ti][tj] the code runs much faster. Can anyone explain why?

tmp · June 9, 2010, 1:53am

Opps, the title should of course say “arrays”, not “accrays” (whatever that is)…

dominik · June 9, 2010, 5:44am

Do you only change the indices for the access to As or do you also change the indices for your global array?

dominik · June 9, 2010, 6:18am

If you’re not changing the access to global memory then the problem could be bank-conflicts.
Using As[ti][tj] means that adjacent workitems will access the same column and therefore the same bank (because your array width is 16). With As[tj][ti] adjacent workitems will access the same row and thus different banks. This is why it is so much faster.
You can try padding your array (e.g. As[16][17] to avoid this).