I have a problem understanding the memory access patterns for kernels in OpenCL. Consider the below snippet of a kernel. It exists in a two-dimensional 16x16 work-group and basically the 16x16 threads collaborates on initializing a 16x16 local array (As) with elements from a larger global array.
...
int ti = get_local_id(0);
int tj = get_local_id(1);
__local int As[16][16];
As[ti][tj] = someGlobalArray[...];
barrier(CLK_LOCAL_MEM_FENCE);
...
The strange thing is that if I access As with As[tj][ti] instead of As[ti][tj] the code runs much faster. Can anyone explain why?
If you’re not changing the access to global memory then the problem could be bank-conflicts.
Using As[ti][tj] means that adjacent workitems will access the same column and therefore the same bank (because your array width is 16). With As[tj][ti] adjacent workitems will access the same row and thus different banks. This is why it is so much faster.
You can try padding your array (e.g. As[16][17] to avoid this).