I'm trying to optimize an algorithm performing some transformations on a matrix in OpenCL. I'm running it on a AMD GPU. The current version uses 1 x 256 threads (this is the local size of a workgroup; so it has the size "1" on the rows-dimension and "256" on the columns-dimension ). The global size is "512 x 4096", so it has the size 512 on rows-dimension and 4096 on the columns-dimension.

This means that for each column there will be a thread (4096 threads on columns-dimension) and for each 8 rows there will be a thread (4096 lines of matrix / 512 threads). Each workgroup uses a shared memory where the data is loaded with the dimension 8 x 256 (8 lines and 256 columns). After all workgroups load their data there is a barrier in order to be sure that all of them reached this point, after this barrier they perform the transformations on their chunks. The code executes correctly without problems.

Now I want to reduce the number of threads on rows-dimension from 512 to 1. So I want to have the global dimension 1 x 4096 (instead of 512 x 4096). Each column will be processed by a thread, the same as before but now because we have only "1" size on rows dimension... each thread will iterate on the rows in order to perform the jobs that before were performed by the 512 - threads. So now I have:

Code :
for (k=0;k<511;k++)
     load_a_chunk_of(8 x 512)_in_shared_memory;

On this version I receive close results to the correct ones, but not 100% correct. Do you have idea how this change can affect the correctness of my algorithm? Instead using 512 threads on rows-dimension, I used 1 and I created a loop from 0 to 511...

Thank you,