Reduce nr of threads & use loop in kernel to cover the work

Hello,

I’m trying to optimize an algorithm performing some transformations on a matrix in OpenCL. I’m running it on a AMD GPU. The current version uses 1 x 256 threads (this is the local size of a workgroup; so it has the size “1” on the rows-dimension and “256” on the columns-dimension ). The global size is “512 x 4096”, so it has the size 512 on rows-dimension and 4096 on the columns-dimension.

This means that for each column there will be a thread (4096 threads on columns-dimension) and for each 8 rows there will be a thread (4096 lines of matrix / 512 threads). Each workgroup uses a shared memory where the data is loaded with the dimension 8 x 256 (8 lines and 256 columns). After all workgroups load their data there is a barrier in order to be sure that all of them reached this point, after this barrier they perform the transformations on their chunks. The code executes correctly without problems.

Now I want to reduce the number of threads on rows-dimension from 512 to 1. So I want to have the global dimension 1 x 4096 (instead of 512 x 4096). Each column will be processed by a thread, the same as before but now because we have only “1” size on rows dimension… each thread will iterate on the rows in order to perform the jobs that before were performed by the 512 - threads. So now I have:

for (k=0;k<511;k++)
{
     load_a_chunk_of(8 x 512)_in_shared_memory;

     Barrier();

     perform_operations_on_chunk();
     
     barrier();
}

On this version I receive close results to the correct ones, but not 100% correct. Do you have idea how this change can affect the correctness of my algorithm? Instead using 512 threads on rows-dimension, I used 1 and I created a loop from 0 to 511…

Thank you,
Cristina

On a gpu make sure you use dimension 0 as the columns, 1 as the rows, otherwise you can get very low performance - but not having any more detail I don’t know if your algorithm is affected. Also one normally wants to increase the parallelism not reduce it - unless your tasks are interdependent which you imply they are not.

The short answer to your question is that you added a bug during the conversion process.

If you did it correctly there should be no difference apart from a very likely much slower execution time.

Probably just forgot to re-initialise variables within the loop.