Multiple access to global memory

Dear all,

In my code, the several threads need to read from the global memory a lot of variables with the same address. Unfortunately, the size of the varibles is too large in order to fit them all in local memory. As a consequence, reading these variables takes 80% of the time, even if it represents only less than 5% of the instructions.
Can anyone suggest a way to speed up the access to these shared variables?

(my procedure is somehow similar to the multiplication of two matrices)

Thank you

Optimization is very specific to the hardware you’re targeting, and also to the problem. Without much more detail you’re only going to get vague answers. Some of the things that are generally a good idea on a GPU when accessing global memory:
[ul]
[li] Ensure that memory accesses are coalesced. That means that each thread should access memory that immediately follows that of the previous thread.[/:m:19866tvl][/li][li] If the GPU doesn’t have an L1 cache (e.g. NVIDIA prior to Fermi), copy a chunk of data into shared memory and then work on it before loading another chunk. This is particularly useful if the memory is reused, as occurs in matrix multiplication.[/:m:19866tvl][/li][li] Put the data in an image and access it through a sampler. [/*:m:19866tvl][/ul][/li]Depending on how similar your operation is to matrix multiply, try reading some of the papers on it e.g. Google for Volkov matrix multiply or Nakasato matrix multiply.

bmerry is correct, this will take some work.

First (and it seems you’ve done this), code it to use global memory, to work out the algorithm.

Then, figure out how to use shared memory, up to it’s limited size.

If you can’t fit everything you need, figure out some subset that will be useful.

There are great examples of using shared memory for array multiplies, find them and study them, to figure out how to make best use of shared memory.