Does NDRange affect performance?

Hello.

I thought that the grid dimensions used for the NDRange when launching a kernel was just a practical issue, but that it did not affect to the kernel execution time. Nevertheless I performed some tests on a NVIDIA GTX285 that do not agree with this. I implemented a trivial kernel that reads a vector of 2**20 items from global memory.

Code for one-dimensional NDRange:


unsigned int tid = get_global_id(0);
register x = data[tid];


unsigned int tid = get_global_id(0) * get_global_size(1) + get_global_id(1);
register x = data[tid];

I compiled the code with -cl-opt-disable option. I launched the kernel many times, always with 1x512 threads in each work-group and different NDRange dimensions, but always the same global size (1x220, 2x219 4x218,…,2048x29). The execution time is lower for the extreme values (217 ms.) and higher for the central ones (232 ms. for 16x2**16) (each time the timing is averaged over 256 executions and computed with opencl events).

Moreover, the execution of a one-dimensional NDRange of 2048 work-groups of 512 threads each is much faster (101 ms.) than a two-dimensional one of dimensions 1x2**20.

If I read the data many times from the kernel (adding a loop) the differences between the two-dimensional versions are even higher, but the difference between the one-dimensional and the two-dimensional is neglible.

Does it make sense? Am I ignoring something?

Thank you for your help.

The local range has a direct affect on execution, if you’re not specifying it explicitly then the opencl runtime is just making one up to fit the problem, and since it has less information about what your kernel is doing, it might not make a good guess.

But I suspect your main issue is your column-major matrix access.

In all current cards, work items are assigned to hardware threads/wavefronts in row-major order: i…e x first, then y, etc. If you’re using column-major order you’re threads will be accessing memory in an very much less than ideal fashion.

i…e any indexing for global memory should be global_id(0) + global_id(1) * global_size(0), etc. local store can sometimes be used to re-arrange the data if that isn’t what you need, or if you need both row and column accessing one can use images.

Sorry if I’m not clear, but I’m used to CUDA and sometimes I mix both terminologies. In my test I fixed the local range to 1x512, so I think it should not affect the results.

I agree that the accessing pattern (row or column major order) must affect the results, but I actually tried both versions and the results were better for the pattern I showed. If I change the access pattern as you suggested [global_id(0) + global_id(1) * global_size(0)], the results do not change for 1x220, but are much worse in other cases. The worst case is for 16x216 with 554 ms.

For the version that performs many reads (50) in a loop in the kernel, the same fact occurs: The access time with your suggested pattern is much greater.

I think that the differences for the one-dimensional and two-dimensional cases can come from the multiplication and the sum that must be performed to compute the ‘tid’ variable. This would explain why the difference disappears when I perform a loop of 50 memory reads in the kernel.

In any case, I cannot explain why a 16x216 NDRange performs much worse than a 1x220 one (e.g. 3300 ms. vs. 5850 ms. if I do 50 iterations in the kernel).

Does anyone have any clue? Can anyone confirm how the threads are arranged in warp/wavefronts on a two-dimensional NDRange?

Thanks again for your answers.

Hello.

I think I know which the problem is. As I described in the previous posts I defined column-wise work-groups of fixed size (1x512). I have now tried with a row-wise workgroup (512x1) and also inverted the NDRange’s dimension sizes. Moreover, I also computed the tid of each thread according to notzed’s proposal: global_id(0) + global_id(1) * global_size(0). With this layout, as it should be expected, the time does not significantly vary if I change the dimensions’ size of the NDRange (assuming I always keep the same “area”, get_global_size(0) * get_global_size(1)).

I think the reason is that for the column-wise layout the actual thread IDs did not equal the ‘tid’ variables and therefore the memory accesses were not always fully coalesced. I checked that for this layout computing tid as get_group_id(1) * get_global_size(0) * get_local_size(1) + get_local_id(1) leads to the correct ‘thread id’/‘tid’ mapping and the timing confirms it. A picture where the work-groups are drawn can help to understand the problem.

Although I resolved my doubt (or that’s what I hope), all further comments are welcome.

Sorry, the bolded formula in the previous post should be:
get_group_id(1) * get_global_size(0) * get_local_size(1) + get_group_id(0) * get_local_size(1) + get_local_id(1)

It is a specific case for a more general formula when the work-group are column-wise (i.e. get_local_size(0) == 1). If I didn’t make another mistake the more general formula should be:
get_group_id(1) * get_global_size(0) * get_local_size(1) + get_group_id(0) * get_local_size(1) * get_local_size(0) + get_local_id(1) * get_local_size(0) + get_local_id(0)

Ahh sorry, my mind just read ‘512x1’ when you wrote ‘1x512’ since I wouldn’t ever consider using such a layout.

My guess is that in that case the different global work arrangements are causing different global memory bank conflicts, but I’m too lazy to do all the maths to prove it.

But as a starter, even if your workgroup memory access is coalesced and accessing consecutive memory cells within the workgroup - which it should be with 512 work items and the indexing you’re using - each workgroup will not necessarily be accessing consecutive blocks of memory. Because even though you are using y followed by x for the local work-group index, opencl will always use x followed by y for breaking up the global work-size into work-groups and assigning them to hardware CUs/SMs and threads.

So it would be possible to hit some values of the global size x/local size x ratio where consecutive (or at least, concurrently executing) compute units are working on the same banks of memory and causing problems, and others where there might be less overlap. One presumes the hardware has been optimised for consecutive memory accesses globally, rather than any other pattern.

OTOH with an x-ordered layout, where each work item maps to each consecutively increasing memory address globally, i.e. each work group also access consecutive memory blocks as well: one will always get the same bank conflicts for the same total global work size.

i.e. if you always just use row-major-order, you just don’t have to worry about it for such problems and the only limitations you will hit are the ones the hardware designer left in and cannot be avoided.

(Well, the local work-size and data element size will also affect this, and although the first can be tweaked, it is still somewhat constrained)