I thought that the grid dimensions used for the NDRange when launching a kernel was just a practical issue, but that it did not affect to the kernel execution time. Nevertheless I performed some tests on a NVIDIA GTX285 that do not agree with this. I implemented a trivial kernel that reads a vector of 2**20 items from global memory.

Code for one-dimensional NDRange:
Code :
unsigned int tid = get_global_id(0);
register x = data[tid];
Code :
unsigned int tid = get_global_id(0) * get_global_size(1) + get_global_id(1);
register x = data[tid];

I compiled the code with -cl-opt-disable option. I launched the kernel many times, always with 1x512 threads in each work-group and different NDRange dimensions, but always the same global size (1x2**20, 2x2**19 4x2**18,...,2048x2**9). The execution time is lower for the extreme values (217 ms.) and higher for the central ones (232 ms. for 16x2**16) (each time the timing is averaged over 256 executions and computed with opencl events).

Moreover, the execution of a one-dimensional NDRange of 2048 work-groups of 512 threads each is much faster (101 ms.) than a two-dimensional one of dimensions 1x2**20.

If I read the data many times from the kernel (adding a loop) the differences between the two-dimensional versions are even higher, but the difference between the one-dimensional and the two-dimensional is neglible.

Does it make sense? Am I ignoring something?

Thank you for your help.