Hi,
In my code, I have an private array:
__private float foo[10][2];
To make sure it stays in the registers and out of high-latency local memory, I use array offsets that are computed at compile-time. When I’m finished filling it with computed data, I want to transfer the array to thread-specific offset in global memory.
If I do the transfer this way:
__local float bar = foo[0][0];
*(output_data + get_global_id(0)) = test;
then it’s nice and fast. But if I do the transfer this way:
*(output_data + get_global_id(0)) = foo[0][0];
then it’s horribly slow. Really s-l-o-w.
I’ve gone through the various docs multiple times, but I still can’t figure out why this is. If it was a problem with global memory coalescing, wouldn’t it manifest itself in both examples? Can anyone enlighten me?
Thanks,
Polly