register to global memory performance mystery

Hi,

In my code, I have an private array:


__private float foo[10][2];

To make sure it stays in the registers and out of high-latency local memory, I use array offsets that are computed at compile-time. When I’m finished filling it with computed data, I want to transfer the array to thread-specific offset in global memory.

If I do the transfer this way:


__local float bar = foo[0][0];
*(output_data + get_global_id(0)) = test;

then it’s nice and fast. But if I do the transfer this way:


*(output_data + get_global_id(0)) = foo[0][0];

then it’s horribly slow. Really s-l-o-w.

I’ve gone through the various docs multiple times, but I still can’t figure out why this is. If it was a problem with global memory coalescing, wouldn’t it manifest itself in both examples? Can anyone enlighten me?

Thanks,

Polly