global memory coalescing question

Hi,

If I run this test kernel, where input_data and output_data are pointers to global floats:


float test;
for ( int i = 0 ; i < 1024; i++ )
{
  barrier( CLK_GLOBAL_MEM_FENCE);
  float f = *(input_data + i);
  test = (float)get_local_id(0);
}
barrier( CLK_GLOBAL_MEM_FENCE);
*(output_data + get_local_id(0)) = test;

then openclprof tells me that my global memory accesses are coalesced.

But if I run this kernel:


float test;
for ( int i = 0 ; i < 1024; i++ )
{
  barrier( CLK_GLOBAL_MEM_FENCE);
  float f = *(input_data + i);
  //test = (float)get_local_id(0);
  test = f;
}
barrier( CLK_GLOBAL_MEM_FENCE);
*(output_data + get_local_id(0)) = test;

then my global accesses aren’t coalesced. This has me stumped – I think all my threads are reading the same input_data addresses at the same time, and all my threads are writing data to adjacent output_data addresses at the same time too. I think my addresses are aligned correctly in both cases as well. So why does making that assignment to test inside the loop make my accesses uncoalesced?

What am I missing here?

Whooops: I forgot to add, I’m running this code on a compute capability 1.1 board.

Hmmm, if I change the code to:


float test;
for ( int i = 0 ; i < 1024; i++ )
{
  barrier( CLK_GLOBAL_MEM_FENCE);
  float f = *(input_data + get_local_id(0)); // indexing off tid instead of loop counter
  //test = (float)get_local_id(0);
  test = f;
}
barrier( CLK_GLOBAL_MEM_FENCE);
*(output_data + get_local_id(0)) = test;

then the memory accesses are coalesced. I had thought if all threads accessed the same address then it was a special case of coalesced access, but openclprof tells me I’m wrong. Maybe what happened in my “coalesced code” version was the compiler saw I wasn’t using the input_data and so didn’t do the access?

That was my guess when I saw the example.