Hi,
If I run this test kernel, where input_data and output_data are pointers to global floats:
float test;
for ( int i = 0 ; i < 1024; i++ )
{
barrier( CLK_GLOBAL_MEM_FENCE);
float f = *(input_data + i);
test = (float)get_local_id(0);
}
barrier( CLK_GLOBAL_MEM_FENCE);
*(output_data + get_local_id(0)) = test;
then openclprof tells me that my global memory accesses are coalesced.
But if I run this kernel:
float test;
for ( int i = 0 ; i < 1024; i++ )
{
barrier( CLK_GLOBAL_MEM_FENCE);
float f = *(input_data + i);
//test = (float)get_local_id(0);
test = f;
}
barrier( CLK_GLOBAL_MEM_FENCE);
*(output_data + get_local_id(0)) = test;
then my global accesses aren’t coalesced. This has me stumped – I think all my threads are reading the same input_data addresses at the same time, and all my threads are writing data to adjacent output_data addresses at the same time too. I think my addresses are aligned correctly in both cases as well. So why does making that assignment to test inside the loop make my accesses uncoalesced?
What am I missing here?