Global Memory - when is it changed?

I have the following kernel. When I copy the local variable “data” into “retval[offset]”, do I really initiate a write to global memory every time? Or is the compiler intelligent enough to wait til the end of the thread and only then really write the complete retval to global memory in one go?

__kernel void some_kernel(__global float2* retval){

float2 data;
for(int i=0;i<16;i++,offset++){
data.x=bits.x>>=1;
data.y=bits.x>>=1;
retval[offset]=data;
}

for(int i=0;i&lt;16;i++,offset++){
	data.x=bits.y&gt;&gt;=1;
	data.y=bits.y&gt;&gt;=1;
	[b]retval[offset]=data;[/b]
}


}

I strongly doubt that the compiler will optimize this. If the writes were to the same location in each iteration (with += for example) I could image that the compiler will optimize that. But since the writes are to different locations, there’ll probably be a write in each iteration.

Why don’t you just apply the optimization by hand and see if it makes any difference? I guess you would have to use vector types though, because otherwise you’ll have individual writes just as before.

This will be completely dependent on the particular compiler/hardware it is run on. I wouldn’t expect most compilers to optimize this by accumulating your 32 float2 writes into a buffer and then transferring that buffer en-mass unless the device’s hardware is directly conducive to this sort of a memory model (I can think of one device where that might be exactly what they do). Other devices may have “write gathering hardware” which batches up sequential writes and sends them efficiently to memory. Yet other devices cache writes and efficiently flush lines back to global memory. And some devices will issue separate writes just like you’ve written, which may or may not be an efficiency issue.

As suggested by dominik, if you think this code isn’t efficient on device(s) you care about then you should try doing this manually. If your devices have __local memory then you should look at whether it helps to write there and then use the async transfer built-ins to transfer it from local to global.