Can somebody please clarify for me how clEnqueueWriteBuffer() operates at a hardware/OS level? And how much is in the specification versus implementation specific.

The reason I ask, is that I'm writing a biophysics simulator on my MacOS system and it seems that I only need to clEnqueueWriteBuffer() at the very beginning, then as I read out to that same memory address (in order to perform inter-cell communication) whatever is there (even though it's modified between timesteps) gets sent back down the pipeline to the GPU on the next call to clEnqueueNDRangeKernel() and I don't need to call clEnqueueWriteBuffer() again.

This doesn't seem to me like it's really according to the spec, it seems like it's a lucky coincidence of the implementation as the same memory locations are conserved etc. But I'd really like to know the official stance. Ultimately I will be running this on other systems so I will want something that's pretty OpenCL implementation independent.

Furthermore, this speeds up my code by 100% (it takes half the time). So if this is against the spec, is there an alternative way of specifying my memory/accesses such that I can guarantee this behaviour (such as by using a map buffer perhaps)?

I would definitely appreciate some feedback on this one, especially if there're still some of the guys who are working on the official spec out there, it's something I've been wondering about for a few months now but nobody I know seems to be able to answer. Feel free to be as technical as you like in your response, it shouldn't scare me too much.