Kernel execution time drops when not initializing buffers

Hello,

I am experiencing strange behaviour when measuring the execution time of an OpenCL kernel. The kernel expects three buffers as input. I create those buffers in the host code and initialize them by using CL_MEM_COPY_HOST_PTR. I then measure the kernel execution time via OpenCL events. However, when I omit CL_MEM_COPY_HOST_PTR, the kernel execution time drops to a third. Is there an explanation for this behaviour and can it be changed so that not initializing the buffers has no impact on the kernel execution time?

Thanks in advance!

Could your kernel execution time be dependent on the input data?

Try running an empty kernel with those buffers attached first:
clEnqueueTask(…)
clFinish(…)
//Do your thing
Are any of your buffers flagged as “CL_MEM_USE_HOST_MEM”?

Thanks for your suggestions.

Could your kernel execution time be dependent on the input data?

It is an implementation of matrix vector multiplication so the execution time should not be dependent on the input data.

Try running an empty kernel with those buffers attached first:
clEnqueueTask(…)
clFinish(…)
//Do your thing
Are any of your buffers flagged as “CL_MEM_USE_HOST_MEM”?

No, the buffers are flagged as CL_MEM_READ_WRITE. Running an empty kernel first doesn’t change anything.

Another thing I noticed:
When I print the buffers from my kernel, I can see that all elements are 0. However, when I explicitly fill the buffers with 0s via enqueueWriteBuffer the execution time does not get reduced. So the execution time is definitely not dependent on the input data. It seems as if the OpenCL compiler makes some optimizations because it notices that the buffers are not getting initialized. Is there a way to disable optimizations to test if that’s the case?

When I print the buffers from my kernel, I can see that all elements are 0. However, when I explicitly fill the buffers with 0s via enqueueWriteBuffer the execution time does not get reduced.

It can be, a newly created buffer is encoded as a data structure the memory controller interprets as “all zeroes”, so no actual memory reads are performed. I’m not sure what this kind of “optimization” actually achieves, but to fill a single element in a buffer is probably enough to work this around.

I’m not sure what this kind of “optimization” actually achieves, but to fill a single element in a buffer is probably enough to work this around.

I tried it. Writing a single element is not enough. Suprisingly the execution time rises linear to the amount of data written into the buffer.

This behavior makes less sense with every second. Try only writing the first and the last element. If it still won’t work, you may try to write into N randomly distributed positions, though it is likely to take more even time than filling the whole buffer.