Memory error occurs after releasing a memory object

Hello.

I have two kernel functions (A and B).
The kernel B uses the output from the kernel A, so they are executed in order.
Both kernels are repeated in a for loop.

the pseudo code looks like this :


For 
{
	Set kernel arguments to A 
	Run kernel A
	Read the result from A 

	Create a memory object m1 ;
	Write data to the memory objet m1 
	Set kernel arguments to B (including m1) 
	
	Run kernel B
	Read the result from B 
	Release the memory object m1 
}

At the first iteration both kernels work well,
but memory access error occurs when “Run kernel A” is called in the second iteration.
It seems that The memory error is realted to writing to a memory because it is called in the file ‘write.c’ with a message :

Unhandled exception at 0x0252778d in MyOpenCL.exe: 0xC0000005: Access violation reading location 0xfeeeff5e.

I couldn’t find it because the error comes when I call ‘clEnqueueNDRangeKernel’ to execute the kernel A in the 2nd iteration.

Interestingly, if I do not release the memory object m1, no error occurs.
But m1 has no relationship with the kernel A. It is completely used in only B.

Anybody can help me to find what the error comes from ?

What is you OpenCL implementation?
Do you really have two differents cl_kernel ?

Yes, I have two kernel functions in a file ‘my_kernel.cl’ and both of them are built without errors.
Each of them has different kernel id, and they share the same command queue.

As I wrote running them once causes no errors. The problem is that ana error occurs when I run the kernel A in the 2nd iteration.

why you have to release the memory object?

just leave it there. copy the data from A to m1, then update m1 to B’s argument. then repeat

I release m1 because the size of data written to m1 changes in each iteration.

If creating / releasing the memory buffer object is costly, maybe I have to change my code…

creating/releasing memory buffers can indeed be costly because it involves doing an allocation on the device. However, writing data to the memory object can be far, far more costly. If you are transferring data for each iteration you need to make sure you are doing a huge amount of computation to amortize that cost. Otherwise the time you spend sending the data will outweigh any benefits you get from the device acceleration.

This is one of the major issues with using GPUs. If you can’t keep your data on the card you will spend most of your time moving data rather than computing. I would highly encourage you to figure out how to avoid having to write data to the memory object from the host on every iteration.

(There is both a fixed overhead setup for each transfer plus the actual transfer speed, which is really slow: ~5GB/s on PCIe x16.)