Time of clReleaseMemObject : strange behaviour

Hi,
I have a strange behaviour with time of computing with function clReleaseMemObject
First example


float *pfTest = (float *)calloc(50 * 500000, sizeof(float));
cl_mem *clTest = clCreateBuffer(Context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * 500000 * 50, pfTest, NULL)
clReleaseMemObject(clTest);

In this case time of clReleaseMemObject is 0.01s
Second Example


float *pfTest = (float *)calloc(50 * 500000, sizeof(float));
cl_mem *clTest = clCreateBuffer(Context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * 500000 * 50, pfTest, NULL)
cl_mem *clTest2 = clCreateBuffer(Context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * 500000 * 50, pfTest, NULL)
Here i Run a Kernel with parameter clTest and clTest2 (for example clTest[iID] = clTest2[iId])
clReleaseMemObject(clTest);

In the second case time of clReleaseMemObject is 0.2s
Why the fact to add operation Kernel on clTest made that the time of clReleaseMemObject is bigger ???
Thx
J

Performance issues like this depend entirely on the implementation. You’ll have to ask the vendor who provided your SDK.

However, I can imagine several things that might cause this. If you just create a buffer the implementation doesn’t have to move it to the device. If you execute the kernel it does. So if you execute a kernel and then release it, the implementation may have to clean up in two places (on the host and the device) whereas if you do not execute a kernel it might only have to clean up on the host. Of course this is completely dependent on the vendor’s implementation.

Hi,
If i take your idea to release memory on the buffer and on the device :
Why the time of free memory on the device when i launch kernel is :0.19s =(0.20 - 0.01), it’s huge
I use OpenCL 1.0 release candidate of NVIDIA (gtx 285)
Thx
J

This is implementation defined, isn’t it? I was considering the time spent on releasing the buffer objects. In cases, when you are in a loop and you have to call the kernel multiple times, isn’t it better to create the buffer object once outside the loop, and then do a WriteBuffer to the same buffer everytime inside the loop with new data? The only issue is if one wants to change the buffer size across iterations, since we can’t resize the buffer, in which case you would have to release the buffer for every iteration and recreate the buffer for the new buffer size.

What’s the link with my question ?
i create the buffer + 1 kernel + 1 release
Time of release is 0.2s (size : 50 * 500 000 * sizeof(float) ) !!! with one Kernel whereas is 0.01s without this kernel
Make the test with your card and you will see this bug
edit : maybe you answer is linked to viewtopic.php?f=28&t=1978 you have made a mistake when you use reply…

Hi,
Nobody finds that clreleasememobject takes many times of computation ?
It’s strange that i’m the only person with this problem
Example :
my code one 1 core CPU intel XEON 90sec
with GTX 285 and OpenCL : 2.5sec (but in this 2.5sec there is 1.2sec of clReleaseMemObject it’s very strange)
Help
Thx
J

Jonathan,

I do find it strange that clReleaseMemObject() takes so long, but I doubt this is a function of OpenCL per se. This sounds very much like a performance bug with the specific OpenCL implementation you are using. I would suggest filing a report with the vendor who provided your OpenCL implementation to see if they can reproduce it and resolve it.

You might also try to narrow it down by trying another device (for example try running on the CPU, if the vendor supports it) or another vendor’s implementation and seeing if you encounter the same problem. This would help you narrow it down to the vendor, the device, or the machine. I’m sorry I can’t be of more help.

Hi dbs2,
Thanks for your help
I use OpenCL SDK of NVIDIA 1.0
However this SDK don’t support for this moment CPU computation.
I will try with AMD SDK on CPU (as soon as it will be available)
I think that’s a problem of Nvidia SDK.
I just remember that OPEN CL sdk it’s only provided by NVIDIA… so it’s complicate to change my sdk…
i have tried with another card and it’s the same problem.
Moreover, most of people on this forum uses SDK nvidia, so most of people has the same problem with clReleaseMemObject.
And in the nvidia forum for OpenCL, it’s impossible to have a answer mades by a specialist of implementation of OPENCL sdk…
Maybe you must divide this forum in 3 part (AMD, NVIDIA and MAC OS X)…

Jonathan,
I would suggest trying this on a Mac with SnowLeopard to with a 285 if you can. That will give you a good data point for determining if this is problem with Nvidia’s CL implementation or your code. (I realize this may be difficult…)

Hi,
In fact you will see the same problem with the vectorAdd example in the SDK
If you look the profiler you will see so many memcpyDtoH during the release of the memory
Moreover, with big data for the vector the time of releasing the command Queue is Huge!!!
Kernel Time : 0.18s
Release Command Queue : 0.6s
i think it’s a bug with the nvidia driver
Thx
J

Can you post the source (host + client) for testing the times on my OS X / GT 9600 ?

It’s just the vectorAdd example without CPU computation and with a size of data = 512 * 100 000
And just look the time of the function : clReleaseCommandQueue
Thx
J

Appologies if I’ve missed something, but isn’t this just that the memory object can’t be released until the asynchronous computation is complete. As soon as you add the kernel to the mix the work goes from:

[ul]
[li]Define memory objects A & B[/:m:f831f2zj][/li][li]release unused objects.[/:m:f831f2zj][/ul][/li]
which will probably be optimised to nothing, to:

[ul]
[li] Define memory objects A & B[/:m:f831f2zj][/li][li] Upload them to the card[/:m:f831f2zj][/li][li] Execute kernel[/:m:f831f2zj][/li][li] Wait for kernel to finish and then release host and gpu copies of memory objects.[/:m:f831f2zj][/ul][/li]
I.e. it’s not the release that’s taking the time, but the upload + kernel + release. It’s just that the time waiting is spent in the clRelease.

That certainly could be the issue.
In general to time CL you need to do:

start = gettime()

loop(100) {
do work
}
clFinish();

end = gettime();
time = (end-start)/100;

If you don’t call finish then stuff could still be executing. In particular, if you have:

clEnqueue()
clRead()
clRelease()

it’s possible the release will wait for the enqueue and read to finish. However, I’d still consider this a performance bug since the release should just decrement the internal retain count on the memory object and the runtime should have incremented it if it needs to keep it around for the read. So the actual release might not happen until later, but I’d expect clRelease() to execute really quickly all the time.

There is a subtle issue where if you release a memory object allocated with a host pointer you don’t know when the runtime is done with the object so you can’t free your host pointer very reliably. Apple has an extension “clSetMemObjectDestructorAPPLE” that allows you to get a callback when it is safe to do the free. I’d expect something like this to make it into the standard in the future.

@PaulS:
I use a clFinish in my case. Just launch a profiler on the vectorAdd example and you will see that when you release your data or your commandQueue you have several memcpyDtoH which is unnecessary for releasing the memory because i don’t use the option CL_MEM_USE_HOST_POINTER…
It’s a bug in the nvidia driver.
Thx
J