Does Nvidia API grow an unbounded command queue unless clFinish() is called?

Hi all,

I have a simulator which runs under all of the major OpenCL APIs but if I run it for a long time under Nvidia it eventually runs out of memory. My code involves repeatedly calling the same kernel function. After about 10^8 calls the Nvidia machine kills the process (a self protective measure) due to running out of memory. I suspect that the API is maintaining some kind of handles to each of my previous calls to the kernel, either in the command queue (as completed tasks) or as events returned when writing buffers and running kernels. I’m thinking it’s probably the command queues, does anybody have any experience of this? I am not currently calling clFinish() anywhwere in my code as I only do blocking writes and reads and they are performed before and after each kernel run, so in the other APIs I definitely don’t need to run them. But I’ve read elsewhere that Nvidia interpret that part of the API spec differently from everyone else.

I’m going to add the calls to clFinish() to my code now and perhaps that will solve my problem. But my code will take 36 hours to reach a crash point and I’d also really like a definitive answer on this one if anyone has one.

Thanks very much,
Dave.

The NVIDIA driver doesn’t grow the command queue indefinitely, it flushes it to the hardware occasionally, probably based on time or size. You probably have a memory leak.

If Task Manager (or equiv.) is open does memory grow as your program runs?

If you are using events (i.e., you get them back from various enqueue calls) you are responsible for releasing them afterwards.

Hi Dithermaster, thanks for the response,

I don’t have access to a task manager type program as this simulation is running on a university provided compute cluster. I submit batch jobs and they eventually get run. The batch control software kills my process eventually as it exceeds the memory limit; an unfortunate side effect of which is that I don’t get access to some of the output if it hasn’t been written to disk before the kill occurs.

I don’t use events. I don’t currently use an event variable, I just pass NULL in its place. But I read this discussion which hints (but doesn’t prove) that Nvidia may be storing those event references somewhere and perhaps I need to release() them even if I wasn’t actually storing them.

What do you mean by “it flushes to the hardware occasionally”? You mean even if I keep adding to the end of the queue, it will begin processing the queue and removing from the top? My situation is even simpler than this, I blocking write my buffers to the device, I enqueue an NDRange kernel for simulating my data, then I enqueue a blocking read from the device back to normal memory. Then in serial mode I do some transformations of my data on the CPU, then I re-enqueue for write to the device, etc. So I’ve always assumed that the blocking read and write would clear out my system on a regular basis, instead of calling clFinish().

Since I wrote my original post I’ve implemented regular calls to clFinish() and I still get the same behaviour, a crash after about 250,000,000 runs of the kernel. Clearly it’s a memory leak, but I never had this problem with the other APIs. I’ve run sims of a similar, and longer, duration using the AMD API and there was no visible memory leak (no overflow anyway). And now I really need it to work under Nvidia…

One thing I’ve changed in relatively recent times is I’ve switched from regular buffers to Mapped (pinned) memory buffers. My code had no problem under Nvidia before this change, it still has no problem under the other APIs since this change.

Dave.

By the way, this guy appears to be having the exact same problem as me.

Dave

Yes, blocking reads and blocking writes are the equivalent to calling clFinish. So your queue growing is definately not the problem, but there does appear to be a memory leak somewhere, either in your code or NVIDIA’s driver. Since your stuff works on other platforms, I’d suspect the NVIDIA driver like you do. I’ve seen leaks in it before, for the record.

The best way to debug this is to get on an NVIDIA-equipped machine where you can see what is going on and make lots of attempts. A batch environment with no task manager and limited runs per (unit time) complicates the debugging.

To find the leak, try to get a handle on how much memory is leaking per <something>, and identify the <something>. Is it (for example) 48 bytes per clEnqueueNDRange, or 32 bytes per clReadBuffer, or what?

Purposely double or triple up on one OpenCL call at a time to see which one causes memory to be used up (for example, upload the buffer twice – does it get killed in half as many iterations? If not, run the kernel twice and see if it gets killed in half as many iterations, etc.)

If the driver is somehow leaking an event even when you pass NULL for the return event, try passing the address of a clEvent and then call clReleaseEvent on it (to work around this possible driver bug).

Good luck.

Thanks for the follow-up Dithermaster. Sorry I didn’t get an email alerting me to the reply so I’m only checking back now.

Your advice is all perfect, unfortunately at present I don’t have much access to the machines in question. I’m meeting the guys who run the computing centre next week so maybe I’ll get somewhere after that. I’ll keep people informed here if I find a solution. But my take on it is the same, that it’s a memory leak in the Nvidia API based most likely directly on the number of calls to enqueueNDRangeKernel(). I’ve managed to exclude a hidden cl_event queue as the source of the problem, by explicitly requesting the handles then clearing them and I still see the same problem. Some of the other commentary on the Nvidia fora where people had a similar problem found a half workaround by disabling some thermal monitors on the cards, but even then they do randomly crash the calculation.

As I said, I’ll post back if I find a solution. But it does seem to be the API. For now, I have to switch my code to running on multi CPU machines using the AMD API. It’s slower but it works.

Dave.