Non-blocking Write Always Completes Without Error

Hi

This is my first post in this forum. Hence, if I chose the wrong category, please be so kind and move this post where it belongs to.

I am playing around with the non-blocking use of the various enqueue calls. I must say I really like the way the OpenCL Wait Objects are set up.

I came across one issue, though, that I couldn’t explain. I don’t think it makes sense to show my code here because it uses a self-made cloo-alike .NET wrapper around the OpenCL functions.

I logically do the following:
[ul]
[li]Create a context for one device[/:m:c0dyujol][/li][li]Create a command queue[/:m:c0dyujol][/li][li]In an endless loop, I do:[/li][list:c0dyujol]
[li]Allocate a large chunk of memory on the host[/:m:c0dyujol][/li][li]Create a new buffer with the same size[/:m:c0dyujol][/li][li]Enqueue a non-blocking copy from the host memory to the buffer (clEnqueueWriteBuffer)[/:m:c0dyujol][/li][li]put the memory, buffer and returned event in a list so I can still reference them afterwards[/:m:c0dyujol][/ul][/li]until either OpenCL returns an error (synchronously) or I run our of host memory.[/*:m:c0dyujol][/list:u:c0dyujol]

When I execute this, I always run out of host memory first. It seems I can move much more data to the device (a GPU) than it actually has memory to store it.

Up until now, everything’s fine. This was expected, these are non-blocking calls, so they won’t error immediately or will just wait until there is enough space. And now comes the Issue: When I check the execution status (clGetEventInfo), all the events are CL_COMPLETE.

From the spec I would have expected the execution status of the later events to be one of two options:
[ul][li]CL_QUEUED to indicate the operation is queued but cannot run because there is not enough space available on the gpu[/:m:c0dyujol][/li][li]An out of memory error code to indicate that there is not enough space available on the device[/:m:c0dyujol][/ul][/li]
Is what I observe the correct behaviour? Where does the overflowing data live, then?

Any explanation is highly appreciated.

Wow, this is interesting. What implementation of OpenCL are you using (AMD/Nvidia)? What’s your hardware?

A sufficiently clever driver running on hardware that supports mapping host memory into the device’s address space may simply be reusing the pointer you passed to clEnqueueWriteBuffer as device memory. Even though this should be possible (at least in some cases) it would require some trickery with the OS to make it all work.

Have you verified that the data stored in the buffer objects in fact has the same contents as the host memory?

When you clCreateBuffer(), what arguments do you pass as read/write flags? Are they read-only by any chance?

My hat goes off to the folks that wrote that driver if this is all true.

Hey, thanks for your answer!

I am using NVidias implementation on my laptop which features a Quadro FX 880M.

I did not verify data integrity yet. I will later today.

What would you have expected: out of memory or cl_queued? Depending on this information I will double check my wrapper.

I do pass ReadWrite as flags to the buffer creation.

Thanks again.

What would you have expected: out of memory or cl_queued? Depending on this information I will double check my wrapper.

I would have expected out of memory when clEnqueueWriteBuffer() is called since NVidia’s implementation defers memory allocations until the point where a buffer is used.

Also make sure to pass a pfn_notify function when you call clCreateContext(); some errors can only be returned through that callback.

Isn’t lazy memory allocation a product of the fact that buffers do not live on a device by default? At least I don’t see any way to specify which device the buffer is intended to live on when creating it.

How does ATI differ here? I only develop sometimes on my laptop, in my desktop PC there are two strong ATI cards.

That is a good idea. I’ll try this.

By the way, when switching from non-blocking to blocking write I get an out of memory return code after a predictable number of iterations. I suspect a bug.

Assuming that NVidia really does some trickery and uses the host pointer: I strongly dislike it. OpenCL is a very low level abstraction around compute platforms. Not being able to precisely specify when allocations occur on the device and more importantly when data is being copied will lead to decreased performance in specific situations, and there is no way for the implementation to overcome this without introducing overhead again.

Isn’t lazy memory allocation a product of the fact that buffers do not live on a device by default?

Yes, you got that right :slight_smile:

At least I don’t see any way to specify which device the buffer is intended to live on when creating it.

Correct.

How does ATI differ here? I only develop sometimes on my laptop, in my desktop PC there are two strong ATI cards.

I imagine they do the same, but I haven’t checked to be honest.

By the way, when switching from non-blocking to blocking write I get an out of memory return code after a predictable number of iterations. I suspect a bug.

Interesting. What are these iterations doing? Couldn’t this error simply be a consequence of lazy memory allocation? After all, if the command is non-blocking and there’s not enough memory, the driver could simply wait a bit and try later. This sounds much more feasible than being able to reuse host memory, particularly since you mentioned your buffers are read/write.

Assuming that NVidia really does some trickery and uses the host pointer: I strongly dislike it. OpenCL is a very low level abstraction around compute platforms. Not being able to precisely specify when allocations occur on the device and more importantly when data is being copied will lead to decreased performance in specific situations, and there is no way for the implementation to overcome this without introducing overhead again.

OpenCL is not that low level, whether for good or for bad. Specifically, the driver handles all necessary memory transfers between devices and between host and devices transparently to the application. Some have argued that buffers should be explicitly associated with specific devices and applications should be required to explicitly transfer buffer ownership from one device to another. I don’t remember why the committee chose the alternative we have today – I’m sure there are compelling reasons for it. In any case, at this point we have to work with what we have.

Please refer to my original post. One iteration consists of allocating memory on the host, creating a buffer, and moving data from the allocated host memory to the created buffer.

For the blocking call, sure. The error is totally expected. In the case of the non-blocking call, however, there is no error. Each of the event objects’ status is set to CL_COMPLETE, not a single one of them fails. And that very fact is what concerns me.

Exactly. Queue the operation and only carry it out when it’s possible, that is, when there’s enough memory available to actually do it.

I have no idea how using the host pointer actually works. I try to avoid it in my work therefore.

Even though OpenCL tries to be on a higher level, it certainly is not. Just look at stuff like host memory pointers, mappings and the like.

In my opinion, OpenCL combines the disadvantages of both worlds here: Memory spaces are explicit, while data movement is somehow hidden and up to the driver and therefore too transparent.

Hey David

I implemented the context notification callback in my wrapper now and it gets called: “CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_WRITE_BUFFER on Quadro FX 880M (Device 0).”

At least there is some sort of notification now. But there is no way to actually relate the message to a specific operation/wait object.

I then tried executing the very same code on my desktop machine (2xATI HD6970) and here it behaves more as expected: clEnqueueWriteBuffer synchronously returns out of memory after the expected number of iterations. I would have liked it much more if the call succeeded and the wait handle execution status would contain the error, but it’s still much better than what NVidia seems to be doing.

On a further note, I realized that my laptop GPU only supports OpenCL 1.0. Maybe what I observed does only hold for their 1.0 implementation. I don’t have a NVidia GPU capable of 1.1 lying around, otherwise I’d have tested and possibly reported the issue.

There is one remaining issue, though. I can now correctly allocate and write to 4 Buffers of 256MB each on one device. But the last write takes tremendously much longer to finish. When using the built-in profiling facility, the difference between finish and submit time is nearly constant for the first 3 calls at around 0.25 secs, and then increases to 1.5 secs for the fourth write operation. It then fails when trying to allocate and write to a fifth buffer, which is expected.

Do you have any explanation for the decrease in performance for the last successful operation?

I can now correctly allocate and write to 4 Buffers of 256MB each on one device. But the last write takes tremendously much longer to finish. When using the built-in profiling facility, the difference between finish and submit time is nearly constant for the first 3 calls at around 0.25 secs, and then increases to 1.5 secs for the fourth write operation. It then fails when trying to allocate and write to a fifth buffer, which is expected.

Your GPU has 1GB of graphics memory, right? 4x256MB = 1GB. These buffers would consume all of the physically available memory. It looks like the driver is doing all it can to fit them in there, including swapping out whatever other internal buffers they have. I’m actually surprised that the fourth allocation succeeded, even if it took a bit longer.

It has 2GB of memory, of which OpenCL exposes 1GB as global memory.

I am programming a part of a compiler that is responsible for offloading some computations to GPU when the operation is assumed to run faster on GPU. I am therefore dependent on predictions as to how long data movement will take.

Is there any way to check how much space is left on a GPU? Or tell OpenCL not to offload memory already allocated for other purposes?

It might be helpful to think of device memory as a cache of host memory, instead of a separate memory space. Whenever the OpenCL runtime executes a command on the device, the runtime and operating system are responsible for making sure the operands of the command are available in the device memory before they are needed by the command.

In the above loop, each buffer would not need to be available in device memory until the copy command actually executes. There are no commands in the loop which cause the main thread to wait for the device or the runtime, (i.e. flush, blocking command, wait, or finish), so it is possible that the copy commands are never submitted and the buffers are never made resident on the device.

In order to make forward progress in the application, eventually the commands should be submitted to the device, e.g. by a blocking command or a wait. Since the copy command only takes two arguments, only two buffers would need to be available on the device at a time.

Is there any way to check how much space is left on a GPU? Or tell OpenCL not to offload memory already allocated for other purposes?

I’m afraid not and I don’t think such a thing would be useful in reality. Multiple processes could be running OpenCL simultaneously and as soon as you query how much physical memory is free another process could consume all of it.

This was not about the loop. And the loop was only a test to see how well OpenCL behaves.

Of course, any application needs to make forward progress. but when moving a buffer to GPU causes so much overhead in this specific situation, then it would be good to know this up-front and circumvent it all together. Assume that some element wise operation is faster on GPU because the data set is simply huge and there are much more cores on the GPU. Then it makes sende to run it on GPU because it’s simply faster. But if it cannot be detected that the device would cause much overhead for swapping other memory out then this is a severe danger for thoughput.

I do consider this useful. Plus, I will know that my application will be the only one that consumes much GPU memory. The only thing that I’d need to know is how to determine how much memory is used for e.g. the display driver.