Device affinity for command queues and buffers seems at odds

Hi All,

To create a cl_mem object, one calls clCreateBuffer(), which takes a cl_context as an argument. I assume that this means that the cl_mem object has affinity with the cl_context used to create it and that it is an error to use it in any other context (the standard does not seem to state this explicitly). Since the cl_context was created with a set of cl_device_ids, I assume that it is valid to use cl_mem object with any of the devices used to create the cl_context that was passed to clCreateBuffer().

To read data out of a cl_mem object, one uses the clEnqueueReadBuffer() method, which takes a cl_command_queue as an argument. A cl_command_queue is created for a specific cl_device. It seems very strange that I need to specify a device when reading form a cl_mem object as it does not have device affinity.

This certainly lacks symmetry with creating a buffer with the CL_MEM_COPY_HOST_PTR flag as no device is passed to the clCreateBuffer() method. I’ve seen it said in other posts that the following are equivalent:


cl_mem buf = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size, ptr, 0 );


cl_mem buf = clCreateBuffer( context, CL_MEM_READ_ONLY, size, 0, 0 );
clEnqueueWriteBuffer( queue, buf, true, 0, size, ptr, 0, 0, 0);

However, there is one distinction - the second case requires you to nominate a device (needed to create the queue) while the first does not!

Can someone clarify what is going on here? I am working within a context that has multiple devices and want to read data out of a cl_mem object using clEnqueueReadBuffer() - what device should the cl_command_queue that I use be associated with? Does it not matter?

Thanks in advance,

Dan

CL_INVALID_CONTEXT is used for this case, and the specification explicity say when it’s raised. For example for the clEnqueueReadBuffer command :

Memory objects are often cached on a device.

For me there is several differences between only create buffer (with copy) and create + enqueue write:

  • the first is synchronous the second can be asynchronous;
  • the second permit that the driver cache memory sooner on the good device; with the first example, there is more chances that the caching will only be done at clEnqueueNDRangeKernel command execution;

So the second method add more liberty for the developer to optimize caching time. And permit to do something else while the write is happening.

CL_INVALID_CONTEXT is used for this case, and the specification explicity say when it’s raised. For example for the clEnqueueReadBuffer command :

[/quote]

My apologies - I restructured my sentences and that comment was left out of context! I meant to say that the standard doesn’t seem to explicitly state that a buffer can be used on any device associated with a context (that is, that there is no device affinity). And yet, to copy data to/from the buffer I need to talk about a specific device.

I’m sure that in practice you generally queue up commands on a given device following a pattern along to lines of:

  1. Copy from host
  2. Execute kernel
  3. Copy to host

And it just works out nicely. However, I still find it very strange that you can not copy data to the host without queuing a command for a particular device.

My apologies - I restructured my sentences and that comment was left out of context! I meant to say that the standard doesn’t seem to explicitly state that a buffer can be used on any device associated with a context (that is, that there is no device affinity)

See the glossary on page 14:

Context: The environment within which the kernels execute and the domain in which
synchronization and memory management is defined.

See also Appendix A:

OpenCL memory objects, program objects and kernel objects are created using a context and can
be shared across multiple command-queues created using the same context. Event objects can be
created when a command is queued to a command-queue. These event objects can be shared
across multiple command-queues created using the same context.

Fine, but this seems irrelevant. Consider the concept of thread local storage - the allocations still happen in the context of the processes heap, but the memory has thread affinity.

To say “can be shared” is very, very weak. In what way can they be shared, and what of concurrent access or usage is allowed?

Back to the original question then - if memory objects do not have device affinity, why is there no function to copy a buffer from device memory to host memory without enqueuing a command for a specific device?

I’m not impressed that you had to quote from the glossary and appendix, rather the the standard proper, to try and answer my question. It seems that detail is being buried in the wrong places.

Cheers,

Dan

I understand your frustration.

To say “can be shared” is very, very weak. In what way can they be shared, and what of concurrent access or usage is allowed?

That is defined in Appendix A. The quote I provided is only an excerpt.

Back to the original question then - if memory objects do not have device affinity, why is there no function to copy a buffer from device memory to host memory without enqueuing a command for a specific device?

Some device has to perform the data copy. OpenCL allows the application to choose any of the devices in the context to do the operation. Arguably this is better than leaving it up to the driver to decide which of the devices to use.

I’m not impressed that you had to quote from the glossary and appendix, rather the the standard proper

While the glossary is not normative, the appendix is.

It is not defined nor discussed in any great detail in Appendix A. Perhaps there is little detail required as OpenCL does not promise much - to quote the last sentence in A.1, “The results of modifying a shared resource in one command-queue while it is being used by another command-queue are undefined.”

I find it strange that an OpenCL device would be performing the copy between global and host memory. I had assumed some sort of direct memory access transfer would be used.

I can imagine that providing a target device for a copy from host to device is useful as a hint as to which device is going to use the buffer, so preemptive caching in the devices physical memory may occur.

If some device has to perform the data copy, which one does it when you call clCreateBuffer() with CL_MEM_COPY_HOST_PTR? Why are you not forced to, or even allowed to, specify a device when using this flag? Why is there no symmetrical way to copy data from the device to the host? Something just isn’t right with this API.

Cheers,

Dan

Perhaps there is little detail required as OpenCL does not promise much - to quote the last sentence in A.1, “The results of modifying a shared resource in one command-queue while it is being used by another command-queue are undefined.”

That sentence from the spec is stating something that should be expected anyway: modifying a resource in one queue while another queue is making use of it is going to cause trouble. The way to avoid any problems is by establishing dependencies between commands appropriately and by using clFlush() when there are dependencies across command queues.

As long as you use dependencies correctly, sharing resources between different command queues inside the same context is straightforward. I suggest searching the term “synchronization point” in the spec.

I find it strange that an OpenCL device would be performing the copy between global and host memory. I had assumed some sort of direct memory access transfer would be used.

That will depend on each particular implementation. Remember that OpenCL serves a very wide range of computing devices.

If some device has to perform the data copy, which one does it when you call clCreateBuffer() with CL_MEM_COPY_HOST_PTR? Why are you not forced to, or even allowed to, specify a device when using this flag? Why is there no symmetrical way to copy data from the device to the host? Something just isn’t right with this API.

Any standard API will be some sort of compromise of the alternatives suggested by multiple people from different companies. It is not possible to design an API or a language that will satisfy everybody.

Generally speaking, mapping memory objects into the host’s address space and writing the data directly into the given pointer instead of copying it around will give better performance than using CL_MEM_COPY_HOST_PTR. This is only a general rule. YMMV.

Should be expected? Why is that? The spec could make this as tight as it likes.

The OpenCL spec gives a fair amount of detail on memory fences and barriers so it is well defined what happens when a memory object is concurrently accessed and mutated by multiple compute units. However, they decide to stop there and just leave cross-command queue synchronisation very loose. The best you can do for synchronisation across command queues is to stop and wait just in case. That being said, this has nothing to do with my original query.

Agreed, this is indeed straightforward.

I’m not looking for satisfaction, merely explanation of why it is like it is. Who knows, my tirade on the non-symmetry in the API regarding reading and writing memory objects may lead to changes in the spec, or, I might just be rehashing and age-old argument, or there might be a very good reason for why it is like it is.

I agree that design-by-committee is a less than ideal way to work, but every decision made should be justified and I would hope that those making the decisions are involved or represented in this community and would be willing to share such justifications here.

Cheers,

Dan

[quote:3313vybf]That sentence from the spec is stating something that should be expected anyway: modifying a resource in one queue while another queue is making use of it is going to cause trouble.

Should be expected? Why is that?
[/quote:3313vybf]

It should be expected because the same requirement exists when you modify resources within a single command queue.

The OpenCL spec gives a fair amount of detail on memory fences and barriers so it is well defined what happens when a memory object is concurrently accessed and mutated by multiple compute units. However, they decide to stop there and just leave cross-command queue synchronisation very loose.

The only difference between inter-queue synchronization and intra-queue synchronization is that when you switch between queues you must use clFlush()/clFinish() to make sure that all commands are flushed. I agree that the spec could explain this in more detail.

I’m not looking for satisfaction, merely explanation of why it is like it is.

To be honest I don’t remember this particular topic being discussed in the meetings; I don’t think anybody saw this asymmetry as a big problem. Leaving aesthetics aside, do you see this being a real-world performance bottleneck? If performance is a concern I would suggest first looking at redesigning the application to avoid using CL_MEM_COPY_HOST_PTR and using clEnqueueMapBuffer() instead.

[/quote:8axp5z4h]

What I was doing was wrapping OpenCL to some degree to fit the way I want to use it. When writing my wrapper of the buffer, I first wrote various constructors to match the methods of creation supported by OpenCL. Next I tried to write a method to retrieve buffer contents so I could do some simple testing. I was surprised to find that I could set the contents of the buffer without my class being coupled to the command queue (and transitively, with a specific device), but could not read the contents. Further reading and reasoning just lead to confusion. For example, with the only cross-device synchronisation available being mutual exclusion, it is difficult to think of a buffer as “shared” across devices, except in the case of a read-only buffer.

I am not currently using CL_MEM_COPY_HOST_PTR in my application and do not know if it represents a performance bottleneck (though I have some anecdotal evidence that CL_MEM_COPY_HOST_PTR is slow on NVidia hardware). I would, however, be in favor of removing CL_MEM_COPY_HOST_PTR from the spec and providing a utility method to create a buffer and perform a blocking copy from host memory instead. I think OpenCL needs something akin to GLU for OpenGL.

Personally, I feel that the real problem is with the command queue. What is the reasoning for a command queue being bound to a single device? I think that a command queue that is bound to the context (and hence, can contain commands that are bound to different devices) is much more flexible, simpler to use and would probably address my original issue as commands to copy memory to/from the host would not need to nominate a device.

Cheers,

Dan

For example, with the only cross-device synchronisation available being mutual exclusion, it is difficult to think of a buffer as “shared” across devices, except in the case of a read-only buffer.

True, but you have the same problem trying to share a buffer between multiple NDRange kernel executions inside the same device. I.e., you can’t. Different kernels must work on the same buffer with mutual exclusion. It would be very odd if it was allowed for multiple devices to modify a buffer simultaneously if we don’t allow multiple kernels in the same device to do the same.

What is the reasoning for a command queue being bound to a single device?

Because it maps very well into how GPUs work and OpenCL is a low-level API. GPUs are asynchronous devices: there’s a command FIFO between the host CPU and the GPU. The host puts commands in it and the GPU fetches them. That’s basically what a CL command queue represents. For CPUs the story is different but it’s preferable to have the same abstraction for both CPUs and GPUs.

Having a single queue sending commands to multiple devices would require the driver to do some kind of dynamic load balancing, which is nontrivial. In addition, multiple devices in the same context may have different capabilities, such as support for images, and kernels that may run on one device may not run in the other device.

OpenCL is close to the metal and, yes, that makes it a bit hard to use at times. The benefit is that it takes a lot of guesswork away from the driver and puts most of the control in the hands of the application. It will take some time for third party developers to create higher-level languages and abstractions on top of OpenCL that will make it easier to write portable and reasonably performant code.

Finally, in case there’s any doubt: I am not speaking on behalf of the OpenCL working group. I don’t even participate in the meetings nowadays.

I hope that you do not think that I am advocating change at this point, rather I am merely pointing out things that have either helped or hindered my understanding of the OpenCL API.

One question on what you’ve stated. Let’s say that I have a NVidia Fermi card which, to my understanding, can concurrently execute multiple kernels on the one device. Are you suggesting that memory fences using CLK_GLOBAL_MEM_FENCE are not guaranteed to order reads and/or writes across the concurrently executing kernels?

Sounds like a leaky abstraction to me!

[/quote]

I had not assumed that OpenCL would do any load balancing. Specifying which device a kernel would execute on would be part of the function call to enqueue a kernel (same goes for any other commands that require the programmer to nominate a device). Given that, wouldn’t you agree that a command queue should be device agnostic? If a device-specific command queue is indeed a useful concept, then this could be written as a layer on top of the device-agnostic command queue. Again, a utility library akin to GLU would be useful here.

Cheers,

Dan

Are you suggesting that memory fences using CLK_GLOBAL_MEM_FENCE are not guaranteed to order reads and/or writes across the concurrently executing kernels?

I cannot answer questions regarding hardware from other vendors. If is in general regarding standard OpenCL, the answer is: good question! I don’t see any language in the spec talking explicitly about this scenario. Instead of guessing I’ll try to bring this up to the group --but be aware that it will take time to get an answer.

Given that, wouldn’t you agree that a command queue should be device agnostic?

I have trouble following you. What is a device-agnostic queue? One that is not bound to a particular device? Then, when you enqueue a command in this queue, what device runs it? Who chooses that device? How is explicitly selecting a device to run the command any different from having to explicitly select a queue, given that each queue is only associated with a single device?

Now I realize I had misread section 3.3.1. I think it answers your question:

OpenCL uses a relaxed consistency memory model; i.e. the state of memory visible to a workitem
is not guaranteed to be consistent across the collection of work-items at all times.

Within a work-item memory has load / store consistency. Local memory is consistent across
work-items in a single work-group at a work-group barrier. Global memory is consistent across
work-items in a single work-group at a work-group barrier, but there are no guarantees of
memory consistency between different work-groups executing a kernel
.

Memory consistency for memory objects shared between enqueued commands is enforced at a
synchronization point.

I will try to get some extra assurance from the group anyway since the text above only deals with execution barriers and not explicitly with memory fences.

Yes, by device agnostic I mean a queue that is not bound to one (and only one) device. I could have gone the other way and called it promiscuous as you could say that it services many devices.

I don’t think the idea that a “device runs a queue” is right. Does the spec talk about who “pumps the queue”? I had expected that the OpenCL runtime, possibly executing on the host, actually enqueues, dequeues and executes commands. It just so happens that some commands execute code on the OpenCL devices’ compute units. Perhaps this view of the world is completely incorrect? What I mean is that I would not expect the compute units on the device to be running code to pumps the queue, yet OpenCL defines a Device to be nothing more than a collection of compute units, so what does it mean for a device to “run a queue”?

It should be noted that some commands have nothing to do with the OpenCL device. For example, clEnqueueWaitForEvents() is an interesting animal. All it does it wait for a list of events to be set before it sets its own event. This requires no interaction with the device associated with the queue in which the “wait for events” command is enqueued. In fact, the list of events that it waits on could be events of commands enqueued for other devices.

If the command queue were not bound to a device, then the signature of some API functions, such as clEnqueueNDRangeKernel() would have to change to include the target device to execute the kernel on as this is no longer implied by the queue. Relating this back to the original question - I wonder if the commands to copy memory would require a device to be specified when being enqueued…

I think I have a very different view of what the command queue represents. I feel that the command and events concept is very good, not just as a synchronisation mechanism, but rather, to capture dependencies between commands. I see enqueued OpenCL commands as forming a graph (well, strictly a DAG) through these dependencies and the OpenCL runtime is able to execute only those commands that have all of their dependencies satisfied - that is, all of the events that it waits on have been set. I find the declarative nature of merely stating dependencies to be a better approach than thinking procedurally as you must do with an in-order queue.

I see no need for a “queue” concept at this fundamental level. There are just commands and dependencies. However, an in-order queue can be implemented trivially by making a linear list of commands, irrespective of that real dependencies. Other queues, such as a “priority queue” could also be layered on top of the DAG. The in-order and priority queues could be implemented in a utility library rather than in the OpenCL core API.

On a side note, the whole idea of out-of-order execution falls apart (at least on the current NVidia OpenCL implementation on Windows) because clEnqueueNDRangeKernel() is a blocking call. This makes it impossible for a single host thread to enqueue kernels for concurrent execution on multiple devices! I have resorted to having a host thread per device and my own work queues in the host to work around this. I would encourage the powers that be to change the OpenCL spec to state that clEnqueueNDRangeKernel() must not block (or add a “blocking” flag as per the memory transfer functions).

This discussion has been very interesting and made me go back to the OpenCL spec a number of times, and each time I seem to be getting more and more confused about the details. I guess some examples (just for reference - need not comment unless you feel compelled to) include:

[ul]
[li]In 3.2.1 it is stated that, “The command-queue schedules commands for execution on a device.” But earlier it defined the command queue as a data structure, which is passive - how can the command queue schedule anything? This leads to the question, who pumps the command queue?[/:m:1z5reisu][/li][li]Still in 3.2.1 it is stated that, “These (commands) execute asynchronously between the host and the device.” Yet clEnqueueNDRangeKernel() blocks on the NVidia platform and I have read in other threads that this behavior is not considered to be an error with respect to the spec.[/:m:1z5reisu][/li][li]Again in 3.2.1 it is stated that, “It is possible to associate multiple queues with a single context. These queues run concurrently and independently with no explicit mechanisms within OpenCL to synchronize between them.” Yet A.1 states that event objects can be shared across command queues - isn’t this an “explicit mechanisms” to synchronise between command queues?[/*:m:1z5reisu][/ul][/li]
On that last point, where the spec states, “It is possible to associate multiple queues with a single context”, you get the feeling that the spec is leading you to not create multiple command queues, however, you must use multiple command queues if you have multiple devices in the context!

Cheers,

Dan

I don’t think the idea that a “device runs a queue” is right.

I should have said “what device runs the commands in the queue” (my bad).

Re. clEnqueueWaitForEvents() and related examples where the command doesn’t really have any tying to a particular device, I think I agree with you that those APIs should probably apply to context objects instead of queue objects (or device-agnostic queues if you prefer).

Your reflection about the DAG of dependencies is spot-on :slight_smile:

I am extremely surprised to hear that clEnqueueNDRangeKernel() is blocking on some implementations. I doubt that it will stay that way for long. The spec already says that certain calls are non-blocking, but there’s no way to enforce it, so in practice it’s like the spec didn’t say anything. Market forces hopefully will take care of the problem.

In summary I think you make some very good points. I wish the group had your input two years ago.

I wasn’t trying to nitpick - your language actually aligns quite well with what the spec says. In this regard, I think that the spec is not specific or detailed (or maybe consistent?) enough to form an understanding of the OpenCL design without having been involved in its inception.

Excellent - if the powers that be agree, then perhaps we can push OpenCL in that direction for future releases.

I was extremely surprised by it too, and a little caught out; I had quite a clear picture of how I was going to use OpenCL (I am undertaking my first OpenCL project currently) but this threw a spannner in my works. I’m just about to post in another thread about this, so tune in if you’re interested.

Thanks David - that’s awfully flattering. I’d be happy to contribute more formally than just venting in these forums, so perhaps pull me in if/when appropriate.

Cheers,

Dan

Excellent - if the powers that be agree, then perhaps we can push OpenCL in that direction for future releases.

Sorry, but you lost me here. What do you mean by that? The dependencies already form a DAG and if your device supports out-of-order execution it may take advantage of the structure of the DAG to improve performance.

Thanks David - that’s awfully flattering. I’d be happy to contribute more formally than just venting in these forums, so perhaps pull me in if/when appropriate.

Unfortunately that’s not in my hands. You need to be a member of Khronos in order to contribute to the discussions for future specs. See Khronos Membership - The Khronos Group Inc. Look in particular at the academic membership if you think you might apply.