memory allocation / dealloaction / copying timing questions

Hi,

I am trying to figure out memory interaction issues (allocation, duration and copy operations etc.) between host and devices, using the C++ API (whens, hows, guarantees, nogos etc.). So assume there is a context, device etc (all initialization stuff omitted):

cl::Context someContext;
cl::Device someDevice;
cl::CommandQueue someQueue;
cl::Kernel someKernel;

So # 1 I create a buffer object and initialize it with memory from the host application:
cl::Buffer buf(someContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, …);

Next, # 2, I define this memory to be a kernel argument:
someKernel.setArg(0, buf);

And #3 I invoke the kernel:
someQueue.enqueueNDRangeKernel(someKernel, …);

First question:
From which stage on (assume it has been waited for each command to fully complete) i) may memory be allocated on the device side, ii) is memory guaranteed to having been allocated on the device side, iii) is the data in buffer guaranteed to having been copied?
The only place where a command-queue / device comes in place is #3. So I assume at this stage? Or is this unspecified?

Second question, is cl::CommandQueue::enqueueUnmapMemObject releasing memory by guarantee? For example, will
someQueue.enqueueUnmapMemObject(buf, …)
[and waiting for that to have finished] ensure that the device has freed the memory again?
If I don’t do anything explicitly, when will the device release the memory automatically? Upon destructor invokation of the Buffer?

Third question, at which stage must I not write new values to buffer in order to ensure what the kernel sees? So for example, presume I do this:

cl::Buffer buf(someContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, …);
someKernel.setArg(0, buf);
someQueue.enqueueWriteBuffer(&buf, …);
someQueue.enqueueNDRangeKernel(someKernel, …);

Are there any guarantees by the OpenCL standard what memory content the kernel will see?

Fourth question, presume that for some kernels there is a mix of program-wise constant data, i.e. the data never change during the program lifetime, and “dynamic” data which do change between kernel invokations. Other kernels (invoked between) have different layouts. Ideally I want to copy the constant data for the given kernels only once to the device, and then store it there permanently. What do I have to do to ensure that, if that’s possible at all? By intuition I would say that ensuring the buffer remains valid over the whole program lifetime and never explicitly releasing the memory should make the device copy the data only once upon first kernel invokation and then reuse it upon further kernel invokations, that might be it. But I really don’t know. If there is no general guarantee, what’s the “best-chance” approach practically speaking?

thanks for your help!

To answer the first question:
i) Memory can be allocated on the device once you have set up someContext. It is allocated by the code you give in #1.
ii) I don’t see anything on the specifications that says the memory will definitely be allocated on the device immediately, but I don’t see how an OpenCL implementation can keep track of the available device memory without allocating the memory in the process of creating a buffer.
iii) The 1.0 specs state that the host memory will be copied to a new array in host memory and that that array can be used as a cache for copying to/from the device. There is no guarantee of when the data will be copied to the device, AFAIK, except that it will be there when needed by a kernel.

For the second question, mapping and unmapping a buffer only gives you a pointer to host memory from which data will be copied to a device upon unmapping the buffer - you are mapping the buffer into host memory, modifying it there, then unmapping to send the data back to the device. This does not release the device memory managed by buf.

Question 3: that workflow is just fine. The kernel argument is a pointer the buffer. Hence setting the kernel argument before writing data to the buffer is just fine because the kernel argument is the memory address, which remains the same after writing to the buffer. You can set the argument once and then repeatedly modify the buffer’s data and call the kernel.

Question 4: The buffer will remain on the device until explicitly released. Note that you must keep alive a reference to the cl::Buffer object because, once that goes out of scope, the destructor could call clReleaseMemObject, removing the buffer from device memory. I do exactly this using JavaCL though, but the procedure is basically the same.

Thanks for the reply!

For the first question, at #1 no device has been specified yet. Assume one context attaches to several GPU-devices of the same kind, but the kernel is finally run only on a portion of them. Would OpenCL still be allowed to allocating memory on all devices even they never come to use the buffer? That’s why I had guessed the true allocation will take place at #3, but of course for optimization reasons this might deliberately be left open.
[the practical reason behind that question is that allocating memory on all devices, even on those never invoked for the given kernel, might cause an out-of-mem issue!]

For the second question, will cl::~Buffer actually call clReleaseMemObject ? The C++ API says I can specify my own function to be invoked upon destruction, but it doesn’t say anything what’s the default one if none is used to overwrite the default (clReleaseMemObject would be sensible default one, of course). At least I can’t find anything.

With the constant data, which steps do I have to do so the data are copied only once to the device?

So at program initialization I create a buffer:

cl::Buffer constBuf(someContext, …);

and ensure that this object remains in life for the entire host program lifetime. Fine. At some stage I define this buffer as kernel argument and, also invoke the kernel:

someKernel.setArg(0, constBuf);

someKernel.setArg(1, someOtherBuff);
someQueue.enqueueNDRangeKernel(someKernel, …);

Good, now let’s say several other kernels will be invoked on the same device, but then I get back to my good old someKernel. Do I still need to specify the kernel argument 0 or has this been permantently associated inside of someKernel until I explicitly overwrite it with something else? Specifically, do I have to write again:

someKernel.setArg(0, constBuf);
someKernel.setArg(1, someOtherBuff);
someQueue.enqueueNDRangeKernel(someKernel, …);

or can I omit the first line and just reduce it to:

someKernel.setArg(1, someOtherBuff);
someQueue.enqueueNDRangeKernel(someKernel, …);

and rely on argument 0 being the same that I had once specified?

To extend this question, presume that the “dynamic” data, stored in someOtherBuff, is kept in the same memory (cl::Buffer) object stored in the same memory (buffer) object (i.e., that object has never been destroyed). So not the memory object itself has changed, but merely the values in the memory. Do I still have to call the line

someKernel.setArg(1, someOtherBuff);

or can I omit that as well, again relying on argument 1 of the given kernel still being attached to the same memory object?

The reasons behind all these questions is efficient memory usage on a memory-intensive application. Copy and allocate memory only when it is really needed, to save both valuable time and avoid out-of-mem issues by splitting different work tasks across devices.

thanks!

You should get a CL_MEM_OBJECT_ALLOCATION_FAILURE when creating the buffer, if there isn’t enough free memory on some of the devices. Creating a buffer is supposed to allocate memory on all devices in a context and I would expect that that memory is allocated on the device during the call to clCreateBuffer.

Consider the following bug: you allocate a number of buffers on your device, all allocations succeed, your kernels use some of those buffers, then you allocate more buffers, use them, then use some of the original buffers that haven’t been used before. At the last stage, you could get an out of memory error if the implementation hasn’t already reserved memory for those buffers. An implementation that does that sounds like a debugging nightmare.

You are responsible for allocating only those buffers that you need on the devices that are actually going to use them, unless you are willing to waste memory.

Looking at cl.hpp, it seems that cl::Buffer inherits from cl::Wrapper, who’s destructor calls clReleaseMemObject. It appears that that destructor will always called, in addition to any other call back functions you specify.

As for constant data, you can omit the line “someKernel.setArg(0, constBuf);”. If you only change the values in someOtherBuff but not the actual cl::Buffer object that you point to, then “someKernel.setArg(1, someOtherBuff);” can also be removed. You only need to set kernel arguments that are pointers when the actual memory object (i.e. region in device memory) changes.

Ok so for allocation I assume that all devices attached to a context must have sufficient memory available, even if never invoked for the given kernel.
Sure I am responsible for allocationg memory appropriately. I suppose that means, practically speaking and from a technical perspective, to attach different devices (even of the same kind) to different contexts and decide per-context what memory to allocate and what workload to process.

Let’s turn this into a practical example: Say there are 4 GB of total data (all input and output memory combined) to be processed, and two GPUs available each at 3 GB. Assume each ‘output row’ only needs the same row of input for the kernel calculation. The naive approach would be using one context, attach it to two devices, allocate input and output buffer, set kernel args and then give the first device the first half and the second device the second half of the data to be processed (through two calls to enqueueNDRangeKernel, one per device). ‘Naive’ here means it wouldn’t work at all as I’d get an out-of-mem error when the buffers are created.
The correct approach would be two contexts, each attached to a single device, and on each device separately allocate that part of memory that is required (first or second half, respectively,) through separate buffers, and then invoke the kernel on each device for all data items on the device memory. Correct?

As for ~cl::Buffer, I also checked the headers of my implementation and it does call clReleaseMemObject. However, I suggest that the C++ Wrapper APi should definitely explicitly state this RAII as default behaviour - after all it seems to be the intended behaviour but there is formally no guarantee.

As for constant data and setting kernel arguments, seems I got it :). So I may assume copying between host and device memory only takes place when a cl::Buffer is created and initialized with some memory, or the values in memory reset with a call to CommandQueue::enqueueWriteBuffer (or enqueueCopyBuffer or a related function). I.e., if I expliclity order a buffer copy operation. Correct?

thanks!

You are correct, you will have to create separate contexts, allocated buffers in each context for half of the input and output data and finally write the appropriate parts of the input data to each device and read back the appropriate halves of the results that are generated.

You are also right about the memory transfers. Glad to hear you are making progress.