How to avoid double allocation on CPU

I am newbie to OpenCL, using my CPU with OpenCL for processing remote sensing images. As per the OpenCL documentation, we need to allocate host buffer and transfer the data to device memory. But when using OpenCL for CPU both host and device memories are same, is there any way to avoid this double allocation and use the buffer allocated on host can be directly used in OpenCL Kernel?

thanks in advance.

The CL_MEM_USE_HOST_PTR is often used in this case- it will wrap an application created memory allocation with a CL mem object.

I have a similar problem. I am using an Intel CPU and I’m using the Intel OpenCL SDK (version 1.5). I try to omit CPU-CPU memory copies, since that would be a waste of resources. In my case I assume that the data is allocated in a different section of the program - I don’t want to modify that part.

What I currently have is found below:

For each input and output array, I create a new pointer as follows using CL_MEM_USE_HOST_PTR:

cl_mem device_arrayname = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, arraysize, original_arrayname, NULL);

After the call to ‘clCreateBuffer’ I set-up the kernel and run it. Afterwards, the results are not correct, unless I make a call to ‘clEnqueueReadBuffer’ as follows:

clEnqueueReadBuffer(queue, device_arrayname, CL_TRUE, 0, arraysize, original_arrayname, 0, NULL, NULL);

Both of these calls seem to take significant time and it appears some actual copying is involved.

Furthermore, I’ve tried using the ‘clEnqueueMapBuffer’ and ‘clEnqueueUnmapMemObject’, instead of the ‘clEnqueueReadBuffer’, but that doesn’t seem to work (or I’m not doing it at the right place or with the right arguments).

I’ve also read something about alignment of the original memory allocation, but I don’t want to modify that part. Is it possible at all to omit the CPU-CPU copy without modifying the original memory allocation scheme?

You have to use either read buffer or map/unmap buffer to get the data back. ReadBuffer obviously has to copy the data since you provide it with a pointer, and the api must honour the same semantics which make it work on all devices.

Try using enqueueMapBuffer instead.

Note that the api provides a certain view of the world but internally even the cpu driver could do things very differently so you can’t really expect to be able to process in-place.

And what do you mean by ‘significant time’? If copying the data twice is a significant amount of time verses the algorithm execution, then opencl simply wont provide you any benefit and worrying about it is pointless.

Do not base your timing on a single run; the first-run time (which can not be changed as it depends on the operating system) will probably be meaninglessly very large.

Thank you for your response.

As I’ve said in the earlier message, I have tried the map/unmap scheme, but it doesn’t seem to work for me. I have one input array and one output array, both of the same size. My complete code now looks like this (using map/unmap):


cl_mem device_input = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, original_input, NULL);
cl_mem device_output = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, original_output, NULL);
<<< run the kernel >>>
void* pointer = clEnqueueMapBuffer(queue, device_output, CL_TRUE, CL_MAP_READ, size, 0, 0, NULL, NULL, NULL);
clFinish(queue);
clEnqueueUnmapMemObject(queue, device_output, pointer, 0, NULL, NULL);
clReleaseMemObject(device_input);
clReleaseMemObject(device_output);
<<< test the results of 'original_output' >>>

The above code does not seem to work though. It does decrease the total memory copy time by half, but the results are incorrect.

What does work for me is the following (using ‘clEnqueueReadBuffer’) code:


cl_mem device_input = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, original_input, NULL);
cl_mem device_output = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, original_output, NULL);
<<< run the kernel >>>
clEnqueueReadBuffer(queue, device_output, CL_TRUE, 0, size, original_output, 0, NULL, NULL);
clReleaseMemObject(device_input);
clReleaseMemObject(device_output);
<<< test the results of 'original_output' >>>

So what you are saying is that with the map/unmap scheme and the ‘CL_MEM_USE_HOST_PTR’ I might be able to omit the memory copies, but it is not certain that it would work for every case / every machine / every OpenCL driver instance. If that is true though, I should still get correct results with the above code, right? It might be that because the allocated memory is not properly aligned to 128-byte boundaries for example (as Intel states in their documents).

I understand the problem with memory-copy versus kernel execution time. However, what I’m working on is just a toy example, in which I have 16M integers in an input array and the same amount in an output array. My kernel runs with 16M threads each incrementing the input by 3. What I’m trying to do here is to omit the CPU-CPU memory copies, because I believe that should be possible somehow!

What I mean with significant time is that it takes more time than the kernel (which also reads/writes from and to the same memory), while it should take a small amount of CPU cycles in the ideal case, since it should simply create a new pointer and point it to the corrrect place in memory!

I understand this. I run some memory copy and a kernel before I start the useful part (and the timers).

cl_mem device_input = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, original_input, NULL);
cl_mem device_output = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, original_output, NULL);
<<< run the kernel >>>
void* pointer = clEnqueueMapBuffer(queue, device_output, CL_TRUE, CL_MAP_READ, size, 0, 0, NULL, NULL, NULL);
clFinish(queue);
clEnqueueUnmapMemObject(queue, device_output, pointer, 0, NULL, NULL);
clReleaseMemObject(device_input);
clReleaseMemObject(device_output);
<<< test the results of 'original_output' >>>

The problem is in the last line. There’s no guarantee that the most up-to-date data will be in original_output once you have destroyed device_output. OpenCL only guarantees that you will see the most up-to-date bits in the period when the buffer is mapped.

In other words, you would need to do this:

cl_mem device_input = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, original_input, NULL);
cl_mem device_output = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, original_output, NULL);
<<< run the kernel >>>
void* pointer = clEnqueueMapBuffer(queue, device_output, CL_TRUE, CL_MAP_READ, size, 0, 0, NULL, NULL, NULL);
<<< test the results of 'original_output' >>>
clEnqueueUnmapMemObject(queue, device_output, pointer, 0, NULL, NULL);
clReleaseMemObject(device_input);
clReleaseMemObject(device_output);

OK. That would mean I have to use the ‘clEnqueueUnmapMemObject’ and the ‘clReleaseMemObject’ at the very end of my program, just before I use ‘free’ on the original arrays (if it was malloc-ed). What will OpenCL do when the unmap function is called on the array? It will not free-up the memory space allocated for the original array, will it? This also constrains larger projects that split over multiple files. It would be nice to have the OpenCL part in a single file and not have to use OpenCL API calls somewhere else to clean-up a pointer.

Also, as far as I understand, I don’t have to map/unmap the input array, right?

I’ve tried your suggestion by simply printf-ing a value from the ‘original_output’ array, but it gives 0 (which it shouldn’t). Only after re-enabling the ‘clEnqueueReadBuffer’ it prints the correct value. I am printing straight after the kernel has ended.

So what you are saying is that with the map/unmap scheme and the ‘CL_MEM_USE_HOST_PTR’ I might be able to omit the memory copies, but it is not certain that it would work for every case / every machine / every OpenCL driver instance. If that is true though, I should still get correct results with the above code, right? It might be that because the allocated memory is not properly aligned to

Obviously a discrete gpu will need to copy the data across the pci bus anyway: either on an ‘as needs’ basis as the cpu requests it, or at map/unmap time via DMA - which options are available depend on the hardware, driver and os.

The specification clearly states that ‘implementations are allowed to cache the buffer contents’ - i.e. they may be copied anyway. You need to read the specific vendor documentation for details of their implementation - i’ve not read the intel stuff but the amd stuff has a fair bit about their gpu stuff and how to get ‘zero copy’ access.

And no you shouldn’t expect correct results above: you’re using the api incorrectly, as David followed up on. More on that in another post.

128-byte boundaries for example (as Intel states in their documents).

If the intel SDK states that buffers must be aligned (for zero copy?), and you’re not aligning them: you have to expect that the driver will need to copy the data. It’s as simple as that.

If their cpu’s work much faster on aligned data, then it would usually be a win to copy the whole buffer to align it, under the assumption that you’re going to be doing more than a trivial amount of work, and accessing that data more than once on the ‘device side’. For all you know it could be copying it regardless: e.g. to match the workgroup topology more closely in an SMP context.

If you are doing a trivial amount of work and accessing the data only once to/from the host (i.e. outside opencl), it will be faster just using the cpu in all cases, so such tests should not be used as a guide for opencl usage.

What I mean with significant time is that it takes more time than the kernel (which also reads/writes from and to the same memory), while it should take a small amount of CPU cycles in the ideal case, since it should simply create a new pointer and point it to the corrrect place in memory!

Well, ‘significant time’ has no meaning if you’re not doing any work - i.e. it isn’t significant no matter it’s absolute value. OpenCL has a lot more overhead than a simple function call: that overhead is worth it (and insignificant) when you’re doing a lot of work, but will still be there when you’re doing simple work.

Don’t get caught up on worrying about memory copies before you’ve even started …

On your first question: Not really, unless you’re only generating one result for the whole application.

It must be mapped/unmapped every time you access the data. This makes sure you do not have stale data either on the host or device when either side needs it (it’s pretty simple to understand why this is needed). You cannot enqueue tasks which write to a mapped buffer anyway: they need to be unmapped before being used. i.e. your proposed suggestion is invalid, and when taking into account this bracketing the opencl code will be grouped and not scattered all over the place.

opencl 1.1 spec 5.4.2.1: “Behaviour of opencl commands that access mapped regions of a memory object” has the (simple) rules.

Also, as far as I understand, I don’t have to map/unmap the input array, right?

You can do whatever you want. But if it’s only being initialised with one set of fixed data either using USE_HOST_PTR or COPY_HOST_PTR (if you want to free the pointer straight away) are probably the easiest.

I’ve tried your suggestion by simply printf-ing a value from the ‘original_output’ array, but it gives 0 (which it shouldn’t). Only after re-enabling the ‘clEnqueueReadBuffer’ it prints the correct value. I am printing straight after the kernel has ended.

Inside the map buffer/unmap buffer bracket?

I noticed in the spec that when using MapBuffer, with a CL_MEM_USE_HOST_PTR buffer, it will make sure the data is in the buffer you provided. So this will probably mean a copy: for a gpu it will because you cannot map device memory into the application heap, and for a cpu device it might because of alignment restrictions (depending on what the specific vendor documentation says).

What you possibly want (i’m speculating here as i haven’t done this so far): just let opencl do the allocations, you know then that whatever it chooses will be optimal for the processing stages (which is the important bit). Then use map to access the area it has allocated and write directly to that - and hope that it does the best it can - i.e. zero copy. If you have control over the allocations it just changes the way you allocate/access the memory.

But If you need to interact with another library in which you cannot control the allocation it’s probably going to have to be copied anyway (in general at least), so it’s up to you if you want to do it yourself or let opencl do it.

First of all, thank you for your very long and thorough responses, it is greatly appreciated.

If the intel SDK states that buffers must be aligned (for zero copy?), and you’re not aligning them: you have to expect that the driver will need to copy the data. It’s as simple as that.
[/quote:35iolxyh]

I understand. I’m targeting an Intel CPU only, no GPUs involved at any point. I’m therefore also reading through the Intel OpenCL document, in particular the guide found here: http://software.intel.com/file/39189. In particular, I am trying to follow section 3.1 of this document, where they give the programmer two choices: either you use OpenCL’s memory allocation system and map/unmap, or you make use of the ‘CL_MEM_USE_HOST_PTR’ flag:

Although my originally allocated arrays might be not properly aligned, it should then perform the copies and give me correct results.

I understand all that, I do have some experience programming CUDA and OpenCL for GPUs. The only thing I’m trying to do now is to omit the memory copies on the CPU.

[quote=“notzed”]

On your first question: Not really, unless you’re only generating one result for the whole application.

It must be mapped/unmapped every time you access the data. This makes sure you do not have stale data either on the host or device when either side needs it (it’s pretty simple to understand why this is needed). You cannot enqueue tasks which write to a mapped buffer anyway: they need to be unmapped before being used. i.e. your proposed suggestion is invalid, and when taking into account this bracketing the opencl code will be grouped and not scattered all over the place.

opencl 1.1 spec 5.4.2.1: “Behaviour of opencl commands that access mapped regions of a memory object” has the (simple) rules.

Also, as far as I understand, I don’t have to map/unmap the input array, right?

You can do whatever you want. But if it’s only being initialised with one set of fixed data either using USE_HOST_PTR or COPY_HOST_PTR (if you want to free the pointer straight away) are probably the easiest.

[quote:35iolxyh]
I’ve tried your suggestion by simply printf-ing a value from the ‘original_output’ array, but it gives 0 (which it shouldn’t). Only after re-enabling the ‘clEnqueueReadBuffer’ it prints the correct value. I am printing straight after the kernel has ended.[/quote]

Inside the map buffer/unmap buffer bracket?

I noticed in the spec that when using MapBuffer, with a CL_MEM_USE_HOST_PTR buffer, it will make sure the data is in the buffer you provided. So this will probably mean a copy: for a gpu it will because you cannot map device memory into the application heap, and for a cpu device it might because of alignment restrictions (depending on what the specific vendor documentation says).[/quote:35iolxyh]

This is what I understand, please correct me if I’m wrong:

  • If I have a memory object which is created using ‘CL_MEM_USE_HOST_PTR’, it is meant to be accessed by the accelerator only (read/write).
  • After I’ve created such an object, I should not access the host version of it, as it contains undefined data.
  • If I map the memory object, it is accessible by the host from that point on (either for read or write, specified as a flag to the API call), but the accelerator should not access it anymore, as it contains undefined data from accelerator perspective.
  • If I unmap the memory object, it goes back to the state it previously was (accessible by the accelerator, not by the host).

Therefore, I print inside this map/unmap region (and at various other places), but it does not seem to work. I’ve made a link to the full version of the code here: http://dl.dropbox.com/u/26669157/opencl-cpu.tar.gz (I’m not asking you to go through the code, but maybe somebody is interested anyway - the printf is in line 247 of the example6_host.c file).

I do understand that writing everything (including the mallocs) using OpenCL is the best choice, but I’m not allowed to touch the existing part. At first I wanted to get the zero-copies anyway, but that doesn’t seem to be working, is it? Now, I’m just trying to get the map/unmap part working, so that if the original mallocs would meet certain requirements (e.g. aligned to 128 byte for Intel SDK) it would omit this copy. However, that just doesn’t seem to work right now.

A small question to end with: why do I want to ‘unmap’ the object? After my OpenCL kernel has ran I will do a lot of computations on the resulting data outside of OpenCL, no kernels anymore. Ideally I just want to ‘map’ the object directly after the kernel has ended to give it back to the CPU and never ‘unmap’ it. That doesn’t seem to be possible, as the ‘map’ requires you to specify either read or write.

Thanks again for the help!

Ok. I just find that puzzling :slight_smile: There are probably easier ways to parallelise code if you only have a CPU to work on.

I understand all that, I do have some experience programming CUDA and OpenCL for GPUs. The only thing I’m trying to do now is to omit the memory copies on the CPU.

Just as an intellectual exercise?

This is what I understand, please correct me if I’m wrong:

  • If I have a memory object which is created using ‘CL_MEM_USE_HOST_PTR’, it is meant to be accessed by the accelerator only (read/write).
  • After I’ve created such an object, I should not access the host version of it, as it contains undefined data.

Well it depends on who is writing to it and the read/write flags, but in general where both sides are writing, then yes.

  • If I map the memory object, it is accessible by the host from that point on (either for read or write, specified as a flag to the API call), but the accelerator should not access it anymore, as it contains undefined data from accelerator perspective.

Again, it depends on who is writing it. If you’re only mapping it for read then both sides can still read it.

  • If I unmap the memory object, it goes back to the state it previously was (accessible by the accelerator, not by the host).

Well the heap memory wont be unmapped from the process: you will still have access to it. It’s just that if you subsequently invoke a kernel, and have written to it in the mean-time, there’s no guarantee the kernel will get any of those writes.

If you’re only reading a result or never use it for a kernel, the data will stay around and be valid after you unmap it.

Therefore, I print inside this map/unmap region (and at various other places), but it does not seem to work. I’ve made a link to the full version of the code here: http://dl.dropbox.com/u/26669157/opencl-cpu.tar.gz (I’m not asking you to go through the code, but maybe somebody is interested anyway - the printf is in line 247 of the example6_host.c file).

245   void* pointer_to_B = clEnqueueMapBuffer(bones_queue,device_B,CL_TRUE,CL_MAP_READ,(N * 1)*sizeof(int),0,0,NULL,NULL,&bones_errors); error_check(bones_errors);

This looks wrong, you’re passing the size as the offset, and mapping 0 bytes. i.e. pointer_to_B should end up being &B[N*4], not &B[0]

(another example of why actual source is much better than fragments/discussions).

A small question to end with: why do I want to ‘unmap’ the object? After my OpenCL kernel has ran I will do a lot of computations on the resulting data outside of OpenCL, no kernels anymore. Ideally I just want to ‘map’ the object directly after the kernel has ended to give it back to the CPU and never ‘unmap’ it. That doesn’t seem to be possible, as the ‘map’ requires you to specify either read or write.

Thanks again for the help!

Because it’s part of the api? You’ve effectively allocated a resource and it’s just a resource management issue.

But anyway unmapping will work if you are either only reading it, or never using that same buffer ever again in a kernel. If you need to do some processing and subsequently invoke another kernel on it, you will either need to keep the map around during the whole host-side update, or alternatively release the buffer and create a new one when you need it again.

BTW I know you’re only investigating but you really don’t need all those clFinish() calls.

If you do a blocking map (or any blocking operation), it will always ensure that particular operation is completely finished implicitly (and by extension, all previous operations on a single, serial queue).

clFinish() is handy for timing and debugging, but again it’s only needed when you don’t end the sequence on a blocking call.

Ok. I just find that puzzling :slight_smile: There are probably easier ways to parallelise code if you only have a CPU to work on.[/quote]

True, true. But you only figure that out after you try, right? I guess since I am quite familiar with CUDA and have a little bit of OpenCL GPU experience, I must be able to write some OpenCL CPU code. The good thing about the Intel OpenCL compiler is that it does vectorization for you as well as uses multiple CPU threads.

Well it depends on who is writing to it and the read/write flags, but in general where both sides are writing, then yes.

  • If I map the memory object, it is accessible by the host from that point on (either for read or write, specified as a flag to the API call), but the accelerator should not access it anymore, as it contains undefined data from accelerator perspective.

Again, it depends on who is writing it. If you’re only mapping it for read then both sides can still read it.

  • If I unmap the memory object, it goes back to the state it previously was (accessible by the accelerator, not by the host).

Well the heap memory wont be unmapped from the process: you will still have access to it. It’s just that if you subsequently invoke a kernel, and have written to it in the mean-time, there’s no guarantee the kernel will get any of those writes.

If you’re only reading a result or never use it for a kernel, the data will stay around and be valid after you unmap it.

Therefore, I print inside this map/unmap region (and at various other places), but it does not seem to work. I’ve made a link to the full version of the code here: http://dl.dropbox.com/u/26669157/opencl-cpu.tar.gz (I’m not asking you to go through the code, but maybe somebody is interested anyway - the printf is in line 247 of the example6_host.c file).

245   void* pointer_to_B = clEnqueueMapBuffer(bones_queue,device_B,CL_TRUE,CL_MAP_READ,(N * 1)*sizeof(int),0,0,NULL,NULL,&bones_errors); error_check(bones_errors);

This looks wrong, you’re passing the size as the offset, and mapping 0 bytes. i.e. pointer_to_B should end up being &B[N*4], not &B[0]

(another example of why actual source is much better than fragments/discussions).[/quote:whek347w]

I guess I understand the map/unmap procedure now. And the bug indeed appeared to be a small mistake, swapping the offset and the size. I’ve fixed that and the code behaves correctly again!

I’ve removed some of the clFinish calls. I just put them there to be sure that was not causing the problem :slight_smile:

Anyway, many thanks for your help and explanations! It works as expected now! The only thing left is trying to make it actually perform zero-copy, because execution time still hints at the fact that a copy is made :wink:

For those that are interested, the malloc/free implementations that I use to allocate at a 128-byte boundary are as follows:


// Allocate a 128-byte aligned pointer
void *malloc128(size_t size) {
  char *pointer;
  char *pointer2;
  char *aligned_pointer;
  
  // Allocate the memory plus a little bit extra
  pointer = (char *)malloc(size + 128 + sizeof(int));
  if(pointer==NULL) { return(NULL); }
  
  // Create the aligned pointer
  pointer2 = pointer + sizeof(int);
  aligned_pointer = pointer2 + (128 - ((size_t)pointer2 & 127));
  
  // Set the padding size
  pointer2 = aligned_pointer - sizeof(int);
  *((int *)pointer2) = (int)(aligned_pointer - pointer);
  
  // Return the 128-byte aligned pointer
  return (aligned_pointer);
}


// Free the 128-byte aligned pointer
void free128(void *pointer) {
  int *pointer2=(int *)pointer - 1;
  pointer -= *pointer2;
  free(pointer);
}

These malloc’s, in combination with ‘CL_MEM_USE_HOST_PTR’, ‘clEnqueueMapBuffer’ and ‘clEnqueueUnmapMemObject’ give me a zero-copy on an Intel CPU using Intel’s OpenCL SDK.

What about posix_memalign()?