Sharing host memory with clSetKernelArg!

Hi!

For CPU and AMD Fusion devices which share the same (host) memory, there is no point in relying on the clCreateBuffer to copy data to the device and back. In such cases it would make sense that clSetKernelArg should be allowed to accept (properly alligned) pointer to host memory directly. clSetKernelArg also has a very small overhead in compare to the “buffer” functions, which were designed especially with the “split” memory scenarios in mind.

Thanks!
Atmapuri

Your phrase “share the same (host) memory” is also known as host-unified memory. However, you still need to use clCreateBuffer. You seem to assume that an implementation must always use “split” memory scenarios, but that is not the case when using a host-unified memory device. Naturally it all depends upon the implementation and the device, but the ones I’m familiar with do not make any extra copies or splits in this case. Then when clSetKernelArg is called it just uses the host memory referenced by the Buffer. As a result there is no need for clSetKernelArg “to accept (properly aligned) pointer to host memory directly”.

I measured overhead between 50 and 2000us and you say it is not there?
Going over clCreateBuffer or clEnqueueRead/Write even for CPU devices adds an overhead considerably (1000x) above the optimum (pointer copy).

Thanks!
Atmapuri

What parameters are you passing into clCreateBuffer? Can you cut & paste the line of code.

I cant post complete code as it is scattered across lots of other code. I call clCreateBuffer with:

CL_MEM_READ_WRITE

clEnqueueMapBuffer has CL_TRUE for blocking and CL_MAP_READ when reading and CL_MAP_WRITE when writing.

I tried using additionaly:

CL_MEM_ALLOC_HOST_PTR

(clCreateBuffer) but ATI GPU device appears to constantly mirror (copy) all the changes from Host to GPU and back in the background when this flag is specified (thus slowing down the computation). When this flag is specified, the time to copy data from the GPU using Map/Unmap is same as for the CPU device (1.5ms for 4MBytes of data). When this flag is not specified, the time to copy data is 17ms. (makes sense).

To copy 4MBytes of memory in C++takes 200us on my machine and with 1.5ms overhead for CPU device that is not “zero cost”.

Is there some special reason why I couldn’t use CL_MEM_USE_HOST_PTR with clCreateBuffer and then do a clFinish, before copying memory with C++ code (When device is CPU)? (directly referencing the Host pointer passed to clCreateBuffer)

I create buffer in the context shared by both CPU and GPU, but kernels are currently en-queued always to one device during the lifetime of the buffer.

Thanks!
Atmapuri

No need to “post complete code”, just the one line of the clCreateBuffer was all I wanted to see, however, I think you have explained much more about your application and your use of OpenCL - thanks. I now understand that you are trying to use both the CPU and GPU which operate on one or more buffers that are shared between them.

Background: The only valid choices of the clCreateBuffer “HOST_PTR” flags are the following combinations (ignoring the “READ/WRITE” flags which are orthogonal):
[ul][li]none[/:m:2ir5vteg][/li][li]CL_MEM_COPY_HOST_PTR[/:m:2ir5vteg][/li][li]CL_MEM_ALLOC_HOST_PTR[/:m:2ir5vteg][/li][li]CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR[/:m:2ir5vteg][/li][li]CL_MEM_USE_HOST_PTR[/*:m:2ir5vteg][/ul][/li]
The use of each combination depends upon a number of factors or strategies. For example here is one (and there are more, if you are interested please ask and I’ll write more).

[ul][li]Your host application has already allocated and computed some data outside of OpenCL, for example, 1M floats, that is, [/li]```
float array[1024*1024]

 or 

float array = (float)malloc(1024*1024)

I think this might be what you are doing.[/*:m:2ir5vteg]
[li]Now you wish to access it from you OpenCL kernel. So you should issue a clCreateBuffer w/ CL_MEM_USE_HOST_PTR. This call takes a pointer to your existing application data, array. For example, [/li]```
cl_mem buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 1024*1024, array, &error)

This is preferred when the application already has allocated the data. Any other flag choice causes an allocation. Once created you should assume that the host application NO longer has access to the data, that is, only the OpenCL devices can access the data until you release the buffer.[/*:m:2ir5vteg]
[li]For the CPU device the runtime uses the data directly in the array and invokes the kernel passing a pointer to this data. There should be no need to move the data or make copies of the data. This is what I think you’re trying to accomplish. For example, [/li]```
error = clSetKernelArg(kernel, 0, sizeof(buffer0, buffer)

 and 

error = clEnqueueTask(cpucommandqueue, kernel, 0, NULL, NULL)

[/*:m:2ir5vteg]
[li]For the GPU however, the runtime MUST transfer the array from the host memory to the device memory and invoke the kernel using the data that is now in the device memory. Naturally this transfer takes time depending upon how much data there is. For example, [/li]```
error = clEnqueueTask(gpucommandqueue, kernel, 0, NULL, NULL)

[/*:m:2ir5vteg]
[li]If another GPU device kernel is enqueued for this data, then the runtime knows that the data is already on the device so no data transfer should happen. For example, [/li]```
error = clEnqueueTask(gpucommandqueue, kernel2, 0, NULL, NULL)

[/*:m:2ir5vteg]
[li]After the GPU device kernel completes execution then the runtime can transfer the data to host memory when requested by either the host application or the CPU device. [/*:m:2ir5vteg][/li][li]If the CPU device kernel needs this data, then the runtime must transfer the data from device memory to host memory, incur the transfer time and invoke the kernel passing a pointer to the data. For example, [/li]```
error = clEnqueueTask(cpucommandqueue, kernel2, 0, NULL, NULL)

[/*:m:2ir5vteg]
[li]If the application needs this data, then the runtime may or may not transfer the data depending if it is or is not in host memory. However, in general, for buffers created with CL_MEM_USE_HOST_PTR it is best to use clEnqueueMapBuffer, because if the data is already in host memory, then there is no need to transfer the data back. For example, [/li]```
void mapaddr = clEnqueueMapBuffer(cpucommandqueue, buffer, CL_TRUE, 0, 0, 10241024, 0, NULL, NULL, &error)

access the data at mapaddr, then 

error = clEnqueueUnmapMemObject(cpucommandqueue, buffer, mapaddr, 0, NULL, NULL)

[/*:m:2ir5vteg]
[li]If you are done using OpenCL then release the buffer to regain access to the data. For example, [/li]```
error = clReleaseMemObject(buffer)

[/*:m:2ir5vteg][/ul]
Note: I have not compiled any of this code, so there might be some typos in it.

I appreciate the detailed and well put answer. Here are some timings that I performed for the CPU device:

cl_mem buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 1024*1024*2, array, &error)
error = clReleaseMemObject(buffer);

AMD driver: 25us
Intel driver: 40us

Time of copying the array in C++ (not in cache/ cached)

Cold: 750us
Warm: 180us

Timing with mapping:

cl_mem buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 1024*1024*2, array, &error)
void *mapaddr = clEnqueueMapBuffer(cpucommandqueue, buffer, CL_TRUE, 0, 0, 1024*1024, 0, NULL, NULL, &error);
error = clEnqueueUnmapMemObject(cpucommandqueue, buffer, mapaddr, 0, NULL, NULL);
error = clReleaseMemObject(buffer);

AMD driver: 37us
Intel driver (release version): ~70us

What are the possible side-effects, if I use the array pointer directly without calling the map/unmap pair to obtain back the already known value? Instead I only make sure that que has clFinished.

Even though 37us does not seem much for AMD, it is still 37x more than clSetKernelArg (not to mention the utter simplicity of one function call against 4 which must be properly configured out of many options). In terms of computational power, 37us is enough to compute 4x 1024 point FFT. (on one core). 12us is what the map/unmap alone cost and that is still enough for 1x 1024 point FFT.

It may be that for the world of GPU this numbers are “small”, but the CPU device is a different ballgame.

Thanks!
Atmapuri

P.S.
Intel driver requires 1024byte array alignment in order not to copy the memory.

On the hgpu.org machine whose platform parameters are:

  • [li]OS: OpenSUSE 11.4[/:m:1qgy69nu][/li][li]SDK: AMD Accelerated Parallel Processing (APP) SDK 2.4[/:m:1qgy69nu][/li][li]GPU device 0: ATI Radeon HD 5870 2GB, 850MHz[/:m:1qgy69nu][/li][li]GPU device 1: ATI Radeon HD 6970 2GB, 880MHz[/:m:1qgy69nu][/li][li]CPU: AMD Phenom II X6 @ 2.8GHz 1055T[/:m:1qgy69nu][/li][li]RAM: 12GB[/:m:1qgy69nu][/li][li]HDD: 2TB, Raid-0[/*:m:1qgy69nu]

With the following program

#define _BSD_SOURCE
#include <sys/time.h>
#include <CL/cl.h>
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>

#define CHECK(function_call) \
do { \
  /* printf("CHECK(" #function_call ") in: %s, line: %d
", __FILE__, __LINE__); */ \
  int _rc = (function_call); \
  if (_rc != 0) { \
	printf("ERROR:  function rc = %d
", _rc); \
	fflush(stdout); \
	exit(_rc); \
  } \
} while(0);

#define CHECK_ERR(function_call, _rc) \
do { \
  /* printf("CHECK(" #function_call ") in: %s, line: %d
", __FILE__, __LINE__); */ \
  (function_call); \
  if (_rc != 0) { \
	printf("ERROR:  function rc = %d
", _rc); \
	fflush(stdout); \
	exit(_rc); \
  } \
} while(0);

int main(int argc, char **argv) {

	int err;
	struct timeval start, end, diff;
	cl_uint num_platforms;
	cl_platform_id *platforms;
	cl_uint num_devices;
	cl_device_id *devices;
	cl_context context;
	cl_mem buffer;

	// Get platforms
	CHECK(clGetPlatformIDs(0, NULL, &num_platforms));
	platforms = (cl_platform_id *) malloc(
			num_platforms * sizeof(cl_platform_id));
	CHECK(clGetPlatformIDs(num_platforms, platforms, NULL));

	// Loop through all platforms
	unsigned int p;
	for (p = 0; p < num_platforms; p++) {

		// Output platform name
		size_t platform_name_size;
		CHECK(clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, 0, NULL, &platform_name_size));
		char *platform_name = (char *) malloc(platform_name_size);
		CHECK(clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, platform_name_size, platform_name, NULL));
		printf("platform[%u]=%s
", p, platform_name);
		free(platform_name);

		// Get devices
		CHECK(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices));
		devices = (cl_device_id *) malloc(num_devices * sizeof(cl_device_id));
		CHECK(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL));

		// Loop through all devices
		unsigned int d;
		for (d = 0; d < num_devices; d++) {

			// Output device name
			size_t device_name_size;
			CHECK(clGetDeviceInfo(devices[d], CL_DEVICE_NAME, 0, NULL, &device_name_size));
			char *device_name = (char *) malloc(device_name_size);
			CHECK(clGetDeviceInfo(devices[d], CL_DEVICE_NAME, device_name_size, device_name, NULL));
			printf("device[%u]=%s
", d, device_name);
			free(device_name);

			// Create Context
			cl_context_properties context_properties[3] = {
					CL_CONTEXT_PLATFORM, (cl_context_properties) platforms[p],
					0 };
			CHECK_ERR(context = clCreateContext(context_properties, num_devices, devices,
					NULL, NULL, &err), err);

			// Start timing
			err = gettimeofday(&start, NULL);
			if (err != 0) {
				printf("gettimeofday(start, NULL) failed err=%d
", err);
				exit(err);
			}

			// Allocate Buffer
			int *array = (int *) malloc(1024 * 1024 * sizeof(int));

			// Create Buffer
			CHECK_ERR(buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR,
					1024 * 1024 * sizeof(int), array, &err), err);

			// Release Buffer
			CHECK(clReleaseMemObject(buffer));

			// End timing
			err = gettimeofday(&end, NULL);
			if (err != 0) {
				printf("gettimeofday(end, NULL) failed err=%d
", err);
				exit(err);
			}

			// Get end-start difference and print it
			timersub(&end, &start, &diff);
			float time = ((float) diff.tv_sec * 1000000)
					+ ((float) diff.tv_usec);
			printf("end-start time %f usec
", time);

			CHECK(clReleaseContext(context));

		}

		free(devices);

	}

	free(platforms);

	return 0;
}

I get the following timing results

platform[0]=AMD Accelerated Parallel Processing
device[0]=Cypress
end-start time 11.000000 usec
device[1]=Cayman
end-start time 4.000000 usec
device[2]=AMD Phenom(tm) II X6 1055T Processor
end-start time 4.000000 usec

This is faster than what you have. Does this program match what you have used? What do you get when you run this program on your system? If you wish please update this program to include more timings and repost it here.

What are the possible side-effects, if I use the array pointer directly without calling the map/unmap pair to obtain back the already known value? Instead I only make sure that que has clFinished.

Your host memory will be stale (out-of-date) if you use a GPU device because the values in the device memory will not be read back (or mapped) into host memory.

It may be that for the world of GPU this numbers are “small”, but the CPU device is a different ballgame.

I’m not sure what you are saying, please explain. If you are saying that using a GPU takes more overhead than dispatching a simple call from a CPU host application to a CPU compute function, then yes you are right. A call on the CPU is typically only a few instructions whereas the GPU hardware is an I/O attached device which require much more overhead to transfer data to it and dispatch a function on it. However, a GPU has tremendous parallel processing capability. So it is all a game to hide the bandwidth and latency by doing an enough work to make the overhead warranted.

Continuing…Adding in a command queue and a map/unmap I get the following results

platform[0]=AMD Accelerated Parallel Processing
device[0]=Cypress
end-start time 122.000000 usec
device[1]=Cayman
end-start time 114.000000 usec
device[2]=AMD Phenom(tm) II X6 1055T Processor
end-start time 165.000000 usec

The program is now

#define _BSD_SOURCE
#include <sys/time.h>
#include <CL/cl.h>
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>

#define CHECK(function_call) \
do { \
  /* printf("CHECK(" #function_call ") in: %s, line: %d
", __FILE__, __LINE__); */ \
  int _rc = (function_call); \
  if (_rc != 0) { \
	printf("ERROR:  function rc = %d
", _rc); \
	fflush(stdout); \
	exit(_rc); \
  } \
} while(0);

#define CHECK_ERR(function_call, _rc) \
do { \
  /* printf("CHECK(" #function_call ") in: %s, line: %d
", __FILE__, __LINE__); */ \
  (function_call); \
  if (_rc != 0) { \
	printf("ERROR:  function rc = %d
", _rc); \
	fflush(stdout); \
	exit(_rc); \
  } \
} while(0);

int main(int argc, char **argv) {

	int err;
	struct timeval start, end, diff;
	cl_uint num_platforms;
	cl_platform_id *platforms;
	cl_uint num_devices;
	cl_device_id *devices;
	cl_context context;
	cl_command_queue commandqueue;
	cl_mem buffer;

	// Get platforms
	CHECK(clGetPlatformIDs(0, NULL, &num_platforms));
	platforms = (cl_platform_id *) malloc(
			num_platforms * sizeof(cl_platform_id));
	CHECK(clGetPlatformIDs(num_platforms, platforms, NULL));

	// Loop through all platforms
	unsigned int p;
	for (p = 0; p < num_platforms; p++) {

		// Output platform name
		size_t platform_name_size;
		CHECK(clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, 0, NULL, &platform_name_size));
		char *platform_name = (char *) malloc(platform_name_size);
		CHECK(clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, platform_name_size, platform_name, NULL));
		printf("platform[%u]=%s
", p, platform_name);
		free(platform_name);

		// Get devices
		CHECK(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices));
		devices = (cl_device_id *) malloc(num_devices * sizeof(cl_device_id));
		CHECK(clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL));

		// Create Context with all devices
		cl_context_properties context_properties[3] = { CL_CONTEXT_PLATFORM,
				(cl_context_properties) platforms[p], 0 };
		CHECK_ERR(context = clCreateContext(context_properties, num_devices, devices,
						NULL, NULL, &err), err);

		// Loop through all devices
		unsigned int d;
		for (d = 0; d < num_devices; d++) {

			// Output device name
			size_t device_name_size;
			CHECK(clGetDeviceInfo(devices[d], CL_DEVICE_NAME, 0, NULL, &device_name_size));
			char *device_name = (char *) malloc(device_name_size);
			CHECK(clGetDeviceInfo(devices[d], CL_DEVICE_NAME, device_name_size, device_name, NULL));
			printf("device[%u]=%s
", d, device_name);
			free(device_name);

			// Create command queue
			CHECK_ERR(commandqueue = clCreateCommandQueue(context, devices[d], 0, &err), err);

			// Start timing
			CHECK(gettimeofday(&start, NULL));

			// Allocate Buffer
			int *array = (int *) malloc(1024 * 1024 * sizeof(int));

			// Create Buffer
			CHECK_ERR(buffer = clCreateBuffer(context, CL_MEM_USE_HOST_PTR,
							1024 * 1024 * sizeof(int), array, &err), err);

			// Map Buffer
			void *mapaddr;
			CHECK_ERR(mapaddr = clEnqueueMapBuffer(commandqueue, buffer, CL_TRUE, CL_MAP_WRITE, 0, 1024 * 1024 * sizeof(int), 0, NULL, NULL, &err), err);

			// Unmap Memory Object
			CHECK(clEnqueueUnmapMemObject(commandqueue, buffer, mapaddr, 0, NULL, NULL));

			// Release Buffer
			CHECK(clReleaseMemObject(buffer));

			// End timing
			CHECK(gettimeofday(&end, NULL));

			// Compute end-start timing difference and print it
			timersub(&end, &start, &diff);
			float time = ((float) diff.tv_sec * 1000000)
					+ ((float) diff.tv_usec);
			printf("end-start time %f usec
", time);

			// Release command queue
			CHECK(clReleaseCommandQueue(commandqueue));

		}

		// Release context
		CHECK(clReleaseContext(context));

		free(devices);

	}

	free(platforms);

	return 0;
}

Changing the above program to specify only CL_DEVICE_TYPE_CPU (no GPU usage) gets the following results with map/unmap

platform[0]=AMD Accelerated Parallel Processing
device[0]=AMD Phenom(tm) II X6 1055T Processor
end-start time 71.000000 usec

1.) Using only create/free buffer:

platform[0]=AMD Accelerated Parallel Processing
device[0]=Juniper
end-start time 15.666632 usec

device[1]=Intel® Core™ i7 CPU 860 @ 2.80GHz
end-start time 4.736423 usec

platform[1]=NVIDIA CUDA
device[0]=GeForce 8600 GT
end-start time 8.015486 usec

platform[2]=Intel® OpenCL
device[0]=Intel® Core™ i7 CPU 860 @ 2.80GHz
end-start time 24.046458 usec

2.) Create/Free buffer and que:

platform[0]=AMD Accelerated Parallel Processing
device[0]=Juniper
end-start time 12.023229 usec

device[1]=Intel® Core™ i7 CPU 860 @ 2.80GHz
end-start time 10.201528 usec

platform[1]=NVIDIA CUDA
device[0]=GeForce 8600 GT
end-start time 13.844930 usec

platform[2]=Intel® OpenCL
device[0]=Intel® Core™ i7 CPU 860 @ 2.80GHz
end-start time 23.317777 usec

3.) Create/Free buffer, que and do map/umap:
platform[0]=AMD Accelerated Parallel Processing
device[0]=Juniper
end-start time 740.339427 usec

device[1]=Intel® Core™ i7 CPU 860 @ 2.80GHz
end-start time 22.224756 usec

platform[1]=NVIDIA CUDA
device[0]=GeForce 8600 GT
end-start time 9735.171989 usec

platform[2]=Intel® OpenCL
device[0]=Intel® Core™ i7 CPU 860 @ 2.80GHz
end-start time 101.650935 usec

I moved the malloc outside of the loop and added 128byte allignment. I found also some timing overhead in my own code thanks to your example. AMD does show a relatively low overhead for the CPU device, but is still a lot more than pointer copy or a call to clSetKernelArg. Anyhow, you did say that using the pointer without map/unmap for CPU device is fine. So I guess that solves the (overhead) problem. Usually when you copy data you need to wait for the queue to stop anyway.

Thanks!
Atmapuri

Anyhow, you did say that using the pointer without map/unmap for CPU device is fine.

Actually I didn’t say that. But good luck in your efforts!