mapping buffer host on GPU device

hi, I’m trying to perform calculations on a sequence of images. The images are written on a single buffer at intervals of 10 seconds which I apply the Sobel filter.
Instead of changing the buffer on the host, you can edit the buffer of the GPU? In doing so, I would always avoid creating host buffer and then write them on the device.

First, create your buffer using clCreateImage2D() and pass CL_MEM_ALLOC_HOST_PTR as the “flags” argument.

Now you can use clEnqueueMapBuffer()/clEnqueueUnmapBuffer() get a pointer to the image data. With it you can overwrite the contents of the image as easily as if it was a buffer in the host. That way you can avoid a memory copy because you are writing directly into device memory.

thank you david…
i created the buffer using clCreateBuffer. if i want use clEnqueueMapBuffer() / clEnqueueUnmapBuffer() , i need again to create two buffer of clCreateBuffer type for host input e output?
This is my code


cl_mem bufferIn = clCreateBuffer (context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, imageW * imageH * 4, NULL, &ciErr);
cl_mem bufferOut = clCreateBuffer (context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, imageW * imageH * 4, NULL, &ciErr);
unsigned char * dataIn;
dataIn = (BYTE *) clEnqueueMapBuffer(clComQueue, bufferIn, CL_TRUE, CL_MAP_WRITE, 0,  imageW * imageH * 4, 0, NULL, NULL, &ciErr)

After, i have this function where i pass the pointer of image buffer



BYTE * imageIn;
unsigned int * imageOut;
unsigned int imageW;
unsigned int imageH;

imageIn = (BYTE *) display1.bitmap.GetBits();
display1.bitmap.ReleaseBits();
imageW = display1.bitmap.myBmpWidth;
imageH = display1.bitmap.myBmpHeight;
szOut = imageW * imageH * sizeof (unsigned int);
imageOut = (unsigned int * )malloc(szOut);

ExecuteSobel ((BYTE*)imageIn,(BYTE*)imageOut);

void ExecuteSobel ( BYTE* pImgIn, BYTE* pImgOut)
{
ciErr = clEnqueueWriteBuffer(clComQueue, bufferIn, CL_TRUE,  imageW * imageH * 4, imageIn, 0, NULL, NULL);
ciErr = clEnqueueNDRangeKernel(clComQueue, clKernel, 2, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL);
ciErr = clEnqueueReadBuffer(clComQueue, bufferOut, CL_TRUE,  imageW * imageH * 4, pImgOut, 0, NULL, NULL);

clEnqueueUnmapMemObject(clComQueue, bufferIn, imageIn, 0, NULL, NULL);
}
 

i know that in some part i’m wrong.
Please help me!!

The reason for using clEnqueueMapBuffer() is that you no longer need to use clEnqueueReadBuffer/clEnqueueWriteBuffer any more, but it requires that whoever is writing the image data into the buffer needs to let you pass a pointer where you want the data to be written. In your example, the input data comes from display1.bitmap.GetBits(), which allocates its own memory and gives you a pointer to it.

At least you can benefit when you read back the output data this way:


[...]
bufferOut = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, imageW * imageH * sizeof (cl_int), NULL, &errCode);

ciErr = clEnqueueNDRangeKernel(clComQueue, clKernel, 2, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL);

imageOut = clEnqueueMapBuffer(queue, bufferOut, CL_TRUE, CL_MAP_READ, 0, szOut, 0, NULL, NULL, &errCode);

// here you can read the data from imageOut

errCode = clEnqueueUnmapMemObject(queue, bufferOut, imageOut, 0, NULL, NULL);

But in this way i must change also the kernel arguments?
Before using clEnqueueMapBuffer i have used two buffer, one to pass the image to the kernel function, and one to take the image in output.


ciErr  = clSetkernelArg(clKernel, 0 , sizeof(cl_mem) , (void*)&bufferIn);
ciErr |= clSetkernelArg(clKernel, 1 , sizeof(cl_mem) , (void*)&bufferOut);
ciErr |= clSetkernelArg(clKernel, 2 , sizeof(cl_uint) ,  (void*)&imageW);
ciErr |= clSetkernelArg(clKernel, 3 , sizeof(cl_uint) ,  (void*)&imageH);

The kernel arguments do not change. The same buffers are used: one for input and one for output. The only difference is that now instead of calling clEnqueueReadBuffer to read the output buffer, now you will use a mapbuffer/unmapbuffer operation, which can save you a data transfer.

i tryed to use clenqueueMapBuffer/clenqueueUnmapMemObj and the test was good. i have compared the result and if i use clenqueueMapBuffer/clenqueueUnmapMemObj the time execution is reduced to about 50% (create input and output buffer,map the input,execute,map the output,release map output).

Now i would like know if is possibile improve more the time execution. this is my code.



...

if (memMode == PINNED)
	{
		// memMode == PINNED
		bufferIn  = clCreateBuffer(context, CL_MEM_READ_WRITE, szBufferIn , NULL, &ciErr);
		bufferOut = clCreateBuffer(context, CL_MEM_READ_WRITE, szBufferOut, NULL, &ciErr);
	}
	else 
	{	
		// Standard host allocation (malloc)
		// memMode == PAGEABLE	
		bufferIn  = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, szBufferIn , NULL, &ciErr);
		bufferOut = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, szBufferOut, NULL, &ciErr);
	}
	
	if (accMode == MAPPED)
	{
		// Get mapped pointers for writing to pinned input and output host image pointers 
		// accMode = MAPPED
		cDataIn  = (BYTE *)clEnqueueMapBuffer(clCommandQueue, bufferIn , CL_TRUE, CL_MAP_WRITE | CL_MAP_READ, 0, szBufferIn , 0, NULL, NULL, &ciErr);
		//cDataOut = (BYTE *)clEnqueueMapBuffer(clCommandQueue, bufferOut, CL_TRUE, CL_MAP_WRITE | CL_MAP_READ, 0, szBufferOut, 0, NULL, NULL, &ciErr);
	}
	else 
	{
		// DIRECT: API access to device buffer
		// accMode = DIRECT
	}

...

Execute(BYTE* pImgIn, BYTE* pImageOut)
{
	
	if (!pImageOut) 
	{
		char log[10240] = "ERRORE BUFFER";
		MessageBox(NULL,log,"pImageOut NULL",MB_OK);
		return;
	}

	if (accMode == MAPPED)
	{
		//cDataIn  = (BYTE *)clEnqueueMapBuffer(clCommandQueue, bufferIn , CL_TRUE, CL_MAP_WRITE | CL_MAP_READ, 0, szBufferIn , 0, NULL, NULL, &ciErr);
		dt.Start();
		memcpy(cDataIn, pImgIn, szBufferIn);
		writeBuffer = dt.Check();
		//ciErr = clEnqueueUnmapMemObject(clCommandQueue, bufferIn , cDataIn,NULL,NULL,NULL );
		
		// Launch kernel
		dt.Start();
		ciErr = clEnqueueNDRangeKernel(clCommandQueue, clKernel, 2, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, &e_runCode);
		clWaitForEvents(1, &e_runCode);
		runCode = dt.Check();

		cDataOut = (BYTE *)clEnqueueMapBuffer(clCommandQueue, bufferOut, CL_TRUE, CL_MAP_WRITE | CL_MAP_READ, 0, szBufferOut, 0, NULL, NULL, &ciErr);
		dt.Start();
		memcpy(pImageOut, cDataOut, szBufferOut);
		readBuffer = dt.Check();
		ciErr = clEnqueueUnmapMemObject(clCommandQueue, bufferOut, cDataOut, NULL,NULL,NULL );
	}
	else 
	{
		// Copy input data from host to device
		dt.Start();
		ciErr = clEnqueueWriteBuffer(clCommandQueue, bufferIn, CL_TRUE, 0, szBufferIn, pImgIn, 0, NULL, &e_writeBuffer);
		clWaitForEvents(1, &e_writeBuffer);
		writeBuffer = dt.Check();
		
		// Launch kernel
		dt.Start();
		ciErr = clEnqueueNDRangeKernel(clCommandQueue, clKernel, 2, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, &e_runCode);
		clWaitForEvents(1, &e_runCode);
		runCode = dt.Check();
		
		// Copy results back to host, block until complete
		dt.Start();
		ciErr = clEnqueueReadBuffer(clCommandQueue, bufferOut, CL_TRUE, 0, szBufferOut, pImageOut, 0, NULL, &e_readBuffer);
		clWaitForEvents(1, &e_readBuffer);
		readBuffer = dt.Check();
	}
}


This code functions but if i create the map of the output buffer in the same moment when i create the map for the input buffer the kernel code doesn’t function. Why?

If I understood correctly your code, then the reason you can’t map the output buffer before the execution of the kernel is because map/unmap operations are synchronization points. When you map/unmap a pointer, the OpenCL driver may implicitly perform a memory copy from the host to the device (or vice-versa) through PCIe.

This is why if you map the output buffer before the GPU has written into it you will get invalid data out of it. For the same reason you should unmap the input buffer before you enqueue the NDRange. That will ensure that the input data is copied to the GPU before the range is executed.

Also I have another suggestion: be careful with the read/write flags you pass to clEnqueueMapBuffer(). It can save you a lot of performance.

Try this for example:

cDataIn  = (BYTE *)clEnqueueMapBuffer(clCommandQueue, bufferIn , CL_TRUE, CL_MAP_WRITE, 0, szBufferIn , 0, NULL, NULL, &ciErr);
cDataOut = (BYTE *)clEnqueueMapBuffer(clCommandQueue, bufferOut, CL_TRUE, CL_MAP_READ, 0, szBufferOut, 0, NULL, NULL, &ciErr);

By marking the input buffer as write-only you avoid making a copy from the device to the host when the buffer is mapped. By marking the input buffer as read-only you avoid a copy from the host to the device when the buffer is unmapped.

I hope this all makes sense to you. Feel free to ask any other questions and I will try to answer.