How to reduce the read buffer time

Hello,

I am working on convolution of an image with a mask of size 3x3.

I am facing a problem with reading the data back from GPU to CPU. The time required to execute kernel is just 30 milli sec. But to read data back it is taking more time nearly more than a second for a 6000 x 6000 image.

I am using clEnqueReadBuffer() to get data back. I have tried using pinned memory also, but I didnt find any improvement.

I tried the synchronous mode (CL_TRUE) reading. It is taking more time. I tried asynchronous mode (CL_FALSE). It is very fast. But, I am not getting the full image back. If I use clFinish(cqCommandQueue) after asynchronous mode then it is taking the same time as synchronous mode but I am getting full image.

I cant able to make GPU is better than CPU. The time take by convolution on CPU is half the time the time taken for convolution on GPU. But the problem is in reading the data back. Reaming all is ok.

Please help me if you know how to reduce it.

Thanks in advance.

The time required to execute kernel is just 30 milli sec. But to read data back it is taking more time nearly more than a second for a 6000 x 6000 image.

How are you measuring this time? Are you using the command queue’s profiling mode?

I tried asynchronous mode (CL_FALSE). It is very fast. But, I am not getting the full image back.

Did you wait for the enqueue event that is returned by clEnqueueMapImage2D() to complete or did you use the pointer right away without waiting? By waiting I mean calling clWaitForEvents().

Generally speaking, using OpenCL on a GPU efficiently requires that the computation time must be long enough to compensate for the data transfers from and to the GPU. A 3x3 convolution matrix is not very intense computationally.

Also, what is the pixel format of the input image and the output image? If you are using float4 it will be worth considering something smaller, such as char4.

Thanks Mr. david.garcia

I am measuring the kernel execution time using event profiling method.

The execution time is in nano seconds. I also used CPU timer that is GetTickCount() also in Visual studio. Then also I got the same.

I didnt use the command clEnqueueMapImage2D().

If I use clWaitForEvents() then also the read buffer time is taking more even in asynchronous mode(CL_FALSE).

As you said I am using float4 datatype. I will change it. Can I use int instead of char.

What is the suggestible kernel size? Shall I have to use 5x5 or 7x7?

If the image is 6000 x 6000 x float4, it means that its size is at least 550MB. One second to transfer 550MB sounds like a lot, but the OpenCL driver may be doing more than simply copying the data.

Have you measured how long it takes to read the image from the CPU to the GPU? Can you show us the source code of your program? I would like to see the API calls used to copy memory and to measure time.

#include <stdio.h>
#include <conio.h>
#include <windows.h>
#include <assert.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <cv.h>
#include <cxcore.h>
#include <highgui.h>
#include <oclUtils.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <sys/timeb.h>

#define MAX_SOURCE_SIZE (0x100000)

float *srcA, *srcB; // Host buffers for OpenCL test
float *CPUoutput, *GPUoutput; // Host buffer for GPU and CPU processing

// OpenCL Vars
cl_platform_id cpPlatform; // OpenCL platform
cl_device_id cdDevice = NULL,CPU = NULL; // OpenCL device
cl_context cxGPUContext; // OpenCL context
cl_command_queue cqCommandQueue; // OpenCL command que
cl_program cpProgram; // OpenCL program
cl_kernel ckKernel; // OpenCL kernel
cl_mem cmDevSrcA; // OpenCL device source buffer A
cl_mem cmDevSrcB; // OpenCL device source buffer B
cl_mem cmDevDst; // OpenCL device destination buffer
cl_uint ret_num_platforms;
size_t szGlobalWorkSize[2]; // Total # of work items in the 2D range
size_t szLocalWorkSize[2]; // # of work items in the 2D work group
cl_event GPUevent,transferevent; // OpenCL event
cl_ulong start; // To store start clock time
cl_ulong end; // To store end clock time
//cl_event ev;
//size_t szParmDataBytes; // Byte size of context information
//size_t szKernelLength; // Byte size of kernel code
cl_int ciErrNum = 0; // Error check variable
char* cPathAndName = NULL; // var for full paths to data, src, etc.
char* cSourceCL = NULL; // Buffer to hold source for compilation
size_t source_size;
char *source_str;
size_t buffer_size;
float *img,*img1, *mask;

// demo config variables
int iNumElements = 0; // Length of entire array to be created (Image size)
int iFilterWidth = 3; // Specifies the width of the Filter for convolution
int iWidth = 0; // Image Width Holding variable
int iHeight = 0; // Image Height Holding variable
shrBOOL bNoPrompt = shrFALSE;
IplImage *srcimg, *tempFrame; //Image storing elements

//Elements helps in creating the buffers
size_t mem_sizeImage;
unsigned int sizeKernel;
size_t mem_sizeKernel;

// Forward Declarations
// *************************************************************************************
float* CPUConvolution(float* pfData1, float* pfData2);
void GPUDevicePreparation();
void GPUConvolution(float *inimage, float *mask,float *outimage);
void Cleanup (int iExitCode);
double getclock();
void (*pCleanup)(int) = &Cleanup;
void readsourceimage();

//Function Definition

//=====================================================================================
// Function to read an image and make it to an two dimensional array
//=====================================================================================
void readsourceimage()
{
srcimg = cvLoadImage(“road_6000_800.jpg”,0);
tempFrame = cvLoadImage(“road_6000_800.jpg”,0);

iWidth = srcimg-&gt;width;
iHeight = srcimg-&gt;height;
iNumElements = iWidth * iHeight;

IplImage *dstimg = cvCreateImageHeader(cvSize(iWidth,iHeight),IPL_DEPTH_32F,1);

cvNamedWindow ("Inputimage",1);
cvShowImage ("Inputimage",srcimg);
cvWaitKey();

BYTE *inimg = (BYTE *) srcimg-&gt;imageData;


// -----------------------  Dynamic Mem in C++ -----------------------
img = new float [iNumElements];
img1 = new float [iNumElements];

for (int i=0; i&lt; iNumElements; i++)
{
		img[i] = (float) inimg[i];
}

for (int i=0; i&lt; iNumElements; i++)
{
		img1[i] = (float) inimg[i];
}

}

//$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
// Reading the kernel source file to access it
//$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
void loadprogramsource(const char *filename)
{
FILE *fp;
fp = fopen(filename,“r”);

if(!fp)
{
fprintf(stderr,"Failed to load kernel

");
exit(1);
}

source_str = (char *)malloc(MAX_SOURCE_SIZE);
source_size = fread(source_str,1,MAX_SOURCE_SIZE,fp);

fclose(fp);

}

//-------------------------------------------------------------------------------------------
// GPU Device Preparation and loading all the necessary variable and building the profgram
//-------------------------------------------------------------------------------------------

void GPUDevicePreparation()
{
shrLog("GPU Device Preparation for building and execution…
");

// Get the NVIDIA platform
shrLog("Get the NVIDIA platform...

");
ciErrNum = oclGetPlatformID(&cpPlatform);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

//Find GPU CL device, which is really needed for processing
shrLog("Getting the GPU device...

");
ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &cdDevice, NULL);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

// Create the context
shrLog("clCreateContext...

");
cxGPUContext = clCreateContext(0, 1, &cdDevice, NULL, NULL, &ciErrNum);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

// Create a command-queue
shrLog("clCreateCommandQueue...

");
cqCommandQueue = clCreateCommandQueue(cxGPUContext, cdDevice, CL_QUEUE_PROFILING_ENABLE, &ciErrNum);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

// Place here some thing
const char fileName[] = "./Convolution.cl";
loadprogramsource(fileName);

// Create the program
shrLog("clCreateProgramWithSource...

");
cpProgram = clCreateProgramWithSource(cxGPUContext, 1, (const char *)&source_str, (const size_t)&source_size, &ciErrNum);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

// Build the program with 'mad' Optimization option
//#ifdef MAC
//    char* flags = "-cl-fast-relaxed-math -DMAC";
//#else
	char* flags = "-cl-fast-relaxed-math";
//#endif

shrLog("clBuildProgram...

");
ciErrNum = clBuildProgram(cpProgram, 1, &cdDevice, flags, NULL, NULL);

if (ciErrNum != CL_SUCCESS)
{	
	// write out standard error, Build Log and PTX, then cleanup and exit
	shrLogEx(LOGBOTH | ERRORMSG, ciErrNum, STDERROR);
	oclLogBuildInfo(cpProgram, oclGetFirstDev(cxGPUContext));
	oclLogPtx(cpProgram, oclGetFirstDev(cxGPUContext), "oclConvolution.ptx");
	Cleanup(EXIT_FAILURE); 
}

	// Create the kernel
shrLog("clCreateKernel (Convolution)...

");
ckKernel = clCreateKernel(cpProgram, “Convolution”, &ciErrNum);

}

//§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§
// GPU Convolution starts in this function
//§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§§
void GPUConvolution(float *inimage, float *mask, float *outimage)
{
long ts,te,tst,tet;
mem_sizeImage = sizeof(float) * iNumElements;
sizeKernel = iFilterWidth * iFilterWidth;
mem_sizeKernel = sizeof(float) * sizeKernel;
unsigned char * out1;

//ts = GetTickCount();
// Allocate the OpenCL buffer memory objects for source and result on the device GMEM
//shrLog("clCreateBuffer (SrcA, SrcB and GPUoutput in Device GMEM)...

");
cmDevSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, mem_sizeImage, NULL, &ciErrNum);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
cmDevSrcB = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, mem_sizeKernel, NULL, &ciErrNum);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
cmDevDst = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, mem_sizeImage, NULL, &ciErrNum);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
//te = GetTickCount();
//printf("Create Buffer: %d [Milli sec]
",(te - ts));

//ts = GetTickCount();
// Asynchronous write of data to GPU device
//shrLog("clEnqueueWriteBuffer (SrcA and SrcB)...

");
ciErrNum = clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_FALSE, 0, mem_sizeImage,(void*) inimage, 0, NULL, &GPUevent);
ciErrNum |= clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcB, CL_FALSE, 0, mem_sizeKernel,(void*) mask, 0, NULL, &GPUevent);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
clFinish(cqCommandQueue);
//te = GetTickCount();
//printf("Write Buffer: %d [Milli sec]
",(te - ts));

//ts = GetTickCount();
// Set the Argument values
//shrLog("clSetKernelArg 0 - 4...

");
ciErrNum = clSetKernelArg(ckKernel, 0, sizeof(cl_mem), &cmDevSrcA);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
ciErrNum = clSetKernelArg(ckKernel, 1, sizeof(cl_mem), &cmDevSrcB);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
ciErrNum = clSetKernelArg(ckKernel, 2, sizeof(cl_mem), &cmDevDst);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
ciErrNum = clSetKernelArg(ckKernel, 3, sizeof(cl_int), &iWidth);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
ciErrNum = clSetKernelArg(ckKernel, 4, sizeof(cl_int), &iFilterWidth);
oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
//te = GetTickCount();
//printf("Set Kernel Arguments: %d [Milli sec]
",(te - ts));

//Setting the local and global parameters
szGlobalWorkSize[0] = iWidth;
szGlobalWorkSize[1] = iHeight;
szLocalWorkSize[0] = 1;
szLocalWorkSize[1] = 1;

//ts = GetTickCount();
//clGetEventProfilingInfo(GPUevent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong),&start,NULL);

// Launch kernel
//shrLog("clEnqueueNDRangeKernel (Convolution)...

");
//ciErrNum = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, &GPUevent);
ciErrNum = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL);
oclCheckErrorEX(ciErrNum,CL_SUCCESS, pCleanup); //CL_INVALID_KERNEL,CL_SUCCESS
//tst = GetTickCount();
//clWaitForEvents(1,&GPUevent);
//tet = GetTickCount();
//clFinish(cqCommandQueue);

//clGetEventProfilingInfo(GPUevent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong),&end,NULL);
//te = GetTickCount();
//printf("Launch Kernel: %d [Milli sec] 

",(te - ts));
//printf("Event Wait Time for Kernel Execution: %d [Milli sec]
",(tet - tst));
//printf("Kernel Execution time: %ld [nano sec]
",(end-start));

//ts = GetTickCount();
//clGetEventProfilingInfo(transferevent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong),&start,NULL);
// Read back results and check accumulated errors
//shrLog("clEnqueueReadBuffer (GPU Output)...

");
//ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmDevDst, CL_FALSE, 0, mem_sizeImage, outimage, 0, NULL, &transferevent);
ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmDevDst, CL_TRUE, 0, mem_sizeImage, outimage, 0, NULL,NULL);

//Sleep(50);

//oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
//clWaitForEvents(1,&transferevent);

//clGetEventProfilingInfo(transferevent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong),&end,NULL);
//te = GetTickCount();
//printf("Read Buffer: %d [Milli sec] 

",(te - ts));
//printf("Data copying time back from GPU: %ld [nano sec]
",(end-start));

oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
//clFinish(cqCommandQueue);

}

//**********************************************************************
// “CPU” Host processing Convolution function for comparison purposes
// *********************************************************************
float* CPUConvolution(float *inimage, float *mask)
{
//printf("
Entered into the CPU Convolution Function
");
int k = 0;
float *pfResult;

// -----------------------  Dynamic Mem in C++ -----------------------
pfResult = new float [iNumElements];

int idx = 0;
int jdx = 0;
 for (int i = 0; i &lt; iHeight; i++)
 {
	 for (int j = 0; j &lt; iWidth; j++)
	 {
		float sum = 0;
			for (int m = 0; m &lt; iFilterWidth; m++)
			{
				for (int n = 0; n &lt; iFilterWidth; n++)
				{
            		idx = i - m;
					jdx = j - n;
					if (idx &gt;= 0 && jdx &gt;= 0)
                 			sum += img[(i * iWidth) + j + n] * mask[(m * iFilterWidth) + n];
				}
			}
                    if (sum &gt; 255) 
						sum = 255;
                    if (sum &lt; 0) 
						sum = 0;
					pfResult[k] = sum;
					k++;

	}
}

return(pfResult);

}

// **********************************************************************************************
// Main function
// **********************************************************************************************
int main(int argc, char **argv)
{
printf("The program execution has started(Entered into main)…
");
//float kernelconv[] = {1,1,1,1,1,1,0,1,0,1,1,1,-4,1,1,1,0,1,0,1,1,1,1,1,1};
float kernelconv[] = {0,1,0,1,-4,1,0,1,0,};
long ts,te;
mask = kernelconv; //Creating the mask to execute the kernel

// Calling the function to load image for processing
readsourceimage();

// Creating a memory location to store the output of the GPU
GPUoutput = (float *)malloc(sizeof(cl_float) * iNumElements);

// Calling of GPU Device Preparation function
GPUDevicePreparation();

ts = GetTickCount();
// Calling of GPU convolution
//for(int i = 0; i &lt; 10; i++)
//{
	GPUConvolution(img,mask,GPUoutput);
//}
te = GetTickCount();
printf("

Execution time on GPU: %d [Milli sec]
",(te - ts));

ts = GetTickCount();
//Calling of CPU convolution
//for(int i = 0; i &lt; 10; i++)
//{
	CPUoutput = CPUConvolution(img,mask);
//}
te = GetTickCount();
printf("

Execution time on CPU: %d [Milli sec]
",(te - ts));

byte *tmp1, *tmp2;

tmp1 = new byte[iNumElements];
tmp2 = new byte[iNumElements];

//Retriving the results back from GPU
for(int i = 0;i &lt; iNumElements; i++)
{
	// printf("%f	",GPUoutput[i]);

	tmp1[i] = (byte) floor(abs(GPUoutput[i]));	
}

srcimg-&gt;imageData = (char *) tmp1;

srcimg-&gt;imageDataOrigin = srcimg-&gt;imageData;

//Retriving the results back from CPU

for(int i = 0;i &lt; iNumElements; i++)
{
	//printf("%f	",CPUoutput[i]);

	tmp2[i] = (byte) floor(abs(CPUoutput[i]));	
}

tempFrame-&gt;imageData = (char *) tmp2;

tempFrame-&gt;imageDataOrigin = tempFrame-&gt;imageData;


cvNamedWindow("GPU_output",1);
cvNamedWindow("CPU_output",1);
cvShowImage("GPU_output",srcimg);
cvShowImage("CPU_output",tempFrame);
	
clFinish(cqCommandQueue);
shrLog("

Press any Key to EXIT
");
cvWaitKey();
//cvReleaseImage(&srcimg);
//cvReleaseImage(&tempFrame);

//shrLog(LOGFILE, 0, "%f	", GPUoutput[i]);

} // End of MAIN

// Cleanup and exit code
// *********************************************************************
void Cleanup(int iExitCode)
{

// Cleanup allocated objects
shrLog("Starting Cleanup...

");
getchar();
if(cPathAndName)free(cPathAndName);
if(cSourceCL)free(cSourceCL);
if(ckKernel)clReleaseKernel(ckKernel);
if(cpProgram)clReleaseProgram(cpProgram);
if(cqCommandQueue)clReleaseCommandQueue(cqCommandQueue);
if(cxGPUContext)clReleaseContext(cxGPUContext);
if (cmDevSrcA)clReleaseMemObject(cmDevSrcA);
if (cmDevSrcB)clReleaseMemObject(cmDevSrcB);
if (cmDevDst)clReleaseMemObject(cmDevDst);

// Free host memory
free(srcA); 
free(srcB);
free(GPUoutput);
free(CPUoutput);

// finalize logs and leave
if (bNoPrompt)
{
    shrLogEx(LOGBOTH | CLOSELOG, 0, "oclConvolution.exe Exiting...

");
}
else
{
shrLogEx(LOGBOTH | CLOSELOG, 0, "oclConvolution.exe Exiting…
Press <Enter> to Quit
");
getchar();
}
//cvWaitKey();
getchar();
exit (iExitCode);
}

Here is the code I have used to execute the convolution on Image

Okay. I have a few comments about the code.

First, calling GetTickCount() before and after doing some OpenCL operations is going to give you incorrect times in some cases. You will find, for example, that the time it takes to execute clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_FALSE, 0, mem_sizeImage,(void*) inimage, 0, NULL, &GPUevent) is very little. In fact, most clEnqueueXXX functions will be very fast. Why? Because they are asynchronous. What this means is that these functions return before the work has finished. The best way to measure how long it really takes to execute a kernel or to do any of these clEnqueueXXX operations is using the profiling API that you already know how to use.

Second, if you are not sure of what to do with the “local_size” argument to clEnqueueNDRangeKernel, simply pass NULL. Forcing the local_size to be 1 will reduce the performance of your kernel a lot. So try this instead:

ciErrNum = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0, NULL, &NDRangeEvent);

Third, you are not checking the error code returned by clGetEventProfilingInfo. If the information you are requesting is not ready yet, clGetEventProfilingInfo will return CL_PROFILING_INFO_NOT_AVAILABLE. My recommendation is this: first, use a different cl_event variable for each function call you want to measure; second, don’t call clGetEventProfilingInfo in the middle of the program. Instead, make all of your OpenCL calls, then call clFinish() or clWaitForEvents() on the last call and only after all of this is done then query the profiling information with clGetEventProfilingInfo. So try something like this:


ciErrNum = clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_TRUE, 0, mem_sizeImage,(void*) inimage, 0, NULL, &writeSrcABufferEvent);
[...]
ciErrNum = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0, NULL, &NDRangeEvent);
ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmDevDst, CL_TRUE, 0, mem_sizeImage, outimage, 0, NULL, &readDstEvent);
// This clFinish here is to make sure that our calls to clGetEventProfilingInfo will not return an error.
// OpenCL experts: yes, I know this is not actually necessary in this case; bear with me :)
clFinish(cqCommandQueue);

// After the clFinish we can finally call clGetEventProfilingInfo
ciErrNum  = clGetEventProfilingInfo(writeSrcABufferEvent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong),&start, NULL);
oclCheckErrorEX(ciErrNum,CL_SUCCESS, pCleanup);

ciErrNum  = clGetEventProfilingInfo(writeSrcABufferEvent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong),&end, NULL);
oclCheckErrorEX(ciErrNum,CL_SUCCESS, pCleanup);

// Print "end - start" here if you want.

// Use clGetEventProfilingInfo on the events NDRangeEvent and readDstEvent here if you want.


Try this and let us know what results you get :slight_smile:

Thanks for ur support…
I will try this and say you the results…

Dear david garcia,

As you said I have used the event profiling and measured the execution times for different size of images

Image -> GPU => Writing
Kernel -> GPU => Writing
GPU -> Image => Reading

Observation 1:

Image Size : 375 x 565
Image -> GPU : 2596800 nano sec
Kernel -> GPU : 4352 nano sec
Execution of Kernel : 6994400 nano sec
GPU -> Image : 1764992 nano sec
Execution on CPU : 32 milli sec (Using GetTickCount())

GPU in total takes 11.37 milli sec

Observation 2:

Image Size : 2560 x 1024
Image -> GPU : 10595680 nano sec
Kernel -> GPU : 5088 nano sec
Execution of Kernel : 58760128 nano sec
GPU -> Image : 14339776 nano sec
Execution on CPU : 359 milli sec (Using GetTickCount())

GPU in total takes 83.85milli sec

Observation 3:

Image Size : 640 x 480
Image -> GPU : 3803072 nano sec
Kernel -> GPU : 5056 nano sec
Execution of Kernel : 6881120 nano sec
GPU -> Image : 1447264 nano sec
Execution on CPU : 47 milli sec (Using GetTickCount())

GPU in total takes 12.3 milli sec

Observation 4:

Image Size : 8192 x 8192
Image -> GPU : 249863520 nano sec
Kernel -> GPU : 5120 nano sec
Execution of Kernel : 1508864704 nano sec
GPU -> Image : 288091776 nano sec
Execution on CPU : 8034 milli sec (Using GetTickCount())

GPU in total takes 2046 milli sec

Observation 5:

Image Size : 5000 x 5000
Image -> GPU : 96716384 nano sec
Kernel -> GPU : 5152 nano sec
Execution of Kernel : 792512800 nano sec
GPU -> Image : 151397632 nano sec
Execution on CPU : 2980 milli sec (Using GetTickCount())

GPU in total takes 1039 milli sec

Observation 6:

Image Size : 6000 x 6000
Image -> GPU : 138521568 nano sec
Kernel -> GPU : 5152 nano sec
Execution of Kernel : 1141955072 nano sec
GPU -> Image : 154220928 nano sec
Execution on CPU : 4181 milli sec (Using GetTickCount())

GPU in total takes 1434 milli sec

Now I have some questions

  1. Is there any method to compare the execution time between GPU and CPU?
    (As we are using event profiling for GPU and GetTickCount for CPU)

  2. The command clBuildProgram() is taking so much time. Why it happens?

Is there any method to compare the execution time between GPU and CPU?

I don’t understand very well. There are two things you may want to measure in the CPU. One of them is “how much time does it take OpenCL to enqueue all these commands”; the other one is “how much time does it take since I enqueue the first command until the GPU has completely finished doing all the work and has written the image back”.

The first one, “how much it takes to enqueue these commands” doesn’t matter. The second one measures the total amount of time that it takes to send all the data to the GPU, then run the kernel and then read it back to the CPU. The way to measure this one is simple: call GetTickCount() once before you submit the data to the GPU and call it again after clFinish(cqCommandQueue) returns.

The command clBuildProgram() is taking so much time. Why it happens?

clBuildProgram is the function that compiles your kernel into assembly for the GPU. It is true that it’s usually slow (sometimes it takes seconds). The good thing is that you only need to call this function only one time. After the first time you can use clGetProgramInfo() to read back the GPU program after it is compiled. Look in the spec for CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES. Your application can then store the program binary into a file and the next time that you run the application you can use clCreateProgramWithBinary() instead of clCreateProgramWithSource(). When you create a program from a binary then clBuildProgram() is much faster.

I am facing a problem for a 10000 x 10000 image.

While I am passing an 10000 x 10000 image on to the GPU for execution I am getting a problem: It is showing an error at line ____ (CL_PROFILING_INFO_NOT_AVAILABLE)

The line number is where i am calling the event profiling of the NDrange event.

My system monitor is getting completely black and after that windows is showing a warning that
!!! Display driver NVIDIA windows kernel mode driver version 258.19 stopped responding and has sucessfully recovered !!!

After this message the command prompt is showing the above error.

Why this happens? I previously faced the same problem while trying to pass a 7 x 7, 5 x 5, 9 x9 kernel also.

Then also I used to get the warning message from windows.

I would bet that your kernel is taking more than 5 seconds to run. AFAIK NVidia imposes a 5-second limit on kernels that run on the same GPU that is used to display the screen. Either you can buy a separate GPU to do OpenCL or you can divide your NDRange into several pieces.

You can use the global_work_offset argument to clEnqueueNDRangeKernel() to process different parts of the image. For example, you can divide your 10000x10000 image into four 5000x5000 NDRanges.

You can disable the timeout:
http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx