different time of cpu and gpu

i am using gpu of type nvidia,i am using opencl but when i run the program using ctrl+F5(start without debugging )then i get result in which gpu takes more time than cpu but when i run the program cpu takes more time than gpu and result is also i am giving
start without debugging -> cpu time=6127 ms gpu time= 6240 ms
start with debug-> cpu time= 18354 ms gpu time= 9125 ms

wt is the reason in this difference…
visual studio 2010 i am using
the code is here. wt is going wrong.?..thanks

// Hello.cpp : Defines the entry point for the console application.
//

//#include <stdafx.h>
#include<stdio.h>
#include<stdlib.h>
#include<conio.h>
#include<time.h>
#include “CL/cl.h”
#define DATA_SIZE 100000
const char *KernelSource =
"kernel void hello(global float *input , global float *output)
"
"{
"
" size_t id =get_global_id(0);
"
"output[id] =input[id]*input[id];
"
"} "
"
"
"
";
//float start_time,end_time;

int main(void)
{
double start_time,end_time;
start_time=clock();
cl_context context;
cl_context_properties properties[3];
cl_kernel kernel;
cl_command_queue command_queue;
cl_program program;
cl_int err;
cl_uint num_of_platforms=0;
cl_platform_id platform_id;
cl_device_id device_id;
cl_uint num_of_devices=0;
cl_mem input,output;
size_t global;
float inputData[100000];
for(int j=0;j<100000;j++)
{
inputData[j]=(float)j;
}

float results[DATA_SIZE];//={0};

// int i;

//retrieve a list of platform variable
if(clGetPlatformIDs(1,&platform_id,&num_of_platforms)!=CL_SUCCESS)
{
	printf("Unable to get platform_id

");
return 1;
}

//try to get supported GPU DEvice
if(clGetDeviceIDs(platform_id,CL_DEVICE_TYPE_CPU,1,&device_id,
&num_of_devices)!=CL_SUCCESS)
{
	printf("unable to get device_id

");
return 1;
}

//context properties list -must be terminated with 0
properties[0]=CL_CONTEXT_PLATFORM;
properties[1]=(cl_context_properties) platform_id;
properties[2]=0;

//create  a context with the GPU device
context=clCreateContext(properties,1,&device_id,NULL,NULL,&err);

//create command queue using the context and device
command_queue=clCreateCommandQueue(context,device_id,0,&err);

//create a program from the kernel source code 
program=clCreateProgramWithSource(context,1,(const char**)
&KernelSource,NULL,&err);

//compile the program
err=clBuildProgram(program,0,NULL,NULL,NULL,NULL);
if((err!=CL_SUCCESS))
{
	printf("build error  

“,err);
size_t len;
char buffer[4096];
//get the build log
clGetProgramBuildInfo(program,device_id,CL_PROGRAM_BUILD_LOG,sizeof(buffer),buffer,&len);
printf(”----build Log—
%s
",buffer);
exit(1);

// return 1;
}

//specify which kernel from the program to execute
kernel=clCreateKernel(program,"hello",&err);

//create buffers for the input and output
input=clCreateBuffer(context,CL_MEM_READ_ONLY,sizeof(float)*DATA_SIZE,NULL,NULL);

output=clCreateBuffer(context,CL_MEM_WRITE_ONLY,sizeof(float)*DATA_SIZE,NULL,NULL);

//load data into the input buffer

clEnqueueWriteBuffer(command_queue,input,CL_TRUE,0,
       sizeof(float)*DATA_SIZE,inputData,0,NULL,NULL);

//set the argument list for the kernel command
clSetKernelArg(kernel,0,sizeof(cl_mem),&input);
clSetKernelArg(kernel,1,sizeof(cl_mem),&output);
global=DATA_SIZE;

//enqueue the kernel command for execution 
clEnqueueNDRangeKernel(command_queue,kernel,1,NULL,&global,NULL,0,NULL,NULL);
clFinish(command_queue);

//copy the results from out of the buffer
clEnqueueReadBuffer(command_queue,output,CL_TRUE,0,sizeof(float)*DATA_SIZE,results,0,
	NULL,NULL);

//print the results
printf("output:");
for(int i=0;i&lt;DATA_SIZE;i++)
{
	printf("%f

",results[i]);
//printf("no. of times loop run %d
",count);
}

//cleanup-release OpenCL resources 

clReleaseMemObject(input);
clReleaseMemObject(output);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(command_queue);
clReleaseContext(context);
end_time=clock();
printf("execution  time is%f",end_time-start_time); 
_getch();
return 0;

}

When you run your code without debugging, it should execute faster since the debugger doesn’t have to sit in the background waiting for exceptions or allowing you to pause the program. Your kernel is quite simple and won’t stretch the processing power of your GPU, hence the CPU should be faster when not debugging.

The reason for your kernel not being faster on the GPU is that for every two pieces of data you transfer, you do one maths operation. It is faster for the CPU to read those elements from RAM than it is to transfer them from RAM to GPU because the PCIe bus is slow. Ideally, you want to do a lot of operations on the GPU for each element of data that gets sent to the GPU.

I also just noticed that you are timing the entire program rather than just the kernel and data transfers. This is unfair to OpenCL as you are also timing how long it takes to compile your kernel.

pls tell me i am new to opencl, because whenever increase the value of DATA_SIZE it shows that stack overflow .so i am not able to show that GPU is performing well for larger data than CPU, i am taking for whole time because we are reading from CPU both the result thats y i applied to timer there…any valuable modification u suggest…pls help

If you continue using element-wise multiplication of two vectors to demonstrate the speed of a GPU vs. a CPU then the CPU will always do well. You should try to demonstrate a different problem.

Taking from linear algebra, the popular operation is matrix multiplication. If you multiply two NxN matrices then the amount of data that gets sent to the GPU is 2N^2 but the number of floating point operations is at least N(N-1)N^2. Taking the ratio of operations to matrix elements gives N(N-1)N^2 / (2N^2) = N(N-1)/2. For a 1000x1000 matrix, that means that half a million floating point operations get performed for each element that got transfered to the GPU. In such a situation, the GPU excels because one sends a bit of data to the GPU and then sits back and waits for a very long calculation to finish.

As for working around your stack overflow when increasing DATA_SIZE, the mistake that you have made is to declare inputData as an array of float inside the main function. Arrays declared there will use the stack, which is a small (comparatively) section of memory. You should rather work with a pointer to an array of float, i.e. float* inputData = new float[DATA_SIZE]; <<lines of code>> delete[] inputData; That way, the array gets allocated on a part of memory called the heap, which can be multiple gigabytes if you are using a 64-bit operating system. You would have to do the same thing for results, i.e. float* results …

thanks for quick and valuable reply :slight_smile:
thanks

as u told previously it is unfair to calculate all the program so i used clgetprofileinfo for calculating kernel time for cpu and gpu but i am having the error if u can help.
my program is ->
// Hello.cpp : Defines the entry point for the console application.
//

//#include <stdafx.h>
#include<iostream>
#include<stdio.h>
#include<stdlib.h>
#include<conio.h>
#include<time.h>
#include “CL/cl.h”
#define DATA_SIZE 100000
using namespace std;
const char *KernelSource =
"kernel void hello(global float *input , global float *output)
"
"{
"
" size_t id =get_global_id(0);
"
"output[id] =input[id]*input[id];
"
"} "
"
"
"
";
//float start_time,end_time;

int main(void)
{
cl_ulong start_time,end_time,elapsed_time;
//start_time=clock();
cl_context context;
cl_context_properties properties[3];
cl_kernel kernel;
cl_command_queue command_queue;
cl_program program;
cl_int err;
cl_uint num_of_platforms=0;
cl_platform_id platform_id;
cl_device_id device_id;
cl_uint num_of_devices=0;
cl_mem input,output;
cl_event gpuExec;
size_t global;
float executionTimeInSeconds;

float inputData[DATA_SIZE];
for(int j=0;j&lt;DATA_SIZE;j++)
{
 inputData[j]=(float)j;
}

float results[DATA_SIZE];//={0};

// int i;

//retrieve a list of platform variable
if(clGetPlatformIDs(1,&platform_id,&num_of_platforms)!=CL_SUCCESS)
{
	printf("Unable to get platform_id

");
return 1;
}

//try to get supported GPU DEvice
if(clGetDeviceIDs(platform_id,CL_DEVICE_TYPE_GPU,1,&device_id,
&num_of_devices)!=CL_SUCCESS)
{
	printf("unable to get device_id

");
return 1;
}

//context properties list -must be terminated with 0
properties[0]=CL_CONTEXT_PLATFORM;
properties[1]=(cl_context_properties) platform_id;
properties[2]=0;

//create  a context with the GPU device
context=clCreateContext(properties,1,&device_id,NULL,NULL,&err);

//create command queue using the context and device
command_queue=clCreateCommandQueue(context,device_id,0,&err);

//create a program from the kernel source code 
program=clCreateProgramWithSource(context,1,(const char**)
&KernelSource,NULL,&err);

//compile the program
err=clBuildProgram(program,0,NULL,NULL,NULL,NULL);
if((err!=CL_SUCCESS))
{
	printf("build error  

“,err);
size_t len;
char buffer[4096];
//get the build log
clGetProgramBuildInfo(program,device_id,CL_PROGRAM_BUILD_LOG,sizeof(buffer),buffer,&len);
printf(”----build Log—
%s
",buffer);
exit(1);

// return 1;
}

//specify which kernel from the program to execute
kernel=clCreateKernel(program,"hello",&err);

//create buffers for the input and output
input=clCreateBuffer(context,CL_MEM_READ_ONLY,sizeof(float)*DATA_SIZE,NULL,NULL);

output=clCreateBuffer(context,CL_MEM_WRITE_ONLY,sizeof(float)*DATA_SIZE,NULL,NULL);

//load data into the input buffer

clEnqueueWriteBuffer(command_queue,input,CL_TRUE,0,
       sizeof(float)*DATA_SIZE,inputData,0,NULL,NULL);

//set the argument list for the kernel command
clSetKernelArg(kernel,0,sizeof(cl_mem),&input);
clSetKernelArg(kernel,1,sizeof(cl_mem),&output);
global=DATA_SIZE;

//enqueue the kernel command for execution 
clEnqueueNDRangeKernel(command_queue,kernel,1,NULL,&global,NULL,0,NULL,&gpuExec);
clFinish(command_queue);

//copy the results from out of the buffer
clEnqueueReadBuffer(command_queue,output,CL_TRUE,0,sizeof(float)*DATA_SIZE,results,0,
	NULL,NULL);

//print the results
printf("output:");
for(int i=0;i&lt;DATA_SIZE;i++)
{
	printf("%f

",results[i]);
//printf("no. of times loop run %d
",count);
}
//Calculating the time…

clGetEventProfilingInfo(gpuExec, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start_time, NULL);
    clGetEventProfilingInfo(gpuExec, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end_time, NULL);

	/*calculate total elapsed time*/
elapsed_time = end_time-start_time;
executionTimeInSeconds = (float)(1.0e-9 *elapsed_time);
//printf("%f",&executionTimeInSeconds);
cout&lt;&lt;"execution time"&lt;&lt;executionTimeInSeconds;
_getch();
/*end_time=clock();
printf("execution  time is%f",end_time-start_time); 
_getch();*/
//cleanup-release OpenCL resources 

clReleaseMemObject(input);
clReleaseMemObject(output);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(command_queue);
clReleaseContext(context);

return 0;

}

when i run this program on win 7 64 bit,vs 2010. i get this error
Problem signature:
Problem Event Name: APPCRASH
Application Name: Hello.exe
Application Version: 0.0.0.0
Application Timestamp: 50a46359
Fault Module Name: igdrcl64.dll
Fault Module Version: 8.15.10.2712
Fault Module Timestamp: 4f7119e9
Exception Code: c0000005
Exception Offset: 000000000001b5e9
OS Version: 6.1.7601.2.1.0.768.2
Locale ID: 16393
Additional Information 1: a493
Additional Information 2: a493a1183067b5107213879be06ab3eb
Additional Information 3: c660
Additional Information 4: c660a08c8ef8ab0df12390ba7124949d

thanks in advance

To use the profiling commands, you first need to enable profiling for the specific command queue. The line

command_queue=clCreateCommandQueue(context,device_id,0,&err);

Should be changed to

command_queue=clCreateCommandQueue(context,device_id,CL_QUEUE_PROFILING_ENABLE,&err);

That enables profiling.

thanks a lot …
this forum rocks :slight_smile:
but one more question when use clgeteventprofileinfo it calculate which time
means time of

(data transfer from cpu to gpu + gpu processing time + transfer back the results from gpu to cpu)
or only (gpu processing time)

thanks

CL_PROFILING_COMMAND_START cl_ulong
A 64-bit value that describes the
current device time counter in
nanoseconds when the command
identified by event starts execution on
the device.
CL_PROFILING_COMMAND_END cl_ulong
A 64-bit value that describes the
current device time counter in
nanoseconds when the command
identified by event has finished
execution on the device.

Its the beginning and end of your profiled Event. So if you profile the kernel execution, it will be the kernel time, is you profile the writeBuffer, it will be your cpu-gpu transport time

please reply forum…
thanks

clint3112 is right, you are just recording the time taken by the GPU to execute the kernel. If you want the time taken to transfer data, you need to time the clEnqueuWriteBuffer command using CPU timers since you have chosen a synchronous transfer. Use events to profile data transfer when using asynchronous transfers.

ok but when we use cpu with dual core ,then wt it will calculate because there is no transfer in cpu with pci to other device , so wt it will show in case of CPU.because
transfer is done only in gpu then how we compare cpu with gpu .

“with your experience wt people use to compare between cpu and gpu , do they consider this transfer time for comparing cpu and gpu.kindly give suggestion”

thanks

On CPUs, I’d expect that profiling the data transfers will give you very small times. For comparing CPUs and GPUs, I think you should record the time taken to transfer data and the kernel execution time. You only need to consider the OpenCL kernel compile time in certain cases. Perhaps report the times separately i.e. time spent compiling, time spent transferring data, time spent executing the kernel. Then your reader can decide what is important???

ok thanks for reply
pls let me correct if i am wrong

  1. data transfer done from cpu to gpu can be done by clEnqueueReadBuffer() ?
    2)data transfer done from gpu to cpu can be done by clEnqueueWriteBuffer()?

if it is correct clEnqueueReadBuffer() and clEnqueueWriteBuffer() how i can calculate time separately or for i have to consider the time for creating the buffer object also which done by clCreateBuffer for input and output…
pls support
thanks

Hi,

Write buffer will write to your openCL device, read buffer will read from your device.
Just create a cl_event pEvent, pass it to the write/read call and after it finished you get the timing with

clGetEventProfilingInfo(pEvent, CL_PROFILING_COMMAND_END,   sizeof(cl_ulong), &ullEnd,   nullptr);

i am talking abt the data transfer from which command it will happen
pls reply forum …
thanks