why CPU execution time is less than GPU time?

jai · October 13, 2012, 11:53pm

// Hello.cpp : Defines the entry point for the console application.
//

#include “stdafx.h”
#include<stdio.h>
#include<stdlib.h>
#include<time.h>
#include “CL/cl.h”
#define DATA_SIZE 10
const char *KernelSource =
"kernel void hello(global float *input , global float *output)
"
"{
"
" size_t id =get_global_id(0);
"
"output[id] =input[id]*input[id];
"
"}
"
"
";
//float start_time,end_time;

int main(void)
{
float start_time,end_time;
start_time=clock();
cl_context context;
cl_context_properties properties[3];
cl_kernel kernel;
cl_command_queue command_queue;
cl_program program;
cl_int err;
cl_uint num_of_platforms=0;
cl_platform_id platform_id;
cl_device_id device_id;
cl_uint num_of_devices=0;
cl_mem input,output;
size_t global;

float inputData[DATA_SIZE]={0,1,2,3,4,5,6,7,8,9};
float results[DATA_SIZE]={0};

int i;

//retrieve a list of platform variable
if(clGetPlatformIDs(1,&platform_id,&num_of_platforms)!=CL_SUCCESS)
{
	printf("Unable to get platform_id

");
return 1;
}

//try to get supported GPU DEvice
if(clGetDeviceIDs(platform_id,CL_DEVICE_TYPE_GPU,1,&device_id,
&num_of_devices)!=CL_SUCCESS)
{
	printf("unable to get device_id

");
return 1;
}

//context properties list -must be terminated with 0
properties[0]=CL_CONTEXT_PLATFORM;
properties[1]=(cl_context_properties) platform_id;
properties[2]=0;

//create  a context with the GPU device
context=clCreateContext(properties,1,&device_id,NULL,NULL,&err);

//create command queue using the context and device
command_queue=clCreateCommandQueue(context,device_id,0,&err);

//create a program from the kernel source code 
program=clCreateProgramWithSource(context,1,(const char**)
&KernelSource,NULL,&err);

//compile the program
err=clBuildProgram(program,0,NULL,NULL,NULL,NULL);
if((err!=CL_SUCCESS))
{
	printf("build error

“,err);
size_t len;
char buffer[4096];
//get the build log
clGetProgramBuildInfo(program,device_id,CL_PROGRAM_BUILD_LOG,sizeof(buffer),buffer,&len);
printf(”----build Log—
%s
",buffer);
exit(1);

// return 1;
}

//specify which kernel from the program to execute
kernel=clCreateKernel(program,"hello",&err);

//create buffers for the input and output
input=clCreateBuffer(context,CL_MEM_READ_ONLY,sizeof(float)*DATA_SIZE,NULL,NULL);

output=clCreateBuffer(context,CL_MEM_WRITE_ONLY,sizeof(float)*DATA_SIZE,NULL,NULL);

//load data into the input buffer

clEnqueueWriteBuffer(command_queue,input,CL_TRUE,0,
       sizeof(float)*DATA_SIZE,inputData,0,NULL,NULL);

//set the argument list for the kernel command
clSetKernelArg(kernel,0,sizeof(cl_mem),&input);
clSetKernelArg(kernel,1,sizeof(cl_mem),&output);
global=DATA_SIZE;

//enqueue the kernel command for execution 
clEnqueueNDRangeKernel(command_queue,kernel,1,NULL,&global,NULL,0,NULL,NULL);
clFinish(command_queue);

//copy the results from out of the buffer
clEnqueueReadBuffer(command_queue,output,CL_TRUE,0,sizeof(float)*DATA_SIZE,results,0,
	NULL,NULL);

//print the results
printf("output:");
for(i=0;i&lt;DATA_SIZE;i++)
{
	printf("%f",results[i]);
}

//cleanup-release OpenCL resources 

clReleaseMemObject(input);
clReleaseMemObject(output);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(command_queue);
clReleaseContext(context);
end_time=clock();
printf("execution time is%f",end_time-start_time); 
return 0;

}

jai · October 13, 2012, 11:55pm

in this program
(Intel i -5 3rd generation) CPU execution time is 750 ms ,while nvidia GPU takes 900 ms

why it is happening while GPU time should take less time than CPU time.

chippies · October 14, 2012, 8:41am

The problem you are running into is that GPUs are designed to handle huge amounts of work, not small amounts. To give some more detail, there is a certain amount of overhead present in copying data from host memory to device memory, then there is a certain amount of overhead in launching a kernel, and finally some overhead in copying the results back. The kernel launch overheads are fairly constant, while the transfer overheads depend on the size of the data plus a constant overhead from the driver.

If you send a lot of work to the GPU then these overheads account for a proportionally smaller part of the processing time.

The other problem is that your problem size is so small it does not even use a single streaming multiprocessor. To fully utilise a GPU, each work group should contain a few hundred threads and there should be a few hundred work groups.

With regard to our previous discussion on how many threads run in parallel, I should probably elaborate further. Each streaming multiprocessor executes a certain number of threads in parallel, 48 in your case in the best case scenario. Lets say these threads get to a memory access instruction. It doesn’t matter if it is global or local memory, both take a certain amount of time to return the data to the thread. During that time, these threads all block, waiting for the data from memory. Rather than sitting idle, the thread scheduler schedules more threads from a different thread group, called a warp in Nvidia terminology. Each multiprocessor can keep track of the execution status of several hundred threads - upto 1536 on your GPU if memory serves. That is why you need so many threads to make sure that the GPU does not sit idle.

Now for a comment on your chosen problem, squaring each element of a vector. This is not a good problem for a GPU because the amount of maths done for each element is less than the amount of memory access operations. PCIe bandwidth also hampers you here - it is actually the biggest short coming of your code. Using plain reads and writes, your maximum observed PCIe bandwidth could get to about 5GB/s. Since a float is 4 bytes, you can send about 10^9 floats to the GPU each second. Since each element results in one floating operation, that means you will cause about 10^9 floating point operations per second or 1GFLOPS. Even a single core of your CPU can beat that number of operations per second, plus it has slightly fewer overheads because the transfer from host memory to device memory would just be a copy from one location in RAM to another, which is faster than sending data over the PCIe bus. You need many more operations per element of data before a GPU becomes worthwhile.