How to measure GPU performance

zvivered · November 21, 2013, 7:57pm

Hello,

I want to measure the performance of the GPU in NVIDIA’s GeForce 9400 GT

The steps in the host code are:

clSetKernelArg

clCreateCommandQueue

*start measure
clEnqueueNDRangeKernel

clEnqueueReadBuffer
*end measure

In order to compare I did the same calucation on the host without GPU.
It seems that even when the kernel does nothing, the GPU works 5 times faster than the host.

This does not make sense. It should work much faster. The NVIDIA has 16 cores each running at 1.4GHz. The host is Core2Duo running at 3GHz.

What is wrong in my measurment ?

Thanks,
Zvika

haibo031031 · November 22, 2013, 12:45am

Hi, my understanding is that GPU and CPU differs in their startup overhead. But you might show more timing information …
Jianbin

clint3112 · November 22, 2013, 4:30am

timing outside on openCL only makes sense when you have blocking calls. if you only want to see the time your kernel runs have a look at the timing events of the Kernel

zvivered · November 22, 2013, 11:01am

Our goal is to use GPGPU instead of regular CPU cores.
We want to measure the time required for the host to complete a calculation using GPGPU.
I think measuring the time inside the kernel (running on one core) will not give us a complete picture.
The kernel runs on 16 cores simultanously.

Thanks,
Zvika

zvivered · November 22, 2013, 12:39pm

This is the way to do it according “OpenCL in Action” by Matthew Scarpino:

cl_event prof_event;
cl_ulong time_start, time_end, total_time;
…
queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE , &err);
…
err = clEnqueueNDRangeKernel(queue, kernel, dim, global_offset,
global_size, 0, 0 ,NULL, &prof_event);
clFinish(queue);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START,
sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END,
sizeof(time_end), &time_end, NULL);

total_time = time_end-time_start;

Dithermaster · November 24, 2013, 8:19pm

zvivered is correct, you can measure kernel execution using events. Putting timers around the enqueue calls only measures the enqueue speed (although I guess if your read is blocking you’ll get something.)

You can also use NVIDIA Parallel Nsight or AMD APP Profiler to see timeline traces of memory transfers and kernel execution times, as well as summaries showing min/max/averages.