Hello,
I want to measure the performance of the GPU in NVIDIA’s GeForce 9400 GT
The steps in the host code are:
clSetKernelArg
clCreateCommandQueue
*start measure
clEnqueueNDRangeKernel
clEnqueueReadBuffer
*end measure
In order to compare I did the same calucation on the host without GPU.
It seems that even when the kernel does nothing, the GPU works 5 times faster than the host.
This does not make sense. It should work much faster. The NVIDIA has 16 cores each running at 1.4GHz. The host is Core2Duo running at 3GHz.
What is wrong in my measurment ?
Thanks,
Zvika
Hi, my understanding is that GPU and CPU differs in their startup overhead. But you might show more timing information …
Jianbin
timing outside on openCL only makes sense when you have blocking calls. if you only want to see the time your kernel runs have a look at the timing events of the Kernel
Our goal is to use GPGPU instead of regular CPU cores.
We want to measure the time required for the host to complete a calculation using GPGPU.
I think measuring the time inside the kernel (running on one core) will not give us a complete picture.
The kernel runs on 16 cores simultanously.
Thanks,
Zvika
This is the way to do it according “OpenCL in Action” by Matthew Scarpino:
cl_event prof_event;
cl_ulong time_start, time_end, total_time;
…
queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE , &err);
…
err = clEnqueueNDRangeKernel(queue, kernel, dim, global_offset,
global_size, 0, 0 ,NULL, &prof_event);
clFinish(queue);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START,
sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END,
sizeof(time_end), &time_end, NULL);
total_time = time_end-time_start;
zvivered is correct, you can measure kernel execution using events. Putting timers around the enqueue calls only measures the enqueue speed (although I guess if your read is blocking you’ll get something.)
You can also use NVIDIA Parallel Nsight or AMD APP Profiler to see timeline traces of memory transfers and kernel execution times, as well as summaries showing min/max/averages.