I just got 2 new GPUs yesterday. They are both NVIDIA C2070. I wrote a simple program to compare the runtime of using 1 GPU and 2 GPUs. Surprisingly, 2 GPUs don’t give me any speedup. Basically, I have 2 kernels that have their own independent inputs and outputs. I ran different variations of numbers of contexts and command queues, and the command queues are always in-order execution. This is the result:
1 command queue on 1 device
total time: 558,866 microseconds
2 command queues on 1 context on 1 devices
(run kernel A on command queue A; run kernel B on command queue B)
total time: 717,828 microseconds
2 command queues on 1 context on 2 devices
total time: 826,846 microseconds
2 command queues on 2 contexts on 2 devices
(run kernel A on command queue A which is on context A that include only device A; run kernel B on command queue B which is on context B that include only device B)
total time: 519,748 microseconds
Running 1 kernel itself takes 198,018 microseconds (this is the time when the kernel starts running on gpu until finish. there is nothing to do with cpu side.).
Can anyone explain what’s going on? I expect to get some speedup when using 2GPUs but apparently not.
How are you measuring the time? Are you executing the same amount of work on the 1-device case and the 2-device case? I.e. if you are running 100 work-items for the 1-device example, are you then running 50 work-items per device in the two-device example? Is it possible that your execution time is bandwidth bound rather than ALU bound?
For overall time, I use gettimeofday(&time, NULL). I put make a call before creating command queue and after reading output buffers are done. For the kernel runtime on gpu, I use:
I execute 2 kernels. Each has 100 work-items. When I use 1 device, I run both kernels on that device (100 + 100 work-items). When I use 2 device, I run each kernel on each device (100 work-items on one and 100 work-items on another).
I expect that read and write buffer are bandwidth bound, but run kernel shouldn’t. These are the runtime of read buffer, run kernel, and write buffer using clGetEventProfilingInfo:
write: 1 microseconds
run: 198026 microseconds
read: 80 microseconds
These are for kernel A. Kernel B takes about the same time. And these runtime results apply for all variations. You can see that I spend most of the time on running kernel. In the last variation that I have 2 command queues on 2 different contexts associated to 2 different devices, the 2 kernels should run concurrently since I put clEnqueueNDRangeKernel one right after another with clFlush for both of them, so I expect the 198026 microseconds kernel runtime of the 2 kernels to overlap.
It is also weird that write buffer takes only 1 microseconds.
These are the runtime of read buffer, run kernel, and write buffer using clGetEventProfilingInfo:
write: 1 microseconds
run: 198026 microseconds
read: 80 microseconds
You can see that I spend most of the time on running kernel.
That’s not so clear to me. You are apparently measuring the time it takes to execute clEnqueueReadBuffer()/clEnqueueWriteBuffer(), which is not the same as the time it takes to actually read or write a buffer. If I may use an analogy, it’s the difference between the time it takes to order a pizza and the time it takes to actually bake the pizza.
clGetEventProfilingInfo() is the right way to do all time measurements.
In the last variation that I have 2 command queues on 2 different contexts associated to 2 different devices, the 2 kernels should run concurrently since I put clEnqueueNDRangeKernel one right after another with clFlush for both of them
It would be great if you could show us the whole source code to understand what’s going on.
Can I suggest doing all the clCreateXxx() calls as well as clBuildProgram() at the beginning of the code, then doing the actual clEnqueueXxx() calls? clBuildProgram() in particular is notoriously expensive and executing it between your first and your second call to clEnqueueNDRangeKernel() may be eliminating all possibility of concurrency between the two devices.