No speed up from using 2 GPUs

I just got 2 new GPUs yesterday. They are both NVIDIA C2070. I wrote a simple program to compare the runtime of using 1 GPU and 2 GPUs. Surprisingly, 2 GPUs don’t give me any speedup. Basically, I have 2 kernels that have their own independent inputs and outputs. I ran different variations of numbers of contexts and command queues, and the command queues are always in-order execution. This is the result:

1 command queue on 1 device
total time: 558,866 microseconds

2 command queues on 1 context on 1 devices
(run kernel A on command queue A; run kernel B on command queue B)
total time: 717,828 microseconds

2 command queues on 1 context on 2 devices
total time: 826,846 microseconds

2 command queues on 2 contexts on 2 devices
(run kernel A on command queue A which is on context A that include only device A; run kernel B on command queue B which is on context B that include only device B)
total time: 519,748 microseconds

Running 1 kernel itself takes 198,018 microseconds (this is the time when the kernel starts running on gpu until finish. there is nothing to do with cpu side.).

Can anyone explain what’s going on? I expect to get some speedup when using 2GPUs but apparently not.

How are you measuring the time? Are you executing the same amount of work on the 1-device case and the 2-device case? I.e. if you are running 100 work-items for the 1-device example, are you then running 50 work-items per device in the two-device example? Is it possible that your execution time is bandwidth bound rather than ALU bound?

For overall time, I use gettimeofday(&time, NULL). I put make a call before creating command queue and after reading output buffers are done. For the kernel runtime on gpu, I use:

clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, NULL, 0, NULL, &event);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,sizeof(gpu1), &gpu1, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,sizeof(gpu2), &gpu2, NULL);
time = gpu2 - gpu1;

I execute 2 kernels. Each has 100 work-items. When I use 1 device, I run both kernels on that device (100 + 100 work-items). When I use 2 device, I run each kernel on each device (100 work-items on one and 100 work-items on another).

I expect that read and write buffer are bandwidth bound, but run kernel shouldn’t. These are the runtime of read buffer, run kernel, and write buffer using clGetEventProfilingInfo:
write: 1 microseconds
run: 198026 microseconds
read: 80 microseconds

These are for kernel A. Kernel B takes about the same time. And these runtime results apply for all variations. You can see that I spend most of the time on running kernel. In the last variation that I have 2 command queues on 2 different contexts associated to 2 different devices, the 2 kernels should run concurrently since I put clEnqueueNDRangeKernel one right after another with clFlush for both of them, so I expect the 198026 microseconds kernel runtime of the 2 kernels to overlap.

It is also weird that write buffer takes only 1 microseconds.

These are the runtime of read buffer, run kernel, and write buffer using clGetEventProfilingInfo:
write: 1 microseconds
run: 198026 microseconds
read: 80 microseconds

You can see that I spend most of the time on running kernel.

That’s not so clear to me. You are apparently measuring the time it takes to execute clEnqueueReadBuffer()/clEnqueueWriteBuffer(), which is not the same as the time it takes to actually read or write a buffer. If I may use an analogy, it’s the difference between the time it takes to order a pizza and the time it takes to actually bake the pizza.

clGetEventProfilingInfo() is the right way to do all time measurements.

In the last variation that I have 2 command queues on 2 different contexts associated to 2 different devices, the 2 kernels should run concurrently since I put clEnqueueNDRangeKernel one right after another with clFlush for both of them

It would be great if you could show us the whole source code to understand what’s going on.

I did use clGetEventProfilingInfo() to measure those time. Only total time that I used gettimeofday.

Here are my code.
1 command queue on 1 device:


    ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);
    cxGPUContext = clCreateContext(0, 1, &device_id, NULL, NULL, &ciErrNum);

    gettimeofday(&start, NULL);

    commandQueue = clCreateCommandQueue(cxGPUContext, device_id, 0, &ciErrNum);
    cl_program cpProgram = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource, NULL, &ciErrNum);
    ciErrNum = clBuildProgram(cpProgram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
    kernel = clCreateKernel(cpProgram, "hello", &ciErrNum);
    input = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, DATA_SIZE * sizeof(double), NULL,NULL);
    output = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE,  DATA_SIZE * sizeof(double), NULL,NULL);
    clEnqueueWriteBuffer(commandQueue, input, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, NULL);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
    global=DATA_SIZE;
    clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, NULL, 0, NULL, NULL);
    clEnqueueReadBuffer(commandQueue, output, CL_FALSE, 0, DATA_SIZE * sizeof(double), results, 0, NULL, &event3);

    cl_program cpProgram2 = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource2, NULL, &ciErrNum);
    ciErrNum = clBuildProgram(cpProgram2, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
    kernel2 = clCreateKernel(cpProgram2, "hello", &ciErrNum);
    input2 = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, DATA_SIZE * sizeof(double), NULL,NULL);
    output2 = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE,  DATA_SIZE * sizeof(double), NULL,NULL);
    clEnqueueWriteBuffer(commandQueue, input2, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, NULL);
    clSetKernelArg(kernel2, 0, sizeof(cl_mem), &input2);
    clSetKernelArg(kernel2, 1, sizeof(cl_mem), &output2);
    clEnqueueNDRangeKernel(commandQueue, kernel2, 1, NULL, &global, NULL, 0, NULL, NULL);
    clEnqueueReadBuffer(commandQueue, output2, CL_TRUE, 0, DATA_SIZE * sizeof(double), results2, 0, NULL, &event6);
    gettimeofday(&end, NULL);

2 command queues on 1 context on 2 devices:


    ciErrNum = clGetDeviceIDs( cpPlatform, CL_DEVICE_TYPE_GPU, 0, NULL, &device_count );
    cl_device_id* device_ids = new cl_device_id[ device_count ];
    ciErrNum = clGetDeviceIDs( cpPlatform, CL_DEVICE_TYPE_GPU, device_count, device_ids, &device_count );
    cxGPUContext = clCreateContext(0, device_count, device_ids, NULL, NULL, &ciErrNum); // 1 context on all devices

    gettimeofday(&start, NULL);

    // queue1 for kernel1
    commandQueue = clCreateCommandQueue(cxGPUContext, device_ids[0], CL_QUEUE_PROFILING_ENABLE, &ciErrNum); // queue1 on device1
    cl_program cpProgram = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource, NULL, &ciErrNum);
    ciErrNum = clBuildProgram(cpProgram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
    kernel = clCreateKernel(cpProgram, "hello", &ciErrNum);
    input = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, DATA_SIZE * sizeof(double), NULL,NULL);
    output = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY,  DATA_SIZE * sizeof(double), NULL,NULL);
    clEnqueueWriteBuffer(commandQueue, input, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, &event5);
    clFlush(commandQueue);
    
    // queue2 for kernel2
    commandQueue2 = clCreateCommandQueue(cxGPUContext, device_ids[1], CL_QUEUE_PROFILING_ENABLE, &ciErrNum); // queue2 on device2
    cl_program cpProgram2 = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource2, NULL, &ciErrNum);
    ciErrNum = clBuildProgram(cpProgram2, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
    kernel2 = clCreateKernel(cpProgram2, "hello", &ciErrNum);
    input2 = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, DATA_SIZE * sizeof(double), NULL,NULL);
    output2 = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY,  DATA_SIZE * sizeof(double), NULL,NULL);
    clEnqueueWriteBuffer(commandQueue2, input2, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, &event6);
    clFlush(commandQueue2);

    // run kernel1
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
    clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, NULL, 0, NULL, &event3);
    clFlush(commandQueue);
    
    // run kernel2
    clSetKernelArg(kernel2, 0, sizeof(cl_mem), &input2);
    clSetKernelArg(kernel2, 1, sizeof(cl_mem), &output2);
    clEnqueueNDRangeKernel(commandQueue2, kernel2, 1, NULL, &global, NULL, 0, NULL, &event4);
    clFlush(commandQueue2);
    
    // read buffer from kernel1
    clEnqueueReadBuffer(commandQueue, output, CL_FALSE, 0, DATA_SIZE * sizeof(double), results, 0, NULL, &event1);
    clFlush(commandQueue);
    
    // read buffer from kernel2
    clEnqueueReadBuffer(commandQueue2, output2, CL_FALSE, 0, DATA_SIZE * sizeof(double), results2, 0, NULL, &event2);
    clFlush(commandQueue2);

    clFinish(commandQueue);
    clFinish(commandQueue2);

2 command queues on 2 contexts on 2 devices:


    ciErrNum = clGetDeviceIDs( cpPlatform, CL_DEVICE_TYPE_GPU, 0, NULL, &device_count );
    cl_device_id* device_ids = new cl_device_id[ device_count ];
    ciErrNum = clGetDeviceIDs( cpPlatform, CL_DEVICE_TYPE_GPU, device_count, device_ids, &device_count );
    cxGPUContext = clCreateContext(0, 1, &device_ids[0], NULL, NULL, &ciErrNum); // context1 on device1
    cxGPUContext2 = clCreateContext(0, 1, &device_ids[1], NULL, NULL, &ciErrNum); // context2 on device2

    gettimeofday(&start, NULL);

    // queue1 for kernel1
    commandQueue = clCreateCommandQueue(cxGPUContext, device_ids[0], CL_QUEUE_PROFILING_ENABLE, &ciErrNum); // queue1 on context1 on device1
    cl_program cpProgram = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource, NULL, &ciErrNum);
    ciErrNum = clBuildProgram(cpProgram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
    kernel = clCreateKernel(cpProgram, "hello", &ciErrNum);
    input = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, DATA_SIZE * sizeof(double), NULL,NULL);
    output = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY,  DATA_SIZE * sizeof(double), NULL,NULL);
    clEnqueueWriteBuffer(commandQueue, input, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, &event5);
    clFlush(commandQueue);
    
    // queue2 for kernel2
    commandQueue2 = clCreateCommandQueue(cxGPUContext2, device_ids[1], CL_QUEUE_PROFILING_ENABLE, &ciErrNum); // queue2 on context2 on device2
    cl_program cpProgram2 = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource2, NULL, &ciErrNum);
    ciErrNum = clBuildProgram(cpProgram2, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
    kernel2 = clCreateKernel(cpProgram2, "hello", &ciErrNum);
    input2 = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, DATA_SIZE * sizeof(double), NULL,NULL);
    output2 = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY,  DATA_SIZE * sizeof(double), NULL,NULL);
    clEnqueueWriteBuffer(commandQueue2, input2, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, &event6);
    clFlush(commandQueue2);

    // run kernel1
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
    clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, NULL, 0, NULL, &event3);
    clFlush(commandQueue);
    
    // run kernel2
    clSetKernelArg(kernel2, 0, sizeof(cl_mem), &input2);
    clSetKernelArg(kernel2, 1, sizeof(cl_mem), &output2);
    clEnqueueNDRangeKernel(commandQueue2, kernel2, 1, NULL, &global, NULL, 0, NULL, &event4);
    clFlush(commandQueue2);
    
    // read buffer from kernel1
    clEnqueueReadBuffer(commandQueue, output, CL_FALSE, 0, DATA_SIZE * sizeof(double), results, 0, NULL, &event1);
    clFlush(commandQueue);
    
    // read buffer from kernel2
    clEnqueueReadBuffer(commandQueue2, output2, CL_FALSE, 0, DATA_SIZE * sizeof(double), results2, 0, NULL, &event2);
    clFlush(commandQueue2);

    clFinish(commandQueue);
    clFinish(commandQueue2);

For measuring read buffer, run kernel, and write buffer times, I do something like this:


    clGetEventProfilingInfo(eventX, CL_PROFILING_COMMAND_START,sizeof(time1), &time1, NULL);
    clGetEventProfilingInfo(eventX, CL_PROFILING_COMMAND_END,sizeof(time2), &time2, NULL);
    printf("eventX: %lld microseconds
", (time2 - time1)/1000);

Please help! Let me know if you need more information.

Can I suggest doing all the clCreateXxx() calls as well as clBuildProgram() at the beginning of the code, then doing the actual clEnqueueXxx() calls? clBuildProgram() in particular is notoriously expensive and executing it between your first and your second call to clEnqueueNDRangeKernel() may be eliminating all possibility of concurrency between the two devices.

Okay. I’ll try that, and I’ll let you know the result.

That works! Now it runs faster on 2 GPUs. Thank you so much.