Benchmarking vector addition (with bizarre results!)

Hi all,

I’m currently working with OpenCL in hopes to do a bit of research on computational finance methods using GPUs. Things are going okay, but I’ve run into a very strange “bug”. My code compiles and runs, but the results are incredibly confusing. I’m trying to do simple vector addition. The details are:

[ul]
[li]the vectors are 131072 (8 * 32 * 512) elements long
[/li][li]the vectors are added together 1000 times
[/li][/ul]

There are five different but similar ways I’ve gone about this.

Version 1
16384 (32*512) is the number of global work items, and 32 is the number of local work items. Since the vectors are 131072 elements long, each work-item does the calculations for 8 elements. The vectors are copied to the GPU once, added together 1000 times, and then copied back.

To determine what elements the thread must work on, it must determine what element to start start at and what one to stop at. Easy, but a teeny bit of overhead.

Version 2
Same as #1, but the vectors are written to the GPU every addition (1000 times), and copied back once. This is just to demonstrate overhead.

Version 3
131072 is the number of global work items, and 32 is the number of local work items. Thus, one work-item gets one element to calculate. The vectors are copied to the GPU once, added together 1000 times, copied back.

Version 4
Same as #3, but with the overhead mentioned in #1, just for the sake of it.

Version 5
Same kernel as #1, but with 131072 as the number of global work items instead of 16384. Since it’s the same kernel as #1, there is a for-loop that iterates over some elements of the vectors, but this for-loop only iterates once since each work-item gets one element!

I also have some CPU code doing the same thing.

#1 and #2 are slower than the CPU code. #2 I can believe, but #1 being slower is weird. #3, #4 and #5 are all around the same speed. #5 surprises me, because it’s the same kernel as #1, but maybe the compiler gets rid of the for-loop because it’s only one element. I’m not sure.

The weird part
These versions are executed in this order. There are several weird things that happen though:

[ul]
[li]when V5’s number of global work items is changed to the same as V1’s, its speed is still comparable to V3 and V4
[/li][li]when V5 is put BEFORE V1, its speed drops dramatically (from around 180x the CPU code to only 5x the CPU code)
[/li][li]when V1 is put AFTER V5, its speed is then comparable to V3, V4 and V5
[/li][/ul]

I’m really unsure of why this happens. My code is hosted here: https://github.com/SaintDako/OpenCL-examples/blob/master/example01/main.cpp

Maybe this is just the GPU being crazy. If anyone else can produce these results or explain what in the world is going on, PLEASE let me know!

I cannot answer your question, but here is my opinion from looking on the code.

It seems useless to call enqueueBarrier for in-order queue which you have. If you want to wait until all the commands are complete, the most reliable way is to call clFinish for the queue (or its C++ equivalent).

It is not a good idea to pass one/two ints via a buffer (“constants” argument of kernels). You can pass them to the kernel by value, that’s the usual way of passing constants.

Adding two floats is a VERY light-weight kernel. In fact loading data from memory takes more time than adding numbers I suppose. In this case a simple kernel that processes one number (instead of eight consecutive) may be faster due to better coaslecing of memory loads.

What is more important, each of your kernels performs 128K floating point additions. In my opinion it is like giving a roach to a tiger: too little work for a single GPU kernel launch. It may turn out that most of the time is spent on various overhead: waiting for transfers to complete, overhead for buffer transfers, overhead for starting kernels etc.
It is not very useful to measure computational speed on a benchmark that spends most of the time on overhead. Speaking of bizarre results, overhead often tends to be quite unpredictable =)

There’s something I notice in your source code that may produce poor performance. The idea is this: GPUs want to have address patterns that are “coalesced”, which means that neighboring work items are accessing neighboring data values. This code, however, has each work item accessing in a very strided fashion. Consider this code from add_looped:


start = ratio * ID;
stop  = ratio * (ID+1);

int i, j; // will the compiler optimize this anyway? probably.
for (i=0; i<k; i++) {
    for (j=start; j<stop; j++)
        v3[j] = v1[j] + v2[j];
}

Work item #0 will be using index 0 in the first iteration, and work item #1 will be using index ratio, and work item #2 will be using 2 * ratio. Everything is strided by ratio.

In fact, it looks like the cases that perform well are the ones that are strided by 1. So it definitely looks like something to investigate.

It seems useless to call enqueueBarrier for in-order queue which you have. If you want to wait until all the commands are complete, the most reliable way is to call clFinish for the queue (or its C++ equivalent).

I did not know about “clFinish” (or I did, and I forgot). I will look into this.

It is not a good idea to pass one/two ints via a buffer (“constants” argument of kernels). You can pass them to the kernel by value, that’s the usual way of passing constants.

I was under the impression (for some reason) that only arrays could be passed to kernels. I have changed this.

Adding two floats is a VERY light-weight kernel. In fact loading data from memory takes more time than adding numbers I suppose. In this case a simple kernel that processes one number (instead of eight consecutive) may be faster due to better coaslecing of memory loads.

That makes sense to me. I thought that something of that scale would be enough to benchmark - I guess not.

Work item #0 will be using index 0 in the first iteration, and work item #1 will be using index ratio, and work item #2 will be using 2 * ratio. Everything is strided by ratio.

I watched a video on CUDA programming today (not long after reading your comment, actually) and it talked about striding and coalescing. I thought that my loops weren’t actually striding, as a single thread accesses multiple elements in a sequence. However, this may not be the case, as you mentioned – coalescing occurs when multiple threads access single elements in a sequence. I’ll try to figure this out.

I have another question though: is there an ‘easy’ way to determine the optimal work-group / work-item configuration? Should I be using as many work-items as I have elements?

Hi Dakkerst,

I tried to run your code on Ubuntu 14.04, but i get the following output:

clang++ -I/usr/local/cuda-7.0/include/ -I/usr/include/boost/ -L/usr/local/cuda-7.0/lib64/ -o hello *.cpp -lOpenCL -pthread -lboost_program_options

[i]In file included from main.cpp:3:
In file included from /usr/local/cuda-7.0/include/CL/cl.hpp:170:
In file included from /usr/local/cuda-7.0/include/CL/opencl.h:44:
/usr/local/cuda-7.0/include/CL/cl_gl_ext.h:44:4: warning: ‘/*’ within block comment [-Wcomment]

  • /* cl_VEN_extname extension */
    ^
    main.cpp:78:25: error: expected expression
    cl::Context context({default_device});
    ^
    main.cpp:142:23: error: expected expression
    sources.push_back({kernel_code.c_str(), kernel_code.length()});
    ^
    main.cpp:145:23: error: expected expression
    if (program.build({default_device}) != CL_SUCCESS) {
    ^
    1 warning and 3 errors generated.[/i]

Hey masab_ahmad,

The syntax in that code requires a newer version of C++. it can be changed by adding the flag:

-std=c++0x

, so you should put:

clang++ -I/usr/local/cuda-7.0/include/ -I/usr/include/boost/ -L/usr/local/cuda-7.0/lib64/ -std=c++0x -o hello *.cpp -lOpenCL -pthread -lboost_program_options