Hi all,
I’m currently working with OpenCL in hopes to do a bit of research on computational finance methods using GPUs. Things are going okay, but I’ve run into a very strange “bug”. My code compiles and runs, but the results are incredibly confusing. I’m trying to do simple vector addition. The details are:
[ul]
[li]the vectors are 131072 (8 * 32 * 512) elements long
[/li][li]the vectors are added together 1000 times
[/li][/ul]
There are five different but similar ways I’ve gone about this.
Version 1
16384 (32*512) is the number of global work items, and 32 is the number of local work items. Since the vectors are 131072 elements long, each work-item does the calculations for 8 elements. The vectors are copied to the GPU once, added together 1000 times, and then copied back.
To determine what elements the thread must work on, it must determine what element to start start at and what one to stop at. Easy, but a teeny bit of overhead.
Version 2
Same as #1, but the vectors are written to the GPU every addition (1000 times), and copied back once. This is just to demonstrate overhead.
Version 3
131072 is the number of global work items, and 32 is the number of local work items. Thus, one work-item gets one element to calculate. The vectors are copied to the GPU once, added together 1000 times, copied back.
Version 4
Same as #3, but with the overhead mentioned in #1, just for the sake of it.
Version 5
Same kernel as #1, but with 131072 as the number of global work items instead of 16384. Since it’s the same kernel as #1, there is a for-loop that iterates over some elements of the vectors, but this for-loop only iterates once since each work-item gets one element!
I also have some CPU code doing the same thing.
#1 and #2 are slower than the CPU code. #2 I can believe, but #1 being slower is weird. #3, #4 and #5 are all around the same speed. #5 surprises me, because it’s the same kernel as #1, but maybe the compiler gets rid of the for-loop because it’s only one element. I’m not sure.
The weird part
These versions are executed in this order. There are several weird things that happen though:
[ul]
[li]when V5’s number of global work items is changed to the same as V1’s, its speed is still comparable to V3 and V4
[/li][li]when V5 is put BEFORE V1, its speed drops dramatically (from around 180x the CPU code to only 5x the CPU code)
[/li][li]when V1 is put AFTER V5, its speed is then comparable to V3, V4 and V5
[/li][/ul]
I’m really unsure of why this happens. My code is hosted here: https://github.com/SaintDako/OpenCL-examples/blob/master/example01/main.cpp
Maybe this is just the GPU being crazy. If anyone else can produce these results or explain what in the world is going on, PLEASE let me know!