Local work size!

Hi!

I have been playing with various settings to local_work_size and looking at this kernel:

 for (unsigned int i = get_global_id(0); i < Size; i += get_global_size(0))
         Dst[DstIdx + i] = Src1[Src1Idx + i] + Src2[Src2Idx + i];

I noticed a few strange things and I hope somebody can help explain.
The Size parameter was set to Exp2(18).

1.) Local size was fixed at 128 because the performance was best with this setting.
2.) I started with Global_Size parameter equal to Size and reduced in steps by 2 measuring the speed. The speed remained constant down to global_size being 32x less than vector Size. Specifically 8192. Further decreasing dramatically increased the time of processing. Why?
3.) Relating to the question 2#. Why is that a stride step of 8192 within a loop executing 32x works so fast in compare to a stride step of 1? I simply couldn’t measure any benefits, but rather penalty, when the stride step was one in the following kernel:

int Offset = get_global_id(0)*BlockLen;
int Len = BlockLen;
if ((Offset + BlockLen) > Size)
{
    Len = Size - Offset;
    if (Offset > Size) return;
}

for (int i = Offset; i < (Offset + Len); i ++)
{
    Dst[DstIdx + i] = Src1[Src1Idx + i] + Src2[Src2Idx + i];
}

The time of execution in this case was proportional to the value of Len parameter on GPU.

On CPU, one would expect exactly the opposite. Is there no vectorization optimization possible for loops inside kernels?
4.) The first Kernel ran 20% slower than without the for-loop, but allows arbitrary setting of the local_work_size. Are NVidia Open CL drivers still sensitive to local_work_size and require that to be set by the user? I tried with local_work_size of 1 on AMD and speed degraded sharply, but leaving it to automatic works fine (fastest). Intel also recommends local_work_size to be set to auto with their drivers.

Thanks!
Atmapuri

I have some question to understand what’s the code doing. What is “Size”? What is “vector Size”? What is "BlockLen "? Why does “i” increase in steps of “get_global_size(0)”?

Size is equal to length of the vectors or arrays. BlockLen describes how the long vector is broken down in to short vector. (in to many pieces each of BlockLen size). The idea with BlockLen is (or was) that thread internal for-loop would have lower overhead then if the for-loop is thread based (one thread for each item).

>Why does i increase in steps of get_global_size

It is a way to achieve two things:
a.) User controlled distance between items being read regardless of the amount of data that needs to be processed (Size).
b.) It allows the user to set local_work_size and global_work_size independently from the Size.

The two kernels compare two approaches:
1.) Strided address reads within each thread. (step larger than 1).
2.) Consecutive address reads within each thread

I was expecting that kernel2 would work fast on GPU as well. Instead kernel2 works fastest on CPU and kernel1 is fastest on GPU even when the for loop in each kernel does the same number of iterations in both cases. (and the only difference being is the stride in the for loop). (Differences are in the range of 10x).

Note that you cant test this with kernel #1 alone, because if the stride would be set to 1, (global_work_size = 1) the local_work_size would also be 1 and the OpenCL can launch only one thread per compute unit. (instead of 32 or 64 or 128).

Why are large strides (bigger than 8192) so beneficial to GPU?

Sorry, I’m still confused about what the code is doing. Can you post all the kernels and the value of the kernel arguments you used? Is this some sort of micro-benchmark to see how strides affect performance?

I started with Global_Size parameter equal to Size and reduced in steps by 2 measuring the speed. The speed remained constant down to global_size being 32x less than vector Size. Specifically 8192. Further decreasing dramatically increased the time of processing. Why?

Surely if you reduce the global size so much the amount of work that can potentially be executed in parallel is also reduced accordingly? That’s what the global size represents, which is why I find it so hard to understand why are you intentionally reducing it.

__kernel void ippsAddd_Idx(__global const float *Src1, const int Src1Idx,
                          __global const float *Src2, const int Src2Idx,
                          __global float *Dst, const int DstIdx, const int Size)
{
     for (unsigned int i = get_global_id(0); i < Size; i += get_global_size(0))
     {
             Dst[DstIdx + i] = Src1[Src1Idx + i] + Src2[Src2Idx + i];
     }
}

VectorLen = 262144; // == Size
local_work_size = 128;
global_work_size = VectorLen / 2; // and also /4, /8, /16 , /32 , /64

Result: speed falls of dramatically when VectorLen is divided by more than 32.
(internal for loop does more than 32 iterations)

__kernel void ippsAddd_Idx(__global const float *Src1, const int Src1Idx,
                          __global const float *Src2, const int Src2Idx,
                          __global float *Dst, const int DstIdx, const int BlockLen, const int Size)
{
    int Offset = get_global_id(0)*BlockLen;
    int Len = BlockLen;
    if ((Offset + BlockLen) > Size)
    {
        Len = Size - Offset;
        if (Offset > Size) return;
    }

    for (int i = Offset; i < (Offset + Len); i ++)
    {
        Dst[DstIdx + i] = Src1[Src1Idx + i] + Src2[Src2Idx + i];
    }
}

VectorLen = 262144; // == Size
local_work_size = 128;
BlockLen = 1; // or 2, 4, 8, 16, 32
global_work_size = ((VectorLen / BlockLen) / LocalWorkSize)*(LocalWorkSize+1);

Result: The speed is highest for BlockLen = 1 and kernel1 is always faster than kernel2.
The only real difference between the two approaches: the stride between accessed array elements. The second example has stride of 1 and the first example has stride of at least 8192. Tests were run on AMD Juniper.

The only real difference between the two approaches: the stride between accessed array elements.

I must be misunderstanding something. The global size changes dramatically in the two examples. If BlockLen = 1, then get_global_size(0) is equal to Size, the size of the vector. That naturally has a performance benefit.

As for kernel #1 always being faster than kernel #2, notice that kernel #1 only has a loop while kernel #2 has some other code to handle extremes. Given that the kernel is extremely simple, with only three memory operations and one ALU operation, the code outside of the loop may be more significant than you might think at first. There could be something else going on as well that I haven’t thought of.

How does the performance of these kernels compare with a naïve version like this?


__kernel void ippsAddd_Idx(__global const float *Src1, const int Src1Idx,
                          __global const float *Src2, const int Src2Idx,
                          __global float *Dst, const int DstIdx)
{
      size_t i = get_global_id(0);

      Dst[DstIdx + i] = Src1[Src1Idx + i] + Src2[Src2Idx + i];
}

In compare the to the naive implementation, the kernel1 is 20 percent slower and the kernel2 is 50 percent slower each with its optimal settings giving fastest run.

>Surely if you reduce the global size so much the amount of work that can potentially be >executed in parallel is also reduced accordingly?

No. When BlockLen = 32 for Kernel2 and divisor is 32 for Kernel1, the global_size is equal in both cases. Kernel2 however runs 10x slower.

Note that kernel2 is fastest when blockLen is 1. (being only 50 percent slower than naive implementation) but only keeps getting slower as BlockLen is increased.