I have been playing with various settings to local_work_size and looking at this kernel:

for (unsigned int i = get_global_id(0); i < Size; i += get_global_size(0))
Dst[DstIdx + i] = Src1[Src1Idx + i] + Src2[Src2Idx + i];

I noticed a few strange things and I hope somebody can help explain.
The Size parameter was set to Exp2(1.

1.) Local size was fixed at 128 because the performance was best with this setting.
2.) I started with Global_Size parameter equal to Size and reduced in steps by 2 measuring the speed. The speed remained constant down to global_size being 32x less than vector Size. Specifically 8192. Further decreasing dramatically increased the time of processing. Why?
3.) Relating to the question 2#. Why is that a stride step of 8192 within a loop executing 32x works so fast in compare to a stride step of 1? I simply couldn't measure any benefits, but rather penalty, when the stride step was one in the following kernel:

int Offset = get_global_id(0)*BlockLen;
int Len = BlockLen;
if ((Offset + BlockLen) > Size)
Len = Size - Offset;
if (Offset > Size) return;

for (int i = Offset; i < (Offset + Len); i ++)
Dst[DstIdx + i] = Src1[Src1Idx + i] + Src2[Src2Idx + i];

The time of execution in this case was proportional to the value of Len parameter on GPU.

On CPU, one would expect exactly the opposite. Is there no vectorization optimization possible for loops inside kernels?
4.) The first Kernel ran 20% slower than without the for-loop, but allows arbitrary setting of the local_work_size. Are NVidia Open CL drivers still sensitive to local_work_size and require that to be set by the user? I tried with local_work_size of 1 on AMD and speed degraded sharply, but leaving it to automatic works fine (fastest). Intel also recommends local_work_size to be set to auto with their drivers.