Work-Group Size estimation at the edge of an array

Hi,
I have a kernel, which calculates the max and min value of an array. It is an 3D-array.
I am using ITK(4.3.1) with GPU-support.

Heres the kernel:
http://pastebin.com/sJWkaxfF

On a Nvidia Tesla C2075 the kernel returns the right value. On my Nvidia Geforce GTS450 the kernel won’t work. clfinish() returns a CL_INVALID_COMMAND_QUEUE error. I believe, that the drivers for the NVidia Tesla C2075 are able to catch certain problems, while my non-professional card doesn’t do this.

I believe I access outside of my working array. I presume the problem is located at the edge of my array. The size of the array is 160, 256 and 173. My Work-Group size is 4, 4, 4(64).
Therefore the globalsize of my kernel is 160, 256, 176. Of course I check if the workitems lie within the range of the array, but this is not possible for the local cache. I don’t know, how to fix this issue.


__local float Minvals[ WORKGROUPSIZE ];
__local float Maxvals[ WORKGROUPSIZE ];
Minvals[gid_local] = input[gid];
Maxvals[gid_local] = input[gid];
barrier(CLK_LOCAL_MEM_FENCE);
int offset = local_size/2;
//Local reduction
for(;gid_local<offset;offset/=2)
{
Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);
}

With
Minvals[gid_local] = input[gid];
Maxvals[gid_local] = input[gid];
i am copying the values from the global memory to local memory. Of course, if the Work-Item is at the edge of the array the local memory is not filled completely.
So the local reduction is pretty worthless, because I try to access on invalid values(how are the buffers set?).
As well I have a problem with get_local_size(0…2). Does it only return the Work-Group size, which is specified through clEnqueueNDRangeKernel(…)?

What does CL_INVALID_COMMAND_QUEUE indicate?

Many thanks in advance…