Vectorized reduction on embedded device fails at certain data size

Hi all,

working on my Master’s Thesis, I encountered a strange problem regarding a simple openCL reduction example. The code runs on an embedded GPU, so this is why I vectorized it. The algorithm is taken from some nVidia example slides, so I consider it do be a reasonable choice for desktop GPUs. Because I want to show, how much negative impact the lack of shared memory on my embedded device has, I ported the kernel and exchanged local with global memory, leaving the rest ot the algorithm besides vectorization untouched. I am aware of the severe performance decrease connected with that, but that is not the point. On a GeForce GTX 590, I can run my code with vectors up to 130048 elements without any problems. Increasing the element count by another 256 however, I get an CL_INVALID_COMMAND_QUEUE error at clFinish and the kernel produces faulthy results.

After a bit of investigation I found out, that this behaviour occurs if something before the clFinish call went went wrong and is not necessarily related to this call. This is why I check all returned error values of all calls, but no luck so far. Any suggestions, what could go wrong?

My kernel looks like the following:


__kernel void reduce_vectorized(__global float *in, __global float *out, __global float *cache, int vectorLength)
{
	size_t id_global = get_global_id(0);
	size_t wg_dim = get_local_size(0);
	size_t num_wgs = get_num_groups(0);
	size_t id_local = get_local_id(0);
	size_t id_group = get_group_id(0);
		
	if((id_global + 1) * 4 <= vectorLength)
	{
		float4 copy2cache = vload4(id_global, in);
		vstore4(copy2cache, id_global, cache);
	}
	else if(id_global * 4 < vectorLength)
	{
		int overhang = vectorLength - id_global * 4;
		float arr[4] = {0, 0, 0, 0};
		for(int i = 0; i < overhang; ++i) {
			arr[i] = in[id_global * 4 + i];
		}
		vstore4((float4)(arr[0], arr[1], arr[2], arr[3]), id_global, cache);	
	}
	else 
	{
		vstore4((float4)(0.f), id_global, cache);
	}

	barrier(CLK_GLOBAL_MEM_FENCE);

	for(int n = wg_dim / 2; n >= 1; n >>= 1) {
		if(id_local < n){
			float4 tmp = vload4(id_global, cache) + vload4(id_global + n, cache);
			vstore4(tmp, id_global, cache);
		}
		barrier(CLK_GLOBAL_MEM_FENCE);
	}
		
	if(id_local == 0) {
		float4 copy2input = vload4(id_global, cache);
		out[id_group] = dot(copy2input, (float4)(1.f));
	}
}

The kernel is called in a loop with localSize = 256 and globalSize = DATA_SIZE/4, whereas DATA_SIZE = (127 * LOCAL_SIZE * 4) for the last working vector size and DATA_SIZE = (128 * LOCAL_SIZE * 4) for the first failing. After each execution, in- and output buffers are switched. This is done until the reduction is finished. If anyone is interested, I can post the full working minimal example code as well. I hope to get some help with this, because this really confuses me a lot.

Best,
rschiewer