Out of resource in a for-loop on NVidia card

Hi,

When running this loop on AMD it works fine. When running on however on an NVidia, the maximum value of however max_hours is around 500.000, before OpenCL driver crashes with Out of resources error message. The driver actually dies and application needs to be restarted. The error happens on the first function called after clEnqueueNDRangeKernel.

__kernel void vec_custom_f (__global const float *qw, const int gw_Idx,
__global const float *gV, const int gV_Idx,
__global float *tfluid, const int tfluid_Idx,
const int gLen,
const int istart,
const int my_hour_start,
const int max_hours)
{
double aSum;

   for (size_t my_hour = my_hour_start + get_global_id(0); my_hour < max_hours; my_hour += get_global_size(0))      
   {
   	     aSum = 0;
         for (size_t i = istart; i < my_hour; i++ )
         {
                  aSum += qw[gw_Idx + i - 1]*gV[gV_Idx + gLen - my_hour + i];
         }
 	     tfluid[tfluid_Idx + my_hour] = aSum;          
 	}

}

I double tested everything and my final conclusion is that somehow the code hits some hardware resource limit inside the Nvidia GPU when max_hours is big enough. The GlobalSize specified is first power of two less than max_hours. Is there some standard pattern to process such “dot-product” loops which makes them invariant to specified data length and hardware?

Thanks!
Atmapuri

More info:

if I set a fixed iter count on internal for-loop, and break the kernel in to multiple consecutive calls, the same error happens after about 2.5s. I have tried queuing other kernels and there is no fixed time-out.