Iterations And GlobalRange Difficulty (Related?)

wilson · April 3, 2010, 6:55pm

Hi everyone!

I’m benchmarking a parallelized algorithm and i wanted to run 1000 iterations of it.

The thing is, my global range for a single iteration is 82369, to make those 1000 iterations i decided to multiply 82369 times 1000, and then compute the offset inside the kernel. The problem is this raizes CL_OUT_OF_RESOURCES at enqueueReadBuffer.

Then i tried to do a FOR loop, inside the kernel 1000 times and keep global range at 82369 but the same CL_OUT_OF_RESOURCES came out also at enqueueReadBuffer.

Then tried to do some variations like 82369*20 and doing A FOR loop 50 times, but CL_OUT_OF_RESOURCES always came out.

I couldn’t find any explanation for this, info explaining the relation that i found between global range and the number of loops inside the kernel.

If someone has an idea why this happens, i would appreciate it (if this is even possible).

This is the kernel code (most of it):


__kernel void square(__constant unsigned long NBTS, 
					 __constant unsigned long GRID_SIZE_X, 
					 __constant unsigned long GRID_SIZE_Y, 
					 __constant unsigned long radix, 
					 __global unsigned long *BTSET, 					 
					 __global long *fitness,
					 __global int *covered_points
					
					 ) 
{	

	long globaID = get_global_id(0);

	__local long GRID_SIZE;
	
	if (globaID==0)
	 GRID_SIZE = GRID_SIZE_X*GRID_SIZE_Y;
	
	barrier(CLK_LOCAL_MEM_FENCE);

	int contador;
	long x, y;
	long x_k, y_k;
	long x1, y1,rx,ry;

	float cover_rate, fit;	
	int check=0,j,i,m;

	// offset
	
	long idy = globaID / GRID_SIZE;
	long idx = globaID - idy * GRID_SIZE;

	if (idx==0)
	{ 
		*covered_points = 0; //
	}

	for (j=0; j<1; j++) // <--- HERE´S WHERE I PUT THE 1000 ITERATIONS
	{

		contador = 0;

		for(m=0; m<NBTS; m++) 
		{
			if(BTSET[m]==idx) 
			{
				...
			}
		}

			if (check!=1) 
			{ 
					
					for (i=0; i<NBTS; i++) 
					{							
					...
					}

			}

	}

}

If you would like some more information, please just ask.

Thanks in advanced …

dbs2 · April 4, 2010, 5:03am

Two questions:

what is your total global size? (I believe the Nvidia driver currently has a terribly small limitation of a global size of 65k or something.)
what is GRID_SIZE set to if you’re not work-item 0? It seems like it is uninitialized.

Rui · April 4, 2010, 5:25pm

I’m experiencing the same problem executing two nested for cicles. Given the following code (lauching 44100 work-items, the size of addresses array), the card driver eventually crashes. It only allows me to run the outside for loop about 100 times. I don’t know if it is due to memory usage…


		long idx = get_global_id(0);
		long x_p, y_p;
		long x_a, y_a, rx, ry;

		y_p = addresses[idx] / 287;
		x_p = addresses[idx] - y_p * 287;

		
		for (int i=0; i<1000; i++) {

			for (int j=0; j<49; j++) {

				y_a = bt[j] /287;
				x_a = bt[j] - y_a * 287;

				rx = abs_diff(x_a, x_p);
				ry = abs_diff(y_a, y_p);
				
				
			        if (rx<=30 && ry<=30) 
				addresses[idx] = 1;	

			}
		}

What could be preventing it from running as it is supposed to be?

Thanks

wilson · April 4, 2010, 5:55pm

Hi!
Thanks for the reply!
About question nº2, i made that way to try optimize, but now i realize that i only initialize that for the first work group

But even if i use it as global, the problem persist.

The global size, is at minimum 82369. But with this value i need to do 1000 iterations per kernel. And this is the problem, because it returns CL_OUT_OF_RESOURCES and i can’t find any reason why this happens.

The reason why a talk about multiplying 82369 times 1000, it was to try replace the loop inside the loop by the number of work-items, if you know what i mean!

Thanks in advanced!

dbs2 · April 5, 2010, 8:00am

Are you using Nvidia’s drivers? If so, check their release notes because I don’t think they support a global size > 65,535.

Also, if your kernel is taking too long (say longer than 5 seconds) the system watchdog timer on most machines will kill your program. This is true on Mac OS X, and, I believe, windows/linux unless you are using a dedicated (i.e., non-display) card for computing.

Rui · April 5, 2010, 8:03am

Yes, Nvidia drivers here…

Kratzy974 · April 5, 2010, 11:27pm

A global size of something with 140 MB is allowed. The constant is limited to 64k and the local related to the chip between 16 and 32k.

I had the similar problem when using too large loops. So my outer loop isn’t in the kernel anymore, instad I do use more work groups (several thousands). If you use for all work groups the same data, you should make a good time meassure.

dbs2 · April 6, 2010, 5:52am

I think there’s some confusion here. I’m not talking about a global memory size when I say there is (was?) a limitation in the Nvidia drivers. I have heard several people say that global_xglobal_yglobal_z must be <= 65535 to run on Nvidia’s drivers. Can anyone confirm if this is still the case? I know it’s not a hardware limitation since the same card will happily run arbitrarily large global sizes under Mac OS X.

wilson · April 7, 2010, 2:56pm

Hello EveryOne!

Just writing to say that my problem was solved.

The problem wasn’t the global range size, but the bad management of resources in the kernel. Since i had my kernel, optimized, and by this i mean define a right local range and define only the needed variable with the needed type.

And that was it.

Só, answering to dbs2, i belive that the nvidia limit 65535 doesn’t exist. At least with the last drivers available!

Thanks you all for the help!

Take care!