Local and global work size limits

Hello everyone,

my problem is pretty recurrent on opencl forums but I can not solve mine unfortunately.
Firstly, my graphic card is a Nvidia Quadro K620 which supports a MAX_WORK_ITEM_SIZES of 1024 /1024 / 64 and a DEVICE_REGISTERS_PER_BLOCK_NV of 65536.

Naively (maybe), I would like to maximize the global_work_size of my program in order to parallelize the more kernels computation.
I have read that the global_work_size could be “as big as I want” while local_work_size is a multiple of global_work_size and can not be high as the MAX_WORK_ITEM_SIZES (ref : opencl - Questions about global and local work size - Stack Overflow ).

However, I can not be higher than this configuration without errors:
size_t local_work_size[1] = {512};
size_t global_work_size[1] = {32768};

And my question is: why?
Why I can not defined a global_work_size as big as I want?
When I exceed the DEVICE_REGISTERS_PER_BLOCK_NV value, my program fails with the error number -5: "CL_OUT_OF_RESOURCES ". I obtain the same error with local_work_size = 1024.
So, Why I can not define local_work_size[1] = {1024} and global_work_size[1] = {1048576} for example?

Thank you in advance!

It is hard to tell without extra code, but most likely your kernel uses so much resources (local memory, registers…) per work item that a local work size of 1024 is not workable.

There are actually multiple hardware limits that restrict the scalability of GPU programs in terms of local work size. The most common are :

[ul]
[li]A hard limit on the local work size, for any kernel, both globally (CL_DEVICE_MAX_WORK_GROUP_SIZE) and across each dimension (CL_DEVICE_MAX_WORK_ITEM_SIZES).[/li][li]Local memory usage per work-group (CL_DEVICE_LOCAL_MEM_SIZE), which usually increases with local work size because most programs consume a fixed amount of local memory per work-item.[/li][li]Registers per work-group (Not specified by standard OpenCL, though apparently you found an NVidia extension which tells it), again programs consume a fixed amount of registers per work-item.[/li][/ul]

Also, note that beyond that, it can sometimes also be good for performance to ensure that a given compute unit runs as many work-groups as possible. This is, again, dictated by work-group resource consumption.

Summary : If you want to use a local work size of 1024 (which may be important or not, most programs run better with a local work size slightly smaller than the maximum allowed by the device), it seems you will need to find a way to either use less shared memory (this is something YOU control directly) or less registers (this is more compiler-dependent, but usually the more complex the kernel and the more private variable it has, the more registers it will need). The NVidia profiler can be used to tell you which is the bottleneck, but knowing NVidia’s usual track record when it comes to OpenCL, I wonder if there is a way to make it work for you here, or if it will tell you to get lost because you’re not using CUDA.

Note that this does not, of course, limit the global work size you can use in and of itself. You can very well have a global work size of 512 000 000 with a local work size of 512, as long as you do not exceed one of the device’s global resource consumption limits (e.g. global memory). Note also that changing the global work size does not, in and of itself, increase program parallelism: as soon as you fully occupy all of your device’s compute units, your program is as parallel as it can be, and extra performance is usually achieved by things like loop unrolling, reducing parallelism in order to more closely match your device capabilities, resolving local memory bank conflicts, etc.

Small correction in the last line : “reducing parallelism in order to more closely match your device capabilities” => “reducing global work size in order to more closely match your device capabilities”. Fell victim of the very point I was trying to make : once all compute units are busy, adding more work does not make a program more parallel.

Thank you HadrienG for your extensive reply. I have tried different configurations but I am still incapable of increasing the global_work_size higher than 2^16, whatever the local_work_size value (maybe due to a significant global shared memory used inside my program as you suggested?).
However, I can parallelized 2^16 - 1 threads. So concretely, what are my alternatives to reach 2^24 threads for example? I understand that it cannot be perform directly in one execution of 2^24 kernels.
I tried to encapsulate the “EnqueueNDRangeKernel()” command into a for loop to reach the number of 2^24 threads but my results were not independent (the same result was just multiply by the loop value).

It’s quite strange that your global work size is limited in such a drastic way. Could you post your OpenCL and kernel code so that we can have a closer look ?

Ok, I have localized (I think) my memory consumer!
I am using an external library named clRNG that able to generate random number into a kernel. When I delete all the calls clRNG library (such as the function “clrngMrg32k3aRandomU01()”), I can upgrade the global_work_size to 2^21 kernels with a local_work_size defined to 512.
But now… what can I do? I have to generate these random number.

Generate them upfront and write into a buffer if you know upper bound of random numbers you need. If you do not require strong randomness, implement a trivial RNG yourself, using global thread id to generate a seed. Replace single kernel invocation with multiple invocations of size clRNG allows you. Finally, alter your kernel, so each work-item computes few values. Example:


__kernel test(__global* int out){
int i = get_global_id(0);
out[i] = i;
}

is replaced by


__kernel test(__global int* out, int iterations){
int k = get_global(id);
for (int j = 0; j < iterations; ++j){
int i = k + j * get_global_size(0);
out[i] = i;
}
}

thank you Salabar for your help!

So, first, I am trying to replace the library clRNG with another algorithm from “Numerical Recipes in C, The Art of Scientific Computing, Second Edition” (http://www2.units.it/ipl/students_area/imm2/files/Numerical_Recipes.pdf), page 283.


#define MBIG 1000000000
#define MSEED 161803398
#define MZ 0
#define FAC 1.0E-9

double RandomGen(__private char Type,__private long Seed)
{
	int inext, inextp;
	long ma[56]; // ma[0] is not used.
	long mj, mk;
	short i, ii, k;

	if (Type == 0)
	{ // set seed.
		mj = abs(MSEED - abs(Seed));
		mj %= MBIG;
		ma[55] = mj;
		mk = 1;
		for (i = 1; i <= 54; i++)
		{
			ii = (21 * i) % 55;
			ma[ii] = mk;
			mk = mj - mk;
			if (mk < MZ)
				mk += MBIG;
			mj = ma[ii];
		}
		for (k = 1; k <= 4; k++)
			for (i = 1; i <= 55; i++)
			{
				ma[i] -= ma[1 + (i + 30) % 55];
				if (ma[i] < MZ)
					ma[i] += MBIG;
			}
		inext = 0;
		inextp = 31;
	}
	else if (Type == 1)
	{ // get a number.
		if (++inext == 56)
			inext = 1;
		if (++inextp == 56)
			inextp = 1;
		mj = ma[inext] - ma[inextp];
		if (mj < MZ)
			mj += MBIG;
		ma[inext] = mj;
		return (mj * FAC);
	}
}

#undef MBIG
#undef MSEED
#undef MZ
#undef FAC

__kernel void TEST( __global float* F)
{
RandomGen(0,get_global_id(0));
F[get_global_id(0)] = RandomGen(1,0);
}

But unfortunately, it produces only 0 value in ouput.
I can’t see where is the problem… Perhaps, due to the location memory? :roll:

Correct me if I am wrong, but it seems to me that the RandomGen function that you have posted is entirely stateless : it mutates no global state, and its input is passed by value and so cannot be modified. Thus, for a given set of parameters, I would expect it to always produce the same value as an output.

In this case, I expect the first function call of your kernel (“RandomGen(0,get_global_id(0));”) to have no visible effect, and the second vectorized call (“F[get_global_id(0)] = RandomGen(1,0);”) to produce the same value for all threads since its parameters never change.

I do not know about the algorithm you are using, but I think you might have misunderstood something about it. From my understanding of the code, in order for this function to produce a different output on each run, you likely need to preserve at least the values of inext, inextp and ma[] across calls (since these are required by RandomGen but not initialized for Type=1). Note also that you might need to think a bit about how to make this algorithm both thread-safe and memory efficient : since ma[] is relatively large, keeping it private per-thread might be exceedingly costly in terms of private memory use.

I agree with HadrienG, this cannot really work out too well in the first place. Something like this might be more fitting: Xorshift - Wikipedia
Here is the idea of how this could work on GPU


uint64 genRand(uint64* i){
	(*i) ^= (*i) >> 12; // a
	(*i) ^= (*i) << 25; // b
	(*i) ^= (*i) >> 27; // c 
        return (*i) * 2685821657736338717;
}
__kernel void whatever(){
   uint64 seed = (get_global_id(0) + 1111) * (get_global_id(0) + 1011) * (get_global_id(0) + 1012) ; //You should play around with ways to generate an initial value, raw id may result in boring patterns
  printf("%ld ", genRand(&seed));
  printf("%ld ", genRand(&seed));
  printf("%ld ", genRand(&seed));
}

In order to use functions in your kernel, you’d need to propagate this seed pointer. This can become be a complete PITA, but compute and register usage-wise, this is the most efficient implementation I can think of.

thank you for your help HadrienG and Salabar!

yesterday, I implemented the "Tiny Mersenne Twister " ( http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/TINYMT/ ) PRNG in my kernel. I was able to upgrade the global_work_size up to 2^21 threads, which could be satisfactory.
I will try to replace it with the code suggest by Salabar and see if I can still increase the global_work_size.

Hello again everyone,

I have discovered recently a desperate mistake in a big while-loop of my program. My computation limits are now of:
size_t local_work_size[1] = {512};
size_t global_work_size[1] = {8192};

I thought to create a for-loop to call my EnqueueNDRangeKernel in order to obtain a higher number of threads. For example:

for (int repeat = 0; repeat<10; repeat++)
{
clError = EnqueueNDRangeKernel(command_queue, kernel, 1, global_work_size, local_work_size);
}

Firstly, is it a good idea?:confused:
and Secondly, I think I have to free the private memory of all my kernels after the EnqueueNDRangeKernel call (my result is saved in the global memory). However, I haven’t find anything on this subject. What can I do?

Thank you in advance.

Sorry, I did not copy the right code:

short REPEAT_MAX = 3;

	for (int repeat = 0; repeat<REPEAT_MAX; repeat++)
	{
		clError = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, global_work_size, local_work_size,NULL,NULL,NULL);
	}

I saw on some forum that I have to add an event to the clEnqueueNDRangeKernel and release it just after the call but it does not change anything…

	
short REPEAT_MAX = 3;
cl_event event;
	for (int repeat = 0; repeat<REPEAT_MAX; repeat++)
	{
		clError = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, global_work_size, local_work_size,NULL,NULL,&event);
		clReleaseEvent(event);
	}

The private memory of your kernel is private to each of its work items, and automatically liberated once a work item completes. I think you probably meant something else.

You do not need to create an event if you don’t want one. It is legal to pass a nullptr to the “event” parameter of a clEnqueue call.

Regarding whether calling N times the kernel instead of doing one single clEnqueueNDrange would be appropriate, I would have a question : if you call N times the same kernel, with the same parameters, how do you prevent each iteration from 1/leading the same result as the previous one and 2/overwriting the results of the previous one ?

Thank you HadrienG for your reply.

I think you probably meant something else.

I don’t know how to execute more than 8192 kernels; this is my main issue. So I just try to multiply the parallelism by a for-loop. When I increase this number, my computer freezes and the well-known "-5: "CL_OUT_OF_RESOURCES " error appears.

1/leading the same result as the previous one and 2/overwriting the results of the previous one ?

Really pertinant questions, thank you! The random number generator implemented in the kernels is initialized/seeded with the call/execution time. So, I expect to obtain different results on each thread.
Concerning the overwriting, my final result is the sum of each kernel (atomicAdd). Thus, I think that call several times the same code would not overwriting, but adding each thread results.

Well… I have tested this code and it works. (!):

	size_t local_work_size[1] = {512};//{NUMBER_OF_PHOTONS/NUMBER_OF_WORK_GROUP}; // < 1024 car 32*32 = 1024 = CL_DEVICE_MAX_WORK_GROUP_SIZE du matériel!
	size_t global_work_size[1] = {NUMBER_OF_THREADS};//{NUMBER_OF_PHOTONS}; //2097152 // 1048576 // NOMBRE DE PHOTONS SIMULES // DOIT ETRE UN MULTIPLE DE local_work_size
	
	short REPEAT_MAX = 128;
	for(int repeat=0;repeat<REPEAT_MAX;repeat++)
	{
	randomtime ++; 
	error = clSetKernelArg(kernel, 11, sizeof(char), &randomtime);
	clError = EnqueueNDRangeKernel(command_queue, kernel, 1, global_work_size, local_work_size);
	clFinish(command_queue);
	}

My acceleration is about x10 (40minutes with CPU to 4minutes with GPU).
Of course, if someone has a better idea than mine, I’ll take it :smiley:

[QUOTE=axisse;39778]Well… I have tested this code and it works. (!):

	size_t local_work_size[1] = {512};//{NUMBER_OF_PHOTONS/NUMBER_OF_WORK_GROUP}; // < 1024 car 32*32 = 1024 = CL_DEVICE_MAX_WORK_GROUP_SIZE du matériel!
	size_t global_work_size[1] = {NUMBER_OF_THREADS};//{NUMBER_OF_PHOTONS}; //2097152 // 1048576 // NOMBRE DE PHOTONS SIMULES // DOIT ETRE UN MULTIPLE DE local_work_size
	
	short REPEAT_MAX = 128;
	for(int repeat=0;repeat<REPEAT_MAX;repeat++)
	{
	randomtime ++; 
	error = clSetKernelArg(kernel, 11, sizeof(char), &randomtime);
	clError = EnqueueNDRangeKernel(command_queue, kernel, 1, global_work_size, local_work_size);
	clFinish(command_queue);
	}

My acceleration is about x10 (40minutes with CPU to 4minutes with GPU).
Of course, if someone has a better idea than mine, I’ll take it :-D[/QUOTE]

Would you share the complete source code of your program? I did some SPH simulations in OpenCL and a speed up factor of 10 seems rather low if for a compute bound task.