Local memory simple counter

Hi,
I am trying to have the local memory working efficiently. Even if it is actually useless to do so, I count the number of work items using a single local variable.

I am implementing something similar to the AMD reduction tutorial or the AMD TransferOverlap_Kernels.cl tutorial.

AMD Reduction tutorial


__kernel void clTestCounter( __global float* out_fData )
{
	__local int iCounterTest;

	if( get_local_id(0) == 0 )
	{
		iCounterTest = 0;
	}
	barrier(CLK_LOCAL_MEM_FENCE);


	for( int i = 0; i < get_global_size(0); ++i )
	{
		if( i == get_local_id(0) )
		{
			iCounterTest += 1;
		}

		barrier(CLK_LOCAL_MEM_FENCE);
	}

	if( 0 == get_local_id(0) )
	{
		out_fData[0] = iCounterTest;
	}
}

Is there a more efficient way to do this?

Thanks.

In fact, I am wondering if I should use GLSL or OpenCL to compute the reduction algorithm. When I use GLSL, I can read, for instance, 16 pixels per iteration from the global memory input and write them to global memory output directly using registers to compare them. Then I will have to use the ping-pong buffer swapping recursively until the size is reduced to 1.

In the AMD article, they first copy the data to local memory but since the local memory is so small, we have to execute multiple passes anyway. Why don’t they have the work items to read only from global memory saving the results in a smaller global buffer that will recursively have a size of 1? Isn’t the synchronization code with the iterations and conditions doing more work than having each work items working independently?

if (lx == 0)
result[0]= get_local_size(0)
?

Well … there’s not much point creating a nonsensical example because the best solution will depend entirely on the problem at hand.

Your example is serial anyway and doesn’t demonstrate how parallel reduction works. A parallel example would initialise an array of counters to counters[get_local_id(0)] = 1, and then add them up using a parallel prefix sum. Each thread should be doing work concurrently.

For a big reduction using local memory is a big win over using global memory because each single stage needs full communication between all the threads. You can also keep some stuff in registers to multiply the data processed.

The only way to communicate between worker groups using global memory is to (re)invoke a kernel.

I saw yesterday that my example would work only with very small work size.

If my work size is greater than the CL_DEVICE_MAX_WORK_ITEM_SIZES, does it means calling clEnqueueNDRangeKernel will automatically execute my kernel many times?

My max work items is 512/512/64, if I run it on a 1024x1024 work size, the result will be fully computed, but the local memory counter will never reach 1024 * 1024, even if it is serially inefficient. My ultimate goal is to use the local memory to have an accumulation thus saving many kernel passes, so I am beginning with a simple and inefficient code. I must understand the base before going further!

Is the local memory automatically reset between automatic kernel executions?

I did changed the last condition of the example to (local changed to global) :


if( 0 == get_global_id(0) )

Well welcome to the fun world of opencl hacking … “life wasn’t mean to be easy”, but that’s what makes it interesting.

I think you’re missing a fundamental issue here: it’s not that the local memory is reset between invocations, or that the job is executed many times - it’s that the hardware can have MANY concurrent work groups executing in parallel on different physical parts of a device (or devices). In fact, that’s the whole point of opencl exiting in the first place. There isn’t just one bit of local memory for up to 1000 threads, there’s potentially dozens of chunks of local memory for a hundred threads each and they all execute at the same time; hence your code can’t possibly work unless you restrict it to a single sub-set of a single device.

A GPU based implementation takes the global work size and splits it into chunks that fit on the device and executes each in parallel. There is also a physical limit to how many of these can run (which also depend on the code, e.g. register usage), so for big problems, batches of jobs are executed in serial until the work is complete (the limit is in the thousands on current devices).

One aspect of this is that the local memory is necessarily only local to the workgroup and so you can not infer anything about it’s content, and another is that there is no coherency with global memory accesses and so you can’t communicate between workgroups using global memory (outside of atomic operations, some of which are prohibitively expensive on some devices). All you can say is that the local workgroup executes together, but other than that the implementation is free to execute the task however it likes.

This has (good) performance implications, e.g. you don’t need to worry about a globally coherent cache amongst 10+devices since the programming model doesn’t guarantee anything outside of the workgroup granularity. And for local memory it means you get better-than-cache performance with much less power and transistors.

So you have to solve the problem using what you’re given … which for this type of problem involves doing as much parallel work as you can in one or more passes - using local memory to divide the problem by a factor of 256 or more say - and then completing it in a smaller pass which then accumulates the results within a single work-group (which may be less efficient of gpu resources, but is short enough not to matter).

It’s impossible to avoid multiple passes if you want to utilise the hardware fully, and how one breaks the problem up impacts performance in a device-dependent way - e.g. if your passes are too wide, then fewer can run concurrently, synchronisation might be more costly, and so on (e.g. on AMD hardware, using a local worksize of sum(x,y,z) == 64 and setting the function attribute reqd_work_group_size to the same results in barriers being compiled out). There are also other tricks for parallel associative operations such as using a local buffer twice as wide as necessary and padding it with zeros (so that all threads execute the same code and no step-sensitive branching is required), but that is also a trade-off since using more local memory may reduce device concurrency. And after all that sometimes it’s just faster utilising a pair of sqrt(N) loops, where say 16 threads process 16 items each, and then one does the 16 results - i.e. 32 very simple steps versus 7 quite complex ones for 256 items.

Try looking for the paper “scan primitives for gpu computing” (although i found that rather heavy going), or searching on “parallel sum”, “parallel prefix sum” and so on. All the SDK’s have examples of this problem as well since it is a very fundamental algorithm for gpu computing. There’s quite a bit of info about it out there to point you in the right direction.

Thanks a lot for the detailed information.

So to make sure I understand : when the work size is bigger than the CL_DEVICE_MAX_WORK_ITEM_SIZES, calling clEnqueueNDRangeKernel will generate many work groups. And, as specified local memory is shared only for one work group. So my example will count the number of work items of the last executed work group which is of an unpredictable size if my work size is not a multiple of the max work items.

Here is a link to the Scan primitives for gpu computing paper.

So to make sure I understand : when the work size is bigger than the CL_DEVICE_MAX_WORK_ITEM_SIZES, calling clEnqueueNDRangeKernel will generate many work groups.

Not quite. The number of work-groups that are executed is given by the global work size divided by the local work size that you pass to clEnqueueNDRangeKernel. If you are not passing a local work size, then the number of work-groups will be decided by the OpenCL implementation.

Also note that the number of work-groups that are actually executed in parallel depends entirely on your hardware. In particular, it will typically be a function of the number of compute units returned by CL_DEVICE_MAX_COMPUTE_UNITS.