Initializing __local variables

In my kernel I’m processing NxN chunks of an image in a buffer. Each chunk will produce a single result so each work group is a single chunk and each work item within the group modifies a single __local variable with atomic updates. To get the right value, I need to initialize the result value to zero.

This is my approach:


	local uint result;
	result = 0;
	barrier(CLK_LOCAL_MEM_FENCE);

	... work ... (atomic_xyz(&result, ...))

	barrier(CLK_LOCAL_MEM_FENCE);
	if (get_local_id(0) == 0 && get_local_id(1) == 0) {
		global_chunk_results[...] = result;
	}

My question is whether this is the most efficient way initializing a single local variable used by all work items within a group.

It’s my understanding that using barrier when initializing a local variable to 0 is unnecessary and causes a pretty decent slowdown. I think barrier is only required when loading things from global memory or writing things to global memory if you’re going to replace data in that variable after. While the second barrier is required, the first is not in this example, assuming you are syncing some data to a variable or from a variable in your … work … section.

It is imperative that no work item starts “work” until result has been initialized—because they all share the result variable through atomic updates—and the only way to guarantee that is with a barrier. (AFAIK, since there is no local variable initialization.)

Experimentally it’s easily demonstrated that without the barrier, the results are incorrect. Also, the barrier at the beginning of the work item is shown to have a negligible impact on performance.