Problem synchronizing between multiple WorkGroups

Hey,

I am trying to synch all workgroups using a global variable as a semaphore. My barrier function inside the kernel is as follows:


#define WORKGROUP_COUNT 15
#define THREAD0_LOCAL (idx_Local == 0)

inline void barrierGlobalRamp(__global volatile int* volatile synch, int idx_Local, int barrierIdx, char *direction)
{
	mem_fence (CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);	
	if (THREAD0_LOCAL)
	{
		bool goOutFlag = 0;
		switch (*direction)
		{
			case BARRIER_INCREASE:
				atomic_inc(&synch[barrierIdx]);
				while (!goOutFlag)
					if (synch[barrierIdx] >= WORKGROUP_COUNT)
							goOutFlag = 1;
				*direction = BARRIER_DECREASE;
				break;
			case BARRIER_DECREASE:
				atomic_dec(&synch[barrierIdx]);
				while (!goOutFlag)
					if (synch[barrierIdx] <= 0)
							goOutFlag = 1;
				*direction = BARRIER_INCREASE;
				break;
		}
	}
	barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
	return;
}

Every first thread of each workgroup tries increasing (or decreasing, based on the direction variable) synch and checks if the value reached the total # of workgroups and exits if so, and waits otherwise.

I am using GTX570 card which has 15 SMs and this code works if my number of workgroups, or WORKGROUP_COUNT, is 15 or less.

The problem, however, is that it doesn’t seem to get out of the function (for at least some WGs) if the number of workgroups is set to 16 or higher. Anyone has any idea how this might happen?

My initial guess is that one WG is starved by its rival WG on the SM and doesn’t get into the function but I’m pretty sure there is more to it!

Any hint is appreciated :smiley:

Any chance that any of the while loops gets stuck?

I’m sure you’re aware that the rule for barriers is that every work item must hit them. If one of your work items is stuck in a loop, it will never hit the barrier and the rest of the work items will park there waiting.

This cannot work.
Threads on a GPU are not logical threads sharing computation time thanks to a multitasking mechanism. They are physical threads running on processing elements whose number is limited.
Work-items are run concurrently in batches on the processing elements, but these batches are run sequentially.
To say it simply: you have not guarantee that all the work-items of all the work-groups run concurrently.
As a result, your work-items are blocked when their number reaches a given threshold because the other work-items have not even started and won’t.

utnapishtim, very well put. My sleepy brain did not catch that it, but that does appear to be the problem.