Syncronize multiple devices

Hi !

I have a shared context with 4 devices and 4 queues. I want to start these queues syncronised repeatedly.

My idea is as follows…

i have a Active waitlist for events that e.g. containes the last memory transaction

then…

foreach queue
clEnqueWaitForEvent(queue,the Active waitlist)
clEnqueMarker(queue,the next waitlist)
Active waitlist=next waitlist

This snippet in my mind creates a barrier in each queue that awaits the previous queue so when this is executed the last event will be Active when all queues are finished with their current work

Next snippet is to create events in the first queue as many as there are queues. The fist one awaits the last event in the preceeding snippet so the first queue will be filled with
clEnqueWaitForEvents
clEnqueMarker(0,event 0)
clEnqueMarker(0,event 1)
clEnqueMarker(0,event 2)
clEnqueMarker(0,event 3)

The last snippet starts the NDR work where work in queue 0 awaits event 0, work in queue 1 awaits event 1 etc…

This just doesnt work however. Very slow and wrong result ?

Any clues how to do this sync better ?

Just to clarify. I tried first to generate a waitlist from all queues NDR in the previous pass and used that event list for each NDR in the next pass but somehow using the same event makes the event timeout go up to seconds

Someone could perhaps clarify that this isn’t allowed to wait for an event in multiple calls ?

To clarify more. The large slowdown of the software isnt in the GPU but in the call to clEnqueueNDRangeKernel or in the argument setting. This is the code snippet that runs very fast for m_gpu=1 but takes seconds to complete when m_gpus are 2 or bigger


for (gzUInt32 j = 0; j < m_gpus; j++)
{
	status |= clSetKernelArg(m_kernel, 0, sizeof(offset), &offset);

	status |= clEnqueueNDRangeKernel(m_common_info->m_commandQueues[j], m_kernel, 1, NULL, &length, &m_localWorkSizes[i], m_common_info->m_currentWaitList.getSize(), &m_common_info->m_currentWaitList[0], &m_common_info->m_nextWaitList[j]);

	offset += length;
}