Global Barriers?

guillona · February 14, 2010, 8:28pm

Currently I’m writing an algorithm where I need a single (very quick) global barrier, and then processing can resume in parallel as it was… so basically I have a large amount of parallel work, then all work_items should hit a barrier… one work item proceeds past and does some very quick work… then all work_items resume past the barrier.

I don’t see that this is possible with OpenCL. The barrier() instruction specifies that it only applies to work groups. This isn’t good enough, because I want to work at the global_id level.

The other thing to do is to break my kernels into three kernels… kernel_1 does everything in parallel up to the barrier… kernel_2 does a single_work item and very little work (a huge waste of time to spawn, but required for the algorithm), and finally kernel_3 again works in parallel. Obviously I want to avoid the CPU management where I can, because it will add a bit of overhead that isn’t required.

Normally I wouldn’t care… but this is part of a very time-critical algorithm, and I want to ensure this part is as fast as possible.

Thanks!

dominik · February 16, 2010, 2:20am

OpenCL only supports synchronization within workgroups. The official way of a global synchronization is to have multiple kernels as you pointed out. But rather than having 3 kernels you would only need 2 I think: In the first kernel you do all the work up to the barrier and only one workitem (say the one with global_id 0) does the sequential work. Then in the second kernel you do the remaining parallel work.

There’s a paper a this year’s CC conference called “Automatic C-to-CUDA Code Generation for Affine Programs”. They say they use

a “single-writer multiple-reader” technique to achieve synchronization across thread blocks using the global memory space

They don’t discuss the performance of this technique though…

dbs2 · February 20, 2010, 3:58am

The “single-writer multiple-reader” thing sounds a lot like one work-item writes and the others spin-lock on it. That may work, but without assurances as to how the hardware schedules work-groups it might also never complete. (I’ve heard that it tends to work on Nvidia hardware.)