How to have one kernel signal all other kernels to return

James78613 · October 25, 2010, 11:08am

hi im using OpenCL on a GPU (data para.) the kernel that i run has a very long for loop (i <= 99999999) All kernels will eventually find the answer and return (exit) from the loop but one kernel will always find the answer sooner than the others. So I want a way to have one kernel signal the other kernels to stop (break from the loop) and return. I currently do this using a global variable that all kernels check from within the loop. so the first kernel that finds the answer sets the global flag to true and all other kernels read the flag, break from the loop, and returns. This gave me a BIG speed increase. However, I would simply like a way to have one kernel signal all other kernels that something has happened so i can use it to have all the other kernels break and return. Is there a way that one kernel can signal all other kernels?

Thanks

david.garcia · October 25, 2010, 4:58pm

Is there a way that one kernel can signal all other kernels?

No, there isnt. The current method you use (polling) is the only viable one.

James78613 · October 27, 2010, 7:27am

thank you for the reply. since I have to poll global memory, is there a better way to poll? currently I check the global memory flag in each iteration of the loop. is there a timer feature in open cl that will allow my kernel to poll global memory at spaced out intervals to reduce access to global memory?

thank you

david.garcia · October 27, 2010, 3:36pm

is there a timer feature in open cl that will allow my kernel to poll global memory at spaced out intervals to reduce access to global memory?

There may be some vendor extensions to query a timer inside a kernel. The standard doesnt have this, though.

If I were you I would try to think whether the algorithm can be transformed in a way that you dont need this signalling mechanism.

andrew.brownsword · October 27, 2010, 5:23pm

Its not clear to me that you’d want a fixed time slice anyhow, since that would mean you’d be wasting more cycles on faster devices. Better to go with every n’th iteration or work-item, or some other counter.

As David says though, its worth putting considerable thought into figuring out a way to refactor the algorithm to avoid the requirement.