Let’s suppose you’ve a (large) collection of 2D points, and you’ve to find the “best” point among them, where “best” means evaluating the cost of each point, and finding which one has the smallest cost.
The function for evaluating the cost is relatively complex (it needs the coordinates of all the points, because the cost of a single point depends on all the other points too), but despite of this it fits in a kernel, and I successfully compiled it into a kernel both on my NVIDIA and AMD devices.
But now… since it’s not possible to implement spin-locks in OpenCL, how should I find the “best” point? Allocating the cost of all points in global memory, returning them all, and letting the host find the smallest one isn’t an option, because this would mean sending hundreds of MB from the device to the host…