Looping kernel based on change

Sorry for the vague title.

So I had an idea where each work item reports for change and writes it down some where and if there was a change the kernel is run again until no change is reported. Now what would be the best implementation of this when speed is important.

  1. Let the kernel run through once and check for change on host side by running through the buffer where the change is reported and run again if there was change.

  2. Run the kernel X amount of times before check for change on host side by running through the buffer where the change is reported and run again if there was change.

  3. Use atomic integer to increment it during kernel execution and use while(change) loop to check if to run the loop again.

First one might be bad since it requires host check before the scanning can be started again. And that buffer read could take some time.

Second one is the same but would require less host checks but might do extra work for nothing. No problems would come from those extra checks.

I don’t know how atomics work exactly. Does its use freeze other work items processes until everyone has incremented it? Should I just use one unsigned integer which would not be atomic and have all the work items data race it when incrementing. Could it actually end up being 0 when there would be a change?

All I need is one reported change to continue the loop.

Implementation suggestions would be nice, no code needed. Any thoughts?

[QUOTE=EmJayJay;31259]So I had an idea where each work item reports for change and writes it down some where and if there was a change the kernel is run again until no change is reported. Now what would be the best implementation of this when speed is important.

  1. Let the kernel run through once and check for change on host side by running through the buffer where the change is reported and run again if there was change.

  2. Run the kernel X amount of times before check for change on host side by running through the buffer where the change is reported and run again if there was change.

  3. Use atomic integer to increment it during kernel execution and use while(change) loop to check if to run the loop again.
    [/QUOTE]Most likely none of those. One of the good rules of good data processing is: know your data!
    This is especially the case for CL devices, GPU in particular as they often have considerable benefits for running coherent work items.
    So if I had to do that, I’ll have the kernels build a queue using atomics so a further host call can enqueue more processing in blocks. But you would need readbacks… or CL2 pipes maybe or an hypotetical clEnqueueNDRangeIndirect which does not exist.

Keep in mind you can just provide your kernels an early-out path. If they are coherent this would not even require a sync.

[QUOTE=EmJayJay;31259]
I don’t know how atomics work exactly. Does its use freeze other work items processes until everyone has incremented it? Should I just use one unsigned integer which would not be atomic and have all the work items data race it when incrementing. Could it actually end up being 0 when there would be a change?[/QUOTE]Atomics map directly to hardware constructs enforcing an order of operations. Those hardware features don’t enforce a specific order but guarantee the end result will be the same as modifying the variable in some order.

That is, if you write 0x0, 0x0123, 0xabcd, 0xf000 you can be sure to read one of those values: you don’t know which one however. A non-atomic path might read (as example) 0xab23 or just garbage.

Atomics are not standard variables. You access them only by pointers and manipulate them through specific instructions (see CL1.2 - section 6.12.11 Atomic Functions for the baseline you can reasonably expect those days). It is my understanding those pointers shall be declared volatile but they work anyway for me.