Results 1 to 9 of 9

Thread: Bool AND reduction

  1. #1
    Junior Member
    Join Date
    Feb 2018
    Posts
    7

    Question Bool AND reduction

    Hello everyone.

    I'm working in an algorithm that has to iterate doing some computation (a specific kernel) in an image until no changes are done (it always converges, no worries). In OpenCL, I'd need each thread to report whether it changed anything or not, to know if I need to run the kernel again.

    The most trivial way I can think of to solve this is by using a global boolean. If a thread changes anything, they would set that bool to true, and the host would simply check that value. However, this sounds quite costly. I know I could split this reduction using shared memory to optimize it, but there must be an easier and better way to do this.

    Is there a way to send some sort of signal from a thread to the host? There's no extra data I need to transfer, I just need to be notified with a signal or anything.

    Thanks everyone!

  2. #2
    Senior Member
    Join Date
    Dec 2011
    Posts
    252
    Why do you think it would be costly?

  3. #3
    Junior Member
    Join Date
    Feb 2018
    Posts
    7
    Mainly for two reasons:

    1. Many threads could have done changes, so thatís many accesses to global memory.
    2. I need to transfer that boolean back and forth every time

    But maybe Iím wrong.
    PS: I think youíre answering both this and my SO question . You are awesome.
    Last edited by naicolas12; 03-20-2018 at 04:39 PM. Reason: Typo

  4. #4
    Senior Member
    Join Date
    Apr 2015
    Posts
    316
    Code :
    if (*global_bool == false){
               atomic_cmpxchg (global_bool, false, true);//Probably have to use int instead
    }

    This ensures that, on average, 1 atomic store will be performed. And since every thread is going to read this bool, it is likely to be cached most of the time. In theory, anyway, you should profile on your own. And remember you can do workgroup-wide reduction without atomics, though it might be a premature optimization. It's probably a good idea to use host memory to store this bool. Or better yet, use device-side enqueue if OCL 2.0 is available.
    Last edited by Salabar; 03-21-2018 at 06:51 AM.

  5. #5
    Junior Member
    Join Date
    Feb 2018
    Posts
    7
    That looks like a good idea. I see how the bool being kept in cache helps with the conditional, but I don't understand a few things though: Why does it have to be atomic when it's only a write? How is the 'cmpxchg' part helping here? Thanks for your answer!

  6. #6
    Senior Member
    Join Date
    Apr 2015
    Posts
    316
    Atomic ensures every workitem immediately sees this change. I'm not entirely sure cmpxchg achieves anything when atomic_store exists, but what happens without atomic is undefined behavior and may vary from device to device. Maybe it will work, maybe it won't, maybe it works now and will break after a driver update. Literally no reason to waste your time finding out what will happen.

  7. #7
    Senior Member
    Join Date
    Oct 2012
    Posts
    153
    If your work-items only make a transition from false to true (or 0 to 1), a simple write to global memory is enough as long as you always write the same value:

    __global int *flag;

    if (<my_condition>) *flag = 1;

    In that case, you are guaranted that *flag will be set to 1 at the end of the kernel execution if at least one of the work-item has fulfilled <my_condition>.
    This means that you have to set *flag to zero before executing your kernel (with a host write or a device task), and that your work-items cannot read an up-to-date value of *flag.

    However this works well to signal that one work-item has done something. I have already used this pattern for iterative picture analysis with success.

  8. #8
    Senior Member
    Join Date
    Apr 2015
    Posts
    316
    Find an exact line in the OpenCL standart that states "guaranteed". Writing into the same memory location without synchronisation is an undefined behavior, and driver developers adore to screw with undefined behavior. I'm well aware that there is no reason to believe it does not actually work on any current or future hardware, but when the difference is so miniscule, why take risks? There is too much software and games lost forever because of "I dunno, it works for me" and this adds more potential corpses to the pile.

  9. #9
    Senior Member
    Join Date
    Oct 2012
    Posts
    153
    Quote Originally Posted by Salabar View Post
    Find an exact line in the OpenCL standart that states "guaranteed". Writing into the same memory location without synchronisation is an undefined behavior, and driver developers adore to screw with undefined behavior. I'm well aware that there is no reason to believe it does not actually work on any current or future hardware, but when the difference is so miniscule, why take risks? There is too much software and games lost forever because of "I dunno, it works for me" and this adds more potential corpses to the pile.
    Don't confuse memory state with memory operation. Writing some data to a memory location is a memory operation that will be realized whether there is synchronization or not. Without synchronization, you cannot guarantee ordering or consistency among reads and writes. But if the only action on a memory location is exactly the same write by one or several work-items, you are assured that this memory location will contain the result of the write once memory has been flushed, whether by an explicit flush or an implicit one (such as done at the end of the kernel execution).

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Proudly hosted by Digital Ocean