Memory consistency model

The spec says this about memory consistency:

Within a work-item memory has load / store consistency. Local memory is consistent across
work-items in a single work-group at a work-group barrier. Global memory is consistent across
work-items in a single work-group at a work-group barrier, but there are no guarantees of
memory consistency between different work-groups executing a kernel.

To insure load/store consistency within a work-item, do I need to use atomic operations?

The spec says that atomic operations are atomic for a device. For a GPU with multiple SIMD streams, does this mean that atomics are actually consistent across work-groups, not just within a work-group?

For non-atomic read/write accesses across workgroups, is the result merely undefined, or can it cause a crash (other than due to software not prepared to handle the inconsistency)?

This is my understanding:

No. Load/store consistency is guaranteed by the programming model. Atomic operations are only necessary when multiple work-items update the same location in global memory.

Yes. Using atomic operations is consistent across work-groups, i.e. no work-item will see an “intermediate state”. For example, if two work-items increment a variable in global memory (initially set to 0) using atom_dec(), it is guaranteed that the variable has been incremented twice, i.e. it has the value 2 (no matter which work-groups the work-items belong to)

I think the worst thing that can happen is that you get wrong results. To come back to my example: If two work-items increment a global variable (initially set to 0) without atomic operations, they maybe both read the value 0, increment it and store 1 back to memory. In this case that variable has value 1, i.e. it has only been incremented once.