work-group to work-group direct data transfer (DMA)

There is no possibility to send data directly between work-groups, using async_work_group_copy for this is far from being optimal. My suggestion to have something like this:

event_t async_direct_work_group_copy (
wgtypen dst_work_group,
__local gentype *dst,
const __global gentype *src,
size_t num_gentypes,
event_t event);

dst_work_group - work group number in ND-Range, wgtypen, n=1,2,…,MAX_DIM

Let’s see if I understand what you are proposing. You want a work-group to write data into the local memory of another work-group?

Do you realize that work-groups execute asynchronously from each other? How do you know that the destination work-group has not already finished executing? Or what if it is in the middle of the execution?

Current OpenCL standard is too limited for different work-group
threads communications, just over global memory. As result there is a
bottleneck. It’s very known problem for algorithms with heavy data
flow, the methods how to improve this also well known.

Moreover we need different kernels direct communication more effective then over global memory. For instance:

event_t async_global_direct_work_group_copy (
kerneltype dsk_kernel,
wgtypen dst_work_group,
__local gentype *dst,
const __local gentype *src,
size_t num_gentypes,
event_t event);

Syncing of different work-group threads is not covered directly in standard but it’s not a problem to support this.

Local memory optimizations can give significant performance improvements if this features will be supported on hardware level.

This is a little different but follows on the idea of local memory in work-groups running async.

From what I understand, work-items within a work-group can be swapped in and out, and similarly work-groups can be swapped in and out? When work-items are swapped, local memory and registers are left intact until they completely finish. Is this true for work-groups as well? And the local memory is non-overlapping from each work-group, even if they’re running on the same compute unit?

From what I understand, work-items within a work-group can be swapped in and out, and similarly work-groups can be swapped in and out?

The answer to both is very implementation-dependent.

And the local memory is non-overlapping from each work-group, even if they’re running on the same compute unit?

The local memory of each work-group is independent from all the other work-groups.

David, Thanks for the great answers!