async_work_group_strided_copy

Is anyone using this function? Or understand what exactly does it do?

I have a matrix in global memory:


ooooooooooo
ooooooooooo
ooooXXXXXoo
ooooXXXXXoo
ooooXXXXXoo
ooooooooooo

And I need to put subregion of the matrix into the local memory:


XXXXX
XXXXX
XXXXX

For the time being I manually calculate how many element copy operations each work-item within a workgroup should do. The code is not very simple and it will become much more complex as the dimension count of “matrix” become variable (more than 2).

But I know the initial offset, the number of continues regions I need to copy and the “distance” in global buffer between these regions. May I somehow use async_work_group_strided_copy function efficiently here instea? ?? manual calculations?

async_work_group_strided_copy is useful when you have an array of structures (AoS) and you want to transform it into a structure of arrays (SoA), or more specifically, when you have an array of structures and want to extract one of the struct fields.

In your example, the “width” of your sub-matrix would need to be a builtin CL type, like an int, or a float4.

If you want to do a rectangular copy, I recommend executing async_work_group_copy() in a loop. Each iteration of the loop would copy one row of the sub-matrix into local memory. The number of iterations of the loop would match the height of the sub-matrix.

Thanks a lot! My submatrix width is variable, right now it is 5 in one kenel and 6 in another. I don’t think there are built-in types with such a width.

I already tried using async_work_group_copy in cycle. It is slow. I guess it is because the width is much smaller than local worksize thus a lot of workitems are just doing nothing. I end up with several times more wavefront’s memory requests than when I organize load manually.

Thanks again, I am now confident that I am using the best approach :slight_smile:

Sorry David, but I have to quibble. The async_work_group_strided_copy is not especially useful for an AoS <-> SoA transformation. If it were to be useful for the latter it would take this:


********XYZW********
********XYZW********
********XYZW********
********XYZW********

and transform it into


XXXX YYYY ZZZZ WWWW

But it does not… it instead produces:


XYZW XYZW XYZW XYZW

It can’t even claim to be transposing across a workgroup because it is a global <-> local memory copy (thus just deferring the transpose to when the read from local memory happens), as opposed to into private kernel variables. Unfortunately it also isn’t especially useful for rectangular copies because it can only extract 1 gentype-per-row, so at most your rectangle can be 16 elements wide. This is a fixed-stride gather/scatter function (perhaps better called a pack/unpack function?), which limits its utility.

Andrew is right. Thanks for the correction.