Does anyone have experience with performance advantage from using async_work_group_copy() over regular coalesced
read of global memory ?
I tested this method out on AMD GCN 1.0 card, and found no difference.
See this AMD thread for more details:P
Does anyone have experience with performance advantage from using async_work_group_copy() over regular coalesced
read of global memory ?
I tested this method out on AMD GCN 1.0 card, and found no difference.
See this AMD thread for more details:P
Judging from intermediate presentasion of Fiji kernel, ISA generated should be nearly identical. It might not be the case with global memory to global memory transfers though.
#define ASYNC_COPY
#define DATATYPE float
#define IDXTYPE int
#define OFFSET 1
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void read_linear_uncached(__global DATATYPE *input,__global DATATYPE *output)
{
event_t evt;
IDXTYPE gid = get_global_id(0);
IDXTYPE index = gid;
local DATATYPE scratch[256*2];
scratch[get_local_id(0)] = (DATATYPE)(0.0f);
for (int i=0; i < 32; ++i) {
uint flipBuffer = i&2;
#ifndef ASYNC_COPY
scratch[get_local_id(0) + flipBuffer*256] = input[index];
#else
evt = async_work_group_copy(scratch + flipBuffer*256,
input + index,
256,
evt);
#endif
index += OFFSET;
for (int k=0; k < 100; ++k)
scratch[get_local_id(0)+(flipBuffer^1)*256] += pow(scratch[get_local_id(0)+(flipBuffer^1)*256],2);
#ifdef ASYNC_COPY
wait_group_events(1, &evt); // waits until the copy has finished.
#endif
}
output[gid] = scratch[get_local_id(0)];
}
Thanks for looking at the ISA, Salabar. So, it looks like this is just a convenience method on AMD GCN. Good to know