vload4 vs four buffer acceses for local memoy buffer

boxerab · August 8, 2014, 6:07am

Does vload4 have any advantage over four individual buffer accesses for a local memory buffer?

i.e

////////////////////////////////////////////////////////////
__local int FOO[256];

// case 1
int4 pixel = vload4(0,FOO)

// case 2
pixel.x = FOO[0];
pixel.y = FOO[1];
pixel.z = FOO[2];
pixel.w = FOO[3];

/////////////////////////////////////////////////

Also, does vload4 execute in one kernel clock cycle (assuming no bank conflicts) ?

Thanks!
Aaron

kunze · August 8, 2014, 2:04pm

A compiler could theoretically tell that case 1 and case 2 are essentially the same. I have seen compilers do this in similar cases, but I can’t speak for all compilers. As such, I typically prefer the vload over separate loads so that I’m not relying on compiler tricks.

As to your second question, nothing in the spec makes clock-level performance guarantees about any operation. Implementation by carrier pigeon would be completely legal. If you have questions about the behavior on a specific platform, I suggest you talk to the hardware vendor of the device you are using.

boxerab · August 11, 2014, 7:14pm

Thanks kunze. Now, what about bank conflicts. If work item one issues memory reads from address 0 to address 4, and
the next work item reads from address 1 to address 5, then the individual reads would not exhibit bank conflict. However,
if vload is used, then it is possible that vload #1 would conflict with vload #2.

kunze · August 11, 2014, 9:54pm

Again, the answer here would be architecture dependent. But for the architecture I use, one memory access with four lanes trying to access the same bank is no worse than four memory accesses with no bank conflicts. But this should be something that’s pretty easy to verify empirically on whatever you’re using.

boxerab · September 9, 2014, 1:19pm

Tried this out on HD 7700 series GPU: best perf was from individual loads, not vloadn.

Dithermaster · September 9, 2014, 3:06pm

With that amount of overlapped reads (work items re-reading the same memory other work items just read) this is a good candidate for workgroup shared local memory. Make those global memory reads just once, then read them as much as you need inside the work items. That will be faster than either individual loads or vloadn. You can code this yourself or use async_work_group_copy.