Too many operations in a work item?

I’m having a strange performance problem with my program. It’s a text processing, EM-like algorithm running on an AMD Radeon HD 7970. This card has 32 compute units of size 32. It’s supposed to have a double-precision performance of nearly 1 TFLOP. I execute 1024 work items that each process a segment of the text data. There is no synchronization between the work items.

The kernel can be thought of as having two distinct parts: one that calculates several arrays in private memory using data from global memory (let’s call this part A) and the other that uses those arrays to perform another calculation in private memory and updates data in global memory (let’s call it part B). At the start of the program the data in global memory is randomly initialized and then this kernel is executed a couple of thousand times.

If I comment out part A and just create some arbitrary values for the arrays then each iteration zips through really fast - about 2 minutes for 100 iterations. If I leave part A but comment out part B then it zips through even faster - about 30 seconds for 100 iterations. But if I leave them both in then it takes about 6 minutes for 100 iterations. This does not make sense.

I tried commenting out one of the parts and then gradually uncommenting code one line at a time to see if there is any specific operation that is causing the problem. What I found was rather strange. At a certain point, when I uncomment a line of code it goes from a run time of 2m 30s to 6m but what’s strange is that it doesn’t matter what line of code that is. It can be anything, even something as simple as a value assignment in private memory.

What’s even more bothersome is that the code executes at about the same speed as when I run it on the CPU which is an AMD Phenom II X4 at 2.8 GHz. And if I run it on another computer that has a 3.4 GHz Core i7 then it’s even faster - almost twice as fast actually.

It seems to me that I’m not doing something right here. It’s almost as if there are too many operations in the kernel for the GPU to handle. Is that possible? Has anyone encountered something like this?

This might be caused by register spilling. The full code may need more registers than each isolated part of the algorithm.
In this case, values have to be temporarily stored to and read from global memory which is of course less efficient than using registers.

You should analyze your kernel with AMD GPU profiler to pinpoint the origin of the problem.

[QUOTE=utnapishtim;30992]This might be caused by register spilling. The full code may need more registers than each isolated part of the algorithm.
In this case, values have to be temporarily stored to and read from global memory which is of course less efficient than using registers.

You should analyze your kernel with AMD GPU profiler to pinpoint the origin of the problem.[/QUOTE]

Thank you! I didn’t even know such a tool existed.

The profiler did help me find one performance issue but it was unrelated to the issue I’m talking about here. I am still seeing much shorter execution times for individual parts then for the whole kernel.

There is one thing that the profile showed that I am not understanding (the documentation is rather poor). The graph called “Number of waves limited by VGPRs” is red and it shows that the “Number of active wavefronts” decreases from 20 to 4 as the “Number of VGPRs” increases. The device limit for this is showing as 40. As well, at the bottom there is an item called “Limiting factor(s)” in bold, red letters that says “VGPR”. I’m not following this. What’s a wave? I’m guessing VGPR means “vector general purpose register” but the “Vector GPR usage per work-item” is well under limits (171 vs a limit of 256). Can you help me understand this?

A wavefront is more or less the hardware counterpart of a work-group. Each work-group is split in blocks of 64 work-items; this block is executed as a wavefront by a compute unit. Several wavefronts can be executed by a compute unit. Whenever a wavefront stalls because of a pending memory access, another wavefront starts executing. This ensures that a compute unit does not stay idle when long memory latencies occur.

A compute unit can have up to 40 wavefronts scheduled at a time. However the number of registers that a compute unit can allocate to all its wavefronts is 1024, by blocks of 4 wavefronts. So the maximum number of wavefronts that can execute concurrently is floor(256/# of registers)*4.

Since your kernel uses 171 registers, the number of wavefronts running concurrently is 4 per compute unit. Note that this number starts to go down as soon as the number of registers per kernel exceeds 25.

In fact 171 registers are a lot for a kernel. Do you use arrays allocated in private memory? This could explain such a high value.

Yes, I use 7 arrays in private memory.

The first 4 are copies of arrays in global memory and are only read from. I copied them there specifically to increase performance because private memory is supposed to be faster to access than global. In fact, if I leave them in global memory and access them from there that number increases and the performance decreases.

The other 3 are written to and read from very often.

None of them are really that big. The first 4 are all single-dimension arrays of sizes 8, 17, 2 and 2 and the remaining 3 are 8x6x2, 8x6 and 11.

Is there anything I can do about this or am I just stuck with this due to the nature of the algorithm?

Private memory is a lot faster than global memory (roughly 100x faster). However you must consider it as a scarce resource. As I stated earlier, the optimal maximum number of registers for a kernel is 25 for a HD7970.

In your case, you use a lot more registers than reasonable and your compute units probably end up stalled. Instead of using private memory for caching, you should consider using local memory. It is roughly 10x faster than global memory and it will free all those precious registers for more concurrent wavefronts.

Ok, I’ll try that for the first 4 arrays. Not sure if I can do anything about the other 3, the whole algorithm centers around them. There’s not enough local memory to move them there for all work items so I would have to revise the whole approach if I wanted to do that.

Thanks a lot for you help!

      • Updated - - -

One more question, if you don’t mind. Do you know how big these registers are? Like I said, I’m dealing with doubles which are 64 bits.

VGPRs are 32-bit wide.