minimal efficient workgroup size

Hello,

I’m working on Mac OS 10.7, with AMD Radeon 6750M.
I wrote an OpenCL kernel, signed with the following attributes:

kernel 
__attribute__((vec_type_hint(float4)))
__attribute__((reqd_work_group_size(1, WG_SIZE)))
void my_kernel(...) 
{
  // do something with float4 pixels
  local shared_res;
  local tmp[WG_SIZE];
  for (int i = 1; i < N; i++)
  {
     float4 v = read_image_f(...);
     tmp[get_local_id(1)] = foo(v);
     barrier(...); // local barrier
     sum(tmp, &shared_res); // sum tmp and write the result to shared_res
     if (shared_res > SOME_VALUE) break;
  }
}

As far as understand, each work-group runs on one warp (wavefront).
In AMD the wavefront size is 64. Hence, there will be generally no benefit from having more than 16 work-items in each workgroup if the vec_type_hint is float4 (and the compiler uses this hint).

However, it seems when WG_SIZE is 64 rather than 16 gives ~X4 boost to the running time of the kernel.
I suspect that the compiler ignores the vec_type_hint(float4) hint, and compiles the code without vectorizing the float4 operations (i.e. running them one-by-one leaving 75% of the warp size empty)

In my specific case, I would like to use a minimal but efficient size of work-group as I have a brunch in the kernel that allows me to stop the workgroup job and save some time (it saves ~80% of the time in my CPU implementation). As the break happens in all work-items at the group together, this should not make the performance worse (am I right?).

How can I check my hypothesis or understand what’s going on there and why does a larger workgroup size gives better performance?

Thanks in advance,
Yoav

As far as understand, each work-group runs on one warp (wavefront).

You may want to look again at your vendor’s OpenCL programming guide. One work-group typically contains multiple warps/wavefronts/waves.

Right, but does the gpu runs more than WARPSIZE work items from the same workgroup at a time?

The work group size and the underlying compute unit size do not have to match. That said, you can write kernels that take advantage of the knowledge of the underlying architecture. In this case, you can have a macro that gets passed into the kernel compiler to indicate the warp/wavefront size. Then you can use local memory to do some work across the workgroup. If you workgroup size matches the compute unit size, the compiler could optimize away things like barriers.

BTW, the workgroup AND compute unit size have nothing to do with the size of the memory buffer you pass into the kernel. It simply indicates how many cores will be used concurrently. Each core could operate on 1, 2, 3,… etc. bytes from the buffer.
The exception is when you have vector core architectures, like the AMD VLIW, or Intel/AMDs SSE/AVX - then passing in float4 for example will help the compiler vectorize the work.

Right, but does the gpu runs more than WARPSIZE work items from the same workgroup at a time?

Reading your hardware vendor’s programming guide you may have noticed that GPU hardware typically supports multiple warps/wavefronts/waves in each compute unit. Hardware vendors spend silicon supporting this because it’s highly beneficial to performance even if at any point in time the compute unit was only executing a single wave. The performance boost comes from the fact that if a wave is executing an expensive operation, such as a global memory load, the hardware will switch to a different wave instead of having to wait for the global memory load transaction to finalize.

In other words, if you want a kernel to run fast on a GPU, the work-group size you choose must be significantly larger than the wave size.

Thanks for the detailed replies!

In other words, if you want a kernel to run fast on a GPU, the work-group size you choose must be significantly larger than the wave size.

Can’t the GPU run another workgroup in parallel on the same compute unit to hide latency?

I believe that is the case, but typically a workgroup will have many times more threads than would fill a single warp/wavefront. This may be because the hardware schedular is more efficient at managing many threads within a workgroup as opposed to managing many blocks, i’m not sure on the details.


jason

Answer to my question:
The GPU can run other workgroups in parallel to hide latency, but only if (the kernel requirements)*(# active workgroups) do not exceed the GPU resources.

If the kernel requirements are high, than the number of active workgroups will be low, and the GPU will not be able to hide latency well. This is measured by “occupancy”. NVidia has Occupancy calculator for their devices.

The profiler found in the AMD APP SDK offers an occupancy calculator as well. By launching something like

sprofile -o results.csv -O ./yourapp

you will get the performance counters in results.csv and the occupancy analysis in results.occupancy. The occupancy analysis from the AMD profiler tells you how many wavefronts per compute unit you can get, what is limiting your wavefronts/CU number (workgroup size, kernel requirements for registers or LDS, etc), and the % occupancy of the CUs.

The profiler found in the AMD APP SDK offers an occupancy calculator as well. By launching something like

sprofile -o results.csv -O ./yourapp

you will get the performance counters in results.csv and the occupancy analysis in results.occupancy. The occupancy analysis from the AMD profiler tells you how many wavefronts per compute unit you can get, what is limiting your wavefronts/CU number (workgroup size, kernel requirements for registers or LDS, etc), and the % occupancy of the CUs.

Is there a Mac version of the AMD profiler?

There is a linux version that is command-line only. It‘s worth a try.

I know it’s a bit late, but here’s some info on specific questions.

Yes, AFAIK all the gpu implementations ignore vec_type_hint, and the programming model as far as opencl is concerned is entirely scalar-per-thread. On amd hardware until the latest iteration, it is implemented using a 4 or 5-wide instruction, but each thread gets it’s own 4 or 5-wide ALU. The vec hint is just a way to help a cpu access SIMD units as cpu’s have very limited number of ‘threads’: but gpu’s don’t need such a hint as they’re already highly parallel.

(note that each thread’s ALU on pre-GCN AMD hardware is VLIW, not SIMD: simd enforces vectorised algorithms but AMD gets parallelism with scalar code simply based on data dependencies - so a vec_type hint isn’t going to be very useful).

The AMD doco makes it clear that 64 is the minimum size you want for efficiency. I find it works pretty well as a base-line for most algorithms. If you have small kernels (small register usage, local usage) and a lot of jobs they can schedule on the same processor core and hide latencies; so the optimum work size depends on the code being run and the size of the problem.

In my specific case, I would like to use a minimal but efficient size of work-group as I have a brunch in the kernel that allows me to stop the workgroup job and save some time (it saves ~80% of the time in my CPU implementation). As the break happens in all work-items at the group together, this should not make the performance worse (am I right?).

You might have to re-think that. Branches that ‘save work’ can often result in slower code: particularly in an inner loop where any extra work in evaluating a terminal condition can add up. But it depends a lot on the algorithm. Except for specific circumstances all threads execute all paths of all branches, they just mask out results in inactive branches. The specific circumstance is that since the processor executes a wavefront in groups of 16 (afaik, maybe it’s groups of 64) in sequence, if all threads beyond those completed are terminated then they can avoid being executed at all. So if you’re terminating random threads across the wavefront you will gain nothing but the cost of testing when they’re done.

So long as branches aren’t in the innermost loop the cost is small. AMD hardware has some overhead implementing a branch but often branches can be removed by using branchless logic i.e. select(), (?:), etc.

How can I check my hypothesis or understand what’s going on there and why does a larger workgroup size gives better performance?

As suggested: read the vendor documentation. The AMD stuff is quite comprehensive (The AMD APP opencl programming guide, chapter 4 is all about performance). Some of the magazine articles on the hardware (anandtech, toms hardware, and so on) are also good for an overview.

Thanks a lot for the detailed reply!

The branching is done by testing some shared value, so all threads at the wavefront should terminate together. And yes, the branching is done in the outer loop, so it’s not that expensive.

I’m developing on Mac, and I have not find anything about automatically splitting kernels to different threads using the vec hint, so thanks for the information.
BTW, NVidia implementation for windows explicitly output a warning that the vec hint is ignored.

Why do gpu implementation ignore the vec hint? Is it a real limitation, or just because of the assumption that when you have a lot of threads it would be more efficient to ignore it?

I could have implemented the wanted result of this vec hint by my self, if I could write the float4 result of read_image_f directly to the private registers of four threads, but this is not possible in OpenCL without passing it through the local memory (right?).
Is this a hardware limitation or OpenCL language limitation? (i.e. are the different GPU hardwares can load image2d_t pixels to the registers of 4 physical threads?)

It’s just not necessary. From the documentation http://www.khronos.org/registry/cl/sdk/ … fiers.html it’s basically a way to utilise a wide SIMD unit: but GPU’s don’t have such simd units, so the hint just isn’t appropriate for them.

I could have implemented the wanted result of this vec hint by my self, if I could write the float4 result of read_image_f directly to the private registers of four threads, but this is not possible in OpenCL without passing it through the local memory (right?).
Is this a hardware limitation or OpenCL language limitation? (i.e. are the different GPU hardwares can load image2d_t pixels to the registers of 4 physical threads?)

Actually that isn’t what the vec hint is for. It would be more like taking a routine that works on float4, and making it run in a single thread with float8. The hint helps the compiler combine multiple ‘opencl threads’ into single ‘cpu threads’, not the other way around.

But as gpu’s are optimised for float4 (the memory system as well as the alus), trying to do that would almost certainly result in slower code: so just stick to float4.