Urgent: Cache optimization for spring-mass system in OpenCL

Hi there,

for a master’s thesis, we could need some urgent, close-to-submission help concerning the cache utilization of the graphics card.

We’re developing a rather complex spring-mass system using OpenCL. It is based on a tetrahedral topology and uses springs on edges, triangles and tetraedra. All these springs act on the adjacent vertices and apply forces, which are accumulated in a reduction operation.

Here’s a brief description of the general algorithm:

[ol][li]Compute forces on vertices per edge and store in temp buffer[/:m:1iv3n5c6][/li][li]Compute forces on vertices per triangle and store in temp buffer[/:m:1iv3n5c6][/li][li]Compute forces on vertices per tetraedron and store in temp buffer[/:m:1iv3n5c6][/li][li]Accumulate forces per vertex from temp buffers[/:m:1iv3n5c6][/ol][/li]
The problem is that in the first 3 kernels, each edge/triangle/tetra must lookup the position of the adjacent vertices. These are unordered and, hence, caching isn’t used. In the last kernel it’s even worse. Each vertex needs to lookup the forces from all adjacent edges/triangles/tetras. For this, we use three arrays with indices pointing to the elements and then fetch the force vectors in the temp buffers of the previous kernels. These lookups are also very random and don’t use the cache.

The AMD profiler tells us that the cache hit is close to 0% and that the ALUs are only at about 10%, which isn’t surprising as they get bored while waiting for the global memory read.

So, is there anybody with some suggestions of how to optimize the memory access??? We believe this is a well-known problem (probably in applications other than spring-mass simulations).

Any help is really appreciated!!!

I don’t know what is the standard solution to this problem. What I would do is this: perform a pre-processing pass in which you reorder the triangles/tetras in a way that maximizes locality. In the OpenGL/D3D world there are tools to do that called triangle strippers. I suppose the algorithms they use can be generalized to tetras.

Once you have this locality then you can take advantage of it by loading this triangle data into local memory. This use of local memory would be akin to a programmable cache. If most work-items in a work-group access the same data, placing it in local memory reduces external bandwidth a lot and you will see increased ALU utilization.

Thanks for the reply. I’ll have a look into the triangle strippers. Concerning the local memory: unfortunately, work-items are usually independent from each other and only access “their” data. So we won’t be able to make use of local memory too often…

Thanks for the reply. I’ll have a look into the triangle strippers. Concerning the local memory: unfortunately, work-items are usually independent from each other and only access “their” data. So we won’t be able to make use of local memory too often.

That’s why you need to reorder the input data so that work-items in the same work-group do indeed access vertices (masses) with common edges (springs). Reducing bandwidth to global memory is achieved by increasing the amount of data you access in local memory.

Basically what I’m suggesting is to segment the spring-mass system into work-group sized discrete pieces and store those contiguously in memory. When a work-group is launched you will copy that nicely packed chunk of data and place it in local memory for fast access.

You will still need random accesses to global memory for edges and vertices that lie in the boundary between two discrete pieces and those memory accesses will indeed produce cache misses. However, the majority of the memory accesses will be in local memory.

Do you see where I’m going?

Yep, I guess I do. We actually talked about such an approach some half an hour ago :). Sounds like quite a bit of work though. I’ll dig into the code and data structures and see how it goes.

Btw., are there any publications on this topic?

Sounds like quite a bit of work though.

Yes, it is. The good thing is that you can do it in two stages. First simply pack the input data into bite-size chunks as best as you can (using a stripper or similar). Once you’ve done that, your indices will often be sequential: [0 1 2 3 4 5 945 7 8 9 10 452 …]. That will increase the number of coalesced memory accesses and the usage of your global memory cache (since you said your hardware does have that).

Later, as a further improvement, you could do the local memory thing. First you need to increase memory access locality as said above.

I can’t think of any specific publication to look at, although I’ve heard that both AMD and NVidia publish some nice OpenCL programming guides that touch on these topics of memory accesses.