OpenCL Kernel Memory Optimization - Local vs. Global Memory

Hi,

I’m new to OpenCL and I consider using it for some graphics computation where using an OpenGL shader seems not to be natural. Before I actually do so I thought I’d try how much of a performance improvement I could get using OpenCL on my Nvidia GTX 460 over my CPU. For this reason, I implemented a simple skeleton skinning algorithm, once on the CPU, without multithreading but using the Eigen library, which provides SSE-optimized vector and matrix libraries, and once in an OpenCL kernel executing on the GPU. The vertices, bone matrices etc. are generated randomly on application start. I repeat the whole skinning several times so that it executes long enough to get meaningful timing results.

First I simply tried a kernel where I have as much work-items as I have vertices, each one generating one output vertex. I quickly saw that this is not a good idea because performance was even worse than on the CPU. I figured this was in essence a problem of too many memory accesses, mainly to the bone matrices, which are an array of float16-vectors that is addressed four times in each work-item. Then I changed the algorithm so that each work-item handles multiple output vertices, one after the other, so that I have less work-items. In each work-group I create a copy of the bone matrices in local space, and further accesses to these matrices come from local space. The interesting part of my C++ code looks like this:

#define NUM_BONES 30
#define NUM_VERTICES 30000
#define NUM_VERTICES_PER_WORK_ITEM 100
#define NUM_ANIM_REPEAT 1000

uint64_t PerformOpenCLSkeletalAnimation(Matrix4* boneMats, Vector4* vertices, float* weights, uint32_t* indices, Vector4* resVertices)
{
    File kernelFile("/home/alemariusnexus/test/skelanim.cl");
    
    char opts[256];
    sprintf(opts, "-D NUM_VERTICES=%u -D NUM_REPEAT=%u -D NUM_BONES=%u -D NUM_VERTICES_PER_WORK_ITEM=%u", NUM_VERTICES, NUM_ANIM_REPEAT, NUM_BONES, NUM_VERTICES_PER_WORK_ITEM);
    
    cl_program prog = BuildOpenCLProgram(kernelFile, opts);
    
    cl_kernel kernel = clCreateKernel(prog, "skelanim", NULL);
    
    cl_mem boneMatBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_BONES*sizeof(Matrix4), boneMats, NULL);
    cl_mem vertexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*sizeof(Vector4), vertices, NULL);
    cl_mem weightBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(float), weights, NULL);
    cl_mem indexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(uint32_t), indices, NULL);
    cl_mem resVertexBuf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, NUM_VERTICES*sizeof(Vector4), NULL, NULL);
    
    uint64_t s, e;
    s = GetTickcount();
    
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &boneMatBuf);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &vertexBuf);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &weightBuf);
    clSetKernelArg(kernel, 3, sizeof(cl_mem), &indexBuf);
    clSetKernelArg(kernel, 4, sizeof(cl_mem), &resVertexBuf);
    
    size_t globalWorkSize[] = { NUM_VERTICES / NUM_VERTICES_PER_WORK_ITEM };
    size_t localWorkSize[] = { NUM_BONES };
    
    for (size_t i = 0 ; i < NUM_ANIM_REPEAT ; i++) {
        clEnqueueNDRangeKernel(cq, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
    }
    
    clEnqueueReadBuffer(cq, resVertexBuf, CL_TRUE, 0, NUM_VERTICES*sizeof(Vector4), resVertices, 0, NULL, NULL);
    
    e = GetTickcount();
    
    return e-s;
}

The associated program/kernel looks like this:

inline float4 MultiplyMatrixVector(float16 m, float4 v)
{
    return (float4) (
        dot(m.s048C, v),
        dot(m.s159D, v),
        dot(m.s26AE, v),
        dot(m.s37BF, v)
    );
}


kernel void skelanim(global const float16* boneMats, global const float4* vertices, global const float4* weights, global const uint4* indices, global float4* resVertices)
{
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    
    local float16 lBoneMats[NUM_BONES];
    lBoneMats[lid] = boneMats[lid];
    
    barrier(CLK_LOCAL_MEM_FENCE);

    for (int i = 0 ; i < NUM_VERTICES_PER_WORK_ITEM ; i++) {
        int vidx = gid*NUM_VERTICES_PER_WORK_ITEM + i;
    
        float4 vertex = vertices[vidx];
        float4 w = weights[vidx];
        uint4 idx = indices[vidx];
        
        resVertices[vidx] = (MultiplyMatrixVector(lBoneMats[idx.x], vertex * w.x)
                + MultiplyMatrixVector(lBoneMats[idx.y], vertex * w.y)
                + MultiplyMatrixVector(lBoneMats[idx.z], vertex * w.z)
                + MultiplyMatrixVector(lBoneMats[idx.w], vertex * w.w));
    }
}

Now, per work-item I have only one access to the global boneMats, when I create the local copy, and it’s even a lot less work-items executing altogether. Then I have NUM_VERTICES_PER_WORK_ITEM*4 accesses to the local array afterwards. As I understand, local memory should be way faster than global memory, so I thought this would greatly improve performance. Well, the opposite is the cause: When I let lBoneMats alias to the global boneMats instead, I get actually better performance than with the kernel listed above.

What did I get wrong here?

Thanks in advance!

I have still not found a solution for this problem. Does nobody have any idea, or anything I could try?

There is the Nvidia Visual Compute Profiler which can tell you some performance information.

Looking at your launch parameters, you are launching 300 work items arranged into 10 groups of 30 work items each. On Nvidia GPUs, threads are grouped into warps - a group of 32 threads. Each multi processor executes hundreds of threads. Such a large number of threads are needed to hide the latency involved in accessing either global or local memory (although local memory accesses are not as costly).

There are several multi-processors, hence you need thousands of threads to adequately use a GPU. That is why your global memory version is faster I think. I would suggest changing your launch settings so that there are more threads (keep the shared memory though, just do less work per work item). Also, use multiples of 32 threads.

a couple of things:

a) even if the gpu is slower than a cpu, not having to copy the data to/from the device could help. (i.e. go straight from opencl to opengl buffer or whatever)
b) looks like your loop is accessing memory (vidx) in pretty much worst-case access pattern. each work-item should access adjacent values where possible.

To me it looks like it would be best implemented as a one-work-item-per-output algorithm as you said you first tried. Or even use 4 work items per result (one for each of idx.xyzw). Assuming the indexing is correct (i.e. set vidx == get_global_id(0)), i would have expected that to be faster.

I think you’re also confusing what local work size is - it is just the modulo of the total work size which is allocated to a given work-unit (i.e. shares LDS and some other resources). It isn’t a separate dimension from global work size. It is only a ‘coincidence’ that your code is working and num_vertices/num_vertices_per_work_item is a multiple of num_bones.

LDS is way faster than uncached global memory, but if you’ve only accessing 30 ‘bones’, then it should fit into L1 cache, in which case LDS isn’t that much of a boost (it depends on the hardware, not sure what it is on nvidia).

There is the Nvidia Visual Compute Profiler which can tell you some performance information.

I have tried it with CUDA 4.2, but couldn’t really see what it was trying to tell me. With CUDA 5.0 I can’t get OpenCL profiling to work at all. As I’ve read, OpenCL profiling seems to be broken in 5.0, and the driver of 4.2 does not compile on my 3.5 kernel, so I guess I have to wait until Nvidia fixes that (I’m sceptic as to whether they will at all).

I would suggest changing your launch settings so that there are more threads

Seems like this was the main problem. I have no idea what I do now that I haven’t done with my first implementation, but now it’s about four times faster than on my CPU+SSE.

Also, use multiples of 32 threads.

I don’t really know how to do that with my algorithm, as I have no way of controlling the number of vertices, but when I launch one thread for every vertex as I do now, I guess the bit of processing time wasted is not really significant anymore.

even if the gpu is slower than a cpu, not having to copy the data to/from the device could help. (i.e. go straight from opencl to opengl buffer or whatever)

Maybe I’ll do this in my final implementation.

looks like your loop is accessing memory (vidx) in pretty much worst-case access pattern. each work-item should access adjacent values where possible.

I can’t see any nice way to do this. Maybe presorting the vertices by their bone matrix indices, but that would be quite costly (although it would have to be done only once) and I don’t like the idea of changing the vertex order.

To me it looks like it would be best implemented as a one-work-item-per-output algorithm as you said you first tried. Or even use 4 work items per result (one for each of idx.xyzw). Assuming the indexing is correct (i.e. set vidx == get_global_id(0)), i would have expected that to be faster.

As I mentioned before, I do this now, and for whatever reason it’s faster than what I first tried.

I think you’re also confusing what local work size is - it is just the modulo of the total work size which is allocated to a given work-unit (i.e. shares LDS and some other resources). It isn’t a separate dimension from global work size. It is only a ‘coincidence’ that your code is working and num_vertices/num_vertices_per_work_item is a multiple of num_bones.

That’s what I thought it is: The number of work-items (threads) per work-group (thread block). I know I have to choose execution parameters so that the total number of work-items is evenly dividable by it.
In my understanding, changing local work size should not affect performance, assuming shared memory is not used (otherwise the more work groups you have, the more global-to-shared memory copies have to be done, assuming every work group always copies the same amount of data) and it is still a multiple of the warp size (because otherwise the warps aren’t fully utilized).

One question I still have which I couldn’t guess from Nvidias docs is: Can a single warp be made up of threads from different work groups (thread blocks)?

Although there might be room for further improvement, at least I can see that the GPU is actually faster than the CPU, so I’m satisfied for now. The only thing I can’t quite guess is why the same program runs about 8 times slower than the CPU on my old GeForce 8200, even when I optimize the execution parameters. I guess that is because it’s an onboard GPU and global memory accesses are even slower than on a GPU with dedicated memory. The same is true when I execute the CL program on my CPU device, but it might just be too massively multithreaded for a CPU, I haven’t tested this enough yet.

Anyway, thanks for your help!

Changing local work size will affect performance outside of just using LDS for a bunch of reasons: everything in the workgroup executes in lock-step, which affects cache and branching stuff, it affects how many registers are required which affects how many workgroups can be executed concurrently, etc.

BTW use a worksize multiple of 64 if you also want it to work well on AMD hardware, as that is the minimum it requires.

One question I still have which I couldn’t guess from Nvidias docs is: Can a single warp be made up of threads from different work groups (thread blocks)?

A warp is just a hardware implementation thing specific to nvidia. But afaik, all threads in a warp are executing the same code at the same time: so they have to be part of the same opencl workgroup for it to make any sense.

i.e. i believe there is a 1:N mapping of opencl workgroup to nvidia warp.

Although there might be room for further improvement, at least I can see that the GPU is actually faster than the CPU, so I’m satisfied for now. The only thing I can’t quite guess is why the same program runs about 8 times slower than the CPU on my old GeForce 8200, even when I optimize the execution parameters. I guess that is because it’s an onboard GPU and global memory accesses are even slower than on a GPU with dedicated memory. The same is true when I execute the CL program on my CPU device, but it might just be too massively multithreaded for a CPU, I haven’t tested this enough yet.

Anyway, thanks for your help!

Well there’s a large variation in performance of gpu cards, so it can’t speed them up.

And to get that performance you need to access memory properly - i.e. coalesced.

I have a follow up question to this. In my GPU there are 384 cores, 8 compute units (streaming multiprocessors), so there 384/8 = 48 streaming processors on each compute unit. Given that NVidia warp size is 32, which means 32 threads execute in step, doesn’t that mean 16 SPs are not doing anything on each cycle? That doesn’t seem to make sense to me. Can someone help to clarify?

Thanks,
J