Profiling Code

Ok, so I have my code running, but I have to say I’m disappointed with the performance.

It’s a particle system and I get the following approximate performance statistics:

Scalar version on the CPU: 1 Million particles per second.
GPGPU version using GLSL: 55 Million particles per second.
OpenCL version on CPU: 5 Million particles per second.
OpenCL version on GPU: 4 Million particles per second.

OpenCL on the CPU seems about right. I’m doing calculations on 3 component float vectors (in float4s) and I’m on a Core 2 Duo, so two cores. A six times speed-up would be my theoretical maximum, and that’s not including the fact that there’s some unavoidable scalar calculation. I’m happy with that result.

The problem is obviously the GPU based OpenCL. It’s about 12x slower than my GPGPU implementation, and it’s even slower than the CPU OpenCL. Obviously something is going very wrong. I suspect it’s down to memory access, but I don’t know for sure.

How can I find out what is making my code slow?
What profiling tools are there?

I’m currently on Snow Leopard, but could probably get my code to Linux if there were better tools there.

Can you post the kernel you’re using? What work-sizes are you using with it?

Beyond the profiling flag for your command queue (which probably isn’t too helpful here?) there isn’t a standard way of profiling.

Have you tried commenting out various parts of your kernel and seeing where the ‘slowness’ comes from?

Work-group size is a really important factor. If you’re setting it to 1 you will get terrible utilization on GPUs! You also need to make sure that the data transfer is not killing you. On the CPU the data doesn’t have to move over the PCI bus, but it does on the GPU. This means you want to move as little data as possible and do a lot of computation. Given that your GLSL performance is high I doubt this is the issue, though.

This is the kernel:

kernel void particle(constant int numberOfGalaxies,

                     global float4 * galaxyPositions,
                     global float * galaxyMasses,
                     float G,
                     float dT,

                     global float4 * starPositions,
                     global float4 * starVelocities,

                     global float4 * newStarPositions,
                     global float4 * newStarVelocities) {
                     
    int gid = get_global_id(0);
    
    float4 starPosition = starPositions[gid];
    float4 starVelocity = starVelocities[gid];
    
    for (int galaxy = 0; galaxy < numberOfGalaxies; galaxy++) {
       float4 galaxyPosition = galaxyPositions[galaxy];
       float  galaxyMass     = galaxyMasses[galaxy];

       float4 dP = starPosition - galaxyPosition;
       float d = length(dP);
       float acceleration = galaxyMass * G / (d * d * d) * dT; 
           
       starVelocity = starVelocity - dP * acceleration;
    }
    
    starPosition = starPosition + starVelocity * dT;

    newStarPositions[gid]  = starPosition;
    newStarVelocities[gid] = starVelocity;
}

Each time I enqueue this it a 1-dimensional million item global work-size. I’ve been leaving the local work size as NULL to allow the driver to decide an optimal value.
[ul]
[li] starPosition/Velocities are read/write buffer objects, but used for input only.[/:m:21bjdic6][/li][li] newStarPosition/Velocities are read/write buffer objects, but used for output only. [/:m:21bjdic6][/li][li] After each iteration the buffer objects are swapped new<->old, avoiding any copies.[/:m:21bjdic6][/li][li] One call to enqueueReadBuffer is performed on newStarPositions after an iteration to get the locations back for display. It’s non-blocking, but I do wait for the event before display. Removing this read and the display routines don’t make huge differences.[/:m:21bjdic6][/li][li] The main load appears to be the loop. I would have liked to have put the galaxyPositions / Masses into constant space, but making that change (i.e. change global to constant) crashes the compiler.[/:m:21bjdic6][/li][li] numberOfGalaxies = 20.[/:m:21bjdic6][/li][] The speeds I gave before were working on 3 element vectors. I changed it to 4 as an experiment and got a speed increase of about +1 million particles for both the CPU and GPU.[/:m:21bjdic6][/ul]

Hmm the code looks quite sensible and there’s no barriers/fences in there that might mess things up. My best guess would be that the reads from the non const input arrays are causing something bad (and unneeded) to happen with the memory/cache on the device. This would hurt quite a lot, especially given that every work item reads every single galaxy.

Does changing "global float4 * " to “global const float4” change the timings at all? It might let the compiler optimize the loads better. You mentioned as well that the buffers are both read/write. Did you do a comparison at all with marking them either read or write exclusively and doing copies? Might be interesting to see if that makes any difference.

I’m speculating here, so I could be miles off though!

Alan

Does changing "global float4 * " to “global const float4” change the timings at all?

It brings the GPU to approximate parity with the CPU (6M each). Worthwhile, but it’s not the order of magnitude I’m looking for.

You mentioned as well that the buffers are both read/write. Did you do a comparison at all with marking them either read or write exclusively and doing copies? Might be interesting to see if that makes any difference.

I just gave it a go, and it cost me a little (about 300k particles a second on the GPU - about 800k on the CPU). I have also tried just having a single buffer that’s read and written to, and that didn’t seem to be a win or a loss. I was wondering if I might benefit from a smaller cache footprint.

Paul,
Use the async_workgroup_copy to copy your 20 galaxies into local memory. That should give you a tremendous speed boost. Currently you are reading the same data from global memory for every operation, which is probably causing a huge slowdown. (On the CPU this gets put in the cache, so you don’t see the hit as much.)

That’s a big improvement. Up to about 15.5 Million now, so we’re nearly 4x from where we started. The kernel now looks like this:

kernel void particle(constant int numberOfGalaxies,

                     global const float4 * galaxyPositions,
                     global const float * galaxyMasses,
                     
                     local float4 * localGalaxyPositions,
                     local float  * localGalaxyMasses,
                     
                     constant float G,
                     constant float dT,

                     global const float4 * starPositions,
                     global const float4 * starVelocities,

                     global float4 * newStarPositions,
                     global float4 * newStarVelocities) {

    event_t galaxyEvent[2];

    galaxyEvent[0] = async_work_group_copy(localGalaxyPositions, galaxyPositions, numberOfGalaxies, 0);
    galaxyEvent[1] = async_work_group_copy(localGalaxyMasses, galaxyMasses, numberOfGalaxies, 0);

    int gid = get_global_id(0);
    
    float4 starPosition = starPositions[gid];
    float4 starVelocity = starVelocities[gid];
    
    wait_group_events(2, galaxyEvent);

    for (int galaxy = 0; galaxy < numberOfGalaxies; galaxy++) {
       float4 galaxyPosition = localGalaxyPositions[galaxy];
       float  galaxyMass     = localGalaxyMasses[galaxy];

       float4 dP = starPosition - galaxyPosition;
       float d = length(dP);
       float acceleration = galaxyMass * G / (d * d * d) * dT; 
           
       starVelocity = starVelocity - dP * acceleration;
    }
    
    starPosition = starPosition + starVelocity * dT;

    newStarPositions[gid]  = starPosition;
    newStarVelocities[gid] = starVelocity;
}

It took me a little while to work out how to deal with the local memory. Does that look ok to people?
I set the kernel argument with

    err = clSetKernelArg(particleKernel, 3, sizeof(float) * 4 * numberOfGalaxies, NULL); // localGalaxyPositions

I should probably add that adding that async_copy cost a little performance on the CPU, but not much.

I’ve now taken that improvement over to a version that shares an OpenGL VBO as the starPosition memory buffer object. This eliminates the read back and re-submit of all the position data to display the system, which was now becoming significant.

We’re now up to 30 Million particles per second, which is within a factor of 2 of my best GPGPU results on this machine, and 7.5 times what we started with.

I expect some of the difference Vs my GLSL kernel is that the galaxyPositions/Masses were defined as uniforms so that could have been loaded once into fast memory and left alone. This kernel is having copy them local for each work group. Am I right in saying that loading the values in constant memory would have the same effect?

Not sure how much more there is to squeeze out of this, but it’s been an interesting experiment. Hope it’s been useful to others too.

Could still do with some form of profiling :wink: I know it’s not that easy though.

Paul,

You should put in a memory barrier after your async_workgroup copies to make sure all outstanding memory accesses across the workgroup are done before your kernel continues. (This shouldn’t matter on current hardware, but may be needed in the future.)

The other thing you should do is enable MADs. Take a look at the OpenCL documentation for the compiler variable to pass in to the compiler to enable the use of the mad instruction. (It’s something like -cl-enable-mad.) This will be off by default in CL, but on by default in GLSL, so you may be able to get a boost out of that.

You should put in a memory barrier after your async_workgroup copies to make sure all outstanding memory accesses across the workgroup are done before your kernel continues. (This shouldn’t matter on current hardware, but may be needed in the future.)

I thought that was what I was doing with the wait_group_events call. Would a memory barrier do something that doesn’t?

The other thing you should do is enable MADs

I’ll give it a go and report back when I have some time.

The wait_group_events waits for the copies to finish, but does not guarantee that the memory operations are finished. You need to make sure that all memory operations are finished before any member of the workgroup begins to use the data. (The memory operations may be in-flight after the copies are done.)

The wait_group_events waits for the copies to finish, but does not guarantee that the memory operations are finished
That’s certainly a subtle distinction, but worth being aware of.

The MADs seem to reduce the code a bit, and register usage (I’ve discovered that reading back the binary gives you an intermediate representation of the compiled form), but no measurable difference in speed.

I suspect something else is the bottle neck here.

Paul, what is your global size? If it’s big, (>10k) you could try re-writing your kernel to have each kernel process X positions and shrink your global size by X. You want to make sure your global size is >1k, but beyond ~4k you won’t see much benefit (and indeed a small decrease in performance) for using much larger sizes. This is because the hardware only supports a certainly maximum kernel launch size, and to handle larger sizes, the runtime must have to break up the launches into multiple separate launches, each of which will have some overhead. This might be worth experimenting with, because it will allow you to amortize the get_local_id overhead that OpenCL has vs. GLSL.

Paul,

You should try using constant memory for galaxyPositions and galaxyMasses. Declare them as

constant float4 *galaxyPositions,
constant float *galaxyMasses.

This will be similar to uniforms in GLSL. If you do this, you no longer need to do the async_work_group_copy and wait_group_events

Another interesting thing to try is to use fast_length instead of length.

Also, do you really need a float4 for positions and velocities or do you really need only 3-components of float4 vector?

Pulling this up from the depths, as I’ve been working on other things the last couple of weeks, but wanted to reply to these points.

You should try using constant memory for galaxyPositions and galaxyMasses. Declare them as

constant float4 *galaxyPositions,
constant float *galaxyMasses.

This will be similar to uniforms in GLSL. If you do this, you no longer need to do the async_work_group_copy and wait_group_events

On SnowLeopard at the moment, declaring those parameters as constant buffers causes the compiler to crash. It seems to be down to using a loop counter to index into them. This has been reported to Apple, and I’ve had a response back saying it’s a known issue.

Another interesting thing to try is to use fast_length instead of length.

Just gave that a quick try and it does boost performance a little. Thanks.

Also, do you really need a float4 for positions and velocities or do you really need only 3-components of float4 vector?

No, I don’t, but using float4’s is faster. You have to jump through hoops to load and store 3-component vectors and that costs. My code was originally GLSL vec3s. I moved that to vec4s for a fairer comparison early on though. If OpenCL had a 3-component vector type I’d be using it.