Reaction-diffusion: an ideal system for OpenCL?

Hello,

Reaction-diffusion systems are a bit like cellular automata, except that they work on floating point values and use a differential equation to compute the new value. I would have thought this was an ideal system for OpenCL to work on.

Here’s an example of one of our OpenCL kernels:

__kernel void grayscott_compute(
    __global float *U,__global float *V,
    __global float *U2, __global float *V2,
    float k,float F,float D_u,float D_v,float delta_t)
{
    const int x = get_global_id(0);
    const int y = get_global_id(1);
    const int X = get_global_size(0);
    const int Y = get_global_size(1);
    const int i = x*Y+y;

    const float u = U[i];
    const float v = V[i];
    
    // compute the Laplacians of U and V
    const int xm1 = max(x-1,0);
    const int xp1 = min(x+1,X-1);
    const int ym1 = max(y-1,0);
    const int yp1 = min(y+1,Y-1);
    const int iLeft = xm1*Y + y;
    const int iRight = xp1*Y + y;
    const int iUp = x*Y + ym1;
    const int iDown = x*Y + yp1;
    // Standard 5-point stencil
    const float nabla_u = U[iLeft] + U[iRight] + U[iUp] + U[iDown] - 4*u;
    const float nabla_v = V[iLeft] + V[iRight] + V[iUp] + V[iDown] - 4*v;

    // compute the new rate of change (Gray-Scott)
    const float delta_u = D_u * nabla_u - u*v*v + F*(1.0f-u);
    const float delta_v = D_v * nabla_v + u*v*v - (F+k)*v;

    // apply the change (to the new buffer)
    U2[i] = u + delta_t * delta_u;
    V2[i] = v + delta_t * delta_v;
}

Comparing our OpenCL implementations with CPU versions (e.g. using SSE) we’re finding that OpenCL doesn’t give us the performance expected. Here’s a page comparing the speeds, with links to the different implementations:
http://code.google.com/p/reaction-diffusion/wiki/SpeedComparisons

You can see that we’ve been trying different things, including image2d_t and float4 (these help). We try to keep all the data on the card over many iterations. Trying to use local data manually made things slower (GrayScott_OpenCL_Local). Using NDRange local(8,8) helps a lot, compared to local(1,1).

Is there something obvious we can do to improve our OpenCL code? Were my expectations that OpenCL would work well for reaction-diffusion wrong?

Thanks,

Tim

http://www.sq3.org.uk - tim.hutton@gmail.com

From what I can see, the gist of the app looks like this:


Initialize input buffers with clEnqueueWriteBuffer().

while(true) {
    start_timer();

    for(i=0; i < 2000; ++i) {
        clEnqueueNDRangeKernel()
    }

    event.wait();

    clEnqueueReadBuffer(BLOCKING);
    clEnqueueReadBuffer(BLOCKING);
    clEnqueueReadBuffer(BLOCKING);

    end_timer();

    // display
}

In summary: use a profiling tool to find out where the time is being spent in your application and how varying the parameters of the app affects performance. Everything else in this message is pure guesswork.

Some comments/guesses:

  • The simulation appears to be 256x256 in size. That’s pretty small. Given that you run the simulation 2000 times per frame, the overhead of each NDRange call could be a significant factor in the total run time.

  • Have you considered using clEnqueueMapBuffer() instead of clEnqueueReadBuffer()? You will need to create your buffers with CL_MEM_ALLOC_HOST_PTR enabled.

  • Have you tried running with a different local size? Since you compare this to OpenMP it sounds like you are running both on a CPU. Ideally you would like the work set of a work-group to be about the size of the L1 or L2 cache.

  • The call to event.wait(); is unnecessary.

  • The three blocking calls to read the buffer are hurting the amount of concurrency between host and device. Only the last one needs to be blocking. It probably doesn’t make any difference since you are running on a CPU but if it was running on a GPU it could be noticeable.

Hi David, thanks for looking at this.

Yes, I’ve tried with different sizes (up to 1024x1024) and different numbers of iterations per frame and the figures aren’t much different.

I’ll have a go at it. Am I right that this shouldn’t affect the speed as long as N_FRAMES_PER_DISPLAY is high enough? I’m only calling clEnqueueReadBuffer once per very many kernel iterations.

Yes, I’ve tried changing NDRange local(8,8); to different values (1,2,4,8,12,16,32). I hadn’t tried changing the shape (e.g (1,64) or (32,2)) until just now, and I’ve just discovered that (1,512) gives a 4x speedup over (8,8). Presumably that’s because the consecutive y-values lie next to each other in memory? There’s no benefit to changing the shape in the image2d_t versions.

I’ll ask the others to try different values on their systems.

The OpenCL tests are running on the GPU, not the CPU. I haven’t tried running OpenCL on the CPU yet. Some of the results (like OpenMP) aren’t a straight comparison since they’re running on different hardware but it’s giving us an idea of the best approach to implement reaction-diffusion systems.

My card reports (using CL_DEVICE_LOCAL_MEM_SIZE) 48KB of local memory which I think is the L1 cache (I presume the same suggestion applies to the GPU). How do I work out how big the work set is? It would be nice to pick the right work-group size for the target device at runtime.

CL_KERNEL_WORK_GROUP_SIZE and CL_DEVICE_MAX_WORK_GROUP_SIZE are both 1024.

Yes, true, since enqueueReadBuffer is blocking. I’ve removed it. (I’ve also simplified the code since we didn’t need the color bits for this test.)

But these calls only happen after many iterations of the kernel, so they won’t affect the overall speed as long as N_FRAMES_PER_DISPLAY is high enough, right? And get the same sort of speed with more frames per display.

We’re timing the kernel execution already, and trying different parameters, so is there anything left to profile? There’s only four values available from clGetEventProfilingInfo, and we’re effectively measuring START and END already. Is it worth looking at QUEUED and SUBMIT?

Thanks,

Tim

We’re timing the kernel execution already, and trying different parameters, so is there anything left to profile?

Hardware vendors typically provide a profiler in their SDK that gives a lot more information than calling clGetEventProfilingInfo().

You right, this is a perfect problem for GPU, but you do have to write it for one.

Using a local size of (1,1) is like flushing $63 of every $64 you paid for down the toilet! Current GPU devices execute 64 threads concurrently on the same problem, you’re effectively forcing it to use only 1 of those and leave the others idle. So it’s not surprising that first result was so poor.

Since you’re accessing arrays you want to access them in a device friendly manner, which means simply: each thread reads consecutive memory addresses at the same time.

For completely parallel array algorithms, for any current gpu you probably want to start with a local work size of (64*m,n) where n+m <=4 (on most devices). e.g. look up ‘coalesced reads’ in any of the vendor doco. Then n, m can be played with to see if some combination works better for a given device/problem.

As with the image algorithm you could use float2 to store U+V together and halve the memory reads ops.

You want to use x=64,y=1 too, not x=1,y=64 as the work items are allocated to the work-group in x first, and you want the memory accesses to be in the same order. i.e. always process rows not columns.

Given you don’t do many reads then local store might not help, but to give it a better chance you might need a 2d work-group size. e.g. use 64x4, then you can get 256 results from 396 (66x6) memory reads rather than 64x1 results from 66x3. i.e. 1.5 global memory reads per result rather than 3, or 5 (with no LS code). It’s a bit more complicated than that, but that gives the idea. The edge cases make it a bit messy for one.

I’m not sure how your local version is even working since it only reads 1 column of data and then access x+1 and x-1 indices.

For images 16x16 works well for me for simple problems like this, and LS probably wont help here. If you’re only using 2-element data use a 2-element data-type - i.e. don’t use float4 and RGBA, but use RG and float2. If the device supports it. I found nvidia image writes rather slow, so you will probably fare better with an array version.

Also saw this comment:
#define LOCAL_X 1
#define LOCAL_Y 256
// annoyingly, we have to keep these manually synced with the size of the local group in the host”

Its easy: just pass “-DLOCAL_X=blah” etc arguments the compiler when you invoke clBuildProgram

It’s good to hear that this is a suitable GPU problem, I had been coming to the conclusion that there wasn’t enough processing per work item.

In GrayScott_2x2 I’m using float4 to retrieve a 2x2 block of values. But with the latest version this is actually slower on my machine. I’ll try your suggestion.

I tried that but I get horrible performance. I suspect the reason is that our arrays are allocated as float a[Y]; I’ll try it the other way round to see if that helps.

The programs are giving the correct output. With local(1,256) the GrayScott_OpenCL_Local implementation looks pretty odd, I agree. It must be falling back on the global memory access path, hence why it is slower than GrayScott_OpenCL.

Oh yes. We’ve actually worked that out elsewhere in the code but haven’t updated that particular bit. I’m not sure why using get_local_size() doesn’t work.