Small matrix operations

After googling a bit it appears as though open source implementations of small matrix operations for OpenCL are not easy to come by.

I frequently need such functionality so I have started with 3 by 3 matrix equations. I’ve tested pivoted gaussian elimination with back substitution


bool matrixSolve3x3f(float16 *M, float4 *x)
{
    // find first pivot
    if(M->s4 > M->s0 && M->s4 > M->s8)
    {
        float4 t = {M->s0, M->s1, M->s2, x->x};
        M->s0 = M->s4; M->s1 = M->s5; M->s2 = M->s6; x->x = x->y;
        M->s4 = t.x; M->s5 = t.y; M->s6 = t.z; x->y = t.w;
    }
    else if(M->s8 > M->s0)
    {
        float4 t = {M->s0, M->s1, M->s2, x->x};
        M->s0 = M->s8; M->s1 = M->s9; M->s2 = M->sa; x->x = x->z;
        M->s8 = t.x; M->s9 = t.y; M->sa = t.z; x->z = t.w;
    }

    if(abs(M->s0) < FLOAT_EPSILON)
        return false;

    // eliminate
    float u = M->s4/M->s0;
    M->s5 -= M->s1*u; M->s6 -= M->s2*u; x->y -= x->x*u;

    u = M->s8/M->s0;
    M->s9 -= M->s1*u; M->sa -= M->s2*u; x->z -= x->x*u;

    // find second pivot
    if(M->s5 < M->s9)
    {
        float4 t = {M->s5, M->s6, x->y, 0.0f};
        M->s5 = M->s9; M->s6 = M->sa; x->y = x->z;
        M->s9 = t.x; M->sa = t.y; x->z = t.z;
    }

    if(abs(M->s5) < FLOAT_EPSILON)
        return false;

    // eliminate
    u = M->s9/M->s5;
    M->sa -= M->s6*u; x->z -= x->y*u;

    if(abs(M->sa) < FLOAT_EPSILON)
        return false;

    // back substitution
    x->z /= M->sa;
    x->y = (x->y - M->s6*x->z)/M->s5;
    x->x = (x->x - M->s1*x->y - M->s2*x->z)/M->s0;

    return true;
}

and Kramer’s rule


bool matrixSolve3x3fKramer(const float16 *M, float4 *x)
{
    float D = M->s0*(M->s5*M->sa - M->s6*M->s9)
            - M->s1*(M->s4*M->sa - M->s6*M->s8)
            + M->s2*(M->s4*M->s9 - M->s5*M->s8);

    if(abs(D) < FLOAT_EPSILON)
        return false;

    float D0 = x->s0*(M->s5*M->sa - M->s6*M->s9)
             - M->s1*(x->s1*M->sa - M->s6*x->s2)
             + M->s2*(x->s1*M->s9 - M->s5*x->s2);

    float D1 = M->s0*(x->s1*M->sa - M->s6*x->s2)
             - x->s0*(M->s4*M->sa - M->s6*M->s8)
             + M->s2*(M->s4*x->s2 - x->s1*M->s8);

    float D2 = M->s0*(M->s5*x->s2 - x->s1*M->s9)
             - M->s1*(M->s4*x->s2 - x->s1*M->s8)
             + x->s0*(M->s4*M->s9 - M->s5*M->s8);

    x->s0 = D0/D;
    x->s1 = D1/D;
    x->s2 = D2/D;

    return true;
}

The machine epsilon I set to approx 2^-32 and matrix multiplication were implemented as follows


float4 matrixMult3x3f(const float16 *M, const float4 *v)
{
    return {M->s0*v->x + M->s1*v->y + M->s2*v->z, M->s4*v->x + M->s5*v->y + M->s6*v->z, M->s8*v->x + M->s9*v->y + M->sa*v->z, 0.0f};
}

To test the accuracy I used the following kernel (with random matrices)


__kernel void vecMathTest(__global const float16 *A, __global const float4 *b, __global const int *n, __global float *r)
{
    int i = get_global_id(0);

    if(i < *n)
    {
        float16 M = A[i];
        float4 x = b[i];
        matrixSolve3x3f(&M, &x);

        float16 T = A[i];

        float4 R = matrixMult3x3f(&T, &x) - b[i];
        r[i] = length(R)/length(b[i]);
    }
}

Kramer’s rule appears to produce more accurate results but I expect gaussian elimination to be faster as it has fewer floating point operations. I haven’t gotten around to timing it though.

Now I’m quite new to OpenCL so I’d be interested in any optimization tips or improvements as well as thoughts on whether to go for gauss or kramer. I’m leaning towards kramer but allthough it is more accurate for random matrices it is not given that it will be so in all cases.

Also, if anyone notice any bugs I’d appreciate if you’d point the out.

Kramer’s rule appears to produce more accurate results but I expect gaussian elimination to be faster as it has fewer floating point operations. I haven’t gotten around to timing it though.

Now I’m quite new to OpenCL so I’d be interested in any optimization tips or improvements as well as thoughts on whether to go for gauss or kramer. I’m leaning towards kramer but allthough it is more accurate for random matrices it is not given that it will be so in all cases.

Also, if anyone notice any bugs I’d appreciate if you’d point the out.

Some ideas:

o (divergent) branches are slow, so that Gaussian elimination implementation will necessarily be slow. At best branches are implemented by all threads executing all cases, and just ignoring the results in the locally untaken branches. At worst, they’re implemented serially.

For this problem you probably will not get it very fast, although for larger problems the idea is to parameterise the code so that it doesn’t need to branch: the decisions are instead stored in indices and masks. This can either lead to branchless code or at least code with reduced control flow.

Even something innocuous as “if(M->s4 > M->s0 && M->s4 > M->s8)” has to be implemented as 2 nested if’s in order to honour the C short-circuit rules for &&

Use non-short-cut logic like this instead:

if ( (M->s4 > M->s0) & (M->s4 > M->s8) )

o float multiply and add is really fast, i don’t think you’d be able to beat kramers rule for this problem and if you’re going for speed you’re probably wasting your time trying anything else.

o global memory access is slow, and much worse if it isn’t coalesced. reading a float16 looks good on paper but it will be a fairly inefficient access pattern. assuming your typical device will only load float4 at a time, it is translated into 4x float4 accesses with a per-thread stride of 4.

One can either re-arrange the data in global memory to be better - but that’s usually impractical - or use local memory to re-arrange the data. All threads in the workgroup load consecutive float4’s at the same time, and then write them out to local mem in a way that can be accessed efficiently later. local mem has it’s own caveats though - e.g. threads working with consecutive float4s will cause bank conflicts.

For small problems like this though, it may not be worth it since there are overheads in implementing the software cache too.

Interesting, thanks for he feedback. It seems quite difficult to avoid branching at times. For example now I’m writing a function to test for intersecting tetrahedrons. Getting rid of conditionals here is a challenge to say the least.

Well that’s the challenge. Sometimes you need different algorithms and sometimes you can’t avoid it - the higher memory bandwidth and higher parallelism can usually be used to make the code run acceptably, if not amazingly.

Do you have a code example of that particular problem?

Actually it was not too hard to figure out a way


// One dimensional intersection of the open interval <0,1>
bool sect1d(const float a, const float b, const float c, const float d)
{
    bool greater = (a >= 1.0f) & (b >= 1.0f) & (c >= 1.0f) & (d >= 1.0f);
    bool lower = (a <= 0.0f) & (b <= 0.0f) & (c <= 0.0f) & (d <= 0.0f);

    return !(greater | lower);
}

// evaluate if two tetrahedrons intersect without branching
bool intersection(const float4 A0, const float4 B0, const float4 C0, const float4 D0,
                         const float4 A1, const float4 B1, const float4 C1, const float4 D1)
{
    float16 M;
    M.s048c = B0 - A0;
    M.s159d = C0 - A0;
    M.s26ae = D0 - A0;

    float4 V[4] = {A1 - A0, B1 - A0, C1 - A0, D1 - A0};

    matNVecSolve9f(&M, V, 4);

    return sect1d(V[0].x, V[1].x, V[2].x, V[3].x)
         & sect1d(V[0].y, V[1].y, V[2].y, V[3].y)
         & sect1d(V[0].x + V[0].y + V[0].z,
                  V[1].x + V[1].y + V[1].z,
                  V[2].x + V[2].y + V[2].z,
                  V[3].x + V[3].y + V[3].z);
}

Basically it is evaluating all possible cases even if an intersection might already be determined.

Still room for some optimizations though I think.

Well, small matrix operations are a bit on the edge if you would ask me. Anyway, nice to see you come up with an idea which seems to thwart branching in a easier manner. I think once you can manage to get clear of conditionals, you are off with most of the issues.