OpenGL ES fixed pint matrix transform on ARM11 CPU

dear all:
does any one know how to implement EFFICIENT fixed-point matrix transform and vertices transform using instruction of ARM11?

You do it in the same way as with most other CPU’s :slight_smile:

Out of the blue, I’d say that your compiler isn’t smart enough to optimize the 32*32 = 64bit multiply using the SMULL or SMLAL instruction. Most compilers call a routine for that. If so you might have to write the relevant parts in assembler.

Another thing that tends to degrade the performance is pointer aliasing. If your compiler supports the C99 restrict keyword, use it! It can make a huge difference.

I suggest that you just show us the relevant part of the vector*matrix transform along with the compiler output (assembler output). I’m pretty sure I can tell you how to hint the compiler to generate better code after I’ve seen what exactly goes wrong.

Btw - do you really have a bottleneck in the matrix*matrix transform, or are you guessing? Even if it’s horrible slow it shouldn’t turn up in the profile. You shouldn’t need to call this operation that often anyways.

Regards,
Nils Pipenbrinck

If you compile with GCC, try these routines. They generate pretty good code. The matrix-multiply and transform might be the wrong way around (e.g. rows/columns exchanged), but that’s an easy fix.



/* 
 Note: this code is written with ARM in mind, and might not run 
 fast unless you have a clever compiler and the SMLAL instruction. 
*/

typedef int i32;
typedef long long i64;

#define RESTRICT  __restrict__
#define INLINE    static inline


INLINE i64 mul32x32(i32 a, i32 b)
{
	return (i64)a * b;;
}


INLINE i32 sar64_32 (i64 a)
{
  return a>>16;
}


void MatrixMultiply4 (i32 * RESTRICT a_dest, const i32 * RESTRICT a_left, const i32 * RESTRICT a_right)
{
  const i32 * RESTRICT l = a_left;
  const i32 * RESTRICT r = a_right; 
  i32 i;

  for (i=0; i<16; i+=4)
  {
    const i32 x = l[i+0];
    const i32 y = l[i+1];
    const i32 z = l[i+2];
    const i32 w = l[i+3];
    a_dest[i+0] = sar64_32(mul32x32(x, r[ 0]) + mul32x32(y, r[ 4]) + mul32x32(z, r[ 8]) + mul32x32(w, r[12]));
    a_dest[i+1] = sar64_32(mul32x32(x, r[ 1]) + mul32x32(y, r[ 5]) + mul32x32(z, r[ 9]) + mul32x32(w, r[13]));
    a_dest[i+2] = sar64_32(mul32x32(x, r[ 2]) + mul32x32(y, r[ 6]) + mul32x32(z, r[10]) + mul32x32(w, r[14]));
    a_dest[i+3] = sar64_32(mul32x32(x, r[ 3]) + mul32x32(y, r[ 7]) + mul32x32(z, r[11]) + mul32x32(w, r[15]));
  }
}

void Transform_xyzw (i32 * RESTRICT a_dest, const i32 * RESTRICT a_src, const i32 * RESTRICT a_mat, const i32 a_count)
{
  i32 i;
  for (i=0; i<a_count; i++)
  {
    const i32 x = a_src[0];
    const i32 y = a_src[1];
    const i32 z = a_src[2];
    const i32 w = a_src[3];
    __builtin_prefetch (a_src+8, 0, 1);
    a_dest[0] = sar64_32(mul32x32(x, a_mat[ 0]) + mul32x32(y, a_mat[ 4]) + mul32x32(z, a_mat[ 8]) + mul32x32(w, a_mat[12]));
    a_dest[1] = sar64_32(mul32x32(x, a_mat[ 1]) + mul32x32(y, a_mat[ 5]) + mul32x32(z, a_mat[ 9]) + mul32x32(w, a_mat[13]));
    a_dest[2] = sar64_32(mul32x32(x, a_mat[ 2]) + mul32x32(y, a_mat[ 6]) + mul32x32(z, a_mat[10]) + mul32x32(w, a_mat[14]));
    a_dest[3] = sar64_32(mul32x32(x, a_mat[ 3]) + mul32x32(y, a_mat[ 7]) + mul32x32(z, a_mat[11]) + mul32x32(w, a_mat[15]));
    a_src  += 4;
    a_dest += 4;
  }
}

You do it in the same way as with most other CPU’s

Out of the blue, I’d say that your compiler isn’t smart enough to optimize the 32*32 = 64bit multiply using the SMULL or SMLAL instruction. Most compilers call a routine for that. If so you might have to write the relevant parts in assembler.

Another thing that tends to degrade the performance is pointer aliasing. If your compiler supports the C99 restrict keyword, use it! It can make a huge difference.

I suggest that you just show us the relevant part of the vector*matrix transform along with the compiler output (assembler output). I’m pretty sure I can tell you how to hint the compiler to generate better code after I’ve seen what exactly goes wrong.

Btw - do you really have a bottleneck in the matrix*matrix transform, or are you guessing? Even if it’s horrible slow it shouldn’t turn up in the profile. You shouldn’t need to call this operation that often anyways.

Regards,
Nils Pipenbrinck

dear Nils

ya, you are right, the greatest bottleneck in my profiling is rasterization.
but i have fine tune the rasterization stage as i can, and NO.2 is the matrix transform and vertices(normal) transform.

thanks for your suggestion, i will try this.

dear all
i have finished a OpenGL ES software render, but
i have a problem when run Jbenchmark(maybe GLbenchmark also have):

http://www.jbenchmark.com/tools.jsp?benchmark=3d

if i use only fixed point matrix transform and vertices transform, the verteices coordinate after model-view transform look not well, but if i use floating point matrix transform, but fixedpoint vertex transform( convert floating point modelview matrix to fixed point ) , the sapce soldier look well(but performance lose on ARM 11). it seems the percision of fixed point number does not enough. dose any one can give me some suggest how to solve this problem?

hi dear all:
i forgot post some example:

fixed-point matrix:
0xffffffff 0 0 0xfffffd30
0 0 0 0xffffd6bb
0xffffffff 0 0xffffffff 0xffffdd5e
0 0 0 0x0001000

fixed point (convert from floating point)

0xffffff94 0 0x1f4 0xfffffd33
0 0x200 0 0xffffd6bd
0xfffffe0d 0 0xffffff94 0xffffdd61
0 0 0 0x0001000

for vertex (x, y, z, w):
( 0xecb30000, 0x4d260000, 0x13130000, 0x00010000 )

after transform by fixed dpoint matrix:
(0x0000107d, 0xffffd66b, 0xffffdd98, 0x00010000)

after transform by floating point matrix:
(0x000013e5, 0x000023e3, 0xffffec2a, 0x0001000)

note the y-channel are different, one value is positive but the other is negative.

dear all:
i think i can try long long type (64bits).

any sugestion?

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.