Wrong precision in multiplication results

Hello.

I wrote a simple kernel that multiplies a 3x3 matrix to vectors (x,y,1).

The kernel works good if I set the matrix values with simple ones, like

[1.0 2.0 3.0]
[4.0 5.0 6.0]
[7.0 8.0 9.0]

My kernel also works well when I tested it on the CPU.

However, when I set the matrix values like below, I get wrong results.

[0.000000 0.109586 1068.300049]
[41760.031250 0.438342 2670.750000]
[83520.062500 0.767098 4273.200195]

For example,

For a vector: (15, 0, 1)
GPU: 1068.300049, 629071.250000, 1257074.125000
True value: 1068.300049, 629071.250000, 1257074.250000
Diff: 0.000000, 0.000000, -0.125000

For: (124, 0, 1)
GPU: 1068.300049, 5180914.500000, 10360761.000000
True value: 1068.300049, 5180915.000000, 10360761.000000
Diff: 0.000000, -0.500000, 0.000000 --> Large errors.

The errors are not consistent and it is unpredictable.

Anybody knows why this happens ?? Please give me a clue for this.

I attached my kernel here.



struct my_vec4 {
   float x;
   float y;
   float z;
   float w;
};
typedef struct my_vec4 MyVec4;
//----------------------------------------------------



__kernel void compute_ep_lines(
__global MyVec4 *g_dst,
__constant float *c_fmat,  //--> 3x3 matrix		
int N)
{
// just get a global id and use them as a vector
	int x = get_global_id(0) ;  
	int y = get_global_id(1) ;
	
	int index = y * N + x ;

	float e1 = c_fmat[0] * (float)x + c_fmat[1] * (float)y + c_fmat[2] ;
	float e2 = c_fmat[3] * (float)x + c_fmat[4] * (float)y + c_fmat[5] ;
	float e3 = c_fmat[6] * (float)x + c_fmat[7] * (float)y + c_fmat[8] ;
	
	// assign result to global mem 
	g_dst[index].x = e1 ;
	g_dst[index].y = e2 ;
	g_dst[index].z = e3 ;
}

That looks about right for 32-bit floating point precision. They have about 24 bits of precision, which is around 7 decimal digits. Are you comparing this to 64-bit doubles on a CPU or 32-bit floats? Remember that there is no exact answer and that floating point (both single-precision and double-precision) results will depend on the order of execution and the sizes of the operands. (E.g., adding two small numbers and then multiplying by a big number will produce different results than multiplying a big and small number and then adding a small number.)

I used 32 bit float on both GPU and CPU sides.

Remember that there is no exact answer and that floating point (both single-precision and double-precision) results will depend on the order of execution and the sizes of the operands. (E.g., adding two small numbers and then multiplying by a big number will produce different results than multiplying a big and small number and then adding a small number.)

I agree with you.

But some results I got have very large errors like 0.5, which is very large in my problem.
Although the floating computation results are exactly identical, should they have very small differences (like 0.00005) ?

The size of the error will depend on the magnitude of the result.

An error of 0.000005 would be expected if your result is about the size of 1.
If your result is the size of 1000000.0 then you’d expect an error of about size 0.5.

The errors provided by OpenCL 1.0 are also may not be the same as what you get on the CPU for two reasons. The first is that if the compiler is adjusting the order of the math you can get different results. This may be the case if you run the kernel with the -cl-fast-relaxed-math option. The second is that the / operation does not have to be correctly rounded in OpenCL 1.0, so if you are doing division you may get different results depending on the hardware and the values.

Thanks, dbs2.
I have to look values to scale or normalize them if possible.

Make sure that -cl-fast-relaxed-math or -cl-mad-enable is not being passed to clBuildProgram.

Multiplies and add’s are required to be correctly rounded so you should see the same results on CPU and GPU provided no denorms are being passed in the input or are generated.