Hi all, I’m trying to find whether the code hoisting has been implemented in the compiler level for the openCL code. I run a simple example to test it and it seems like it does not exist in Nvidia and AMD platform. Since having an invariant code in a loop is a very common scenario and the code hoisting is a simple concept, so I would like to know if someone has an explicit answer for the current state of the code hoisting for the Nvidia, AMD and Intel compiler. If it does exist, how should we turn it on?
Here is a minimum example:
//Just some complicate operation on the variable random_private
#define zero random_private[0]*random_private[1]*random_private[2]*random_private[3]*random_private[4]*random_private[5]*random_private[6]*random_private[7]*random_private[8]*random_private[9]
#define zero1 powr((double)zero*zero+zero,10)
#define zero2 zero1/(zero1+1)
#define zero3 zero2+zero2*zero2
//Test if the code hoisting exist
//C=A+B+something
kernel void matrix_add1(global double *A, global double *B,global double *C ,global uint* random) {
uint rowNum=10000;
uint colNum=100;
//localize the variable random to make sure the code hoisting is valid(Otherwise it is possible that the variable random can be changed by other thread when excuting the loop and therefore the code hoisting results in incorrect answer)
uint random_private[10]={random[0],random[1],random[2],random[3],random[4],random[5],random[6],random[7],random[8],random[9]};
for(uint j=0;j<colNum;j++){
for(uint i=0;i<rowNum;i++){
//zero3 is a macro to do some super complicate operation on random_private
C[i+j*rowNum]=A[i+j*rowNum]-B[i+j*rowNum]+zero3;
}
}
}
//Manually do the code hoisting
kernel void matrix_add2(global double *A, global double *B,global double *C ,global uint* random) {
uint rowNum=10000;
uint colNum=100;
uint random_private[10]={random[0],random[1],random[2],random[3],random[4],random[5],random[6],random[7],random[8],random[9]};
//Compute the loop-invariant code
uint tmp=zero3;
for(uint j=0;j<colNum;j++){
for(uint i=0;i<rowNum;i++){
C[i+j*rowNum]=A[i+j*rowNum]-B[i+j*rowNum]+tmp;
}
}
}
The example runs 20 times with just one thread, here is the result on my computer:
Nvidia 1070:
matrix_add1: 28.46 sec
matrix_add2: 4.3 sec
AMD 1600X:
matrix_add1: 5.78 sec
matrix_add2: 0.16 sec
The function matrix_add1 is much much slower than the function matrix_add2. Is there anything wrong with matrix_add1 or the hoisting just does not exist? Is there any third-party compiler can do the code hoisting and generate the intermediate code? Thanks