Code 10x faster on CPU device than GPU device

Hi all,

I’m using a 1D cellular automaton (CA) to generate random unsigned ints in OpenCL. Basically, the CA is an array of uchar. And we evolve the CA in iterations such that ca[i] depends on ca[i], ca[i-1] of the previous iteration. Each work item evolves 4 uchar’s in a row and writes this (combined) uint to our output array after every iteration.
(alg based on http://home.southernct.edu/~pasqualonia1/ca/report.html).

The point of all this is that a relatively small (but necessary) change makes my GPU code an order of magnitude slower but has no real effect on my CPU code. Note also that the CPU is doing this MUCH faster than GPU. I’ve tried 1-512 for local_size(0).

In the red section of code, if we change

		ca2[ar] = locRules[ca1[a3] + ca1[ar]]; 
		ca2[a3] = locRules[ca1[a2] + ca1[a3]]; 
		ca2[a2] = locRules[ca1[a1] + ca1[a2]]; 
		ca2[a1] = locRules[(a0>=0?ca1[a0]:0) + ca1[a1]];

to

		 locRules[ca1[a3] + ca1[ar]]; 
		 locRules[ca1[a2] + ca1[a3]]; 
		 locRules[ca1[a1] + ca1[a2]]; 
		 locRules[(a0>=0?ca1[a0]:0) + ca1[a1]];

the execution time speeds up by at least 10x. Any ideas why? Any ideas how to fix it? ca2 is still a __local variable, so the write shouldn’t be that slow. I’m planning to release this code under GNU once it is polished. More details after the code.

/*
ASSERT: This needs to be called on num local cores divisible by lenGlobCAs.
ASSERT: global_size == local_size!
ASSERT: globCA is initialized to random (by burning-in gen or from global source)
Generates nNums random 32-bit unsigned ints and stores result in outNums.

The state of the ith cell at iteration j dependeds on the state of the
ith and (i-1)th cell at iteration (j-1).

*/
__kernel void ca_rng_get_ints (__global uchar *globCA, // the initial state of cellular automaton (ca)
__local uchar *ca1, // local copies of globCA
__local uchar *ca2, // we will swap between ca1 and ca2 for efficiency
int lenCA,
__global const uint *rules, // transition rules from one state to next
__local uint *locRules,
int nRules,
int nRNs, // number of random numbers to generate
__global uint *outNums) {

__global uint *tgon = outNums;

int gid = get_global_id(0);
int nLoc = get_local_size(0); 
int nGlob = get_global_size(0);
__local int toSave;
event_t saveRNs;

// load our 'cells' indecies into memory. Every 4 represents one uint
int cright = (gid+1) * 4 - 1; // right idx. 
int c3 = cright-1;
int c2 = cright-2;
int c1 = cright-3;
int c0 = cright-4; //ca[c0] is READ ONLY

// load ca into local memory
event_t loadCA = async_work_group_copy(ca1, globCA, lenCA, 0);

event_t loadRules = async_work_group_copy(locRules, rules, nRules, 0);

wait_group_events(1, &loadRules);
wait_group_events(1, &loadCA);

// Note that 1 iteration is one generation of CA state.
int nIters = 1 + nRNs / (lenCA/4); // 4 chars per out number

// run CA and generate rand nums
// idea: bounce back and forth from ca1 to ca2
for (int iter=0; iter < nIters; iter++){

	// Eveolve the CA according to rules
	for (int offset=0; offset < lenCA; offset+=nLoc*4){
		// update cell states using rule table:
		int a0=c0+offset; int a1=c1+offset; int a2 = c2+offset; int a3 = c3+offset; int ar = cright+offset;
		
		// for some reason saving to ca2 takes a lot of time!!
		ca2[ar] = locRules[ca1[a3] + ca1[ar]]; 
		ca2[a3] = locRules[ca1[a2] + ca1[a3]]; 
		ca2[a2] = locRules[ca1[a1] + ca1[a2]]; 
		ca2[a1] = locRules[(a0>=0?ca1[a0]:0) + ca1[a1]];
	}
	
	
	// We've finished a complete revolution over every CA cell
	// save results to our output random number array (outNums)
	if (gid==0)
		toSave = min(lenCA/4,nRNs);
		
	barrier(CLK_LOCAL_MEM_FENCE);
	
	if (iter>0)
		wait_group_events(1, &saveRNs); // make sure previous iter's save has finished
	
	saveRNs = async_work_group_copy(outNums, (__local uint*)ca2, toSave , 0);
	
	// Reposition pointer and number of rand nums to generate
	if (gid==0){
		outNums = &outNums[toSave]; // this is a local variable, so all threads get info
		nRNs -= lenCA/4;
	}
	
	// swap c1 and c2
	if (gid==0){
		__local uchar *t=ca1;
		ca1 = ca2;
		ca2 = t; 
	}
	barrier(CLK_LOCAL_MEM_FENCE);
}

// remember that we just swapped ca1 with ca2, so ca1 is most up to date
event_t saveCA = async_work_group_copy(globCA, ca1, lenCA, 0); 
wait_group_events(1, &saveRNs); 
wait_group_events(1, &saveCA);

}

CA - 2056 bytes, rules - 256 char rules, nRNs ~ O(million)
running on Macbook Pro i5 with

When you eliminate the assignments and leave the code like this:

locRules[ca1[a3] + ca1[ar]]; 
locRules[ca1[a2] + ca1[a3]]; 
locRules[ca1[a1] + ca1[a2]]; 
locRules[(a0>=0?ca1[a0]:0) + ca1[a1]];

The compiler is removing all those instructions since they have no effect on the output of the kernel. It’s not surprising that it is faster.

As for “why so much faster”, I don’t know since I’m not familiar with the implementation you are using.

To clarify. Without the code it takes less than a second. With the

ca2[ar] = …

part it takes more like 10 seconds (on GPU only). It is just a local memory assignment, so the time increase is very unexpected to me.

It’s not just a local memory assignment from the compiler point of view: it makes a bunch of code dead or not dead, thus saving a ton of instructions. At least that’s what it seems from what I can see.