Faster on CPU than on the GPU

All,

I need to run a population based code that computes an algebraic expression on the GPU and return it back to the CPU. The host code seems to have been setup correctly but the CPU implementation is at least twice faster than the GPU. My global work size is 200 for the CL code below.

Here are my questions:

  1. Are there any obvious flaws in the CL code?

  2. Is it possible that a dedicated video card is any faster than a comparable on-board chip? I tried working on two platforms: 1) iMac (3.06 GHz CPU with a GeForce 8800 GS) and 2) Macbook Pro (2.26 GHz with a GeForce 9400 M)

  3. This is a straight OpenCL implementation from Apple. Will using CL with Cuda architecture be any faster?


__kernel void clTestProblems(int funcindx, __global float *dv, int nvars, __global float *fitval)
{
	int gid = get_global_id(0);

	if(funcindx == 1 || funcindx == 2 || funcindx == 3)
	{
         ...
	}
	else if(funcindx == 4 || funcindx == 5 || funcindx == 6 || funcindx == 7)
	{
	float term1 = 0.0;
	float term2 = 0.0;
	float pi_2 = 6.2831854;	//2 x pi
	float e_1 = exp(1.0);	//store exponential of 1.0
	int indx = 0;
	int offset = gid*nvars;

	for(int i=0; i<nvars; i++)
	{
		indx = offset+i;
		term1 += pown(dv[indx], 2.0);
		term2 += cos(pi_2*dv[indx]);
	}
	term1 = term1/((float) nvars);
	term1 = -0.2*sqrt(term1);
	term1 = -20.0*exp(term1);
	
	term2 = term2/((float) nvars);
	term2 = exp(term2);

	fitval[gid] = term1 - term2 + 20.0 + e_1;
	
	}
}


Thanks,
Vijay.

200 threads is way too few for a GPU. Scale the problem up and the GPU will probably catch up to and surpass the CPU. I haven’t studied your particular program, but that’s the way it usually goes.