OpenCL Kernel Performance bad (vs CPU)

Furtano · April 9, 2014, 8:05am

Hello,

I’m writing an Ant-Simulation.
The Kernel Performance is very bad. In comparsion to standard c++ solution it has a big performance disadvantage.

I dont understand why. The operations in the kernel are mostly without control structures (like if/else).

Kernels:

github.com

Furtano/BA-Code-fuer-Mac/blob/master/BA/Ant.cl

/*
*
* This is the OpenCL Kernel. It calculates each step for each ant.
*
*/

float lenghtOfVector (float x, float y){

	 return native_sqrt(pown(x,2) + pown(y,2));
}
int getPheromonMapID (int x, int y){
	int WIDTH = 800;

	return y*WIDTH+x;
}

/***
	in: newAntX, newAntY
	out:  newAntX, newAntY
**/

This file has been truncated. show original

github.com

Furtano/BA-Code-fuer-Mac/blob/master/BA/Pheromon.cl



int getPheromonMapID (int x, int y){
	int WIDTH = 800;

	return y*WIDTH+x;
}

// Checks if Ant has Smelled a Pheromon
// if yes, it gets into Pheromon Mode

// TODO: SPEED INTEGRATION !!!
__kernel void pheromon (
	__global  float *trash1,
	__global  float *trash2,
	__global  int *pheromonMap,
	__global  float *antX,
	__global  float *antY,
	__global  int * modus,
	__global  bool * isCarryingFood

This file has been truncated. show original

I made a benchmark, and the OpenCL Kernel Performance is very bad.
(Left Axis: Execution time in ms, Bottom Axis: number of simulated Ants)

Can you give me advice?

You can find the hole code in the git repo, if you are interested (the OpenCL stuff is happening here: BA-Code-fuer-Mac/clInitFunctions.cpp at master · Furtano/BA-Code-fuer-Mac · GitHub).

Thanks

utnapishtim · April 10, 2014, 3:50am

Your kernels could be optimized, but the most important parameter when using a GPU is the local work size.

NVIDIA GPUs for instance are optimized for a local work size of 128, so you should try again with an explicit local work size (and the global work size a multiple of the local work size of course).

tornado · April 14, 2014, 5:39pm

Not every use case is suitable for GPU. Your kernel has lots of divergent branches which are generally bad for GPU.

andrew.brownsword · April 16, 2014, 12:08pm

One thing I notice is that you are reading back several buffers and then writing them again. All this data transfer in/out of the cl_mem buffer objects is going to carry a substantial performance penalty. You want to minimize memory traffic wherever possible, and if you don’t need something on the host between kernel calls, don’t copy it back.