(One Intel CPU+One NVIDIA GPU) is slower than just One Intel CPU !!!

I am on windows and my laptop have one intel Core i7-5500U CPU and one NVIDIA Geforce 940M GPU.
The whole task is devided into about 800 sub-tasks,
and every sub-task is sent to a device by clEnqueueNDRangeKernel with global size of 10000(the group size is ignored).

Here is the situation:
1)When computed with no parallel, the computing time is about 113300MS.
2)When use only the Intel CPU with OpenCL, the computing time is about 4400MS.
3)When use only the NVIDIA GPU with OpenCL, the computing time is about 16400MS.
4)When use both the Intel CPU and the NVIDIA GPU with OpenCL, the computing time is about 4500MS,
and the the Intel CPU finished 75% of sub-tasks, while NVIDIA GPU 25%.

Look: the 4) is slower than 2)!!!

The 800 sub-tasks are dispatched to the 2 devices this way:
1)First, every device is dispatched 3 sub-tasks.
2)When a sub-task is finished on a device, the event callback sent a self-defined message to the main window.
3)The windows message handler dispatches a sub-task to that device.
Note: the purpose I use windows message is to avoid access conflict of shared data by multi threads.

Every thing seems excellent except the computing time.
I have a strong feeling that I can’t reduce the computing time significantly, even I add more GPUs to the computer.
What shall I do to if I want to reduce the computing time by another ten times?

That really annoyed!
Any suggestion?

I mean if I can get significant improved performance by add GPUs?
Where is my problem?

Thanks in advance.

There’s no way for anyone to know why this is happening, since it is ultimately reliant on something we can’t see (ie: your code).

That being said, not all tasks are compute-friendly. More importantly, not all tasks are worth the overhead of GPU compute operations.

In order to use GPU compute, you have to put the source data in GPU accessible memory. Since that source data is generally loaded into CPU memory first, that means you have to do some form of memory transfer. After the GPU compute is finished, you have to do another memory transfer to get the data back where you need it: CPU memory.

None of those things have to happen when using CPU compute. So if the added performance gains from using the GPU do not exceed the overhead of this data management, then you won’t see faster compute performance from using the GPU. This would be true for compute operations where the “compute” part is pretty fast already.

Additionally:

the group size is ignored

… that doesn’t sound good.

The 800 sub-tasks are dispatched to the 2 devices this way:
1)First, every device is dispatched 3 sub-tasks.
2)When a sub-task is finished on a device, the event callback sent a self-defined message to the main window.
3)The windows message handler dispatches a sub-task to that device.
Note: the purpose I use windows message is to avoid access conflict of shared data by multi threads.

You didn’t provide your code, so I can only speculate based on this description. But this does not sound like a recipe for performance. The structure you describe means that, every time you need a new compute task, the GPU has to signal the CPU, the CPU has to receive that signal and do something, then the GPU has to get started on that something.

Broadly speaking, that sounds like a good way to stall the GPU. You should try to shove as much work at the GPU as possible, as fast as possible. Don’t wait for one thing to be done before sending more work; this is why GPU’s have queues.

The way to avoid shared data conflicts is to either not share data or to use GPU synchronization so that sub-tasks that depend on each other do not overlap in execution. Or if there are dependencies, combine “sub-tasks” together so that they’re just one sub-tasks.

Thank you so much for your detailed reply, Alfonse Reinheart!
I think you fully understood me.
And I am not so sure the

using CPU compute
means “using CPU compute with OpenCL, not serial computing”.

The whole task is to calculates the anomolies at every terrain grid points caused by the objects underground.(sorry, I can’t upload the image).
The terrain can have many points that I devided them into pieces which I called them targets.
Every object is made up of diffrent number of triangles (cells).
There is a formula for calculating anomaly at a grid point caused by a triangle, which is quite complicated (some sin, cos, logs, geometrical transforms).
The terrain points have independent coordinates.
Each Objects is expressed as vertex coordinates and cells composed of vertices (infact it is vtk polydata).
So the whole task is to calculates anomolies at every grid points, which is the sum of anomaly caused by every triangle.
For the size of terrain is big enough and the sizes of objects change a lot, I let the global size be the number of target points(a piece, about 10000).
And I let the sub-task be anomalies caused by part of one object, which means there is a for statement in the kernal for these triangles(about 20).
Every result of sub-task is sumed up at the host side.
Sorry for my poor english.

The following is the pseudo code when I use only one OpenCL device.
I have to consider how to abstract the the pseudo code for multi device.
Sorry for the indent!

//this is the pseudo C++ code
int Target_from, Target_to;
int nPieceSize = 10000; //for example
int err;
for(Target_from=0; Target_from<nTerrainPoints; Target_from+=nPieceSize) //piece by piece, nTerrainPoints: total number of terrain points
{
//the last piece maybe smaller
Target_to = Target_from + nPieceSize;
if(Target_to > nTerrainPoints) Target_to = nTerrainPoints;

double *pTargetCoords = pTerrainCoords + Target_from*3; //pTerrainCoords: All of coordinates(x,y,z) of the terrain
double *pTargetValues = pTerrainValues + Target_from; //pTerrainValues: anomaly results for the whole terrain
int nTargetPoints = Target_to - Target_from;

size_t global_size[1];
global_size[0] = nTargetPoints;

// Create CL buffers
cl_mem targetCoord_buff = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
	sizeof(double)*3*nTargetPoints, pTargetCoords, &err);
if(err &lt; 0)...
cl_mem targetValue_buff = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
	sizeof(double)*nTargetPoints, pTargetValues, &err);
if(err &lt; 0)...
// Set kernel arguments
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &targetCoord_buff);
if(err &lt; 0)...
err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &targetValue_buff);
if(err &lt; 0)...

for(int i=0; i&lt;nObjects; i++) //nObjects: number of objects underground
{
	Object *pObject = objects[i]; //for example
	
	int nObjectPoints = pObject-&gt;GetNumberOfPoints(); //for example(function name is not exact)
	double *pObjectCoords = pObject-&gt;GetCoordPointer(); //for example
	int nCells = pObject-&gt;GetNumberOfCells(); //for example
	int *pCells = pObject-&gt;GetCellPointer(); //for example
	// object parameter
	double fRho = pObject-&gt;fRho;

	// Create CL buffers
	cl_mem objectCoord_buff = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		sizeof(double)*3*nObjectPoints, pObjectCoords, &err);
	if(err &lt; 0)...
	cl_mem cell_buff = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		sizeof(int)*4*nCells, pCells, &err);
	if(err &lt; 0)...
	// Set kernel arguments
	err = clSetKernelArg(kernel, 2, sizeof(cl_mem), &objectCoord_buff);
	if(err &lt; 0)...
	err = clSetKernelArg(kernel, 3, sizeof(cl_mem), &Cell_buff);
	if(err &lt; 0)...
	err = clSetKernelArg(kernel, 4, sizeof(double), &fRho);
	if(err &lt; 0)...
	
	int cell_from, cell_to;
	int nCellStep = 20; //for example
	for(cell_from=0; cell_from&lt;nCells; cell_from+=nCellStep)
	{
		cell_to = cell_from + nCellStep;
		if(cell_to &gt; nCells) cell_to = nCells;

		// Set kernel arguments
		err = clSetKernelArg(kernel, 5, sizeof(int), &cell_from);
		if(err &lt; 0)...
		err = clSetKernelArg(kernel, 6, sizeof(int), &cell_to);
		if(err &lt; 0)...

		//run the kernel
		err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_size, NULL, 0, NULL, NULL);
		if(err &lt; 0)...
	}
	clReleaseMemObject(meshCoord_buff);
	clReleaseMemObject(Cell_buff);
}
// Read the result
err = clEnqueueReadBuffer(queue, targetValue_buff, CL_TRUE, 0,
							sizeof(double)*nTargetPoints, pTargetValues, 0, NULL, NULL);
if(err &lt; 0)...

clReleaseMemObject(targetCoord_buff);
clReleaseMemObject(targetValue_buff);

}

//this is part pseudo kernel code
__kernel void Anomaly(__global double* target_coords, __global double* target_values, //target parameters
__global double* object_coords, __global int4* mesh_cells, double fRho, //object parameters
int cell_from, int cell_to) //loop parameters
{

int k, iTargetPoint; //var of loop, the index of the target point

iTargetPoint = get_global_id(0);

dg = 0.0;
for(k=cell_from; k&lt;cell_to; k++) //loop of triangles
{
	...
	...
	ig = (p1.x*sin1-p1.y*cos1)*log((p1.x*cos1+p1.y*sin1+R1)/(p2.x*cos1+p2.y*sin1+R2));
	ig+= (p2.x*sin2-p2.y*cos2)*log((p2.x*cos2+p2.y*sin2+R2)/(p3.x*cos2+p3.y*sin2+R3));
	ig+= p3.y*log((p1.x+R1)/(p3.x+R3));
	ig+= 2*p1.z*atan((p1.y*sin1+(1+cos1)*(p1.x+R1))/(p1.z*sin1));
	ig-= 2*p1.z*atan((p2.y*sin1+(1+cos1)*(p2.x+R2))/(p1.z*sin1));
	ig+= 2*p1.z*atan((p2.y*sin2+(1+cos2)*(p2.x+R2))/(p1.z*sin2));
	ig-= 2*p1.z*atan((p3.y*sin2+(1+cos2)*(p3.x+R3))/(p1.z*sin2));

	//adds to the total anomaly
	dg += (-n.z*ig); //-n: outward normal
}
target_values[iTargetPoint] += fRho*dg;

}

are you sure you can run your kernel simultaneously with NVIDIA GPU + Intel CPU? my impression is that you can not use mix devices if they belong to different platforms. Since NVIDIA OCL does not support Intel CPU and Intel OCL does not support NVIDIA GPU, so, I was not able to use them together. Maybe new implementations now allow this?

However, I was able to run AMD GPU+Intel CPU together, using AMD OCL (amdgpu-pro 16.x) because it supports both CPU and GPU, and yes, if you do the load balancing carefully, you can see a speed improvement - although very small in my case. Essentially, the GPU is about 10-50x times faster than the CPU, you have to give CPU a very small portion of the total task so that they can finish at the same time. In one of our papers, we did not bother to include GPU+CPU benchmark results, but we did include mixing GPUs of different generations (see Fig. 3b)

The other factor to consider is that, NVIDIA OpenCL does not support running jobs in a queue in parallel - all jobs in a queue will be executed in blocking fashion, even you have multiple devices available. To run them in parallel, you will have to create multiple contexts and queues in multiple threads. In comparison, AMD OCL driver automatically launch multiple devices to run jobs in the same queue (which is a lot nicer!) see

You can see a code change I made to allow OpenCL to use multiple devices using NVIDIA OCL

Hi, fangqq.
I am quite sure ,but not so sure, that you can use nVIDIA OpenCL to compute on Intel CPU, as long as you install the Intel OpenCL driver.
My program can run at Intel CPU and NVIDIA GPU on windows, and it is developed with NVIDIA OpenCL SDK(part of the CUDA).
In my case, the problem is the performance.
I do not know if the program run on 2 devices at the same time or in turn, the fact is it spend longer time than run on the fast device.
I want to know how you make multi devices finish their work at the same time.
Sorry, it hard for me to understand your code.
Thank you.

[QUOTE=tdchen;44031]Hi, fangqq.
I am quite sure ,but not so sure, that you can use nVIDIA OpenCL to compute on Intel CPU[/QUOTE]

I am also quite sure (probably slightly more sure than your sureness) that nvidia opencl does not support Intel CPU. If it did, you should have seen both CPU and NVIDIA GPU are listed under the same platform, but I’ve never seen that for NVIDIA ocl.

Try to rename your intel opencl library (libintelocl.so) and run your program again, you will see an error if you specify the CPU device.

Sorry fangqq, I misunderstood you.
Of cource, the CPU is at another platform.