performance comparison between OpenCL & DirectCompute

Hi, all
I implemented the same algorithm using OpenCL and DirectCompute. I used the same totoal thread number, same group number. I think they will have approximate performance. but i am wrong. DirectCompute is about 2 times faster than OpenCL, I dont know why?can anybody tell me the reason?
I am using win7 & Nvidia 8600gt. thanks!

take an image 10001000 for example,
I set the global size 1000
64, set local size 64,
both for OpenCL and DirectCompute
I got the result DirectCompute is 2 times faster than OpenCL on Nvidia 8600gt
.

We cannot give any advice unless we study the source code for both applications. The first place I would look at would be memory transfers since I’ve seen people misusing buffers (particularly buffer initialization) frequently.

Theoretically the same algorithm implemented by OpenCL or DirectCompute will have approximate performance? Is it right?

Yes, in principle executing the same algorithm in OpenCL and DX Compute on the same hardware will have the same performance.

I tried best to optimize OpenCL code but no use. OpenCL version is almost 2.5 times slower than DirectCompute version. I posted the OpenCL code here, and ask your help to optimize it.
The algorithm is very simple, Add gauss blur to image pixels.

#define	NUMPIXELPERGROUP	256
const sampler_t RowSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
__kernel void GaussRowProcess(__read_only image2d_t image, __global unsigned int* uiDest, __local float4* localData,
                            unsigned int uiWidth, unsigned int uiHeight, int iRadius, 
							__global const float *pCo)
{
	unsigned int y = get_group_id(1);
	unsigned int	x	=get_local_id(0);
	int globalPosX = ((int)get_group_id(0)*NUMPIXELPERGROUP)+(int)x-iRadius;
	__global unsigned int*	pDst	=uiDest+mul24(y, uiWidth);
    if (globalPosX>=0&&globalPosX<uiWidth) 
    {
		int2 pos = {globalPosX , y};
        localData[x] = read_imagef(image, RowSampler, pos)*255;
    }

	barrier(CLK_LOCAL_MEM_FENCE);

    if((globalPosX>=0) && (globalPosX<uiWidth)&&(x>=iRadius)&&(x<(iRadius+(int)NUMPIXELPERGROUP)))
	{
		float4	result	=(float4)0.0f;
		float	nTotal	=0;
		int iLen	=iRadius;
		for (int i=-iLen; i<=iLen; ++i) 
		{
			float	tmpF	=pCo[iLen+i];
			result	+=tmpF*localData[x+i];
			nTotal	+=tmpF;
		}
		pDst[globalPosX]	=rgbaFloat4ToUint(result/nTotal);
	}
}
	int	n(0);
	clSetKernelArg(clKernel, n++, sizeof(clSrc), (void*)&clSrc);
	clSetKernelArg(clKernel, n++, sizeof(dstMem), (void*)&dstMem);
	clSetKernelArg(clKernel, n++, sizeof(cl_float4)*(NUMPIXELPERGROUP+g_templateLen*2), NULL);
	clSetKernelArg(clKernel, n++, sizeof(cl_uint), (void*)&g_ImageWidth);
	clSetKernelArg(clKernel, n++, sizeof(cl_uint), (void*)&g_ImageHeight);
	clSetKernelArg(clKernel, n++, sizeof(cl_uint), (void*)&g_templateLen);
	clSetKernelArg(clKernel, n++, sizeof(coMem), (void*)&coMem);

	size_t	global_size[2]	={0, g_ImageHeight};
	size_t	local_size[2]	={0, 1};
	local_size[0] = (size_t)(NUMPIXELPERGROUP+g_templateLen*2); 
	global_size[0] = local_size[0] * (g_ImageWidth/NUMPIXELPERGROUP+(g_ImageWidth%NUMPIXELPERGROUP==0?0:1));

	clEnqueueNDRangeKernel(g_Command_queue, clKernel, 2, NULL, global_size, local_size, 0, NULL, NULL);
	clFinish(g_Command_queue);

“g_ImageHeight2+1" is the template length of gauss blur. for every pixel I should use "g_ImageHeight2+1” pixels to calculate its RGBA, g_ImageHeight before this pixel and g_ImageHeight after it.
“__global const float pCo" is the coefficient to be multiplied with "g_ImageHeight2+1” pixels.
I use “__local float4* localData” to speed up, but it is useless.
As you see, I only do blur horizontally.
what should I do to optimize this code?
Use shared memory may increase transfer speeding, but it is still much slower than DirectCompute version.
so it may have something to do with Nvidia? my card is Nvidia 8600 GT.

Can you show us you you create the buffers and load data into them? Also, how do you read back the result?

Also, have you done any profiling of the individual steps? The very first thing to do when you want to improve the performance of some code is to measure where the time is spent.

Look in the spec for “CL_QUEUE_PROFILING_ENABLE”, “CL_DEVICE_PROFILING_TIMER_RESOLUTION”, “CL_PROFILING_COMMAND_QUEUED”, “CL_PROFILING_COMMAND_START” and “CL_PROFILING_COMMAND_END”.

thank you david. the same program runs as fast as DirectCompute version on Geforce 240. so I think 8600gt may dont support OpenCL well.

Performance in any system tends to be a fragile thing, so its likely that something you are doing in your OpenCL version is causing troubles and there may well be a way to fix that. The only way to determine what is causing the problem is by profiling the code in some detail so that you can see where the excess time is going. Once you know that we might be able to help you address the issue.