an opencl puzzle

Hi, Everyone
I got a puzzle when i want to test gpu and host memory copy performance,
I use code as follows.

		nCount=1000000;
		BYTE*	pBuf	=new	BYTE[nCount];
zeromemory(pBuf, nCount);
		BYTE*	pBuf2	=new	BYTE[nCount];
		cl_mem mem = clCreateBuffer(context, CL_MEM_READ_WRITE, nCount, NULL, &err);
		QueryPerformanceCounter(&timebegin);
		clEnqueueWriteBuffer(Command_queue, mem, CL_TRUE, 0, nCount, pBuf, 0, NULL, NULL);
 		err	=clEnqueueReadBuffer (Command_queue, mem, CL_TRUE, 0, nCount, pBuf2, 0, NULL, NULL);
		QueryPerformanceCounter(&timeEnd);
		float fInterval = (timeEnd.QuadPart - timeBegin.QuadPart)/(1.0*nFreq.QuadPart)*1000;
		cout<<fInterval<<endl;

when do this, the memory of pBuf2 does not changed, if the code is correct, it will be all zero.
why?

I don’t understand. You allocate two uninitialized buffers “pBuf” and “pBuf2”. Then you copy the uninitialized contents of “pBuf” into the memory object “mem”. Finally, you copy the contents of “mem”, which now contains the same uninitialized data as “pBuf” into “pBuf2”.

At this point the contents of “pBuf”, “mem” and “pBuf2” are identical, but since you never initialized “pBuf” in the first place, the data is garbage instead of zeroes. Is that what you are seeing?

He initializes pBuf to zero (see third line of code: zeromemory(pBuf, nCount);)

Did you check the return values of the functions?

thank you two!
the code i posted can work well,After do this the memory of pBuf2 is zero indeed. It was wrong in my computer before, is because i set nCount=100000000
the memory is too big for gpu i think, so problem happens.
but i have another question, does OpenCL provide any technology to solve the bottleneck between memory copy in gup and host?
thanks for your answer! and sorry for my english.

No, OpenCL doesn’t help your here. It’s your job to copy data between devices and host. All you can do is keep the memory transfers to a minimum when you write your programs.