I am working on convolution of an image with a mask of size 3x3.

I am facing a problem with reading the data back from GPU to CPU. The time required to execute kernel is just 30 milli sec. But to read data back it is taking more time nearly more than a second for a 6000 x 6000 image.

I am using clEnqueReadBuffer() to get data back. I have tried using pinned memory also, but I didnt find any improvement.

I tried the synchronous mode (CL_TRUE) reading. It is taking more time. I tried asynchronous mode (CL_FALSE). It is very fast. But, I am not getting the full image back. If I use clFinish(cqCommandQueue) after asynchronous mode then it is taking the same time as synchronous mode but I am getting full image.

I cant able to make GPU is better than CPU. The time take by convolution on CPU is half the time the time taken for convolution on GPU. But the problem is in reading the data back. Reaming all is ok.

Please help me if you know how to reduce it.

Thanks in advance.