Image2D objects in OpenCL and OpenCL kernel performance

There are two alternatives to creating and populating an image object (texture) in OpenCL: a) Setting the CL_MEM_COPY_HOST_PTR flag in clCreateImage2D() or b) using the clEnqueueWriteImage() API.
texImage = clCreateImage2D(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, &imageFormat, imageWidth, imageHeight, 0, inputData, &err);

			or

texImage = clCreateImage2D(GPUContext, CL_MEM_READ_ONLY, &imageFormat, imageWidth, imageHeight, 0, 0, &err);
size_t size3D[3] = {imageWidth, imageHeight,1};
size_t size3DOrig[3] = {0, 0, 0};
err = EnqueueWriteImage(commandQueue, texImage, CL_TRUE, size3DOrig, size3D, 0, 0, inputData, 0, NULL, NULL);
Using the second alternative, the time to create and populate texture is similar to that of CUDA, while the first is 6 times slower? Also, the second alternative leads to atleast 3 times faster access to the texture data within the OpenCL kernel as compared to the first. Any idea why? Any differences in the locality?

Also, in general, OpenCL kernel performance is 2-3 times slower than a CUDA kernel? Is this due to some overheads?

Btw, I am using NVIDIA’s OpenCL implementation.

Those two approaches to initializing a memory object should give identical performance for accessing the image. (They do on MacOS X.) As long as you are careful to make sure that you don’t accidentally set CL_MEM_USE_HOST_PTR (which can cause the runtime to have to do extra copying over the PCIe bus or use slower mapped system memory) you should be able to use either. I would suggest filing a performance bug against Nvidia in this regard.

As to why kernels are a lot slower, it’s most likely due to Nvidia’s OpenCL being a lot newer than CUDA and hence less optimized. My understanding is that the compiler backend is similar, so my guess is that it has to do with performance issues in their runtime. I would again suggest filing a performance bug against the developer.