There are two alternatives to creating and populating an image object (texture) in OpenCL: a) Setting the CL_MEM_COPY_HOST_PTR flag in clCreateImage2D() or b) using the clEnqueueWriteImage() API.
texImage = clCreateImage2D(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, &imageFormat, imageWidth, imageHeight, 0, inputData, &err);


texImage = clCreateImage2D(GPUContext, CL_MEM_READ_ONLY, &imageFormat, imageWidth, imageHeight, 0, 0, &err);
size_t size3D[3] = {imageWidth, imageHeight,1};
size_t size3DOrig[3] = {0, 0, 0};
err = EnqueueWriteImage(commandQueue, texImage, CL_TRUE, size3DOrig, size3D, 0, 0, inputData, 0, NULL, NULL);
Using the second alternative, the time to create and populate texture is similar to that of CUDA, while the first is 6 times slower? Also, the second alternative leads to atleast 3 times faster access to the texture data within the OpenCL kernel as compared to the first. Any idea why? Any differences in the locality?

Also, in general, OpenCL kernel performance is 2-3 times slower than a CUDA kernel? Is this due to some overheads?