How to avoid unnecessary memory copying if using a CPU as device?

I’m planing a program which should be runnable on GPU and CPU. The time critical operation is a large FFT (up to 1 million data points). I want the program to recognize which device type is currently used and switch between an Opencl implementation (clAmdFft) or a C implementation (FFTW).

The second case makes only sense if the device memory buffers can be directly accessed by the FFTW routines without copying them first. Is this possible by using the CL_MEM_USE_HOST_PTR flag? What else do I have to consider in order to avoid unnecessary memory copying?

Yes, I believe that is what CL_MEM_USE_HOST_PTR is intended for. Another alternative is to use CL_MEM_ALLOC_HOST_PTR and then map the buffer afterwards - although that might not guarantee suitable alignment for SSE/AVX-optimized C code.

One thing to keep in mind is that for correctness, you need to map the buffer (clEnqueueMapBuffer) before you start touching it with CPU code, and unmap it again before CL code starts working on it. On a CPU it isn’t going to make any difference since everything is cache-coherent anyway, but at some point you might also want to be doing this on newfangled devices like APUs and MICs where it might be required for cache coherence.

You might also want to take a look at clEnqueueNativeKernel. I’ve never used it myself, but it should let you slot the host C code into the OpenCL scheduling.

Another thing to consider when using the CL_MEM_USE_HOST_PTR flag is that some hardware may need for the alignment of the pointer and/or the size of the buffer to meet certain alignment requirements. In some cases, mis-aligned pointers may get copied. The optimization guide for whatever hardware you are using should give you advice here.