I'm planing a program which should be runnable on GPU and CPU. The time critical operation is a large FFT (up to 1 million data points). I want the program to recognize which device type is currently used and switch between an Opencl implementation (clAmdFft) or a C implementation (FFTW).

The second case makes only sense if the device memory buffers can be directly accessed by the FFTW routines without copying them first. Is this possible by using the CL_MEM_USE_HOST_PTR flag? What else do I have to consider in order to avoid unnecessary memory copying?