Enqueue/Finish Scheme for FFT

This is likely the first part of 2 posts related to some trouble I have involving an FFT signal cross correlation module I’m creating, (makes use of circular convolution theorem, etc etc). I’d like to just confirm that the following scheme will ensure that particular recursion levels of FFT butterfly computation are finished before the next level starts and that the buffers containing the data are fully written to/done with. So the circular correlation/convolution involves an FFT, a vector-wise inner product and then an IFFT.

Because of this scheme, I have no kernels which order the data in bit-reversed index order. The forward FFT kernel produces a bit-reversed-order FFT and, after the inner products, the IFFT just uses this result to compute a natural order solution.

I should mention I have multiple GPU’s.

Anyways, here is a pseudo-code representation of what is going on for every FFT/IFFT, (access/operation algorithms are equivalent, other than conjugated twiddle factors, the normalization kernel comes later:



for numDevices:
    data -> buffers
    buffers -> kernel arguments
    
    for fftRecursionLevels:
        for numDevices:
            recursionLevel -> kernel arguments
            deviceCommandQueue -> enqueueNDRangeKernel
            deviceCommandQueue ->  flush()

        for numDevices:
             deviceCommandQueue -> finish()


Can I get away with that? As far as I understand finish() is a blocking function and that very last for loop would not complete until every kernel is finished computing across its global range, (here fftSize / 2, see any literature on Radix-2 butterfly operations), and, for bonus points, some of the kernels are executing already due to flush() while I’m enqueueing the remaining kernels.

Overall, I’m geting some funky/garbage results using openCL/c++ for this particular software. I have implemented the full data pipeline in python, (the algorithm is, “topologically equivalent” if you will, obviously there is no host<–>device buffer/instructions w/ the python), and simulated how the kernel should run and it produces IDENTICAL results to when I use scipy.fftpack modules and just operate on the signal data vectors.

Thanks in advance.

As a follow up, I’m also wondering if maybe the problem is the number of intra-kernel operations I’ve got going on is too many for just using global memory and registers, (is that even a thing? a variable register limit? ). As far as I can tell, I do not make any use of local memory, (no __local address space usage).

Sorry if this sounds vague, I’m loath to post code at the moment and really get into it, I want to see if there is any obvious stuff I’m missing first. Please let me know if you’d like me to post something specific.

I guess some pictures would help. Here is exactly what is happening in both programs.

  1. generate gaussian vector
  2. zero pad gaussian vector to next next highest power of 2 length
  3. forward FFT, resulting in a natural order (in w ) result
  4. plot

Here is the python simulation of my kernel, compared to just using scipy.fftpack.fft( vector ) :


They are the same. Now compare that to either of:

They are all the same starting data. And they should all look like the green/blue line in image one, but they do not. My eyes have glazed over from how long I’ve been staring at the host/device code for that second program and I do not see any typos or incorrect algorithms. I highly suspect that something is happening device side that I am not aware of, hence my posting here. Clearly the algorithm LOOKS like its operating correctly, (the general shape of the red/red is approximately that of blue/green no matter the starting data. I’ve ran the algorithm on different starting sets and it consistently looks like blue/green but with that nonsensical noise/error), but something is amiss.

I greatly appreciate/welcome any assistance.