I have one question with the problem "element-to-element multiplication of a complex matrix" for size larger than 8000x8000. In my GPU (Tesla C2075) with simple implementataion, the time delay is approx 200ms and is 60% of total time. This is to make Fourier-based convolution for image filtering using clFFT.

if someone know an efficient method for this problem (element-to-element multiplication of a complex matrix) help me please.