I have been developing with OpenCL for a week now and facing the first performance issues. I am processing a video with a resolution of 320x240 (floating point data). I do several computations for a single image.
My GPU is a nVidia Quadro FX 370 which only has 16 cuda cores and OpenCL 1.0 support.
My basic steps are as follows:
1. Read an image from memory to device
2. enqueue a single kernel with a work-group size of 64x8 (max workgroup size is 512, so this fits pretty nice) and a global size of 320x240.
3. wait until kernel is fnished
4. read results from device memory
5. go to 1.

I benchmarked this with a processing of 1000 frames (only measured the enqueueKernel+finish with performance counter).
Result on GPU: 16s
I also have a hand-written SSE version running on the CPU (only a single thread) with:
Result CPU (SSE): 30s

I am a bit disappointed as I hoped moving from CPU to GPU would be at least 4 times faster (SSE = 4 instructions parallel, 16 CUDA cores = 16 instructions parallel).

I wonder how can I improve the perfomance. Do I have to enqueue the same kernel multiple times to fully use the GPU? If yes, which global size do I choose? Have I to split the global size by hand?

All pixels are independend in one image, but a pixel must be processed from first frame to last frame in order.

Thanks for any advice.