How get best performance when using paralel processing?

Basically I have two questions.

  1. How many commands can I pass to queue when I want to perform parallel processing when sending a program to queue.
  2. what is best practice to send kernels to queue when making such effects on images like gaussian blur, B/W mask generator or color replacer.

I have some idea about the gaussian blur and would be glad if you share your opinion on my idea. Here is some basic kernel which I have found on a blog which learns us how to make gaussian blur with OpenCL (by Lefteris):
http://paste.ofcode.org/383aLG4EW6S8cRUbhvfcceA
What I think is wrong here is that all the processing is done on one GPU processing unit. I have cheap card GT640 which has (only) 384 unified shaders. So my idea is to separate the data needed to be calculated into 384 groups. for instance if I would have image of dimensions 4000x4000 (at least) so 4000 / 384 = 10,41.
I would need to create loop with 11 cycles. I would go from y=0 to 3999 creating the commands of gaussian_blur program and sending them to the parallel processing queue. I would specify which line of pixel data I want to work in the function - this would be specified in every call as input to the function, so 384 calls per 1 cycle repeated 11 times until I would proceed all pixels.

Of sure I would need to change the kernel so that it can pass only one pixel, which pixel color information I want to receive would be determinated outside of the program but performed inside of the program.

Is this realistic? Or would it be bad practice? Maybe OpenCL already has some function that can separate the data to amounts automatically without the need to do it manually?