Your kernels could be optimized, but the most important parameter when using a GPU is the local work size.
NVIDIA GPUs for instance are optimized for a local work size of 128, so you should try again with an explicit local work size (and the global work size a multiple of the local work size of course).
One thing I notice is that you are reading back several buffers and then writing them again. All this data transfer in/out of the cl_mem buffer objects is going to carry a substantial performance penalty. You want to minimize memory traffic wherever possible, and if you don’t need something on the host between kernel calls, don’t copy it back.