OpenCL compression code running 10 times than serial code !!

I am a newbie to OpenCL and am working on a project to parallelize a serial lossless compression algorithm ?
I have run the code on a GPU by splitting the the huge input size into 1K chunks…
and individual work items are doing the compression operation on it.
I was expecting a speedup on a GPU over a serial code on a CPU
But infact I am getting a 10 times slower code on a GPU
could anyone help me with pointing me where I could be going wrong ?

DETAILS:
I was working on a LZRW compression code which serially processes 1K chunks of data. My initial obvious thought was to split the input file (say 30 MB) across multiple work items. This way, each work item would need to process much lesser input and logically should give a speedup comprared to a serial code. But to my surprise, the OpenCL code is running slower than CPU serial code.

Could someone point me where could I be going so wrong to actually get a slower code on a GPU ?

There are many reasons why a GPU might be slower than a CPU. There is quite a bit of variation between devices, but in general:

  • GPUs run at a lower clock rate
  • GPUs don’t run branch-intensive code very efficiently
  • GPUs are sensitive to memory access patterns and suffer much reduced bandwidth and poor latency when doing random accesses
  • GPUs are on the other side of the PCIe bus, and thus data typically has to be transported there and back again

LZW will likely suffer very badly on all these counts and therefore I’m not at all surprised that the GPU version is running much more slowly. You should get good results from using OpenCL to run on multi-core CPUs, although you won’t see benefit from using the CPU SIMD capabilities.

It may be possible to create compression algorithms that are better suited to GPUs and CPU SIMD processing, but I’m not aware of any such existing general purpose compression algorithm and creating one is a challenging task.