I am a newbie to OpenCL and am working on a project to parallelize a serial lossless compression algorithm ?
I have run the code on a GPU by splitting the the huge input size into 1K chunks....
and individual work items are doing the compression operation on it.
I was expecting a speedup on a GPU over a serial code on a CPU
But infact I am getting a 10 times slower code on a GPU
could anyone help me with pointing me where I could be going wrong ?

I was working on a LZRW compression code which serially processes 1K chunks of data. My initial obvious thought was to split the input file (say 30 MB) across multiple work items. This way, each work item would need to process much lesser input and logically should give a speedup comprared to a serial code. But to my surprise, the OpenCL code is running slower than CPU serial code.

Could someone point me where could I be going so wrong to actually get a slower code on a GPU ?