I have a kernel here whose next value depends on preceding values. I intend to run this kernel hundreds of thousands of times, with some different input parameters. The way I have this written right now is that the kernel goes through a for-loop across the input set. This then is scheduled to run on a single work item. So already I know I'm not utilizing things very well, but it was the first way that came to mind to make sure separate work items didn't try to fill in the array out-of-order.

I have an Intel CPU and an NVidia GPU, but right now I'm just scheduling things on the CPU. Without OpenCL, I could have multiple threads, each trying to run this load, scheduled across the cores. Instead I was hoping to get something like this with OpenCL. I see device fission available for the CPU, but not the GPU. So without using that, I've started trying to line up some kernels on the CPU's command queue. I only saw a few percentage points worth of change--definitely not something over at least 100%.

There are probably a few things going on here. Is there a way that I can have this kind of function work across work items and actually have it speed up? Should I instead look into device fission? If so, what about the GPU? I have 7 compute units on there, but since it doesn't support device fission, it would only be as good as one to me.