Scheduling a work load that relies on sequential values

I have a kernel here whose next value depends on preceding values. I intend to run this kernel hundreds of thousands of times, with some different input parameters. The way I have this written right now is that the kernel goes through a for-loop across the input set. This then is scheduled to run on a single work item. So already I know I’m not utilizing things very well, but it was the first way that came to mind to make sure separate work items didn’t try to fill in the array out-of-order.

I have an Intel CPU and an NVidia GPU, but right now I’m just scheduling things on the CPU. Without OpenCL, I could have multiple threads, each trying to run this load, scheduled across the cores. Instead I was hoping to get something like this with OpenCL. I see device fission available for the CPU, but not the GPU. So without using that, I’ve started trying to line up some kernels on the CPU’s command queue. I only saw a few percentage points worth of change–definitely not something over at least 100%.

There are probably a few things going on here. Is there a way that I can have this kind of function work across work items and actually have it speed up? Should I instead look into device fission? If so, what about the GPU? I have 7 compute units on there, but since it doesn’t support device fission, it would only be as good as one to me.

If you can schedule your work over multiple threads when not using OpenCL then I don’t see why you can’t use multiple work items. That is the way to use all of the cores on your CPU.

I was pondering this a little bit. Do you think I could do this by basically adding a dimension onto it?

What I was thinking is right now my scheduling and kernel interaction is something like this:

  1. Prepare a single kernel for a single output set of the 100k+ outputs I need
  2. Run that single kernel, which can only safely run on one work item.
  3. Fetch it out and move on to the next element.

Do you expect I would benefit then from this:

  1. Still prepare a single kernel, but for x amount of work items of output at once.
  2. Run the kernel, with it written to get the global_id and an index into the output set where it will fetch the particular parameters it needs to get and set.
  3. Fetch everything out and move on to the next x amount of work items.

Each work item then is basically working on four large output buffers of floats, and one large input buffer of floats. I can have 1024 maximum work items in the first dimension on both devices. Should I just go for 1024? Or will it tactfully schedule if I assume a work item per compute unit? Do I get any control over that?