How can kernels that operate on single items be efficient?

I am new to GPUs, OpenCL, and parallel programming in general. One thing that confuses me is a kernel like this:


__kernel void add_buffers(__global const float *a, __global const float *b, __global float *result) {
    int gid = get_global_id(0);
    result[gid] = a[gid] + b[gid];
}

I see this pattern very often in code examples, where the kernel operates on a single item of a large data set. Semantically it makes sense, but I don’t see how it can possibly be performant. If a kernel is basically a function call, then as the size of the input grows, the overhead of invoking the function on each individual element should dominate.

For example, nobody would write traditional multithreaded code for a CPU like this:

void add_buffers(float *a, float *b, float *result) {
    for (int i = 0; i < SIZE; i++) {
        // Pseudocode...
        spawn a thread that adds a[i] to b[i] and stores it to result[i]
    }
}

The overhead of spawning all those threads, with their function calls and context switches and whatnot, would eliminate most or all of the speedup gained from doing more than one add operation in parallel. Instead, you’d write something like this:

void add_buffers(float *a, float *b, float *result) {
    int range = SIZE / NUMBER_OF_CPU_CORES;  // Assume it divides evenly
    for (int i = 0; i < NUMBER_OF_CPU_CORES; i++) {
        int start = i * range;
        int end = i * range + range - 1;
        // Pseudocode...
        spawn a thread that adds the numbers a[start to end] to b[start to end] and stores the result in result[start to end]
    }
}

Although each thread is now doing more work, this is better because it only spawns as many threads as there are processing elements (cores), thus keeping overhead to a minimum.

In fact, even a sequential version of the algorithm running on a single core CPU should beat the OpenCL variation running on a GPU, even for large input sizes, due to the overhead of each kernel invocation handling just one item of the data set. But of course that would defeat the purpose of OpenCL, so my understanding of how a kernel works must be wrong. What am I missing? Thanks.

I like how you thought about this :slight_smile:

The overhead of spawning all those threads, with their function calls and context switches and whatnot, would eliminate most or all of the speedup gained from doing more than one add operation in parallel.

That is correct. You may be assuming that an OpenCL implementation on a CPU spawns a new thread for each work-item. It’s not the case. A typical implementation will actually be more efficient than the more elaborate code that you showed later. My understanding is that Apple for example implements OpenCL on top of Grand Central Dispatch for their CPUs.

In fact, even a sequential version of the algorithm running on a single core CPU should beat the OpenCL variation running on a GPU, even for large input sizes, due to the overhead of each kernel invocation handling just one item of the data set.

GPUs have dedicated hardware threads that are very lightweight to initialize. That’s why kernels such as the example you wrote can run at great speed.

Right, but let’s say we ignore the cost of thread spawning. There’s still some amount of overhead in invoking a kernel, just as there’s some amount of overhead in invoking a function in a single-threaded program. If that overhead is K, and I’m processing N items, then the total overhead in the single-item-per-kernel model is KN. But if I simply rewrite the kernel to compute, say, 16 work items instead of just one, then the overhead shrinks to KN/16. That’s quite a speedup for such a relatively simple change to the kernel, especially as N grows. This is why the prevalence of the single-item-per-kernel design surprises me.

Hmm, perhaps the value of K on a GPU is far lower than I realize. But not all OpenCL code runs on the GPU. For some platforms, it may run on the CPU. Would the overhead matter then?

In any case, your point highlights a gap in my knowledge. I’ve tried to locate a good resource for learning about GPU hardware architecture, but there doesn’t seem to be much out there. Almost everything I find about GPGPU is from a software perspective and virtually ignores the hardware or treats it very abstractly. (The OpenCL spec is an example of the latter.) What I would love to find is a book of the same scope and caliber as Hennessy and Patterson’s venerable “Computer Architecture”, but for the GPU instead of the CPU. Does anything remotely like that exist?

Thanks for any advice.

If that overhead is K, and I’m processing N items, then the total overhead in the single-item-per-kernel model is KN. But if I simply rewrite the kernel to compute, say, 16 work items instead of just one, then the overhead shrinks to KN/16. That’s quite a speedup for such a relatively simple change to the kernel, especially as N grows.

Good thinking! Maybe this is something that can be automated :wink:

But not all OpenCL code runs on the GPU. For some platforms, it may run on the CPU. Would the overhead matter then?

Not really. The people who write OpenCL drivers and compilers have already thought about these issues and solved them pretty well. I can’t go into details as this is proprietary information. However, you are already thinking in the right direction.

I’ve tried to locate a good resource for learning about GPU hardware architecture, but there doesn’t seem to be much out there. […] What I would love to find is a book of the same scope and caliber as Hennessy and Patterson’s venerable “Computer Architecture”, but for the GPU instead of the CPU. Does anything remotely like that exist?

I don’t think such a book exists. The only way I know to learn more about what goes under the hood of GPUs is to start working for a company that designs them. The the most common way to do that is to be a recent graduate, but it’s possible to join afterwards as well. We are hiring (hint, hint).

Is the smiley at the end meaning, that OpenCL already scales automatically like this, or suggestion to contribute providing a way to automate such thing? :slight_smile:

I was bit AFK from OpenCL, but we were having alike thoughts for combining kernels or other means to manipulate the “hand written” kernel distribution to adjust the end-result kernels based on the core and thread count.

So in case you are looking a way to control such automation I’m interested to show what we can do :slight_smile:

Kalle