I am new to GPUs, OpenCL, and parallel programming in general. One thing that confuses me is a kernel like this:
__kernel void add_buffers(__global const float *a, __global const float *b, __global float *result) {
int gid = get_global_id(0);
result[gid] = a[gid] + b[gid];
}
I see this pattern very often in code examples, where the kernel operates on a single item of a large data set. Semantically it makes sense, but I don’t see how it can possibly be performant. If a kernel is basically a function call, then as the size of the input grows, the overhead of invoking the function on each individual element should dominate.
For example, nobody would write traditional multithreaded code for a CPU like this:
void add_buffers(float *a, float *b, float *result) {
for (int i = 0; i < SIZE; i++) {
// Pseudocode...
spawn a thread that adds a[i] to b[i] and stores it to result[i]
}
}
The overhead of spawning all those threads, with their function calls and context switches and whatnot, would eliminate most or all of the speedup gained from doing more than one add operation in parallel. Instead, you’d write something like this:
void add_buffers(float *a, float *b, float *result) {
int range = SIZE / NUMBER_OF_CPU_CORES; // Assume it divides evenly
for (int i = 0; i < NUMBER_OF_CPU_CORES; i++) {
int start = i * range;
int end = i * range + range - 1;
// Pseudocode...
spawn a thread that adds the numbers a[start to end] to b[start to end] and stores the result in result[start to end]
}
}
Although each thread is now doing more work, this is better because it only spawns as many threads as there are processing elements (cores), thus keeping overhead to a minimum.
In fact, even a sequential version of the algorithm running on a single core CPU should beat the OpenCL variation running on a GPU, even for large input sizes, due to the overhead of each kernel invocation handling just one item of the data set. But of course that would defeat the purpose of OpenCL, so my understanding of how a kernel works must be wrong. What am I missing? Thanks.