Calling CPU kernels with lower overhead!

Hello!

Please provide capability to call Open CL kernels on CPU via simple function pointer bypassing the threading engine.

Purpose:

  • suitable for very short kernels which require low overhead
  • useful for compilers which do not use latest CPU instructions yet (including but not limited to MS VC++). Open CL could in this case deliver faster running functions which would be fastest possible on any platform regardless of the capabilities of the compiler /scripting language used.

Thanks!
Atmapuri

Please provide capability to call Open CL kernels on CPU via simple function pointer

Wouldn’t clEnqueueNativeKernel() be what you are looking for?

Which other (than C++) compilers and scripting languages come with built-in C++ compiler so that programmers could use clEnqueueNativeKernel(…) ??

If you call a function:

double aFun(double a, double b)
{
return a + b;
}

such short functions have a huge call overhead when using Open CL because they need to go through the threading library for the CPU devices. Calling aFun from a C++ for loop directly is 1000x faster than via Open CL API.

Which other (than C++) compilers and scripting languages come with built-in C++ compiler so that programmers could use clEnqueueNativeKernel(…) ??

clEnqueueNativeKernel() does not require such a thing. You simply pass a function pointer to it, much like you pass a function pointer in regular C.

Ok, but who compiles the function of which you are passing the pointer to with support for Intel AVX and SSE 4.2?

Ok, but who compiles the function of which you are passing the pointer to with support for Intel AVX and SSE 4.2?

The same compiler you are using for the rest of your application. Again, this is not any different from using a function pointer in C99 --what you asked for–.

That compiler which I am using does not support Intel AVX and SSE 4.2 and produces very much slower code from what Open CL compiler produces. You are assuming I have a good compiler which is used to call Open CL code. Take for example any .NET compiler or Java script. They produce very much substandard code in compare to lets say Intel C++.

>Again, this is not any different from using a function pointer in C99
> --what you asked for–.

From syntax point and when using C++ compiler maybe. But from performance point definitely not. I read the help for clEnqueNative… and the call has enormous performance overhead. If nothing else, it adds the function to the queue. The handling of the queue alone already takes 1000x more time than direct function call in C++. (The handling of the queue overhead is mostly related to thread synchronization issues). Is it possible to a call an OpenCL kernel without it being added to the queue?

If nothing else, it adds the function to the queue. The handling of the queue alone already takes 1000x more time than direct function call in C++.

Even executing a simple function pointer requires the OpenCL runtime to guarantee the same synchronization and memory coherency constraints as when you are running any other kernel. The runtime can’t simply take the pointer and call it right away.

Is it possible to a call an OpenCL kernel without it being added to the queue?

No, it’s not possible. It’s hard to understand the value of executing a piece of code without synchronizing it with the rest of the computations going on in the runtime.

>The runtime can’t simply take the pointer and call it right away.

I am very well aware of that. Hence the suggestion.

>It’s hard to understand the value of executing a piece of code without synchronizing it >with the rest of the computations going on in the runtime.

That’s what I tried to explain although obviously not very successfully. (You do agree that there are algorithms for which current Open CL API is not suitable to accelerate?) The primary value is the quality or the speed of the compiled code. As mentioned before, many compilers from which you can call Open CL generate substandard code. In the world where the CPU will soon have registers wide enough to store 8 double precision values (to perform add or mul concurrently in one cycle on all of them)and compilers which operate only on the first item in this registers, this can make a big difference.

In the world of Intel CPU, you can easily achieve performance ratio of 50x depending on the compiler that you use. This is not a marginal gain (!)

Here is a list of items, why it makes sense to call Open CL kernels out of the threaded context:

1.) Users can use (Open CL) compiler to generate many times faster code from what their own compilers which they use as primary development tool can deliver.
2.) The resulting application will be cross platform enabled. (portable performance)
3.) The use of a threading library (inside Open CL API) implies big jobs. But we all know that not all jobs can be threaded. Some are simply tool small in that context, but they are still such that they could greatly benefit from a compiler capable of vectorization.
4.) You get a free access to high performance compiler.

For these reasons I would like to see Open CL to allow un-threaded calls of its kernels and/or threaded from the callers side where entire Open CL API can optionally be bypassed except for its platfrom->device->compiler->getkernel->function_call

Such API would enable cross platform applications to accelerate more of its code with lower development costs.

Thanks for the explanation. I understand now.

In my book Open CL is the most innovative design in computing science for the last 10 years. I think that although it was designed to address GPUs, it will ultimately make the biggest difference for the world of CPU. Just the fact that you can virtually ship your application with a compiler included opens a great amount of possibilities.

Hi,

+1 on this sggestion.

Dynamic to-native compilation resulting to target-device optimized code would be great asset. Even (and especially) when not requiring the parallel potential on that portion.

We have made our own breakthroughs on structurizing the code-generation in certain simple fashion. OpenCL seems already provide very interesting concepts on parallelism.

This feature sounds like a basis for the solution where we could move most (if not all) of our generation targets on OpenCL, even in cases where we know and truly want to focus on CPU only.

Kalle Launiala