Communication between OpenCL and CUDA

Hi all,

I’d like to develop a graphics program where end users can create their own arbitrary ‘script’ functions inside my main program. However, the catch is that their ‘script’ will have to run on the GPU for speed. Ideally, their script would be compiled or use JIT for speed.

For this kind of thing, I imagine I would need to redistribute the OpenCL compiler along with my own software.

To complicate things slightly, I’m currently using CUDA for the main code base. However, since it generates PTX like CUDA, I was thinking that I could redistribute the OpenCL compiler with my software solely for the purpose of generating PTX files when the user wants a function to run on the GPU. From what I understand, I can then use the Driver API in CUDA to call these PTX files (sort of treating them like DLLs) and have them run transparently (need to be CUDA 2.1 or higher I think) along with CUDA code.

I would appreciate any insight if what I am asking is possible. Maybe there are other ways of going about the problem of allowing arbitrary user code at runtime? Examples may include the new GPU.NET or at a stretch, creating my own PTX and/or CUBIN compiler (shudder).

A few mini-questions also:
1: To make my task easier, can OpenCL directly generate CUBIN files?

2: Can I redistribute the OpenCL compiler, and if so, what’s the latency time compiling from source to object file (presume a tiny source code function of about 4 lines) ?

3: If I were to switch to OpenCL entirely, can I use it to create DLL files and for the main code to use an arbitrary function from within the DLL to execute (pointer to function needed I think) ?

The compiler is comes with NVIDIA’s drivers so you wouldn’t need to distribute the compiler. I doubt NVIDIA would let you distribute the compiler as it would create conflicts etc. and new CUDA cards would not be supported by older compilers.

You can just use OpenCL to create the binaries without needing to know how the compiler works.

The OpenCL compiler does return a PTX file when you retrieve the binaries of a compiled program. But when I look at the PTX I can see a really strange binary blob inside an array at the top. I’m not sure if this is OpenCL specific and if it would stop you loading using the CUDA driver interface. You would need to test it.

I would just try it to see if it works.

You can currently bypass OpenCL and call the NVIDIA compiler DLL directly (see here) but I would not rely on this as it is not official and they might decide to prevent people from doing this or change the interface at any time in a new driver version.

If you do it yourself the best you could do is write your own compiler that outputs PTX, that way you can make the language work however you want it to. If you want to look at open source compilers you can modify, both AMD and NVIDIA use Clang and LLVM to compile the OpenCL to PTX and AMD IL.

I don’t think it is possible to create a CUBIN compiler as the format is kept very secret by NVIDIA and can change unexpectedly between CUDA versions. You can only use PTXAS from the command line or the CUDA driver API (which has a internal version of PTXAS).

Not in the current CUDA SDK or drivers. You can only get PTX. NVIDIA have hinted this may change later on to produce CUBIN and/or PTX at the same time. (This is what AMD do, they use an ELF file that has both AMD IL and the GPU’s ISA code in it).

2: Can I redistribute the OpenCL compiler, and if so, what’s the latency time compiling from source to object file (presume a tiny source code function of about 4 lines) ?

I don’t think you can redistribute the ciompiler. The compiler seems to take a few seconds to compile even small code currently. Once you get the PTX and cache it the compile time is a lot quicker.

3: If I were to switch to OpenCL entirely, can I use it to create DLL files and for the main code to use an arbitrary function from within the DLL to execute (pointer to function needed I think) ?

Can you explain this better? All you get back from OpenCL is PTX which you could include as a resource section in the DLL. It is still better to include the OpenCL code and just compile it the first time your program runs after a driver version change so that you get the benefit of any bug fixes or performance enhancements included in future compilers.

both AMD and NVIDIA use Clang and LLVM to compile the OpenCL to PTX and AMD IL.

Interesting, never knew that. For ease of use and fast conversion from C/C++ code to PTX, would you recommend Clang or LLVM? Also, maybe I wouldn’t even need to modify the Clang or LLVM code, as conversion from C to PTX seems to be what I need here, and they seem to do that just fine?

I would just try it to see if it works.

I’ll need to study the driver API as my experience with either CUDA or OpenCL is minimal. I’m only just beginning to figure out the basics of the various compiler chains.

What I might try first is an all-OpenCL approach to see if I can get it up and running. After that, I might try to find some way for OpenCL and CUDA to cooperate.

[quote:3goky2pj]1: To make my task easier, can OpenCL directly generate CUBIN files?

Not in the current CUDA SDK or drivers. You can only get PTX.[/quote:3goky2pj]

FYI, when I try to use nvcc to compile from .cu to .ptx, it also requires use of the Microsoft Visual C compiler - cl.exe (even though in theory, it shouldn’t need cl.exe at all). Unfortunately, Microsoft won’t allow me to redistribute their cl compiler, (and I haven’t heard back yet from NVidia about redistributing nvcc either).

I don’t think you can redistribute the ciompiler. The compiler seems to take a few seconds to compile even small code currently. Once you get the PTX and cache it the compile time is a lot quicker.

Just to be clear, when you say compiler here, I presume you mean NVidia’s version of the OpenCL compiler.

(sub-issue: I initially thought there was only “one OpenCL”, because it is able to run on any platform, but NVidia, AMD, Intel etc. seem to be offering their own flavour - a real shame they can’t join together and make available one single download for all).

You mention it takes a few seconds to compile even small code (I presume you mean from source to PTX). That sounds like a potential issue, because I’d like speeds faster than at least half a second to compile. If I use Clang or LLVM to convert from source to PTX, will that be any quicker?

[quote:3goky2pj]3: If I were to switch to OpenCL entirely, can I use it to create DLL files and for the main code to use an arbitrary function from within the DLL to execute (pointer to function needed I think) ?
Can you explain this better? All you get back from OpenCL is PTX which you could include as a resource section in the DLL. It is still better to include the OpenCL code and just compile it the first time your program runs after a driver version change so that you get the benefit of any bug fixes or performance enhancements included in future compilers.[/quote:3goky2pj]
Sure and thanks for asking…

Currently on the CPU, I have my graphics program contain a section for the user to input ‘scripting’ code (essentially a function with parameters). My program then uses a provided C/C++ compiler (say TCC - Tiny C Compiler which is unfortunately pretty slow, even considering it’s only CPU) to convert the user ‘script’ code into a DLL (which could contain a few user function potentially). All this is happening during runtime of the main program. My graphics program would then call then a particular DLL containing the user-specified compiled function, and use that inside the main program in conjunction.

Okay imagine all that, but instead of on the CPU, imagine it all on the GPU under OpenCL, where not only my graphics program is GPU accelerated, but also the user’s DLL. Please let me know if that’s not clear or if you have any questions.

What’s the OpenCL equivalent of .cubin? Based on what you’ve said, I’m thinking a possible plan of action would be to use Clang or LLVM to convert to PTX, and then use ‘something’ to convert from PTX to the final binary object code that the either the Nvidia or AMD GPUs can understand.

(sub-issue: I initially thought there was only “one OpenCL”, because it is able to run on any platform, but NVidia, AMD, Intel etc. seem to be offering their own flavour - a real shame they can’t join together and make available one single download for all).

The ICD is supposed to enable applications to use all vendor implementations. Your application links against the ICD DLL and uses the clGetPlatformIDs to find all the installed implementations, and choose which one it wants to use. Users will have the implementation(s) installed that are appropriate to their hardware.

I strongly recommend against messing around with the compiler toolchains to the degree being discussed in this thread. You are exposing yourself to implementation details that are particular to specific vendors, and (even worse) to particular versions of a vendor’s drivers. The OpenCL spec is defined such that you should be able to avoid these issues and be interoperable with any vendor’s OpenCL implementation. As implementations mature their compatibility and correctness with respect to the spec will become better.

Thanks, what would you recommend then to achieve what I want?

I don’t know exactly what your requirements are, but I would suggest an all OpenCL application that compiles from source (and perhaps caches compiled binaries), or if you can’t ship source then precompile binaries for all the vendors you can. Most vendors are providing a forward & backward compatible binary form – see their implementation docs.