Nvidia recently announced CUDA 7 will come with a runtime compilation feature. I find this quite interesting for two reasons. First and foremost, this is something OpenCL could do from the beginning, in fact, that is how OpenCL works. Pass a kernel string to the OpenCL library at runtime, get back something you can call and run on a device. With CUDA you have to compile to PTX, which in turn could be passed as a kernel string to the CUDA runtime or driver API1. So CUDA actually tries to catch up with OpenCL here – an interesting role reversal. Usually OpenCL is behind CUDA.
The second point is that now Nvidia either has to ship a more or less complete C++ compiler with their driver. Or software vendors have to ship this runtime compilation feature with their products. The idea of making a C++ compiler available at runtime seems ludicrous and ingenious at the same time. I wonder what possibilities this opens up in general, beyond runtime parameter tuning. It is obvious that the more the compiler knows the better it can optimize.
But this is not new in the realm of graphics programming APIs. Both Direct3D and OpenGL involve runtime compilation of shader code. The model of 2-stage offline compilation to generic byte-code and runtime compilation for a specific device that CUDA uses also exists in graphics programming.
A direct consequence for my Aura C++ accelerator programming library is that by constraining kernel code to C99 and using a few preprocessor macros as well as wrapper functions, OpenCL and CUDA kernels can now look exactly the same and can be defined in the same way. I hit a wall in this regard recently, due to the PTX generation requirement in CUDA. I started thinking about an additional build step that extracts kernel strings, feeds them to the CUDA compiler, and re-embeds the resulting PTX in the source code. This seems now superfluous.
This new feature will be valuable to other C++ GPU programming libraries too. Recently reviewed (and accepted) Boost.Compute could add a CUDA backend without too much effort – OpenCL and CUDA kernels are not very different. VexCL could use this feature as an alternative to requiring the CUDA compiler installed on the target system. ArrayFire could replace their JIT PTX generation facilities with a JIT C++ generator. Exciting possibilities.
- An alternative is to generate PTX at runtime, a method ArrayFire uses to fuse element-wise operations. CUDA kernels can also be compiled to device code directly at compile time.