Archive for January, 2015

CUDA 7 Runtime Compilation

Monday, January 19th, 2015

Nvidia recently announced CUDA 7 will come with a runtime compilation feature. I find this quite interesting for two reasons. First and foremost, this is something OpenCL could do from the beginning, in fact, that is how OpenCL works. Pass a kernel string to the OpenCL library at runtime, get back something you can call and run on a device. With CUDA you have to compile to PTX, which in turn could be passed as a kernel string to the CUDA runtime or driver API1. So CUDA actually tries to catch up with OpenCL here – an interesting role reversal. Usually OpenCL is behind CUDA.

The second point is that now Nvidia either has to ship a more or less complete C++ compiler with their driver. Or software vendors have to ship this runtime compilation feature with their products. The idea of making a C++ compiler available at runtime seems ludicrous and ingenious at the same time. I wonder what possibilities this opens up in general, beyond runtime parameter tuning. It is obvious that the more the compiler knows the better it can optimize.

But this is not new in the realm of graphics programming APIs. Both Direct3D and OpenGL involve runtime compilation of shader code. The model of 2-stage offline compilation to generic byte-code and runtime compilation for a specific device that CUDA uses also exists in graphics programming.

A direct consequence for my Aura C++ accelerator programming library is that by constraining kernel code to C99 and using a few preprocessor macros as well as wrapper functions, OpenCL and CUDA kernels can now look exactly the same and can be defined in the same way. I hit a wall in this regard recently, due to the PTX generation requirement in CUDA. I started thinking about an additional build step that extracts kernel strings, feeds them to the CUDA compiler, and re-embeds the resulting PTX in the source code. This seems now superfluous.

This new feature will be valuable to other C++ GPU programming libraries too. Recently reviewed (and accepted) Boost.Compute could add a CUDA backend without too much effort – OpenCL and CUDA kernels are not very different. VexCL could use this feature as an alternative to requiring the CUDA compiler installed on the target system. ArrayFire could replace their JIT PTX generation facilities with a JIT C++ generator. Exciting possibilities.


  1. An alternative is to generate PTX at runtime, a method ArrayFire uses to fuse element-wise operations. CUDA kernels can also be compiled to device code directly at compile time.

SyCL

Saturday, January 3rd, 2015

The fourth post in my C++ accelerator library series is about SyCL. SyCL is a royality-free provisional specification from the Khronos group. It is spear-headed by Edinburgh, UK based Codeplay Software. Its aim is to simplify programming many-core vector co-processors and multi-core vector processors by offering a modern C++ interface on top of OpenCL. The main selling point of SyCL is similar to that of C++ AMP: single-source development and inline kernels. That means kernel code can be embedded within C++ code. To that effect, kernel code can also be templated, something that is not possible using standard OpenCL1.

ul
Sickle, image by Enrico Sanna

SyCL builds upon a new abstraction – the command_group. It groups “lists of OpenCL commands that are necessary to perform all the work required to correctly process host data on a device using a kernel”. When acting in concert with the command_queue type it also represents an novel attempt of expressing a dataflow graph. I suspect it is also a means for a SyCL compiler to find and extract the code necessary to generate the proper device kernel. In addition it is supposed to simplify memory management. This is what SyCL looks like:

std::vector h_a(LENGTH);             // a vector
std::vector h_r(LENGTH);             // r vector
{
  // Device buffers
  buffer d_a(h_a);
  buffer d_r(h_r);
  queue myQueue;
  command_group(myQueue, [&]()    
  {
    // Data accessors
    auto a = d_a.get_access();
    auto r = d_r.get_access();
    // Kernel
    parallel_for(count, kernel_functor([ = ](id<> item) {
      int i = item.get_global(0);
      r[i] += a[i];
    }));
  });
}

For each command_group a kernel is launched and for each kernel the user specifies how memory should be accessed through the accessor type. The runtime can thus infer which blocks of memory should be copied when and where.

Hence, a SyCL user does not have to issue explicit copy instructions but informs the runtime about the type of memory access the kernel requires (read, write, read_write, discard_write, discard_read_write). The runtime then ensures that buffers containing the correct data are available once the kernel starts and that the result is available in host memory once it is required. Besides simplifying memory management for the user, I can guess what other problem SyCL tries to solve here: zero-copy. That is to say, with such an interface, code does not have to be changed to perform efficiently on zero-copy architectures. Memory copies are omitted on zero-copy systems but the same interface can be used and the user code does not have to change.

Between the accessor abstraction and host memory is another abstraction: the buffer or image. It seems to be a concept representing data that can be used by SyCL kernels that is stored in some well defined but unknown location.

The buffer or image in turn use the storage concept. It encapsulates the “ownership of shared data” and can be reimplemented by the user. I could not determine from browsing the specification where allocation, deallocation and copy occurs. It must be either in the accessor, the buffer or the storage. Oddly enough, storage is not a template argument of the buffer concept but instead is passed in as an optional reference. I wonder how a user can implement a custom version of storage with this kind of interface.

Kernels are implemented with the help of parallel_for meta-functions. Hierarchies can be expressed by nesting parallel_for, parallel_for_workitem and parallel_for_workgroup. Kernel lambda functions are passed an instance of the id<> type. This object can be used by each instance of the kernel to access its location in the global and local work group. There are also barrier functions that synchronize workgroups to facilitate communication.

To summarize, I applaud the plan to create a single-source OpenCL programming environment. With standard C++11 or even C++14 however, for this to work an additional compiler or a non-standard compiler extension is required. Such a compiler must extract the parallel_for meta-functions and generate code that can execute on the target device. As of the publication of this article, such a compiler does not exist. There is however a test-bed implementation of SyCL that runs only on the CPU.

In addition, I fear that the implicit memory management and command_group abstractions might be to inflexible and convoluted for SyCl to become a valuable foundation one can build even fancier things on.

An update:
Andrew Richards (Founder and CEO of Codeplay) adds that they are working on simplifying storage objects. He also believes that the runtime will have enough information best performing scheduling and memory access. Codeplay is going to use a Clang C++ compiler to extract kernels into OpenCL binary kernels but there is no release date yet.


  1. an exception to this is the the OpenCL Static C ++ Kernel Language Extension that is implemented by AMD, it allows overloading in kernel functions, template arguments, namespaces and references