echelon


SyCL

The fourth post in my C++ accelerator library series is about SyCL. SyCL is a royality-free provisional specification from the Khronos group. It is spear-headed by Edinburgh, UK based Codeplay Software. Its aim is to simplify programming many-core vector co-processors and multi-core vector processors by offering a modern C++ interface on top of OpenCL. The main selling point of SyCL is similar to that of C++ AMP: single-source development and inline kernels. That means kernel code can be embedded within C++ code. To that effect, kernel code can also be templated, something that is not possible using standard OpenCL1.

ul
Sickle, image by Enrico Sanna

SyCL builds upon a new abstraction – the command_group. It groups “lists of OpenCL commands that are necessary to perform all the work required to correctly process host data on a device using a kernel”. When acting in concert with the command_queue type it also represents an novel attempt of expressing a dataflow graph. I suspect it is also a means for a SyCL compiler to find and extract the code necessary to generate the proper device kernel. In addition it is supposed to simplify memory management. This is what SyCL looks like:

std::vector h_a(LENGTH);             // a vector
std::vector h_r(LENGTH);             // r vector
{
  // Device buffers
  buffer d_a(h_a);
  buffer d_r(h_r);
  queue myQueue;
  command_group(myQueue, [&]()    
  {
    // Data accessors
    auto a = d_a.get_access();
    auto r = d_r.get_access();
    // Kernel
    parallel_for(count, kernel_functor([ = ](id<> item) {
      int i = item.get_global(0);
      r[i] += a[i];
    }));
  });
}

For each command_group a kernel is launched and for each kernel the user specifies how memory should be accessed through the accessor type. The runtime can thus infer which blocks of memory should be copied when and where.

Hence, a SyCL user does not have to issue explicit copy instructions but informs the runtime about the type of memory access the kernel requires (read, write, read_write, discard_write, discard_read_write). The runtime then ensures that buffers containing the correct data are available once the kernel starts and that the result is available in host memory once it is required. Besides simplifying memory management for the user, I can guess what other problem SyCL tries to solve here: zero-copy. That is to say, with such an interface, code does not have to be changed to perform efficiently on zero-copy architectures. Memory copies are omitted on zero-copy systems but the same interface can be used and the user code does not have to change.

Between the accessor abstraction and host memory is another abstraction: the buffer or image. It seems to be a concept representing data that can be used by SyCL kernels that is stored in some well defined but unknown location.

The buffer or image in turn use the storage concept. It encapsulates the “ownership of shared data” and can be reimplemented by the user. I could not determine from browsing the specification where allocation, deallocation and copy occurs. It must be either in the accessor, the buffer or the storage. Oddly enough, storage is not a template argument of the buffer concept but instead is passed in as an optional reference. I wonder how a user can implement a custom version of storage with this kind of interface.

Kernels are implemented with the help of parallel_for meta-functions. Hierarchies can be expressed by nesting parallel_for, parallel_for_workitem and parallel_for_workgroup. Kernel lambda functions are passed an instance of the id<> type. This object can be used by each instance of the kernel to access its location in the global and local work group. There are also barrier functions that synchronize workgroups to facilitate communication.

To summarize, I applaud the plan to create a single-source OpenCL programming environment. With standard C++11 or even C++14 however, for this to work an additional compiler or a non-standard compiler extension is required. Such a compiler must extract the parallel_for meta-functions and generate code that can execute on the target device. As of the publication of this article, such a compiler does not exist. There is however a test-bed implementation of SyCL that runs only on the CPU.

In addition, I fear that the implicit memory management and command_group abstractions might be to inflexible and convoluted for SyCl to become a valuable foundation one can build even fancier things on.

An update:
Andrew Richards (Founder and CEO of Codeplay) adds that they are working on simplifying storage objects. He also believes that the runtime will have enough information best performing scheduling and memory access. Codeplay is going to use a Clang C++ compiler to extract kernels into OpenCL binary kernels but there is no release date yet.


  1. an exception to this is the the OpenCL Static C ++ Kernel Language Extension that is implemented by AMD, it allows overloading in kernel functions, template arguments, namespaces and references


4 Comments »

#1 Mathias Gaunard wrote on January 4, 2015:

My understanding is that SyCL is designed for HSA-capable devices. Those have a unified memory model, so copying is not needed.

As for how the compiler works, it simply triggers code generation for OpenCL targets of any function passed to the higher order template functions. This logic doesn’t necessarily require a compiler with special extensions and could be handled at a build system stage with multiple compilation steps.

#2 Sebastian Schaetz wrote on January 5, 2015:

Thanks for your comment Mathias.

I lack experience with unified memory systems but my guess is that you’ll always want to have some control over where data is allocated and when data is copied. If you don’t have control, you’re at the mercy of the runtime/subsystem to do the right thing for you.

I agree with your comment about the compiler. It is very similar to what CUDA is doing, they call it a “compiler driver”. I’m looking forward to experimenting with any special compiler, custom compiler extension, compiler driver or build system step once it is available.

#3 Andrew Richards wrote on February 20, 2015:

It isn’t designed for HSA-capable devices. SYCL is targeted at OpenCL 1.2-generation devices, so it doesn’t required any shared-virtual-memory. Data is stored in SYCL buffers, which can be implemented as OpenCL buffers. Copying may be necessary on devices which require it, as with OpenCL normally.

We would like to support shared-virtual-memory in OpenCL 2.x in a future SYCL. We aim to stay reasonably up-to-date with OpenCL releases.

#4 Sebastian Schaetz wrote on February 20, 2015:

Andrew, thanks for your comment. I guess this is in reply to Mathias Gaunard?

Leave a comment