echelon


C++ Accelerator Libraries

In preparation for my C++Now talk entitled The Future of Accelerator Programming in C++ I am currently reviewing numerous C++ libraries. I put together a catalogue of questions for these reviews. The questions are intended to gauge scope, use-cases, performance, quality and level of abstraction of each library.

ul

Iqra: Read, image by Farrukh
  1. Is concurrency supported?
    Accelerators are massive parallel devices, but due to memory transfer overhead, concurrency is a central aspect for many efficient programs.
  2. How is memory managed?
    This is a central question since simple and efficient management of distributed memory is not trivial.
  3. What parallel primitives are provided?
    Parallel primitives are essential building blocks for many accelerator-enabled programs.
  4. How is numerical analysis supported?
    Massive parallel accelerator architectures lend themselves well to numerical analysis.
  5. How can users specify custom accelerator functions?
    A useful accelerator library should allow users to specify custom functions.
  6. What is the intended use-case for the library? Who is the target audience?
    Is the library suitable for i.e. high performance computing, prototyping or signal processing?
  7. What are noteworthy features of the library?

This is a list of all libraries that I am reviewing:

Library CUDA OpenCL Other Type1
Thrust X OMP, TBB header
Bolt X2 TBB, C++ AMP link
VexCL X3 X header
Boost.Compute X header
C++ AMP X4 DX11 compiler
SyCL X5 compiler
ViennaCL X X OMP header
SkePU X X OMP, seq header
SkelCL X link
HPL X link
ArrayFire X X link
CLOGS X link
hemi X header
MTL4 X header
Kokkos X OMP, PTH, seq link
Aura6 X X header

If I missed a library, please let me know. I will add it immediately. I’m going to publish selected library reviews here on my blog. I’m hoping to discuss specific reviews with the original library authors. The conclusions of these reviews will be part of my talk at C++Now.


  1. either header-only library, link-library or library that requires compiler support
  2. custom AMD OpenCL Static C++ Kernel Language extension required
  3. CUDA SDK required at runtime
  4. Prototype implemenation available here
  5. only specification released so far
  6. disclaimer: library developed by the author


20 Comments »

#1 Matt wrote on April 15, 2014:

In related context, see also “Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries”: http://arxiv.org/abs/1212.6326

#2 Sebastian Schaetz wrote on April 15, 2014:

Hi Matt,
thanks for the hint. I’ll put together a post about related reading and references and I’ll make sure to include the paper. The first author of the paper is also the author of VexCL.
Cheers,
Sebastian

#3 Denis Demidov wrote on April 15, 2014:

As it happens, the third co-author of the paper above is the author of ViennaCL, and the fourth co-author is the author of MTL4 for CUDA. The second co-author is the author of Boost.odeint :).

#4 Sebastian Schaetz wrote on April 15, 2014:

Thanks for the additional information Denis :-)

#5 Carter Edwards wrote on April 15, 2014:

Kokkos is an open source accelerator performance portability library (C++, link) being developed at Sandia National Laboratories. Current back-ends are OpenMP, pthreads, and CUDA. See https://www.xsede.org/documents/271087/586927/Edwards-2013-XSCALE13-Kokkos.pdf

#6 Sebastian Schaetz wrote on April 15, 2014:

Carter, thanks for the heads up about Kokkos. Haven’t had that one on the radar. I will have a look.

I found that source-code is available from the trilinos.org website, is that correct? And is this the latest documentation? http://trilinos.org/docs/r11.6/packages/kokkos/doc/html/index.html For a while now I really prefer looking at code and documentation as opposed to reading the paper. Could you point out what sets Kokkos apart from other libraries?

Cheers,
Sebastian

#7 Carter Edwards wrote on April 15, 2014:

That is the current public-facing location for Kokkos. We are moving from pure research to research & production with this library, documentation is catching up accordingly.
I believe that our unified treatment thread parallel dispatch and polymorphic multidimensional array layout in a pure-library implementation is unique. This is essential for performance portable memory access patterns. Closest other package that we looked into is C++AMP, which requires compiler support. I am looking forward to your broader coverage of options than what we have done.

#8 Sebastian Schaetz wrote on April 15, 2014:

Carter, allright – I’m sold. I’ll definitely consider it in my reviews. I also added it to the list. Feel free to send me a notice once the move to production is finished. I’ll update the link accordingly.

#9 ltroot wrote on April 16, 2014:

there is a typo in “CUDA SKD required at runtime”

#10 Sebastian Schaetz wrote on April 16, 2014:

Thanks lroot, I fixed the typo.

#11 Hartmut Kaiser wrote on April 22, 2014:

HPX is a general purpose C++ runtime system for parallel and distributed applications of any scale (https://github.com/STEllAR-GROUP/hpx). It is more low level than many of the libraries listed above but could be used as a backend for them. A current GSoC project will create an HPX backend for Thrust, for instance. HPX is known to outperform applications which are based on existing programming models (like OMP or MPI). In terms of accelerators the most notable feature is HPX’ ability to enable highly efficient use of the Intel Xeon Phi accelerators (see: for instance http://stellar.cct.lsu.edu/pubs/scala13.pdf).

#12 Sebastian Schaetz wrote on April 22, 2014:

Hartmut, thanks for your comment. To be honest, I have yet to understand what HPX is. It seems to me as if it were a cluster/grid level tool, not an accelerator programming library but I might be wrong about that. Is this the accelerator specific part of HPX [0]? Is there documentation for this?

Cheers,
Sebastian

[0] https://github.com/STEllAR-GROUP/hpxcl

#13 Paul Jurczak wrote on May 12, 2014:

I think that Intel Cilk Plus http://www.cilkplus.org/ deserves a place on your list. It is a C++ language extension, not a library, but so are some of the other entries on the list, e.g C++AMP. As far as I know, current Cilk Plus implementations run on CPUs, but Intel announced future support for Xeon Phi.

Have a look at array notation in Cilk Plus, I don’t think you can find the same level of terse notation in other libraries. There is also a standard kernel based approach available.

#14 Sebastian Schaetz wrote on May 12, 2014:

Thanks for the information about Cilk Plus, Paul. I had a look at Cillk Plus and as you said, the array notation is nice (Matlab/Fortran style). But since my articles focus on accelerator libraries, Cilk Plus didn’t make the list.

A better candidate would be OpenMP 4.0 – they have accelerator support. Another would be OpenACC. They seem to be less capable compared to Cilk in terms of syntax but they do support accelerators today.

#15 Mathias Gaunard wrote on May 13, 2014:

NT2 could deserve a mention there. It supports vectorization, parallelization on shared memory systems (OpenMP, TBB and HPX backends), and GPU code generation (CUDA and OpenCL backends, both proprietary).
The GPU code generation is entirely done at compile-time, so there is no need for the CUDA SDK to be present on the target machine for example.
NT2 also integrates many numerical analysis algorithms, including state of the art linear algebra on both multi-core CPU and GPU with performance beating that of PLASMA, MAGMA or MKL.

#16 Ben Sander wrote on May 13, 2014:

Hi Sebastian-
The Bolt library also has a back-end for C++AMP – you could list this in your table in the “other” column. Feel free to shoot me a note if you’d like more info on what we’re doing. Looks like a very interesting talk.
-ben

#17 Sebastian Schaetz wrote on May 14, 2014:

Mathias, thanks for you comment. I’m of course aware of NT2. I was not aware of the fact that you can now generate GPU code, probably because it is proprietary. Is there any documentation or examples or code about its capabilities? I guess then I could include it :-)

Ben, thanks for you comment. I saw that there is a C++AMP backend for Bolt. I must have forgotten to add it. I just fixed it.

#18 Carter Edwards wrote on July 22, 2014:

Link to the presentation you gave last May?

#19 Sebastian Schaetz wrote on July 22, 2014:

Linked from my talk page and here.

#20 Varun Nagpal wrote on June 3, 2015:

Have a look at CUDA CUB and ModernGPU libraries

[1] http://nvlabs.github.io/cub/
[2] http://nvlabs.github.io/moderngpu/

Leave a comment