Archive for April, 2014

Boost.Compute

Monday, April 28th, 2014

This is the second post in my C++ accelerator library series. It is about Boost.Compute, a header-only C++ accelerator programming library developed by Kyle Lutz and others. It is based on OpenCL and is released under the Boost 1.0 license. The name of the library, the license, as well as the documentation suggest that Boost.Compute one day will apply for an official review by the Boost community to become a part of the Boost C++ library suite. Acceptance into Boost can be a stepping stone to inclusion into the C++ Standard. But this is not the only reason why it is worthwhile to take a look at Boost.Compute.

ul
Compute, stand by, operate, stand by to operate, image by Marcin Wichary

Boost.Compute manages memory through the compute::vector template, a concept similar to std::vector. Memory is either transferred synchronously through copy or asynchronously through the copy_async function. I consider this split an excellent design decision: copy does exactly what one would expect as it is the same call we can find in the STL. But asynchronous copy is also available for those programmers that need more performance and know what they are doing. An asynchronous copy returns an instance of compute::future. The future can be blocked on to ensure deterministic and correct execution of the program. Both copy function interfaces are iterator based.

All functions resulting in commands issued to an accelerator accept an optional command_queue argument. A command_queue has a wait function to block until completion of all enqueued commands. In addition, barriers and markers can be enqueued in a command_queue for additional control over asynchronous operations. These are in my opinion all the tools necessary to express concurrency and control command-level parallelism.

Boost.Compute comes with a multitude of parallel primitives. These STL like functions compose the core of Boost.Compute. This is an incredible amount of accelerator-enabled general algorithms and I believe a listing of the more complex functions is in order:


/**
  Returns the result of applying function to the elements in the 
  range [first, last) and init.
 */
T accumulate(InputIterator first, InputIterator last, T init, 
  BinaryFunction function)

/**
  Returns true if value is in the sorted range [first, last).
 */
bool binary_search(InputIterator first, InputIterator last, 
  const T & value)

/**
  Performs an exclusive scan on the elements in the range 
  [first, last) and stores the results in the range beginning 
  at result.
 */
OutputIterator exclusive_scan(InputIterator first, 
  InputIterator last, OutputIterator result) 

/**
  Calls function on each element in the range [first, last).
 */
UnaryFunction for_each(InputIterator first, InputIterator last, 
  UnaryFunction function)

/**
  Copies the elements using the indices from the range [first, last) 
  to the range beginning at result using the input values from the 
  range beginning at input.
 */
void gather(MapIterator first, MapIterator last, 
  InputIterator input, OutputIterator result)

/**
  Performs an inclusive scan on the elements in the range [first, last) 
  and stores the results in the range beginning at result.
 */
OutputIterator inclusive_scan(InputIterator first, 
  InputIterator last, OutputIterator result)

/**
  Returns the inner product of the elements in the range 
  [first1, last1) with the elements in the range beginning at first2. 
 */
T inner_product(InputIterator1 first1, InputIterator1 last1, 
  InputIterator2 first2, T init, 
  BinaryAccumulateFunction accumulate_function, 
  BinaryTransformFunction transform_function)

/**
  Merges the sorted values in the range [first1, last1) with the sorted 
  values in the range [first2, last2) and stores the result in the range 
  beginning at result. Values are compared using the comp function. 
 */
OutputIterator merge(InputIterator1 first1, InputIterator1 last1, 
  InputIterator2 first2, InputIterator2 last2, 
  OutputIterator result, Compare comp)

/**
  Calculates the cumulative sum of the elements in the range 
  [first, last) and writes the resulting values to the range beginning 
  at result. 
 */
OutputIterator partial_sum(InputIterator first, 
  InputIterator last, OutputIterator result)

/**
  Randomly shuffles the elements in the range [first, last).
 */
void random_shuffle(Iterator first, Iterator last)

/**
  Returns the result of applying function to the elements in the range 
  [first, last). Requires the binary operator to be commutative.
 */
void reduce(InputIterator first, InputIterator last, 
  OutputIterator result, BinaryFunction function)

/**
  Performs left rotation such that element at n_first comes 
  to the beginning.
 */
void rotate(InputIterator first, InputIterator n_first, 
  InputIterator last)

/**
  Copies the elements from the range [first, last) to the range 
  beginning at result using the output indices from the range 
  beginning at map.
 */
void scatter(InputIterator first, InputIterator last, 
  MapIterator map, OutputIterator result)

/**
  Sorts the values in the range [first, last) according to compare.
 */
void sort(Iterator first, Iterator last, Compare compare)

/**
  Performs a key-value sort using the keys in the range 
  [keys_first, keys_last) on the values in the range 
  [values_first, values_first + (keys_last - keys_first)) using compare.
 */
void sort_by_key(KeyIterator keys_first, KeyIterator keys_last, 
  ValueIterator values_first, Compare compare)

/**
  Sorts the values in the range [first, last) according to compare. 
  The relative order of identical values is preserved.
 */
void stable_sort(Iterator first, Iterator last, Compare compare)

/**
  Transforms the elements in the range [first, last) using transform 
  and stores the results in the range beginning at result. 
 */
OutputIterator transform(InputIterator1 first1, InputIterator1 last1, 
  InputIterator2 first2, OutputIterator result, BinaryOperator op)

/**
  Transforms each value in the range [first, last) with the unary 
  transform_function and then reduces each transformed value with 
  reduce_function.
 */
void transform_reduce(InputIterator first, InputIterator last, 
  OutputIterator result, UnaryTransformFunction transform_function, 
  BinaryReduceFunction reduce_function)


There is no special support for numerical analysis but a number of numerical operations can be built from aforementioned parallel primitives.

Boost.Compute provides a number of built in functions to pass to the parallel primitives but programmers may also specify custom functions. This can be done by creating an instance of a compute::function. A shorthand macro BOOST_COMPUTE_FUNCTION() simplifies this task. Next to custom functions, programmers can also declare a BOOST_COMPUTE_CLOSURE() with the added benefit of capturing variables that can then be used within the accelerator function. As a third option, programmers can specify a lambda expression instead of a custom function object. This is accomplished with the help of Boost.Proto. Kyle Lutz talks about these features in a recent blog post.

Boost.Compute contains a well designed C++ wrapper for OpenCL. Each wrapper type has conversion operators to the underlying OpenCL object; a handy feature that enables full compatibility to the lower layer of accelerator programming. Boost.Compute is inter-operable with OpenGL, OpenCV, Eigen, QT and VTK. It is available on github, the documentation is available here.

I would categorize Boost.Compute as a general purpose, high productivity library. Depending on the quality of implementation and specialization of the provided parallel primitives, close to peak performance should be possible with Boost.Compute. I like the combination of low level OpenCL wrappers and higher level parallel primitives as well as the possibility to fall back to OpenCL code if certain features are not yet wrapped by Boost.Compute. I think the work in Boost.Compute can be an important part of a yet to be defined standard C++ accelerator library.

VexCL

Saturday, April 26th, 2014

This is the first post in my C++ accelerator library series. It is about VexCL, a header-only C++ library for accelerator programming, developed by Denis Demidov and others. The library is released under the MIT licence.

VexCL supports OpenCL and CUDA as accelerator backends. For the CUDA backend it is important to note that the CUDA SDK must be installed on systems running VexCL code, because kernel code is generated and compiled at runtime.

ul
LAb[au] : Spectr|a|um, image by Marc Wathieu

The central memory abstraction concept in VexCL is a vex::vector. The template represents contiguous data on one accelerator. It can also act as segmented data container that manages disjoint blocks of memory on multiple accelerators. The library considers device bandwidth measurements when choosing memory segment sizes or a user-defined device weighting function. Explicit copy functions allow programmers to move data from, to and between accelerators. An iterator as well as a range-based syntax is supported. Additional data types include sparse matrix types vex::SpMat as well as vex::multivector types, representing lists of vectors that can be processed in a single step.

Concurrent execution of multiple kernels or of copy and kernel operations is partially supported by VexCL. Copy operations are synchronous by default but can be configured to be asynchronous. Each container has associated command_queues that are used to enqueue operations. A command_queue has a finish() method that blocks until all commands in the queue have completed. This is not the most elegant way to handle concurrency, but VexCL does not abstract away the option for parallel execution of operations, which is nice.

VexCL supports numerous parallel primitives such as inclusive_scan, exclusive_scan, sort, sort_by_key and reduce_by_key. But the core functionality of VexCL is its kernel generation from vector expression mechanism. If X, Y and Z are a vex::vector type, the expression

X = 2 * Y - sin(Z);

generates a single kernel that is automatically executed on all accelerators that the vectors occupy:

kernel void vexcl_vector_kernel(
    ulong n, global double * prm_1,
    int prm_2, global double * prm_3,
    global double * prm_4)
{
    for(size_t i = get_global_id(0); i < n; i += get_global_size(0)) {
        prm_1[i] = ( ( prm_2 * prm_3[i] ) - sin( prm_4[i] ) );
    }
}

This expression template mechanism has many features, including numerous built-ins, support for constants, access to vector indices, inclusion of user defined functions, tagging of data to avoid reading memory more than once, temporary values, random numbers, permutations, slicing, reducing, reshaping, scattered data interpolation and fast Fourier transform. A performance improvement can be expected from expression template generated kernels, since such fused kernels save on memory reads and writes over individual calls to BLAS functions.

Programmers may also generate kernels by feeding a vex::symbolic type to an algorithm. The symbol records any arithmetic operation it is subjected to and an accelerator kernel can be generated. As an alternative, the function generator also accepts a function object that can then be used in vector expressions. And finally, a custom kernel can be specified in source code and registered with VexCL through the shorthand VEX_STRINGIZE_SOURCE macro or by creating an instance of vex::backend::kernel.

VexCL is inter-operable with ViennaCL, Boost.Compute and CLOGS. The library is available on github, the documentation is available here.

Denis Demidov mentions performance results of VexCL are published in one of his papers and are included in one of his recent talks.

I would categorize VexCL as a high productivity prototyping library that can also be used in scientific production code. It is particularly well suited to implement numerical analysis. I see minor problems when it comes to kernel caching, the fact that the CUDA SDK is required and the lack of elegant concurrency constructs. Nevertheless, the number of features in VexCL is enormous. VexCL is an excellent example of how sophisticated C++ programming techniques can lead to efficient code as well as a beautiful and concise interface.

C++ Accelerator Libraries

Tuesday, April 15th, 2014

In preparation for my C++Now talk entitled The Future of Accelerator Programming in C++ I am currently reviewing numerous C++ libraries. I put together a catalogue of questions for these reviews. The questions are intended to gauge scope, use-cases, performance, quality and level of abstraction of each library.

ul
Iqra: Read, image by Farrukh
  1. Is concurrency supported?
    Accelerators are massive parallel devices, but due to memory transfer overhead, concurrency is a central aspect for many efficient programs.
  2. How is memory managed?
    This is a central question since simple and efficient management of distributed memory is not trivial.
  3. What parallel primitives are provided?
    Parallel primitives are essential building blocks for many accelerator-enabled programs.
  4. How is numerical analysis supported?
    Massive parallel accelerator architectures lend themselves well to numerical analysis.
  5. How can users specify custom accelerator functions?
    A useful accelerator library should allow users to specify custom functions.
  6. What is the intended use-case for the library? Who is the target audience?
    Is the library suitable for i.e. high performance computing, prototyping or signal processing?
  7. What are noteworthy features of the library?

This is a list of all libraries that I am reviewing:

Library CUDA OpenCL Other Type1
Thrust X OMP, TBB header
Bolt X2 TBB, C++ AMP link
VexCL X3 X header
Boost.Compute X header
C++ AMP X4 DX11 compiler
SyCL X5 compiler
ViennaCL X X OMP header
SkePU X X OMP, seq header
SkelCL X link
HPL X link
ArrayFire X X link
CLOGS X link
hemi X header
MTL4 X header
Kokkos X OMP, PTH, seq link
Aura6 X X header

If I missed a library, please let me know. I will add it immediately. I’m going to publish selected library reviews here on my blog. I’m hoping to discuss specific reviews with the original library authors. The conclusions of these reviews will be part of my talk at C++Now.


  1. either header-only library, link-library or library that requires compiler support
  2. custom AMD OpenCL Static C++ Kernel Language extension required
  3. CUDA SDK required at runtime
  4. Prototype implemenation available here
  5. only specification released so far
  6. disclaimer: library developed by the author