VexCL

April 26, 2014 - 4 comments

This is the first post in my C++ accelerator library series. It is about VexCL, a header-only C++ library for accelerator programming, developed by Denis Demidov and others. The library is released under the MIT licence.

VexCL supports OpenCL and CUDA as accelerator backends. For the CUDA backend it is important to note that the CUDA SDK must be installed on systems running VexCL code, because kernel code is generated and compiled at runtime.

LAb[au] : Spectr|a|um, image by Marc Wathieu

The central memory abstraction concept in VexCL is a vex::vector. The template represents contiguous data on one accelerator. It can also act as segmented data container that manages disjoint blocks of memory on multiple accelerators. The library considers device bandwidth measurements when choosing memory segment sizes or a user-defined device weighting function. Explicit copy functions allow programmers to move data from, to and between accelerators. An iterator as well as a range-based syntax is supported. Additional data types include sparse matrix types vex::SpMat as well as vex::multivector types, representing lists of vectors that can be processed in a single step.

Concurrent execution of multiple kernels or of copy and kernel operations is partially supported by VexCL. Copy operations are synchronous by default but can be configured to be asynchronous. Each container has associated command_queues that are used to enqueue operations. A command_queue has a finish() method that blocks until all commands in the queue have completed. This is not the most elegant way to handle concurrency, but VexCL does not abstract away the option for parallel execution of operations, which is nice.

VexCL supports numerous parallel primitives such as inclusive_scan, exclusive_scan, sort, sort_by_key and reduce_by_key. But the core functionality of VexCL is its kernel generation from vector expression mechanism. If X, Y and Z are a vex::vector type, the expression

X = 2 * Y - sin(Z);

generates a single kernel that is automatically executed on all accelerators that the vectors occupy:

kernel void vexcl_vector_kernel(
    ulong n, global double * prm_1,
    int prm_2, global double * prm_3,
    global double * prm_4)
{
    for(size_t i = get_global_id(0); i < n; i += get_global_size(0)) {
        prm_1[i] = ( ( prm_2 * prm_3[i] ) - sin( prm_4[i] ) );
    }
}

This expression template mechanism has many features, including numerous built-ins, support for constants, access to vector indices, inclusion of user defined functions, tagging of data to avoid reading memory more than once, temporary values, random numbers, permutations, slicing, reducing, reshaping, scattered data interpolation and fast Fourier transform. A performance improvement can be expected from expression template generated kernels, since such fused kernels save on memory reads and writes over individual calls to BLAS functions.

Programmers may also generate kernels by feeding a vex::symbolic type to an algorithm. The symbol records any arithmetic operation it is subjected to and an accelerator kernel can be generated. As an alternative, the function generator also accepts a function object that can then be used in vector expressions. And finally, a custom kernel can be specified in source code and registered with VexCL through the shorthand VEX_STRINGIZE_SOURCE macro or by creating an instance of vex::backend::kernel.

VexCL is inter-operable with ViennaCL, Boost.Compute and CLOGS. The library is available on github, the documentation is available here.

Denis Demidov mentions performance results of VexCL are published in one of his papers and are included in one of his recent talks.

I would categorize VexCL as a high productivity prototyping library that can also be used in scientific production code. It is particularly well suited to implement numerical analysis. I see minor problems when it comes to kernel caching, the fact that the CUDA SDK is required and the lack of elegant concurrency constructs. Nevertheless, the number of features in VexCL is enormous. VexCL is an excellent example of how sophisticated C++ programming techniques can lead to efficient code as well as a beautiful and concise interface.

4 Comments

#1 Jim wrote on April 27, 2014:

Great write up and overall coverage. Would be great to seem more detailed coverage but this is very helpful and a great start to a series.

#2 Sebastian Schaetz wrote on April 27, 2014:

Jim, thank you for your comment. You are right of course. On twitter @relativetoyou also asked for more examples and performance measurements. I think with regards to performance measurements, I suspect there are no big surprises when using VexCL. If might of course be interesting to do a platform comparison, but that is for now out of the scope of my reviews.

Edit: by “no big surprises” with regards to performance I meant to say that VexCL will most probably not reduce performance, so no negative surprises.

#3 Denis Demidov wrote on April 27, 2014:

Some performance results may be found in [1] (already cited in the comments to the parent post) and [2].

[1] Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries, http://arxiv.org/abs/1212.6326
[2] VexCL – a Vector Expression Template Library for OpenCL, https://speakerdeck.com/ddemidov/vexcl-at-pecos-university-of-texas-2013

#4 Sebastian Schaetz wrote on April 27, 2014:

Thanks Denis, I updated the review and link to your paper as well as slides.

Sorry, the comment form is closed at this time.

echelon

VexCL

4 Comments