Archive for September, 2014

C++ AMP

Monday, September 1st, 2014

The third post in my C++ accelerator library series is about C++ AMP. C++ AMP is many things: a programming model, a C++ library and an associated compiler extension. It can be used to program accelerators, that is GPUs and other massive parallel coprocessors. And it is an open specification that constitutes a novel way of describing data parallelism in C++. Very exciting.

ul
Calculating Machine, image by Kim P

This is what C++ AMP looks like:

const int size = 5;
int aCPP[] = {1, 2, 3, 4, 5};
int bCPP[] = {6, 7, 8, 9, 10};
int sumCPP[size];

array_view a(size, aCPP), b(size, bCPP);
array_view sum(size, sumCPP);

parallel_for_each(
	sum.extent,
	[=](index<1> idx) restrict(amp)
	{
		sum[idx] = a[idx] + b[idx];
	}
);

I believe the appeal of this programming model is immediately visible from the above code sample. Existing, contiguous blocks of data that should be computed on are made known to the runtime through the array_view concept. Parallel computation is expressed inline through a call to the parallel_for_each meta-function. The loop boundaries are specified, an index object is created that is used in the following lambda function to describes the computation on a per element basis. The programmer says: “do this for each element”. It is high-level, it is inline, it is terse, it is pretty.

The downsides of this model can be uncovered by asking a few key questions: When is data copied to the accelerator (if that is necessary)? When are results copied back? Can the underlying array be modified after the view is created? Is the meta-function synchronous or asynchronous? C++ AMP does a great job at behaving as one would expect without any extra effort when these questions are irrelevant. In my opinion this programming model was designed for an audience that does not ask these questions.

If these questions are important things start to get a little messy. Data copying to accelerators starts when parallel_for_each is invoked. Results are copied back when they are accessed from the host (through the view) or when they are explicitly synchronized with the underlying array. If the underlying array can be modified after the view is created depends on the plattform. If it supports zero-copy, that is, host and accelerator share memory, the answer is: probably no. If code is written with a distributed memory system in mind the answer is: probably yes. The parallel_for_each function is asynchronous. It is queued and executed at some point in the future. It does not return a future though. Instead data can be synchronized explicitly and returns a future (source).

C++ AMP has many more features beyond what the above code example shows. There are tiles that represent the well known work-groups (OpenCL) or blocks (CUDA). There are accelerator and accelerator_view abstractions to select and manage accelerators. There are atomics, parallel_for_each throws in case of errors for simplified error handling and there is more.

There exists an implementation by Microsoft in their Visual Studio product line and a free implementation my AMD together with Microsoft. There is supposed to be an implementation by Intel called Shelvin Park but I could not find anything about it beyond the much cited presentation here. Indeed, in June of 2013 Intel proclaimed that they are looking into different solutions that ease the programming experience on their graphics processors but none are in a status that they can share today (source). Microsoft on the other hand is doing a good job at marketing and documenting C++ AMP. Their Parallel Programming in Native Code Blog has some helpfull entries including for example “How to measure the performance of C++ AMP algorithms?“.

For me the central feature of C++ AMP is the elegant way kernels can be implemented inline. The concepts they built around that are great but could be designed better. The problem is: if you go one level below what C++ AMP is on the outside things get messy. But it does not have to be this way. There are simple and well understood concepts that underly the whole problem that is accelerator/co-processor programming. I would love to see a well thought out low level abstraction for accelerator programming that C++ AMP builds upon and that programmers can fall back to if circumstances require it.