Echelon Blog

Archive for December, 2016

Definining Data Decomposition across CUDA, OpenCL, Metal

Monday, December 12th, 2016

When attempting to solve computational problems using accelerators (for example graphics processing units) a central challenge is to decompose the computation into many small identical problems. These small problems are then mapped to execution units of a given accelerator and solved in parallel.

7516040142_6af3d0b83b_k — Décomposition lumineuse, image by Groume

The three household low level accelerator frameworks are CUDA, OpenCL and Metal. Next to mapping sub-problems to execution units they also allow for sub-problems to be grouped together. Groups of sub-problems have certain interesting properties:

they can be synchronized and
they can share data.

Accelerator frameworks thus ask developers to define two layers of data decomposition: (1) the overall size of the problem space and (2) the size of a group of sub-problems.

Moving a problem from one accelerator framework to another or implementing a solution using multiple accelerator frameworks can be interesting. Frameworks are very similar but the devil is in the detail. There is no standard of mapping sub-problems to execution units. Both the naming conventions and the semantics are different:

	CUDA	OpenCL	Metal	Aura
level 1	grid	global work	threads per group	mesh
level 2	block	local work	thread groups	bundle
overall	grid * block	global work	threads per group * thread groups	mesh * bundle

I added to this table the naming convention and semantics for for the Aura library that is under development. The library wraps the three standard accelerator frameworks and exposes a single API for all three.

Posted in interesting stuff | Comments Closed

echelon

Archive for December, 2016

Definining Data Decomposition across CUDA, OpenCL, Metal

Pages

Archives

Categories