Definining Data Decomposition across CUDA, OpenCL, Metal

December 12, 2016 - no comments yet

When attempting to solve computational problems using accelerators (for example graphics processing units) a central challenge is to decompose the computation into many small identical problems. These small problems are then mapped to execution units of a given accelerator and solved in parallel.

7516040142_6af3d0b83b_k — Décomposition lumineuse, image by Groume

The three household low level accelerator frameworks are CUDA, OpenCL and Metal. Next to mapping sub-problems to execution units they also allow for sub-problems to be grouped together. Groups of sub-problems have certain interesting properties:

they can be synchronized and
they can share data.

Accelerator frameworks thus ask developers to define two layers of data decomposition: (1) the overall size of the problem space and (2) the size of a group of sub-problems.

Moving a problem from one accelerator framework to another or implementing a solution using multiple accelerator frameworks can be interesting. Frameworks are very similar but the devil is in the detail. There is no standard of mapping sub-problems to execution units. Both the naming conventions and the semantics are different:

	CUDA	OpenCL	Metal	Aura
level 1	grid	global work	threads per group	mesh
level 2	block	local work	thread groups	bundle
overall	grid * block	global work	threads per group * thread groups	mesh * bundle

I added to this table the naming convention and semantics for for the Aura library that is under development. The library wraps the three standard accelerator frameworks and exposes a single API for all three.

No Comments

Sorry, the comment form is closed at this time.

echelon

Definining Data Decomposition across CUDA, OpenCL, Metal

No Comments