Archive for December, 2016

Definining Data Decomposition across CUDA, OpenCL, Metal

Monday, December 12th, 2016

When attempting to solve computational problems using accelerators (for example graphics processing units) a central challenge is to decompose the computation into many small identical problems. These small problems are then mapped to execution units of a given accelerator and solved in parallel.

7516040142_6af3d0b83b_k
Décomposition lumineuse, image by Groume

The three household low level accelerator frameworks are CUDA, OpenCL and Metal. Next to mapping sub-problems to execution units they also allow for sub-problems to be grouped together. Groups of sub-problems have certain interesting properties:

  • they can be synchronized and
  • they can share data.

Accelerator frameworks thus ask developers to define two layers of data decomposition: (1) the overall size of the problem space and (2) the size of a group of sub-problems.

Moving a problem from one accelerator framework to another or implementing a solution using multiple accelerator frameworks can be interesting. Frameworks are very similar but the devil is in the detail. There is no standard of mapping sub-problems to execution units. Both the naming conventions and the semantics are different:

CUDA OpenCL Metal Aura
level 1 grid global work threads per group mesh
level 2 block local work thread groups bundle
overall grid * block global work threads per group *
thread groups
mesh * bundle

I added to this table the naming convention and semantics for for the Aura library that is under development. The library wraps the three standard accelerator frameworks and exposes a single API for all three.