StarPU

A Unified Runtime System for Heterogeneous Multicore Architectures

Overview

StarPU is a task programming library for hybrid architectures

  1. The application provides algorithms and constraints
    • CPU/GPU implementations of tasks
    • A graph of tasks, using either the StarPU's high level GCC plugin pragmas or StarPU's rich C API

  2. StarPU handles run-time concerns
    • Task dependencies
    • Optimized heterogeneous scheduling
    • Optimized data transfers and replication between main memory and discrete memories
    • Optimized cluster communications

Rather than handling low-level issues, programmers can concentrate on algorithmic concerns!

The StarPU documentation is available in PDF and in HTML. Please note that these documents are up-to-date with the latest release of StarPU.

News

May 2015 » The second release candidate of the v1.2.0 release of StarPU is now available!. This release notably brings an out-of-core support, a MIC Xeon Phi support, an OpenMP runtime support, and a new internal communication system for MPI.

April 2015 » A tutorial on runtime systems including StarPU will be given at INRIA Bordeaux in June 2015.

March 2015 » The first release candidate of the v1.2.0 release of StarPU is now available!. This release notably brings an out-of-core support, a MIC Xeon Phi support, an OpenMP runtime support, and a new internal communication system for MPI.

March 2015 » The v1.1.4 release of StarPU is now available!. This release notably brings the concept of scheduling contexts which allows to separate computation resources.

September 2014 » The v1.1.3 release of StarPU is now available!. This release notably brings the concept of scheduling contexts which allows to separate computation resources.

June 2014 » The v1.1.2 release of StarPU is now available!. This release notably brings the concept of scheduling contexts which allows to separate computation resources.

May 2014 » Open Engineer Position.

Get the latest StarPU news by subscribing to the starpu-announce mailing list. See also the full news.

Contact

For any questions regarding StarPU, please contact the StarPU developers mailing list.

starpu-devel@lists.gforge.inria.fr

Features

Portability

Portability is obtained by the means of a unified abstraction of the machine. StarPU offers a unified offloadable task abstraction named codelet. Rather than rewriting the entire code, programmers can encapsulate existing functions within codelets. In case a codelet can run on heterogeneous architectures, it is possible to specify one function for each architectures (e.g. one function for CUDA and one function for CPUs). StarPU takes care of scheduling and executing those codelets as efficiently as possible over the entire machine, include multiple GPUs. One can even specify several functions for each architecture (new in v1.0) as well as parallel implementations (e.g. in OpenMP), and StarPU will automatically determine which version is best for each input size (new in v0.9).

Data transfers

To relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet starts (e.g. on an accelerator), all its data are automatically made available on the compute resource. Data are also kept on e.g. GPUs as long as they are needed for further tasks. When a device runs out of memory, StarPU uses an LRU strategy to evict unused data. StarPU also takes care of automatically prefetching data, which thus permits to overlap data transfers with computations (including GPU-GPU direct transfers) to achieve the most of the architecture.

Dependencies

Dependencies between tasks can be given several ways, to provide the programmer with best flexibility:

StarPU also supports an OpenMP-like reduction access mode (new in v0.9).

It also supports a commute access mode to allow data access commutativity (new in v1.2).

Heterogeneous Scheduling

StarPU obtains portable performances by efficiently (and easily) using all computing resources at the same time. StarPU also takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models. These determine the relative performance achieved by the different processing units for the various kinds of task, and thus permits to automatically let processing units execute the tasks they are the best for. Various strategies and variants are available: dmda (a data-aware MCT strategy, thus similar to heft but starts executing tasks before the whole task graph is submitted, thus allowing dynamic task submission), eager, locality-aware work-stealing, ... The overhead per task is typically around the order of magnitude of a microsecond. Tasks should thus be a few orders of magnitude bigger, such as 100 microseconds or 1 millisecond, to make the overhead negligible.

Clusters

To deal with clusters, StarPU can nicely integrate with MPI through explicit network communications, which will then be automatically combined and overlapped with the intra-node data transfers and computation. The application can also just provide the whole task graph, a data distribution over MPI nodes, and StarPU will automatically determine which MPI node should execute which task, and generate all required MPI communications accordingly (new in v0.9). We have gotten excellent scaling on a 144-node cluster with GPUs, we have not yet had the opportunity to test on a yet larger cluster. We have however measured that with naive task submission, it should scale to a thousand nodes, and with pruning-tuned task submission, it should scale to about a million nodes.

Out of core

When memory is not big enough for the working set, one may have to resort to using disks. StarPU makes this seamless thanks to its out of core support (new in 1.2). StarPU will automatically evict data from the main memory in advance, and prefetch back required data before it is needed for tasks.

Extensions to the C Language

StarPU comes with a GCC plug-in that extends the C programming language with pragmas and attributes that make it easy to annotate a sequential C program to turn it into a parallel StarPU program (new in v1.0).

OpenCL-compatible interface

StarPU provides an OpenCL-compatible interface, SOCL which allows to simply run OpenCL applications on top of StarPU (new in v1.0).

Simulation support

StarPU can very accurately simulate an application execution and measure the resulting performance thanks to using the SimGrid simulator (new in v1.1). This allows to quickly experiment with various scheduling heuristics, various application algorithms, and even various platforms (available GPUs and CPUs, available bandwidth)!

All in all

All that means that, with the help of StarPU's extensions to the C language, the following sequential source code of a tiled version of the classical Cholesky factorization algorithm using BLAS is also valid StarPU code, possibly running on all the CPUs and GPUs, and given a data distribution over MPI nodes, it is even a distributed version!

for (k = 0; k < tiles; k++) {
  potrf(A[k,k])
  for (m = k+1; m < tiles; m++)
    trsm(A[k,k], A[m,k])
  for (m = k+1; m < tiles; m++)
    syrk(A[m,k], A[m, m])
  for (m = k+1, m < tiles; m++)
    for (n = k+1, n < m; n++)
      gemm(A[m,k], A[n,k], A[m,n])
}

Supported Architectures

and soon (in v1.2)

Supported Operating Systems

Performance analysis tools

In order to understand the performance obtained by StarPU, it is helpful to visualize the actual behaviour of the applications running on complex heterogeneous multicore architectures. StarPU therefore makes it possible to generate Pajé traces that can be visualized thanks to the ViTE (Visual Trace Explorer) open source tool.

Example: LU decomposition on 3 CPU cores and a GPU using a very simple greedy scheduling strategy. The green (resp. red) sections indicate when the corresponding processing unit is busy (resp. idle). The number of ready tasks is displayed in the curve on top: it appears that with this scheduling policy, the algorithm suffers a certain lack of parallelism. Measured speed: 175.32 GFlop/s

LU decomposition (greedy)

This second trace depicts the behaviour of the same application using a scheduling strategy trying to minimize load imbalance thanks to auto-tuned performance models and to keep data locality as high as possible. In this example, the Pajé trace clearly shows that this scheduling strategy outperforms the previous one in terms of processor usage. Measured speed: 239.60 GFlop/s

LU decomposition (dmda)

Temanejo can be used to debug the task graph, as shown below (new in v1.1).

Software using StarPU

Some software is known for being able to use StarPU to tackle heterogeneous architectures, here is a non-exhaustive list:

You can find below the list of publications related to applications using StarPU.

Give it a try!

You can easily try the performance on the Cholesky factorization for instance. Make sure to have the pkg-config and hwloc software installed for proper CPU control and BLAS kernels for your computation units and configured in your environment (e.g. MKL for CPUs and CUBLAS for GPUs).

$ wget http://starpu.gforge.inria.fr/files/starpu-someversion.tar.gz
$ tar xf starpu-someversion.tar.gz
$ cd starpu-someversion
$ ./configure
$ make -j 12
$ STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
$ STARPU_SCHED=dmdas mpirun -np 4 -machinefile mymachines ./mpi/examples/matrix_decomposition/mpi_cholesky_distributed -size $((960*40*4)) -nblocks $((40*4))

Note that the dmdas scheduler uses performance models, and thus needs calibration execution before exhibiting optimized performance (until the "model something is not calibrated enough" messages go away).

To get a glimpse at what happened, you can get an execution trace by installing FxT and ViTE, and enabling traces:

$ ./configure --with-fxt
$ make -j 12
$ STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
$ ./tools/starpu_fxt_tool -i /tmp/prof_file_${USER}_0
$ vite paje.trace

Starting with StarPU 1.1, it is also possible to reproduce the performance that we show in our articles on our machines, by installing simgrid, and then using the simulation mode of StarPU using the performance models of our machines:

$ ./configure --enable-simgrid
$ make -j 12
$ STARPU_PERF_MODEL_DIR=$PWD/tools/perfmodels/sampling STARPU_HOSTNAME=mirage STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
# size	ms	GFlops
38400	10216	1847.6

(MPI simulation is not supported yet)

Publications

All StarPU related publications are also listed here with the corresponding Bibtex entries.

A good overview is available in the following Research Report.

General presentations

  1. C. Augonnet.
    Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System's Perspective. PhD thesis, Université Bordeaux 1, December 2011.
    Available here.
  2. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier.
    StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009, 23:187-198, February 2011.
    Available here.
  3. C. Augonnet.
    StarPU: un support exécutif unifié pour les architectures multicoeurs hétérogènes. In 19èmes Rencontres Francophones du Parallélisme, September 2009. Note: Best Paper Award.
    Available here. (French version)
  4. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier.
    StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Proceedings of the 15th International Euro-Par Conference, volume 5704 of LNCS, August 2009.
    Available here. (short version)
  5. C. Augonnet and R. Namyst.
    A unified runtime system for heterogeneous multicore architectures. In Proceedings of the International Euro-Par Workshops 2008, HPPC'08, volume 5415 of LNCS, August 2008.
    Available here. (early version)

On Composability

  1. A. Hugo, A. Guermouche, R. Namyst, and P.-A. Wacrenier.
    Composing multiple StarPU applications over heterogeneous machines: a supervised approach. In Third International Workshop on Accelerators and Hybrid Exascale Systems, Boston, USA, May 2013.
    Available here.
  2. A. Hugo.
    Le problème de la composition parallèle : une approche supervisée. In 21èmes Rencontres Francophones du Parallélisme (RenPar'21), Grenoble, France, January 2013.
    Available here.

On Scheduling

  1. E. Agullo, O. Beaumont, L. Eyraud-Dubois, J. Herrmann, S. Kumar, L. Marchal, and S. Thibault.
    Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms. In Heterogeneity in Computing Workshop 2015, Hyderabad, India, May 2015.
    Available here
  2. M. Sergent and S. Archipoff.
    Modulariser les ordonnanceurs de tâches : une approche structurelle. In Conférence d’informatique en Parallélisme, Architecture et Système (Compas'2014), Neuchâtel, Switzerland, April 2014.
    Available here.

On the C Extensions

  1. L. Courtès.
    C Language Extensions for Hybrid CPU/GPU Programming with StarPU.
    Available here.

On OpenMP support on top of StarPU

  1. P. Virouleau, P. Brunet, F. Broquedis, N. Furmento, S. Thibault, O. Aumage, and T. Gautier.
    Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite. In 10th International Workshop on OpenMP, IWOMP2014, September 2014.
    Available here.

On MPI support

  1. C. Augonnet, O. Aumage, N. Furmento, R. Namyst, and S. Thibault.
    StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators. INRIA Research Report RR-8538, May 2014.
    Available here.
  2. C. Augonnet, O. Aumage, N. Furmento, R. Namyst, and S. Thibault.
    StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators. In EuroMPI 2012, volume 7490 of LNCS, September 2012. Note: Poster Session.
    Available here.

On data transfer management

  1. C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namyst
    Data-Aware Task Scheduling on Multi-Accelerator based Platforms. In The 16th International Conference on Parallel and Distributed Systems (ICPADS), December 2010.
    Available here.

On performance model tuning

  1. C. Augonnet, S. Thibault, and R. Namyst.
    Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures. In Proceedings of the International Euro-Par Workshops 2009, HPPC'09, volume 6043 of LNCS, August 2009.
    Available here.

On the simulation support through SimGrid

  1. L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut.
    Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures In Euro-par 2014 - 20th International Conference on Parallel Processing, Porto, Portugal, August 2014.
    Available here.

On the Cell support

  1. C. Augonnet, S. Thibault, R. Namyst, and M. Nijhuis.
    Exploiting the Cell/BE architecture with the StarPU unified runtime system. In SAMOS Workshop - International Workshop on Systems, Architectures, Modeling, and Simulation, volume 5657 of LNCS, July 2009.
    Available here.

On Applications

  1. S. Henry, A. Denis, D. Barthou, M.-C. Counilh, R. Namyst
    Toward OpenCL Automatic Multi-Device Support Euro-Par 2014, Porto, Portugal, August 2014.
    Available here.
  2. X. Lacoste, M. Faverge, P. Ramet, S. Thibault, and G. Bosilca
    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes HCW'2014 workshop of IPDPS, May 2014.
    Available here.
  3. T. Odajima, T. Boku, M. Sato, T. Hanawa, Y. Kodama, R. Namyst, S. Thibault, and O. Aumage
    Adaptive Task Size Control on High Level Programming for GPU/CPU Work Sharing In The 2013 International Symposium on Advances of Distributed and Parallel Computing (ADPC 2013), Vietri sul Mare, Italy. December 2013.
    Available here.
  4. S. Henry
    Modèles de programmation et supports exécutifs pour architectures hétérogènes. PhD thesis, Université Bordeaux 1, Novembre 2013.
    Available here.
  5. S. Ohshima, S. Katagiri, K. Nakajima, S. Thibault, and R. Namyst
    Implementation of FEM Application on GPU with StarPU In SIAM CSE13 - SIAM Conference on Computational Science and Engineering 2013, Boston, USA February 2013.
    Available here.
  6. C. Rossignon.
    Optimisation du produit matrice-vecteur creux sur architecture GPU pour un simulateur de réservoir. In 21èmes Rencontres Francophones du Parallélisme (RenPar'21), Grenoble, France, January 2013.
    Available here.
  7. S. Henry, A. Denis, and D. Barthou.
    Programmation unifiée multi-accélérateur OpenCL. Techniques et Sciences Informatiques, (8-9-10):1233-1249, 2012.
    Available here
  8. S.A. Mahmoudi, P. Manneback, C. Augonnet, and S. Thibault.
    Traitements d'Images sur Architectures Parallèles et Hétérogènes. Technique et Science Informatiques, 2012.
    Available here.
  9. S. Benkner, S. Pllana, J.L. Träff, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov.
    PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems. IEEE Micro, 31(5):28-41, September 2011.
    Available here.
  10. U. Dastgeer, C. Kessler, and S. Thibault.
    Flexible runtime support for efficient skeleton programming on hybrid systems. In Proceedings of the International Conference on Parallel Computing (ParCo), Applications, Tools and Techniques on the Road to Exascale Computing, volume 22 of Advances of Parallel Computing, August 2011.
    Available here.
  11. S. Henry.
    Programmation multi-accélérateurs unifiée en OpenCL. In 20èmes Rencontres Francophones du Parallélisme (RenPar'20), May 2011.
    Available here.
  12. S.A. Mahmoudi, P. Manneback, C. Augonnet, and S. Thibault.
    Détection optimale des coins et contours dans des bases d'images volumineuses sur architectures multicoeurs hétérogènes. In 20èmes Rencontres Francophones du Parallélisme, May 2011.
    Available here.
  13. E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov.
    A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs. In GPU Computing Gems, volume 2., September 2010.
    Available here.
  14. E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov.
    QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators. In 25th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2011), May 2011.
    Available here.
  15. E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, J. Roman, S. Thibault, and S. Tomov.
    Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators. In Symposium on Application Accelerators in High Performance Computing (SAAHPC), July 2010.
    Available here.
  16. E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov.
    LU factorization for accelerator-based systems. In 9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11), June 2011.
    Available here

Last updated on 2012/10/03.