StarPU

A Unified Runtime System for Heterogeneous Multicore Architectures

Overview

StarPU is a task programming library for hybrid architectures

  1. The application provides algorithms and constraints
    • CPU/GPU implementations of tasks
    • A graph of tasks, using either the StarPU's high level GCC plugin pragmas or StarPU's rich C API

  2. StarPU handles run-time concerns
    • Task dependencies
    • Optimized heterogeneous scheduling
    • Optimized data transfers and replication between main memory and discrete memories
    • Optimized cluster communications

Rather than handling low-level issues, programmers can concentrate on algorithmic concerns!

The StarPU documentation is available in PDF and in HTML. Please note that these documents are up-to-date with the latest release of StarPU.

News

August 2016 » The v1.1.6 release of StarPU is now available!. This release notably brings the concept of scheduling contexts which allows to separate computation resources.

August 2016 » The 1.2.0 release of StarPU is now available!. This release notably brings an out-of-core support, a MIC Xeon Phi support, an OpenMP runtime support, and a new internal communication system for MPI.

August 2016 » The sixth (and really hopefully last) release candidate of the v1.2.0 release of StarPU is now available!. This release notably brings an out-of-core support, a MIC Xeon Phi support, an OpenMP runtime support, and a new internal communication system for MPI.

March 2016 »  Engineer job offer at Inria: more details on the job and on how to apply are available here

December 2015 » The fifth (and hopefully last) release candidate of the v1.2.0 release of StarPU is now available!. This release notably brings an out-of-core support, a MIC Xeon Phi support, an OpenMP runtime support, and a new internal communication system for MPI.

September 2015 » The v1.1.5 release of StarPU is now available!. This release notably brings the concept of scheduling contexts which allows to separate computation resources.

August 2015 » The fourth release candidate of the v1.2.0 release of StarPU is now available!. This release notably brings an out-of-core support, a MIC Xeon Phi support, an OpenMP runtime support, and a new internal communication system for MPI.

July 2015 » The third release candidate of the v1.2.0 release of StarPU is now available!. This release notably brings an out-of-core support, a MIC Xeon Phi support, an OpenMP runtime support, and a new internal communication system for MPI.

May 2015 » The second release candidate of the v1.2.0 release of StarPU is now available!. This release notably brings an out-of-core support, a MIC Xeon Phi support, an OpenMP runtime support, and a new internal communication system for MPI.

April 2015 » A tutorial on runtime systems including StarPU will be given at INRIA Bordeaux in June 2015.

March 2015 » The first release candidate of the v1.2.0 release of StarPU is now available!. This release notably brings an out-of-core support, a MIC Xeon Phi support, an OpenMP runtime support, and a new internal communication system for MPI.

March 2015 » The v1.1.4 release of StarPU is now available!. This release notably brings the concept of scheduling contexts which allows to separate computation resources.

Get the latest StarPU news by subscribing to the starpu-announce mailing list. See also the full news.

Video Conference

A video recording (26') of a presentation at the XDC2014 conference gives an overview of StarPU (slides):

Contact

For any questions regarding StarPU, please contact the StarPU developers mailing list.

starpu-devel@lists.gforge.inria.fr

Features

Portability

Portability is obtained by the means of a unified abstraction of the machine. StarPU offers a unified offloadable task abstraction named codelet. Rather than rewriting the entire code, programmers can encapsulate existing functions within codelets. In case a codelet can run on heterogeneous architectures, it is possible to specify one function for each architectures (e.g. one function for CUDA and one function for CPUs). StarPU takes care of scheduling and executing those codelets as efficiently as possible over the entire machine, include multiple GPUs. One can even specify several functions for each architecture (new in v1.0) as well as parallel implementations (e.g. in OpenMP), and StarPU will automatically determine which version is best for each input size (new in v0.9).

Data transfers

To relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet starts (e.g. on an accelerator), all its data are automatically made available on the compute resource. Data are also kept on e.g. GPUs as long as they are needed for further tasks. When a device runs out of memory, StarPU uses an LRU strategy to evict unused data. StarPU also takes care of automatically prefetching data, which thus permits to overlap data transfers with computations (including GPU-GPU direct transfers) to achieve the most of the architecture.

Dependencies

Dependencies between tasks can be given several ways, to provide the programmer with best flexibility:

These dependencies are computed in a completely decentralized way.

StarPU also supports an OpenMP-like reduction access mode (new in v0.9).

It also supports a commute access mode to allow data access commutativity (new in v1.2).

Heterogeneous Scheduling

StarPU obtains portable performances by efficiently (and easily) using all computing resources at the same time. StarPU also takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models. These determine the relative performance achieved by the different processing units for the various kinds of task, and thus permits to automatically let processing units execute the tasks they are the best for. Various strategies and variants are available. Some of them are centralized, but most of them are completely distributed. dmda (a data-locality-aware MCT strategy, thus similar to heft but starts executing tasks before the whole task graph is submitted, thus allowing dynamic task submission and a decentralized scheduler), eager (dumb centralized queue), decentralized locality-aware work-stealing, ... The overhead per task is typically around the order of magnitude of a microsecond. Tasks should thus be a few orders of magnitude bigger, such as 100 microseconds or 1 millisecond, to make the overhead negligible.

Clusters

To deal with clusters, StarPU can nicely integrate with MPI through explicit network communications, which will then be automatically combined and overlapped with the intra-node data transfers and computation. The application can also just provide the whole task graph, a data distribution over MPI nodes, and StarPU will automatically determine which MPI node should execute which task, and generate all required MPI communications accordingly (new in v0.9). We have gotten excellent scaling on a 144-node cluster with GPUs, we have not yet had the opportunity to test on a yet larger cluster. We have however measured that with naive task submission, it should scale to a thousand nodes, and with pruning-tuned task submission, it should scale to about a million nodes.

Out of core

When memory is not big enough for the working set, one may have to resort to using disks. StarPU makes this seamless thanks to its out of core support (new in 1.2). StarPU will automatically evict data from the main memory in advance, and prefetch back required data before it is needed for tasks.

Extensions to the C Language

StarPU comes with a GCC plug-in that extends the C programming language with pragmas and attributes that make it easy to annotate a sequential C program to turn it into a parallel StarPU program (new in v1.0).

OpenMP 4 -compatible interface

K'Star provides an OpenMP 4 -compatible interface on top of StarPU. This allows to just rebuild OpenMP applications with the K'Star source-to-source compiler, then build it with the usual compiler, and the result will use the StarPU runtime.

K'Star also provides some extensions to the OpenMP 4 standard, to let the StarPU runtime perform online optimizations.

OpenCL-compatible interface

StarPU provides an OpenCL-compatible interface, SOCL which allows to simply run OpenCL applications on top of StarPU (new in v1.0).

Simulation support

StarPU can very accurately simulate an application execution and measure the resulting performance thanks to using the SimGrid simulator (new in v1.1). This allows to quickly experiment with various scheduling heuristics, various application algorithms, and even various platforms (available GPUs and CPUs, available bandwidth)!

All in all

All that means that, with the help of StarPU's extensions to the C language, the following sequential source code of a tiled version of the classical Cholesky factorization algorithm using BLAS is also valid StarPU code, possibly running on all the CPUs and GPUs, and given a data distribution over MPI nodes, it is even a distributed version!

for (k = 0; k < tiles; k++) {
  potrf(A[k,k])
  for (m = k+1; m < tiles; m++)
    trsm(A[k,k], A[m,k])
  for (m = k+1; m < tiles; m++)
    syrk(A[m,k], A[m, m])
  for (m = k+1, m < tiles; m++)
    for (n = k+1, n < m; n++)
      gemm(A[m,k], A[n,k], A[m,n])
}

Supported Architectures

and soon (in v1.2)

Supported Operating Systems

Performance analysis tools

In order to understand the performance obtained by StarPU, it is helpful to visualize the actual behaviour of the applications running on complex heterogeneous multicore architectures. StarPU therefore makes it possible to generate Pajé traces that can be visualized thanks to the ViTE (Visual Trace Explorer) open source tool.

Example: LU decomposition on 3 CPU cores and a GPU using a very simple greedy scheduling strategy. The green (resp. red) sections indicate when the corresponding processing unit is busy (resp. idle). The number of ready tasks is displayed in the curve on top: it appears that with this scheduling policy, the algorithm suffers a certain lack of parallelism. Measured speed: 175.32 GFlop/s

LU decomposition (greedy)

This second trace depicts the behaviour of the same application using a scheduling strategy trying to minimize load imbalance thanks to auto-tuned performance models and to keep data locality as high as possible. In this example, the Pajé trace clearly shows that this scheduling strategy outperforms the previous one in terms of processor usage. Measured speed: 239.60 GFlop/s

LU decomposition (dmda)

Temanejo can be used to debug the task graph, as shown below (new in v1.1).

Software using StarPU

Some software is known for being able to use StarPU to tackle heterogeneous architectures, here is a non-exhaustive list (feel free to ask to be in the list!):

You can find below the list of publications related to applications using StarPU.

Give it a try!

You can easily try the performance on the Cholesky factorization for instance. Make sure to have the pkg-config and hwloc software installed for proper CPU control and BLAS kernels for your computation units and configured in your environment (e.g. MKL for CPUs and CUBLAS for GPUs).

$ wget http://starpu.gforge.inria.fr/files/starpu-someversion.tar.gz
$ tar xf starpu-someversion.tar.gz
$ cd starpu-someversion
$ ./configure
$ make -j 12
$ STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
$ STARPU_SCHED=dmdas mpirun -np 4 -machinefile mymachines ./mpi/examples/matrix_decomposition/mpi_cholesky_distributed -size $((960*40*4)) -nblocks $((40*4))

Note that the dmdas scheduler uses performance models, and thus needs calibration execution before exhibiting optimized performance (until the "model something is not calibrated enough" messages go away).

To get a glimpse at what happened, you can get an execution trace by installing FxT and ViTE, and enabling traces:

$ ./configure --with-fxt
$ make -j 12
$ STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
$ ./tools/starpu_fxt_tool -i /tmp/prof_file_${USER}_0
$ vite paje.trace

Starting with StarPU 1.1, it is also possible to reproduce the performance that we show in our articles on our machines, by installing simgrid, and then using the simulation mode of StarPU using the performance models of our machines:

$ ./configure --enable-simgrid
$ make -j 12
$ STARPU_PERF_MODEL_DIR=$PWD/tools/perfmodels/sampling STARPU_HOSTNAME=mirage STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
# size	ms	GFlops
38400	10216	1847.6

(MPI simulation is not supported yet)

Publications

All StarPU related publications are also listed here with the corresponding Bibtex entries.

A good overview is available in the following Research Report.

General Presentations

  1. Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, Marc Sergent, and Samuel Thibault
    Harnessing clusters of hybrid nodes with a sequential task-based programming model
    In 8th International Workshop on Parallel Matrix Algorithms and Applications, July 2014
    [WWW] [PDF]
  2. Cédric Augonnet
    Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System's Perspective
    PhD thesis, Université Bordeaux 1, 351 cours de la Libération --- 33405 TALENCE cedex, December 2011
    [WWW]
  3. Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier
    StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
    Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009, 23:187-198, February 2011
    [WWW] [doi:10.1002/cpe.1631]
  4. Cédric Augonnet, Samuel Thibault, and Raymond Namyst
    StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines
    Technical Report 7240, INRIA, March 2010
    [WWW]
  5. Cédric Augonnet
    StarPU: un support exécutif unifié pour les architectures multicoeurs hétérogènes
    In 19èmes Rencontres Francophones du Parallélisme, Toulouse / France, September 2009
    Note: Best Paper Award
    [WWW]
  6. Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier
    StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
    In Proceedings of the 15th International Euro-Par Conference, volume 5704 of Lecture Notes in Computer Science, Delft, The Netherlands, pages 863-874, August 2009
    Springer
    [WWW] [doi:10.1007/978-3-642-03869-3_80]
  7. Cédric Augonnet and Raymond Namyst
    A unified runtime system for heterogeneous multicore architectures
    In Proceedings of the International Euro-Par Workshops 2008, HPPC'08, volume 5415 of Lecture Notes in Computer Science, Las Palmas de Gran Canaria, Spain, pages 174-183, August 2008
    Springer
    ISBN: 978-3-642-00954-9
    [WWW] [doi:10.1007/978-3-642-00955-6_22]
  8. Cédric Augonnet
    Vers des supports d'exécution capables d'exploiter les machines multicoeurs hétérogènes
    Mémoire de DEA, Université Bordeaux 1, June 2008
    [WWW]

On Composability

  1. Andra Hugo
    Le problème de la composition parallèle : une approche supervisée
    In 21èmes Rencontres Francophones du Parallélisme (RenPar'21), Grenoble, France, January 2013
    [WWW]
  2. Andra Hugo, Abdou Guermouche, Raymond Namyst, and Pierre-André Wacrenier
    Composing multiple StarPU applications over heterogeneous machines: a supervised approach
    In Third International Workshop on Accelerators and Hybrid Exascale Systems, Boston, USA, May 2013
    [WWW]
  3. Andra Hugo
    Composabilité de codes parallèles sur architectures hétérogènes
    Mémoire de Master, Université Bordeaux 1, June 2011
    [WWW]

On Scheduling

  1. Emmanuel Agullo, Olivier Beaumont, Lionel Eyraud-Dubois, and Suraj Kumar
    Are Static Schedules so Bad ? A Case Study on Cholesky Factorization
    In IPDPS'16, Proceedings of the 30th IEEE International Parallel & Distributed Processing Symposium, IPDPS'16, Chicago, IL, United States, May 2016
    IEEE
    [WWW] [PDF]
  2. Olivier Beaumont, Terry Cojean, Lionel Eyraud-Dubois, Abdou Guermouche, and Suraj Kumar
    Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources
    In International Conference on High Performance Computing, Data, and Analytics (HiPC), Hyderabad, India, December 2016
    [WWW] [PDF]
  3. Vinicius Garcia Pinto, Luka Stanisic, Arnaud Legrand, Lucas Mello Schnorr, Samuel Thibault, and Vincent Danjean
    Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach
    In 3rd Workshop on Visual Performance Analysis (VPA), Salt Lake City, United States, November 2016
    Note: Held in conjunction with SC16
    [WWW] [PDF]
  4. Olivier Beaumont, Lionel Eyraud-Dubois, and Suraj Kumar
    Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs
    Note: Working paper or preprint, October 2016
    [WWW] [PDF]
  5. Emmanuel Agullo, Olivier Beaumont, Lionel Eyraud-Dubois, Julien Herrmann, Suraj Kumar, Loris Marchal, and Samuel Thibault
    Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms
    In Heterogeneity in Computing Workshop 2015, Hyderabad, India, May 2015
    [WWW]
  6. Marc Sergent and Simon Archipoff
    Modulariser les ordonnanceurs de tâches : une approche structurelle
    In Compas'2014, Neuchâtel, Suisse, April 2014
    [WWW] [PDF]

On The C Extensions

  1. Ludovic Courtès
    C Language Extensions for Hybrid CPU/GPU Programming with StarPU
    Research Report RR-8278, INRIA, April 2013
    [WWW] [PDF]

On OpenMP Support on top of StarPU

  1. Philippe Virouleau, Pierrick BRUNET, François Broquedis, Nathalie Furmento, Samuel Thibault, Olivier Aumage, and Thierry Gautier
    Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite
    In 10th International Workshop on OpenMP, IWOMP2014, 10th International Workshop on OpenMP, IWOMP2014, Salvador, Brazil, France, pages 16 - 29, September 2014
    Springer
    [WWW] [doi:10.1007/978-3-319-11454-5_2]

On MPI Support

  1. Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, Marc Sergent, and Samuel Thibault
    Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model
    Research Report RR-8927, Inria Bordeaux Sud-Ouest ; Bordeaux INP ; CNRS ; Université de Bordeaux ; CEA, June 2016
    [WWW] [PDF]
  2. Cédric Augonnet, Olivier Aumage, Nathalie Furmento, Samuel Thibault, and Raymond Namyst
    StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators
    Rapport de recherche RR-8538, INRIA, May 2014
    [WWW] [PDF]
  3. Cédric Augonnet, Olivier Aumage, Nathalie Furmento, Raymond Namyst, and Samuel Thibault
    StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators
    In Siegfried Benkner Jesper Larsson Träff and Jack Dongarra, editors, EuroMPI 2012, volume 7490 of LNCS, September 2012
    Springer
    Note: Poster Session
    [WWW]

On Memory Control

  1. Marc Sergent, David Goudin, Samuel Thibault, and Olivier Aumage
    Controlling the Memory Subscription of Distributed Applications with a Task-Based Runtime System
    In 21st International Workshop on High-Level Parallel Programming Models and Supportive Environments, Chicago, United States, May 2016
    [WWW] [PDF]
  2. Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, Marc Sergent, and Samuel Thibault
    Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model
    Research Report RR-8927, Inria Bordeaux Sud-Ouest ; Bordeaux INP ; CNRS ; Université de Bordeaux ; CEA, June 2016
    [WWW] [PDF]

On Data Transfer Management

  1. Cédric Augonnet, Jérôme Clet-Ortega, Samuel Thibault, and Raymond Namyst
    Data-Aware Task Scheduling on Multi-Accelerator based Platforms
    In The 16th International Conference on Parallel and Distributed Systems (ICPADS), Shanghai, China, December 2010
    [WWW] [doi:10.1109/ICPADS.2010.129]

On Performance Model Tuning

  1. Cédric Augonnet, Samuel Thibault, and Raymond Namyst
    Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures
    In Proceedings of the International Euro-Par Workshops 2009, HPPC'09, volume 6043 of Lecture Notes in Computer Science, Delft, The Netherlands, pages 56-65, August 2009
    Springer
    [WWW] [doi:10.1007/978-3-642-14122-5_9]

On The Simulation Support through SimGrid

  1. Luka Stanisic, Samuel Thibault, Arnaud Legrand, Brice Videau, and Jean-François Méhaut
    Faithful Performance Prediction of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures
    Concurrency and Computation: Practice and Experience, pp 16, May 2015
    [WWW] [PDF] [doi:10.1002/cpe]
  2. Luka Stanisic, Samuel Thibault, Arnaud Legrand, Brice Videau, and Jean-François Méhaut
    Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures
    In Euro-par - 20th International Conference on Parallel Processing, Porto, Portugal, August 2014
    Springer-Verlag
    [WWW] [PDF]

On The Cell Support

  1. Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Maik Nijhuis
    Exploiting the Cell/BE architecture with the StarPU unified runtime system
    In SAMOS Workshop - International Workshop on Systems, Architectures, Modeling, and Simulation, volume 5657 of Lecture Notes in Computer Science, Samos, Greece, July 2009
    [WWW] [doi:10.1007/978-3-642-03138-0_36]

On Applications

  1. Emmanuel Agullo, Bérenger Bramas, Olivier Coulaud, Martin Khannouz, and Luka Stanisic
    Task-based fast multipole method for clusters of multicore processors
    Research Report RR-8970, Inria Bordeaux Sud-Ouest, October 2016
    [WWW] [PDF]
  2. E Agullo, L Giraud, A Guermouche, S Nakov, and Jean Roman
    Task-based Conjugate Gradient: from multi-GPU towards heterogeneous architectures
    Research Report 8912, Inria Bordeaux Sud-Ouest, May 2016
    [WWW] [PDF]
  3. Corentin Rossignon
    A fine grain model programming for parallelization of sparse linear solver
    PhD thesis, Université de Bordeaux, July 2015
    [WWW] [PDF]
  4. Vìctor Martìnez, David Michéa, Fabrice Dupros, Olivier Aumage, Samuel Thibault, Hideo Aochi, and Philippe Olivier Alexandre Navaux
    Towards seismic wave modeling on heterogeneous many-core architectures using task-based runtime system
    In 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Florianopolis, Brazil, October 2015
    [WWW] [PDF]
  5. Sylvain Henry, Alexandre Denis, Denis Barthou, Marie-Christine Counilh, and Raymond Namyst
    Toward OpenCL Automatic Multi-Device Support
    In Fernando Silva, Ines Dutra, and Vitor Santos Costa, editors, Euro-Par 2014, Porto, Portugal, August 2014
    Springer
    [WWW] [PDF]
  6. Xavier Lacoste, Mathieu Faverge, Pierre Ramet, Samuel Thibault, and George Bosilca
    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
    In HCW'2014 workshop of IPDPS, Phoenix, États-Unis, May 2014
    IEEE
    Note: RR-8446 RR-8446
    [WWW] [PDF]
  7. Xavier Lacoste, Mathieu Faverge, Pierre Ramet, Samuel Thibault, and George Bosilca
    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
    Rapport de recherche RR-8446, INRIA, January 2014
    [WWW] [PDF]
  8. Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, Marc Sergent, and Samuel Thibault
    Overview of Distributed Linear Algebra on Hybrid Nodes over the StarPU Runtime
    SIAM Conference on Parallel Processing for Scientific Computing, February 2014
    [WWW] [PDF]
  9. Cyril Bordage
    Ordonnancement dynamique, adapté aux architectures hétérogènes, de la méthode multipôle pour les équations de Maxwell, en électromagnétisme
    PhD thesis, Université Bordeaux 1, 351 cours de la Libération --- 33405 TALENCE cedex, December 2013
  10. Sylvain Henry
    Modèles de programmation et supports exécutifs pour architectures hétérogènes
    PhD thesis, Université Bordeaux 1, 351 cours de la Libération --- 33405 TALENCE cedex, November 2013
    [WWW]
  11. Sylvain Henry
    ViperVM: a Runtime System for Parallel Functional High-Performance Computing on Heterogeneous Architectures
    In 2nd Workshop on Functional High-Performance Computing (FHPC'13), Boston, États-Unis, September 2013
    [WWW] [PDF]
  12. Tetsuya Odajima, Taisuke Boku, Mitsuhisa Sato, Toshihiro Hanawa, Yuetsu Kodama, Raymond Namyst, Samuel Thibault, and Olivier Aumage
    Adaptive Task Size Control on High Level Programming for GPU/CPU Work Sharing
    In The 2013 International Symposium on Advances of Distributed and Parallel Computing (ADPC 2013), Vietri sul Mare, Italie, December 2013
    [WWW] [PDF]
  13. Satoshi Ohshima, Satoshi Katagiri, Kengo Nakajima, Samuel Thibault, and Raymond Namyst
    Implementation of FEM Application on GPU with StarPU
    In SIAM CSE13 - SIAM Conference on Computational Science and Engineering 2013, Boston, États-Unis, February 2013
    SIAM
    [WWW]
  14. Corentin Rossignon
    Optimisation du produit matrice-vecteur creux sur architecture GPU pour un simulateur de reservoir
    In 21èmes Rencontres Francophones du Parallélisme (RenPar'21), Grenoble, France, January 2013
    [WWW]
  15. Corentin Rossignon, Pascal Hénon, Olivier Aumage, and Samuel Thibault
    A NUMA-aware fine grain parallelization framework for multi-core architecture
    In PDSEC - 14th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing - 2013, Boston, États-Unis, May 2013
    [WWW] [PDF]
  16. Sylvain Henry, Alexandre Denis, and Denis Barthou
    Programmation unifiée multi-accélérateur OpenCL
    Techniques et Sciences Informatiques, (8-9-10):1233-1249, 2012
    [WWW]
  17. Sidi Ahmed Mahmoudi, Pierre Manneback, Cédric Augonnet, and Samuel Thibault
    Traitements d'Images sur Architectures Parallèles et Hétérogènes
    Technique et Science Informatiques, 2012
    [WWW]
  18. Siegfried Benkner, Enes Bajrovic, Erich Marth, Martin Sandrieser, Raymond Namyst, and Samuel Thibault
    High-Level Support for Pipeline Parallelism on Many-Core Architectures
    In Europar - International European Conference on Parallel and Distributed Computing - 2012, Rhodes Island, Grèce, August 2012
    [WWW] [PDF]
  19. Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards, Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Träff, and Sabri Pllana
    Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems
    In Design, Automation and Test in Europe (DATE), Dresden, Allemagne, March 2012
    [WWW] [PDF]
  20. Siegfried Benkner, Sabri Pllana, Jesper Larsson Träff, Philippas Tsigas, Uwe Dolinsky, Cédric Augonnet, Beverly Bachmayer, Christoph Kessler, David Moloney, and Vitaly Osipov
    PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems
    IEEE Micro, 31(5):28-41, September 2011
    ISSN: 0272-1732
    [WWW] [doi:10.1109/MM.2011.67]
  21. Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, and Stanimire Tomov
    LU factorization for accelerator-based systems
    In 9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11), Sharm El-Sheikh, Egypt, June 2011
    [WWW]
  22. Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Samuel Thibault, and Stanimire Tomov
    QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
    In 25th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2011), Anchorage, Alaska, USA, May 2011
    [WWW] [doi:10.1109/IPDPS.2011.90]
  23. Usman Dastgeer, Christoph Kessler, and Samuel Thibault
    Flexible runtime support for efficient skeleton programming on hybrid systems
    In Proceedings of the International Conference on Parallel Computing (ParCo), Applications, Tools and Techniques on the Road to Exascale Computing, volume 22 of Advances of Parallel Computing, Gent, Belgium, pages 159-166, August 2011
    [WWW]
  24. Sylvain Henry
    Programmation multi-accélérateurs unifiée en OpenCL
    In 20èmes Rencontres Francophones du Parallélisme (RenPar'20), Saint Malo, France, May 2011
    [WWW]
  25. Sidi Ahmed Mahmoudi, Pierre Manneback, Cédric Augonnet, and Samuel Thibault
    Détection optimale des coins et contours dans des bases d'images volumineuses sur architectures multicoeurs hétérogènes
    In 20èmes Rencontres Francophones du Parallélisme, Saint-Malo / France, May 2011
    [WWW]
  26. Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, Raymond Namyst, Samuel Thibault, and Stanimire Tomov
    A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs
    In Wen-mei W
    Hwu, editor, GPU Computing Gems, volume 2
    Morgan Kaufmann, September 2010
    [WWW]
  27. Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, Raymond Namyst, Jean Roman, Samuel Thibault, and Stanimire Tomov
    Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators
    In Symposium on Application Accelerators in High Performance Computing (SAAHPC), Knoxville, USA, July 2010
    [WWW]

Last updated on 2016/04/13.