Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License”.
The use of specialized hardware such as accelerators or coprocessors offers an interesting approach to overcome the physical limits encountered by processor architects. As a result, many machines are now equipped with one or several accelerators (e.g. a GPU), in addition to the usual processor(s). While a lot of efforts have been devoted to offload computation onto such accelerators, very little attention as been paid to portability concerns on the one hand, and to the possibility of having heterogeneous accelerators and processors to interact on the other hand.
StarPU is a runtime system that offers support for heterogeneous multicore architectures, it not only offers a unified view of the computational resources (i.e. CPUs and accelerators at the same time), but it also takes care of efficiently mapping and executing tasks onto an heterogeneous machine while transparently handling low-level issues such as data transfers in a portable fashion.
StarPU is a software tool aiming to allow programmers to exploit the computing power of the available CPUs and GPUs, while relieving them from the need to specially adapt their programs to the target machine and processing units.
At the core of StarPU is its run-time support library, which is responsible for scheduling application-provided tasks on heterogeneous CPU/GPU machines. In addition, StarPU comes with programming language support, in the form of extensions to languages of the C family (C Extensions), as well as an OpenCL front-end (SOCL OpenCL Extensions).
StarPU's run-time and programming language extensions support a task-based programming model. Applications submit computational tasks, with CPU and/or GPU implementations, and StarPU schedules these tasks and associated data transfers on available CPUs and GPUs. The data that a task manipulates are automatically transferred among accelerators and the main memory, so that programmers are freed from the scheduling issues and technical details associated with these transfers.
StarPU takes particular care of scheduling tasks efficiently, using well-known algorithms from the literature (Task Scheduling Policy). In addition, it allows scheduling experts, such as compiler or computational library developers, to implement custom scheduling policies in a portable fashion (Defining A New Scheduling Policy).
The remainder of this section describes the main concepts used in StarPU.
One of the StarPU primary data structures is the codelet. A codelet describes a computational kernel that can possibly be implemented on multiple architectures such as a CPU, a CUDA device or an OpenCL device.
Another important data structure is the task. Executing a StarPU task consists in applying a codelet on a data set, on one of the architectures on which the codelet is implemented. A task thus describes the codelet that it uses, but also which data are accessed, and how they are accessed during the computation (read and/or write). StarPU tasks are asynchronous: submitting a task to StarPU is a non-blocking operation. The task structure can also specify a callback function that is called once StarPU has properly executed the task. It also contains optional fields that the application may use to give hints to the scheduler (such as priority levels).
By default, task dependencies are inferred from data dependency (sequential coherency) by StarPU. The application can however disable sequential coherency for some data, and dependencies can be specifically expressed. A task may be identified by a unique 64-bit number chosen by the application which we refer as a tag. Task dependencies can be enforced either by the means of callback functions, by submitting other tasks, or by expressing dependencies between tags (which can thus correspond to tasks that have not yet been submitted).
Because StarPU schedules tasks at runtime, data transfers have to be done automatically and ``just-in-time'' between processing units, relieving application programmers from explicit data transfers. Moreover, to avoid unnecessary transfers, StarPU keeps data where it was last needed, even if was modified there, and it allows multiple copies of the same data to reside at the same time on several processing units as long as it is not modified.
A codelet records pointers to various implementations of the same theoretical function.
A memory node can be either the main RAM, GPU-embedded memory or a disk memory.
A bus is a link between memory nodes.
A data handle keeps track of replicates of the same data (registered by the application) over various memory nodes. The data management library manages to keep them coherent.
The home memory node of a data handle is the memory node from which the data was registered (usually the main memory node).
A task represents a scheduled execution of a codelet on some data handles.
A tag is a rendez-vous point. Tasks typically have their own tag, and can depend on other tags. The value is chosen by the application.
A worker execute tasks. There is typically one per CPU computation core and one per accelerator (for which a whole CPU core is dedicated).
A driver drives a given kind of workers. There are currently CPU, CUDA, and OpenCL drivers. They usually start several workers to actually drive them.
A performance model is a (dynamic or static) model of the performance of a given codelet. Codelets can have execution time performance model as well as energy consumption performance models.
A data interface describes the layout of the data: for a vector, a pointer for the start, the number of elements and the size of elements ; for a matrix, a pointer for the start, the number of elements per row, the offset between rows, and the size of each element ; etc. To access their data, codelet functions are given interfaces for the local memory node replicates of the data handles of the scheduled task.
Partitioning data means dividing the data of a given data handle (called father) into a series of children data handles which designate various portions of the former.
A filter is the function which computes children data handles from a father data handle, and thus describes how the partitioning should be done (horizontal, vertical, etc.)
Acquiring a data handle can be done from the main application, to safely access the data of a data handle from its home node, without having to unregister it.
Research papers about StarPU can be found at http://starpu.gforge.inria.fr/publications/.
A good overview is available in the research report at http://hal.archives-ouvertes.fr/inria-00467677.
Many examples are also available in the StarPU sources in the directory
examples/. Simple examples include:
Trivial incrementation test.
Simple documented Hello world and vector/scalar product (as shown in Basic Examples), matrix product examples (as shown in Performance Model Example), an example using the blocked matrix data interface, an example using the variable data interface, and an example using different formats on CPUs and GPUs.
OpenCL example from NVidia, adapted to StarPU.
AXPY CUBLAS operation adapted to StarPU.
Example of using StarPU's native Fortran support.
Example of Fortran 90 bindings, using C marshalling wrappers.
More advanced examples include:
Examples using filters, as shown in Partitioning Data.
LU matrix factorization, see for instance
The documentation chapters include
Make sure to have had a look at those too!