



# StarPU, a Task-Based Runtime System

### for Heterogeneous Platform Programming



Olivier Aumage, Team STORM Inria – LaBRI olivier.aumage@inria.fr

### Team STORM

#### STatic Optimizations, Runtime Methods

- Inria Bordeaux Sud-Ouest, LaBRI Laboratory
- Head: Denis Barthou
- Research directions
  - Expressing...
  - Adapting... ... parallelism
  - Optimizing...



Inría

### Contents

- 1. Runtime Systems for Heterogeneous Platforms
- 2. The StarPU Task-Based Runtime System
- 3. Programming with StarPU
- 4. StarPU Internals
- 5. Scheduling Policies
- 6. Data Management
- 7. Analysis and Monitoring
- 8. Distributed Computing
- 9. Interoperability and Composition
- 10. Advanced Scheduling Topics
- 11. Advanced Data Management Topics
- 12. Advanced Analysis and Monitoring Topics
- 13. Conclusion





# Runtime Systems for Heterogeneous Platforms

Innia

O. Aumage – StarPU Runtime

More capabilities, more complexity

Ínría\_

More capabilities, more complexity

#### Display

- Higher resolutions
- 2D acceleration
- 3D rendering



More capabilities, more complexity

#### Display

- Higher resolutions
- 2D acceleration
- 3D rendering

#### Networking

- Processing offload
- Zero-copy transfers
- Hardware multiplexing



More capabilities, more complexity

#### Display

- Higher resolutions
- 2D acceleration
- 3D rendering

#### Networking

- Processing offload
- Zero-copy transfers
- Hardware multiplexing

### I/O

- RAID
- SSD vs Disks
- Network-attached disks
- Parallel file systems



More capabilities, more complexity

#### Display

- Higher resolutions
- 2D acceleration
- 3D rendering

#### Networking

- Processing offload
- Zero-copy transfers
- Hardware multiplexing

### I/O

- RAID
- SSD vs Disks
- Network-attached disks
- Parallel file systems

#### **Computing Hardware?**



Ínría\_

Stay conservative?

Ínría\_

#### Stay conservative?

- Only use long established features
  - Display: Basic graphics or terminal output
  - Networking: Unix systems calls, TCP sockets
  - I/O: Unix systems calls, read/write

Innía

#### Stay conservative?

- Only use long established features
  - Display: Basic graphics or terminal output
  - Networking: Unix systems calls, TCP sockets
  - I/O: Unix systems calls, read/write
- Under-used hardware?
- Low performance?

Ínría\_

- Efficiency
- Convenience

Inría

- Efficiency
- Convenience
- Portability?
  - What if the application is used on different hardware?

Inría

- Efficiency
- Convenience
- Portability?
  - What if the application is used on different hardware?
- Adaptiveness?
  - What if hardware resource availability/capacity is higher? Lower?

Inría

- Efficiency
- Convenience
- Portability?
  - What if the application is used on different hardware?
- Adaptiveness?
  - What if hardware resource availability/capacity is higher? Lower?
- Cost?
  - Is it worthwhile to use such "specific" features?



- Efficiency
- Convenience
- Portability?
  - What if the application is used on different hardware?
- Adaptiveness?
  - What if hardware resource availability/capacity is higher? Lower?
- Cost?
  - Is it worthwhile to use such "specific" features?
- Long-term viability?
- Vendor-tied code?
  - Is it worthwhile to invest into porting on such platforms?

Innía

Answer: Use runtime systems!

Ínría\_



## **Principles of Runtime Systems**

Innía

O. Aumage – StarPU Runtime

Answer: Use runtime systems!



Answer: Use runtime systems!

The Role(s) of Runtime Systems

Portability

Inría

Answer: Use runtime systems!

The Role(s) of Runtime Systems

- Portability
- Control

Innía

Answer: Use runtime systems!

The Role(s) of Runtime Systems

- Portability
- Control
- Adaptiveness

Inría

Answer: Use runtime systems!

The Role(s) of Runtime Systems

- Portability
- Control
- Adaptiveness
- Optimization



Ínría\_

Networking

- MPI (Message Passing Interface), Global Arrays
- GASPI / GPI-2
- GASNet, CCI
- Distributed Shared Memory systems
- SHMEM

Innía

#### Networking

- MPI (Message Passing Interface), Global Arrays
- GASPI / GPI-2
- GASNet, CCI
- Distributed Shared Memory systems
- SHMEM

#### Graphics

- DirectX, Direct3D (Microsoft Windows)
- OpenGL



#### Networking

- MPI (Message Passing Interface), Global Arrays
- GASPI / GPI-2
- GASNet, CCI
- Distributed Shared Memory systems
- SHMEM

### Graphics

- DirectX, Direct3D (Microsoft Windows)
- OpenGL

### I/O

- MPI-IO
- HDF5 libraries
- Database engines



(nría\_

- Abstraction
  - Uniform front-end layer
  - Device-independent API
  - Targeted by applications





- Abstraction
  - Uniform front-end layer
  - Device-independent API
  - Targeted by applications
- Drivers, plugins
  - Device-dependent backend layer
  - $-\,$  Targeted by vendors and/or device specialist





- Abstraction
  - Uniform front-end layer
  - Device-independent API
  - Targeted by applications
- Drivers, plugins
  - Device-dependent backend layer
  - Targeted by vendors and/or device specialist
- Decoupling applications from device specific matters





### The Role(s) of Runtime Systems: Control





### The Role(s) of Runtime Systems: Control

- Resource mapping
  - Deciding which hardware resource to use/not to use for some application workload
  - Spatial work mapping





# The Role(s) of Runtime Systems: Control

- Resource mapping
  - Deciding which hardware resource to use/not to use for some application workload
  - Spatial work mapping
- Scheduling
  - Deciding when and in which order to perform some application workload
  - Temporal work mapping





# The Role(s) of Runtime Systems: Control

- Resource mapping
  - Deciding which hardware resource to use/not to use for some application workload
  - Spatial work mapping
- Scheduling
  - Deciding when and in which order to perform some application workload
  - Temporal work mapping
- Plan application workload execution





Ínría\_

- Discovering, sampling, calibrating
  - Detecting qualitative hardware capabilities
  - Providing fallbacks, when possible
  - Detecting quantitative hardware capabilities





- Discovering, sampling, calibrating
  - Detecting qualitative hardware capabilities
  - Providing fallbacks, when possible
  - Detecting quantitative hardware capabilities
- Monitoring, load balancing
  - Throttling workload feed
  - Reacting to hardware status changes





- Discovering, sampling, calibrating
  - Detecting qualitative hardware capabilities
  - Providing fallbacks, when possible
  - Detecting quantitative hardware capabilities
- Monitoring, load balancing
  - Throttling workload feed
  - Reacting to hardware status changes
- Cope with effective hardware aptitude and performance level





Ínría\_

- Capitalize on workload look-ahead to bring performance-oriented added value
  - Requests aggregation
  - Resource locality
  - Computation offload
  - Computation/transfer overlap





- Capitalize on workload look-ahead to bring performance-oriented added value
  - Requests aggregation
  - Resource locality
  - Computation offload
  - Computation/transfer overlap
- Take advantage of the cross-cutting point of view of the runtime system

- Perform global optimizations when possible





- Capitalize on workload look-ahead to bring performance-oriented added value
  - Requests aggregation
  - Resource locality
  - Computation offload
  - Computation/transfer overlap
- ∎ Take advantage of the cross-cutting point of view of the runtime system
  - Perform global optimizations when possible
- Out-weight the cost of an extra, intermediate software layer



nnía



# **Runtime Systems for Computing**

Innía

O. Aumage – StarPU Runtime

# **Evolution of Computing Hardware**

Rupture

- The "Frequency Wall"
  - Processing units cannot run anymore faster
- Looking for other sources of performance

Innía

# **Evolution of Computing Hardware**

Rupture

- The "Frequency Wall"
  - Processing units cannot run anymore faster
- Looking for other sources of performance

Hardware Parallelism

- Multiply existing processing power
  - Have several processing units work together

Innía

# **Evolution of Computing Hardware**

Rupture

- The "Frequency Wall"
  - Processing units cannot run anymore faster
- Looking for other sources of performance

Hardware Parallelism

- Multiply existing processing power
  - Have several processing units work together
- Not a new idea...
- .... but definitely the key performance factor now

Heterogeneous Association

- General purpose processor
- Specialized accelerator





Heterogeneous Association

- General purpose processor
- Specialized accelerator

- Distributed cores, discrete accelerators
  - Standalone GPUs
  - Intel Xeon Phi (KNC)



Heterogeneous Association

- General purpose processor
- Specialized accelerator

- Distributed cores, discrete accelerators
  - Standalone GPUs
  - Intel Xeon Phi (KNC)
- Integrated cores
  - Intel Skylake / Kaby Lake
  - Intel Xeon Phi (KNL)
  - AMD Fusion
  - nVidia Tegra, ARM big.LITTLE





Heterogeneous Association

- General purpose processor
- Specialized accelerator

- Distributed cores, discrete accelerators
  - Standalone GPUs
  - Intel Xeon Phi (KNC)
- Integrated cores
  - Intel Skylake / Kaby Lake
  - Intel Xeon Phi (KNL)
  - AMD Fusion
  - nVidia Tegra, ARM big.LITTLE
- Combination of various units
  - Latency-optimized cores
  - Throughput-optimized cores
  - Energy-optimized cores





Heterogeneous Association

- General purpose processor
- Specialized accelerator

Generalization

- Distributed cores, discrete accelerators
  - Standalone GPUs
  - Intel Xeon Phi (KNC)
- Integrated cores
  - Intel Skylake / Kaby Lake
  - Intel Xeon Phi (KNL)
  - AMD Fusion
  - nVidia Tegra, ARM big.LITTLE
- Combination of various units
  - Latency-optimized cores
  - Throughput-optimized cores
  - Energy-optimized cores

#### Overall increased parallelism diversity

- Multiprocessors, multicores
- Vector processing extensions
- Accelerators



# Example: CPU vs GPU Hardware

#### Multiple strategies for multiple purposes

- CPU
  - Strategy
    - Large caches
    - Large control
  - Purpose
    - Complex codes, branching
    - Complex memory access patterns
  - World Rally Championship car

## GPU

- Strategy
  - Lot of computing power
  - Simplified control
- Purpose
  - Regular data parallel codes
  - Simple memory access patterns
- Formula One car







Special purpose computing devices (or general purpose GPUs)

- (initially) a discrete expansion card
- Rationale: dye area trade-off

Innía

Special purpose computing devices (or general purpose GPUs)

- (initially) a discrete expansion card
- Rationale: dye area trade-off

Single Instruction Multiple Threads (SIMT)

- A single control unit...
- ... for several computing units

Innía

Special purpose computing devices (or general purpose GPUs)

- (initially) a discrete expansion card
- Rationale: dye area trade-off

Single Instruction Multiple Threads (SIMT)

- A single control unit...
- ... for several computing units





Special purpose computing devices (or general purpose GPUs)

- (initially) a discrete expansion card
- Rationale: dye area trade-off

Single Instruction Multiple Threads (SIMT)

- A single control unit...
- ... for several computing units

SIMT is distinct from SIMD

- Allows flows to diverge
- ... but better avoid it!







## Problematics

### Unified computing runtime system for heterogeneous platforms

- Portability of performance
  - Abstraction
  - Adaptiveness
  - Execution Control
  - Optimization

#### Need a way to abstract application execution...

... into elementary, manageable objects





# **Abstracting Application Workload**

Innía

O. Aumage – StarPU Runtime

Reasoning on Thread objects?

- One instruction flow
  - Unbounded flow
  - Parallel activity
- One state/context per thread
  - Stack

Inría

#### Reasoning on Thread objects?

Thread

- One instruction flow
  - Unbounded flow
  - Parallel activity
- One state/context per thread
  - Stack

### Examples

- OpenMP parallel regions
- libpthread
- C++ threads



Reasoning on Thread objects?

- One instruction flow
  - Unbounded flow
  - Parallel activity
- One state/context per thread
  - Stack



- OpenMP parallel regions
- libpthread
- C++ threads









Reasoning on Thread objects?

- One instruction flow
  - Unbounded flow
  - Parallel activity
- One state/context per thread
  - Stack



- OpenMP parallel regions
- libpthread
- C++ threads









Reasoning on Thread objects?

- One instruction flow
  - Unbounded flow
  - Parallel activity
- One state/context per thread
  - Stack



- OpenMP parallel regions
- libpthread
- C++ threads









## Threads: Resources vs Needs

Lack of abstraction

- Threads express explicit resource request
- instead of application requirements

Inría

## Threads: Resources vs Needs

Lack of abstraction

- Threads express explicit resource request
- instead of application requirements









# Threads: Resources Miss-subscription

Software vs hardware mismatch

- Over-subscription
- Under-subscription
- Fixed number of threads

Innía

# Threads: Resources Miss-subscription

Software vs hardware mismatch

- Over-subscription
- Under-subscription
- Fixed number of threads









# Threads: Resources Miss-subscription

Software vs hardware mismatch

- Over-subscription
- Under-subscription
- Fixed number of threads





Computation Threads



### Threads: Resources Miss-subscription

Software vs hardware mismatch

- Over-subscription
- Under-subscription
- Fixed number of threads





### Threads: Resources Miss-subscription

Software vs hardware mismatch

- Over-subscription
- Under-subscription
- Fixed number of threads





Computation Threads

Time



#### **Threads: Lack of Semantics**

What does a thread really do?

- Resource usage?
- Inter-thread constraints
- Chaining constraints, ordering?

Planning Issues

- Unbounded computation
- System-controlled context switches

Consequences

- Heavy synchronizations: barriers
- User-managed fine-grain synchronizations: locks, mutexes
- Little to no help from runtime system



#### Threads: Load Balancing Issues

Keeping every hardware unit busy

- Irregular application, workload
- Uncontrolled synchronization shift
- Heterogeneous platforms: accelerators, GPU

Innía

#### Threads: Load Balancing Issues

Keeping every hardware unit busy

- Irregular application, workload
- Uncontrolled synchronization shift
- Heterogeneous platforms: accelerators, GPU





#### Threads: Load Balancing Issues

Keeping every hardware unit busy

- Irregular application, workload
- Uncontrolled synchronization shift
- Heterogeneous platforms: accelerators, GPU









#### Threads: Networking and I/O Issues

- Computation/communication overlapping?
- Bulk I/O / network transfer mitigation?
- Thread-level idle time reduction?

Innía

#### Threads: Networking and I/O Issues

- Computation/communication overlapping?
- Bulk I/O / network transfer mitigation?
- Thread-level idle time reduction?



### Threads: Networking and I/O Issues

- Computation/communication overlapping?
- Bulk I/O / network transfer mitigation?
- Thread-level idle time reduction?









#### Threads: Outcome

Perhaps not the right semantics for end-user application development

- Over-constrained concept for application programming
- Awkward object to manipulate at the runtime system level
- Not well suited to leverage theoretical scheduling results
  - Completion?
  - Other metrics?

Innía

Reasoning on Task objects

Common definition

- Elementary computation
  - Numerical kernel
  - BLAS call
  - ...
- $\rightarrow$  Potential parallel work





Reasoning on Task objects

Common definition

- Elementary computation
  - Numerical kernel
  - BLAS call
  - ..
- $\bullet \rightarrow \mathsf{Potential} \mathsf{ parallel} \mathsf{ work}$

#### Constraints

- Input needed
- Output produced
- $\ \rightarrow \ \mathsf{Dependencies}$
- No side effect (no hidden dependencies)
- ${\scriptstyle \bullet} \rightarrow {\rm Degrees} \mbox{ of Freedom}$  in realizing the potential parallelism





Reasoning on Task objects

Common definition

- Elementary computation
  - Numerical kernel
  - BLAS call
  - ..
- $\bullet \rightarrow \mathsf{Potential} \mathsf{ parallel} \mathsf{ work}$

#### Constraints

- Input needed
- Output produced
- $\ \rightarrow \ \mathsf{Dependencies}$
- No side effect (no hidden dependencies)
- ${\scriptstyle \bullet} \rightarrow {\rm Degrees} \mbox{ of Freedom}$  in realizing the potential parallelism
- Shared (often fixed) pool of worker threads
- ${\scriptstyle \bullet} \rightarrow {\rm Decoupled}$  engine, to realize a potentially parallel execution





#### Tasks: Resources vs Needs?

A task expresses what to do (e.g. which computation)

The runtime remains free to decide the amount of resources to execute a task

- Rationalize resource consumption
  - Thread and associated stack reused among several tasks
- Enforce separation of concerns
  - Management code brought out of the application
- Open the way to resource allocation optimization
  - Cross-cutting view of the application requirements

#### Tasks: Resources vs Needs?

A task expresses what to do (e.g. which computation)

The runtime remains free to decide the amount of resources to execute a task

- Rationalize resource consumption
  - Thread and associated stack reused among several tasks
- Enforce separation of concerns
  - Management code brought out of the application
- Open the way to resource allocation optimization
  - Cross-cutting view of the application requirements



main

#### Tasks: Resources Miss-subscription?

The runtime system may initialize a pool of worker threads according to the hardware capabilities

The application submit tasks independently to the runtime, independently of the hardware capabilities

- Tasks submitted by the application according to its natural algorithm
  - Abstraction with respect to hardware
- Workers allocated according to hardware resource, topology
  - Typically one thread per core or per hardware thread
- Operating system scheduler interference largely eliminated
  - No competition between worker threads

A task expresses what to do (e.g. which computation), under which constraints.

The runtime system can take advantage of this knowledge



- Optimize spatial resource usage
  - Decide which computing resource is best suited for a given task



- Optimize spatial resource usage
  - Decide which computing resource is best suited for a given task
- Optimize temporal resource usage
  - Decide in which order to execute tasks



- Optimize spatial resource usage
  - Decide which computing resource is best suited for a given task
- Optimize temporal resource usage
  - Decide in which order to execute tasks
- Optimize concurrent resource usage
  - Decide which pairs of tasks to execute in parallel



- Optimize spatial resource usage
  - Decide which computing resource is best suited for a given task
- Optimize temporal resource usage
  - Decide in which order to execute tasks
- Optimize concurrent resource usage
  - Decide which pairs of tasks to execute in parallel
- No lock directly manipulated by the application



Tasks may transparently fill arising idle times as long as sufficient parallelism is available

- Flexibility
  - No need for all tasks to have a uniform duration
  - Naturally opens the way to heterogeneous computations, accelerated offloads
- Transparency
  - No need for explicit yield

Innía

Tasks may transparently fill arising idle times as long as sufficient parallelism is available

- Flexibility
  - No need for all tasks to have a uniform duration
  - Naturally opens the way to heterogeneous computations, accelerated offloads
- Transparency





Tasks may transparently fill arising idle times as long as sufficient parallelism is available

- Flexibility
  - No need for all tasks to have a uniform duration
  - Naturally opens the way to heterogeneous computations, accelerated offloads
- Transparency





Tasks may transparently fill arising idle times as long as sufficient parallelism is available

- Flexibility
  - No need for all tasks to have a uniform duration
  - Naturally opens the way to heterogeneous computations, accelerated offloads
- Transparency





Tasks may transparently fill arising idle times as long as sufficient parallelism is available

- Flexibility
  - No need for all tasks to have a uniform duration
  - Naturally opens the way to heterogeneous computations, accelerated offloads
- Transparency



Tasks may transparently fill arising idle times as long as sufficient parallelism is available

- Flexibility
  - No need for all tasks to have a uniform duration
  - Naturally opens the way to heterogeneous computations, accelerated offloads
- Transparency
  - No need for explicit yield







#### Tasks: Networking and I/O Issues?

Potential 1-to-1 relationship between tasks and network/IO requests

- Network/IO request may start as soon as the task producing the data has been completed
- Tasks may be triggered as the result of network/IO requests completion
- Significant difference with fork-join models, MPI+X
  - Transparent interoperability
  - Avoid deferred network/IO requests until next join
  - Avoid custom network/IO requests management inside the application code

#### Tasks: Outcome

Task = Characterizable work

#### Well-defined

- Workload
- Completion
- Dependencies
- Similar to the pure function concept from Functional programming domain

#### Suitable object for modelling

- Constraints
- Degrees of freedom
- Large corpus of task scheduling theory

#### Enforcing separation of concerns

- Application specialist
- Kernel(s) specialist
- Scheduling theoretician specialist
- Runtime-system specialist



### Programming Modern Platforms using Tasks

**See second part**: Programming Modern Platforms with the StarPU Task-Based Runtime System

Rich set of existing task-based programming models and associated runtime systems  $% \left( {{{\mathbf{x}}_{i}}} \right)$ 

- DuctTeip
- Legion
- OCR
- OpenMP 4.x
- OmpSs
- ParalleX
- PaRSEC
- Swan
- Uintah/Kokkos
- XKaapi
- • •





# The StarPU Task-Based Runtime System

Innía

O. Aumage – StarPU Runtime

### **Heterogeneous Parallel Platforms**

Heterogeneous Association

- General purpose processor
- Specialized accelerator

Generalization

- Distributed cores, discrete accelerators
  - Standalone GPUs
  - Intel Xeon Phi (KNC)
- Integrated cores
  - Intel Skylake / Kaby Lake
  - Intel Xeon Phi (KNL)
  - AMD Fusion
  - nVidia Tegra, ARM big.LITTLE
- Combination of various units
  - Latency-optimized cores
  - Throughput-optimized cores
  - Energy-optimized cores





#### Task

- Elementary computation
  - Some kernel
- $\bullet \rightarrow \mathsf{Potential} \ \mathsf{parallel} \ \mathsf{work}$
- Constraints
  - Input needed
  - Output produced
  - $\ \rightarrow \ \mathsf{Dependencies}$



 ${\scriptstyle \bullet} \rightarrow {\rm Degrees} \mbox{ of Freedom}$  in realizing the potential parallelism



#### Task

- Elementary computation
  - Some kernel
- $\bullet \rightarrow \mathsf{Potential}$  parallel work
- Constraints
  - Input needed
  - Output produced
  - $\rightarrow \mathsf{Dependencies}$



 ${\scriptstyle \bullet} \rightarrow {\rm Degrees} \mbox{ of Freedom}$  in realizing the potential parallelism

#### Expressing tasks?

- Divide and conquer: Cilk (recursive tasks)
- Dependencies compiler: PaRSEC (parameterized task graph)
- Sequential task flow: StarPU (directed acyclic task graph)



#### StarPU Programming Model: Sequential Task Flow

- Express parallelism...
- ... using the natural program flow
- **Submit** tasks in the sequential flow of the program...
- ... then let the runtime schedule the tasks asynchronously

Innía

# Sequential Task Flow Graph Building

#### Example: Cholesky Decomposition



















#### 

Tasks are submitted asynchronously





- Tasks are submitted asynchronously
- StarPU infers data dependences...





- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks







- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks





#### 

- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks





- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks





- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks





- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks





- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks





- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks







- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks





- Tasks are submitted asynchronously
- StarPU infers data dependences...
- ... and build a graph of tasks
- The graph of tasks is executed





# StarPU Execution Model: Task Scheduling

Mapping the graph of tasks (DAG) on the hardware

- Allocating computing resources
- Enforcing dependency constraints
- Handling data transfers

#### Adaptiveness

- A single DAG enables multiple schedulings
- A single DAG can be mapped on multiple platforms



# Example: SCHNAPS, Implicit kinetic schemes

#### SCHNAPS Solver (Inria TONUS)

Example of a task graph submitted to StarPU





# Heterogeneous Showcase with Chameleon + StarPU

#### UTK, Inria HIEPACS, Inria RUNTIME

• QR decomp. on 16 CPUs (AMD) + 4 GPUs (C1060) using MAGMA GPU kernels



"E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, et al. *QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators*. 25th IEEE IPDPS, 2011."



# Heterogeneous Showcase with Chameleon + StarPU

#### UTK, Inria HIEPACS, Inria RUNTIME

• QR decomp. on 16 CPUs (AMD) + 4 GPUs (C1060) using MAGMA GPU kernels



"E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, et al. *QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators.* 25th IEEE IPDPS, 2011."



# Heterogeneous Showcase with Chameleon + StarPU

#### **QR** kernel properties

| Kernel SGEQRT |             |      |                         |           |    |
|---------------|-------------|------|-------------------------|-----------|----|
|               | 9 GFlop/s   | GPU: | <mark>30</mark> GFlop/s | Speed-up: | 3  |
| Kernel STSQRT |             |      |                         |           |    |
| CPU:          | 12 GFlop/s  | GPU: | 37 GFlop/s              | Speed-up: | 3  |
| Kernel SOMQRT |             |      |                         |           |    |
| CPU:          | 8.5 GFlop/s | GPU: | 227 GFlop/s             | Speed-up: | 27 |
| Kernel SSSMQ  |             |      |                         |           |    |
| CPU:          | 10 GFlop/s  | GPU: | 285 GFlop/s             | Speed-up: | 28 |

#### Consequences

- Task distribution
  - SGEQRT: 20% Tasks on GPU
  - SSSMQ: 92% tasks on GPU
- Taking advantage of heterogeneity!
  - Only do what you are good for
  - Don't do what you are not good for





# Programming with StarPU

Innía

O. Aumage - StarPU Runtime

# Terminology

- Codelet
- Task
- Data handle



- ... relates an abstract computation kernel to its implementation(s)
- .... can be instantiated into one or more tasks
- .... defines characteristics common to a set of tasks

Innía

A Codelet...

- ... relates an abstract computation kernel to its implementation(s)
- .... can be instantiated into one or more tasks
- .... defines characteristics common to a set of tasks

Codelet scal\_cl





A Codelet...

- ... relates an abstract computation kernel to its implementation(s)
- .... can be instantiated into one or more tasks
- ... defines characteristics common to a set of tasks

Codelet scal\_cl



- ... relates an abstract computation kernel to its implementation(s)
- .... can be instantiated into one or more tasks
- ... defines characteristics common to a set of tasks





- ... relates an abstract computation kernel to its implementation(s)
- .... can be instantiated into one or more tasks
- ... defines characteristics common to a set of tasks





- ... relates an abstract computation kernel to its implementation(s)
- .... can be instantiated into one or more tasks
- ... defines characteristics common to a set of tasks





- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output

Innía

- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output





- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output





- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output





- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output





- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output





- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output





- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output





- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output





#### **Definition: A Task**

A Task...

- ... is an instantiation of a Codelet
- ... atomically executes a kernel from its beginning to its end
- ... receives some input
- ... produces some output





#### **Definition: A Data Handle**

#### A Data Handle...

- ... designates a piece of data managed by StarPU
- ... is typed (vector, matrix, etc.)
- ${\scriptstyle \bullet}$  ... can be passed as input/output for a  ${\bf Task}$

Innía

#### **Elementary API**

- Declaring a codelet
- Declaring and Managing Data
- Writing a Kernel Function
- Submitting a task
- Waiting for submitted tasks



```
1 struct starpu_codelet scal_cl = {
2 ...
3 };
```



- Plug the kernel function
  - Here: scal\_cpu\_func

```
struct starpu_codelet scal_cl = {
    cpu_func = { scal_cpu_func, NULL },
    ...
};
```

- Plug the kernel function
  - Here: scal\_cpu\_func
- Declare the number of data pieces used by the kernel
  - Here: A single vector

```
struct starpu_codelet scal_cl = {
    cpu_func = { scal_cpu_func, NULL },
    ...
    hbuffers = 1,
    ...
};
```

- Plug the kernel function
  - Here: scal\_cpu\_func
- Declare the number of data pieces used by the kernel
  - Here: A single vector
- Declare how the kernel accesses the piece of data
  - $-\,$  Here: The vector is scaled in-place, thus  ${\sf R}/{\sf W}$

```
struct starpu_codelet scal_cl = {
    cpu_func = { scal_cpu_func, NULL },
    .nbuffers = 1,
    .modes = { STARPU_RW },
```



Put data under StarPU control

Initialize a piece of data

```
1 float vector[NX];
2 /* ... fill data ... */
```



- Initialize a piece of data
- Register the piece of data and get a handle
  - The vector is now under StarPU control

- Initialize a piece of data
- Register the piece of data and get a handle
  - The vector is now under StarPU control
- Use data through the handle



- Initialize a piece of data
- Register the piece of data and get a handle
  - The vector is now under StarPU control
- Use data through the handle
- Unregister the piece of data
  - The handle is destroyed
  - The vector is now back under user control



Every kernel function has the same C prototype

```
void scal_cpu_func(void *buffers[], void *cl_arg) {
    ...
}
```



- Every kernel function has the same C prototype
- Retrieve the vector's handle

```
void scal_cpu_func(void *buffers[], void *cl_arg) {
    struct starpu_vector_interface *vector_handle = buffers
    [0];
    ...
  }
```

- Every kernel function has the same C prototype
- Retrieve the vector's handle
- Get vector's number of elements and base pointer

```
void scal_cpu_func(void *buffers[], void *cl_arg) {
    struct starpu_vector_interface *vector_handle = buffers
    [0];
    unsigned n = STARPU_VECTOR_GET_NX(vector_handle);
    float *vector = STARPU_VECTOR_GET_PTR(vector_handle);
    ...
    ...
    }
}
```



- Every kernel function has the same C prototype
- Retrieve the vector's handle
- Get vector's number of elements and base pointer
- Get the scaling factor as inline argument

```
void scal_cpu_func(void *buffers[], void *cl_arg) {
1
      struct starpu_vector_interface *vector_handle = buffers
2
          [0];
3
      unsigned n = STARPU_VECTOR_GET_NX(vector_handle);
4
      float *vector = STARPU_VECTOR_GET_PTR(vector_handle);
5
6
      float *ptr_factor = cl_arg;
7
8
9
 }
10
```

- Every kernel function has the same C prototype
- Retrieve the vector's handle
- Get vector's number of elements and base pointer
- Get the scaling factor as inline argument
- Compute the vector scaling

```
void scal_cpu_func(void *buffers[], void *cl_arg) {
1
      struct starpu_vector_interface *vector_handle = buffers
2
           [0];
3
      unsigned n = STARPU_VECTOR_GET_NX(vector_handle);
4
      float *vector = STARPU_VECTOR_GET_PTR(vector_handle);
5
6
      float *ptr_factor = cl_arg;
7
8
      unsigned i;
9
      for (i = 0; i < n; i++)
10
          vector[i] *= *ptr_factor:
11
12
```



The starpu\_task\_insert call

Inserts a task in the StarPU DAG



The starpu\_task\_insert call

Inserts a task in the StarPU DAG

Arguments

The codelet structure

The starpu\_task\_insert call

Inserts a task in the StarPU DAG

Arguments

- The codelet structure
- The StarPU-managed data

```
starpu_task_insert(&scal_cl,
    STARPU_RW, vector_handle,
    ...);
```



The starpu\_task\_insert call

Inserts a task in the StarPU DAG

Arguments

- The codelet structure
- The StarPU-managed data
- The small-size inline data

```
starpu_task_insert(&scal_cl,
STARPU_RW, vector_handle,
STARPU_VALUE, &factor, sizeof(factor),
...);
```



The starpu\_task\_insert call

Inserts a task in the StarPU DAG

Arguments

- The codelet structure
- The StarPU-managed data
- The small-size inline data
- 0 to mark the end of arguments

```
starpu_task_insert(&scal_cl,
    STARPU_RW, vector_handle,
    STARPU_VALUE, &factor, sizeof(factor),
    0);
```



The starpu\_task\_insert call

Inserts a task in the StarPU DAG

Arguments

- The codelet structure
- The StarPU-managed data
- The small-size inline data
- 0 to mark the end of arguments

Notes

The task is submitted non-blockingly

Innía

The starpu\_task\_insert call

Inserts a task in the StarPU DAG

Arguments

- The codelet structure
- The StarPU-managed data
- The small-size inline data
- 0 to mark the end of arguments

Notes

- The task is submitted non-blockingly
- Dependencies are enforced with previously submitted tasks' data...

Innia

The starpu\_task\_insert call

Inserts a task in the StarPU DAG

Arguments

- The codelet structure
- The StarPU-managed data
- The small-size inline data
- 0 to mark the end of arguments

#### Notes

- The task is submitted non-blockingly
- Dependencies are enforced with previously submitted tasks' data...
- ... following the natural order of the program

The starpu\_task\_insert call

Inserts a task in the StarPU DAG

Arguments

- The codelet structure
- The StarPU-managed data
- The small-size inline data
- 0 to mark the end of arguments

Notes

- The task is submitted non-blockingly
- Dependencies are enforced with previously submitted tasks' data...
- ... following the natural order of the program
- This is the Sequential Task Flow Paradigm



Tasks are submitted non-blockingly

(nría\_

Tasks are submitted non-blockingly

```
1 /* non-blocking task submits */
2 starpu_task_insert(...);
3 ...
```

Innía

- Tasks are submitted non-blockingly
- Wait for all submitted tasks to complete their work

```
1 /* non-blocking task submits */
2 starpu_task_insert(...);
3 ...
```



- Tasks are submitted non-blockingly
- Wait for all submitted tasks to complete their work

```
1 /* non-blocking task submits */
2 starpu_task_insert(...);
3 ...
4
5 /* wait for all task submitted so far */
6 starpu_task_wait_for_all();
```



```
1 float factor = 3.14;
2 float vector[NX];
```

Inría

```
1 float factor = 3.14;
2 float vector[NX];
3 starpu_data_handle_t vector_handle;
```

Innía

Innia

```
_1 float factor = 3.14:
2 float vector[NX];
3 starpu_data_handle_t vector_handle;
4
 /* ... fill vector ... */
5
7 starpu_vector_data_register(&vector_handle, 0,
                         (uintptr_t)vector, NX, sizeof(vector[0]))
8
9
10
  starpu_task_insert(
                   &scal cl.
11
                   STARPU RW. vector handle.
12
                   STARPU_VALUE, &factor, sizeof(factor),
13
                   0);
14
```



```
_1 float factor = 3.14:
2 float vector[NX];
3 starpu_data_handle_t vector_handle;
4
  /* ... fill vector ... */
5
7 starpu_vector_data_register(&vector_handle, 0,
                         (uintptr_t)vector, NX, sizeof(vector[0]))
8
9
10
  starpu_task_insert(
                   &scal cl.
11
                   STARPU RW. vector handle.
                   STARPU_VALUE, &factor, sizeof(factor),
13
                   0):
14
15
16 starpu_task_wait_for_all();
```

```
_1 float factor = 3.14:
2 float vector[NX];
3 starpu_data_handle_t vector_handle;
4
5 /* ... fill vector ... */
7 starpu_vector_data_register(&vector_handle, 0,
                         (uintptr_t)vector, NX, sizeof(vector[0]))
8
9
10
  starpu_task_insert(
                   &scal cl.
11
                   STARPU RW. vector handle.
                   STARPU_VALUE, & factor, sizeof(factor),
13
                   0):
14
15
  starpu_task_wait_for_all();
16
17 starpu_data_unregister(vector_handle);
18
19 /* ... display vector ... */
```



### Heterogeneity: Device Kernels

Extending a codelet to handle heterogeneous platforms

Ínría

### Heterogeneity: Device Kernels

Extending a codelet to handle heterogeneous platforms

- Multiple kernel implementations for a CPU
  - SSE, AVX, ... optimized kernels

```
struct starpu_codelet scal_cl = {
    cpu_func = { scal_cpu_func,
        scal_sse_cpu_func, scal_avx_cpu_func, NULL },
    .nbuffers = 1,
    .modes = { STARPU_RW },
};
```

### Heterogeneity: Device Kernels

Extending a codelet to handle heterogeneous platforms

- Multiple kernel implementations for a CPU
  - SSE, AVX, ... optimized kernels
- Kernels implementations for accelerator devices
  - OpenCL, NVidia Cuda kernels

```
struct starpu codelet scal cl = {
1
      .cpu_func = \{ scal_cpu_func, \}
2
               scal_sse_cpu_func, scal_avx_cpu_func, NULL },
3
      .opencl_func = { scal_cpu_opencl, NULL },
4
      .cuda_func = \{ scal_cpu_cuda, NULL \},
5
      . nbuffers = 1.
6
      .modes = \{ STARPU_RW \},
7
8
 };
```

Ínría\_

```
1
2
3
5
6
7
  extern "C" void scal_cuda_func(void *buffers[], void *cl_arg)
8
       struct starpu_vector_interface *vector_handle = buffers
9
           [0];
       unsigned n = STARPU_VECTOR_GET_NX(vector_handle);
10
       float *vector = STARPU_VECTOR_GET_PTR(vector_handle);
11
       float *ptr_factor = cl_arg;
12
13
14
       . . .
15
16
17
18
19
```



```
1
2
3
4
5
6
7
  extern "C" void scal_cuda_func(void *buffers[], void *cl_arg)
8
      struct starpu_vector_interface *vector_handle = buffers
9
           [0];
      unsigned n = STARPU_VECTOR_GET_NX(vector_handle);
10
      float *vector = STARPU_VECTOR_GET_PTR(vector_handle);
11
      float *ptr_factor = cl_arg;
12
13
      unsigned threads_per_block = 64;
14
      unsigned nblocks = (n+threads_per_block-1)/
15
           threads_per_block;
16
17
       . . .
18
19
  ł
```

```
1
2
3
5
6
  extern "C" void scal_cuda_func(void *buffers[], void *cl_arg)
8
      struct starpu_vector_interface *vector_handle = buffers
9
           [0];
      unsigned n = STARPU_VECTOR_GET_NX(vector_handle);
10
      float *vector = STARPU_VECTOR_GET_PTR(vector_handle);
11
      float *ptr_factor = cl_arg;
12
13
      unsigned threads_per_block = 64;
14
      unsigned nblocks = (n+threads_per_block-1)/
15
           threads per block;
16
      vector_mult_cuda<<<nblocks,threads_per_block,0,
17
           starpu_cuda_get_local_stream()>>>(n, vector ,*
18
               ptr factor);
19
  }
```

```
static __global__ void vector_mult_cuda(unsigned n,
1
                                      float *vector, float factor
2
      unsigned i = blockldx.x*blockDim.x + threadldx.x;
3
4
5
  }
6
7
  extern "C" void scal_cuda_func(void *buffers[], void *cl_arg)
8
      struct starpu_vector_interface *vector_handle = buffers
9
           [0];
      unsigned n = STARPU_VECTOR_GET_NX(vector_handle);
10
      float *vector = STARPU_VECTOR_GET_PTR(vector_handle);
11
      float *ptr factor = cl arg;
12
13
      unsigned threads per block = 64;
14
      unsigned nblocks = (n+threads_per_block-1)/
15
          threads per block;
16
      vector_mult_cuda<<<nblocks,threads_per_block,0,
17
          starpu_cuda_get_local_stream()>>>(n, vector ,*
18
               ptr factor);
```

```
static __global__ void vector_mult_cuda(unsigned n,
1
                                      float *vector, float factor
2
      unsigned i = blockldx.x*blockDim.x + threadldx.x;
3
      if (i < n)
4
          vector[i] *= factor;
5
  }
6
7
  extern "C" void scal_cuda_func(void *buffers[], void *cl_arg)
8
      struct starpu_vector_interface *vector_handle = buffers
9
           [0];
      unsigned n = STARPU_VECTOR_GET_NX(vector_handle);
10
      float *vector = STARPU_VECTOR_GET_PTR(vector_handle);
11
      float *ptr factor = cl arg;
12
13
      unsigned threads per block = 64;
14
      unsigned nblocks = (n+threads_per_block-1)/
15
          threads per block;
16
      vector mult cuda<<<nblocks,threads per block,0,
17
          starpu_cuda_get_local_stream()>>>(n, vector ,*
18
               ptr factor);
```



# **StarPU Internals**

Ínría O. Aumage - StarPU Runtime

# StarPU Internal Structure









e – StarPU Runtime – 4. StarPU Internals

69































# **Scheduling Policies**

Innía

O. Aumage – StarPU Runtime

### StarPU Scheduling Policies

- No one size fits all policy
- Selectable scheduling policy
  - Predefined set of popular policies: eager, work-stealing, etc.

Inría

### StarPU Scheduling Policies

- No one size fits all policy
- Selectable scheduling policy
  - Predefined set of popular policies: eager, work-stealing, etc.

Going beyond?



### StarPU Scheduling Policies

- No one size fits all policy
- Selectable scheduling policy
  - Predefined set of popular policies: eager, work-stealing, etc.

Going beyond?

Scheduling is a decision process:

- Providing more input to the scheduler...
- .... can lead to better scheduling decisions

What kind of information?

- Relative importance of tasks
  - Priorities
- Cost of tasks
  - Codelet models
- Cost of transferring data
  - Bus calibration



• Use the **STARPU\_SCHED** environment variable

Inría

- Use the STARPU\_SCHED environment variable
- Example 1: selecting the prio scheduler

```
1 $ export STARPU_SCHED=prio
2 $ my_program
3 ...
```



- Use the STARPU\_SCHED environment variable
- Example 1: selecting the prio scheduler
- Example 2: selecting the dm scheduler

```
1 $ export STARPU_SCHED=prio
2 $ my_program
3 ...
```

```
1 $ export STARPU_SCHED=dm
2 $ my_program
3 ...
```



- Use the STARPU\_SCHED environment variable
- Example 1: selecting the prio scheduler
- Example 2: selecting the dm scheduler
- Example 3: resetting to default scheduler eager

```
1 $ export STARPU_SCHED=prio
2 $ my_program
3 ...
```

```
1 $ export STARPU_SCHED=dm
2 $ my_program
3 ...
```

```
1 $ unset STARPU_SCHED
2 $ my_program
3 ...
```

- Use the STARPU\_SCHED environment variable
- Example 1: selecting the prio scheduler
- Example 2: selecting the dm scheduler
- Example 3: resetting to default scheduler eager
- No need to recompile the application

```
1 $ export STARPU_SCHED=prio
2 $ my_program
3 ...
```

```
1 $ export STARPU_SCHED=dm
2 $ my_program
3 ...
```

```
1 $ unset STARPU_SCHED
2 $ my_program
3 ...
```



### Task Mapping using a Performance Model



Innía

### Task Mapping using a Performance Model

- Using codelet performance models
  - Kernel calibration on each available computing device
  - Raw history model of kernels' past execution times
  - Refined models using regression on kernels' execution times history
- Model parameter(s)
  - Data size
  - User-defined parameters





# Data Management

Ínría

O. Aumage – StarPU Runtime



























GEMM



Handles dependencies

































Handles dependencies









Handles dependencies

















- Handles dependencies
- Handles scheduling (policy)









- Handles dependencies
- Handles scheduling (policy)









- Handles dependencies
- Handles scheduling (policy)
- Handles data consistency (MSI protocol)











- Handles dependencies
- Handles scheduling (policy)
- Handles data consistency (MSI protocol)





- Handles dependencies
- Handles scheduling (policy)
- Handles data consistency (MSI protocol)









- Handles dependencies
- Handles scheduling (policy)
- Handles data consistency (MSI protocol)









- Handles dependencies
- Handles scheduling (policy)
- Handles data consistency (MSI protocol)







## **Distributed Shared Memory Consistency**

### MSI Protocol

- M: Modified
- S: Shared
- I: Invalid



Μ

SS

## **Distributed Shared Memory Consistency**

### MSI Protocol

- M: Modified
- S: Shared
- I: Invalid





## **Distributed Shared Memory Consistency**

### MSI Protocol

- M: Modified
- S: Shared
- I: Invalid



## Data Transfer Cost Modelling for Improved Scheduling

Discrete accelerators

- CPU  $\leftrightarrow$  GPU transfers
- Data transfer cost vs kernel offload benefit

Innía

## Data Transfer Cost Modelling for Improved Scheduling

Discrete accelerators

- $\blacksquare$  CPU  $\leftrightarrow$  GPU transfers
- Data transfer cost vs kernel offload benefit

### Transfer cost modelling

- Bus calibration
  - Can differ even for identical devices
  - Platform's topology

Innía

# Data Transfer Cost Modelling for Improved Scheduling

Discrete accelerators

- $\blacksquare$  CPU  $\leftrightarrow$  GPU transfers
- Data transfer cost vs kernel offload benefit

#### Transfer cost modelling

- Bus calibration
  - Can differ even for identical devices
  - Platform's topology

#### Data-transfer aware scheduling

- Deque Model Data Aware (dmda) scheduling policy variants
- Tunable data transfer cost bias
  - locality
  - vs load balancing



## **Data Prefetching**

Task states

- Submitted
  - Task inserted by the application
- Ready
  - Task's dependencies resolved
- Scheduled
  - Task queued on a computing unit
- Executing
  - Task running on a computing unit

Innía

## **Data Prefetching**

Task states

- Submitted
  - Task inserted by the application
- Ready
  - Task's dependencies resolved
- Scheduled
  - Task queued on a computing unit
- Executing
  - Task running on a computing unit

Anticipate on the  $\textbf{Scheduled} \rightarrow \textbf{Executing}$  transition

Prefetch triggered ASAP after Scheduled state

Innía

## **Data Prefetching**

Task states

- Submitted
  - Task inserted by the application
- Ready
  - Task's dependencies resolved
- Scheduled
  - Task queued on a computing unit
- Executing
  - Task running on a computing unit

Anticipate on the  $\textbf{Scheduled} \rightarrow \textbf{Executing}$  transition

- Prefetch triggered ASAP after Scheduled state
- Prefetch may also be triggered by the application



- Vector
- Matrix
- BCSR sparse matrix

```
int vector[NX];
starpu_data_handle_t handle;
starpu_vector_data_register(&handle, 0, (uintptr_t)vector,
NX, sizeof(vector[0]));
```



- Vector
- Matrix
- BCSR sparse matrix

```
float matrix [NX*NY];
starpu_data_handle_t handle;
starpu_matrix_data_register(&handle, 0, (uintptr_t)matrix,
NX, NX, NY, sizeof(matrix[0]));
```



- Vector
- Matrix
- BCSR sparse matrix

```
1 ...
2 starpu_data_handle_t handle;
4 starpu_bcsr_data_register(&handle, 0, NNZ, NROW,
5 (uintptr_t)bcsr_matrix_data,
6 bcsr_matrix_indices, bcsr_matrix_rowptr,
6 first_entry,
7 BLOCK_NROW, BLOCK_NCOL, sizeof(double));
```



- Vector
- Matrix
- BCSR sparse matrix
- Extensible data type set
  - You can write your own, specifically tailored data type

Innía

- Vector
- Matrix
- BCSR sparse matrix
- Extensible data type set
  - You can write your own, specifically tailored data type
- Only the byte size and the shape of data matter, not the actual element type (integer, float, double precision float, ...)

Innía

Splitting a piece of managed data into several handles

- Granularity adjustment
- Notion of filter

Ínría

Splitting a piece of managed data into several handles

- Granularity adjustment
- Notion of filter

### Partition

```
int vector[NX];
2 starpu_data_handle_t handle;
3 starpu_vector_data_register(&handle, 0, (uintptr_t)vector,
                               NX, sizeof(vector[0]));
4
5
  /* Partition the vector in NB_PARTS sub-vectors */
7 struct starpu_data_filter filter = {
      .filter_func = starpu_vector_filter_block ,
8
      .nchildren = NB PARTS
9
  }:
10
  starpu_data_partition(handle, &filter);
11
12
  /* Data can only be accessed through sub-handles now */
13
```



Splitting a piece of managed data into several handles

- Granularity adjustment
- Notion of filter

 $\mathsf{Partition} \to \mathsf{Use}$ 

```
for (i=0; i<starpu_data_get_nb_children(handle); i++) {</pre>
1
      /* Get subdata number i */
2
      starpu_data_handle_t sub_handle =
3
           starpu_data_get_sub_data(handle, 1, i);
4
5
      starpu_task_insert(
6
           &scal_cl,
7
            STARPU_RW, sub_handle,
8
            STARPU_VALUE, & factor, sizeof(factor),
9
            0):
10
11
```



Splitting a piece of managed data into several handles

- Granularity adjustment
- Notion of filter

 $\mathsf{Partition} \to \mathsf{Use} \to \mathsf{Unpartition}$ 

```
1 /* Wait for submitted tasks to complete */
2 starpu_task_wait_for_all();
3 /* Unpartition data */
5 starpu_data_unpartition(handle, 0);
6 /* Data can now be accessed through 'handle' only */
```



## **Asynchronous Partitioning**

Inserting a partitioning request in the submission flow

Two steps



# Asynchronous Partitioning

Inserting a partitioning request in the submission flow

Two steps

Partition planning

```
int vector[NX];
2 starpu_data_handle_t handle;
3 starpu_vector_data_register(&handle, 0, (uintptr_t)vector,
                               NX, sizeof(vector[0]));
4
5
  /* Partition the vector in NB_PARTS sub-vectors */
7 struct starpu_data_filter filter = {
      .filter_func = starpu_vector_filter_block ,
8
      . nchildren = NB PARTS
9
  }:
10
  starpu_data_handle_t children [NB_PARTS];
11
  starpu_data_partition_plan(handle, &filter, children);
12
13
  /* Data can only be accessed through sub-handles now */
14
```



# Asynchronous Partitioning

Inserting a partitioning request in the submission flow

Two steps

- Partition planning
- Asynchronous partition inforcement

```
starpu_task_insert(&scal_cl ,
1
       STARPU RW. handle.
2
       STARPU_VALUE, &factor1, sizeof(factor1), 0);
3
4 starpu_data_partition_submit(handle, NB_PARTS, children);
  for (i=0; i < NB_PARTS; i++) {
5
      starpu_task_insert(&scal_cl,
6
           STARPU_RW, children[i],
7
           STARPU_VALUE, & factor2, sizeof(factor2),
8
           0):
9
10
  starpu_data_unpartition_submit(handle, NB_PARTS, children,
11
      node);
12 starpu_task_insert(&scal_cl ,
       STARPU RW, handle,
13
       STARPU_VALUE, &factor3, sizeof(factor3), 0);
14
```



## Reduction

Merge contributions from a set of tasks into a single buffer

- Define neutral element initializer
- Define reduction operator



## Reduction

Merge contributions from a set of tasks into a single buffer

- Define neutral element initializer
- Define reduction operator

Define zero

```
void bzero_cpu(void *descr[], void *cl_arg) {
1
      double *v_zero = (double *)STARPU_VARIABLE_GET_PTR(descr
2
          [0]);
      *v_zero = 0.0;
3
 }
4
5
6 struct starpu_codelet bzero_cl = {
      .cpu_funcs = { bzero_cpu, NULL },
7
      . nbuffers = 1
8
9
 };
```



## Reduction

Merge contributions from a set of tasks into a single buffer

- Define neutral element initializer
- Define reduction operator

 $\mathsf{Define}\ \mathsf{zero}\ \rightarrow\ \mathsf{Define}\ \mathsf{op}$ 

```
void accumulate_cpu(void *descr[], void *cl_arg) {
1
      double *v_dst = (double *)STARPU_VARIABLE_GET_PTR(descr
2
           [0]):
      double *v_src = (double *)STARPU_VARIABLE_GET_PTR(descr
3
          [1]):
      *v_dst = *v_dst + *v_src;
4
  }
5
6
  struct starpu_codelet accumulate_cl = {
7
      .cpu_funcs = { accumulate_cpu, NULL },
8
      . nbuffers = 1
9
10
  };
```



## Reduction

Merge contributions from a set of tasks into a single buffer

- Define neutral element initializer
- Define reduction operator

 $\mathsf{Define}\ \mathsf{zero}\ \to\ \mathsf{Define}\ \mathsf{op}\ \to\ \mathsf{Reduce}\ \mathsf{task}\ \mathsf{contributions}$ 

```
starpu_variable_data_register(&accum_handle, -1,
                                 NULL, sizeof(type));
2
3 starpu_data_set_reduction_methods(accum_handle,
                                    &accumulate_cl, &bzero_cl);
4
5
  for (b = 0; b < nblocks; b++)
6
      starpu_task_insert(&dot_kernel_cl,
7
          STARPU_REDUX , accum_handle ,
8
          STARPU_R, starpu_data_get_sub_data(v1, 1, b),
9
          STARPU_R, starpu_data_get_sub_data(v2, 1, b),
10
          0);
11
```

Innía

### **Commutative Write Accesses**

- Write accesses enforce sequential consistency by default
  - To strong for some kind of workloads
  - N-body, unstructured meshes

Innía

### **Commutative Write Accesses**

- Write accesses enforce sequential consistency by default
  - To strong for some kind of workloads
  - N-body, unstructured meshes





### **Commutative Write Accesses**

- Write accesses enforce sequential consistency by default
  - To strong for some kind of workloads
  - N-body, unstructured meshes
- . Commute: allows a set of tasks to modify a buffer in any order

```
starpu_task_insert(&cl1 ,
1
      STARPU_R, handle0,
2
      STARPU_RW, handle,
3
      0);
4
  starpu_task_insert(&cl2 ,
5
      STARPU_R, handle1,
6
      STARPU_RW | STARPU_COMMUTE, handle,
7
      0);
8
  starpu_task_insert(&cl2 ,
q
      STARPU R. handle2.
10
      STARPU_RW | STARPU_COMMUTE, handle,
11
      0);
12
  starpu_task_insert(&cl3 ,
13
      STARPU_R, handle3,
14
      STARPU_RW, handle,
15
      0);
16
```



# **Analysis and Monitoring**

Innía

O. Aumage – StarPU Runtime

### Feedback mechanisms

#### Online Tools

- Statistics
- Visual debugging

### Offline Tools

Trace-based analysis

Innía

### Offline Trace-Based Feedback

- FxT trace collection
- Trace analysis and display
  - ViTE Gantt
  - Graphviz DAG
  - R plots

Innía

# Offline Feedback – Trace Analysis

Automatically generated

- Dependency graph (DAG)
- Activity diagramm (GANTT)
  - Visualize with ViTE



Innía

### Offline Feedback – Kernel Model

Display the codelet performance models recorded by StarPU

- Command-line tool starpu\_perfmodel\_display
- History-based models
- Regression-based models

Innía

### Offline Feedback – Kernel Model

Display the codelet performance models recorded by StarPU

- Command-line tool starpu\_perfmodel\_display
- History-based models
- Regression-based models

| 1 | <pre>\$ starpu_perfmodel_display -s starpu_slu_lu_model_11</pre> |         |              |              |    |  |
|---|------------------------------------------------------------------|---------|--------------|--------------|----|--|
| 2 | _                                                                |         |              |              |    |  |
| 3 | performance model <b>for</b> cpu0_parallel1_impl0                |         |              |              |    |  |
| 4 | # hash                                                           | size    | mean (us)    | stddev (us)  | n  |  |
| 5 | aa6d4ef7                                                         | 4194304 | 3.055501e+05 | 5.804822e+04 | 48 |  |

### **Offline Feedback – Kernel Model Characteristics**



### Offline Feedback – Kernel Model Regression Fitness





### Offline Feedback – Synthetic Kernels' Behaviour



Data trace

Ínría\_

O. Aumage - StarPU Runtime - 7. Analysis and Monitoring



# **Distributed Computing**

Innía

O. Aumage – StarPU Runtime

# **Distributed Support**

#### Sequential Task Flow Paradigm on Clusters

#### Each node unrolls the sequential task flow

#### Data↔Node Mapping

- Provided by the application
- Can be altered dynamically





## **Distributed Support**

Sequential Task Flow Paradigm on Clusters

Each node unrolls the sequential task flow

#### Inter-node dependence management

- Inferred from the task graph edges
- Automatic Isend and Irecv calls







# **Distributed Support**

Sequential Task Flow Paradigm on Clusters

#### Each node unrolls the sequential task flow

#### Task↔Node Mapping

- Inferred from data location:
  - Tasks move to data they modify
- No global scheduling
- No synchronizations

#### Optimization

Local DAG pruning



# Ínría\_

# **Distributed Scalability Study Results**

Chameleon linear algebra library (Inria Team HiePACS)

Heterogeneous cluster: 1152 CPU cores+288 GPUs



IEEE TPDS Paper: DOI: 10.1109/TPDS.2017.2766064 — https://hal.inria.fr/hal-01618526





# Interoperability and Composition

Inría

O. Aumage – StarPU Runtime

Rationale



#### Rationale

Sharing computing resources...



#### Rationale

- Sharing computing resources...
- ... among multiple DAGs

Inría

#### Rationale

- Sharing computing resources...
- ... among multiple DAGs
- ... simultaneously



#### Rationale

- Sharing computing resources...
- ... among multiple DAGs
- ... simultaneously

#### **Scheduling Contexts**





#### Rationale

- Sharing computing resources...
- ... among multiple DAGs
- ... simultaneously

#### Scheduling Contexts

Map DAGs on subsets of computing units





#### Rationale

- Sharing computing resources...
- ... among multiple DAGs
- ... simultaneously

#### Scheduling Contexts

- Map DAGs on subsets of computing units
- Isolate competing kernels or library calls
  - OpenMP kernel, Intel MKL, etc.





#### Rationale

- Sharing computing resources...
- ... among multiple DAGs
- ... simultaneously

#### Scheduling Contexts

- Map DAGs on subsets of computing units
- Isolate competing kernels or library calls
  - OpenMP kernel, Intel MKL, etc.
- Select scheduling policy per context





















Inría

How to Make Runtimes, Libs Cooperate?



# How to Make Runtimes, Libs Cooperate?

- Project INTERTWinE (EU H2020, 3-years, 2015-2018)
  - Task-based runtimes: StarPU, OmpSs, PaRSEC, OpenMP
  - Networking APIs: MPI, GASPI
  - Libraries: Plasma, DPlasma
  - Applications





# How to Make Runtimes, Libs Cooperate?

- Project INTERTWinE (EU H2020, 3-years, 2015-2018)
  - Task-based runtimes: StarPU, OmpSs, PaRSEC, OpenMP
  - Networking APIs: MPI, GASPI
  - Libraries: Plasma, DPlasma
  - Applications

#### Cooperative resource allocation and management

- Cores
- Accelerators
- Memory
- Pinned memory segments
- ...

#### www.intertwine-project.eu







# **Resource Management APIs**

#### Olivier Aumage (Inria), Vicenç Beltran & Xavier Teruel (BSC)

# http://www.intertwine-project.eu



This project is funded from the European Union's Horizon 2020 Research and Innovation programme under Grant Agreement no. 671602.

## **INTERTWINE**

#### **Interoperability between programming models** for scalable performance on extreme-scale supercomputers



## **Computational Resource Management Objectives**

• Implement a **Resource Management API** to share computing resources between parallel applications, libraries and runtime systems



• Fork-join pattern





## Motivation Sequential applications + parallel libraries

• Fork-join pattern







## Motivation Sequential applications + parallel libraries

- Fork-join pattern
- No over-subscription, but most CPUs underutilized on sequential parts







## Motivation Parallel application + parallel libraries





## Motivation Parallel application + parallel libraries

Uncoordinated access to CPU cores







## Motivation Parallel application + parallel libraries

- Uncoordinated access to CPU cores
- Oversubscription
  - Cache pollution
  - Higher number of context switches







#### **Computational Resource Sharing**

- Multiple codes compete for CPU cores, accelerator devices on cluster nodes
  - Application threads
  - Numerical libraries threads
  - Runtime systems threads
  - Communication library threads



#### **Computational Resource Sharing**

- Multiple codes compete for CPU cores, accelerator devices on cluster nodes
- Application threads
- Numerical libraries threads
- Runtime systems threads
- Communication library threads
- Interference leads to resource over-subscription or under-subscription on cluster nodes
- Interoperability?



#### **Computational Resource Sharing**

- Multiple codes compete for CPU cores, accelerator devices on cluster nodes
- Application threads
- Numerical libraries threads
- Runtime systems threads
- Communication library threads
- Interference leads to resource over-subscription or under-subscription on cluster nodes
- Interoperability?
- Need coordinated resource sharing:
- Ability to express general resource needs
- Ability to express dynamic resource requirements:
  - computational-heavy periods, idleness periods



#### **Computational Resource Sharing**

- Multiple codes compete for CPU cores, accelerator devices on cluster nodes
- Application threads
- Numerical libraries threads
- Runtime systems threads
- Communication library threads
- Interference leads to resource over-subscription or under-subscription on cluster nodes
- Interoperability?
- Need coordinated resource sharing:
- Ability to express general resource needs
- Ability to express dynamic resource requirements:
  - computational-heavy periods, idleness periods

#### INTERTWinE Resource Management APIs



#### **Resource Manager Overview**

• Implement a **Resource Manager** to share CPU resources between parallel application, libraries and runtime systems



### **Resource Manager APIs Native offload and resource enforcement API**

**Coordinated execution of a parallel library kernel from a parallel application** 



### **Resource Manager APIs Native offload and resource enforcement API**

#### Coordinated execution of a parallel library kernel from a parallel application





## Resource Manager APIs Native offload and resource enforcement API

• Each runtime has its own (similar) asynchronous API:

```
    Nanos6

    void nanos spawn function (
            void (*function) (void *),
            void *args,
            void (*completion callback)(void *),
            void *completion args,
            char const *label,
            cpu set t *cpu mask)

    StarPU

    void starpurm spawn kernel on cpus callback(
            void *data,
            void(*f)(void *),
            void *args,
            hwloc cpuset t cpuset,
            void(*cb f)(void *),
            void *cb args)
```



### Resource Manager APIs Performance evaluation of Native (and OpenCL) offloading API

- MatMul: 16 CPUs
  - Outermost task: block size 4K, 4 CPUs assigned to each task
  - Innermost task: block size 512 bytes



When there is only one level of tasks, high performance is not RTW.
 achieved until matrix is very big

## Resource Manager APIs Dynamic Resource Sharing (DRS)



### See StarPU dynamic resource management animation



#### **Accelerator Resource Management**

#### • Dynamic Resource Sharing API extended for devices

- Device sharing routines
  - Lend/Reclaim device
  - Acquire/Return device



### **Accelerator Resource Management**

#### Dynamic Resource Sharing API extended for devices

- Device sharing routines
  - Lend/Reclaim device
  - Acquire/Return device
- StarPU's Resource Manager implementation extended to support devices
- Device types supported
  - CUDA devices
  - OpenCL devices
  - (Xeon Phi KNC accelerator devices )



### **Accelerator Resource Management**

#### Dynamic Resource Sharing API extended for devices

- Device sharing routines
  - Lend/Reclaim device
  - Acquire/Return device
- StarPU's Resource Manager implementation extended to support devices
- Device types supported
  - CUDA devices
  - OpenCL devices
  - (Xeon Phi KNC accelerator devices, ...)
- Dynamic notifications
  - Device becoming idle, from the runtime point of view
  - Device becoming needed, from the runtime point of view
  - Could be interfaced with DLB as for the CPU support.



### **INTERTWinE – Resource Management APIs**

#### • Exascale Scheme

- Parallel application + Parallel libraries
- Need for coordinated access to computing resources
- Avoid undersubscription, oversubscription, idleness
- Interoperability



### **INTERTWinE – Resource Management APIs**

### Exascale Scheme

- Parallel application + Parallel libraries
- Need for coordinated access to computing resources
- Avoid undersubscription, oversubscription, idleness
- Interoperability

#### **INTERTWinE Resource Management APIs**

- Kernel offload and resource enforcement APIs
  - Native & via OpenCL
- Dynamic resource sharing API
- (Pause/Resume API)



### **INTERTWinE:**

## **Programming Model INTERoperability ToWards Exascale**

#### Visit http://www.intertwine-project.eu to find out about our:

- Best Practice Guides:
- Writing GASPI-MPI Interoperable Programs
- MPI + OpenMP Programming
- MPI + OmpSs Interoperable Programs
- Open MP/OmpSs/StarPU + Multi-threaded Libraries Interoperable Programs
- "Developer Hub" of resources for developers & application users

...and to sign up for the latest news from INTERTWinE at http://www.intertwine-project.eu/newsletter



http://www.intertwine-project.eu



#### **Advanced Scheduling Topics**

Ínría

O. Aumage – StarPU Runtime

#### Multicore CPUs: Parallel Tasks (T. Cojean)

Kernel sweet spots: example with Cholesky factorization kernels (1x Xeon E5-2680v3 2.5GHz 12 cores)





Rationale

- Run parallel kernels on multiple CPU cores
- Address CPU/GPU computing power imbalance
- Address nested-runtime interoperability





Rationale

- Run parallel kernels on multiple CPU cores
- Address CPU/GPU computing power imbalance
- Address nested-runtime interoperability

Reduce computing power imbalance between CPU and GPU

- Big kernel for GPU
- Small kernel for a single CPU core
- Run "bigger" kernel on several CPU cores



Rationale

- Run parallel kernels on multiple CPU cores
- Address CPU/GPU computing power imbalance
- Address nested-runtime interoperability

Reduce computing power imbalance between CPU and GPU

- Big kernel for GPU
- Small kernel for a single CPU core
- Run "bigger" kernel on several CPU cores

Make use of existing parallel kernels/codes

- Interoperability
- Libraries: BLAS, FFT, ...
- OpenMP code



Two flavors of parallel tasks



Two flavors of parallel tasks

Fork-mode

StarPU provides threads on the participating cores

Inría

Two flavors of parallel tasks

#### Fork-mode

StarPU provides threads on the participating cores

#### SPMD-mode

- StarPU launches the task on a single core
- . . . . and let the task create its own threads
  - Black-box mode



Two flavors of parallel tasks

#### Fork-mode

StarPU provides threads on the participating cores

#### SPMD-mode

- StarPU launches the task on a single core
- . . . . and let the task create its own threads
  - Black-box mode

Locality enforcement in NUMA context

Combined worker threads



### Submission-side Task Flow Optimizations

- Global task-graph pruning in distributed computing sessions
- Memory subscription control



### **Distributed Scalability Study Results**

Chameleon linear algebra library (Inria Team HiePACS)

Heterogeneous cluster: 1152 CPU cores+288 GPUs



IEEE TPDS Paper: DOI: 10.1109/TPDS.2017.2766064 — https://hal.inria.fr/hal-01618526



### **Distributed Support**

Sequential Task Flow Paradigm on Clusters

#### Each node unrolls the sequential task flow

#### Task↔Node Mapping

- Inferred from data location:
  - Tasks move to data they modify
- No global scheduling
- No synchronizations

#### Optimization

Local DAG pruning





#### **Global Task-Graph Pruning Issue**







### **Unbounded Task Submission Issue**





### Implementing Some Scheduling Lookahead Window

Control of the task submission flow

#### Memory tracking

- Account the memory subscription
- Task submission throttling
  - Blocking mechanism of the task submission flow
  - Allows the task submission to be controlled by an external criteria
- A control policy which uses the memory tracking to throttle the task submission flow

Inría

### Memory Behaviour Without Memory Control





### Memory Behaviour With Memory Control







# **Advanced Data Management Topics**

Innía

O. Aumage – StarPU Runtime

**Advanced Data Management** 

Ínría\_

### **Advanced Data Management**

Heterogeneous data layout

Multiformat support



### **Advanced Data Management**

Heterogeneous data layout

Multiformat support

Large workloads

Out-of-core support

Innía

### **Data Layout**

Heterogeneous platforms

- Heterogeneous data layout requirements
- Example:
  - Arrays of Structures (AoS), for CPU cache locality
  - vs Structures of Arrays (SoA), for GPU coalesced memory accesses
  - vs Arrays of Structures of Arrays (AoSoA), for MIC/Xeon Phi
  - $-\ldots$  any other data layout

Innía

### Data Layout

Heterogeneous platforms

- Heterogeneous data layout requirements
- Example:
  - Arrays of Structures (AoS), for CPU cache locality
  - vs Structures of Arrays (SoA), for GPU coalesced memory accesses
  - vs Arrays of Structures of Arrays (AoSoA), for MIC/Xeon Phi
  - $\ \ldots$  any other data layout

StarPU enables Multiformat kernel implementations

- User-provided data layout conversion codelets...
- ... automatically called upon transfers between devices



### Multiformat

#### Example

Declare conversion codelets

```
/* Conversion codelets */
2 struct starpu_multiformat_data_interface_ops format_ops = {
      .cuda_elemsize = 2 * sizeof(float),
3
      .cpu_to_cuda_cl = &cpu_to_cuda_cl,
4
5
      .cuda_to_cpu_cl = &cuda_to_cpu_cl,
6
      .cpu_elemsize = 2 * sizeof(float),
7
8
9
  };
10
  /* Multiformat handle registration */
11
  starpu_multiformat_data_register(handle, 0,
12
                         &array_of_structs, NX, &format_ops);
13
```



### Multiformat

Example

- Declare conversion codelets
- Array of structures for CPU

```
/* CPU Computation Kernel */
1
2
a void
  multiformat_scal_cpu_func(void *buffers[],void *cl_arg) {
4
      struct point *aos;
5
      unsigned int n;
6
7
      aos = STARPU_MULTIFORMAT_GET_CPU_PTR(buffers[0]);
8
      n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
9
10
       . . .
11
  ł
```

Innía

### Multiformat

Example

- Declare conversion codelets
- Array of structures for CPU
- Structure of arrays for NVidia CUDA GPU

```
/* GPU Computation Kernel */
1
2
  extern "C" void
3
  multiformat_scal_cuda_func(void *buffers[],void *cl_arg) {
4
      unsigned int n;
5
      struct struct_of_arrays *soa;
6
7
      soa = (struct struct_of_arrays *)
8
                  STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
9
      n = STARPU MULTIFORMAT GET NX(buffers[0]);
10
11
12
       . . .
13
  ł
```



Using disks as StarPU memory nodes

Out-of-Core



Using disks as StarPU memory nodes

- Out-of-Core
- Enable StarPU to evict temporarily unused data to disk

Innía

Using disks as StarPU memory nodes

Out-of-Core



Using disks as StarPU memory nodes

- Out-of-Core
- Enable StarPU to evict temporarily unused data to disk

Innía

Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance



Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance





Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance





Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance





Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance





Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance





Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance





Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance





Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance





Integration with general StarPU's memory management layer

- StarPU data handles
- Task dependencies
- Multiple I/O drivers supported

- Out-of-core / swap
- Mitigated startup load / solution output
- Building block for fault tolerance







# **Advanced Analysis and Monitoring Topics**

Innía

O. Aumage – StarPU Runtime

- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs



- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs

```
int ret = starpu_init(NULL);
...
starpu_task_insert(...);
starpu_task_insert(...);
...
starpu_task_wait_for_all();
...
...
```



- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs

```
int ret = starpu_init(NULL);
...
starpu_bound_start();
starpu_task_insert(...);
starpu_task_insert(...);
...
r starpu_task_wait_for_all();
starpu_bound_stop();
...
```



- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs

```
int ret = starpu_init(NULL);
...
starpu_bound_start();
starpu_task_insert(...);
starpu_task_insert(...);
...
r starpu_task_wait_for_all();
starpu_bound_stop();
starpu_bound_print_lp();
...
```



- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs
  - Generate a Linear Programming problem...
    - ... to be solved externally (Ip\_solve, etc.)

```
int ret = starpu_init(NULL);
...
starpu_bound_start();
starpu_task_insert(...);
starpu_task_insert(...);
...
r starpu_task_wait_for_all();
starpu_bound_stop();
starpu_bound_print_lp();
...
```



- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs



- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs

```
int ret = starpu_init(NULL);
...
starpu_task_insert(...);
starpu_task_insert(...);
...
starpu_task_wait_for_all();
...
...
...
```



- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs

```
int ret = starpu_init(NULL);
...
starpu_bound_start();
starpu_task_insert(...);
starpu_task_insert(...);
...
r starpu_task_wait_for_all();
starpu_bound_stop();
...
```



- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs

```
int ret = starpu_init(NULL);
...
starpu_bound_start();
starpu_task_insert(...);
starpu_task_insert(...);
...
r starpu_task_wait_for_all();
starpu_bound_stop();
starpu_bound_print_lp();
...
```



- ... on Execution Time
  - Have realistic expectations from the scheduler
  - Identify issues
    - Abnormal overhead
    - Bugs
  - Generate a Linear Programming problem...
    - ... to be solved externally (Ip\_solve, etc.)

```
int ret = starpu_init(NULL);
...
starpu_bound_start();
starpu_task_insert(...);
starpu_task_insert(...);
...
r starpu_task_wait_for_all();
starpu_bound_stop();
starpu_bound_print_lp();
...
```



#### Simulation with SimGrid

#### Scheduling without executing kernels

- Requires the SimGrid simulation environment
- Enables simulating large-scale scenarios
  - Large data sets
  - Large simulated hardware plaform
- Relies on real performance models...
- ... collected by StarPU on a real machine
- Enables fast experiments when designing application algorithms
- Enables fast experiments when designing scheduling algorithms

\$\$STARPU\_DIR/configure ---enable-simgrid [... other opts ...]
 ...



#### Simulation with SimGrid

#### Scheduling without executing kernels

- Requires the SimGrid simulation environment
- Enables simulating large-scale scenarios
  - Large data sets
  - Large simulated hardware plaform
- Relies on real performance models...
- ... collected by StarPU on a real machine
- Enables fast experiments when designing application algorithms
- Enables fast experiments when designing scheduling algorithms

\$\$STARPU\_DIR/configure — enable-simgrid [... other opts ...]
 ...



#### Simulation with SimGrid

Scheduling without executing kernels

- Requires the SimGrid simulation environment
- Enables simulating large-scale scenarios
  - Large data sets
  - Large simulated hardware plaform
- Relies on real performance models...
- ... collected by StarPU on a real machine
- Enables fast experiments when designing application algorithms
- Enables fast experiments when designing scheduling algorithms

\$\$STARPU\_DIR/configure —enable-simgrid [... other opts ...]
 ...



#### Simulation accuracy with SimGrid



Ínría

#### Simulation with StarPU/SimGrid (L. Stanisic)





## Simulation with StarPU/SimGrid (L. Stanisic)



Ínría\_



Inria O. Aumage - StarPU Runtime

## $\mathsf{StarPU}$

A Unified Runtime System for Heterogeneous Multicore Architectures



# StarPU

A Unified Runtime System for Heterogeneous Multicore Architectures

Programming Model: Async. Task Submission + Inferred Dependencies

Inría

# StarPU

A Unified Runtime System for Heterogeneous Multicore Architectures

Programming Model: Async. Task Submission + Inferred Dependencies Execution Model: Scheduler + Distributed Shared Memory

Inría

# StarPU

A Unified Runtime System for Heterogeneous Multicore Architectures

Programming Model: Async. Task Submission + Inferred Dependencies Execution Model: Scheduler + Distributed Shared Memory

The key combination for:

- Portability
- Control
- Adaptiveness
- Optimization

#### Portability of Performance



# Thanks for your attention. StarPU runtime system

Web Site: http://starpu.gforge.inria.fr/ LGPL License

Open to external contributors

Innia

O. Aumage - StarPU Runtime