StarPU

Runtime Systems for Heterogeneous Platform Programming

StarPU Tutorial - Maison de la Simulation, June 2016

This tutorial is part of the PATC Training "Runtime systems for heterogeneous platform programming".

Other materials (talk slides, links) for the whole tutorial session are available at the bottom of this page.

Setup

Connection to the Platform

The lab works are going to be done on the MDS platform. You should have received information on how to connect to the platform.

To use StarPU on the machines, you need to load the following modules

module load cuda
module load openmpi
module load hwloc/1.6.2_gnu47
module load openblas/v0.2.8_gnu48

The following variables need to be set to use StarPU.

export STARPU_PATH=/gpfslocal/pub/training/runtime_june2016/softs/starpu/starpu-1.2.0rc5
export PATH=$STARPU_PATH/bin:$PATH
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$STARPU_PATH/lib/pkgconfig
export LD_LIBRARY_PATH=$STARPU_PATH/lib:$LD_LIBRARY_PATH

export LD_LIBRARY_PATH=/gpfslocal/pub/training/runtime_june2016/softs/starpu:$LD_LIBRARY_PATH
export LIBRARY_PATH=$LD_LIBRARY_PATH

export STARPU_IDLE_FILE=$HOME/starpu_idle_microsec.log

You can either add the previous lines to your file $HOME/.bash_profile, or use the script file /gpfslocal/pub/training/runtime_june2016/starpu_env.sh

Job Submission

Jobs can be submitted to the platform to reserve a set of nodes and to execute a application on those nodes. Here is a script to submit your first StarPU application. It calls the tool starpu_machine_display which shows the processing units that StarPU can use, and the bandwidth and affinity measured between the memory nodes.

MdS nodes are accessed through queues which represent machines with similar characteristics. For our lab works, we have 2 sets of machines:

#!/bin/bash
# @ class            = clgpu
# @ job_name = job_starpu_machine_display
# @ total_tasks = 10
# @ node = 1
# @ wall_clock_limit = 00:10:00
# @ output = $(HOME)/starpu/$(job_name).$(jobid).out
# @ error = $(HOME)/starpu/$(job_name).$(jobid).err
# @ job_type = mpich
# @ queue

source /gpfslocal/pub/training/runtime_june2016/starpu_env.sh
starpu_machine_display

You will find a copy of the script in /gpfslocal/pub/training/runtime_june2016/starpu_machine_display.sh. To submit the script, simply call:

llsubmit starpu_machine_display.sh

The state of the job can be queried by calling the command llq | grep $USER. Once finished, the standard output and the standard error generated by the script execution are available in the files:

Note that the first time starpu_machine_display is executed, it calibrates the performance model of the bus, the results are then stored in different files in the directory $HOME/.starpu/sampling/bus. If you run the command several times, you will notice that StarPU may calibrate the bus speed several times. This is because the cluster's batch scheduler may assign a different node each time, and StarPU does not know that the local cluster we use is homogeneous, and thus assumes that all nodes of the cluster may be different. The following line could be added to the script file to force StarPU to use the same machine ID for the whole cluster:

$ export STARPU_HOSTNAME=poincaregpu

Of course, on a heterogeneous cluster, the cluster launcher script should set various hostnames for the different node classes, as appropriate.

Tutorial Material

All files needed for the lab works are available on the machine in the directory /gpfslocal/pub/training/runtime_june2016.

Session Part 1: Task-based Programming Model

Application Example: Vector Scaling

Making it and Running it

A typical Makefile for applications using StarPU is the following:

CFLAGS += $(shell pkg-config --cflags starpu-1.2)
LDFLAGS += $(shell pkg-config --libs starpu-1.2)
%.o: %.cu
	nvcc $(CFLAGS) $< -c $

vector_scal_task_insert: vector_scal_task_insert.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o

Here are the source files for the application:

Run make, and run the resulting vector_scal_task_insert executable using the batch scheduler using the given script vector_scal.sh. It should be working: it simply scales a given vector by a given factor.

#!/bin/bash
# @ class            = clgpu
# @ job_name = job_vector_scal
# @ total_tasks = 10
# @ node = 1
# @ wall_clock_limit = 00:10:00
# @ output = $(HOME)/starpu/$(job_name).$(jobid).out
# @ error = $(HOME)/starpu/$(job_name).$(jobid).err
# @ job_type = mpich
# @ queue

source /gpfslocal/pub/training/runtime_june2016/starpu_env.sh

make vector_scal_task_insert

./vector_scal_task_insert

Computation Kernels

Examine the source code, starting from vector_scal_cpu.c : this is the actual computation code, which is wrapped into a vector_scal_cpu function which takes a series of DSM interfaces and a non-DSM parameter. The code simply gets the factor value from the non-DSM parameter, an actual pointer from the first DSM interface, and performs the vector scaling.

The GPU implementation, in vector_scal_cuda.cu, is basically the same, with the host part (vector_scal_cuda) which extracts the actual CUDA pointer from the DSM interface, and passes it to the device part (vector_mult_cuda) which performs the actual computation.

The OpenCL implementation in vector_scal_opencl.c and vector_scal_opencl_kernel.clis more hairy due to the low-level aspect of the OpenCL standard, but the principle remains the same.

Modify the source code of the different implementations (CPU, CUDA and OpenCL) and see which ones gets executed. You can force the execution of one the implementations simply by disabling a type of device when running your application, e.g.:

# to force the implementation on a GPU device, by default, it will enable CUDA
STARPU_NCPUS=0 vector_scal_task_insert

# to force the implementation on a OpenCL device
STARPU_NCPUS=0 STARPU_NCUDA=0 vector_scal_task_insert

You can set the environment variable STARPU_WORKER_STATS to 1 when running your application to see the number of tasks executed by each device. You can see the whole list of environment variables here.

STARPU_WORKER_STATS=1 vector_scal_task_insert

Main Code

Now examine vector_scal_task_insert.c: the cl (codelet) structure simply gathers pointers on the functions mentioned above.

The main function

Data Partitioning

In the previous section, we submitted only one task. We here discuss how to partition data so as to submit multiple tasks which can be executed in parallel by the various CPUs and GPUs.

Let's examine mult.c.

Run the application with the batch scheduler, enabling some statistics:


Figures show how the computation were distributed on the various processing units.

Other example

gemm/xgemm.c is a very similar matrix-matrix product example, but which makes use of BLAS kernels for much better performance. The mult_kernel_common functions shows how we call DGEMM (CPUs) or cublasDgemm (GPUs) on the DSM interface.

Let's execute it.

#!/bin/bash
# @ class            = clgpu
# @ job_name = job_gemm
# @ total_tasks = 10
# @ node = 1
# @ wall_clock_limit = 00:10:00
# @ output = $(HOME)/starpu/$(job_name).$(jobid).out
# @ error = $(HOME)/starpu/$(job_name).$(jobid).err
# @ job_type = mpich
# @ queue

source /gpfslocal/pub/training/runtime_june2016/starpu_env.sh

make gemm/sgemm
STARPU_WORKER_STATS=1 ./gemm/sgemm

Exercise

Take the vector example again, and add partitioning support to it, using the matrix-matrix multiplication as an example. Here we will use the starpu_vector_filter_block() filter function. You can see the list of predefined filters provided by StarPU here. Try to run it with various numbers of tasks.

Session Part 2: Optimizations

This is based on StarPU's documentation optimization chapter

Data Management

We have explained how StarPU can overlap computation and data transfers thanks to DMAs. This is however only possible when CUDA has control over the application buffers. The application should thus use starpu_malloc() when allocating its buffer, to permit asynchronous DMAs from and to it.

Take the vector example again, and fix the allocation, to make it use starpu_malloc().

Task Submission

To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like that: submit the whole graph of tasks, and wait for termination.

Task Scheduling Policy

By default, StarPU uses the eager simple greedy scheduler. This is because it provides correct load balance even if the application codelets do not have performance models: it uses a single central queue, from which workers draw tasks to work on. This however does not permit to prefetch data, since the scheduling decision is taken late.

If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.

For instance, compare the eager (default) and dmda scheduling policies:

STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 gemm/sgemm -x 1024 -y 1024 -z 1024

with:

STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmda gemm/sgemm -x 1024 -y 1024 -z 1024

You can see most (all?) the computation have been done on GPUs, leading to better performances.

Try other schedulers, use STARPU_SCHED=help to get the list.

Also try with various sizes and draw curves.

You can also try the double version, dgemm, and notice that GPUs get less great performance.

Performance Model Calibration

Performance prediction is essential for proper scheduling decisions, the performance models thus have to be calibrated. This is done automatically by StarPU when a codelet is executed for the first time. Once this is done, the result is saved to a file in $STARPU_HOME for later re-use. The starpu_perfmodel_display tool can be used to check the resulting performance model.

$ starpu_perfmodel_display -l
file: <starpu_sgemm_gemm.mirage>
$ starpu_perfmodel_display -s starpu_sgemm_gemm
performance model for cpu_impl_0
# hash		size		flops		mean (us)	stddev (us)		n
8bd4e11d	2359296        	0.000000e+00   	1.848856e+04   	4.026761e+03   	12
performance model for cuda_0_impl_0
# hash		size		flops		mean (us)	stddev (us)		n
8bd4e11d	2359296        	0.000000e+00   	4.918095e+02   	9.404866e+00   	66
...

This shows that for the sgemm kernel with a 2.5M matrix slice, the average execution time on CPUs was about 18ms, with a 4ms standard deviation, over 12 samples, while it took about 0.049ms on GPUs, with a 0.009ms standard deviation. It is a good idea to check this before doing actual performance measurements. If the kernel has varying performance, it may be a good idea to force StarPU to continue calibrating the performance model, by using export STARPU_CALIBRATE=1

If the code of a computation kernel is modified, the performance changes, the performance model thus has to be recalibrated from start. To do so, use export STARPU_CALIBRATE=2

The performance model can also be drawn by using starpu_perfmodel_plot, which will emit a gnuplot file in the current directory.

Sessions Part 3: MPI Support

StarPU provides support for MPI communications. It does so in two ways. Either the application specifies MPI transfers by hand, or it lets StarPU infer them from data dependencies.

Manual MPI transfers

Basically, StarPU provides equivalents of MPI_* functions, but which operate on DSM handles instead of void* buffers. The difference is that the source data may be residing on a GPU where it just got computed. StarPU will automatically handle copying it back to main memory before submitting it to MPI.

ring_async_implicit.c shows an example of mixing MPI communications and task submission. It is a classical ring MPI ping-pong, but the token which is being passed on from neighbour to neighbour is incremented by a starpu task at each step.

This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.

#!/bin/bash
# @ class            = clgpu
# @ job_name = job_ring
# @ total_tasks = 10
# @ node = 1
# @ wall_clock_limit = 00:10:00
# @ output = $(HOME)/starpu/$(job_name).$(jobid).out
# @ error = $(HOME)/starpu/$(job_name).$(jobid).err
# @ job_type = mpich
# @ queue

source /gpfslocal/pub/training/runtime_june2016/starpu_env.sh

make ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit

starpu_mpi_insert_task

A stencil application shows a basic MPI task model application. The data distribution over MPI nodes is decided by the my_distrib function, and can thus be changed trivially. It also shows how data can be migrated to a new distribution.

#!/bin/bash
# @ class            = clgpu
# @ job_name = job_stencil
# @ total_tasks = 10
# @ node = 1
# @ wall_clock_limit = 00:10:00
# @ output = $(HOME)/starpu/$(job_name).$(jobid).out
# @ error = $(HOME)/starpu/$(job_name).$(jobid).err
# @ job_type = mpich
# @ queue

source /gpfslocal/pub/training/runtime_june2016/starpu_env.sh

make stencil5
mpirun -np 2 $PWD/stencil5 -display

Session Part 4: OpenMP Support

The Klang-Omp OpenMP Compiler

The Klang-Omp OpenMP compiler converts C/C++ source codes annotated with OpenMP 4 directives into StarPU enabled codes. Klang-Omp is source-to-source compiler based on the LLVM/CLang compiler framework.

The following shell sequence shows an example of an OpenMP version of the Cholesky decomposition compiled into StarPU code.

cd
source /gpfslocal/pub/training/runtime_june2016/openmp/environment
cp -r /gpfslocal/pub/training/runtime_june2016/openmp/Cholesky .
cd Cholesky
make
./cholesky_omp4.starpu

Homepage of the Klang-Omp OpenMP compiler: Klang-Omp

Contact

For any questions regarding StarPU, please contact the StarPU developers mailing list. starpu-devel@lists.gforge.inria.fr

More Performance Optimizations

The StarPU documentation performance feedback chapter provides more optimization tips for further reading after this tutorial.

Other Materials: Talk Slides and Website Links

General Session Introduction

The Hardware Locality Library (hwloc)

The StarPU Runtime System

The EZtrace Performance Debugging Framework

Last updated on 2016/06/17.