Runtime Systems for Heterogeneous Platform Programming, StarPU Tutorial - Maison de la Simulation, June 2016
This tutorial is part of the PATC Training “Runtime systems for heterogeneous platform programming”.
Other materials (talk slides, links) for the whole tutorial session are available at the bottom of this page.
Setup
Connection to the Platform
The lab works are going to be done on the PlaFRIM/DiHPES platform. A subset of machines has been specifically booked for our own usage. You should have received information on how to connect to the platform.
Once you are connected, we advise you to add the following lines at
the end of your file .bash_profile
.
module purge
module load compiler/intel
module load hardware/hwloc
module load compiler/cuda
module load mpi/intel
module load runtime/starpu/1.1.4
Important: Due to an issue with the NFS-mounted home, you need to redirect CUDA’s cache to
/tmp, so please add the following lines in your file .bash_profile
:
rm -fr ~/.nv
mkdir -p /tmp/$USER/nv
ln -s /tmp/$USER/nv ~/.nv
StarPU uses a file locking mechanism which also clashes with the
NFS-mounted home, you will need to change the location where StarPU
stores its files, by adding the following line in your
file .bash_profile
:
export STARPU_HOME=/tmp/$USER/starpu
Job Submission
Jobs can be submitted to the platform to reserve a set of nodes and to
execute a application on those nodes. We advise not to reserve nodes
interactively so as not to block the machines for the others
participants. Here is
a script to submit your
first StarPU application. It calls the
tool starpu_machine_display
which shows the processing units
that StarPU can use, and the bandwidth and affinity measured between
the memory nodes.
PlaFRIM/DiHPES nodes are accessed through queues which represent machines with similar characteristics. For our lab works, we have 2 sets of machines:
- GPU nodes accessed with the queue
formation_gpu
. - Non-GPU nodes accessed with the queue
formation
.
For the rest of the tutorial, we will use the queue formation_gpu
.
#how many nodes and cores
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12 -q formation_gpu
starpu_machine_display
To submit the script, simply call:
qsub starpu_machine_display.pbs
The state of the job can be queried by calling the command qstat | grep $USER
.
Once finished, the standard output and the standard error generated by
the script execution are available in the files:
- jobname.osequence_number
- jobname.esequence_number
Note that the first time starpu_machine_display
is executed,
it calibrates the performance model of the bus, the results are then
stored in different files in the
directory $HOME/.starpu/sampling/bus
. If you run the command
several times, you will notice that StarPU may calibrate the bus speed
several times. This is because the cluster’s batch scheduler may assign a
different node each time, and StarPU does not know that the local
cluster we use is homogeneous, and thus assumes that all nodes of the
cluster may be different. Let’s force it to use the same machine ID
for the whole cluster:
$ export STARPU_HOSTNAME=mirage
Also add this to your .bash_profile
for further connections. Of course, on
a heterogeneous cluster, the cluster launcher script should set various
hostnames for the different node classes, as appropriate.
Tutorial Material
All files needed for the lab works are available in the zip file. Copy this file on your PlaFRIM/DiHPES account and unzip its contents.
Session Part 1: Task-based Programming Model
Application Example: Vector Scaling
Making it and Running it
A typical Makefile
for
applications using StarPU is the following:
CFLAGS += $(shell pkg-config --cflags starpu-1.1)
LDFLAGS += $(shell pkg-config --libs starpu-1.1)
%.o: %.cu
nvcc $(CFLAGS) $< -c $
vector_scal_task_insert: vector_scal_task_insert.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
Here are the source files for the application:
- The main application
- The CPU implementation of the codelet
- The CUDA implementation of the codelet
- The OpenCL host implementation of the codelet
- The OpenCL device implementation of the codelet
Run make
, and run the
resulting vector_scal_task_insert
executable using the batch
scheduler using the given qsub script vector_scal.pbs. It should be working: it simply scales a given vector by a
given factor.
#how many nodes and cores
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12 -q formation_gpu
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR
make vector_scal_task_insert
./vector_scal_task_insert
Computation Kernels
Examine the source code, starting from vector_scal_cpu.c
: this is
the actual computation code, which is wrapped into a vector_scal_cpu
function which takes a series of DSM interfaces and a non-DSM parameter. The
code simply gets the factor value from the non-DSM parameter,
an actual pointer from the first DSM interface,
and performs the vector scaling.
The GPU implementation, in vector_scal_cuda.cu
, is basically
the same, with the host part (vector_scal_cuda
) which extracts the
actual CUDA pointer from the DSM interface, and passes it to the device part
(vector_mult_cuda
) which performs the actual computation.
The OpenCL implementation in vector_scal_opencl.c
and
vector_scal_opencl_kernel.cl
is more hairy due to the low-level aspect
of the OpenCL standard, but the principle remains the same.
Modify the source code of the different implementations (CPU, CUDA and OpenCL) and see which ones gets executed. You can force the execution of one the implementations simply by disabling a type of device when running your application, e.g.:
# to force the implementation on a GPU device, by default, it will enable CUDA
STARPU_NCPUS=0 vector_scal_task_insert
# to force the implementation on a OpenCL device
STARPU_NCPUS=0 STARPU_NCUDA=0 vector_scal_task_insert
You can set the environment variable STARPU_WORKER_STATS to 1 when running your application to see the number of tasks executed by each device. You can see the whole list of environment variables here.
STARPU_WORKER_STATS=1 vector_scal_task_insert
Main Code
Now examine vector_scal_task_insert.c
: the cl
(codelet) structure simply gathers pointers on the functions
mentioned above.
The main
function
- Allocates an
vector
application buffer and fills it. - Registers it to StarPU, and gets back a DSM handle. From now on, the
application is not supposed to access
vector
directly, since its content may be copied and modified by a task on a GPU, the main-memory copy then being outdated. - Submits a (asynchronous) task to StarPU.
- Waits for task completion.
- Unregisters the vector from StarPU, which brings back the modified version to main memory.
Data Partitioning
In the previous section, we submitted only one task. We here discuss how to /partition/ data so as to submit multiple tasks which can be executed in parallel by the various CPUs and GPUs.
Let’s examine mult.c.
- The computation kernel,
cpu_mult
is a trivial matrix multiplication kernel, which operates on 3 given DSM interfaces. These will actually not be whole matrices, but only small parts of matrices. init_problem_data
initializes the whole A, B and C matrices.partition_mult_data
does the actual registration and partitioning. Matrices are first registered completely, then two partitioning filters are declared. The first one,vert
, is used to split B and C vertically. The second one,horiz
, is used to split A and C horizontally. We thus end up with a grid of pieces of C to be computed from stripes of A and B.launch_tasks
submits the actual tasks: for each piece of C, take the appropriate piece of A and B to produce the piece of C.- The access mode is interesting: A and B just need to be read from, and C will only be written to. This means that StarPU will make copies of the pieces of A and B along the machines, where they are needed for tasks, and will give to the tasks some uninitialized buffers for the pieces of C, since they will not be read from.
- The
main
code initializes StarPU and data, launches tasks, unpartitions data, and unregisters it. Unpartitioning is an interesting step: until then the pieces of C are residing on the various GPUs where they have been computed. Unpartitioning will collect all the pieces of C into the main memory to form the whole C result matrix.
Run the application with the batch scheduler, enabling some statistics:
#how many nodes and cores
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12 -q formation_gpu
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR
make mult
STARPU_WORKER_STATS=1 ./mult
Figures show how the computation were distributed on the various processing units.
Other example
gemm/xgemm.c
is a very similar
matrix-matrix product example, but which makes use of BLAS kernels for
much better performance. The mult_kernel_common
functions
shows how we call DGEMM
(CPUs) or cublasDgemm
(GPUs)
on the DSM interface.
Let’s execute it.
#how many nodes and cores
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12 -q formation_gpu
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR
make gemm/sgemm
STARPU_WORKER_STATS=1 ./gemm/sgemm
More Advanced Examples
examples/lu/xlu_implicit.c
is a more involved example: this is a simple
LU decomposition algorithm. The dw_codelet_facto_v3
is actually the
main algorithm loop, in a very readable, sequential-looking way. It simply
submits all the tasks asynchronously, and waits for them all.
examples/cholesky/cholesky_implicit.c
is a similar example, but which makes use
of the starpu_insert_task
helper. The _cholesky
function looks
very much like dw_codelet_facto_v3
of the previous paragraph, and all
task submission details are handled by starpu_insert_task
.
Thanks to being already using a task-based programming model, MAGMA and PLASMA
have been easily ported to StarPU by simply using starpu_insert_task
.
–>
Exercise
Take the vector example again, and add partitioning support to it, using the
matrix-matrix multiplication as an example. Here we will use the
starpu_vector_filter_block()
filter function. You can see the list of
predefined filters provided by
StarPU here.
Try to run it with various numbers of tasks.
Session Part 2: Optimizations
This is based on StarPU’s documentation optimization chapter
Data Management
We have explained how StarPU can overlap computation and data transfers
thanks to DMAs. This is however only possible when CUDA has control over the
application buffers. The application should thus use starpu_malloc()
when allocating its buffer, to permit asynchronous DMAs from and to
it.
Take the vector example again, and fix the allocation, to make it use
starpu_malloc()
.
Task Submission
To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like that: submit the whole graph of tasks, and wait for termination.
Task Scheduling Policy
By default, StarPU uses the eager
simple greedy scheduler. This is
because it provides correct load balance even if the application codelets do not
have performance models: it uses a single central queue, from which workers draw
tasks to work on. This however does not permit to prefetch data, since the
scheduling decision is taken late.
If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.
For instance, compare the eager
(default) and dmda
scheduling
policies:
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 gemm/sgemm -x 1024 -y 1024 -z 1024
with:
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmda gemm/sgemm -x 1024 -y 1024 -z 1024
You can see most (all?) the computation have been done on GPUs, leading to better performances.
Try other schedulers, use STARPU_SCHED=help
to get the
list.
Also try with various sizes and draw curves.
You can also try the double version, dgemm
, and notice that GPUs get
less great performance.
Performance Model Calibration
Performance prediction is essential for proper scheduling decisions, the
performance models thus have to be calibrated. This is done automatically by
StarPU when a codelet is executed for the first time. Once this is done, the
result is saved to a file in $STARPU_HOME
for later re-use. The
starpu_perfmodel_display
tool can be used to check the resulting
performance model.
$ starpu_perfmodel_display -l
file: <starpu_sgemm_gemm.mirage>
$ starpu_perfmodel_display -s starpu_sgemm_gemm
performance model for cpu_impl_0
# hash size flops mean (us) stddev (us) n
8bd4e11d 2359296 0.000000e+00 1.848856e+04 4.026761e+03 12
performance model for cuda_0_impl_0
# hash size flops mean (us) stddev (us) n
8bd4e11d 2359296 0.000000e+00 4.918095e+02 9.404866e+00 66
...
This shows that for the sgemm kernel with a 2.5M matrix slice, the average
execution time on CPUs was about 18ms, with a 4ms standard deviation, over
12 samples, while it took about 0.049ms on GPUs, with a 0.009ms standard
deviation. It is a good idea to check this before doing actual performance
measurements. If the kernel has varying performance, it may be a good idea to
force StarPU to continue calibrating the performance model, by using export
STARPU_CALIBRATE=1
If the code of a computation kernel is modified, the performance changes, the
performance model thus has to be recalibrated from start. To do so, use
export STARPU_CALIBRATE=2
The performance model can also be drawn by using starpu_perfmodel_plot
,
which will emit a gnuplot file in the current directory.
Sessions Part 3: MPI Support
StarPU provides support for MPI communications. It does so two ways. Either the application specifies MPI transfers by hand, or it lets StarPU infer them from data dependencies
Manual MPI transfers
Basically, StarPU provides
equivalents of MPI_*
functions, but which operate on DSM handles
instead of void*
buffers. The difference is that the source data may be
residing on a GPU where it just got computed. StarPU will automatically handle
copying it back to main memory before submitting it to MPI.
ring_async_implicit.c
shows an example of mixing MPI communications and task submission. It
is a classical ring MPI ping-pong, but the token which is being passed
on from neighbour to neighbour is incremented by a starpu task at each
step.
This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.
#how many nodes and cores
#PBS -W x=NACCESSPOLICY:SINGLEJOB -l nodes=1:ppn=12 -q formation_gpu
# go in the directory from which the submission was made
cd $PBS_O_WORKDIR
make ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit
starpu_mpi_insert_task
A stencil application shows a basic MPI
task model application. The data distribution over MPI
nodes is decided by the my_distrib
function, and can thus be changed
trivially.
It also shows how data can be migrated to a
new distribution.
Contact
For any questions regarding StarPU, please contact the StarPU developers mailing list. starpu-devel@inria.fr
More Performance Optimizations
The StarPU documentation performance feedback chapter provides more optimization tips for further reading after this tutorial.
FxT Tracing Support
In addition to online profiling, StarPU provides offline profiling tools, based on recording a trace of events during execution, and analyzing it afterwards.
To use the version of StarPU compiled with FxT support, you need to reload the module StarPU after loading the module FxT.
module unload runtime/starpu/1.1.4
module load trace/fxt/0.2.13
module load runtime/starpu/1.1.4
The trace file is stored in /tmp
by default. Since execution will
happen on a cluster node, the file will not be reachable after execution,
we need to tell StarPU to store output traces in the home directory, by
setting:
$ export STARPU_FXT_PREFIX=$HOME/
do not forget the add the line in your file .bash_profile
.
The application should be run again, and this time a prof_file_XX_YY
trace file will be generated in your home directory. This can be converted to
several formats by using:
$ starpu_fxt_tool -i ~/prof_file_*
This will create
- a
paje.trace
file, which can be opened by using the ViTE tool. This shows a Gant diagram of the tasks which executed, and thus the activity and idleness of tasks, as well as dependencies, data transfers, etc. You may have to zoom in to actually focus on the computation part, and not the lengthy CUDA initialization. - a
dag.dot
file, which contains the graph of all the tasks submitted by the application. It can be opened by using Graphviz. - an
activity.data
file, which records the activity of all processing units over time.
Other Materials: Talk Slides and Website Links
General Session Introduction
The Hardware Locality Library (hwloc)
- Tutorial: hwloc. For questions regarding hwloc, please contact brice.goglin@inria.fr.
TreeMatch
- Tutorial: TreeMatch. For questions regarding TreeMatch, please contact emmanuel.jeannot@inria.fr.
The StarPU Runtime System
The EZtrace Performance Debugging Framework
- Tutorial: EzTrace. For questions regarding EzTrace, please contact eztrace-devel@lists.gforge.inria.fr.