StarPU Tutorial - Workshop HPC-GA - 2012
Download & Install
DAS-4 modules
The first step is of course to download and install StarPU. Before doing so, make sure to enable paths to the CUDA and CUBLAS environments on your machine, for DAS-4 that means running
$ module load cuda32/toolkit
$ module load cuda32/blas
We will also use the prun
tool:
$ module load prun
You should probably put these module load
commands in your
.bashrc
for further connections to DAS-4.
hwloc
In order to properly discover the machine cores, StarPU uses the hwloc library. It can be downloaded from the hwloc website. The build procedure is the usual
$ ./configure --prefix=$HOME
$ make
$ make install
To easily get the proper compiler and
linker flags for StarPU as well as execution paths, enable them in the
pkg-config
search path and library path:
$ export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$HOME/lib/pkgconfig
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/lib
$ export PATH=$PATH:$HOME/bin
You should add these lines to your .bashrc
file for further
connections.
StarPU
The StarPU source code can be downloaded from the StarPU website, make sure to get the latest release, that is 0.9.1. The build process is using the usual GNU style:
$ ./configure --prefix=$HOME
$ make
$ make install
In the summary dumped at the end of the configure
step,
check that CUDA support was detected (CUDA enabled: yes
) as well
as hwloc.
You can test execution of a “Hello world!” program (using prun
to
run the command on a DAS-4 computation node, as required by usage Policy):
$ prun -np 1 ./examples/basic_examples/hello_world
If execution does not find the cudart library, make sure that your
.bashrc
properly keeps existing paths in the LD_LIBRARY_PATH
environment variable.
Run the command several times, you will notice that StarPU calibrates the bus speed each time. This is because DAS-4’s job scheduler assigns a different node each time, and StarPU does not know that the local cluster we use is homogeneous, and thus assumes that all nodes of the cluster may be different. Let’s force it to use the same machine ID for the whole cluster:
$ export STARPU_HOSTNAME=das4
Also add this do your .bashrc
for further connections. Of course, on
a heterogeneous cluster, the cluster launcher script should set various
hostnames for the different node classes, as appropriate.
Hands-on session part 1: Task-based programming model
Application example: vector scaling
Making it and running it
A typical Makefile
for applications using StarPU is then the
following (available for download):
CFLAGS += $(shell pkg-config --cflags libstarpu)
LDFLAGS += $(shell pkg-config --libs libstarpu)
vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
%.o: %.cu
nvcc $(CFLAGS) $< -c $
clean:
rm -f vector_scal *.o
Copy the vector_scal*.c*
files from
examples/basic_examples
into a new empty directory, along with
the Makefile
mentioned above. Run make
, and try
$ prun -np 1 ./vector_scal
it should be working: it simply scales a given vector by a given factor.
Computation kernels
Examine the source code, starting from vector_scal_cpu.c
: this is
the actual computation code, which is wrapped into a scal_cpu_func
function which takes a series of DSM interfaces and a non-DSM parameter. The
code simply gets an actual pointer from the first DSM interface, and the factor
value from the non-DSM parameter, and performs the vector scaling.
The GPU implementation, in vector_scal_cuda.cu
, is basically
the same, with the host part (scal_cuda_func
) which extracts the
actual CUDA pointer from the DSM interface, and passes it to the device part
(vector_mult_cuda
) which performs the actual computation.
The OpenCL implementation is more hairy due to the low-level aspect of the OpenCL standard, but the principle remains the same.
Main code
Now examine vector_scal.c
: the cl
(codelet) structure simply gathers
pointers on the functions mentioned above. It also includes a performance model,
which we will discuss about this afternoon.
The main
function
- Allocates an
vector
application buffer and fills it. - Registers it to StarPU, and get back a DSM handle. For now on, the application is not supposed to access
vector
directly, since its content may be copied and modified by a task on a GPU, the main-memory copy then being outdated. - Submits a (synchronous) task to StarPU.
- Unregisters the vector from StarPU, which brings back the modified version to main memory.
Data partitioning
In the previous section, we submitted only one task. We here discuss how to /partition/ data so as to submit multiple tasks which can be executed in parallel by the various CPUs and GPUs.
Let’s examine examples/basic_examples/mult.c
.
- The computation kernel,
cpu_mult
is a trivial matrix multiplication kernel, which operates on 3 given DSM interfaces. These will actually not be whole matrices, but only small parts of matrices. init_problem_data
initializes the whole A, B and C matrices.partition_mult_data
does the actual registration and partitioning. Matrices are first registered completely, then two partitioning filters are declared. The first one,vert
, is used to split B and C vertically. The second one,horiz
, is used to split A and C horizontally. We thus end up with a grid of pieces of C to be computed from stripes of A and B.launch_tasks
submits the actual tasks: for each piece of C, take the appropriate piece of A and B to produce the piece of C.- The access mode is interesting: A and B just need to be read from, and C will only be written to. This means that StarPU will make copies of the pieces of A and B along the machines, where they are needed for tasks, and will give to the tasks some uninitialized buffers for the pieces of C, since they will not be read from.
- The main code initializes StarPU and data, launches tasks, unpartitions data, and unregisters it. Unpartitioning is an interesting step: until then the pieces of C are residing on the various GPUs where they have been computed. Unpartitioning will collect all the pieces of C into the main memory to form the whole C result matrix.
Run the application, enabling some statistics:
$ prun -np 1 STARPU_WORKER_STATS=1 ./examples/basic_examples/mult
Figures show how the computation were distributed on the various processing units. We will discuss performance further this afternoon.
examples/mult/xgemm.c
is a very similar matrix-matrix product example,
but which makes use of BLAS kernels for much better performance. The mult_kernel_common
functions
shows how we call DGEMM
(CPUs) or cublasDgemm
(GPUs) on the DSM interface.
It is also able to benefit from a parallel implementation of DGEMM
, we
will however not have the time to discuss about this still-experimental feature.
Let’s execute it on a node with one GPU:
$ prun -native '-l gpu=GTX480' -np 1 STARPU_WORKER_STATS=1 ./examples/mult/sgemm
(it takes some time for StarPU to make an off-line bus performance calibration, but this is done only once).
We can notice that StarPU gave much more tasks to the GPU. You can also try
to set num_gpu=2
to run on the machine which has two GPUs (there is
only one of them, so you may have to wait a long time, so submit this in
background in a separate terminal), the interesting thing here is that
with no application modification beyond making it use a task-based
programming model, we get multi-GPU support for free!
More advanced examples
examples/lu/xlu_implicit.c
is a more involved example: this is a simple
LU decomposition algorithm. The dw_codelet_facto_v3
is actually the
main algorithm loop, in a very readable, sequential-looking way. It simply
submits all the tasks asynchronously, and waits for them all.
examples/cholesky/cholesky_implicit.c
is a similar example, but which makes use
of the starpu_insert_task
helper. The _cholesky
function looks
very much like dw_codelet_facto_v3
of the previous paragraph, and all
task submission details are handled by starpu_insert_task
.
Thanks to being already using a task-based programming model, MAGMA and PLASMA
have been easily ported to StarPU by simply using starpu_insert_task
.
Exercise
Take the vector example again, and add partitioning support to it, using the matrix-matrix multiplication as an example. Try to run it with various numbers of tasks
Hands-on session part 2: Optimizations
This is based on StarPU’s documentation optimization chapter
Data management
We have explained how StarPU can overlap computation and data transfers
thanks to DMAs. This is however only possible when CUDA has control over the
application buffers. The application should thus use starpu_malloc
when allocating its buffer, to permit asynchronous DMAs from and to it.
Task submission
To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like the applications we have observed this morning: submit the whole graph of tasks, and wait for termination.
Task scheduling policy
By default, StarPU uses the eager
simple greedy scheduler. This is
because it provides correct load balance even if the application codelets do not
have performance models: it uses a single central queue, from which workers draw
tasks to work on. This however does not permit to prefetch data, since the
scheduling decision is taken late.
If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.
For instance, compare the eager
(default) and heft
scheduling
policies:
prun -native '-l gpu=GTX480' -np 1 STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 ./examples/mult/sgemm -x 1024 -y 1024 -z 1024
with
prun -native '-l gpu=GTX480' -np 1 STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=heft ./examples/mult/sgemm -x 1024 -y 1024 -z 1024
There are much less data transfers, and StarPU realizes that there is no point in giving tasks to GPUs, resulting to better performance.
You may have to use STARPU_CALIBRATE=2 , as it seems the performance of the kernels varies on DAS-4, I have not had the time to check why.
Depending on using gpu=GTX480
(old GPUs) or gpu=C2050
(recent GPUs), you should set export STARPU_HOSTNAME=das4-gtx480
or
-c2050
, to make StarPU use separate performance models for these two
kinds of machines.
Try other schedulers, use STARPU_SCHED=help
to get the
list.
Also try with various sizes and draw curves.
You can also try the double version, dgemm
, and notice that GPUs get
less great performance.
Performance model calibration
Performance prediction is essential for proper scheduling decisions, the
performance models thus have to be calibrated. This is done automatically by
StarPU when a codelet is executed for the first time. Once this is done, the
result is saved to a file in $HOME
for later re-use. The
starpu_perfmodel_display
tool can be used to check the resulting
performance model.
$ starpu_perfmodel_display -l
file: <starpu_sgemm_gemm.das4>
$ starpu_perfmodel_display -s starpu_sgemm_gemm
performance model for cpu
# hash size mean dev n
8bd4e11d 2359296 9.318547e+04 4.335047e+02 700
performance model for cuda_0
# hash size mean dev n
8bd4e11d 2359296 3.396056e+02 3.391979e+00 900
This shows that for the sgemm kernel with a 2.5M matrix slice, the average
execution time on CPUs was about 93ms, with a 0.4ms standard deviation, over
700 samples, while it took about 0.033ms on GPUs, with a 0.004ms standard
deviation. It is a good idea to check this before doing actual performance
measurements. If the kernel has varying performance, it may be a good idea to
force StarPU to continue calibrating the performance model, by using export
STARPU_CALIBRATE=1
If the code of a computation kernel is modified, the performance changes, the
performance model thus has to be recalibrated from start. To do so, use
export STARPU_CALIBRATE=2
More performance optimizations
The starpu documentation optimization chapter provides more optimization tips for further reading after the Spring School.
FxT tracing support
In addition to online profiling, StarPU provides offline profiling tools, based on recording a trace of events during execution, and analyzing it afterwards.
The tool used by StarPU to record a trace is called FxT, and can be downloaded from savannah. The build process is as usual:
$ ./configure --prefix=$HOME
$ make
$ make install
StarPU should then be recompiled with FxT support:
$ ./configure --with-fxt --prefix=$HOME
$ make clean
$ make
$ make install
You should make sure that the summary at the end of ./configure
shows that tracing was enabled:
Tracing enabled: yes
The trace file is output in /tmp
by default. Since execution will
happen on a cluster node, the file will not be reachable after execution,
we need to direct StarPU to output traces to the home directory, by using
$ export STARPU_FXT_PREFIX=$HOME/
and add it to your .bashrc
.
The application should be run again, and this time a prof_file_XX_YY
trace file will be generated in your home directory. This can be converted to
several formats by using
$ starpu_fxt_tool -i ~/prof_file_*
That will create
- a
paje.trace
file, which can be opened by using the ViTE tool. This shows a Gant diagram of the tasks which executed, and thus the activity and idleness of tasks, as well as dependencies, data transfers, etc. You may have to zoom in to actually focus on the computation part, and not the lengthy CUDA initialization. - a
dag.dot
file, which contains the graph of all the tasks submitted by the application. It can be opened by using Graphviz. - an
activity.data
file, which records the activity of all processing units over time.
MPI support
StarPU provides support for MPI communications. Basically, it provides
equivalents of MPI_*
functions, but which operate on DSM handles
instead of void*
buffers. The difference is that the source data may be
residing on a GPU where it just got computed. StarPU will automatically handle
copying it back to main memory before submitting it to MPI.
mpi/tests/ring_async_implicit.c
shows an example of mixing MPI communications and task submission. It is a classical ring MPI ping-pong, but the token which is being passed on from neighbour to neighbour is incremented by a starpu task at each step.
This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.
starpu_mpi_insert_task
The Cholesky factorization shown in the presentation slides is available in
mpi/examples/cholesky/mpi_cholesky.c
. The data distribution over MPI
nodes is decided by the my_distrib
function, and can thus be changed
trivially.
Contact
For any questions regarding StarPU, please contact the StarPU developers mailing list starpu-devel@inria.fr