StarPU Tutorial - Workshop HPC-GA - 2012

Download & Install

DAS-4 modules

The first step is of course to download and install StarPU. Before doing so, make sure to enable paths to the CUDA and CUBLAS environments on your machine, for DAS-4 that means running

$ module load cuda32/toolkit
$ module load cuda32/blas

We will also use the prun tool:

$ module load prun

You should probably put these module load commands in your .bashrc for further connections to DAS-4.

hwloc

In order to properly discover the machine cores, StarPU uses the hwloc library. It can be downloaded from the hwloc website. The build procedure is the usual

$ ./configure --prefix=$HOME
$ make
$ make install

To easily get the proper compiler and linker flags for StarPU as well as execution paths, enable them in the pkg-config search path and library path:

$ export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$HOME/lib/pkgconfig
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/lib
$ export PATH=$PATH:$HOME/bin

You should add these lines to your .bashrc file for further connections.

StarPU

The StarPU source code can be downloaded from the StarPU website, make sure to get the latest release, that is 0.9.1. The build process is using the usual GNU style:

$ ./configure --prefix=$HOME
$ make
$ make install

In the summary dumped at the end of the configure step, check that CUDA support was detected (CUDA enabled: yes) as well as hwloc.

You can test execution of a “Hello world!” program (using prun to run the command on a DAS-4 computation node, as required by usage Policy):

$ prun -np 1 ./examples/basic_examples/hello_world

If execution does not find the cudart library, make sure that your .bashrc properly keeps existing paths in the LD_LIBRARY_PATH environment variable.

Run the command several times, you will notice that StarPU calibrates the bus speed each time. This is because DAS-4’s job scheduler assigns a different node each time, and StarPU does not know that the local cluster we use is homogeneous, and thus assumes that all nodes of the cluster may be different. Let’s force it to use the same machine ID for the whole cluster:

$ export STARPU_HOSTNAME=das4

Also add this do your .bashrc for further connections. Of course, on a heterogeneous cluster, the cluster launcher script should set various hostnames for the different node classes, as appropriate.

Hands-on session part 1: Task-based programming model

Application example: vector scaling

Making it and running it

A typical Makefile for applications using StarPU is then the following (available for download):

CFLAGS += $(shell pkg-config --cflags libstarpu)
LDFLAGS += $(shell pkg-config --libs libstarpu)
vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
%.o: %.cu
	nvcc $(CFLAGS) $< -c $
clean:
	rm -f vector_scal *.o

Copy the vector_scal*.c* files from examples/basic_examples into a new empty directory, along with the Makefile mentioned above. Run make, and try

$ prun -np 1 ./vector_scal

it should be working: it simply scales a given vector by a given factor.

Computation kernels

Examine the source code, starting from vector_scal_cpu.c : this is the actual computation code, which is wrapped into a scal_cpu_func function which takes a series of DSM interfaces and a non-DSM parameter. The code simply gets an actual pointer from the first DSM interface, and the factor value from the non-DSM parameter, and performs the vector scaling.

The GPU implementation, in vector_scal_cuda.cu, is basically the same, with the host part (scal_cuda_func) which extracts the actual CUDA pointer from the DSM interface, and passes it to the device part (vector_mult_cuda) which performs the actual computation.

The OpenCL implementation is more hairy due to the low-level aspect of the OpenCL standard, but the principle remains the same.

Main code

Now examine vector_scal.c: the cl (codelet) structure simply gathers pointers on the functions mentioned above. It also includes a performance model, which we will discuss about this afternoon.

The main function

Allocates an vector application buffer and fills it.
Registers it to StarPU, and get back a DSM handle. For now on, the application is not supposed to access vector directly, since its content may be copied and modified by a task on a GPU, the main-memory copy then being outdated.
Submits a (synchronous) task to StarPU.
Unregisters the vector from StarPU, which brings back the modified version to main memory.

Data partitioning

In the previous section, we submitted only one task. We here discuss how to /partition/ data so as to submit multiple tasks which can be executed in parallel by the various CPUs and GPUs.

Let’s examine examples/basic_examples/mult.c.

The computation kernel, cpu_mult is a trivial matrix multiplication kernel, which operates on 3 given DSM interfaces. These will actually not be whole matrices, but only small parts of matrices.
init_problem_data initializes the whole A, B and C matrices.
partition_mult_data does the actual registration and partitioning. Matrices are first registered completely, then two partitioning filters are declared. The first one, vert, is used to split B and C vertically. The second one, horiz, is used to split A and C horizontally. We thus end up with a grid of pieces of C to be computed from stripes of A and B.
launch_tasks submits the actual tasks: for each piece of C, take the appropriate piece of A and B to produce the piece of C.
The access mode is interesting: A and B just need to be read from, and C will only be written to. This means that StarPU will make copies of the pieces of A and B along the machines, where they are needed for tasks, and will give to the tasks some uninitialized buffers for the pieces of C, since they will not be read from.
The main code initializes StarPU and data, launches tasks, unpartitions data, and unregisters it. Unpartitioning is an interesting step: until then the pieces of C are residing on the various GPUs where they have been computed. Unpartitioning will collect all the pieces of C into the main memory to form the whole C result matrix.

Run the application, enabling some statistics:

$ prun -np 1 STARPU_WORKER_STATS=1 ./examples/basic_examples/mult

Figures show how the computation were distributed on the various processing units. We will discuss performance further this afternoon.

examples/mult/xgemm.c is a very similar matrix-matrix product example, but which makes use of BLAS kernels for much better performance. The mult_kernel_common functions shows how we call DGEMM (CPUs) or cublasDgemm (GPUs) on the DSM interface. It is also able to benefit from a parallel implementation of DGEMM, we will however not have the time to discuss about this still-experimental feature.

Let’s execute it on a node with one GPU:

$ prun -native '-l gpu=GTX480' -np 1 STARPU_WORKER_STATS=1 ./examples/mult/sgemm

(it takes some time for StarPU to make an off-line bus performance calibration, but this is done only once).

We can notice that StarPU gave much more tasks to the GPU. You can also try to set num_gpu=2 to run on the machine which has two GPUs (there is only one of them, so you may have to wait a long time, so submit this in background in a separate terminal), the interesting thing here is that with no application modification beyond making it use a task-based programming model, we get multi-GPU support for free!

More advanced examples

examples/lu/xlu_implicit.c is a more involved example: this is a simple LU decomposition algorithm. The dw_codelet_facto_v3 is actually the main algorithm loop, in a very readable, sequential-looking way. It simply submits all the tasks asynchronously, and waits for them all.

examples/cholesky/cholesky_implicit.c is a similar example, but which makes use of the starpu_insert_task helper. The _cholesky function looks very much like dw_codelet_facto_v3 of the previous paragraph, and all task submission details are handled by starpu_insert_task.

Thanks to being already using a task-based programming model, MAGMA and PLASMA have been easily ported to StarPU by simply using starpu_insert_task.

Exercise

Take the vector example again, and add partitioning support to it, using the matrix-matrix multiplication as an example. Try to run it with various numbers of tasks

Hands-on session part 2: Optimizations

This is based on StarPU’s documentation optimization chapter

Data management

We have explained how StarPU can overlap computation and data transfers thanks to DMAs. This is however only possible when CUDA has control over the application buffers. The application should thus use starpu_malloc when allocating its buffer, to permit asynchronous DMAs from and to it.

Task submission

To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like the applications we have observed this morning: submit the whole graph of tasks, and wait for termination.

Task scheduling policy

By default, StarPU uses the eager simple greedy scheduler. This is because it provides correct load balance even if the application codelets do not have performance models: it uses a single central queue, from which workers draw tasks to work on. This however does not permit to prefetch data, since the scheduling decision is taken late.

If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.

For instance, compare the eager (default) and heft scheduling policies:

prun -native '-l gpu=GTX480' -np 1 STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 ./examples/mult/sgemm -x 1024 -y 1024 -z 1024

with

prun -native '-l gpu=GTX480' -np 1 STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=heft ./examples/mult/sgemm -x 1024 -y 1024 -z 1024

There are much less data transfers, and StarPU realizes that there is no point in giving tasks to GPUs, resulting to better performance.

You may have to use STARPU_CALIBRATE=2 , as it seems the performance of the kernels varies on DAS-4, I have not had the time to check why.

Depending on using gpu=GTX480 (old GPUs) or gpu=C2050 (recent GPUs), you should set export STARPU_HOSTNAME=das4-gtx480 or -c2050, to make StarPU use separate performance models for these two kinds of machines.

Try other schedulers, use STARPU_SCHED=help to get the list.

Also try with various sizes and draw curves.

You can also try the double version, dgemm, and notice that GPUs get less great performance.

Performance model calibration

Performance prediction is essential for proper scheduling decisions, the performance models thus have to be calibrated. This is done automatically by StarPU when a codelet is executed for the first time. Once this is done, the result is saved to a file in $HOME for later re-use. The starpu_perfmodel_display tool can be used to check the resulting performance model.

$ starpu_perfmodel_display -l
file: &lt;starpu_sgemm_gemm.das4&gt;
$ starpu_perfmodel_display -s starpu_sgemm_gemm
performance model for cpu
# hash		size		mean		dev		n
8bd4e11d	2359296        	9.318547e+04   	4.335047e+02   	700
performance model for cuda_0
# hash		size		mean		dev		n
8bd4e11d	2359296        	3.396056e+02   	3.391979e+00   	900

This shows that for the sgemm kernel with a 2.5M matrix slice, the average execution time on CPUs was about 93ms, with a 0.4ms standard deviation, over 700 samples, while it took about 0.033ms on GPUs, with a 0.004ms standard deviation. It is a good idea to check this before doing actual performance measurements. If the kernel has varying performance, it may be a good idea to force StarPU to continue calibrating the performance model, by using export STARPU_CALIBRATE=1

If the code of a computation kernel is modified, the performance changes, the performance model thus has to be recalibrated from start. To do so, use export STARPU_CALIBRATE=2

More performance optimizations

The starpu documentation optimization chapter provides more optimization tips for further reading after the Spring School.

FxT tracing support

In addition to online profiling, StarPU provides offline profiling tools, based on recording a trace of events during execution, and analyzing it afterwards.

The tool used by StarPU to record a trace is called FxT, and can be downloaded from savannah. The build process is as usual:

$ ./configure --prefix=$HOME
$ make
$ make install

StarPU should then be recompiled with FxT support:

$ ./configure --with-fxt --prefix=$HOME
$ make clean
$ make
$ make install

You should make sure that the summary at the end of ./configure shows that tracing was enabled:

Tracing enabled: yes

The trace file is output in /tmp by default. Since execution will happen on a cluster node, the file will not be reachable after execution, we need to direct StarPU to output traces to the home directory, by using

$ export STARPU_FXT_PREFIX=$HOME/

and add it to your .bashrc.

The application should be run again, and this time a prof_file_XX_YY trace file will be generated in your home directory. This can be converted to several formats by using

$ starpu_fxt_tool -i ~/prof_file_*

That will create

a paje.trace file, which can be opened by using the ViTE tool. This shows a Gant diagram of the tasks which executed, and thus the activity and idleness of tasks, as well as dependencies, data transfers, etc. You may have to zoom in to actually focus on the computation part, and not the lengthy CUDA initialization.
a dag.dot file, which contains the graph of all the tasks submitted by the application. It can be opened by using Graphviz.
an activity.data file, which records the activity of all processing units over time.

MPI support

StarPU provides support for MPI communications. Basically, it provides equivalents of MPI_* functions, but which operate on DSM handles instead of void* buffers. The difference is that the source data may be residing on a GPU where it just got computed. StarPU will automatically handle copying it back to main memory before submitting it to MPI.

mpi/tests/ring_async_implicit.c shows an example of mixing MPI communications and task submission. It is a classical ring MPI ping-pong, but the token which is being passed on from neighbour to neighbour is incremented by a starpu task at each step.

This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.

starpu_mpi_insert_task

The Cholesky factorization shown in the presentation slides is available in mpi/examples/cholesky/mpi_cholesky.c. The data distribution over MPI nodes is decided by the my_distrib function, and can thus be changed trivially.

Contact

For any questions regarding StarPU, please contact the StarPU developers mailing list starpu-devel@inria.fr