The first step is of course to download and install StarPU. Before doing so, make sure to enable paths to the CUDA and CUBLAS environments on your machine, for the Erik cluster, as well as GCC and OpenMPI, that means running
module load gcc/4.6.3 module load cuda/5.0 module load openmpi/1.6.4/gcc/4.6.3
You should probably put these module load commands in your .bashrc for further connections to the Erik cluster.
In order to properly discover the machine cores, StarPU uses the hwloc library. It can be downloaded from the hwloc website. The build procedure is the usual
$ ./configure --prefix=$HOME $ make $ make install
To easily get the proper compiler and linker flags for StarPU as well as execution paths, enable them in the pkg-config search path and library path:
$ export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$HOME/lib/pkgconfig $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/lib $ export PATH=$PATH:$HOME/bin
You should add these lines to your .bashrc file for further connections.
The StarPU source code can be downloaded from the StarPU website, make sure to get the latest release, that is 1.1.0rc1. The build process is using the usual GNU style:
$ ./configure --prefix=$HOME $ make $ make install
In the summary dumped at the end of the configure step, check that CUDA support was detected (CUDA enabled: yes) as well as hwloc.
You can test execution of a "Hello world!" StarPU program using the cluster's batch scheduler to run the command on a computation node, as required by usage Policy
Run the command several times, you will notice that StarPU may calibrate the bus speed several times. This is because the cluster's batch scheduler assigns a different node each time, and StarPU does not know that the local cluster we use is homogeneous, and thus assumes that all nodes of the cluster may be different. Let's force it to use the same machine ID for the whole cluster:
$ export STARPU_HOSTNAME=erik
Also add this do your .bashrc for further connections. Of course, on a heterogeneous cluster, the cluster launcher script should set various hostnames for the different node classes, as appropriate.
A typical Makefile for applications using StarPU is then the following (available for download):
CFLAGS += $(shell pkg-config --cflags starpu-1.1) LDFLAGS += $(shell pkg-config --libs starpu-1.1) vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o %.o: %.cu nvcc $(CFLAGS) $< -c $ clean: rm -f vector_scal *.o
Copy the vector_scal*.c* files and the vector_scal_cpu_template.h file from examples/basic_examples into a new empty directory, along with the Makefile mentioned above. Run make, and run the resulting vector_scal executable using the batch scheduler. It should be working: it simply scales a given vector by a given factor.
Examine the source code, starting from vector_scal_cpu.c : this is the actual computation code, which is wrapped into a scal_cpu_func function which takes a series of DSM interfaces and a non-DSM parameter. The code simply gets an actual pointer from the first DSM interface, and the factor value from the non-DSM parameter, and performs the vector scaling.
The GPU implementation, in vector_scal_cuda.cu, is basically the same, with the host part (scal_cuda_func) which extracts the actual CUDA pointer from the DSM interface, and passes it to the device part (vector_mult_cuda) which performs the actual computation.
The OpenCL implementation is more hairy due to the low-level aspect of the OpenCL standard, but the principle remains the same.
Now examine vector_scal.c: the cl (codelet) structure simply gathers pointers on the functions mentioned above. It also includes a performance model.
The main function
In the previous section, we submitted only one task. We here discuss how to partition data so as to submit multiple tasks which can be executed in parallel by the various CPUs and GPUs.
Let's examine examples/basic_examples/mult.c.
Run the application with the batch scheduler, enabling some statistics:
Figures show how the computation were distributed on the various processing
examples/mult/xgemm.c is a very similar matrix-matrix product example, but which makes use of BLAS kernels for much better performance. The mult_kernel_common functions shows how we call DGEMM (CPUs) or cublasDgemm (GPUs) on the DSM interface.
Let's execute it on a node with one GPU:
(it takes some time for StarPU to make an off-line bus performance
calibration, but this is done only once).
We can notice that StarPU gave much more tasks to the GPU. You can also try to set num_gpu=2 to run on the machine which has two GPUs (there is only one of them, so you may have to wait a long time, so submit this in background in a separate terminal), the interesting thing here is that with no application modification beyond making it use a task-based programming model, we get multi-GPU support for free!
examples/lu/xlu_implicit.c is a more involved example: this is a simple LU decomposition algorithm. The dw_codelet_facto_v3 is actually the main algorithm loop, in a very readable, sequential-looking way. It simply submits all the tasks asynchronously, and waits for them all.
examples/cholesky/cholesky_implicit.c is a similar example, but which makes use of the starpu_insert_task helper. The _cholesky function looks very much like dw_codelet_facto_v3 of the previous paragraph, and all task submission details are handled by starpu_insert_task.
Thanks to being already using a task-based programming model, MAGMA and PLASMA have been easily ported to StarPU by simply using starpu_insert_task.
Take the vector example again, and add partitioning support to it, using the matrix-matrix multiplication as an example. Try to run it with various numbers of tasks
This is based on StarPU's documentation optimization chapter
We have explained how StarPU can overlap computation and data transfers thanks to DMAs. This is however only possible when CUDA has control over the application buffers. The application should thus use starpu_malloc when allocating its buffer, to permit asynchronous DMAs from and to it.
To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like that: submit the whole graph of tasks, and wait for termination.
By default, StarPU uses the eager simple greedy scheduler. This is because it provides correct load balance even if the application codelets do not have performance models: it uses a single central queue, from which workers draw tasks to work on. This however does not permit to prefetch data, since the scheduling decision is taken late.
If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.
For instance, compare the eager (default) and dmda scheduling
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 [PATH]/examples/mult/sgemm -x 1024 -y 1024 -z 1024
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmda [PATH]/examples/mult/sgemm -x 1024 -y 1024 -z 1024
There are much less data transfers, and StarPU realizes that there is no point in giving tasks to GPUs, resulting to better performance.
Try other schedulers, use STARPU_SCHED=help to get the list.
Also try with various sizes and draw curves.
You can also try the double version, dgemm, and notice that GPUs get less great performance.
Performance prediction is essential for proper scheduling decisions, the performance models thus have to be calibrated. This is done automatically by StarPU when a codelet is executed for the first time. Once this is done, the result is saved to a file in $HOME for later re-use. The starpu_perfmodel_display tool can be used to check the resulting performance model.
$ starpu_perfmodel_display -l file: <starpu_sgemm_gemm.erik> $ starpu_perfmodel_display -s starpu_sgemm_gemm performance model for cpu # hash size mean dev n 8bd4e11d 2359296 9.318547e+04 4.335047e+02 700 performance model for cuda_0 # hash size mean dev n 8bd4e11d 2359296 3.396056e+02 3.391979e+00 900
This shows that for the sgemm kernel with a 2.5M matrix slice, the average execution time on CPUs was about 93ms, with a 0.4ms standard deviation, over 700 samples, while it took about 0.033ms on GPUs, with a 0.004ms standard deviation. It is a good idea to check this before doing actual performance measurements. If the kernel has varying performance, it may be a good idea to force StarPU to continue calibrating the performance model, by using export STARPU_CALIBRATE=1
If the code of a computation kernel is modified, the performance changes, the performance model thus has to be recalibrated from start. To do so, use export STARPU_CALIBRATE=2
StarPU provides support for MPI communications. Basically, it provides equivalents of MPI_* functions, but which operate on DSM handles instead of void* buffers. The difference is that the source data may be residing on a GPU where it just got computed. StarPU will automatically handle copying it back to main memory before submitting it to MPI.
mpi/tests/ring_async_implicit.c shows an example of mixing MPI communications and task submission. It is a classical ring MPI ping-pong, but the token which is being passed on from neighbour to neighbour is incremented by a starpu task at each step.
This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.
The Cholesky factorization shown in the presentation slides is available in mpi/examples/cholesky/mpi_cholesky.c. The data distribution over MPI nodes is decided by the my_distrib function, and can thus be changed trivially.
The starpu documentation optimization chapter provides more optimization tips for further reading after the Spring School.
In addition to online profiling, StarPU provides offline profiling tools, based on recording a trace of events during execution, and analyzing it afterwards.
The tool used by StarPU to record a trace is called FxT, and can be downloaded from savannah. The build process is as usual:
$ ./configure --prefix=$HOME $ make $ make install
StarPU should then be recompiled with FxT support:
$ ./configure --with-fxt --prefix=$HOME $ make clean $ make $ make install
You should make sure that the summary at the end of ./configure shows that tracing was enabled:
Tracing enabled: yes
The trace file is output in /tmp by default. Since execution will happen on a cluster node, the file will not be reachable after execution, we need to direct StarPU to output traces to the home directory, by using
$ export STARPU_FXT_PREFIX=$HOME/
and add it to your .bashrc.
The application should be run again, and this time a prof_file_XX_YY trace file will be generated in your home directory. This can be converted to several formats by using
$ starpu_fxt_tool -i ~/prof_file_*
That will create
Last updated on 2013/06/03.