StarPU Handbook
Data Management

TODO: intro which mentions consistency among other things

Data Interface

StarPU provides several data interfaces for programmers to describe the data layout of their application. There are predefined interfaces already available in StarPU. Users can define new data interfaces as explained in Defining A New Data Interface. All functions provided by StarPU are documented in Data Interfaces. You will find a short list below.

Variable Data Interface

A variable is a given-size byte element, typically a scalar. Here an example of how to register a variable data to StarPU by using starpu_variable_data_register().

float var = 42.0;
starpu_variable_data_register(&var_handle, STARPU_MAIN_RAM, (uintptr_t)&var, sizeof(var));

Vector Data Interface

A vector is a fixed number of elements of a given size. Here an example of how to register a vector data to StarPU by using starpu_vector_data_register().

float vector[NX];
starpu_data_handle_t vector_handle;
starpu_vector_data_register(&vector_handle, STARPU_MAIN_RAM, (uintptr_t)vector, NX, sizeof(vector[0]));

Vectors can be partitioned into pieces by using starpu_vector_filter_block(). They can also be partitioned with some overlapping by using starpu_vector_filter_block_shadow(). By default StarPU uses the same size for each piece. If different sizes are desired, starpu_vector_filter_list() or starpu_vector_filter_list_long() can be used instead. To just divide in two pieces, starpu_vector_filter_divide_in_2() can be used.

Matrix Data Interface

To register 2-D matrices with a potential padding, one can use the matrix data interface. Here an example of how to register a matrix data to StarPU by using starpu_matrix_data_register().

float *matrix;
starpu_data_handle_t matrix_handle;
matrix = (float*)malloc(width * height * sizeof(float));
starpu_matrix_data_register(&matrix_handle, STARPU_MAIN_RAM, (uintptr_t)matrix, width, width, height, sizeof(float));

2D matrices can be partitioned into 2D matrices along the x dimension by using starpu_matrix_filter_block(), and along the y dimension by using starpu_matrix_filter_vertical_block(). They can also be partitioned with some overlapping by using starpu_matrix_filter_block_shadow() and starpu_matrix_filter_vertical_block_shadow().

Block Data Interface

To register 3-D matrices with potential paddings on Y and Z dimensions, one can use the block data interface. Here an example of how to register a block data to StarPU by using starpu_block_data_register().

float *block;
starpu_data_handle_t block_handle;
block = (float*)malloc(nx*ny*nz*sizeof(float));
starpu_block_data_register(&block_handle, STARPU_MAIN_RAM, (uintptr_t)block, nx, nx*ny, nx, ny, nz, sizeof(float));

3D matrices can be partitioned along the x dimension by using starpu_block_filter_block(), or along the y dimension by using starpu_block_filter_vertical_block, or along the z dimension by using starpu_block_filter_depth_block. They can also be partitioned with some overlapping by using starpu_block_filter_block_shadow(), starpu_block_filter_vertical_block_shadow(), or starpu_block_filter_depth_block_shadow().

BCSR Data Interface

BCSR (Blocked Compressed Sparse Row Representation) sparse matrix data can be registered to StarPU using the bcsr data interface. Here an example on how to do so by using starpu_bcsr_data_register().

* We use the following matrix:
* +----------------+
* | 0 1 0 0 |
* | 2 3 0 0 |
* | 4 5 8 9 |
* | 6 7 10 11 |
* +----------------+
* nzval = [0, 1, 2, 3] ++ [4, 5, 6, 7] ++ [8, 9, 10, 11]
* colind = [0, 0, 1]
* rowptr = [0, 1, 3]
* r = c = 2
/* Size of the blocks */
int R = 2;
int C = 2;
int NROWS = 2;
int NNZ_BLOCKS = 3; /* out of 4 */
int nzval[NZVAL_SIZE] =
0, 1, 2, 3, /* First block */
4, 5, 6, 7, /* Second block */
8, 9, 10, 11 /* Third block */
uint32_t colind[NNZ_BLOCKS] =
0, /* block-column index for first block in nzval */
0, /* block-column index for second block in nzval */
1 /* block-column index for third block in nzval */
uint32_t rowptr[NROWS+1] =
0, / * block-index in nzval of the first block of the first row. */
1, / * block-index in nzval of the first block of the second row. */
NNZ_BLOCKS /* number of blocks, to allow an easier element's access for the kernels */
(uintptr_t) nzval,
0, /* firstentry */

StarPU provides an example on how to deal with such matrices in examples/spmv.

BCSR data handles can be partitioned into its dense matrix blocks by using starpu_bcsr_filter_canonical_block(), or split into other BCSR data handles by using starpu_bcsr_filter_vertical_block() (but only split along the leading dimension is supported, i.e. along adjacent nnz blocks)

CSR Data Interface


CSR data handles can be partitioned into vertical CSR matrices by using starpu_csr_filter_vertical_block().

Data Interface with Variable Size

Tasks are actually allowed to change the size of data interfaces.

The simplest case is just changing the amount of data actually used within the allocated buffer. This is for instance implemented for the matrix interface: one can set the new NX/NY values with STARPU_MATRIX_SET_NX(), STARPU_MATRIX_SET_NY(), and STARPU_MATRIX_SET_LD() at the end of the task implementation. Data transfers achieved by StarPU will then use these values instead of the whole allocated size. The values of course need to be set within the original allocation. To reserve room for increasing the NX/NY values, one can use starpu_matrix_data_register_allocsize() instead of starpu_matrix_data_register(), to specify the allocation size to be used instead of the default NX*NY*ELEMSIZE. To support this, the data interface has to implement the starpu_data_interface_ops::alloc_footprint and starpu_data_interface_ops::alloc_compare methods, for proper StarPU allocation management.

A more involved case is changing the amount of allocated data. The task implementation can just reallocate the buffer during its execution, and set the proper new values in the interface structure, e.g. nx, ny, ld, etc. so that the StarPU core knows the new data layout. The starpu_data_interface_ops structure however then needs to have the starpu_data_interface_ops::dontcache field set to 1, to prevent StarPU from trying to perform any cached allocation, since the allocated size will vary. An example is available in tests/datawizard/variable_size.c. The example uses its own data interface so as to contain some simulation information for data growth, but the principle can be applied for any data interface.

The principle is to use starpu_malloc_on_node_flags to make the new allocation, and use starpu_free_on_node_flags to release any previous allocation. The flags have to be precisely like in the example:

unsigned workerid = starpu_worker_get_id_check();
unsigned dst_node = starpu_worker_get_memory_node(workerid);
interface->size += increase;

so that the allocated area has the expected properties and the allocation is accounted for properly.

Depending on the interface (vector, CSR, etc.) you may have to fix several members of the data interface: e.g. both nx and allocsize for vectors, and store the pointer both in ptr and dev_handle.

Some interfaces make a distinction between the actual number of elements stored in the data and the actually allocated buffer. For instance, the vector interface uses the nx field for the former, and the allocsize for the latter. This allows for lazy reallocation to avoid reallocating the buffer everytime to exactly match the actual number of elements. Computations and data transfers will use nx field, while allocation functions will use the allocsize. One just has to make sure that allocsize is always bigger or equal to nx.

Important note: one can not change the size of a partitioned data.

Data Management

When the application allocates data, whenever possible it should use the starpu_malloc() function, which will ask CUDA or OpenCL to make the allocation itself and pin the corresponding allocated memory, or to use the starpu_memory_pin() function to pin memory allocated by other ways, such as local arrays. This is needed to permit asynchronous data transfer, i.e. permit data transfer to overlap with computations. Otherwise, the trace will show that the DriverCopyAsync state takes a lot of time, this is because CUDA or OpenCL then reverts to synchronous transfers.

The application can provide its own allocation function by calling starpu_malloc_set_hooks(). StarPU will then use them for all data handle allocations in the main memory.

By default, StarPU leaves replicates of data wherever they were used, in case they will be re-used by other tasks, thus saving the data transfer time. When some task modifies some data, all the other replicates are invalidated, and only the processing unit which ran this task will have a valid replicate of the data. If the application knows that this data will not be re-used by further tasks, it should advise StarPU to immediately replicate it to a desired list of memory nodes (given through a bitmask). This can be understood like the write-through mode of CPU caches.

starpu_data_set_wt_mask(img_handle, 1<<0);

will for instance request to always automatically transfer a replicate into the main memory (node 0), as bit 0 of the write-through bitmask is being set.

starpu_data_set_wt_mask(img_handle, ~0U);

will request to always automatically broadcast the updated data to all memory nodes.

Setting the write-through mask to ~0U can also be useful to make sure all memory nodes always have a copy of the data, so that it is never evicted when memory gets scarse.

Implicit data dependency computation can become expensive if a lot of tasks access the same piece of data. If no dependency is required on some piece of data (e.g. because it is only accessed in read-only mode, or because write accesses are actually commutative), use the function starpu_data_set_sequential_consistency_flag() to disable implicit dependencies on this data.

In the same vein, accumulation of results in the same data can become a bottleneck. The use of the mode STARPU_REDUX permits to optimize such accumulation (see Data Reduction). To a lesser extent, the use of the flag STARPU_COMMUTE keeps the bottleneck (see Commute Data Access), but at least permits the accumulation to happen in any order.

Applications often need a data just for temporary results. In such a case, registration can be made without an initial value, for instance this produces a vector data:

starpu_vector_data_register(&handle, -1, 0, n, sizeof(float));

StarPU will then allocate the actual buffer only when it is actually needed, e.g. directly on the GPU without allocating in main memory.

In the same vein, once the temporary results are not useful any more, the data should be thrown away. If the handle is not to be reused, it can be unregistered:

actual unregistration will be done after all tasks working on the handle terminate.

If the handle is to be reused, instead of unregistering it, it can simply be invalidated:

the buffers containing the current value will then be freed, and reallocated only when another task writes some value to the handle.

Data Prefetch

The scheduling policies heft, dmda and pheft perform data prefetch (see STARPU_PREFETCH): as soon as a scheduling decision is taken for a task, requests are issued to transfer its required data to the target processing unit, if needed, so that when the processing unit actually starts the task, its data will hopefully be already available and it will not have to wait for the transfer to finish.

The application may want to perform some manual prefetching, for several reasons such as excluding initial data transfers from performance measurements, or setting up an initial statically-computed data distribution on the machine before submitting tasks, which will thus guide StarPU toward an initial task distribution (since StarPU will try to avoid further transfers).

This can be achieved by giving the function starpu_data_prefetch_on_node() the handle and the desired target memory node. The starpu_data_idle_prefetch_on_node() variant can be used to issue the transfer only when the bus is idle.

Conversely, one can advise StarPU that some data will not be useful in the close future by calling starpu_data_wont_use(). StarPU will then write its value back to its home node, and evict it from GPUs when room is needed.

Partitioning Data

An existing piece of data can be partitioned in sub parts to be used by different tasks, for instance:

#define NX 1048576
#define PARTS 16
int vector[NX];
/* Declare data to StarPU */
starpu_vector_data_register(&handle, STARPU_MAIN_RAM, (uintptr_t)vector, NX, sizeof(vector[0]));
/* Partition the vector in PARTS sub-vectors */
.nchildren = PARTS

The task submission then uses the function starpu_data_get_sub_data() to retrieve the sub-handles to be passed as tasks parameters.

/* Submit a task on each sub-vector */
for (i=0; i<starpu_data_get_nb_children(handle); i++)
/* Get subdata number i (there is only 1 dimension) */
starpu_data_handle_t sub_handle = starpu_data_get_sub_data(handle, 1, i);
struct starpu_task *task = starpu_task_create();
task->handles[0] = sub_handle;
task->cl = &cl;
task->synchronous = 1;
task->cl_arg = &factor;
task->cl_arg_size = sizeof(factor);

Partitioning can be applied several times, see examples/basic_examples/mult.c and examples/filters/.

Wherever the whole piece of data is already available, the partitioning will be done in-place, i.e. without allocating new buffers but just using pointers inside the existing copy. This is particularly important to be aware of when using OpenCL, where the kernel parameters are not pointers, but cl_mem handles. The kernel thus needs to be also passed the offset within the OpenCL buffer:

void opencl_func(void *buffers[], void *cl_arg)
cl_mem vector = (cl_mem) STARPU_VECTOR_GET_DEV_HANDLE(buffers[0]);
unsigned offset = STARPU_BLOCK_GET_OFFSET(buffers[0]);
clSetKernelArg(kernel, 0, sizeof(vector), &vector);
clSetKernelArg(kernel, 1, sizeof(offset), &offset);

And the kernel has to shift from the pointer passed by the OpenCL driver:

__kernel void opencl_kernel(__global int *vector, unsigned offset)
block = (__global void *)block + offset;

When the sub-data is not of the same type as the original data, the starpu_data_filter::get_child_ops field needs to be set appropriately for StarPU to know which type should be used.

StarPU provides various interfaces and filters for matrices, vectors, etc., but applications can also write their own data interfaces and filters, see examples/interface and examples/filters/custom_mf for an example, and see Defining A New Data Interface and Defining A New Data Filter for documentation.

Asynchronous Partitioning

The partitioning functions described in the previous section are synchronous: starpu_data_partition() and starpu_data_unpartition() both wait for all the tasks currently working on the data. This can be a bottleneck for the application.

An asynchronous API also exists, it works only on handles with sequential consistency. The principle is to first plan the partitioning, which returns data handles of the partition, which are not functional yet. When submitting tasks, one can mix using the handles of the partition, of the whole data. One can even partition recursively and mix using handles at different levels of the recursion. Of course, StarPU will have to introduce coherency synchronization.

fmultiple_submit_implicit is a complete example using this technique. One can also look at fmultiple_submit_readonly which contains the explicit coherency synchronization which are automatically introduced by StarPU for fmultiple_submit_implicit.

In short, we first register a matrix and plan the partitioning:

starpu_matrix_data_register(&handle, STARPU_MAIN_RAM, (uintptr_t)matrix, NX, NX, NY, sizeof(matrix[0]));
struct starpu_data_filter f_vert =
.nchildren = PARTS
starpu_data_partition_plan(handle, &f_vert, vert_handle);

starpu_data_partition_plan() returns the handles for the partition in vert_handle.

One can then submit tasks working on the main handle, and tasks working on vert_handle handles. Between using the main handle and vert_handle handles, StarPU will automatically call starpu_data_partition_submit() and starpu_data_unpartition_submit().

All this code is asynchronous, just submitting which tasks, partitioning and unpartitioning will be done at runtime.

Planning several partitioning of the same data is also possible, StarPU will unpartition and repartition as needed when mixing accesses of different partitions. If data access is done in read-only mode, StarPU will allow the different partitioning to coexist. As soon as a data is accessed in read-write mode, StarPU will automatically unpartition everything and activate only the partitioning leading to the data being written to.

For instance, for a stencil application, one can split a subdomain into its interior and halos, and then just submit a task updating the whole subdomain, then submit MPI sends/receives to update the halos, then submit again a task updating the whole subdomain, etc. and StarPU will automatically partition/unpartition each time.

Manual Partitioning

One can also handle partitioning by hand, by registering several views on the same piece of data. The idea is then to manage the coherency of the various views through the common buffer in the main memory. fmultiple_manual is a complete example using this technique.

In short, we first register the same matrix several times:

starpu_matrix_data_register(&handle, STARPU_MAIN_RAM, (uintptr_t)matrix, NX, NX, NY, sizeof(matrix[0]));
for (i = 0; i < PARTS; i++)
starpu_matrix_data_register(&vert_handle[i], STARPU_MAIN_RAM, (uintptr_t)&matrix[0][i*(NX/PARTS)], NX, NX/PARTS, NY, sizeof(matrix[0][0]));

Since StarPU is not aware that the two handles are actually pointing to the same data, we have a danger of inadvertently submitting tasks to both views, which will bring a mess since StarPU will not guarantee any coherency between the two views. To make sure we don't do this, we invalidate the view that we will not use:

for (i = 0; i < PARTS; i++)

Then we can safely work on handle.

When we want to switch to the vertical slice view, all we need to do is bring coherency between them by running an empty task on the home node of the data:

void empty(void *buffers[], void *cl_arg)
{ }
struct starpu_codelet cl_switch =
.cpu_funcs = {empty},
ret = starpu_task_insert(&cl_switch, STARPU_RW, handle,
STARPU_W, vert_handle[0],
STARPU_W, vert_handle[1],

The execution of the switch task will get back the matrix data into the main memory, and thus the vertical slices will get the updated value there.

Again, we prefer to make sure that we don't accidentally access the matrix through the whole-matrix handle:

And now we can start using vertical slices, etc.

Handles data buffer pointers

A simple understanding of starpu handles is that it's a collection of buffers on each memory node of the machine, which contain the same data. The picture is however made more complex with the OpenCL support and with partitioning.

When partitioning a handle, the data buffers of the subhandles will indeed be inside the data buffers of the main handle (to save transferring data back and forth between the main handle and the subhandles). But in OpenCL, a cl_mem is not a pointer, but an opaque value on which pointer arithmetic can not be used. That is why data interfaces contain three members: dev_handle, offset, and ptr. The dev_handle member is what the allocation function returned, and one can not do arithmetic on it. The offset member is the offset inside the allocated area, most often it will be 0 because data start at the beginning of the allocated area, but when the handle is partitioned, the subhandles will have varying offset values, for each subpiece. The ptr member, in the non-OpenCL case, i.e. when pointer arithmetic can be used on dev_handle, is just the sum of dev_handle and offset, provided for convenience.

This means that:

  • computation kernels can use ptr in non-OpenCL implementations.
  • computation kernels have to use dev_handle and offset in the OpenCL implementation.
  • allocation methods of data interfaces have to store the value returned by starpu_malloc_on_node in dev_handle and ptr, and set offset to 0.
  • partitioning filters have to copy over dev_handle without modifying it, set in the child different values of offset, and set ptr accordingly as the sum of dev_handle and offset.

Defining A New Data Filter

StarPU provides a series of predefined filters in Data Partition, but additional filters can be defined by the application. The principle is that the filter function just fills the memory location of the i-th subpart of a data. Examples are provided in src/datawizard/interfaces/*_filters.c, and see starpu_data_filter::filter_func for the details. The starpu_filter_nparts_compute_chunk_size_and_offset() helper can be used to compute the division of pieces of data.

Data Reduction

In various cases, some piece of data is used to accumulate intermediate results. For instances, the dot product of a vector, maximum/minimum finding, the histogram of a photograph, etc. When these results are produced along the whole machine, it would not be efficient to accumulate them in only one place, incurring data transmission each and access concurrency.

StarPU provides a mode STARPU_REDUX, which permits to optimize this case: it will allocate a buffer on each worker (lazily), and accumulate intermediate results there. When the data is eventually accessed in the normal mode STARPU_R, StarPU will collect the intermediate results in just one buffer.

For this to work, the user has to use the function starpu_data_set_reduction_methods() to declare how to initialize these buffers, and how to assemble partial results.

For instance, cg uses that to optimize its dot product: it first defines the codelets for initialization and reduction:

struct starpu_codelet bzero_variable_cl =
.cpu_funcs = { bzero_variable_cpu },
.cpu_funcs_name = { "bzero_variable_cpu" },
.cuda_funcs = { bzero_variable_cuda },
.nbuffers = 1,
static void accumulate_variable_cpu(void *descr[], void *cl_arg)
double *v_dst = (double *)STARPU_VARIABLE_GET_PTR(descr[0]);
double *v_src = (double *)STARPU_VARIABLE_GET_PTR(descr[1]);
*v_dst = *v_dst + *v_src;
static void accumulate_variable_cuda(void *descr[], void *cl_arg)
double *v_dst = (double *)STARPU_VARIABLE_GET_PTR(descr[0]);
double *v_src = (double *)STARPU_VARIABLE_GET_PTR(descr[1]);
cublasaxpy(1, (double)1.0, v_src, 1, v_dst, 1);
struct starpu_codelet accumulate_variable_cl =
.cpu_funcs = { accumulate_variable_cpu },
.cpu_funcs_name = { "accumulate_variable_cpu" },
.cuda_funcs = { accumulate_variable_cuda },
.nbuffers = 1,

and attaches them as reduction methods for its handle dtq:

starpu_variable_data_register(&dtq_handle, -1, NULL, sizeof(type));
starpu_data_set_reduction_methods(dtq_handle, &accumulate_variable_cl, &bzero_variable_cl);

and dtq_handle can now be used in mode STARPU_REDUX for the dot products with partitioned vectors:

for (b = 0; b < nblocks; b++)
STARPU_REDUX, dtq_handle,

During registration, we have here provided NULL, i.e. there is no initial value to be taken into account during reduction. StarPU will thus only take into account the contributions from the tasks dot_kernel_cl. Also, it will not allocate any memory for dtq_handle before tasks dot_kernel_cl are ready to run.

If another dot product has to be performed, one could unregister dtq_handle, and re-register it. But one can also call starpu_data_invalidate_submit() with the parameter dtq_handle, which will clear all data from the handle, thus resetting it back to the initial status register(NULL).

The example cg also uses reduction for the blocked gemv kernel, leading to yet more relaxed dependencies and more parallelism.

STARPU_REDUX can also be passed to starpu_mpi_task_insert() in the MPI case. This will however not produce any MPI communication, but just pass STARPU_REDUX to the underlying starpu_task_insert(). It is up to the application to call starpu_mpi_redux_data(), which posts tasks which will reduce the partial results among MPI nodes into the MPI node which owns the data. For instance, some hypothetical application which collects partial results into data res, then uses it for other computation, before looping again with a new reduction:

for (i = 0; i < 100; i++)
starpu_mpi_task_insert(MPI_COMM_WORLD, &init_res, STARPU_W, res, 0);
starpu_mpi_task_insert(MPI_COMM_WORLD, &work, STARPU_RW, A, STARPU_R, B, STARPU_REDUX, res, 0);
starpu_mpi_redux_data(MPI_COMM_WORLD, res);
starpu_mpi_task_insert(MPI_COMM_WORLD, &work2, STARPU_RW, B, STARPU_R, res, 0);

Commute Data Access

By default, the implicit dependencies computed from data access use the sequential semantic. Notably, write accesses are always serialized in the order of submission. In some applicative cases, the write contributions can actually be performed in any order without affecting the eventual result. In this case it is useful to drop the strictly sequential semantic, to improve parallelism by allowing StarPU to reorder the write accesses. This can be done by using the STARPU_COMMUTE data access flag. Accesses without this flag will however properly be serialized against accesses with this flag. For instance:

starpu_task_insert(&cl1, STARPU_R, h, STARPU_RW, handle, 0);
starpu_task_insert(&cl2, STARPU_R, handle1, STARPU_RW|STARPU_COMMUTE, handle, 0);
starpu_task_insert(&cl2, STARPU_R, handle2, STARPU_RW|STARPU_COMMUTE, handle, 0);
starpu_task_insert(&cl3, STARPU_R, g, STARPU_RW, handle, 0);

The two tasks running cl2 will be able to commute: depending on whether the value of handle1 or handle2 becomes available first, the corresponding task running cl2 will start first. The task running cl1 will however always be run before them, and the task running cl3 will always be run after them.

If a lot of tasks use the commute access on the same set of data and a lot of them are ready at the same time, it may become interesting to use an arbiter, see Concurrent Data Accesses.

Concurrent Data Accesses

When several tasks are ready and will work on several data, StarPU is faced with the classical Dining Philosophers problem, and has to determine the order in which it will run the tasks.

Data accesses usually use sequential ordering, so data accesses are usually already serialized, and thus by default StarPU uses the Dijkstra solution which scales very well in terms of overhead: tasks will just acquire data one by one by data handle pointer value order.

When sequential ordering is disabled or the STARPU_COMMUTE flag is used, there may be a lot of concurrent accesses to the same data, and the Dijkstra solution gets only poor parallelism, typically in some pathological cases which do happen in various applications. In this case, one can use a data access arbiter, which implements the classical centralized solution for the Dining Philosophers problem. This is more expensive in terms of overhead since it is centralized, but it opportunistically gets a lot of parallelism. The centralization can also be avoided by using several arbiters, thus separating sets of data for which arbitration will be done. If a task accesses data from different arbiters, it will acquire them arbiter by arbiter, in arbiter pointer value order.

See the tests/datawizard/test_arbiter.cpp example.

Arbiters however do not support the STARPU_REDUX flag yet.

Temporary Buffers

There are two kinds of temporary buffers: temporary data which just pass results from a task to another, and scratch data which are needed only internally by tasks.

Temporary Data

Data can sometimes be entirely produced by a task, and entirely consumed by another task, without the need for other parts of the application to access it. In such case, registration can be done without prior allocation, by using the special memory node number -1, and passing a zero pointer. StarPU will actually allocate memory only when the task creating the content gets scheduled, and destroy it on unregistration.

In addition to this, it can be tedious for the application to have to unregister the data, since it will not use its content anyway. The unregistration can be done lazily by using the function starpu_data_unregister_submit(), which will record that no more tasks accessing the handle will be submitted, so that it can be freed as soon as the last task accessing it is over.

The following code examplifies both points: it registers the temporary data, submits three tasks accessing it, and records the data for automatic unregistration.

starpu_vector_data_register(&handle, -1, 0, n, sizeof(float));
starpu_task_insert(&produce_data, STARPU_W, handle, 0);
starpu_task_insert(&compute_data, STARPU_RW, handle, 0);
starpu_task_insert(&summarize_data, STARPU_R, handle, STARPU_W, result_handle, 0);

The application may also want to see the temporary data initialized on the fly before being used by the task. This can be done by using starpu_data_set_reduction_methods() to set an initialization codelet (no redux codelet is needed).

Scratch Data

Some kernels sometimes need temporary data to achieve the computations, i.e. a workspace. The application could allocate it at the start of the codelet function, and free it at the end, but this would be costly. It could also allocate one buffer per worker (similarly to How To Initialize A Computation Library Once For Each Worker?), but this would make them systematic and permanent. A more optimized way is to use the data access mode STARPU_SCRATCH, as examplified below, which provides per-worker buffers without content consistency. The buffer is registered only once, using memory node -1, i.e. the application didn't allocate memory for it, and StarPU will allocate it on demand at task execution.

starpu_vector_data_register(&workspace, -1, 0, sizeof(float));
for (i = 0; i < N; i++)
starpu_task_insert(&compute, STARPU_R, input[i], STARPU_SCRATCH, workspace, STARPU_W, output[i], 0);

StarPU will make sure that the buffer is allocated before executing the task, and make this allocation per-worker: for CPU workers, notably, each worker has its own buffer. This means that each task submitted above will actually have its own workspace, which will actually be the same for all tasks running one after the other on the same worker. Also, if for instance memory becomes scarce, StarPU will notice that it can free such buffers easily, since the content does not matter.

The example examples/pi uses scratches for some temporary buffer.

The Multiformat Interface

It may be interesting to represent the same piece of data using two different data structures: one only used on CPUs, and one only used on GPUs. This can be done by using the multiformat interface. StarPU will be able to convert data from one data structure to the other when needed. Note that the scheduler dmda is the only one optimized for this interface. The user must provide StarPU with conversion codelets:

#define NX 1024
struct point array_of_structs[NX];
* The conversion of a piece of data is itself a task, though it is created,
* submitted and destroyed by StarPU internals and not by the user. Therefore,
* we have to define two codelets.
* Note that for now the conversion from the CPU format to the GPU format has to
* be executed on the GPU, and the conversion from the GPU to the CPU has to be
* executed on the CPU.
void cpu_to_opencl_opencl_func(void *buffers[], void *args);
struct starpu_codelet cpu_to_opencl_cl =
.opencl_funcs = { cpu_to_opencl_opencl_func },
.nbuffers = 1,
.modes = { STARPU_RW }
void opencl_to_cpu_func(void *buffers[], void *args);
struct starpu_codelet opencl_to_cpu_cl =
.cpu_funcs = { opencl_to_cpu_func },
.cpu_funcs_name = { "opencl_to_cpu_func" },
.nbuffers = 1,
.modes = { STARPU_RW }
.opencl_elemsize = 2 * sizeof(float),
.cpu_to_opencl_cl = &cpu_to_opencl_cl,
.opencl_to_cpu_cl = &opencl_to_cpu_cl,
.cpu_elemsize = 2 * sizeof(float),
starpu_multiformat_data_register(handle, STARPU_MAIN_RAM, &array_of_structs, NX, &format_ops);

Kernels can be written almost as for any other interface. Note that STARPU_MULTIFORMAT_GET_CPU_PTR shall only be used for CPU kernels. CUDA kernels must use STARPU_MULTIFORMAT_GET_CUDA_PTR, and OpenCL kernels must use STARPU_MULTIFORMAT_GET_OPENCL_PTR. STARPU_MULTIFORMAT_GET_NX may be used in any kind of kernel.

static void
multiformat_scal_cpu_func(void *buffers[], void *args)
struct point *aos;
unsigned int n;
extern "C" void multiformat_scal_cuda_func(void *buffers[], void *_args)
unsigned int n;
struct struct_of_arrays *soa;
soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);

A full example may be found in examples/basic_examples/multiformat.c.

Defining A New Data Interface

This section proposes an example how to define your own interface, when the StarPU-provided interface do not fit your needs. Here we take a dumb example of an array of complex numbers represented by two arrays of double values.

Let's thus define a new data interface to manage arrays of complex numbers:

/* interface for complex numbers */
struct starpu_complex_interface
double *real;
double *imaginary;
int nx;

That structure stores enough to describe one buffer of such kind of data. It is used for the buffer stored in the main memory, another instance is used for the buffer stored in a GPU, etc. A data handle is thus a collection of such structures, to remember each buffer on each memory node.

Note: one should not take pointers into such structures, because StarPU needs to be able to copy over the content of it to various places, for instance to efficiently migrate a data buffer from one data handle to another data handle.

Registering such a data to StarPU is easily done using the function starpu_data_register(). The last parameter of the function, interface_complex_ops, will be described below.

void starpu_complex_data_register(starpu_data_handle_t *handle,
unsigned home_node, double *real, double *imaginary, int nx)
struct starpu_complex_interface complex =
.real = real,
.imaginary = imaginary,
.nx = nx
if (interface_complex_ops.interfaceid == STARPU_UNKNOWN_INTERFACE_ID)
interface_complex_ops.interfaceid = starpu_data_interface_get_next_id();
starpu_data_register(handleptr, home_node, &complex, &interface_complex_ops);

The struct starpu_complex_interface complex is here used just to store the parameters that the user provided to starpu_complex_data_register. starpu_data_register() will first allocate the handle, and then pass the starpu_complex_interface structure to the starpu_data_interface_ops::register_data_handle method, which records them within the data handle (it is called once per node by starpu_data_register()):

static void complex_register_data_handle(starpu_data_handle_t handle, unsigned home_node, void *data_interface)
struct starpu_complex_interface *complex_interface = (struct starpu_complex_interface *) data_interface;
unsigned node;
for (node = 0; node < STARPU_MAXNODES; node++)
struct starpu_complex_interface *local_interface = (struct starpu_complex_interface *)
local_interface->nx = complex_interface->nx;
if (node == home_node)
local_interface->real = complex_interface->real;
local_interface->imaginary = complex_interface->imaginary;
local_interface->real = NULL;
local_interface->imaginary = NULL;

If the application provided a home node, the corresponding pointers will be recorded for that node. Others have no buffer allocated yet.

Different operations need to be defined for a data interface through the type starpu_data_interface_ops. We only define here the basic operations needed to run simple applications. The source code for the different functions can be found in the file examples/interface/complex_interface.c, the details of the hooks to be provided are documented in starpu_data_interface_ops .

static struct starpu_data_interface_ops interface_complex_ops =
.register_data_handle = complex_register_data_handle,
.allocate_data_on_node = complex_allocate_data_on_node,
.copy_methods = &complex_copy_methods,
.get_size = complex_get_size,
.footprint = complex_footprint,
.interface_size = sizeof(struct starpu_complex_interface),

Functions need to be defined to access the different fields of the complex interface from a StarPU data handle.

double *starpu_complex_get_real(starpu_data_handle_t handle)
struct starpu_complex_interface *complex_interface =
(struct starpu_complex_interface *) starpu_data_get_interface_on_node(handle, STARPU_MAIN_RAM);
return complex_interface->real;
double *starpu_complex_get_imaginary(starpu_data_handle_t handle);
int starpu_complex_get_nx(starpu_data_handle_t handle);

Similar functions need to be defined to access the different fields of the complex interface from a void * pointer to be used within codelet implemetations.

#define STARPU_COMPLEX_GET_REAL(interface) (((struct starpu_complex_interface *)(interface))->real)
#define STARPU_COMPLEX_GET_IMAGINARY(interface) (((struct starpu_complex_interface *)(interface))->imaginary)
#define STARPU_COMPLEX_GET_NX(interface) (((struct starpu_complex_interface *)(interface))->nx)

Complex data interfaces can then be registered to StarPU.

double real = 45.0;
double imaginary = 12.0;
starpu_complex_data_register(&handle1, STARPU_MAIN_RAM, &real, &imaginary, 1);
starpu_task_insert(&cl_display, STARPU_R, handle1, 0);

and used by codelets.

void display_complex_codelet(void *descr[], void *_args)
int nx = STARPU_COMPLEX_GET_NX(descr[0]);
double *real = STARPU_COMPLEX_GET_REAL(descr[0]);
double *imaginary = STARPU_COMPLEX_GET_IMAGINARY(descr[0]);
int i;
for(i=0 ; i<nx ; i++)
fprintf(stderr, "Complex[%d] = %3.2f + %3.2f i\n", i, real[i], imaginary[i]);

The whole code for this complex data interface is available in the directory examples/interface/.

Specifying A Target Node For Task Data

When executing a task on a GPU for instance, StarPU would normally copy all the needed data for the tasks on the embedded memory of the GPU. It may however happen that the task kernel would rather have some of the datas kept in the main memory instead of copied in the GPU, a pivoting vector for instance. This can be achieved by setting the starpu_codelet::specific_nodes flag to 1, and then fill the starpu_codelet::nodes array (or starpu_codelet::dyn_nodes when starpu_codelet::nbuffers is greater than STARPU_NMAXBUFS) with the node numbers where data should be copied to, or STARPU_SPECIFIC_NODE_LOCAL to let StarPU copy it to the memory node where the task will be executed.

::STARPU_SPECIFIC_NODE_CPU can also be used to request data to be put in CPU-accessible memory (and let StarPU choose the NUMA node). ::STARPU_SPECIFIC_NODE_FAST and ::STARPU_SPECIFIC_NODE_SLOW can also be used

For instance, with the following codelet:

struct starpu_codelet cl =
.cuda_funcs = { kernel },
.nbuffers = 2,
.modes = {STARPU_RW, STARPU_RW},
.specific_nodes = 1,

the first data of the task will be kept in the CPU memory, while the second data will be copied to the CUDA GPU as usual. A working example is available in tests/datawizard/specific_node.c

With the following codelet:

struct starpu_codelet cl =
.cuda_funcs = { kernel },
.nbuffers = 2,
.modes = {STARPU_RW, STARPU_RW},
.specific_nodes = 1,

The first data will be copied into fast (but probably size-limited) local memory while the second data will be left in slow (but large) memory. This makes sense when the kernel does not make so many accesses to the second data, and thus data being remote e.g. over a PCI bus is not a performance problem, and avoids filling the fast local memory with data which does not need the performance.