StarPU Handbook
 All Data Structures Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
Tasks In StarPU

Task Granularity

Like any other runtime, StarPU has some overhead to manage tasks. Since it does smart scheduling and data management, that overhead is not always neglectable. The order of magnitude of the overhead is typically a couple of microseconds, which is actually quite smaller than the CUDA overhead itself. The amount of work that a task should do should thus be somewhat bigger, to make sure that the overhead becomes neglectible. The offline performance feedback can provide a measure of task length, which should thus be checked if bad performance are observed. To get a grasp at the scalability possibility according to task size, one can run tests/microbenchs/tasks_size_overhead.sh which draws curves of the speedup of independent tasks of very small sizes.

The choice of scheduler also has impact over the overhead: for instance, the scheduler dmda takes time to make a decision, while eager does not. tasks_size_overhead.sh can again be used to get a grasp at how much impact that has on the target machine.

Task Submission

To let StarPU make online optimizations, tasks should be submitted asynchronously as much as possible. Ideally, all the tasks should be submitted, and mere calls to starpu_task_wait_for_all() or starpu_data_unregister() be done to wait for termination. StarPU will then be able to rework the whole schedule, overlap computation with communication, manage accelerator local memory usage, etc.

Task Priorities

By default, StarPU will consider the tasks in the order they are submitted by the application. If the application programmer knows that some tasks should be performed in priority (for instance because their output is needed by many other tasks and may thus be a bottleneck if not executed early enough), the field starpu_task::priority should be set to transmit the priority information to StarPU.

Setting Many Data Handles For a Task

The maximum number of data a task can manage is fixed by the environment variable STARPU_NMAXBUFS which has a default value which can be changed through the configure option --enable-maxbuffers.

However, it is possible to define tasks managing more data by using the field starpu_task::dyn_handles when defining a task and the field starpu_codelet::dyn_modes when defining the corresponding codelet.

};
struct starpu_codelet dummy_big_cl =
{
.cuda_funcs = { dummy_big_kernel },
.opencl_funcs = { dummy_big_kernel },
.cpu_funcs = { dummy_big_kernel },
.cpu_funcs_name = { "dummy_big_kernel" },
.nbuffers = STARPU_NMAXBUFS+1,
.dyn_modes = modes
};
task->cl = &dummy_big_cl;
task->dyn_handles = malloc(task->cl->nbuffers * sizeof(starpu_data_handle_t));
for(i=0 ; i<task->cl->nbuffers ; i++)
{
task->dyn_handles[i] = handle;
}
starpu_data_handle_t *handles = malloc(dummy_big_cl.nbuffers * sizeof(starpu_data_handle_t));
for(i=0 ; i<dummy_big_cl.nbuffers ; i++)
{
handles[i] = handle;
}
starpu_task_insert(&dummy_big_cl,
STARPU_VALUE, &dummy_big_cl.nbuffers, sizeof(dummy_big_cl.nbuffers),
STARPU_DATA_ARRAY, handles, dummy_big_cl.nbuffers,
0);

The whole code for this complex data interface is available in the directory examples/basic_examples/dynamic_handles.c.

Setting a Variable Number Of Data Handles For a Task

Normally, the number of data handles given to a task is fixed in the starpu_codelet::nbuffers codelet field. This field can however be set to STARPU_VARIABLE_NBUFFERS, in which case the starpu_task::nbuffers task field must be set, and the starpu_task::modes field (or starpu_task::dyn_modes field, see Setting Many Data Handles For a Task) should be used to specify the modes for the handles.

Using Multiple Implementations Of A Codelet

One may want to write multiple implementations of a codelet for a single type of device and let StarPU choose which one to run. As an example, we will show how to use SSE to scale a vector. The codelet can be written as follows:

#include <xmmintrin.h>
void scal_sse_func(void *buffers[], void *cl_arg)
{
float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);
unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);
unsigned int n_iterations = n/4;
if (n % 4 != 0)
n_iterations++;
__m128 *VECTOR = (__m128*) vector;
__m128 factor __attribute__((aligned(16)));
factor = _mm_set1_ps(*(float *) cl_arg);
unsigned int i;
for (i = 0; i < n_iterations; i++)
VECTOR[i] = _mm_mul_ps(factor, VECTOR[i]);
}
struct starpu_codelet cl = {
.cpu_funcs = { scal_cpu_func, scal_sse_func },
.cpu_funcs_name = { "scal_cpu_func", "scal_sse_func" },
.nbuffers = 1,
.modes = { STARPU_RW }
};

Schedulers which are multi-implementation aware (only dmda and pheft for now) will use the performance models of all the implementations it was given, and pick the one that seems to be the fastest.

Enabling Implementation According To Capabilities

Some implementations may not run on some devices. For instance, some CUDA devices do not support double floating point precision, and thus the kernel execution would just fail; or the device may not have enough shared memory for the implementation being used. The field starpu_codelet::can_execute permits to express this. For instance:

static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
{
const struct cudaDeviceProp *props;
return 1;
/* Cuda device */
if (props->major >= 2 || props->minor >= 3)
/* At least compute capability 1.3, supports doubles */
return 1;
/* Old card, does not support doubles */
return 0;
}
struct starpu_codelet cl = {
.cpu_funcs = { cpu_func },
.cpu_funcs_name = { "cpu_func" },
.cuda_funcs = { gpu_func }
.nbuffers = 1,
.modes = { STARPU_RW }
};

This can be essential e.g. when running on a machine which mixes various models of CUDA devices, to take benefit from the new models without crashing on old models.

Note: the function starpu_codelet::can_execute is called by the scheduler each time it tries to match a task with a worker, and should thus be very fast. The function starpu_cuda_get_device_properties() provides a quick access to CUDA properties of CUDA devices to achieve such efficiency.

Another example is to compile CUDA code for various compute capabilities, resulting with two CUDA functions, e.g. scal_gpu_13 for compute capability 1.3, and scal_gpu_20 for compute capability 2.0. Both functions can be provided to StarPU by using starpu_codelet::cuda_funcs, and starpu_codelet::can_execute can then be used to rule out the scal_gpu_20 variant on a CUDA device which will not be able to execute it:

static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
{
const struct cudaDeviceProp *props;
return 1;
/* Cuda device */
if (nimpl == 0)
/* Trying to execute the 1.3 capability variant, we assume it is ok in all cases. */
return 1;
/* Trying to execute the 2.0 capability variant, check that the card can do it. */
if (props->major >= 2 || props->minor >= 0)
/* At least compute capability 2.0, can run it */
return 1;
/* Old card, does not support 2.0, will not be able to execute the 2.0 variant. */
return 0;
}
struct starpu_codelet cl = {
.cpu_funcs = { cpu_func },
.cpu_funcs_name = { "cpu_func" },
.cuda_funcs = { scal_gpu_13, scal_gpu_20 },
.nbuffers = 1,
.modes = { STARPU_RW }
};

Another example is having specialized implementations for some given common sizes, for instance here we have a specialized implementation for 1024x1024 matrices:

static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
{
const struct cudaDeviceProp *props;
return 1;
/* Cuda device */
switch (nimpl)
{
case 0:
/* Trying to execute the generic capability variant. */
return 1;
case 1:
{
/* Trying to execute the size == 1024 specific variant. */
struct starpu_matrix_interface *interface = starpu_data_get_interface_on_node(task->handles[0]);
return STARPU_MATRIX_GET_NX(interface) == 1024 && STARPU_MATRIX_GET_NY(interface == 1024);
}
}
}
struct starpu_codelet cl = {
.cpu_funcs = { cpu_func },
.cpu_funcs_name = { "cpu_func" },
.cuda_funcs = { potrf_gpu_generic, potrf_gpu_1024 },
.nbuffers = 1,
.modes = { STARPU_RW }
};

Note: the most generic variant should be provided first, as some schedulers are not able to try the different variants.

Insert Task Utility

StarPU provides the wrapper function starpu_task_insert() to ease the creation and submission of tasks.

Here the implementation of the codelet:

void func_cpu(void *descr[], void *_args)
{
int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
int ifactor;
float ffactor;
starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
*x0 = *x0 * ifactor;
*x1 = *x1 * ffactor;
}
struct starpu_codelet mycodelet = {
.cpu_funcs = { func_cpu },
.cpu_funcs_name = { "func_cpu" },
.nbuffers = 2,
.modes = { STARPU_RW, STARPU_RW }
};

And the call to the function starpu_task_insert():

starpu_task_insert(&mycodelet,
STARPU_VALUE, &ifactor, sizeof(ifactor),
STARPU_VALUE, &ffactor, sizeof(ffactor),
STARPU_RW, data_handles[0],
STARPU_RW, data_handles[1],
0);

The call to starpu_task_insert() is equivalent to the following code:

task->cl = &mycodelet;
task->handles[0] = data_handles[0];
task->handles[1] = data_handles[1];
char *arg_buffer;
size_t arg_buffer_size;
starpu_codelet_pack_args(&arg_buffer, &arg_buffer_size,
STARPU_VALUE, &ifactor, sizeof(ifactor),
STARPU_VALUE, &ffactor, sizeof(ffactor),
0);
task->cl_arg = arg_buffer;
task->cl_arg_size = arg_buffer_size;
int ret = starpu_task_submit(task);

Here a similar call using STARPU_DATA_ARRAY.

starpu_task_insert(&mycodelet,
STARPU_DATA_ARRAY, data_handles, 2,
STARPU_VALUE, &ifactor, sizeof(ifactor),
STARPU_VALUE, &ffactor, sizeof(ffactor),
0);

If some part of the task insertion depends on the value of some computation, the macro STARPU_DATA_ACQUIRE_CB can be very convenient. For instance, assuming that the index variable i was registered as handle A_handle[i]:

/* Compute which portion we will work on, e.g. pivot */
starpu_task_insert(&which_index, STARPU_W, i_handle, 0);
/* And submit the corresponding task */
STARPU_RW, A_handle[i],
0));

The macro STARPU_DATA_ACQUIRE_CB submits an asynchronous request for acquiring data i for the main application, and will execute the code given as third parameter when it is acquired. In other words, as soon as the value of i computed by the codelet which_index can be read, the portion of code passed as third parameter of STARPU_DATA_ACQUIRE_CB will be executed, and is allowed to read from i to use it e.g. as an index. Note that this macro is only avaible when compiling StarPU with the compiler gcc.

There is several ways of calling the function starpu_codelet_unpack_args().

void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
}
void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
starpu_codelet_unpack_args(_args, &ifactor, 0);
starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
}
void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
char buffer[100];
starpu_codelet_unpack_args_and_copyleft(_args, buffer, 100, &ifactor, 0);
starpu_codelet_unpack_args(buffer, &ffactor);
}

Getting Task Children

It may be interesting to get the list of tasks which depend on a given task, notably when using implicit dependencies, since this list is computed by StarPU. starpu_task_get_task_succs() provides it. For instance:

struct starpu_task *tasks[4];
ret = starpu_task_get_task_succs(task, sizeof(tasks)/sizeof(*tasks), tasks);

Parallel Tasks

StarPU can leverage existing parallel computation libraries by the means of parallel tasks. A parallel task is a task which gets worked on by a set of CPUs (called a parallel or combined worker) at the same time, by using an existing parallel CPU implementation of the computation to be achieved. This can also be useful to improve the load balance between slow CPUs and fast GPUs: since CPUs work collectively on a single task, the completion time of tasks on CPUs become comparable to the completion time on GPUs, thus relieving from granularity discrepancy concerns. hwloc support needs to be enabled to get good performance, otherwise StarPU will not know how to better group cores.

Two modes of execution exist to accomodate with existing usages.

Fork-mode Parallel Tasks

In the Fork mode, StarPU will call the codelet function on one of the CPUs of the combined worker. The codelet function can use starpu_combined_worker_get_size() to get the number of threads it is allowed to start to achieve the computation. The CPU binding mask for the whole set of CPUs is already enforced, so that threads created by the function will inherit the mask, and thus execute where StarPU expected, the OS being in charge of choosing how to schedule threads on the corresponding CPUs. The application can also choose to bind threads by hand, using e.g. sched_getaffinity to know the CPU binding mask that StarPU chose.

For instance, using OpenMP (full source is available in examples/openmp/vector_scal.c):

void scal_cpu_func(void *buffers[], void *_args)
{
unsigned i;
float *factor = _args;
struct starpu_vector_interface *vector = buffers[0];
unsigned n = STARPU_VECTOR_GET_NX(vector);
float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
#pragma omp parallel for num_threads(starpu_combined_worker_get_size())
for (i = 0; i < n; i++)
val[i] *= *factor;
}
static struct starpu_codelet cl =
{
.modes = { STARPU_RW },
.where = STARPU_CPU,
.type = STARPU_FORKJOIN,
.max_parallelism = INT_MAX,
.cpu_funcs = {scal_cpu_func},
.cpu_funcs_name = {"scal_cpu_func"},
.nbuffers = 1,
};

Other examples include for instance calling a BLAS parallel CPU implementation (see examples/mult/xgemm.c).

SPMD-mode Parallel Tasks

In the SPMD mode, StarPU will call the codelet function on each CPU of the combined worker. The codelet function can use starpu_combined_worker_get_size() to get the total number of CPUs involved in the combined worker, and thus the number of calls that are made in parallel to the function, and starpu_combined_worker_get_rank() to get the rank of the current CPU within the combined worker. For instance:

static void func(void *buffers[], void *args)
{
unsigned i;
float *factor = _args;
struct starpu_vector_interface *vector = buffers[0];
unsigned n = STARPU_VECTOR_GET_NX(vector);
float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
/* Compute slice to compute */
unsigned slice = (n+m-1)/m;
for (i = j * slice; i < (j+1) * slice && i < n; i++)
val[i] *= *factor;
}
static struct starpu_codelet cl =
{
.modes = { STARPU_RW },
.type = STARPU_SPMD,
.max_parallelism = INT_MAX,
.cpu_funcs = { func },
.cpu_funcs_name = { "func" },
.nbuffers = 1,
}

Of course, this trivial example will not really benefit from parallel task execution, and was only meant to be simple to understand. The benefit comes when the computation to be done is so that threads have to e.g. exchange intermediate results, or write to the data in a complex but safe way in the same buffer.

Parallel Tasks Performance

To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to be used. When exposed to codelets with a flag STARPU_FORKJOIN or STARPU_SPMD, the schedulers pheft (parallel-heft) and peager (parallel eager) will indeed also try to execute tasks with several CPUs. It will automatically try the various available combined worker sizes (making several measurements for each worker size) and thus be able to avoid choosing a large combined worker if the codelet does not actually scale so much.

Combined Workers

By default, StarPU creates combined workers according to the architecture structure as detected by hwloc. It means that for each object of the hwloc topology (NUMA node, socket, cache, ...) a combined worker will be created. If some nodes of the hierarchy have a big arity (e.g. many cores in a socket without a hierarchy of shared caches), StarPU will create combined workers of intermediate sizes. The variable STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER permits to tune the maximum arity between levels of combined workers.

The combined workers actually produced can be seen in the output of the tool starpu_machine_display (the environment variable STARPU_SCHED has to be set to a combined worker-aware scheduler such as pheft or peager).

Concurrent Parallel Tasks

Unfortunately, many environments and librairies do not support concurrent calls.

For instance, most OpenMP implementations (including the main ones) do not support concurrent pragma omp parallel statements without nesting them in another pragma omp parallel statement, but StarPU does not yet support creating its CPU workers by using such pragma.

Other parallel libraries are also not safe when being invoked concurrently from different threads, due to the use of global variables in their sequential sections for instance.

The solution is then to use only one combined worker at a time. This can be done by setting the field starpu_conf::single_combined_worker to 1, or setting the environment variable STARPU_SINGLE_COMBINED_WORKER to 1. StarPU will then run only one parallel task at a time (but other CPU and GPU tasks are not affected and can be run concurrently). The parallel task scheduler will however still however still try varying combined worker sizes to look for the most efficient ones.

Synchronization Tasks

For the application conveniency, it may be useful to define tasks which do not actually make any computation, but wear for instance dependencies between other tasks or tags, or to be submitted in callbacks, etc.

The obvious way is of course to make kernel functions empty, but such task will thus have to wait for a worker to become ready, transfer data, etc.

A much lighter way to define a synchronization task is to set its starpu_task::cl field to NULL. The task will thus be a mere synchronization point, without any data access or execution content: as soon as its dependencies become available, it will terminate, call the callbacks, and release dependencies.

An intermediate solution is to define a codelet with its starpu_codelet::where field set to STARPU_NOWHERE, for instance:

struct starpu_codelet cl = {
.nbuffers = 1,
.modes = { STARPU_R },
}
task->cl = &cl;
task->handles[0] = handle;

will create a task which simply waits for the value of handle to be available for read. This task can then be depended on, etc.