StarPU Handbook
Clustering A Machine

TODO: clarify and put more explanations, express how to create clusters using the context API.

General Ideas

Clusters are a concept introduced in this paper. This comes from a basic idea, making use of two level of parallelism in a DAG. We keep the DAG parallelism but consider on top of it that a task can contain internal parallelism. A good example is if each task in the DAG is OpenMP enabled.

The particularity of such tasks is that we will combine the power of two runtime systems: StarPU will manage the DAG parallelism and another runtime (e.g. OpenMP) will manage the internal parallelism. The challenge is in creating an interface between the two runtime systems so that StarPU can regroup cores inside a machine (creating what we call a "cluster") on top of which the parallel tasks (e.g. OpenMP tasks) will be ran in a contained fashion.

The aim of the cluster API is to facilitate this process in an automatic fashion. For this purpose, we depend on the hwloc tool to detect the machine configuration and then partition it into usable clusters.

An example of code running on clusters is available in examples/sched_ctx/parallel_tasks_with_cluster_api.c.

Let's first look at how to create one in practice, then we will detail their internals.

Creating Clusters

Partitioning a machine into clusters with the cluster API is fairly straightforward. The simplest way is to state under which machine topology level we wish to regroup all resources. This level is an HwLoc object, of the type hwloc_obj_type_t. More can be found in the hwloc documentation.

Once a cluster is created, the full machine is represented with an opaque structure starpu_cluster_machine. This can be printed to show the current machine state.

struct starpu_cluster_machine *clusters;
clusters = starpu_cluster_machine(HWLOC_OBJ_SOCKET, 0);
starpu_cluster_print(clusters);
starpu_uncluster_machine(clusters);

The following graphic is an example of what a particular machine can look like once clusterized. The main difference is that we have less worker queues and tasks which will be executed on several resources at once. The execution of these tasks will be left to the internal runtime system, represented with a dashed box around the resources.

runtime-par.png
StarPU using parallel tasks

Creating clusters as shown in the example above will create workers able to execute OpenMP code by default. The cluster API aims in allowing to parametrize the cluster creation and can take a va_list of arguments as input after the HwLoc object (always terminated by a 0 value). These can help creating clusters of a type different from OpenMP, or create a more precise partition of the machine.

Example Of Constraining OpenMP

Clusters require being able to constrain the runtime managing the internal task parallelism (internal runtime) to the resources set by StarPU. The purpose of this is to express how StarPU must communicate with the internal runtime to achieve the required cooperation. In the case of OpenMP, StarPU will provide an awake thread from the cluster to execute this liaison. It will then provide on demand the process ids of the other resources supposed to be in the region. Finally, thanks to an OpenMP region we can create the required number of threads and bind each of them on the correct region. These will then be reused each time we encounter a #pragma omp parallel in the following computations of our program.

The following graphic is an example of what an OpenMP-type cluster looks like and how it represented in StarPU. We can see that one StarPU (black) thread is awake, and we need to create on the other resources the OpenMP threads (in pink).

parallel_worker2.png
StarPU with an OpenMP cluster

Finally, the following code shows how to force OpenMP to cooperate with StarPU and create the aforementioned OpenMP threads constrained in the cluster's resources set:

void starpu_openmp_prologue(void * sched_ctx_id)
{
int sched_ctx = *(int*)sched_ctx_id;
int *cpuids = NULL;
int ncpuids = 0;
int workerid = starpu_worker_get_id();
//we can target only CPU workers
{
//grab all the ids inside the cluster
starpu_sched_ctx_get_available_cpuids(sched_ctx, &cpuids, &ncpuids);
//set the number of threads
omp_set_num_threads(ncpuids);
#pragma omp parallel
{
//bind each threads to its respective resource
starpu_sched_ctx_bind_current_thread_to_cpuid(cpuids[omp_get_thread_num()]);
}
free(cpuids);
}
return;
}

This is in fact exactly the default function used when we don't specify anything. As can be seen, we based the clusters on several tools and models present in the StarPU contexts, and merely extended them to allow to represent and carry clusters. More on contexts can be read here Scheduling Contexts.

Creating Custom Clusters

As was previously said it is possible to create clusters using another cluster type, in order to bind another internal runtime inside StarPU. This can be done with in several ways:

  • By using the currently available functions
  • By passing as argument a user defined function

Here are two examples:

struct starpu_cluster_machine *clusters;
clusters = starpu_cluster_machine(HWLOC_OBJ_SOCKET,
STARPU_CLUSTER_TYPE, STARPU_CLUSTER_GNU_OPENMP_MKL,
0);

This type of clusters is available by default only if StarPU is compiled with MKL. It uses MKL functions to set the number of threads which is more reliable when using an OpenMP implementation different from the Intel one.

void foo_func(void* foo_arg);
\\...
int foo_arg = 0;
struct starpu_cluster_machine *clusters;
clusters = starpu_cluster_machine(HWLOC_OBJ_SOCKET,
STARPU_CLUSTER_CREATE_FUNC, &foo_func,
STARPU_CLUSTER_CREATE_FUNC_ARG, &foo_arg,
0);

Clusters With Scheduling

Contexts API As previously mentioned, the cluster API is implemented on top of Scheduling Contexts. Its main addition is to ease the creation of a machine CPU partition with no overlapping by using HwLoc, whereas scheduling contexts can use any number of any resources.

It is therefore possible, but not recommended, to create clusters using the scheduling contexts API. This can be useful mostly in the most complex machine configurations where the user has to dimension precisely clusters by hand using his own algorithm.

/* the list of resources the context will manage */
int workerids[3] = {1, 3, 10};
/* indicate the list of workers assigned to it, the number of workers,
the name of the context and the scheduling policy to be used within
the context */
int id_ctx = starpu_sched_ctx_create(workerids, 3, "my_ctx", 0);
/* let StarPU know that the following tasks will be submitted to this context */
starpu_sched_ctx_set_task_context(id);
task->prologue_callback_pop_func=runtime_interface_function_here;
/* submit the task to StarPU */

As this example illustrates, creating a context without scheduling policy will create a cluster. The important change is that the user will have to specify an interface function between the two runtimes he plans to use. This can be done in the prologue_callback_pop_func field of the task. Such a function can be similar to the OpenMP thread team creation one.

Note that the OpenMP mode is the default one both for clusters and contexts. The result of a cluster creation is a woken up master worker and sleeping "slaves" which allow the master to run tasks on their resources. To create a cluster with woken up workers one can use the flag STARPU_SCHED_CTX_AWAKE_WORKERS with the scheduling context API and STARPU_CLUSTER_AWAKE_WORKERS with the cluster API as parameter to the creation function.