HPX and C++ Task Blocks

The quest for finding efficient, convenient, and widely usable higher level parallelization constructs in C++ is continuing. With the standardization document N4411, a group of authors from Intel and Microsoft propose thetask_blockfacility which formalizes fork-join parallelism. From the paper (slightly edited):

The model of parallelism supported by the constructs in this paper is called strict fork-join task parallelism, which has decades of research behind it and is the form of structured parallelism supported by all of the prominent parallel languages, including Cilk, X10, Habanero, and OpenMP.

Consider an example of a parallel traversal of a tree, where a user-provided functionfis applied to each node of the tree, returning the sum of the results:

template <typename Func>
int traverse(node& n, Func && f)
{
    int left = 0, right = 0;
    define_task_block(
        [ & ](task_block& tb) {
            if (n.left)
                tb.run([&] { left = traverse(*n.left, f); });
            if (n.right)
                tb.run([&] { right = traverse(*n.right, f); });
        });
    return f(n) + left + right;
}

The example above demonstrates the use of two of the functions proposed in N4411:define_task_blockandtask_block::run . Thedefine_task_blockfunction delineates a region in the program code potentially containing invocations of tasks spawned by the run member function of thetask_blockclass. Therunfunction spawns a task, a unit of work that is allowed to execute in parallel with respect to the caller. Any parallel tasks spawned by run within thedefine_task_blockare joined back to a single thread of execution on return from define_task_block. The functionruntakes a user-provided function objectfand starts it asynchronously – I.e. it may return before the execution offcompletes. The implementation’s scheduler may choose to run this function immediately or delay running it until compute resources become available.

While we understand that fork-join parallelism has certain disadvantages – it introduces a forced barrier into the computation after all – we still decided to support the proposed construct in HPX as well as the concept of fork-join parallelism is widely known and easy to understand. We however, added our own extensions to the proposal which help to a) improve the integration of our implementation oftask_blockwith other higher level parallel facilities like our parallel algorithms and b) helps to mitigate the detrimental effects of the forced barrier.

The parallel algorithms extensions to the standard library described in the Parallel TS (N4354) introduces the concept of execution policies. Those essentially allow for the user to describe the concurrency guarantees for his/her codes (e.g. the lambda passed to an algorithm is safe to be executed concurrently, etc.). In HPX, we decided to enable specifying the very same execution policies while creating a task block. In addition to the syntax shown in the example above we now support writing something like:

template <typename Func>
int traverse(node& n, Func && f)
{
    int left = 0, right = 0;
    define_task_block(
        par, // specify an execution policy for the created task block
        [ & ](task_block<>& tb) {
            if (n.left)
                tb.run([&] { left = traverse(*n.left, f); });
            if (n.right)
                tb.run([&] { right = traverse(*n.right, f); });
        });
    return f(n) + left + right;
}

The main differences to the snippet shown in the beginning are the additional (optional) first parameter passed todefine_task_blockand the fact that thetask_block<>type is now a template (where the template argument is the type of the passed in execution policy which defaults toparallel_execution_policy). This change is especially interesting in the light of executors (see HPX and C++ Executors): our implementation of task blocks allows to specify execution policies in combination with executors. This enables a very flexible and adaptable way of defining how and where the tasks inside a task block will be run.

The second addition to the proposal we introduced is based on our experience with implementing the parallel algorithms (see HPX and the C++ Standard). For the parallel algorithms we implemented a set of special execution policies which make the invocation of the algorithm itself asynchronous. For instance, by specifyingpar(task)as the execution policy, the invocation of the algorithm will just kick off the required operation in an asynchronous way, returning an instance of a future object which will become ready once the algorithm execution finished. For task blocks we decided to implement a similar extension. As you can see, making the execution of a task block asynchronous helps mitigating the effects of the forced barrier at the end of the fork-join operation:

template <typename Func>
int traverse(node& n, Func && f)
{
    int left = 0, right = 0;
    future<void> done = 
        define_task_block(
            par(task), // specify an asynchronous execution policy 
            [ & ](auto& tb) {
                if (n.left)
                    tb.run([&] { left = traverse(*n.left, f); });
                if (n.right)
                    tb.run([&] { right = traverse(*n.right, f); });
            });
 
    int result = f(n);  // do other stuff
    done.wait();        // wait for task block to finish
 
    return result + left + right;
}

Now, other work can be easily performed while the task block is still executing – nifty! Note the use of the C++14 generic lambda syntax which allows us to ignore the concrete type of the generated task block instance.

As always, if you want to try things out, please fork HPX from our Github site.

2 Comments

  1. Alex

    This is amazing! what are chances that we’ll see this feature in C++17?

    • Hartmut Kaiser

      As said, the latest version of the proposal is N4411. I’d expect this to become part of one of the technical specifications related to concurrency and/or parallelism.

Leave a Reply to Alex Cancel reply

Your email address will not be published.