HPX and C++ Executors

The STE||AR Group has implemented executors in HPX which, as proposed by the C++ standardization proposal called ‘Parallel Algorithms Need Executors’ (document number N4406), are objects that choose where and how a function call is completed. This is a step in the right direction for HPX and parallelism because executors give more flexibility on how and where task based work should be accomplished and gives the programmer a means to compose executors nicely with execution policies inside of algorithm implementations.

Analogous to iterator_traits for iterators, the functionality of executors are exposed by executor_traits. Each executor must implement the member function async_execute that calls a passed function and returns a future holding the result. executor_traits can then synchronize three other functions: a bulk overload of async_execute where a function is called multiple times for an iterable range of inputs and, as implemented by HPX, a vector of futures holding the result of each function call is returned, and the synchronous version of async_execute, execute combined with its bulk overloaded form. The following example clarifies the core functions executor_traits provides:

using hpx::future;
using std::vector;

some_executor_type exec;
some_shape_type inputs;
auto fire_n_forget = [](){ /*..accomplish something..*/ };
auto f1 = [](){ /*..compute..*/ return t; };
auto f2 = [](T t_a){ /*..compute..*/ return t_b; };

typedef executor_traits<some_executor_type> traits;

// executor calls fire_n_forget function but is not required to 
// wait for the function to finish
traits::apply_execute(exec, fire_n_forget)

// executor calls f1 and returns the result
traits::execute(exec, f1);

// Bulk overload. Calls f2 for each of the inputs and returns a
// vector<T> holding the results
traits::execute(exec, f2, inputs);

// Asynchronously calls f1 and returns a future of the result
traits::async_execute(exec, f1)

// Bulk overload. Calls f2 for each of the inputs and returns a
// vector<future<T>> holding the results
traits::async_execute(exec, f2, inputs);

The implementation of executor_traits allows the programmer to easily create their own executor so as to support diverse uses. By default, executor_traits forwards the function call to the executor it was passed. But since execute, the bulk overload of execute and the bulk overload of async_execute can all be implemented in terms ofasync_execute, executor_traits can, if needed, synchronize the former three functions from the latter. This eases the burden on programmers implementing their own executors, an important design feature. Being able to extend executors to make decisions at runtime or to function on platform specific mechanisms of parallelism such as GPU threads or SIMD vector units magnifies the power of executors.

Though not all design features we implemented are in concordance with the C++ standardization proposal; we took the liberty to slightly modify the bulk overloads in two key ways. First, the shape parameter, where inputs is passed above, is templated in HPX as opposed to the proposed std::size_t. This allows more versatile inputs and goes hand in hand with ranges. Second, we have chosen to modify the return types. The bulk overloads of execute and async_execute return std::vector<T> and std::vector<hpx::future<T>> respectively instead of the proposed void and std::future<void>. We believe that the benefits of asynchronous computation and the reduction of fork-parallelism are better supported in this form.

We are also in the process of converting our parallel algorithms to make full use of executors via execution policies. Our previous implementation of execution policies specified parallel and asynchronous computations on our algorithms defined by the Parallelism TS. While this functionality is useful and maintained, what if the programmer wants to have further control over the specifics of the parallel execution? To accomplish this, a .on function taking an executor has been added to our execution policies. This means our parallel algorithms will be able to support executors, done as follows with the already modified for_each algorithm:

// Computed sequentially
for_each(seq, first, last, result, f)       // returns f
for_each(seq(task), first, last, result, f) // returns a future for f

// Can be computed in parallel or sequentially
for_each(par, first, last, result, f)       // returns f
for_each(par(task), first, last, result, f) // returns a future for f

// Can be computed in parallel or sequentially
for_each(par.on(an_executor),               // New: an executor is used,
    first, last, result, f)                 // returns f
for_each(par(task).on(an_executor),         // New: an executor is used,
    first, last, result, f)                 // returns a future for f

Note that the execution policies still specify what work can be done but now they also check that the work done by the executor is allowed. Thus seq.on(parallel_executor()) would fail to compile since it would not be sensible to let a sequential execution policy allow parallel execution but seq.on(sequential_executor()) would work fine.

This is an exciting proposal for us. Not only do executors provide programmers finer control over parallel execution and work well with our preexisting algorithms, but they do so at a correct level of abstraction allowing for reusable and maintainable code.  In HPX, this will allow users to express concepts such as “run this code on neighboring machines” or “schedule this work in one NUMA domain”. Additionally, by integrating executors with the parallel algorithms and the existing execution policies, users will be able to specify abstractions such as “parallelize this algorithm this way” or “run this algorithm on that accelerator.”

If you want to try things out, please fork HPX from our Github repository.

Leave a Reply

Your email address will not be published. Required fields are marked *