HPX and C++ Parallel Algorithms

In Lenexa (May 2015), the C++ standardization committee has finalized the work related to the Technical Specification for C++ Extensions for Parallelism (the latest document at the time of this writing is N4507) . This document describes parallel algorithms which will extend and complement the (sequential) standard library algorithms we all love and use for over a decade now. This is an important – albeit only first – step towards standardizing higher level abstractions for parallelism and concurrency in C++.

In a nutshell, the parallel algorithms are very similar to their sequential counterparts (in name, expected arguments, and guarantees) with the addition of a first template argument – the execution policy. Here is a table of all parallel algorithms. Note that not all sequential algorithms have a parallel counterpart.

Parallel algorithms

An object of an execution policy type indicates the kinds of parallelism allowed in the execution of the algorithm and expresses the consequent requirements on the element access functions. The technical specification (TS) defines three execution policy types: the sequential, parallel, and parallel-vector execution policies:

The classsequential_execution_policyrequires that a parallel algorithm’s execution may not be parallelized.
The classparallel_execution_policyindicates that a parallel algorithm’s execution may be parallelized.
The classparallel_vector_execution_policyindicates that a parallel algorithm’s execution may be vectorized and parallelized.

In HPX, we take conformance to the C++ standard (and related documents) very seriously. For this reason we started implementing these parallel algorithms over a year ago and as of today have implemented almost 80% of them (for a document showing our progress, please see the corresponding ticket). For us, the implementation of parallel algorithms is very important. They provide higher level and general purpose abstractions on top of the basic parallelization constructs exposed by HPX (such asasyncanddataflow). Our experience with the implemented algorithms is very positive. We were able to show yet again the benefits of being able to build on top of our very fine grained parallelization infrastructure. I will show numbers below which highlight some of this.

As usual, we not only implement features conforming to the standard documents, but also try to extend those facilities to support the specifics of HPX. For the parallel algorithms we decided to add three (in our opinion very important) things:

Full support of executors as proposed in the paper Parallel Algorithms Need Executors (N4406). We have already written about this here: HPX and C++ Executors.
Enable the use of the parallel algorithms in distributed scenarios. We will talk about this in more detail in one of our future posts.
Implement additional execution policies which enable asynchronous execution of parallel algorithms.

In this blog post I focus on the last of these points. We already mentioned our approach to asynchrony for constructs implementing fork-join parallelism in a previous post: HPX and C++ Task Blocks. There we introduced new asynchronous execution policies, one for each of the standardized policies. For instance, for the pre-defined parallel execution policypar we added a corresponding asynchronous execution policy generated bypar(task). Using one of those asynchronous execution policies with any of the parallel algorithms makes its execution fully asynchronous. In this case the algorithm will immediately return an instance of a future representing the end of the execution of the algorithm, it will not wait for the algorithm to finish running before returning. This is fundamentally different from the conventional, sequential way algorithms are normally executed.

Let’s have a look at some examples. The first code snippet demonstrates the application of the parallel for_each algorithm for a simple use case:

    std::vector<std::size_t> data = { ... };
    hpx::parallel::for_each(
        hpx::parallel::par,
        std::begin(data), std::end(data),
        [](std::size_t val)
        {
            do_some_work(val);
        });

By the way, this code is also a very nice demonstration of how the new parallel algorithms provide a viable alternative to classical parallelization constructs as provided by OpenMP. The new algorithms are fully generic and perfectly integrated with C++ and its type system, which will allow for better optimizations. However, this example still exposes the same disadvantage as is exposed by an equivalent OpenMP loop. It is a representation of fork-join parallelism which is well known to introduce forced barriers into the overall execution flow at the end of the loop.

Luckily, the mentioned asynchronous execution policies help mitigating this disadvantage. Here is an equivalent asynchronous version of the code snippet above:

    std::vector<std::size_t> data = { ... };
    hpx::future<void> f = hpx::parallel::for_each(
        hpx::parallel::par(task),
        std::begin(data), std::end(data),
        [](std::size_t val)
        {
            do_some_work(val);
        });

    // more parallel work can be performed here
    do_other_work();

    // wait for the for_each to be finished
    f.wait();

If the functiondo_other_work()schedules more parallel work, the runtime system will be able to avoid resource starvation effects caused by the fork-join barrier at the end of the execution of the parallel algorithm.

On a side note, we also implemented an asynchronous execution policy which still forces sequential execution:seq(task). In this case, the algorithm is executed sequentially, but it immediately returns a future instance representing the fact that it finished its execution. This is an important facility as it is not always possible to enable full parallelism, for instance for pre-existing code.

There is another big advantage in letting the algorithm return a future representing the completion of its overall execution. The returned future allows integration of the asynchronously executed algorithm with any of the other possible modes of asynchrony exposed by HPX. The future returned by the algorithm is indistinguishable from any other future returned by other (possibly user-defined) functions. For this reason, it can be used together with any of those other futures to orchestrate the overall asynchronous execution flow in an application. In addition, that makes algorithms usable with the proposed await facility (see our post HPX and C++ Dataflow for more information).

Now for some numbers. The results presented here are taken from Grant Mercer’s talk at C++Now 2015 (the videos are not available yet), where he presented his work on implementing some of the parallel algorithms for HPX.

Grant took a simple parallel for_each (very similar to the code shown above) and performed a constant amount of artificial work in each iteration. The graph below shows the results for running the parallel for_each on 16 cores for a vector of 1,000,000 elements, where each iteration artificially delays execution for 1μs (1,000ns). The numbers shown are the result of averaging the runtime for 100 runs. The scaling shown is the speedup over the equivalent sequential execution (larger numbers are better).

for_each_scaling

The graph shows three results: for the sequential runs (seq), for runs using the parallel execution policy (par), and for runs using the asynchronous parallel execution policy (marked as task). The grain size in this graph is the number of iterations executed on one HPX thread. Please note, that in case of task the test additionally delayed waiting for the future returned from the algorithm by one test iteration:

    std::vector<std::size_t> data = { ... };
    hpx::future<void> outer;
    for (int i = 0; i != 100; ++i)
    {
        hpx::future<void> f = hpx::parallel::for_each(
            hpx::parallel::par(task),
            std::begin(data), std::end(data),
            [](std::size_t val)
            {
                busy_wait_for_1000ns();
            });
        if (i != 0)
            outer.wait();
        outer = std::move(f);
    }
    outer.wait();

As expected, for small grain sizes (left end of the graph, every iteration runs on a separate HPX thread), the scaling (i.e. benefit from running in parallel) is very low because of the overheads imposed by the management of the threads. We know from other measurements that one HPX thread creates overheads in the order of ~700ns (this includes the full life cycle of the thread: creation, scheduling, execution, and deletion). That means that for very small grain sizes the thread management overhead is comparable to the actual work performed, which kills our performance.

For very large grain sizes (right end of the graph, all of the iterations are executed on the same HPX thread) we see the expected result as well: the time required to run the tests is equivalent to the sequential execution time. The performance suffers from starvation effects caused by under-utilizing the available compute resources.

The graph however shows, that (if you come from the left) very quickly the achievable scaling increases and reaches a maximum at a grain size of around 100-200 iterations per HPX thread (please note that the x-axis in this graph is logarithmic, which might be misleading). This also means that for the optimal case we still create 5-10 thousand HPX threads per algorithm invocation.

Another notable result we see is that the asynchronous execution (including that one test iteration overlap) gives us much better scaling if compared to ‘normal’ parallel execution. In fact, the asynchronous execution policy (task) gives us almost perfect scaling, a notable speedup of close to 16 on 16 cores over a wide range of grain sizes. This demonstrates that the negative fork-join effects mentioned above can be mitigated by: a) creating sufficient amount of parallelism, and b) enabling its use through high level APIs supporting asynchrony in a manageable way. Our test simply executed two staggered algorithm invocations at the same time (i.e. has 10-20 thousand concurrently active threads), but any other form of parallelism would have the same advantageous effects.

Another conclusion from these results is that the parallel for_each we implemented provides best possible speedup if we run 100-200μs of work per HPX thread. For this reason, our algorithms are built around an auto-partitioner algorithm which decides at runtime how many iterations will run on the same HPX thread. This auto-partitioner will make sure that at least 100μs of work are executed per HPX thread, which ensures best possible scaling for a wide range of loop parameters. For us, this is just another example of how the HPX runtime system applies runtime-adaptive techniques to improve resource utilization and performance of applications.

If today you decide you want to try out the new parallelization facilities which will be available in the future, hopefully on every major compiler platform, please fork the HPX repository here and let us know what you think.

Leave a Reply Cancel reply