HPX and Index-based C++ Parallel Loops

In Jacksonville at the winter 2016 C++ standardization meeting, Intel presented the second revision of their standardization proposal targeting index-based parallel for-loops (the latest document at the time of this writing is P0075R1). This document describes a couple of basic parallel for-loop constructs which complement the existing parallel algorithms which we have already implemented for quite some time in HPX. The following is taken from this document:

The looping construct in the existing parallelism TS, parallel::for_each, while convenient for traversing a sequence of elements in parallel, would require tricky and convoluted code in order to handle a number of common patterns:

Traversing multiple sequences in the same loop, e.g., A[i] = B[i].

Referring to elements before or after the current element, e.g., iter[0] = iter[1].

Performing computations based on the position in the loop, e.g., A[i] += i % 2 ? 1 : -1;

The proposed looping constructs are meant to overcome these limitations. Here is a short example:

void saxpy_ref(int n, float a, float x[], float y[])
{
    for_loop(par, 0, n, 
        [&](int i)
        {
            y[i] += a * x[i];
        });
}

Please note, that the for_loop expects to be called with an execution_policy as its first parameter, very much in the same way as the already existing parallel algorithms. This is a very important decision, as it marks a first step of the standardization process into a direction which HPX has been following all along: finding an uniform way of expressing and controlling various types of parallelism in C++. We have written about this more than once, for instance in our post about HPX and executors.

Conformance to the C++ Standard and the ongoing standardization effort is very important for HPX. For this reason we decided to start implementing the index-based for-loops. For our progress on this, please see the corresponding ticket in our issue tracker. We will update this ticket as we implement the various facilities proposed. In addition to implementing the constructs as specified, we will extend those in similar ways as we have done already for the existing parallel algorithms (parallel::for_each, et. al.), namely:

We add full support for executors as proposed in the paper An Interface for Abstracting Execution (P0058R1, to which we have contributed as well). This for instance allows to write code like:

void saxpy_ref(int n, float a, float x[], float y[])
{
    // create an executor running on 4 cores of the
    // socket where the data 'x' is located
    numa_executor exec(4, x); 

    // the iterations of the following for_loop will 
    // run in parallel on 4 cores of the same NUMA 
    // domain close to the data
    for_loop(par.on(exec), 0, n, 
        [&](int i)
        {
            y[i] += a * x[i];
        });
}

We augment the new algorithms to be usable with our additional execution policies enabling their asynchronous execution. In several discussions in the C++ committee people expressed their support for finding a way of exposing algorithms asynchronously. We are obviously not sure whether in the end the standard will adopt an API for asynchronous algorithms similar to what we have in HPX, but we are reasonably certain that something very similar will be proposed. Here is an example how it looks in HPX today:
```
future<void> saxpy_ref(int n, float a, float x[], float y[])
{
    return
        for_loop(par(task), 0, n, 
            [&](int i)
            {
                y[i] += a * x[i];
            });
}
```
Note that the execution policy par(task) instructs the previously synchronous algorithm to kick off the iterations in parallel and to return immediately. The returned future object will become ready once the algorithm has run to completion. Note also, that in this example the user has to make sure the arrays x and y are being kept alive until after the algorithm has finished executing.

Intel’s standardization proposal defines more facilities, such like the ability to specify reductions and to add additional loop variables. While we already have implemented those, I will not go into more detail than necessary here. Please refer to the standardization document or our implementation in HPX.

To the best of our knowledge, the implementation of these parallelization constructs in HPX is the first to be openly available. So if you decide today that you want to try out these new facilities which will be available in the future, hopefully on every major compiler platform, please fork the HPX repository here and let us know what you think.

HPX and Index-based C++ Parallel Loops

1 Comment

Leave a Reply Cancel reply