As HPC shifts its long range focus from peta- to exascale, the need for programmers to be able to efficiently utilize the entirety of a machine’s compute resources has become more paramount. This has grown increasingly difficult as most of the Top500 machines utilize, in some capacity, hardware accelerators like GPUs and coprocessors which often require special languages and APIs to take advantage of them. In C++ the concept of executors, as currently discussed by the C++ standardization committee, has created a possibility for a flexible, and dynamic choice of the execution platform for various types of parallelism in C++, including the execution of user code on heterogeneous resources like accelerators and GPUs in a portable way. This will also allow to develop a solution that seamlessly integrates iterative execution (parallel algorithms) with other types of parallelism, such as task-based parallelism, asynchronous execution flows, continuation style computation, and explicit fork-join control flow of independent and non-homogeneous code paths.
Today, HPX implements executors to support sequential and parallel execution of parallel algorithms (and other parallelization constructs) on the main cores (CPUs). However, the current solutions does not directly utilize the power of GPUs. The support for GPU frameworks (such as CUDA and/or OpenCL) has become a standard in modern graphic processors, but the ability to utilize the computational capabilities provided by GPUs is still -especially on HPC applications – far from common. The most frequent reasons include: less intuitive programming model, more complex architecture of both memory and streaming processors, and the host-device paradigm which enforces manual data transmission and synchronization. These disadvantages will only aggravate with newer computer architectures to come online over the next years. There is a need for portable, higher level APIs for parallelization across a diverse set of heterogeneous architectures, including CPUs, GPUs, accelerators and coprocessors, which would simplify the creation of applications, even by programmers who are not experienced in either architecture or low-level programming languages of GPUs.
In HPX, we are working on bringing GPU parallelization into the implementation of a diverse set of parallelization constructs. The major goal for this work is to ensure portability, both in terms of code as well as in terms of performance across heterogeneous compute elements. The same C++ application code should be compilable and highly performant for CPUs, GPUs, Xeon/Phi, and other resources. All provided higher level C++ parallelization APIs are fully conforming to the existing C++ standardization documents (C++14, various Technical Specifications), and we will make every effort to keep up with the ongoing standardization efforts (for C++17 and C++20).
First results of this research demonstrate the viability of our approach as we have shown that it is possible to create higher level parallelization APIs which are at least as performant as existing solutions (based on OpenMP and/or MPI). We have also shown that it is possible to achieve native performance when integrating work on (local and remote) GPUs into the overall parallel execution flow (running tasks on CPUs and GPUs concurrently). We expect to see similar results when exposing GPUs through the very same APIs by continuing the work as described above.