It has been another STE||AR Summer of Code! This year our team had the privilege to work with some very talented students. Moreover, we are pleased to say that their work clearly showcases their potential. It has been an exciting and rewarding to watch these students submit increasingly influential contributions to our community. Below you can read an outline of their projects which include links to the source code they added. Check it out!
Madhavan Seshadri – HPXCL – Asynchronous Integration of CUDA and OpenCL to HPX
The rapidly increasing quantity of problem data requires massively parallel processing to obtain meaningful inference. GPUs provide the compute capability for SIMD processes through Stream Multiprocessors while CPU processors optimize the performance of through pipelining instructions. Optimization of large scale problems constraints us to combine the computation through both the CPUs and GPUs both requiring complex synchronization mechanism. HPX enhances the synchronization across nodes using future<> and the developed HPXCL can provide intra-device, inter-device and node-device synchronization using future<>. The work developed during GSoC optimizes the performance of HPXCL Cuda through the possibility of choosing Default or Multiple streams. Test cases were written to ensure that the system performs per expectations. Multiple benchmarks were also written to measure the performance of the given system against the native applications running on the GPU, potentially scoping the enhancement of the system. A detailed report of Madhavan’ project is available here.
Denis Blank – Re-implementing hpx::util::unwrapped and unifying the API of hpx::wait and hpx::when
In HPX we express dependencies between future results through the when_*, wait_* and dataflow functions, which are mostly used in combination with the unwrapped function that allows to work with plain values directly, rather than futures of values. Denis’ project added the ability to those functions to also accept arbitrary containers, tuple-like objects and move only types as well as the capability to deal with futures nested inside such types. For his GSoC project, Denis re-implemented and extended the existing solution of unwrapped into an independent implementation, which provide those functionalities for synchronous and asynchronous usage. This features will make it much easier to express dependencies between futures in HPX and access its result in a convenient way. A detailed report of Denis’ project is available here.
Ajai George – Work on Parallel Algorithms I
My proposal was to implement distributed versions of STL parallel algorithms. The main focus has been on resolving as much of the pending work in #1338 , which is about ensuring that these algorithms work seamlessly with distributed data structures like hpx::partitioned_vector. I have implemented for_each_n, unary and binary transform, reduce, transform_inclusive/ exclusive_scan, find, find_if, find_if_not, any_of, all_of, none_of, binary transform_reduce, adjacent_find, and adjacent_difference. A complex algorithm has been implemented to find sequences spanning multiple segments in a distributed data structure. This works for almost all the test cases except the corner case of searching for repeating sequences (like 123123123) across segments. The implementations of the transform scans use the existing distributed implementation of scans. The implementation of find and its variations use a common intermediate function for handling the distribution of work among segments. Adequate test cases have been provided for all of the implemented functions. A detailed report of Denis’ project is available here.
Taeguk Kwon – Work on Parallel Algorithms II
HPX is a C++ Standards Library for Concurrency and Parallelism. Therefore, implementing a C++17 parallelism proposal like N4409 in HPX was a natural fit. Most of parallel algorithms were already implemented before I began working with the HPX team, however, some of them were not. My main objective for the summer was to implement as many of the remaining parallel algorithms as possible. Additionally, I set out to adapt the algorithms to the Ranges TS as well as add container versions of them. I added unit tests and benchmark codes to test my work. I also was able to catch many issues in and suggest fixes to the previously implemented algorithms. A detailed report of Taeguk’s project is available here.