CPPCast Episode: HPX and DLA Future

CppCast, hosted by Jason Turner and JeanHeyd Meneide, is the first podcast for C++ developers by C++ developers. Since 2015 CppCast has been having conversations with C++ conference speakers, library authors, writers, ISO committee members and more.

Hartmut Kaiser and Mikael Simberg of the STE||AR Group joined Jason and JeanHeyd for a podcast episode recently. They discussed some blog posts on returning multiple values from a function and C++ Ranges. Then they talk about the latest version of HPX, how easy it is to gain performance improvements with HPX, and DLA Futures, the Distributed Linear Algebra library built using HPX.

To listen to the podcast episode, click here.

GSoC 21 – First Eval Status Update

In the first second of July, we completed the first evaluation of our Google Summer of Code program. The students have provided summaries of their work and details of the pull requests they’ve created. Check them out below:

Akhil Nair

My GSoC work mainly involves targeting the following three issues :-

I’ll be adapting the remaining algorithms in these issues so that they conform to the C++20 standard.

The project involves adding tag_dispatch and tag_fallback_dispatch CPO (Customization Point Object) overloads. Tag_dispatch overloads are used for the segmented overloads and tag_fallback_dispatch for the parallel overloads as this ensures that it tries to dispatch to a segmented overload even if it’s not an exact match before looking at the parallel overloads.

Range based overloads are also added as well as overloads taking in sentinel values. The base implementations are modified to support these sentinel values. Tests for these overloads as well as any missing tests for the parallel overloads have also been added. Doxygen documentation for each overload and the deprecation of the old hpx::parallel namespace overloads is also done.

The first phase of the GSoC period was great, the community was very supportive and nice. The weekly meetings are really helpful and something to look forward to every week and everyone is helpful and responsive on the irc channel. I am slowly getting to know my mentors as well as my fellow GSoC/GSoD students.

For the first phase of GSoC, I’ve adapted the following algorithms to C++ 20 :-

Merged:

Open:

Apart from this, I’ve also created a PR to add the ranges starts_with and ends_with algorithm.

Moving forward, I would be working on adapting the remaining algorithms (not a lot left now) and working on some other issues as well such as adding the shift_left and shift_right algorithms and looking into the performance issues of the scan partitioner.

Srinivas Yadav

Add vectorization to par_unseq implementations of Parallel Algorithms

An Overview of the Project

 Currently the HPX algorithms support data parallelism through explicit vectorization using Vc library and only for few algorithms like for_each, transform and count, but recently the support for Vc library has been deprecated and it is moved to std-experimental-simd. So the aim of the project is to adapt the data-parallel support for parallel algorithms using the new  std-experimental-simd.

Work Done So far

The existing support for data parallelism is tightly entangled with the implementation of the parallel base algorithms. So the first job was to separate the existing datapar algorithms by creating Customization Point Objects (CPOs) using tag_dispatch for datapar execution policies and fallback using tag_fallback_dispatch for other execution policies to the base implementation.

At first I really did not understand how the CPOs work and why they were used in HPX, but later soon I really got into understanding how they work with the help of my mentors and their references to some nice resources for C++ metaprogramming and CRTPs and turns out CPOs was the perfect solution for separating the datapar algorithms. The existing algorithms that support data parallelism rely on two main utility algorithms called util::loop and util::transform. So I had converted these two algorithms to CPOs which successfully separated the datapar algorithms. Here is the link to PR.

This first job really made me understand the existing implementation of the data parallel algorithms and helped me a lot in future in adapting other algorithms. Then I moved on to adapt the std-experimental-simd backend. This task mainly requires implementing four basic traits i.e vector_pack_type, load_store, size and count which can be used to adapt most parallel algorithms. It was fairly simple and I implemented them, but later the CI tests were failing for clang compilers after going through the errors we found out that std-experimental-simd was partially implemented for clang compilers and was only fully  implemented for GCC so then adding extra flags which enable datapar support only with GCC during cmake resolved this issue. Then I was told by one my mentors that there was proposal to C++23 for adding simd support to parallel algorithms which is the one of the main goals of this project, so in the proposal [P0350] the author has proposed the datapar execution policies with the name of simd execution policy, so after discussing with mentors we decided to rename the datapar and dataseq policies to simdpar and simd. Then after completing  this task the existing algorithms support with new simd and simdpar execution policies. Here is the link to PR.

I was playing around with the new std-experimental-simd and was testing out the performance for a few kernels then it turns out the performance when compared to the old Vc backend was really impressive and there were some nice speed ups. I created a new repo for the performance tests which were run on different machines with different architectures.

I was testing the performance only with one metric i.e speed up, but later I was told by my mentors that the roofline model is another way to measure the performance, which I was new to and I was lucky to get an explanation on it from one of my mentors and and understood that  which essentially shows whether the performance that we got out of the newly adapted simd algorithms is reaching CPU theoretical peak performance or not and shows the kernel used is compute bound or memory bound.

The following figure shows speed plot for one the algorithms (for_each using compute bound kernel)

X_axis : Number of elements in array (In powers of 2)

Y_axis : Speed up against sequential execution policy

What’s Next ?

The next step is to implement different compute bound and memory bound kernels and get the roofline plots for them to measure the performance and I still need to clean up the repo I created for the performance benchmarks, after that I will be working on adapting the remaining algorithms here to datapar.

STE||AR Spotlight: Patrick Diehl

Patrick Diehl is a research scientist here in the STE||AR Group at CCT – LSU. He is definitely one of our most active members!  In addition to his extensive research activities and numerous publications, Patrick also teaches in the LSU Math Department and has organized several workshops and events.

Before joining LSU, Patrick was a postdoctoral fellow at the Laboratory for Multiscale Mechanics at Polytechnique Montreal. He received his diploma in computer science at the University of Stuttgart and his Ph.D. in Applied Mathematics from the university of Bonn.

Patrick created and hosted the virtual CAIRO colloquium series in the Spring.  Speakers from across the country, and even internationally, joined to discuss various AI (artificial intelligence) topics.  The series was an overall success and will continue in the Fall.

Patrick is the liaison for universities in Louisiana for the Texas and Louisiana section of the Society for Applied and Industrial Mathematics (SIAM). He is a topic editor for the Journal for Open Source Software (JOSS) for computational fracture mechanics, applied mathematics, C++, asynchronous and task-based programming.

Patrick also cohosts a podcast – FLOSS For Science – with episodes that showcase free, libre and open source software uses in science with the aim to advocate for the usage of Open Source software in academia and higher education.

Patrick’s main research interests are:

Computational engineering with the focus on peridynamic material models for the application in solids, like glassy or composite materials

High performance computing, especially the asynchronous many task system (ATM), e.g. the C++ standard library for parallelism and concurrency (HPX) for large heterogeneous computations.

In addition, Patrick has a deep interest in the usage of Open Source software to enhance the openness of Science. With respect to teaching, he is interested to develop tools to easily introduce C++ and parallel computing to non-computer science students.

Patrick lives in Baton Rouge with his wife Sylvia and their young daughter Ava.  Aside from all the great work he does at LSU, he’s an active family man and enjoys trips to the park and gymnastics lessons for Ava.  Some of their favorite activities are enjoying the local Cajun food and visiting the amazing BREC parks.