GSoC 2022 – Adapting std Algorithms for the unseq and par_unseq Execution Policies

Kishore Kumar, International Institute of Information Technology, Hyderabad

Adapting std Algorithms for the unseq and par_unseq Execution Policies

I began my work by first analyzing and testing compiler support and codegen for different user provided hints. This was used to create the original version of #6016. Later, I added support for the omp backend which is supported by later versions of Clang and ICC out of the box. As of the latest PR the unseq backend will first attempt to use the omp backend, and if it is not available, default to compiler specific hints. 

After this, the next task assigned to me was to implement a basic version of the transform_loop and loop CPO’s. This was initially completed keeping in mind just supporting the original non-omp backend. Later, it was ported to account for supporting the omp backend as well. In particular, GCC will throw errors if the loops asked to vectorize are not conforming to the standard syntax:

for(int counter=0; counter < limit; counter++) { … }

So the implementation was then changed to vectorize loops only when passed std::random_access_iterator’s. This is #6017.

Following this, I wrote a mini-benchmark environment for testing the performance of my adaptation of the std algorithms here. This exists as a separate repo and was used to report all the benchmark numbers shown here. 

A strong case for switching to the omp backend was its support for declaring reductions on supported clauses. The next task I worked on was implementing an efficient version of the reduce CPO’s here #6018. Reductions for default-supported ones were overloaded to their respective methods, and a generalized implementation is given as well. This mostly gets the job done, however for the specialized-overloads to accept the overload the reduction operation must exactly match the type of the init value. For example, if reduction is over unsigned int and init is signed, the overload will not accept. This is a TODO that I believe is possible to achieve with more template meta-programming. I will be working on this post GSoC.

Note: GCC Unseq can probably be made a decent amount faster by switching to the omp backend (Does not default support). Also, clang no-vec benchmarks were removed from the chart as they were very slow and skewed the visualization. 

GSoC 22: First Eval Update

In the second week of July, we completed the first evaluation of our Google Summer of Code program. The students have provided summaries of their work and details of the pull requests they’ve created. Check them out below:

Monalisha Ojha:

https://medium.com/@monalisha-ojha/multiple-datasets-performance-visualization-traveler-a352c13f7c25

Multiple Datasets Performance Visualization — Traveler

Phase-1 of Google Summer of Code 2022 at Stellar Group

This summer, I am working as a Google Summer of Code mentee in STE||AR Group on “Upgrading Multiple Datasets Performance Visualization feature in Traveler” under the mentorship of Kate Isaacs. This blog summarizes my work on the Traveler Platform during phase 1 of Google Summer of Code 2022 program.

About Traveler

Traveler-Integrated is a web-based visualization system for parallel performance data, such as OTF2 traces and HPX execution trees. HPX traces are collected with APEX and written as OTF2 files with extensions. It is developed by the HDC Lab (Humans, Data and Computers Lab) at the University of Arizona.The major goal of this platform is to provide meaningful insights into parallel performance data in the form of Gantt charts (trace data timelines with dependencies), source code, expression tree, aggregated time series line charts for counter data, utilization chart and task level histograms.

Web Interface of Traveler

Abstract

The aim of this project, “Multiple Datasets Performance Visualization,’’ is to add specific features in the platform that will help in managing multiple data files and organizing traveler interface windows to handle the comparison of data. Organizing multiple datasets in the platform, comparison of datasets side by side, implementing a highlighted linking system for multiple datasets and organizing datasets efficiently for visualization are some of the major sub-goals.

Phase — 1

Updated the Tagging system of Traveler Interface to accommodate multiple datasets

Issue : Organizing the datasets according to their assigned tags.

Made changes in the interface main menu to display the datasets according to their tags names. Tested the tagging system back-end to accommodate multiple datasets. The screenshot displays the fixes made when tested with 2 datasets.

Traveler Interface

Issue Link: https://github.com/hdc-arizona/traveler-integrated/issues/90

Pull Request: https://github.com/hdc-arizona/traveler-integrated/pull/91

Fixed glitches related Traveler front-end

Issue: Displaying a clear relationship between a folder and its datasets.

Made changes in the front-end to make the lines visible that shows the connection between folder and its datasets. Adjusted the tag header to solve the tag overlapping issue for multiple datasets. The screenshot of the changes are shown below.

Traveler Interface

Issue link: https://github.com/hdc-arizona/traveler-integrated/issues/92

Pull request link: https://github.com/hdc-arizona/traveler-integrated/pull/93

Adding dynamic color highlighting system

Issue: Adding a color picker system to distinguish between multiple datasets.

“Change Datasets color” option is added to datasets context menu. With this feature, a user can change the datasets selection color and main menu color to be distinguishable from other datasets. The screenshots of changes done till now are displayed below:

Traveler Interface

Pull request link: https://github.com/hdc-arizona/traveler-integrated/pull/94

Shreyas Atre

https://satacker.github.io/docs/c++/GSoC-HPX/

Mentors (STE||AR Group @ LSU)

  1. Dr. Hartmut Kaiser, Adjunct Professor @ LSU
  2. Giannis Gonidelis, RA @ LSU

Abstract#

HPX being up to date with Std C++ Proposals, Senders/Receivers were implemented as per P2300. But they have been missing coroutine (co_await) integration and minor functionalities as described in P2300 which is likely to be accepted. Hence I plan to implement these functionalities within the Core HPX Library.

  • Benefits:
    • Coroutines introduce better async code. For example, it is more readable, local variables have the same lifespan as the coroutine which means we don’t need to worry about allocation/release.
    • S/R algorithms can work with coroutines which they cannot as of now unless relied on futures which as mentioned are single-time use.
    • Adding co_await support makes the code more structured with respect to concurrency which can also be done by library abstractions of callbacks but using co_await may make it more optimized.

Brief Summary#

  • Senders, and Receivers
    • Because it makes a more consistent programming model considering async programming types i.e. Parallelism and Concurrency. It standardizes the terminologies and execution policies which are more generic and reduce redundancy.
    • Coroutines have a direct connection between Senders and Coroutine Awaitables.
  • Futures
    • One of the points of S/R is to avoid the allocations associated with futures, also, futures are single-use, whereas S/R, in general, can be used (started) multiple times. – Dr. H. Kaiser

Goal is to enable all Sender CPOs to do the following:

  • If we write a sender and pass it to a function which could be a coroutine that could co_await that sender and get its result.
  • If they are not generally awaitable then we can await transform them (i.e. make them awaitable).

Work#

My PRs can be found using this link as it’ll always be updated.

Following are the Merged PRs until now:

With coroutine traits completed, my remaining work is the following:

  1. Adapt get_completion_signatures when Sender is a awaitable
  2. Utility as_awaitable_t
    • receiver_basesender_awaitable_base
    • to transform an object into one that is awaitable within a particular coroutine.
  3. promise base for 5.
  4. operation base for 5.
  5. Utility connect_awaitable to adapt connect mentioned in spec 2.2
  6. Utility with_awaitable_senders
    • Used as the base class of a coroutine promise type, makes senders awaitable in that coroutine type

References#

Panagiotis Syskakis:

I’m Panos, currently studying Electrical and Computer Engineering in Aristotle University of Thessaloniki, in Greece. This summer, I joined the HPX team as a contributor through Google Summer of Code (GSoC).

My GSoC project involves performance analysis and optimization on C++ standard parallel algorithms.

To explain further:
The C++ standard defines many functions for algorithms that are commonly used by developers (eg. sorting, searching).
HPX provides sequential and parallel implementations for all these algorithms.
I’m working on improving the performance of these implementations.

So far, I have explored different methodologies for visualizing and assessing an algorithm’s performance. This has involved a lot of scripting for automating tasks, as well as data collection and analysis.

With help from my mentor, I have produced plots that show how an algorithm’s performance changes when tweaking different parameters (such as workload size and number of computer cores). We also produced visualizations of how different tasks are distributed and where/how they are executed in a parallel environment.

Most importantly though:
The HPX community has been immensely welcoming. It can often be awkward being “the new junior guy”, but my mentor quickly made me feel like a part of the team.
People here are talented, but also fun and humble, and always eager to help.

This summarizes my experience for the first two months of GSoC. I have learned tons so far. My work here is far from done, however we have laid a great foundation for the work that will follow.

GSoC 2022 Participants Announced!

It is time to announce the participants for in the STE||AR Group’s 2022 Google Summer of Code! We are very proud to announce the names of the 5 contributors this year who will be funded by Google to work on projects for our group.

These recipients represent the very best of the many excellent proposals that we had to choose from. For those unfamiliar with the program, the Google Summer of Code brings together ambitious students from around the world with open source developers by giving each mentoring organization funds to hire a set number of participants. Students then write proposals, which they submit to a mentoring organization, in hopes of having their work funded.

Below are the contributors who will be working with the STE||AR Group this summer listed with their mentors and their proposal abstracts.


Participant:

Shreyas Swanand Atre, Veermata Jijabai Technological Institute

Mentors:

Giannis Gonidelis

Hartmut Kaiser

Project: Coroutine-like interface

HPX being up to date with Std C++ Proposals, Senders/Receivers were implemented as per P2300. But they have been missing coroutine (co_await) integration and minor functionalities as described in P2300 which is likely to be accepted. Hence I propose to implement these functionalities within the Core HPX Library. Benefits: * Coroutines introduce better async code. For example, it is more readable, local variables have the same lifespan as the coroutine which means we don’t need to worry about allocation/release. * S/R algorithms can work with coroutines which they cannot as of now unless relied on futures which as mentioned are single-time use. * Adding co_await support makes the code more structured with respect to concurrency which can also be done by library abstractions of callbacks but using co_await may make it more optimized.


Participant:

Panos Syskakis, Aristotle University of Thessaloniki

Mentors:

Giannis Gonidelis

Hartmut Kaiser

Project:  HPX Algorithm Performance Analysis & Optimization

The latest C++ specifications and the HPX library introduce a variety of ready-to-use algorithms that may use parallelization and concurrency, in order to more efficiently utilize system resources. However, current implementations of parallel algorithms don’t always perform ideally (low thread utilization, large overhead, in some cases slower than sequential). The goal of this project is to investigate this under-performance and improve current implementations, using scaling analysis, profiling tools and visualizations.


Participant:

Bo Chen, University of Science and Technology Beijing

Mentors:

Patrick Diehl

Project: Implement your favorite Computational Algorithm in HPX ( Molecular Dynamics Simulation of Metal)

My Implement will base on MISA-MD. There are various potential functions used in MD simulation under fields, such as Tersoff potential and Lennard-Jones (L-J) potential, for calculating the interaction among atoms. To improve the simulation accuracy, MISA-MD adopted Embedded Atom Method (EAM) potential, a complex but pretty accurate potential Function, which can provide an effective interatomic description for metallic system. To improve the runtime performance, MISA-MD designed and realized a new hash based data structure for efficient atom storage and quick neighbor atom indexing.


Participant:

Kishore Kumar, International Institute of Information Technology, Hyderabad

Mentors:

Nikunj Gupta

Srinivas Yadav

Project:  Implementing auto-vectorization hints for par_unseq and unseq versions of HPX parallel algorithms

C++ 17 and 20 released the par_unseq and unseq execution models which give guarantees to functions which specialize on them that data access functions can be interleaved even between iterations of one thread. This means that these functions are vectorization safe and can thus gain massive boosts in performance by compiler auto-vectorization. Compilers however are conservative and auto-vectorize loops only when they are sure that vectorized versions give the same result as their scalar counterparts and that vectorization will actually end up being profitable. GCC, Clang, MSVC, ICC all rely on different optimization passes in their backend and are all capable of auto-vectorizing certain loop patterns but not all. The goal of this project is to analyze compiler codegen response to different hints and implement a version of the par_unseq and unseq execution policies in HPX that makes use of these guarantees to provide compilers with as many hints as possible to encourage auto-vectorization.


Participant:

Monalisha Ojha, Birla Institute of Technology, Mesra

Mentors:

Kate Isaacs

Project: Multiple Dataset Performance Visualization

Traveler-Integrated is a web-based visualization system for parallel performance data, such as OTF2 traces and HPX execution trees. HPX traces are collected with APEX and written as OTF2 files with extensions. The major goal of this platform is to provide meaningful insights into parallel performance data in the form of Gantt charts (trace data timelines with dependencies), source code, expression tree, aggregated time series line charts for counter data, utilization chart and task level histograms. The aim of this project, “Multiple Dataset Performance Visualization,” is to add specific features in the platform that will help in managing multiple data files and organising traveler interface windows to handle the comparison of data. Organising multiple datasets in the platform, comparison of datasets side by side, implementing a highlighted linking system for multiple datasets and organising datasets efficiently for visualisation are some of the major sub-goals.

GSoD 2022 – STE||AR Group announces technical writer hire!

Documentation is a love letter that your write to your future self.

Damian Conway

We are proud to welcome Bhumit Attarde to STE||AR Group as the new technical writer that will work with us during this year’s Google Season of Docs period. Bhumit will focus on developing additional content for our HPX documentation in order to aid prospective users to navigate easier through our codebase. We are looking forward to a fruitful collaboration that will benefit our open source community and enrich our impact in the world of High Performance Computing.

GSoD 2022

STE||AR Group was accepted for Google Season of Docs 2022! We look forward to developing our HPX documentation even more and expanding our group this summer.

https://developers.google.com/season-of-docs/docs/participants

Like Google Summer of Code (GSoC) the program aims to match motivated people with interesting open source projects that are looking for volunteer contributions. GSoD, however, aims to improve open source project documentation, which often tends to get less attention than the code itself.

We are now looking for motivated people to help us improve our documentation. If you have some prior experience with technical writing, and are interested in working together with us on making the documentation of a cutting edge open source C++ library the best possible guide for new and experienced users, this is your chance. You can read more about the program on the official GSoD home page. We’ve provided a few project ideas on our wiki, but you can also come up with your own. Our current documentation can be found here.

GSoC 22: Come and code a STE||AR Summer with us!

The STE||AR Group is honored to be selected as one of the 2022 Google Summer of Code (GSoC) mentor organizations! This program, which pays students over the summer to work on open source projects, has been a wonderful experience for students and mentors alike. This is our 8th summer being accepted by the program!

Interested students can find out more about the details of the program on GSoC’s official website. As a mentor organization we have come up with a list of suggested topics for students to work on, however, any student can write a proposal about any topic they are interested. We find that students who engage with us on IRC (#ste||ar on freenode) or via our mailing list hpx-users@stellar-group.org have a better chance of having their proposals accepted and a better understanding of their project scope. Students may also read through our hints for successful proposals.

If you are interested in working with an international team of developers on the cutting edge of C++ parallel, task-based runtime systems please check us out!

STE||AR Spotlight: Srinivas Yadav

Srinivas Yadav is a final year under-graduate student pursuing a Bachelors in Computer Science in India. He has been working for STE||AR Group for 8 months now and is interested in the area of vectorization in the field of HPC. He has worked on HPX for the project “Adding par_simd implementations to parallel algorithms”.

Srinivas relied on guidance from STE||AR Group members, Prof. Hartmut Kaiser, Nikunj Gupta, and Auriane Reverdell, to work through his GSoC project.

Srinivas published and presented the paper “Parallel SIMD – A Policy Based Solution for Free Speed-Up using C++ Data-Parallel Types at ESPM2, SC21 in November 2021.

While working with STE||AR Group and HPX Srinivas has learned more about the field of computer science and programming. He learned how to use HPC clusters, and was given remote access to Rostam Cluster at LSU, CCT by Hartmut Kaiser to run SIMD benchmarks on different machines available. Currently, Srinivas is working remotely on Ookami Cluster at Stony Brook University to perform the SIMD benchmarks for ARM machines (on A64FX node) and working with octo-tiger to  port new vectorization backends which could be used to run the octo-tiger benchmarks on these nodes

He also learned the importance of collaboration with different researchers from different time zones. Getting international schedules together can be tricky!

Currently, Srinivas is working on three key areas

First is adapting algorithms to SIMD policies in HPX. The main aim of this task is to adapt as many algorithms as possible to SIMD execution policies in HPX, which contributes to fixing #2333.

Second is to port std::experimental::simd to Octo-Tiger. There are many kernels in Octo-Tiger library which currently use HPX-Kokkos with Vc library for vectorization, but now Vc is deprecated and the plan isto replace it with std::experimental::simd.

Finally, porting the EVE library as a new vectorization/simd backend to HPX. HPX currently has two vectorization backends, a newer one is std::experimental::simd, the older one is Vc (now deprecated) so needs to be replaced by a newer library and EVE seems to be perfect fit for that slot.

Srinivas has applied to LSU for a masters in Computer Science for the coming Fall Semester 2022.  He is very excited to come to CCT and LSU!

Other than academics/work, Srinivas enjoys playing badminton, watching and playing cricket and exploring new places/traveling a lot. Recently, he started learning cooking and learning a new Indian language (Tamil).

Exploit data parallelism using hpx simd and par_simd policies.

Srinivas Yadav

Introduction

Vectorization is a technique to allow incore parallelism using CPU vector registors which enables us to exploit data-parallelism. Recent additions in C++17 and C++20 to parallel algorithms accept execution policy as first argument which changes the execution behaviour based on the given policy. We implement two new execution policies hpx::execution::simd and hpx::execution::par_simd. The former policy does execution in sequential fashion with Vectorization added, where as the latter one does execution in parallel with Vectorization. For both of these newly implemented policies, the iterator function now no longer accepts static types instead accept only generic types i.e templated or generic function objects. This allows the function object to work with both non-simd and simd policies with a very little or no change in the code. We used std::experimental::simd (_available in C++20 with GCC >= 11.1 and Clang >= 12) as the Vectorization backend in implementing the 2 new execution policies. In the following sections we dicuss example codes on how to use these new facilites adapted to hpx and the benchmarks with results performed on various architectures using different kernels.

Example Usage

The following example code snippet describes the use of hpx for_each algorithm with different execuction policies such as seqpar, simd and par_simd.

Note that we passed a generic lambda to for_each algorithm as argument because the same lambda can be used with different execution policies. The template argument ExPolicy is used to accept execution policy, T is used for handling data-types for creating std::vector and Gen is used to accept geneartor function to fill the std::vector. If the execution policy is seq or par then variable x would be of arithmetic type T as intfloat and so on.. where as if execution policy is simd or par_simd then x would of type std::experimental::simd<T> as std::experimental::simd<int>std::experimental::simd<float> and so on. std::experimental::simd<T> is a vector_pack of type T which is value_type of iterator nums. Contigous elements of the iterator are loaded internally into the vector_pack. The sin and cos functions in the lambda are adapted to arithmetic types and vector_pack types (available in std::experimental namespace).

The lambda used in this code snippet is a compute-bound kernel because of high Arithmetic Intensity due to loop running for 100 steps performing sin and cos operations at each step.

Now, we look into another example using a memory bound kernel (performing SAXPY Operation) with help of transform algorithm.

This code snippet is very similar to the previous with change in lambda and algorithm. Here as well, the arguments to lambda is of two types either arithmetic type (if execution policy is seq or par), or vector_pack type (if the execution policy is simd or par_simd).

The following code snippet describes the usage of algorithms such as count, find. These do not require any lambda and hence vectorization is straightforward just with implementation itself.

This class of algorithms is much easier and are more prone to getting vectorized because of minimum intervention with users i.e no lamda or function is taken in the arguments. Note that we get vectorization benefits only if the iterators passed to algorithm are random access iterators.

Example Implementation

The above code snippet shows implementation for datapar_loop function, which is main vectorization backend helper function for most of the iterative algorithms in hpx. The function call in datapar_loop class can be divided into 3 main steps.

  • First a prefix loop runs the code in sequential fashion by calling each element with function f using the helper function datapar_loop_step::call1. This loop runs until it finds first aligned element.
  • Secondly, the main vectorization loop, where actual vectorization happens with datapar_loop_step::callv function which actually creates a vector_pack then loads the elements from iterator and calls the function f and then stores back the results as below.
  • Finally, the last block i.e post-fix loops handles the elements at the end of array or container that are less than vector_pack size and hence cannot be fit into single vector_pack. So they are handled in sequential fashion similar to pre-fix loop.

Benchmarks

We ran the benchmarks for 2 classes of algorithms. First class being iterative algorithms where each element from the iterator gets mapped using some function. We used for_each and transform algorithms with compute bound and memory bound kernels. For the second class of algorithms, we pick the algorithms which have consists of conditional statements and these can be called as algorithms with simd mask reductions. For this class, we chose count and find algorithms.

Results

The Following figure shows benchmark of for_each algorithm with compute bound kernel i.e Example 1 These benchmarks were run on Intel Xeon Skylake with AVX512AVX512 vector register can hold 16 floating point elements. We can see a 12x speed up with simd policy and over 140x with par_simd.

all

The above image shows the benchmark results graph depeciting speed ups of simdpar and par_simd against seq execution policy. These benchmarks were run on AMD EPYC 7H12 with AVX2AVX2 vector register can hold 8 floating point elements.The array used contains 128 Billion elements with float and double as data types. We can see super-linear scaling for simd speed up in compute bound kernels i.e speed up of simd (10.37) is more than vector_pack size (8) because sin and cos implementations for scalar arithmetic types and vector_pack types are slightly different. We can see a 3 order magnitude of speed up when using par_simd execution policy.

Conclusion

From the examples illustrated and the benchmarks, we can see how easy it is to vectorize the code using simd and par_simd execution policies and gain massive speed ups with very little change in code. Currently adapting algorithms to simd and par_simd policies is still under progress. You can find the list of algorithms adapted to these policies .

STE||AR Spotlight: Alireza Kheirkhahan

Alireza Kheirkhahan is an IT Consultant in the STE||AR Group at LSU.  He received his B.S in Computer Engineering from Sharif University of Technology, Tehran, Iran and his master’s in computer science from LSU.

Alireza’s master thesis focused on I/O backend and Storage solutions. His main research focus is High Performance Computing, I/O Systems, high-throughput, redundant and distributed storage systems

Alireza designed and implemented STE||AR Group at LSU’s small research cluster, Rostam. Since 2015, he manages and improves this cluster. Currently, the second generation of the computer cluster is in use, and the next generation is in design. Rostam consists of nearly 80 compute nodes and multiple storage servers. Over the course of the years, more than a dozen graduate students used Rostam for their thesis work, and more than thirty scientific publications have been created using this cluster.

On the CERA (Coastal Emergency Risk Assessment) project, Alireza acts as residing high performance specialist. He carries the specific hpc tasks for domain scientists.  He adapts and maintains their computational needs in HPC clusters provided by LSU and the State of Louisiana. 

Alireza designed and implemented a special purpose storage system for CERA projects. CERA has a particular need for a specific storage solution. The application creates bursts of gigabytes of data at once and goes quite for few hours until another burst arrives. The recently generated data is highly valued and must have very quick and reliable access, but the older data must be archived with a cost-effective manner. With his research background, Alireza designed the new storage system to carry out both tasks at once, which reduced the data transfer time significantly and increased reliability.

Alireza lives in Baton Rouge, with his wife Shahrzad and son Damon. In his spare time, he enjoys woodworking and tinkering with electronics.