We are pleased to announce the 1st place Best Poster Award winner, Maxwell Cole, for his poster: Computational feasibility of simulating radiation induced changes to vasculature and blood flow rates in the entire human body.
2nd place was awarded to Ioannis Gonidelis for the poster titled Evaluating and Improving Shared Memory Performance of HPX and OpenMP using Task Bench.
Each year, the Best Poster Award recognizes outstanding presentations in the conference’s Poster Session. Posters are judged by external workshop attendees.
We like to thank HPE Enteprise for sponsoring the poster prices.
Kishore Kumar, International Institute of Information Technology, Hyderabad
AdaptingstdAlgorithms for theunseqandpar_unseqExecution Policies
I began my work by first analyzing and testing compiler support and codegen for different user provided hints. This was used to create the original version of #6016. Later, I added support for the omp backend which is supported by later versions of Clang and ICC out of the box. As of the latest PR the unseq backend will first attempt to use the omp backend, and if it is not available, default to compiler specific hints.
After this, the next task assigned to me was to implement a basic version of the transform_loop and loop CPO’s. This was initially completed keeping in mind just supporting the original non-omp backend. Later, it was ported to account for supporting the omp backend as well. In particular, GCC will throw errors if the loops asked to vectorize are not conforming to the standard syntax:
Following this, I wrote a mini-benchmark environment for testing the performance of my adaptation of the std algorithms here. This exists as a separate repo and was used to report all the benchmark numbers shown here.
A strong case for switching to the omp backend was its support for declaring reductions on supported clauses. The next task I worked on was implementing an efficient version of the reduce CPO’s here #6018. Reductions for default-supported ones were overloaded to their respective methods, and a generalized implementation is given as well. This mostly gets the job done, however for the specialized-overloads to accept the overload the reduction operation must exactly match the type of the init value. For example, if reduction is over unsigned int and init is signed, the overload will not accept. This is a TODO that I believe is possible to achieve with more template meta-programming. I will be working on this post GSoC.
Note: GCC Unseq can probably be made a decent amount faster by switching to the omp backend (Does not default support). Also, clang no-vec benchmarks were removed from the chart as they were very slow and skewed the visualization.
In the second week of July, we completed the first evaluation of our Google Summer of Code program. The students have provided summaries of their work and details of the pull requests they’ve created. Check them out below:
Phase-1 of Google Summer of Code 2022 at Stellar Group
This summer, I am working as a Google Summer of Code mentee in STE||AR Group on “Upgrading Multiple Datasets Performance Visualization feature in Traveler” under the mentorship of Kate Isaacs. This blog summarizes my work on the Traveler Platform during phase 1 of Google Summer of Code 2022 program.
About Traveler
Traveler-Integrated is a web-based visualization system for parallel performance data, such as OTF2 traces and HPX execution trees. HPX traces are collected with APEX and written as OTF2 files with extensions. It is developed by the HDC Lab (Humans, Data and Computers Lab) at the University of Arizona.The major goal of this platform is to provide meaningful insights into parallel performance data in the form of Gantt charts (trace data timelines with dependencies), source code, expression tree, aggregated time series line charts for counter data, utilization chart and task level histograms.
Web Interface of Traveler
Abstract
The aim of this project, “Multiple Datasets Performance Visualization,’’ is to add specific features in the platform that will help in managing multiple data files and organizing traveler interface windows to handle the comparison of data. Organizing multiple datasets in the platform, comparison of datasets side by side, implementing a highlighted linking system for multiple datasets and organizing datasets efficiently for visualization are some of the major sub-goals.
Phase — 1
Updated the Tagging system of Traveler Interface to accommodate multiple datasets
Issue : Organizing the datasets according to their assigned tags.
Made changes in the interface main menu to display the datasets according to their tags names. Tested the tagging system back-end to accommodate multiple datasets. The screenshot displays the fixes made when tested with 2 datasets.
Issue: Displaying a clear relationship between a folder and its datasets.
Made changes in the front-end to make the lines visible that shows the connection between folder and its datasets. Adjusted the tag header to solve the tag overlapping issue for multiple datasets. The screenshot of the changes are shown below.
Issue: Adding a color picker system to distinguish between multiple datasets.
“Change Datasets color” option is added to datasets context menu. With this feature, a user can change the datasets selection color and main menu color to be distinguishable from other datasets. The screenshots of changes done till now are displayed below:
HPX being up to date with Std C++ Proposals, Senders/Receivers were implemented as per P2300. But they have been missing coroutine (co_await) integration and minor functionalities as described in P2300 which is likely to be accepted. Hence I plan to implement these functionalities within the Core HPX Library.
Benefits:
Coroutines introduce better async code. For example, it is more readable, local variables have the same lifespan as the coroutine which means we don’t need to worry about allocation/release.
S/R algorithms can work with coroutines which they cannot as of now unless relied on futures which as mentioned are single-time use.
Adding co_await support makes the code more structured with respect to concurrency which can also be done by library abstractions of callbacks but using co_await may make it more optimized.
Because it makes a more consistent programming model considering async programming types i.e. Parallelism and Concurrency. It standardizes the terminologies and execution policies which are more generic and reduce redundancy.
Coroutines have a direct connection between Senders and Coroutine Awaitables.
Futures
One of the points of S/R is to avoid the allocations associated with futures, also, futures are single-use, whereas S/R, in general, can be used (started) multiple times. – Dr. H. Kaiser
Goal is to enable all Sender CPOs to do the following:
If we write a sender and pass it to a function which could be a coroutine that could co_await that sender and get its result.
If they are not generally awaitable then we can await transform them (i.e. make them awaitable).
I’m Panos, currently studying Electrical and Computer Engineering in Aristotle University of Thessaloniki, in Greece. This summer, I joined the HPX team as a contributor through Google Summer of Code (GSoC).
My GSoC project involves performance analysis and optimization on C++ standard parallel algorithms.
To explain further: The C++ standard defines many functions for algorithms that are commonly used by developers (eg. sorting, searching). HPX provides sequential and parallel implementations for all these algorithms. I’m working on improving the performance of these implementations.
So far, I have explored different methodologies for visualizing and assessing an algorithm’s performance. This has involved a lot of scripting for automating tasks, as well as data collection and analysis.
With help from my mentor, I have produced plots that show how an algorithm’s performance changes when tweaking different parameters (such as workload size and number of computer cores). We also produced visualizations of how different tasks are distributed and where/how they are executed in a parallel environment.
Most importantly though: The HPX community has been immensely welcoming. It can often be awkward being “the new junior guy”, but my mentor quickly made me feel like a part of the team. People here are talented, but also fun and humble, and always eager to help.
This summarizes my experience for the first two months of GSoC. I have learned tons so far. My work here is far from done, however we have laid a great foundation for the work that will follow.
Documentation is a love letter that your write to your future self.
Damian Conway
We are proud to welcome Bhumit Attarde to STE||AR Group as the new technical writer that will work with us during this year’s Google Season of Docs period. Bhumit will focus on developing additional content for our HPX documentation in order to aid prospective users to navigate easier through our codebase. We are looking forward to a fruitful collaboration that will benefit our open source community and enrich our impact in the world of High Performance Computing.
Vectorization is a technique to allow incore parallelism using CPU vector registors which enables us to exploit data-parallelism. Recent additions in C++17 and C++20 to parallel algorithms accept execution policy as first argument which changes the execution behaviour based on the given policy. We implement two new execution policies hpx::execution::simd and hpx::execution::par_simd. The former policy does execution in sequential fashion with Vectorization added, where as the latter one does execution in parallel with Vectorization. For both of these newly implemented policies, the iterator function now no longer accepts static types instead accept only generic types i.e templated or generic function objects. This allows the function object to work with both non-simd and simd policies with a very little or no change in the code. We used std::experimental::simd (_available in C++20 with GCC >= 11.1 and Clang >= 12) as the Vectorization backend in implementing the 2 new execution policies. In the following sections we dicuss example codes on how to use these new facilites adapted to hpx and the benchmarks with results performed on various architectures using different kernels.
Example Usage
The following example code snippet describes the use of hpx for_each algorithm with different execuction policies such as seq, par, simd and par_simd.
Note that we passed a generic lambda to for_each algorithm as argument because the same lambda can be used with different execution policies. The template argument ExPolicy is used to accept execution policy, T is used for handling data-types for creating std::vector and Gen is used to accept geneartor function to fill the std::vector. If the execution policy is seq or par then variable x would be of arithmetic typeT as int, float and so on.. where as if execution policy is simd or par_simd then x would of type std::experimental::simd<T> as std::experimental::simd<int>, std::experimental::simd<float> and so on. std::experimental::simd<T> is a vector_pack of type T which is value_type of iterator nums. Contigous elements of the iterator are loaded internally into the vector_pack. The sin and cos functions in the lambda are adapted to arithmetic types and vector_pack types (available in std::experimental namespace).
The lambda used in this code snippet is a compute-bound kernel because of high Arithmetic Intensity due to loop running for 100 steps performing sin and cos operations at each step.
Now, we look into another example using a memory bound kernel (performing SAXPY Operation) with help of transform algorithm.
This code snippet is very similar to the previous with change in lambda and algorithm. Here as well, the arguments to lambda is of two types either arithmetic type (if execution policy is seq or par), or vector_pack type (if the execution policy is simd or par_simd).
The following code snippet describes the usage of algorithms such as count, find. These do not require any lambda and hence vectorization is straightforward just with implementation itself.
This class of algorithms is much easier and are more prone to getting vectorized because of minimum intervention with users i.e no lamda or function is taken in the arguments. Note that we get vectorization benefits only if the iterators passed to algorithm are random access iterators.
Example Implementation
The above code snippet shows implementation for datapar_loop function, which is main vectorization backend helper function for most of the iterative algorithms in hpx. The function call in datapar_loop class can be divided into 3 main steps.
First a prefix loop runs the code in sequential fashion by calling each element with function f using the helper function datapar_loop_step::call1. This loop runs until it finds first aligned element.
Secondly, the main vectorization loop, where actual vectorization happens with datapar_loop_step::callv function which actually creates a vector_pack then loads the elements from iterator and calls the function f and then stores back the results as below.
Finally, the last block i.e post-fix loops handles the elements at the end of array or container that are less than vector_pack size and hence cannot be fit into single vector_pack. So they are handled in sequential fashion similar to pre-fix loop.
Benchmarks
We ran the benchmarks for 2 classes of algorithms. First class being iterative algorithms where each element from the iterator gets mapped using some function. We used for_each and transform algorithms with compute bound and memory bound kernels. For the second class of algorithms, we pick the algorithms which have consists of conditional statements and these can be called as algorithms with simd mask reductions. For this class, we chose count and find algorithms.
Results
The Following figure shows benchmark of for_each algorithm with compute bound kernel i.e Example 1. These benchmarks were run on Intel Xeon Skylake with AVX512. AVX512 vector register can hold 16 floating point elements. We can see a 12x speed up with simd policy and over 140x with par_simd.
The above image shows the benchmark results graph depeciting speed ups of simd, par and par_simd against seq execution policy. These benchmarks were run on AMD EPYC 7H12 with AVX2. AVX2 vector register can hold 8 floating point elements.The array used contains 128 Billion elements with float and double as data types. We can see super-linear scaling for simd speed up in compute bound kernels i.e speed up of simd (10.37) is more than vector_pack size (8) because sin and cos implementations for scalar arithmetic types and vector_pack types are slightly different. We can see a 3 order magnitude of speed up when using par_simd execution policy.
Conclusion
From the examples illustrated and the benchmarks, we can see how easy it is to vectorize the code using simd and par_simd execution policies and gain massive speed ups with very little change in code. Currently adapting algorithms to simd and par_simd policies is still under progress. You can find the list of algorithms adapted to these policies .
HPX algorithms support data parallelism through explicit vectorization using Vc library and only for a few algorithms like for_each, transform and count, but recently the support for Vc library has been deprecated and has been replaced by std::experimental::simd. In this project I have adapted many algorithms to datapar using new backend std::experimental::simd with two new policies simd and par_simd using the data-parallel types proposed in the experimental namespace. For all the algorithms adapted to datapar, separate tests have been created.
I have created a new github repository namely std-simd-perf for the benchmarks of the algorithms that I have adapted to datapar which have various plots for speed up analysis and roofline model for artificial benchmarks and real world applications.
The std-simd-perf repository contains all the benchmarks for simd on artificial algorithms such as for_each, transform, count, find etc.. and on real world examples such as Mandelbrot set.
These benchmarks were run on different clusters and have separate branches for each architecture in the repo.
Speed up plot for a compute bound kernel using for_each algorithm
Speed up plot for a simd reduction based algorithm using count algorithm
Beyond GSoC
Adapt #2333 rest of the algorithms to support data parallel.
I will be further working with STE||AR GROUP for HPX in other areas as well as this is a great community to learn with great people and expand my knowledge.
Acknowledgements
Special thanks to Hartmut Kaiser, Nikunj Gupta and Auriane R. for all the guidance and help with frequent meetings.
My main task involves adapting the remaining algorithms from this issue to C++ 20 by using the tag_invoke CPO mechanism to add the correct overloads for the algorithms as mentioned by the C++20 standard. It also involves adding ranges and sentinel overloads for these algorithms as well as ensuring that the base implementations support sentinels. I also added doxygen documentation for each overload.
We have managed to cover almost all algorithms thanks to previous contributions prior to the 2021 GSoC period from Giannis, Hartmut, Mikael and others as well as from Chuanqiu He and Karame for adapting the rotate/rotate_copy and adjacent_difference respectively.
Apart from the adaptation work, I have also created PRs adding the shift_left and shift_right algorithms (Issue #3706) and the ranges starts_with and ends_with algorithms (Issue #5381) and they’re currently under review.
Details:
Tag_invoke:
We render the old hpx::parallel overloads as deprecated and add new tag_fallback_dispatch overloads according to the function signatures specified in the C++ 20 standard using the tag_invoke CPO mechanism for dispatching the call to the correct overloads.
The segmented overloads for an algorithm use tag_dispatch and the normal parallel and container overloads use the tag_fallback_dispatch, so that all the overloads of the segmented overloads are preferred before falling back to the remaining parallel overloads.
Range and sentinel overloads:
C++ 20 introduced the ranges overloads for many of the algorithms and we have done the same for our algorithms, available in the hpx::ranges namespace.
We can pass a range as either a single range argument or by using an iterator-sentinel pair. The range overloads also make use of tag_fallback_dispatch for overload resolution.
Separating the segmented overloads:
For algorithms having segmented overloads, we add tag_dispatch overloads and remove the forward declarations in both files to seperate the segmented overloads completely from the parallel overloads.
Shift left and shift right algorithms:
Shift left and shift right algorithms have been added. They make use of reverse in the parallel implementations (anyone reading this in the future, feel free to attempt a more efficient parallel implementation if possible). Range and sentinel overloads for these algorithms have been added as well. Ranges starts_with and ends_with algorithms have been added too.
Other:
I’ve also been looking into the senders and receivers proposal and looking into the performance issues of the scan partitioner by trying to measure the execution time and scheduling of the various stages of the scan algorithm.
PR Details:
The following PRs have been merged as of writing this report :-
My experience working with and being mentored by the STE||AR Group has been amazing. This being my second gsoc, I was looking for an organization that had both challenging and interesting work and a helpful and supportive community, and the STE||AR Group ticked off both of those boxes wonderfully.
Hartmut and Giannis were amazing mentors and have been very helpful. The weekly meetings with them and Auriane were very useful to keep track of the progress and get guidance on how to proceed. Thanks to Hartmut, Auriane and Mikael for reviewing my PRs. I’m also grateful for the help of other members of the community who were very helpful and responsive on the IRC chat.
Over the summer my understanding of C++ has definitely increased, though there is a LOT more to cover, although I’m sure continuing to work on HPX (and asking questions on the IRC) will help with that. Having access to and being able to ask questions to the community members who have such a deep understanding of the topics is a very valuable advantage of contributing to HPX.
I fully intend to continue working on HPX and with the STE||AR Group after GSoC is over and look forward to learning and working on more interesting stuff in the coming months.
HPX was recently selected to be part of Google’s Season of Docs (GSoD), a program designed to improve the documentation of open source software, as well as being a Google Summer of Code organization.
GSoD aims to cover and create the documentation gaps faced by organizations due to various reasons, alongside giving technical writers who initially just wanted to know how much do editors make an avenue to showcase their skills.
I will be helping in the organization and update of the prior documentation to make it into a more navigable and to provide a user-friendly structure, which many users have had issues with using the current documentation. I will work closely with the HPX team and our users to collect feedback, find user pain-points, and improve preexisting docs, which mainly comprise of the build instructions.
Alongside, I would create a “design document” containing guidelines for how to add new content to the documentation: tips on how to structure new sections, general guidelines on what sort of content should be presented in what chapters, etc. The project may also include content rearrangement and a change of hierarchy, if the users find it is needed.
I am currently working on a timeline and action items and researching about the possible shift to another documentation platform.
I am reachable at rachitt01@gmail.com or on the IRC as rachitt_shah, please contact me to suggest changes to the documentation or to provide feedback. We can always benefit from your ideas.
About me, I’m an undergrad studying electronics as my major, and I’m a casual sport programmer as well. I’ve been a product manager and venture capital intern in the past, and done Google Summer of Code with OpenAstronomy.
We’ve reached the end of Google’s Season of Docs, and we’ve accomplished a lot in the past three months. My initial proposal was to work on three sections of the manual, and we have far exceeded our goal, managing to make changes to twelve different sections of the documentation. The majority of the work I’ve done has consisted of cleaning up grammatical errors and improving sentence structure. I have also added a style guide to the wiki, which should help standardize future changes to the documentation. The style guide can be found in the “HPX Source Code Structure and Coding Standards” wiki document under the section “Documentation Style Guide”. For a complete list of my pull requests during Season of Docs, please see here. To view my changes to the wiki, please see here.