STE||AR Spotlight: Nanmiao Wu

Nanmiao Wu is a Ph.D. student In the Department of Electrical and Computer Engineering and Center for Computation and Technology, LSU. She has been working in STE||AR group for more than 2 years and is co-advised by Dr. Hartmut Kaiser, head of the STE||AR Group, and Dr. Ram Ramanujam, Director of CCT. 

Before joining LSU, she received a B.S. degree in Electronic Information Science and Technology from Nankai University, and an M.S. degree in Electrical and Computer Engineering from the University of Macau.

Nanmiao’s research focuses on scalable and distributed high-performance computation for machine learning and deep learning applications.

She has been an intern at Pacific Northwest National Laboratory (PNNL) from February to August  2021, developing a HPX runtime interface for a C++ algorithm and data-structure library, SHAD, for better scalability and performance. The linear scaling performance is achieved on a single locality with varying data-structure sizes and on multiple localities. During the internship, she has utilized the HPX serialization library to bitwise serialize SHAD types. She also learned how to associate multiple tasks to the same handle, forming a task group, and run the callbacks on remote localities via customized actions.

Before that, she collaborated with PNNL for a scalable second-order optimization for deep learning applications. During the collaboration, she has implemented a PyTorch second-order optimizer and compared its performance with stochastic gradient descent (SGD), a first-order optimizer, on an image classification task, using a multi-layer perceptron network with one hidden layer. The scalable performance and improving throughput were achieved:  2.2x speedup was achieved over SGD in multi-thread scenario, and 5.8x speedup was achieved in multi-process scenario.

Previously, she implemented a scalable and distributed alternating least square (ALS) recommendation algorithm for large recommendation systems and a number of iterative solvers on the open source distributed machine learning framework, Phylanx. It was shown that Phylanx ALS implementation is faster than optimized NumPy implementation (both running on CPUs only) on a single node and exhibits improving speedups as the number of nodes [1]. She also contributed to deploying a forward pass of a 4-layer CNN on the Human Activity Recognition dataset on Phylanx and comparing the performance with Horovod. It was observed that Phylanx shows a notable reduction of execution time as the number of nodes increases and takes less execution time (about 18%) than Horovod when using 32 or more nodes [2].

Outside the lab, Nanmiao enjoys spending time in nature.  She likes hiking, camping,  snorkeling, and travelling. She also likes reading. Her favorite books of 2021 are Neapolitan Novels.


[1] Steven R. Brandt, Bita Hasheminezhad, Nanmiao Wu, Sayef Azad Sakin, Alex R. Bigelow, Katherine E. Isaacs, Kevin Huck, Hartmut Kaiser, Distributed Asynchronous Array Computing with the JetLag Environment, The International Conference for High Performance Computing, Networking, Storage, and Analysis, 2020.

[2] Hasheminezhad, Bita and Shirzad, Shahrzad and Wu, Nanmiao and Diehl, Patrick and Schulz, Hannes and Kaiser, Hartmut, Towards a Scalable and Distributed Infrastructure for Deep Learning Applications, 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS), 2020.

GSoC 2021 – Add vectorization to par_unseq implementations of Parallel Algorithms

by Srinivas Yadav

GSoC 2021 Final Report


HPX algorithms support data parallelism through explicit vectorization using Vc library and only for a few algorithms like for_each, transform and count, but recently the support for Vc library has been deprecated and has been replaced by std::experimental::simd. In this project I have adapted many algorithms to datapar using new backend std::experimental::simd with two new policies simd and par_simd using the data-parallel types proposed in the experimental namespace. For all the algorithms adapted to datapar, separate tests have been created.

I have created a new github repository namely std-simd-perf for the benchmarks of the algorithms that I have adapted to datapar which have various plots for speed up analysis and roofline model for artificial benchmarks and real world applications.

Pull Requests for HPX Repo



Other Adapted Algorithms to datapar [code]: 

  • adjacent_difference
  • adjacent_find
  • all_of , any_of, none_of
  • copy
  • count
  • find
  • for_each
  • generate
  • transform

Performance Benchmarks

  • The std-simd-perf repository contains all the benchmarks for simd on artificial algorithms such as for_each, transform, count, find etc.. and on real world examples such as Mandelbrot set.
  • These benchmarks were run on different clusters and have separate branches for each architecture in the repo.
  • Speed up plot for a compute bound kernel using for_each algorithm
  • Speed up plot for a simd reduction based algorithm using count algorithm

Beyond GSoC

  • Adapt #2333 rest of the algorithms to support data parallel.
  • I will be further working with STE||AR GROUP for HPX in other areas as well as this is a great community to learn with great people and expand my knowledge.


Special thanks to Hartmut Kaiser, Nikunj Gupta and Auriane R. for all the guidance and help with frequent meetings.

GSoC 2021 – Adapting algorithms to C++ 20 and Ranges TS

by Akhil Nair


My main task involves adapting the remaining algorithms from this issue to C++ 20 by using the tag_invoke CPO mechanism to add the correct overloads for the algorithms as mentioned by the C++20 standard. It also involves adding ranges and sentinel overloads for these algorithms as well as ensuring that the base implementations support sentinels. I also added doxygen documentation for each overload.

We have managed to cover almost all algorithms thanks to previous contributions prior to the 2021 GSoC period from Giannis, Hartmut, Mikael and others as well as from Chuanqiu He and Karame for adapting the rotate/rotate_copy and adjacent_difference respectively.

Apart from the adaptation work, I have also created PRs adding the shift_left and shift_right algorithms (Issue #3706) and the ranges starts_with and ends_with algorithms (Issue #5381) and they’re currently under review.



We render the old hpx::parallel overloads as deprecated and add new tag_fallback_dispatch overloads according to the function signatures specified in the C++ 20 standard using the tag_invoke CPO mechanism for dispatching the call to the correct overloads.

The segmented overloads for an algorithm use tag_dispatch and the normal parallel and container overloads use the tag_fallback_dispatch, so that all the overloads of the segmented overloads are preferred before falling back to the remaining parallel overloads.

Range and sentinel overloads:

C++ 20 introduced the ranges overloads for many of the algorithms and we have done the same for our algorithms, available in the hpx::ranges namespace.

We can pass a range as either a single range argument or by using an iterator-sentinel pair. The range overloads also make use of tag_fallback_dispatch for overload resolution.

Separating the segmented overloads:

For algorithms having segmented overloads, we add tag_dispatch overloads and remove the forward declarations in both files to seperate the segmented overloads completely from the parallel overloads.

Shift left and shift right algorithms:

Shift left and shift right algorithms have been added. They make use of reverse in the parallel implementations (anyone reading this in the future, feel free to attempt a more efficient parallel implementation if possible). Range and sentinel overloads for these algorithms have been added as well. Ranges starts_with and ends_with algorithms have been added too.


I’ve also been looking into the senders and receivers proposal and looking into the performance issues of the scan partitioner by trying to measure the execution time and scheduling of the various stages of the scan algorithm.

PR Details:

The following PRs have been merged as of writing this report :-

Open PRs currently under review :-

My experience:

My experience working with and being mentored by the STE||AR Group has been amazing. This being my second gsoc, I was looking for an organization that had both challenging and interesting work and a helpful and supportive community, and the STE||AR Group ticked off both of those boxes wonderfully.

Hartmut and Giannis were amazing mentors and have been very helpful. The weekly meetings with them and Auriane were very useful to keep track of the progress and get guidance on how to proceed. Thanks to Hartmut, Auriane and Mikael for reviewing my PRs. I’m also grateful for the help of other members of the community who were very helpful and responsive on the IRC chat.

Over the summer my understanding of C++ has definitely increased, though there is a LOT more to cover, although I’m sure continuing to work on HPX (and asking questions on the IRC) will help with that. Having access to and being able to ask questions to the community members who have such a deep understanding of the topics is a very valuable advantage of contributing to HPX.

I fully intend to continue working on HPX and with the STE||AR Group after GSoC is over and look forward to learning and working on more interesting stuff in the coming months.

CPPCast Episode: HPX and DLA Future

CppCast, hosted by Jason Turner and JeanHeyd Meneide, is the first podcast for C++ developers by C++ developers. Since 2015 CppCast has been having conversations with C++ conference speakers, library authors, writers, ISO committee members and more.

Hartmut Kaiser and Mikael Simberg of the STE||AR Group joined Jason and JeanHeyd for a podcast episode recently. They discussed some blog posts on returning multiple values from a function and C++ Ranges. Then they talk about the latest version of HPX, how easy it is to gain performance improvements with HPX, and DLA Futures, the Distributed Linear Algebra library built using HPX.

To listen to the podcast episode, click here.

STE||AR Spotlight: Patrick Diehl

Patrick Diehl is a research scientist here in the STE||AR Group at CCT – LSU. He is definitely one of our most active members!  In addition to his extensive research activities and numerous publications, Patrick also teaches in the LSU Math Department and has organized several workshops and events.

Before joining LSU, Patrick was a postdoctoral fellow at the Laboratory for Multiscale Mechanics at Polytechnique Montreal. He received his diploma in computer science at the University of Stuttgart and his Ph.D. in Applied Mathematics from the university of Bonn.

Patrick created and hosted the virtual CAIRO colloquium series in the Spring.  Speakers from across the country, and even internationally, joined to discuss various AI (artificial intelligence) topics.  The series was an overall success and will continue in the Fall.

Patrick is the liaison for universities in Louisiana for the Texas and Louisiana section of the Society for Applied and Industrial Mathematics (SIAM). He is a topic editor for the Journal for Open Source Software (JOSS) for computational fracture mechanics, applied mathematics, C++, asynchronous and task-based programming.

Patrick also cohosts a podcast – FLOSS For Science – with episodes that showcase free, libre and open source software uses in science with the aim to advocate for the usage of Open Source software in academia and higher education.

Patrick’s main research interests are:

Computational engineering with the focus on peridynamic material models for the application in solids, like glassy or composite materials

High performance computing, especially the asynchronous many task system (ATM), e.g. the C++ standard library for parallelism and concurrency (HPX) for large heterogeneous computations.

In addition, Patrick has a deep interest in the usage of Open Source software to enhance the openness of Science. With respect to teaching, he is interested to develop tools to easily introduce C++ and parallel computing to non-computer science students.

Patrick lives in Baton Rouge with his wife Sylvia and their young daughter Ava.  Aside from all the great work he does at LSU, he’s an active family man and enjoys trips to the park and gymnastics lessons for Ava.  Some of their favorite activities are enjoying the local Cajun food and visiting the amazing BREC parks.

Important: IRC channel change

Because of the growing problems with Freenode we have decided to move our IRC channel to a different network. Please /join #ste||ar at Libra.Chat (irc:// If you are using the Matrix bridge to IRC, you can join #ste|| through Matrix.

HPX 1.6.0 Released!

The STE||AR Group is proud to announce the release of HPX 1.6.0! HPX is an implementation of the C++ standard library for parallelism and concurrency on an efficient user-level threading runtime, with extensions for distributed computing.

This release continues the focus on C++20 conformance with multiple new algorithms adapted to be C++20 conformant and becoming customization point objects (CPOs). We have added experimental support for HIP, allowing existing CUDA features to now be compiled with hipcc and run on AMD GPUs as well. We have also continued improving the performance of the parallel executors, and added an experimental fork-join executor. The full list of improvements, fixes, and breaking changes can be found in the release notes.

Thank you to everyone in the STE||AR Group and all the volunteers who have provided fixes, opened issues, and improved the documentation.

Download the release from our releases page.

If you have any questions, comments, or exploits to report you can reach us on IRC (#ste||ar on freenode), on the matrix hpx channel, or email us at hpx-users. We depend on your input!

HPX accepted for Google Season of Docs 2019

This year Google is organizing for the first time Google Season of Docs (GSoD). Like Google Summer of Code (GSoC) the program aims to match motivated people with interesting open source projects that are looking for volunteer contributions. GSoD, however, aims to improve open source project documentation, which often tends to get less attention than the code itself. We recognize this all too well in the HPX project. For this reason, we decided to apply for GSoD and can now proudly announce that HPX has been selected as one of 50 projects participating in this year’s GSoD!

This means that we are now looking for motivated people to help us improve our documentation. If you have some prior experience with technical writing, and are interested in working together with us on making the documentation of a cutting edge open source C++ library the best possible guide for new and experienced users, this is your chance. You can read more about the program on the official GSoD home page. We’ve provided a few project ideas on our wiki, but you can also come up with your own. Our current documentation can be found here.

The deadline for technical writer applications is June 28. Come talk to us about your ideas and your application on our mailing list, IRC, or Slack. We’d love to hear from you!