GSoC 2020 – Massively Parallel Solvers

Final Report for project: Domain decomposition, load balancing, and massively parallel solvers for the class of nonlocal models

By: Pranav Gadikar

Mentors: Patrick Diehl, Prashant K. Jha

Project Proposal

Description of the problem

Recently various nonlocal models have been proposed and applied for understanding of complex spatially multiscale phenomena in diverse fields such as solid mechanics, fluid mechanics, particulate media, directed self-assembly of Block-Copolymer, tumor modeling, etc. Unlike local models which are governed by partial differential equations such as heat equation, wave equation, nonlocal models have dependence over a larger length than the size of discretization. This makes the partitioning of mesh and information sharing among the processors more difficult. Also, the amount of data communicated among the processors is higher compared to what is expected in the local models. The challenge is also to not have idle processors during the communication step. It is our understanding that an efficient computational method to solve the nonlocal model is very important and will have impact in many fields. The algorithm developed for one field can be easily applied and extended to models in the other fields. Our main goal in this project is to highlight and address the key difficulties in designing massively parallel schemes for nonlocal models while keeping the model related complexity to minimum. To meet the previous goal and to fix the ideas, we use the nonlocal heat equation in 2D setting. We utilize a modern parallelization library, HPX, to develop the parallel solver.

We consider a nonlocal diffusion equation for a temperature field u over a square domain D = [0,1]2 subjected to zero temperature conditions on its boundary and to a given heat source distribution within the domain. It is a first order transient equation in time. For simplification, we only consider an explicit time integration scheme, namely the forward Euler scheme. For spatial discretization, we consider uniform mesh of square elements of size h over domain D and consider finite difference approximation. We let ϵ > 0 be the nonlocal length scale which is typically 3-10 times of the mesh size h. We specify zero temperature on all mesh nodes in the shaded area which is denoted by non-local boundary Dc. Any mesh node xi in D, see figure, only depends on mesh nodes in the green shaded region (ball of size ϵ). For the discretization of time domain, we assume time steps of size Δt over a given interval [0,T]. We update the temperature at each mesh node xi every Δt step.

Thread level parallelism has been exploited previously to solve the problem on a single node. However, the dependence of a single point on a small set of near neighboring points hints for data-level parallelism. Distributing the workload among various compute elements and computing the parts of the domain independently can help to improve the performance. Here, the challenge is to perform an optimal domain decomposition to perform an efficient load balancing of the computation.

Description of the solution

The goal of the proposed solution is to achieve full distribution of workload starting from mesh partition to computation of fields at mesh nodes. The following goal is achieved step by step, starting from a simple sequential implementation of 1D nonlocal diffusion. Ideas of 1D nonlocal serial implementation are extended to 2D nonlocal serial implementation. The 2D nonlocal diffusion equation is made multi-threaded to enable concurrent processing within a single node in the 2D nonlocal asynchronous implementation. The 2D nonlocal diffusion equation is then made completely distributed to exploit data level parallelism in 2D nonlocal distributed implementation. The 2D nonlocal distributed implementation also implements a load balancing algorithm. Domain decomposition for the rectangle domain is implemented as a separate tool.

Link to the repository

Progress Timeline

GSoC First Evaluation

  1. Work Accomplished:

Completed serial implementation of numerical and analytical solutions for 1d and 2d nonlocal equations. The code has been tested with random test cases, by comparing the numerical and analytical solution results. As we move towards a finer mesh, it is expected that the difference between the analytical and numerical solution should reduce.

Pull Requests:

Plot for dx (space discretization) versus error:

  1. Work Accomplished:

Completed the multithreaded version of the 2D nonlocal diffusion equation using various HPX constructs like hpx::async calls to functions and hpx::futures to extract asynchronously computed data. The entire domain is divided into smaller rectangle partitions to overcome delays due to the asynchronous functions calls . The code has been tested with random test cases, by comparing the numerical and analytical solution results.

Pull Requests:

Strong scaling of asynchronous 2d nonlocal equation:

    Weak scaling of asynchronous of 2d nonlocal equation:

GSoC Second Evaluation

  1. Work Accomplished:

Completed distributed implementation of the 2d nonlocal heat equation with a decomposition of the domain into smaller rectangles. Achieved full data-level parallelism for distribution of workload across various nodes. Each rectangle’s values for the numerical solution would be computed on a single node. However, each node can have multiple squares.

    Pull Requests:

Distributed scaling results:

    Weak scaling of distributed implementation of 2d nonlocal equation:

  1. Work Accomplished:

Completed the integration of GMSH and METIS libraries to obtain an efficient domain decomposition of the coarse mesh. The main idea is to partition the coarse mesh i.e. the mesh created using smaller rectangles within the rectangle domain. Coarse mesh allows better I/O times as well as faster domain decomposition using METIS API. METIS library is used for partitioning whereas the GMSH library is used for reading and interpreting the mesh written in .msh format from the GMSH GUI.

Pull Requests:

Plot for distributed scaling of 2d nonlocal equation with domain decomposition:

GSoC Third Evaluation

  1. Work Accomplished:

Currently, there is no load balancing algorithm readily available for the asynchronous many task systems. So the work involved devising a new load balancing algorithm and demonstrating its correctness on the use cases relevant to the 2d nonlocal equation. The accuracy of the load balancing algorithm has been demonstrated for cases when the amount of compute on each of the nodes is more than the data transfer times. 

Pull Requests:

    Testing load balancing:

  • Refer to in the Test Load balancing section

Post GSoC work

At present, the load balancing algorithm works only when the data transfer time is lower than the computational time across the nodes. Post GSoC work involves devising and implementing an algorithm for the case when the data transfer times are comparable to the computational time across the nodes.

Leave a Reply

Your email address will not be published. Required fields are marked *