Final Report for project: Domain decomposition, load balancing, and massively parallel solvers for the class of nonlocal models

By: Pranav Gadikar

Mentors: Patrick Diehl, Prashant K. Jha

**Project Proposal**

**Description of the problem**

Recently various nonlocal models have been proposed and applied for understanding of complex spatially multiscale phenomena in diverse fields such as solid mechanics, fluid mechanics, particulate media, directed self-assembly of Block-Copolymer, tumor modeling, etc. Unlike local models which are governed by partial differential equations such as heat equation, wave equation, nonlocal models have dependence over a larger length than the size of discretization. This makes the **partitioning of mesh** and **information sharing** among the processors more difficult. Also, the amount of data communicated among the processors is higher compared to what is expected in the local models. The challenge is also to not have **idle processors** during the communication step. It is our understanding that an efficient computational method to solve the nonlocal model is very important and will have impact in many fields. The algorithm developed for one field can be easily **applied and extended** to models in the other fields. Our main goal in this project is to highlight and address the key difficulties in designing massively parallel schemes for nonlocal models while keeping the model related complexity to minimum. To meet the previous goal and to fix the ideas, we use the nonlocal heat equation in 2D setting. We utilize a modern parallelization library, **HPX**, to develop the parallel solver.

We consider a nonlocal diffusion equation for a temperature field ** u** over a square domain

**D**=

**[0,1]**

**subjected to zero temperature conditions on its boundary and to a given heat source distribution within the domain. It is a first order transient equation in time. For simplification, we only consider an explicit time integration scheme, namely the forward Euler scheme. For spatial discretization, we consider uniform mesh of square elements of size**

^{2}**h**over domain

**D**and consider finite difference approximation. We let

**ϵ**> 0 be the nonlocal length scale which is typically 3-10 times of the mesh size

**h**. We specify zero temperature on all mesh nodes in the shaded area which is denoted by non-local boundary

**D**

**. Any mesh node**

_{c}**x**

**in**

_{i}**D,**see figure, only depends on mesh nodes in the green shaded region (ball of size

**ϵ**). For the discretization of time domain, we assume time steps of size

**Δt**over a given interval

**[0,T]**. We update the temperature at each mesh node

**x**

_{i}**every**

**Δt**step.

Thread level parallelism has been exploited previously to solve the problem on a single node. However, the dependence of a single point on a small set of near neighboring points hints for **data-level parallelism**. Distributing the workload among various compute elements and computing the parts of the domain independently can help to improve the performance. Here, the challenge is to perform an optimal domain decomposition to perform an efficient load balancing of the computation.

**Description of the solution**

The goal of the proposed solution is to achieve full distribution of workload starting from mesh partition to computation of fields at mesh nodes. The following goal is achieved step by step, starting from a simple sequential implementation of **1D nonlocal diffusion**. Ideas of 1D nonlocal serial implementation are extended to **2D nonlocal serial implementation**. The 2D nonlocal diffusion equation is made multi-threaded to enable concurrent processing within a single node in the **2D nonlocal asynchronous implementation**. The 2D nonlocal diffusion equation is then made completely distributed to exploit data level parallelism in **2D nonlocal distributed implementation**. The 2D nonlocal distributed implementation also implements a **load balancing** algorithm. **Domain decomposition** for the rectangle domain is implemented as a separate tool.

**Link to the repository**

**Progress Timeline**

**GSoC First Evaluation**

**Work Accomplished:**

Completed **serial implementation** of **numerical** and **analytical** solutions for **1d **and** 2d** nonlocal equations. The code has been tested with random test cases, by comparing the numerical and analytical solution results. As we move towards a finer mesh, it is expected that the difference between the analytical and numerical solution should reduce.

**Pull Requests:**

- https://github.com/nonlocalmodels/nonlocalheatequation/pull/1
- https://github.com/nonlocalmodels/nonlocalheatequation/pull/3

**Plot for ***dx ***(space discretization) versus error:**

*dx*

**Work Accomplished:**

Completed the **multithreaded version** of the **2D nonlocal** diffusion equation using various HPX constructs like **hpx::async** calls to functions and **hpx::futures** to extract asynchronously computed data. The entire domain is divided into smaller rectangle partitions to overcome delays due to the asynchronous functions calls . The code has been tested with random test cases, by comparing the numerical and analytical solution results.

**Pull Requests:**

**Strong scaling of asynchronous 2d nonlocal equation****:**

** ****Weak scaling of asynchronous of 2d nonlocal equation:**

**GSoC Second Evaluation**

**Work Accomplished:**

Completed **distributed implementation** of the **2d nonlocal** heat equation with a decomposition of the **domain into smaller rectangles**. Achieved full **data-level parallelism** for distribution of workload across various nodes. Each rectangle’s values for the numerical solution would be computed on a single node. However, each node can have multiple squares.

#### **Pull Requests:**

**Distributed scaling results:**

** ****Weak scaling of distributed implementation of 2d nonlocal equation:**

**Work Accomplished:**

Completed the integration of **GMSH** and **METIS** libraries to obtain an efficient domain decomposition of the coarse mesh. The main idea is to partition the coarse mesh i.e. the mesh created using smaller rectangles within the rectangle domain. Coarse mesh allows better I/O times as well as faster domain decomposition using **METIS API**. **METIS** library is used for **partitioning** whereas the **GMSH** library is used for **reading and interpreting** the mesh written in **.msh** format from the GMSH GUI.

**Pull Requests:**

**Plot for distributed scaling of 2d nonlocal equation with domain decomposition:**

**GSoC Third Evaluation**

**Work Accomplished:**

Currently, there is no **load balancing algorithm** readily available for the asynchronous many task systems. So the work involved devising a new load balancing algorithm and demonstrating its correctness on the use cases relevant to the 2d nonlocal equation. The accuracy of the load balancing algorithm has been demonstrated for cases when the amount of compute on each of the nodes is more than the data transfer times.

**Pull Requests:**

** ****Testing load balancing:**

- Refer to
**Readme.md**in the**Test Load balancing section**

**Post GSoC work**

At present, the load balancing algorithm works only when the data transfer time is lower than the computational time across the nodes. Post GSoC work involves devising and implementing an algorithm for the case when the data transfer times are comparable to the computational time across the nodes.