The STE||AR Group’s research is guided by five guiding principals necessary to achieve exascale computing:
Scalability
In order to run at exascale, our applications will have to be able to efficiently utilize the resources available to them. The current programming paradigm suffers from several factors that we call SLOW: Starvation, Latencies, Overheads, and Waiting for contention. We propose to overcome these barriers to scalability by employing a combination of techniques such as an active global address space, fine grained parallelism, and a light weight synchronization. Our current research in this area focuses on using runtime systems to expose these features. To learn more about our work on scalability please view our work with HPX.
Programmability
As applications grow in complexity, it is important to design abstractions which allow complex operations to be stated plainly. These abstractions must allow domain scientist, who are not experts in HPC, to write simple codes which can be parallized automatically. We approach programmability by building libraries and application frameworks on top of scalable underlying runtime systems. This allows the scalability exposed by the runtime system to be encapsulated in a manner which is most useful to a particular domain. The STE||AR Group currently develops the following application frameworks:
- LibGeoDecomp
- Octopus
Performance Portability
Performance portability is a crucial feature for future applications. No one wants to have to re-write their code for every different machine that an application needs to run on. By keeping performance portability in mind during the design of the software stack, we believe that we can run the same code on different machines and maintain efficient utilization of the hardware. In our work with HPX, we have been able to demonstrate performance portability by being able to run the same code on different operating systems (Linux, Windows, Mac, and Blue Gene/Q) and architectures (x86, k10m, and PowerPC A2). In addition, we have been able to take the same code written for a traditional host core and compile and run it on a Xeon Phi accelerator with excellent scaling results¹. Our current research in this area includes:
- HPXCL- A library which supports percolation via OpenCL and CUDA
- Legacy Migration- Providing a way to run legacy code within an HPX application
Resilience
Time-to-failure of future machines will be unprecedentedly short. This requires computer scientist to come up with a system which can recover from hardware failures and errors without disrupting the running application. Of the five principles, resilience is arguably the most difficult to tackle. Our current work in this area deals with how to store states of an application. We believe that by using an object oriented file system we will be able to store states of a simulation by having each object know which other objects it depends on. With this capability we will be able to bring back previously saved states of a simulation in order to recover from the loss of hardware.
Energy Efficiency
Some of the largest obstacles of exascale simulations are related to the large power requirements needed to run the machines. The cost of power alone will demand that our applications be power efficient. In some cases, future applications will need to be able to react to a changing power environment, perhaps to take advantage of changing electricity prices or to allocate power to different areas of the machine. Currently, we are working with our applications so that they can better utilize the hardware they are running on. In this way, less energy is wasted on idling hardware. Future research, however, must look at how to take hardware information such as core temperature, clock rate, and power consumption and use this information to drive policy engines which allow the user to define what power information an application should react to.
¹T. Heller, H. Kaiser, A. Schäfer, and D. Fey, Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers, ScalA ’13, Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Article No. 1, 2013, pdf