Colloquium on Artificial Intelligence Research and Optimization

Every first and third Wednesday of the month, at 1:00 pm CST, Zoom

Some of today’s most visible and, indeed, remarkable achievements in artificial intelligence (AI) have come from advances in deep learning (DL). The formula for the success of DL has been compute power – artificial neural networks are a decades-old idea, but it was the use of powerful accelerators, mainly GPUs, that truly enabled DL to blossom into its current form.

As significant as the impacts of DL have been, there is a realization that current approaches are merely scratching the surface of what might be possible and that researchers could more rapidly conduct exploratory research on ever larger and more complex systems – if only more compute power could be effectively applied.

There are three emerging trends that, if properly harnessed, could enable such a boost in compute power applied to AI, thereby paving the way for major advances in AI capabilities. 

  • Optimization algorithms based on higher-order derivatives are well-established numerical methods, offering superior convergence characteristics and inherently exposing more opportunities for scalable parallel performance than first-order methods commonly applied today. Despite their potential advantages, these algorithms have not yet found their way into mainstream AI applications, as they require significantly more powerful computational resources and must manage significantly larger amounts of data.
  • High-performance computing (HPC) brings more compute power to bear via parallel programming techniques and large-scale hardware clusters and will be required to satisfy the resource requirements of higher-order methods. That DL is not currently taking advantage of HPC resources is not due to lack of imagination or lack of initiative in the community.  Rather, matching the needs of DL systems with the capabilities of HPC platforms presents significant challenges that can only be met by coordinated advances across multiple disciplines.
  • Hardware architecture advances continue apace, with diversification and specialization increasingly being seen as a critical mechanism for increased performance. Cyberinfrastructure (CI) and runtime systems that insulate users from hardware changes, coupled with tools that support performance evaluation and adaptive optimization of AI applications, are increasingly important to achieving high user productivity, code portability, and application performance.

The colloquium collates experts in the fields of algorithmic theory, artificial intelligence (AI), and high-performance computing (HPC) and aims to transform research in the broader field of AI and Optimization. The first aspects of the colloquium are distributed AI frameworks, e.g. TensorFlow, PyTorch, Horovod, and Phylanx. Here, one challenge is the integration of accelerator devices and support of a wide variety of target architectures, since recent supercomputers are getting more inhomogeneous, having accelerator cards or solely CPUs. The framework should be easy to deploy and maintain and provide good portability and productivity. Here, some abstractions and a unified API to hide the zoo of accelerator devices from the users is important.

The second aspect are higher-order algorithms, e.g. second order methods or Bayesian optimization. These methods might result in a higher accuracy, but are more computationally intense. We will look into the theoretical and computational aspects of these methods.

______________________________________________________________________________

Confirmed Speakers

02/03/2021
J. Ram Ramanujam
*Introductory remarks
LSU – CCT
02/03/2021Hongchao ZhangLSU
02/17/2021
Sam Bentley
*Introductory remarks
LSU – ORED
02/17/2021Andrew LumsdaineUniversity of Washington
03/03/2021John T. FosterUT Austin
03/17/2021Anshumali ShrivastavaRice University
04/07/2021Andy HockCerebas Systems, Inc.
04/21/2021Vijay GadepallyMIT
05/05/2021Utham KamathGroq
05/19/2021Lexie YangOak Ridge National Laboratory
06/02/2021Cansu CancaAI Ethics Lab

Registration

Registration for the colloquium is free. Please complete your registration here: registration form

Local organizers

  • Patrick Diehl
  • Katie Bailey
  • Hartmut Kaiser
  • Bita Hasheminezhad
  • Mayank Tyagi

For questions or comments regarding the colloquium, please contact Bita Hasheminezhad.

Talks

Speaker:            Dr. Hongchao Zhang, Louisiana State University

Date:                  Wed 2/3 @ 1:00 pm CST

Title:                   Inexact proximal stochastic gradient method for empirical risk minimization

Abstract:            We will talk about algorithm frameworks of inexact proximal stochastic gradient method for solving empirical composite optimization, whose objective function is a summation of an average of a large number of smooth convex or nonconvex functions and a convex, but possibly nonsmooth, function. At each iteration, the algorithm inexactly solves a proximal subproblem constructed by using a stochastic gradient of the objective function. Variance reduction techniques are incorporated in the method to reduce the stochastic gradient variance. The main feature of these algorithms is to allow solving the proximal subproblems inexactly while still keeping the global convergence with desirable complexity bounds. Global convergence and the component gradient complexity bounds are derived for the cases when the objective function is strongly convex, convex or nonconvex. Some preliminary numerical experiments indicate the efficiency of the algorithm.

Bio:                     Hongchao Zhang received his PhD in applied mathematics from University of Florida in 2006. He then had a postdoc position at the Institute for Mathematics and Its Applications (IMA) and IBM T.J. Watson Research Center. He joined LSU as an assistant professor in 2008 and is now a professor in the department of mathematics and Center for Computation & Technology (CCT) at LSU. His research interests are nonlinear optimization theory, algorithm and applications.

______________________________________________________________________________

Speaker:           Dr. Andrew Lumsdaine,

Date:                  Wed 2/17 @ 1:00 pm CST

Title:                   Second order optimizations for scalable training of DNNs

Abstract:           With the increasing availability of scalable computing platforms (including accelerators), there is an opportunity for more sophisticated approaches for deep network training to be developed and applied. Higher-order optimization methods (e.g., Hessian-based) can provide improved convergence rates compared to first-order approaches such as stochastic gradient descent, at the cost of increased resource requirements. However higher-order approaches also expose increased opportunities for parallelism, such that the increased resource requirements are readily amortized across a scalable computing platform. The combination of improved convergence and increased parallelism promise to enable much more rapid time to solution than is currently achieved. In this talk I present an overview of the landscape of second-order (and higher-order) methods, including initial experimental results.

Bio:                     Andrew Lumsdaine is an internationally recognized expert in the area of high-performance computing who has made important contributions in many of the constitutive areas of HPC, including systems, programming languages, software libraries, and performance modeling. His work in HPC has been motivated by data-driven problems (e.g., large-scale graph analytics), as well as more traditional computational science problems. He has been an active participant in multiple standardization efforts, including the MPI Forum, the BLAS Technical Forum, the ISO C++ standardization committee, the one API Technical Advisory Board, and the SYCL Advisory Panel. Open source software projects resulting from his work include the Matrix Template Library, the Boost Graph Library, and Open MPI.

______________________________________________________________________________

Speaker:            Dr. John Foster, University of Texas at Austin

Date:                  Wed 3/3 @ 1:00 pm CST

Title:                  Scientific Machine Learning (SciML): overview and discussion of applications in petroleum engineering

Abstract:            Scientific Machine Learning or SciML is a relatively new phrase that is used to describe the intersection of data science, machine learning, and physics- based computational simulation. SciML encompasses many ideas including physics informed neural networks, universal differential equations, and the use of synthetic data generated from physical simulators in training machine learning models for rapid decision making. This talk will give an overview of SciML using simple examples and discuss recent results from our investigations using SciML in petroleum engineering applications, specifically for reservoir simulation and drill string dynamics.

Bio:                     Before joining UT Austin, John was a faculty member in mechanical engineering at UTSA and was a Senior Member of the Technical Staff at Sandia National Laboratories. He received his BS and MS in mechanical engineering from Texas Tech University and PhD from Purdue University. He is a registered Professional Engineer in the State of Texas and the co-founder and CTO of Daytum, a tech-enabled professional education company for data science and machine learning targeting the energy industry.

______________________________________________________________________________

Speaker:            Dr. Anshumali Shrivastava, Rice University

Date:                  Wed 3/17 @ 1:00 pm CST

Title:                   SLIDE: Commodity Hardware is All You Need for Large-Scale Deep Learning

Abstract:            Current Deep Learning (DL) architectures are growing larger to learn from complex datasets. The trends show that the only sure-shot way of surpassing prior accuracy is to increase the model size, supplement it with more data, followed by aggressive fine-tuning. However, training and tuning astronomical sized models are time-consuming and stall the progress in AI. As a result, industries are increasingly investing in specialized hardware and deep learning accelerators like GPUs to scale up the process. It is taken for granted that commodity hardware CPU is incapable of outperforming powerful accelerators such as V100 GPUs in a head-to-head comparison of training large DL models. However, GPUs come with additional concerns: expensive infrastructural change, hard to virtualize, main memory limitations.

In this talk, I will demonstrate the first algorithmic progress that challenges the common knowledge prevailing in the community that specialized processors like GPUs are significantly superior to CPUs for training large neural networks.  The algorithm is a novel alternative to traditional matrix-multiplication-based backpropagation. We will show how data structures, particularly hash tables, can reduce the no of multiplications associated with the forward pass of the neural networks. The very sparse nature of updates uniquely allows for an asynchronous data-parallel gradient descent algorithm.  A C++ implementation with multi-core parallelism and workload optimization on CPU is anywhere from 4-15x faster than the most optimized implementations of Tensorflow on the best available V100 GPUs in a head to head comparisons. The associated task is training a 200-million-parameter neural network on Kaggle Amazon recommendation datasets.

Bio:                     Anshumali Shrivastava is an assistant professor in the computer science department at Rice University.  His broad research interests include randomized algorithms for large-scale machine learning. In 2018, Science news named him one of the Top-10 scientists under 40 to watch.  He is a recipient of the National Science Foundation CAREER Award, a Young Investigator Award from the Air Force Office of Scientific Research, and a machine learning research award from Amazon. He has won numerous paper awards, including Best Paper Award at NIPS 2014 and Most Reproducible Paper Award at SIGMOD 2019. IEEE Spectrum describes his work on scaling up deep learning as, “stunning.” Investorplace considers SLIDE algorithm one of the biggest threats to NVIDIA Stock.

______________________________________________________________________________

Speaker:            Dr. Andy Hock, Cerebus Systems, Inc.

Date:                  Wed 4/7 @ 1:00 pm CST

Title:                   AI Research and Optimization at Wafer-Scale

Abstract:           

Deep learning and artificial intelligence (AI) has great potential for a wide variety of scientific and industry applications. However, modern AI research is constrained by compute: state of the art deep learning models often take days or weeks to train even on large clusters of legacy, general purpose processors like graphics processing units (GPUs). These machines are suitable, but not optimal for AI work. We need a new compute solution to accelerate time to solution, reduce the cost of curiosity, unlock new research and applications.

Cerebras has developed a new processor and system that is able to accelerate AI compute by orders of magnitude beyond GPU, training models in minutes or hours that previously took days or weeks. At the heart of our system is the Wafer-Scale Engine (WSE) – the largest chip ever built and the most powerful processor available for deep learning. The WSE is massive, more than 56x larger than the largest chip built previously. With 400,000 cores, 18GB of fast on-chip SRAM, and a high bandwidth, low latency, software-configurable on-chip network, the WSE delivers cluster-scale compute within a single device programmable via familiar ML frameworks such as TensorFlow. Housed in a standard datacenter-compatible server called the CS-1, our novel solution enables AI research at previously-impossible speeds and scale.

In this talk we will provide an introduction to the Cerebras WSE processor and CS-1 system, and discuss its implications for AI research and optimization.

Bio:                     Dr Andy Hock is VP of Product at Cerebras. Andy came to Cerebras from Google, where he led Data and Analytics Product for the Terra Bella project, using deep learning and AI to create useful data for maps and enterprise applications from satellite imagery. Computation speed was a problem for this work, and at Cerebras, Andy saw an opportunity to help deliver the right compute solution for deep learning and AI. Before Google, Andy was a Senior Scientist and Senior Technical Program Manager at Arete Associates, where he led research for image processing algorithm development. He has a PhD in Geophysics and Space Physics from UCLA and a BA in Astronomy-Physics from Colgate University.

______________________________________________________________________________

Speaker:            Dr. Utham Kamath

Date:                  Wed 5/5 @ 1:00 pm CST

Title:                   The Groq Tensor Streaming Processor (TSP) and the Value of Deterministic Instruction Execution

Abstract:           The explosion of machine learning and its many applications has motivated a variety of new domain-specific architectures to accelerate these deep learning workloads. The Groq Tensor Streaming Processor (TSP) is a functionally-sliced microarchitecture with memory units interleaved with vector and matrix functional units. This architecture takes advantage of dataflow locality of deep learning operations. The TSP is built based on two key observations: (1) machine learning workloads exhibit abundant data parallelism, which can be readily mapped to tensors in hardware, and (2) a deterministic processor with a stream programming model enables precise reasoning and control of hardware components to achieve good performance and power efficiency. The TSP is designed to exploit parallelism inherent in machine-learning workloads including instruction-level parallelism, memory concurrency, data and model parallelism. It guarantees determinism by eliminating all reactive elements in the hardware, for example, arbiters and caches. The instruction ordering is entirely software controlled and the underlying hardware cannot reorder these events and they must complete in a fixed amount of time. This has several consequences for system design: zero variance latency, low latency, high throughput at batch size 1 and reduced total cost of ownership (TCO) for data centers with diverse service level agreements (SLAs). Early ResNet50 image classification results demonstrate 20.4K processed images per second with a batch size of one. This is a 4X improvement compared to other modern GPUs and accelerators. The first ASIC implementation of the TSP architecture yields a computational density of more than 1 TOp/s per square mm of silicon. The TSP is a 25x29mm 14nm chip operating at a nominal clock frequency of 900MHz. In this talk we discuss the TSP and the design implications of its architecture. The talk will cover our work published at ISCA 2020: ISCA 2020 Conference : Groq

Bio:                     Utham Kamath is Director of Machine Learning Systems at Groq where he works on the implementation and optimization of ML models for Groq’s hardware and performance analysis of ML workloads. He has twenty years of industry experience including technical and management roles at Qualcomm, Atheros Communications and Hewlett Packard. He has a Bachelor’s degree in Engineering from Bangalore University and an MS and PhD from the University of Southern California.

______________________________________________________________________________

Speaker:            Dr. Lexie Yang

Date:                  Wed 5/19 @ 1:00 pm CST

Title:                   Scaling Geospatial Artificial Intelligence: Opportunities and Challenges

Abstract:           The increasing accessibility of geospatial imagery and rapid advances in AI continues to drive a surge in adaptation of GeoAI systems. From mapping to real-time monitoring and solving long-standing geoscience problems, GeoAI plays a critical role to revolutionizing how geospatial data can be transformed into actionable knowledge in our daily life.  While holding great promise, the tremendous amount of imagery describing the surface of the Earth every day entails the challenges of collecting, refining, analyzing, and curating those data. These challenges necessitate the need of integrating advances in computing technologies to deliver the actionable geo-knowledge with scalable approaches.

In this talk, we will review the current GeoAI advances and several demonstrations of large scale and impactful applications. We will also present several key research directions to foster the end-to-end and scalable GeoAI systems to address societal challenges.

Bio:                     H. Lexie Yang is a Lead staff scientist in GeoAI Group at Oak Ridge National Laboratory. Her research interests focus on advancing high performance computing and machine learning approaches for geospatial data analysis. She had collaborated with esteemed scholars for NASA AIST, NSF, DOE sponsored projects and currently leads several AI-enabled geoscience data analytics projects with large-scale multi-modality geospatial data. The recent work from her team has been widely used to support national-scale disaster assessment and management by agencies.  She received PhD in Civil Engineering from Purdue University in 2014.

______________________________________________________________________________