Compilers for Machine Learning

4th C4ML workshop, at CGO 2023

Sunday, February 26, 2023

In person: Montreal, Canada

Scope

Machine learning applications are becoming ubiquitous in large-scale production systems. With that growth and the scaling in data volume and model complexity, the focus on efficiently executing machine learning models has become even greater. The push for increased energy efficiency has led to the emergence of diverse heterogeneous system and accelerator architectures. In parallel, model complexity and diversity pushed for higher productivity systems, more powerful programming abstractions, type systems, language embeddings, frameworks and libraries. Compilers have historically been the bridge between programmer efficiency and high performance code, allowing the expression of code that remains understandable and productive to port and extend, while producing high-performance code for diverse architectures. As such, compiler techniques have been increasingly incorporated into machine learning frameworks. This goes both ways: given the broadening gap between high-level constructs and hardware accelerators, compilers in machine learning frameworks also emerged as natural clients of machine learning techniques, from domain-specific heuristics to autotuning.

This workshop aims to highlight cutting edge work and research that incorporates compiler techniques and algorithms with optimizing machine learning workloads. Compiler techniques affect a large part of the machine learning stack. The workshop topics span from high-level abstract representations to code generation for accelerators. The list of invited speakers are similarly experts across the different levels of the stack. The workshop does not have formal proceedings, and presentations will include ample time for interaction.

Program

The workshop features 9 presentations from leading ML compiler experts from industry and academia. 5 posters will be displayed throughout the event, with presenters available at the breaks and during a dedicated poster session (prior to the main conference's welcome and poster reception).

08:30-08:40 - Opening

08:40-10:00 - Session 1 - End-to-end ML compiler flows

Bin Bao (Meta) - A deep dive on TorchInductor
Mahesh Ravishankar, Google (Google) - IREE: MLIR-based compilation for Tensor-based programs

10:00-10:20 - Break & Poster Installation

10:20-12:20 - Session 2 - Software architecture and compiler construction

Li-Wen Chang (ByteDance AML) - Open Source Adoption Lessons and Improvements for a Production ML Compiler
[slides] Hong-Rong Hsu & Pen Li (SiFive) - Software Components and Methodology for Designing and Optimizing RISC-V ML Compilers
[slides] Harsh Menon, Anush Elangovan (Nod.ai) - Decomposable operators in IREE: Winograd Convolutions and Flash Attention

12:20-13:20 - Lunch & Posters

13:20-15:20 - Session 3 - Target- and domain-specific optimization

David Packwood, Robert Walker, Andrzej Warzynski (ARM) - VOSA - A high level IR to enable compilation of imaging and computer vision workloads
[slides] Gil Rapaport (Mobileye) - MLIR targeting Mobileye's EyeQ Heterogeneous Platform
[slides] Manish Gupta (Google) - Developing and Generating Optimal CUDA Kernels on NVIDIA Tensor Cores

15:20-15:40 - Break & Posters

15:40-16:20 - Session 4 - New application areas

[slides] Dumitru Potop-Butucaru (INRIA) - Dataflow reactive ML programming

16:20-17:20 - Posters & Discussion

Posters

Richard Schulze, Ari Rasch and Sergei Gorlatch (University of Münster)
Optimization for Deep-Learning Computations on GPUs via Multi-Dimensional Homomorphisms
Nicolas Bohm Agostini, David Kaeli (Northeastern University) and Antonino Tumeo (PNNL)
Efficient Mapping of ML applications on Heterogeneous Targets with the SODA Toolchain
Marius Brehler, Simon Camphausen (Fraunhofer IML), Hsin-I Cindy, Ben Vanik, Lun Dong and Stella Laurenzo (Google)
Compiling and Deploying Machine Learning Models to Bare-Metal Devices with TinyIREE
Sebastian Buschjäger (TU Dortmund)
FastInference: Model compilation for Embedded Systems
Hanwoong Jung, Gyeongmin Lee, Bongjun Kim, Seokyoung Yoon, Heewoo Nam , Hyunjun Kim, Dongguen Lim, Young H. Oh, Hyunjun Shin, Hexiang Ji, Wei Chen (Samsung Advanced Institute of Technology), Ilya Palachev, Ilya Afanasyev, Dmitry Ryabko (Samsung R&D Institute Russia) and Pengcheng Liang (Samsung R&D Institute Xi'an)
Universal Deep Learning Compiler (UDLC) : The end-to-end compiler-driven software infrastructure for AI accelerators and systems

Abstracts

Bin Bao (Meta) - A deep dive on TorchInductor
This talk presents a deep dive into the design principles of TorchInductor, the Pytorch compiler backend, the lowering stack that it uses to transform Pytorch programs, and the optimization and codegen technologies that it uses.
Mahesh Ravishankar, Google (Google) - IREE: MLIR-based compilation for Tensor-based programs
IREE is a compiler built entirely on MLIR that utilizes progressive lowering to compile programs written in frameworks like Tensorflow, Pytorch, etc. down to machine code that can be executed on many different hardware architectures. While today IREE supports CPU compilation (to target x86, ARM and RISC-V) and GPU compilation (CUDA, RoCM and SPIR-V), it can be easily extended to target other hardwares since all target architectures are layered below Hardware Abstraction Layer (HAL). Being an open-source project, IREE has already been used by the community to deliver state-of-the-art performance for ML models on CPUs and GPUs, for server and device deployments. Since most of the compilation stack is shared across devices, adding support for new devices automatically inherits a lot of the functionality supported by existing backends. The compilation of IREE contains broadly three phases. The first phase partitions the program into atomic units of work known as dispatches. These dispatches represent computations to be performed on the device. The data dependence between these dispatches is used to compute a schedule that is used by a host to launch these dispatches. The computation within a dispatch is then compiled down to executable code for the device executing that dispatch. Both, the computation of the schedule used by the host to launch dispatches, and the progressive lowering of computation within a dispatch into target executable code is achieved through use of different dialects in IREE and MLIR. This talk will describe the compilation flow in IREE through these different dialects, and show the performance of models compiled using IREE on different architectures.
Li-Wen Chang (ByteDance AML) - Open Source Adoption Lessons and Improvements for a Production ML Compiler
Adopting an open source compiler infrastructure into an existing production ML compiler is challenging and tricky. In this talk, I will share the experience of how we adopted MLIR and reformed our existing production ML compiler, and discuss what lessons we learnt and benefits we gained. Our compiler involves optimizations in multiple levels across services, graphs, and loops, in order to automatically and seamlessly deploy models. In this walk, I will also cover at least one optimization of each, and take a deep dive into their technical needs and challenges, and our corresponding solutions.
Hong-Rong Hsu & Pen Li (SiFive) - Software Components and Methodology for Designing and Optimizing RISC-V ML Compilers
RISC-V’s open standard Instruction Set Architecture (ISA) has enabled numerous processor innovations. Notably, the Vector extension, compared to packed-SIMD and GPU implementations, offers variable vector length and high code density for efficient ML computation. RISC-V vector (RVV) processors are shipping or being actively designed into SoCs from the smallest IoT devices to the largest Cloud servers. To optimize the ML compiler code-generation quality for these cutting-edge RVV processors, a novel bidirectional iterative approach is proposed. The code is first produced from front ends down to MLIR, and then followed by a bottom-up approach from the RVV LLVM backend and mathematical libraries up to MLIR. This process is then repeated several times. In just a few months of integration work, IREE, a ML compiler and runtime in Google’s OpenXLA ecosystem, is able to demonstrate outstanding performance with multi-core scalability on SiFive’s Intelligence X280. We will conclude this talk by sharing the future enhancements and collaboration opportunities.
Harsh Menon, Anush Elangovan (Nod.ai) - Decomposable operators in IREE: Winograd Convolutions and Flash Attention
Existing ML compilation pipelines in IREE focus on having a primary compute operator such as matrix multiplication or convolution followed by elementwise operations within a single dispatch region. While this approach covers most of the existing use-cases in ML, in this talk we go over two algorithms that break that approach: Winograd convolutions and Flash Attention. We describe the algorithms, followed by how we represented them in MLIR and their specific code generation strategies. We present some preliminary results and discuss a way of unifying these operators under a general interface.
David Packwood, Robert Walker, Andrzej Warzynski (ARM) - VOSA - A high level IR to enable compilation of imaging and computer vision workloads
Compiler technology for machine learning is an active area of work for many individuals and groups in both academia and industry. The usual input of such compilation is a neural network as generated from ML frameworks such as tensorflow. However, the deployment of imaging and computer vision applications will often mesh these neural networks with more classical computer vision operations, often as pre or post processing steps. Currently these CV operations may be supported by different frameworks or methodologies than the compiled ML network itself. We introduce VOSA (vision operator set architecture), a specification for an intermediate representation to support these computer vision operations, and its associated materialization as an MLIR dialect. By this mechanism, in concert with machine learning representations (such as TOSA), we hope to be able to present to a compiler a more complete and holistic application description where improvements in compiler technology can benefit both components, and possible performance optimizations can be leveraged at the complete pipeline level. VOSA is an ongoing technology project. While it is still in its early days, we have already been able to use it to successfully compile various Computer Vision as well as mixed Computer Vision/Machine Learning pipelines.. We will share in more detail the history and motivation and some learnings so far. We also hope to engage with the community to better understand what the most valuable next steps would be.
Gil Rapaport (Mobileye) - MLIR targeting Mobileye's EyeQ Heterogeneous Platform
In this talk we introduce our team's on-going effort to build an MLIR-based compiler targeting Mobileye's EyeQ system on chip. This chip is composed of multiple accelerator types ranging from general-purpose multi-threaded processor clusters through VLIW SIMD and CGRA machines to Deep Learning accelerators. Furthermore, these accelerators rely on software to manage their activations and resources, such as caches, in a functionally-correct and performance-efficient manner, in order to cope with the stringent validation and performance requirements. We describe our domain of programs including TensorFlow and OpenCL, the compilation challenges in mapping such programs to EyeQ and the role MLIR can play, specifically in legalizing and optimizing affine loop-nests. We also describe our reasons for choosing C over LLVM-IR as our output format for MLIR and suggest to extend this perspective in the other direction.
Manish Gupta (Google) - Developing and Generating Optimal CUDA Kernels on NVIDIA Tensor Cores
NVIDIA's Tensor Cores are at the core of every significant advancement in artificial intelligence, from speech and language processing to computer vision and robotics. The NVIDIA A100 Ampere architecture introduced exciting changes to Tensor Cores while also making them more accessible to developers. In this talk, we will provide a detailed explanation of NVIDIA Tensor Cores. Additionally, we will discuss how to implement an optimal CUDA kernel using Tensor Cores on NVIDIA A100 by utilizing techniques for efficient data movement and software pipelining, which are essential for maximizing Tensor Core performance. Finally, we'll describe the next-generation Tensor Cores and advancements in data movement techniques on the latest NVIDIA H100 Hopper architecture.
Dumitru Potop-Butucaru (INRIA) - Dataflow reactive ML programming
Today’s machine learning (ML) frameworks (TensorFlow, PyTorch…) and compilers (MLIR, Glow...) allow the specification and efficient implementation of deep neural networks (DNNs) that have ever-increasing precision when applied to Big Data that is already stored in large databases. By contrast, in various applications, a clear need emerges for reactive ML applications that operate in a stateful fashion on streams of data, in time (in RNNs, attention-based approaches, reinforcement learning - RL - or even when scheduling convolutional networks in time to reduce memory requirements). Beyond this view focused on ML algorithmic design, the field of embedded ML, where ML components are placed in the feedback loop of real-time embedded control applications (along with data pre- and post-processing routines) is becoming both a reality and an industrial necessity. Existing ML frameworks and compilers cannot adequately handle the specification and the implementation of reactive aspects. The situation has reached a point where ML algorithmic innovation (which largely embraces approaches that process data in time, like in reinforcement learning - RL- or in transformers) is largely distinct from embedded ML engineering, the latter focusing on simpler (often stateless) network architectures. Ad hoc remedies are proposed in the literature for specific and limited cases. For instance, the time dimension of RL applications is traversed using (interpreted) Python code unsuitable for an embedded implementation. Modifications to the Python front-end also provide the solution to streaming applications (a sub-case of general-purpose reactivity).
To bridge between ML algorithmic research and reactive embedded implementation, instead of the existing ad hoc workarounds, we propose the direct integration of general-purpose reactiveness into the specification formalisms and compilers of ML frameworks. We have integrated low-level (imperative) and high-level (dataflow) synchronous reactive programming into MLIR. We first recall commonalities between dataflow synchronous languages and the static single assignment (SSA) form of general-purpose/ML compilers. We highlight the key mechanisms of synchronous languages that SSA does not cover—denotational concepts such as synchronizing computations with an external time base, cyclic and reactive I/O, as well as the operational notions of relaxing control flow dominance and the modeling of absent values. We discover that initialization-related static analyses and code generation aspects can be fully decoupled from other aspects of synchronous semantics such as memory management and causality analysis, the latter being covered by existing dominance-based algorithms of SSA-form compilers. We show how the SSA form can be seamlessly extended to enable all SSA-based transformations and optimizations on reactive programs with synchronous concurrency. We derive a compilation flow suitable for both high-performance and reactive aspects of a control application, by embedding the Lustre dataflow synchronous language into the SSA-based MLIR/LLVM compiler infrastructure. This allows the modeling of signal processing and deep neural network inference in the (closed) loop of feedback-directed control systems. Performance is not affected by the use of reactive modeling.

Call for posters

We seek poster abstracts describing recent or ongoing research related to the research topics in the C4ML workshop. All researchers and practitioners are welcome to submit their work for presentation at this workshop. Posters will not be published and can therefore be work in progress. There will be an in-person poster session during the workshop but no separate presentation.

Submit via HotCRP (contact posters@c4ml.org in case of any issues during submission). Format is 1-2 pages double column pages (excluding references).

Important deadlines

Submission deadline for extended abstracts by January 5, 2023 AOE
Notification by January 12, 2023

Organizers

Albert Cohen, Google
Ayal Zaks, Mobileye
Diego Caballero, Google
Gokcen Kestor, PNNL
Jacques Pienaar, Google
Tatiana Shpeisman, Modular
Tianqi Chen, CMU and OctoML

Contact us

c4ml@googlegroups.com

Google Sites

Report abuse