Compilers for Machine Learning
Scope
Machine learning applications are becoming ubiquitous in large-scale production systems. With that growth and the scaling in data volume and model complexity, the focus on efficiently executing machine learning models has become even greater. The push for increased energy efficiency has led to the emergence of diverse heterogeneous system and accelerator architectures. In parallel, model complexity and diversity pushed for higher productivity systems, more powerful programming abstractions, type systems, language embeddings, frameworks and libraries. Compilers have historically been the bridge between programmer efficiency and high performance code, allowing the expression of code that remains understandable and productive to port and extend, while producing high-performance code for diverse architectures. As such, compiler techniques have been increasingly incorporated into machine learning frameworks. This goes both ways: given the broadening gap between high-level constructs and hardware accelerators, compilers in machine learning frameworks also emerged as natural clients of machine learning techniques, from domain-specific heuristics to autotuning.
This workshop aims to highlight cutting edge work and research that incorporates compiler techniques and algorithms with optimizing machine learning workloads. Compiler techniques affect a large part of the machine learning stack. The workshop topics span from high-level abstract representations to code generation for accelerators. The list of invited speakers are similarly experts across the different levels of the stack. The workshop does not have formal proceedings, and presentations will include ample time for interaction.
Program
The workshop features 8 presentations from leading ML compiler experts from industry and academia. 6 posters will be displayed at the end of the workshop (together with the main conference's welcome and poster reception), with short talks introducing the posters in the last session.
Venue: The Westin Las Vegas Hotel & Spa
Room: Willow.
08:30-08:40 - Opening
08:40-10:00 - Session 1
Javed Absar, Samarth Narang, Muthu Baskaran; Qualcomm
Tensor Evolution: A framework for fast evaluation of Tensor Computations using recurrencesTianqi Chen; NVDIA and Carnegie Mellon University
Enable Large language model deployment across cloud and edge with ML Compilation
10:00-10:30 - Break
10:30-11:50 - Session 2
Ari Rasch, Richard Schulze; University of Muenster
MDH+ATF: Code Generation and Optimization for Deep Learning ComputationsMarco Siracusa, Olivia Hsu, Víctor Soria-Pardos, Joshua Randall, Arnaud Grasset, Eric Biscondi, Douglas J. Joseph, Randy Allen, Fredrik Kjolstad, Miquel Moreto, Adrià Armejach; Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ARM, Samsung, Stanford
An Architecture and a Compiler to Accelerate Embedding Operations in Machine Learning Models
12:00-13:00 - Lunch
13:00-15:00 - Session 3
Zachary Cetinic, Philip Lassen, Arash Taheri-Dezfouli, Abel Nieto; Groq
The Front-End of the Groq Deep Learning CompilerSamuel J. Kaufman, René Just, Rastislav Bodik; University of Washington
Morello: Compiling Neural Networks with Dynamic Programming & Spatial CompressionHongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhiru Zhang; Cornell University
Allo: Catalyzing Accelerator Design and Programming for Machine Learning
15:00-15:30 - Break
15:30-16:20 - Session 4 - Poster Lightning Talks
Yongin Kwon, Misun Yu, Jeman Park, Jemin Lee, Jinse Kwon; ETRI
Optimizing Collective Communication for Deep Learning in NUMA Systems: A Compiler-Based ApproachCharlie Lin, Richa Gadgil , Shivad Bhavsar, Umang Yadav, Ted Themistokleous, Aarushi Jain; AMD
MIGraphX: Deep Learning Inference Engine for AMD HardwareJeman Park, Misun Yu, Jemin Lee, Jinse Kwon, Yongin Kwon; ETRI
A Lightweight Deep Learning Backend for Edge Devices Optimized for Limited C Library EnvironmentsPerry Gibson, Sam Kellett, Aran McConnell; Fractile Ltd.
Rapid Compiler Prototyping using MLIR Python Bindings: A Case Study from FractileMisun Yu, Jemin Lee, Jinse Kwon, Jeman Park, Yongin Kwon; ETRI
Dynamic Layer-Specific Overlapping for Efficient LLM Inference on Resource-Constrained Systems
Abstracts
Javed Absar (Qualcomm UK); Samarth Narang, Muthu Baskaran (Qualcomm) - Tensor Evolution: A framework for fast evaluation of Tensor Computations using recurrences
This paper introduces a new mathematical framework for analysis and optimization of tensor expressions within an enclosing loop. Tensors are multi-dimensional arrays of values. They are common in high performance computing (HPC) and machine learning domains. Our framework extends Scalar Evolution -- an important optimization pass implemented in both LLVM and GCC -- to tensors. Scalar Evolution (SCEV) relies on the theory of Chain of Recurrences for its mathematical underpinnings. We use the same theory for Tensor Evolution (TeV). While some concepts from SCEV map easily to TeV -- e.g. element-wise operations; tensors introduce new operations such as concatenation, slicing, broadcast, reduction, and reshape which have no equivalent in scalars and SCEV. Not all computations are amenable to TeV analysis, but it can play a part in the optimization and analysis parts of ML and HPC compilers. Also, for many mathematical/compiler ideas, applications may go beyond what was initially envisioned, once others build on it and take it further. We hope for a similar trajectory for the tensor-evolution concept.
Tianqi Chen; (NVDIA and Carnegie Mellon University) - Enable Large language model deployment across cloud and edge with ML Compilation
In this talk, we will discuss the lessons learned in building an efficient large language model deployment system for both server and edge settings. We will cover general techniques in machine learning compilation and system support for efficient structure generation. We will also discuss the future opportunities in system co-design for cloud-edge model deployments.Ari Rasch, Richard Schulze (University of Muenster) - MDH+ATF: Code Generation and Optimization for Deep Learning Computations
Deep Learning (DL) is a popular approach in machine learning, known for its effectiveness in areas like speech recognition, image classification, and language generation. Our MDH approach allows DL scientists to define DL computations, such as Convolution (CONV) and Matrix Multiplication (MatMul), at a high level of abstraction. This means they can define these computations without needing to worry about the specific hardware or optimization details. However, MDH still captures the necessary information to generate high-performance code for different target architectures like GPUs and CPUs, as well as different data characteristics such as size and memory layout.
Internally, MDH generates a search space of optimized low-level representations for a DL computation. These low-level instances explicitly show data movement and parallelization optimizations, which are important for efficiently using the complex memory and core structures of modern parallel architectures. Because optimizations are clearly expressed in MDH's low-level representation, creating executable code (e.g., in CUDA, OpenMP, or OpenCL) from the representation is a straightforward and technical process.
We use our ATF auto-tuning framework to select a device- and data-optimized instance from MDH's optimization space. A key feature of ATF is its efficient handling of performance-critical parameters, also known as tuning parameters, that have constraints. For example, a tile size parameter on a higher memory layer needs to be related to a tile size parameter on a lower memory layer. ATF introduces new methods for generating, storing, and exploring the optimization spaces of these constrained tuning parameters.
Our experiments demonstrate that our MDH+ATF code generation and optimization approach achieves promising results on GPUs and CPUs. These results are comparable to state-of-the-art approaches, including hand-optimized vendor libraries like NVIDIA cuBLAS/cuDNN and Intel oneMKL/oneDNN, polyhedral compilers like PPCG and Pluto, and the popular TVM compiler.Marco Siracusa (Barcelona Supercomputing Center and Universitat Politècnica de Catalunya); Olivia Hsu (Stanford University); Víctor Soria-Pardos (Barcelona Supercomputing Center (BSC)); Joshua Randall, Arnaud Grasset, Eric Biscondi (Arm); Douglas J. Joseph (Samsung); Randy Allen (Barcelona Supercomputing Center); Fredrik Kjolstad (Stanford University); Miquel Moreto (UPC/BSC); Adrià Armejach (Barcelona Supercomputing Center) - An Architecture and a Compiler to Accelerate Embedding Operations in Machine Learning Models
This work introduces a Decoupled Access-Execute (DAE) architecture and a DAE compiler to accelerate critical embedding operations in modern machine learning models. Models such as deep-learning recommendation models, sparse large-language models, and graph-learning models need to look up a large amount of embedding vectors from scattered memory locations. However, traditional architectures are not optimized for these irregular memory accesses, leaving performance on the table. Hence, we firstly designed a DAE multicore processor that, by offloading embedding lookups to specialized access units, achieves 5.8x higher performance than traditional out-of-order processors, and 2.5x higher performance and 1.6x higher performance/watt over GPUs. Then, we designed a DAE compiler, Ember, to compile MLIR embedding operations to such DAE processor, enabling its performance potential at no programmability cost.Zachary Cetinic, Philip Lassen, Arash Taheri-Dezfouli, Abel Nieto (Groq) - The Front-End of the Groq Deep Learning Compiler.
This talk is an experience report of our ongoing work on the front-end of groq-compiler, a modern deep learning compiler targeting Groq's LPU architecture. Specifically, we describe the compiler's model ingestion pipeline wherein a deep-learning model, written in a high-level language like Python, is translated into a representation that is suitable for optimization and subsequent compilation. We focus on lessons learned, challenges, and optimization opportunities.Samuel J. Kaufman, René Just, Rastislav Bodik (University of Washington) - Morello: Compiling Neural Networks with Dynamic Programming & Spatial Compression.
A persistent challenge in generating high-performance neural network implementations is that of coordinating its many interacting optimization decisions (i.e., it is a large combinatorial optimization problem). For example, buffer layout and instruction selection interact strongly but are usually decided greedily by different compiler phases. We address this problem with a search-based compiler built on a hierarchical IR and compositional cost model. The IR decomposes program specifications into partial programs containing `smaller' specifications. Each program has a cost which is affine in its sub-program's costs, and a dynamic programming-style solver can straightforwardly find the program which minimizes that cost. Storing memoization tables for this high-throughput solver would be prohibitively expensive, but we employ a novel database representation which maps specifications to coordinates in Zn and compresses identical solutions.Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhiru Zhang (Cornell University) - Allo: Catalyzing Accelerator Design and Programming for Machine Learning.
As the benefits of technology scaling diminish, specialized hardware accelerators are crucial for performance in emerging machine learning applications. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. While new accelerator design languages (ADLs) aim to enhance or replace HLS, they are typically more effective for simple applications with a single kernel, rather than for hierarchical designs with multiple kernels.
In the first part of this talk, we will introduce Allo, a composable programming model for efficient hardware accelerator design. Allo decouples hardware customizations, including compute, memory, communication, and data types from algorithm specification, and encapsulates them as a set of customization primitives, which enables verifiable stepwise optimizations. Allo also preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner, enabling both temporal and spatial composition.
We will then illustrate how Allo optimizes large-scale designs with two case studies. First, we develop a spatial accelerator architecture for large language models (LLMs) and prototype it on an AMD U280 FPGA, demonstrating higher energy efficiency than NVIDIA GPUs in generative inference settings. In addition, we deploy a convolutional neural network (CNN) design using Allo on the AMD Ryzen AI Engine, achieving substantial speedups over prior approaches.
Organizers
Jacques Pienaar, Google
Gokcen Kestor, PNNL
Renato Golin, Intel Research
Tianqi Chen, CMU