Computer Science Thesis Oral

Wednesday, November 30, 2022 - 10:00am to 12:00pm


In Person and Virtual - ET Reddy Conference Room, Gates Hillman 4405 and Zoom


PRATIK PRAMOD FEGADE, Ph.D. CandidateComputer Science DepartmentCarnegie Mellon University

Auto-batching Techniques for Dynamic Deep Learning Computations

Deep learning is increasingly used across a range of domains. Dynamism—where the execution of a computation differs across different inputs---has been shown to be important in enabling deep learning models to effectively model the varying structure of input data in these domains, thereby achieving high accuracy. However, dynamism often makes batching, an important performance optimization, difficult to apply. This thesis presents techniques to enable efficient auto-batching—automatically enabling batched execution for a computation—for dynamic deep learning computations. We consider two kinds of dynamism commonly exhibited by deep learning computations—control flow dynamism, where the computation involves control flow structures such as conditional statements and recursion, and shape dynamism, where the computation involves tensors of different shapes across different inputs. Past work has proposed a variety of approaches for solving this problem. However, past work is often characterized by significant fragmentation from a compilation and execution point of view.

Techniques often target individual components of the compilation/runtime stack without taking a holistic view of the entire stack, and hence the entire computation, into account. For instance, tensor kernels are often optimized in isolation, independent of the surrounding computation, while auto-batching techniques often primarily rely either on compile-time, or on runtime approaches, rather than an end-to-end approach. Considering these limitations, this thesis attempts to remove the aforementioned fragmentation to enable efficient auto-batching. We rely on two insights (1) hybrid static+dynamic analysis to exploit available parallelism while keeping the runtime overheads low and (2) allowing the flow of information across the compilation and execution of tensor operators and the surrounding computation. These insights enable us to obtain significant gains over past work. For instance, Cortex, a compiler specialized for recursive deep learning computations achieves up to 14X faster inference over past work, while ACRoBat, an auto-batching framework that can handle unrestricted control flow is up to 8.5X faster. Further, CoRa, a tensor compiler we designed for batched execution in the presence of shape dynamism performs on-par with highly hand-optimized implementations of the transformer model.

Thesis Committee:

Todd C. Mowry (Co-Chair)

Phillip B. Gibbons (Co-Chair)

Tianqi Chen (Co-Chair)

Graham Neubig

Saman Amarasinghe (Massachusetts Institute of Technology)

In Person and Zoom Participation. See announcement.

For More Information, Contact:


Thesis Oral