Doctoral Thesis Oral Defense - Suhas Jayaram Subramanya

— 2:00pm

Location:
In Person and Virtual - ET - Traffic21 Classroom, Gates Hillman 6501 and Zoom

Speaker:
SUHAS JAYARAM SUBRAMANYA , Ph.D. Candidate, Computer Science Department, Carnegie Mellon University
https://suhasjs.github.io/

Efficient and Responsive Job-Resource Co-adaptivity for Deep Learning Workloads in Large Heterogeneous GPU Clusters

Existing cluster schedulers face many limitations in scheduling adaptive deep learning training jobs on large heterogeneous GPU clusters – many are not heterogeneity-aware, few are adaptivity-aware, and none scale to large clusters without sacrificing allocation fidelity or cluster efficiency. Emerging clusters further complicate this problem with larger, more heterogeneous resources running more increasingly diverse jobs with more dimensions of adaptivity.

This thesis develops new scheduling approaches and algorithms that can (1) scale to emerging clusters with hundreds of thousands of GPUs and many GPU types, (2) quickly optimize high-fidelity allocations for adaptive DL training jobs with low scheduler overhead, and (3) efficiently adapt to changing cluster conditions to improve goodput on the limited GPU resources.

We first introduce Sia — a round-based scheduler that efficiently optimizes adaptive jobs in a heterogeneous cluster with many GPU types. Sia uses GPU resources judiciously to gather information on job-GPU fit-levels using a mix of online and offline profiling, and continuously co-optimizes the GPU resources allocated to jobs and their execution parameters at runtime to maximize cluster-wide training progress. Using job traces derived from real-world data centers, we find that Sia ’s allocations are fair and efficient, and are quickly computed using an efficient formulation, even for 1000-GPU clusters.

Second, we introduce continual optimization — a new paradigm that explicitly models the slow evolution of resource-allocation problems at scale to reduce solver runtime for quick responses to changes in jobs or resources. We then introduce COpter, our approach to continual optimization that (a) efficiently updates the optimization problems for job and resource changes using a differential interface, (b) implements a factorization-free warm-started LP solver to benefit from slowly-evolving nature of the allocations, and (c) implements lightweight heuristics to recover feasible integral solutions with negligible quality loss. In our evaluations, COpter speeds up Sia scheduler policy by a few orders of magnitude on clusters with tens of thousands of GPUs without sacrificing job completion times and makespan.

Third, COpter is easily applied to resource-allocation problems in other domains (e.g. shard load-balancing, WAN traffic engineering) and we see 57 − 83 × reductions in solver runtimes.

Thesis Committee

Gregory Ganger (Chair) Zhihao Jia Virginia Smith Amar Phanishayee (Meta Platforms Inc.)

In Person and Zoom Participation.  See announcement.


Add event to Google
Add event to iCal