Doctoral Thesis Proposal - Suhas Jayaram Subramanya

— 4:30pm

Location:
In Person - Reddy Conference Room, Gates Hillman 4405

Speaker:
SUHAS JAYARAM SUBRAMANYA, Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://suhasjs.github.io/


Efficient job-resource co-adaptivity for deep learning workloads on large heterogeneous GPU clusters

The training performance of a deep learning (DL) training job is determined by the number, type and arrangement of the allocated GPU resources, and the job parameters (like batch size and learning rate) used for execution. Modern clusters for DL training contain tens of thousands of GPUs of many types, and a cluster scheduler allocates GPUs to training jobs to maximize collective training progress in the cluster. Existing DL cluster schedulers cannot handle the large space of adaptivity choices (i.e., combined space of GPU allocations and job parameters) for large, heterogeneous GPU clusters — many are not heterogeneity-aware, few are adaptivity-aware, and none scale to large clusters without sacrificing allocation fidelity and cluster efficiency. 

In this thesis, we introduce (a) a scheduler to facilitate efficient job-resource adaptivity for DL training jobs on large heterogeneous GPU clusters, and (b) a method to scale optimization-based scheduling to much larger cluster sizes without sacrificing allocation fidelity and resource efficiency. Our adaptivity-aware scheduler, Sia, uses GPU resources judiciously to learn a job's training performance across different GPU types, and continuously co-optimizes the GPU allocation and job execution parameters to maximize cluster-wide training progress in heterogeneous GPU clusters. We then scale Sia to large cluster sizes by modeling the scheduling policy as a continuous optimization problem. We show that it is possible to augment the interface between a scheduler and the optimization problem solver to efficiently track changes to the scheduling problem arising from changing cluster conditions like job arrivals, departures and phase changes. We develop a prototype solver with the augmented interface for the Sia scheduling policy that can efficiently recover allocations for very large clusters. As an additional contribution, we observe that many other resource-allocation problems can also be formulated as continuous optimization problems and can be solved both quickly and efficiently using our proposed solver. 

Thesis Committee:

Greg Ganger (Chair)
Zhihao Jia
Virginia Smith
Amar Phanishayee (Meta)
 

Additional Information

Event Website:
https://csd.cmu.edu/calendar/doctoral-thesis-proposal-suhas-jayaram-subramanya