Doctoral Thesis Proposal - Daiyaan Arfeen June 17, 2025 2:00pm — 3:30pm Location: In Person - ASA Conference Room, Gates Hillman 6115 Speaker: DAIYAAN ARFEEN , Ph.D. Student, Computer Science Department, Carnegie Mellon University https://csd.cmu.edu/people/doctoral-student/daiyaan-arfeen Designing Scalable DNN Training Systems to Overcome Algorithmic Constraints LLM training requires massive amounts of compute due to large model and dataset sizes, so it is not unusual to train LLMs on tens or hundreds of thousands of GPUs to complete training in a reasonable amount of time (days or weeks). However, GPU failures (which are common at these scales) and data-dependencies (introduced by the training algorithms) can lead to severe GPU underutilization. In this talk, we present distributed LLM training systems which are efficient and fault-tolerant at these scales. We first present Nonuniform-tensor-parallelism (NTP), a technique which increases the fault-tolerance of tensor-parallel training, thereby reducing the blast-radius of GPU failures. NTP enables scale-up training with little-to-no loss in training efficiency from realistic rates of GPU failures. Next we present PipeFill, a system for recovering GPU utilization (lost due to scale-out training) by filling pipeline bubbles with third-party latency-insensitive jobs. We will discuss how PipeFill could be extended to support filling pipeline bubbles with online inference jobs, which are latency-sensitive.Thesis CommitteeGreg Ganger (Chair)Zhihao JiaPhillip B. GibbonsDheevatsa Mudigere (NVIDIA) Add event to Google Add event to iCal