Ellango Jothimurugesan Large-Scale Machine Learning over Streaming Data Degree Type: Ph.D. in Computer Science Advisor(s): Phillip B. Gibbons Graduated: December 2022 Abstract: This thesis introduces new techniques for efficiently training machine learning models over continuously arriving data to achieve high accuracy, even under changes in the data distribution over time, known as concept drift. First, we address the case of IID data with STRSAGA, an optimization algorithm based on variance-reduced stochastic gradient descent that can incorporate incrementally arriving data and efficiently converges to statistical accuracy. Second, we address the case of non-IID data over time with DriftSurf. Previous work on drift detection generally rely on threshold parameters that are difficult to set, making them less practical without prior knowledge of the magnitude and rate of change. DriftSurf improves the robustness of traditional drift detection tests through a stable-state/reactive-state process, and attains higher statistical accuracy whenever an efficient optimizer like STRSAGA is used. Third, we address the case of non-IID data both over time and distributed in space in the federated learning setting with FedDrift. We empirically show that previous centralized drift adaptation and previous personalized federated learning methods are ill-suited under staggered drifts. FedDrift is the first algorithm explicitly designed for both dimensions of heterogeneity, and accurately identifies distinct concepts by learning a time-varying clustering, which enables collaborative training despite drifts. We show the presented algorithms are effective through theoretical competitive analyses and experimental studies that demonstrate higher accuracy on benchmark datasets over the prior state-of-the-art. Thesis Committee: Phillip B. Gibbons (Chair) Gauri Joshi Virginia Smith Kevin Hsieh (Microsoft) Srinivasan Seshan, Head, Computer Science Department Martial Hebert, Dean, School of Computer Science Keywords: Machine learning, streaming, concept drift, federated learning CMU-CS-22-150.pdf (3.28 MB) ( 116 pages) Copyright Notice