Monday, May 10, 2021 - 4:30pm to 5:30pm
Location:Virtual Presentation - ET Remote Access - Zoom
Speaker:BENJAMIN BERG, Ph.D. Student http://bsb20.github.io/
A Principled Approach to Parallel Job Scheduling
Parallelizable workloads are ubiquitous and appear across a diverse array of modern computer systems. Data centers, supercomputers, machine learning clusters, distributed computing frameworks, and databases all process jobs designed to be parallelized across many servers or cores. A job will receive some speedup from being parallelized across additional servers or cores, allowing the job to complete more quickly. However, jobs generally cannot be perfectly parallelized, and receive diminishing returns from being allocated additional servers. Hence, given a fixed number of servers, it is not obvious how to allocate servers across a set of jobs in order to reduce the response times of the jobs — the times from when each job arrives to the system until it is completed. While this question has been considered by the worst-case scheduling community, existing results are hampered by strong lower bounds on worst-case performance and tend to suggest policies which do not work well in practice. The goal of this thesis is to develop and analyze practical policies for scheduling parallelizable jobs using the tools of performance modeling and stochastic analysis.
Our approach in developing new scheduling policies for parallelizable jobs is threefold. First, we develop new stochastic models of parallelizable jobs running in a multi-server system. Second, we analyze these new models using the tools of stochastic performance modeling, showing that a stochastic analysis emits scheduling policies which outperform the policies suggested by the worstcase literature. Finally, we validate our theoretical models through both simulation and real-world implementation to show that the scheduling policies we derive work well in practice. We apply this methodology to a variety of practical settings where systems are required to schedule parallelizable jobs. We consider the case where job sizes are completely unknown to the system, the case where job sizes are perfectly known to the system, and the case where job sizes are known to follow general distributions. Similarly, we consider the case where jobs are all assumed to have the same scaling behavior, the case where some jobs are more parallelizable than others, and the case where a job’s parallelizability can change over time.
Mor Harchol-Balter (Chair)
Gregory R. Ganger
Christos Kozyrakis (Stanford University)
Zoom Participation. See announcement.