Wednesday, May 1, 2019 - 12:00pm to 1:30pm
Location:3305 Newell-Simon Hall
Speaker:JUN WOO PARK, Ph.D. Student https://junwoo.me/
Distribution-based cluster scheduling
Modern computing clusters support a mixture of diverse activities, ranging from customer-facing internet services, software development and test, scientific research, and exploratory data analytics. Many schedulers exploit knowledge of pending jobs' runtimes and resource usages as a powerful building block but suffer significant performance penalty if such knowledge is imperfect. This dissertation demonstrates that schedulers that rely on information about job runtimes and resource usages can more robustly address imperfect predictions by looking at likelihoods of possible outcomes rather than single point expected outcomes.
This dissertation presents a workload analysis and two case studies of scheduling systems, 3Sigma, and the resource-runtime distribution based scheduler. Characterization of real workloads revealed that there exists inherent variability in the job runtimes and resource usage that cannot be captured by single point estimates. An evaluation of a history-based runtime predictor with four different traces demonstrates it is not trivial to obtain perfect runtime predictions in real workloads, especially if the predictor is provided with insufficient information. 3Sigma is a scheduler that leverages distributions of the relevant runtime histories rather than just a point estimate derived from it. By leveraging distribution and mis-estimate mitigation mechanisms, 3Sigma is able to make more robust scheduling decisions and outperform state-of-the-art scheduling systems that only rely on limited or no runtime knowledge. The resource-runtime distribution scheduler is a system that can leverage the distribution of resource usage (cpu, memory, and cpu-time) and account the risk of contention to make robust scheduling decisions. The evaluation of the scheduler demonstrates that leveraging full history and mitigation mechanisms allows the scheduler to more robustly address the imperfect predictions and perform almost as good as the hypothetical system equipped with perfect knowledge of runtime and resource usage.
Gregory R. Ganger (Chair)
Phillip B. Gibbons
Michael Kozuch (Intel Labs)