Doctoral Thesis Proposal - Gabriele Oliaro
May 12, 2026 5:00PM—6:30PM
Location:
4405 & Zoom
-
Gates and Hillman Centers
Speaker:
GABRIELE OLIARO,
Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://www.gabrieleoliaro.com/
Large language models have become central infrastructure for modern AI applications, but running them efficiently remains a major systems challenge. Larger and more capable models require more GPU computation, GPUs remain expensive, and sophisticated applications such as agents require low latency and predictable service-level objectives to be practically usable. At the same time, inference optimization increasingly depends on deep specialized expertise in scheduling, memory management, and GPU kernel engineering. This expertise is difficult to scale because model architectures and accelerator platforms evolve rapidly, introducing new operators, precision formats, parallelization patterns, and hardware-specific optimization requirements.
This thesis develops systems that use model-driven and agentic techniques to automate parts of this optimization process. SpecInfer uses smaller language models to speculate on the outputs of larger models, converting otherwise serial autoregressive decoding into parallel verification. SuffixDecoding extends this idea to agentic workloads by caching and reusing prior generation patterns to speculate with minimal GPU overhead. FlexLLM automates fine-grained resource allocation between latency-critical inference and throughput-oriented finetuning, allowing both services to share GPUs while preserving inference service-level objectives. The remaining work moves from optimizing inference systems with AI techniques to using AI agents to optimize the systems themselves. FastKernels provides a production-faithful benchmark for LLM-based GPU kernel agents, and the proposed kernel agent will use compiler feedback, correctness tests, runtime measurements, and hardware feedback to generate optimized kernels for rapidly evolving operators such as linear attention, state-space models, mixture-of-experts routing, quantized inference, and multimodal fusion.
Thesis Committee
Zhihao Jia (Chair)
Tianqi Chen
Phillip Gibbons
Hao Zhang (University of California, San Diego)
In-person & Zoom