5th Year Master's Thesis Presentation - Hao Kang
April 21, 2026 12:00PM—1:30PM
Location:
In Person
-
ASA Conference Room, Gates Hillman 6115
Speaker:
HAO KANG,
Master's Student, Computer Science Department, Carnegie Mellon University
https://haokang.me/
Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models, offering a favorable trade-off between model capacity and per-token computation. This thesis studies the training of MoE language models from two perspectives: modeling and systems.
On the modeling side, we present FLAME-MoE, a transparent research platform providing a suite of MoE models across seven scales, with all code, data pipelines, intermediate checkpoints, and routing logs released publicly. We establish MoE scaling laws and show that the resulting models outperform dense baselines at matched compute. Using the full training trace, we conduct empirical analyses of expert behavior, finding that expert specialization emerges gradually, co-activation remains sparse but intensifies in deeper layers, and routing decisions converge early in training.
On the systems side, we present PithTrain, a Python-native MoE training framework that delivers production-grade throughput in roughly 10K lines of code. PithTrain supports 4D parallelism, a DualPipeV pipeline scheduler that overlaps computation with communication, FP8 training via DeepGEMM, and fused Triton kernels for expert dispatch. At this scale, the entire codebase fits within the context window of modern AI coding tools, making it end-to-end readable by both humans and agents. We also explore the implications of this minimal design for agentic development.
Thesis Committee
Chenyan Xiong (Chair)
Tianqi Chen
Additional Information
For More Information:
amalloy@cs.cmu.edu