5th Year Master's Thesis Presentation - Hao Kang

April 21, 2026  12:00PM—1:30PM

Location:
In Person - ASA Conference Room, Gates Hillman 6115

Speaker:
HAO KANG, Master's Student, Computer Science Department, Carnegie Mellon University
https://haokang.me/

Training Mixture-of-Experts Language Models

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models, offering a favorable trade-off between model capacity and per-token computation. This thesis studies the training of MoE language models from two perspectives: modeling and systems.

On the modeling side, we present FLAME-MoE, a transparent research platform providing a suite of MoE models across seven scales, with all code, data pipelines, intermediate checkpoints, and routing logs released publicly. We establish MoE scaling laws and show that the resulting models outperform dense baselines at matched compute. Using the full training trace, we conduct empirical analyses of expert behavior, finding that expert specialization emerges gradually, co-activation remains sparse but intensifies in deeper layers, and routing decisions converge early in training.

On the systems side, we present PithTrain, a Python-native MoE training framework that delivers production-grade throughput in roughly 10K lines of code. PithTrain supports 4D parallelism, a DualPipeV pipeline scheduler that overlaps computation with communication, FP8 training via DeepGEMM, and fused Triton kernels for expert dispatch. At this scale, the entire codebase fits within the context window of modern AI coding tools, making it end-to-end readable by both humans and agents. We also explore the implications of this minimal design for agentic development.

Thesis Committee
 Chenyan Xiong (Chair)
Tianqi Chen

Additional Information 

For More Information:
amalloy@cs.cmu.edu


Add event to Google
Add event to iCal