Doctoral Thesis Proposal - Yonghao Zhuang
— 2:00pm
Location:
In Person and Virtual - ET
-
Newell-Simon 4201 and Zoom
Speaker:
YONGHAO ZHUANG,
Ph.D. Student
Computer Science Department
Carnegie Mellon University
https://zyhowell.github.io/
Today's LLM training introduces more stages to further improve the model quality, and post-training is the most important. Despite the long-context workload imbalance, post-training includes reinforcement learning (RL), which iteratively runs the "rollout generation - reward evaluation - policy update'' pipeline.
This thesis proposes the concept of attention server, where the main part of attention (core attention) is disaggregated from other components of the model, and is handled by an independent cluster of GPUs. The first benefit of the disaggregation is independent scaling, enabling a higher batch size of other components; Besides, the core attention kernel only needs a subset of the GPU resources to saturate its memory bandwidth demand, allowing the GPU to utilize the remaining resources for compute intensive tasks.
Thesis Committee
Eric Xing (Chair)
Tianqi Chen
Zhihao Jia
Ion Stoica (University of California, Berkeley)
Hao Zhang (University of California, San Diego)
Additional Information
In Person and Zoom Participation. See announcement.
For More Information:
matthewstewart@cmu.edu