Doctoral Speaking Skills Talk - Yonghao Zhuang

— 4:00pm

Location:
In Person - Gates Hillman 7101

Speaker:
YONGHAO ZHUANG , Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://zyhowell.github.io/

Scaling Long Context Training Data by Long-Distance Referrals

Training large language models for long context understanding faces the challenge of data shortage. Previous data engineering approaches mechanically concatenate short documents, which may create many pseudo long documents but raise concerns about data quality. 

In this paper, we study the core attribute of high quality data for long context training, and provide a data pipeline, LongPack, to scale such data. We found that long distance referrals, which occur in natural long documents, are crucial for long-context training. However, simply concatenating short documents does not reliably generate these relations. We further show that the density of long-distance referrals, which is higher in longer documents, has a key role in training efficiency, making previous upsampling methods suboptimal.  

To enrich long documents, we propose LongPack, a data pipeline that constructs long documents by packing shorter ones based on referral relationships. Specifically, for web pages, which are the primary source for language model training, we found hyper-link a native signal for such a relation. By packing web pages through their hyper-link connection, we can create longer, high-quality documents. 

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement


Add event to Google
Add event to iCal