Computer Science 5th Years Master's Thesis Presentation May 6, 2024 10:00am — 11:00am Location: In Person - Traffic21 Classroom, Gates Hillman 6501 Speaker: MEDHA PALAVALLI, Masters Student, Computer Science Department, Carnegie Mellon University A Taxonomy for Data Contamination in Large Language Models Large language models pre-trained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern surrounds data contamination, where evaluation datasets may unintentionally influence the pretraining corpus, potentially inflating model performance. Despite these concerns, there remains a lack of comprehensive understanding regarding how such contamination impacts the performance of language models on downstream tasks, highlighting the necessity to investigate and mitigate this issue for accurate model evaluation. In this thesis, we present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks—summarization and question answering—revealing how different types of contamination influence task performance during evaluation. Our findings yield concrete recommendations for prioritizing data decontamination for pretraining. Thesis Committee: Matt Gormley (Chair)Lori LevinAdditional Information Add event to Google Add event to iCal