Thesis Proposal - Mingjie Sun May 28, 2024 1:00pm Location: In Person - Traffic21 Classroom, Gates Hillman 6501 Speaker: MINGJIE SUN, Ph.D. Student, Computer Science Department, Carnegie Mellon University https://eric-mingjie.github.io/ Transformer is a neural network architecture centered on the self-attention mechanism. In recent years, it has become the de-facto architecture for deep learning, e.g., Large Language Models (LLMs) and Vision Transformers (ViTs). However, these models, with millions to billions of parameters, remain largely opaque and their mechanisms are difficult to interpret. As their real-world applications grow, gaining a deep understanding of their internal representations is essential for effectively utilizing and improving these models.In this work, we closely examine the activation landscape in Transformers. We demonstrate that understanding the intriguing activation phenomena in Transformers can have practical and meaningful implications. First, we identify a fundamental limitation of the well-established magnitude pruning method, where it fails to consider the existence of features with large activations in large-scale Transformers. Leveraging this key insight, we develop a simple and effective pruning approach. Second, we discover and study the presence of very few activations with extremely large magnitudes, which we call massive activations. We investigate the role of massive activations in Transformers and show how they are fundamentally connected to the self-attention mechanism. Last, we discuss our proposed extensions of this work, primarily focusing on developing a unified framework for LLM compression, through a principled investigation of existing works.Thesis Committee:J. Zico Kolter (Chair)Graham NeubigAditi RaghunathanKaiming He (Massachusetts Institute of Technology)Additional Information Add event to Google Add event to iCal