Jielin Qiu On the Alignment, Robustness, and Generalizability of Multimodal Learning Degree Type: Ph.D. in Computer Science Advisor(s): Christos Faloutsos, Lei Li Graduated: April 2024 Abstract: Multimodal intelligence, where AI systems can process and integrate information from multiple modalities, such as text, visual, audio, etc., has emerged as a key concept in today's data-driven era. This cross-modal approach finds diverse applications and transformative potential across industries. By fusing heterogeneous data streams, multimodal AI generates representations more akin to human-like intelligence than traditional unimodal techniques.In this thesis, we aim to advance the field of multimodal intelligence by focusing on three crucial dimensions: multimodal alignment, robustness, and generalizability. By introducing new approaches and methods, we aim to improve the performance, robustness, and interpretability of multimodal models in practical applications. In this thesis, we address these critical questions: (1) How do we explore the inner semantic alignment between different types of data? How can the learned alignment help advance multimodal applications? (2) How robust are the multimodal models? How can we improve the models' robustness in real-world applications? (3) How do we generalize the knowledge of one learned domain to another unlearned domain?This thesis makes contributions to all three technical challenges. We start with a contribution of learning cross-modal semantic alignment, where we explore establishing rich connections between language and image/video data, with a focus on the multimodal summarization task. By aligning the semantic content of language with visual elements, the resulting models can possess a more nuanced understanding of the underlying concepts. We delve into the application of Optimal Transport-based approaches to learn cross-domain alignment, enabling models to provide interpretable explanations of their multimodal reasoning process.For the next contribution, we develop comprehensive evaluation metrics and methodologies to assess the robustness of multimodal models. By simulating distribution shifts and measuring the model's performance under different scenarios, we can gain a deeper understanding of the model's adaptability and identify potential vulnerabilities. We also adopt Optimal Transport to improve the model's robustness performance through data augmentation via Wasserstein Geodesic perturbation. The third contribution revolves around the generalizability of multimodal systems, with an emphasis on the interactive domain and the healthcare domain. In the interactive domain, we develop new learning paradigms for learning executable robotic policy plans from visual observations by incorporating latent language encoding. We also use retrieval augmentation to make the vision-language models capable of recognizing and providing knowledgeable answers in real-world entity-centric VQA. In the healthcare domain, we bridge the gap by transferring the knowledge of LLMs to clinical ECG and EEG. In addition, we design retrieval systems that can automatically match the clinical healthcare signal to the most similar records in the database. This functionality can significantly aid in diagnosing diseases and reduce physicians' workload.In essence, this thesis seeks to propel the field of multimodal AI forward by enhancing alignment, robustness, and generalizability, thus paving the way for more sophisticated and efficient multimodal AI systems. Thesis Committee: Christos Faloutsos (Co-chair) Lei Li (Co-chair) Yanatan Bisk William Wang (University of California, Santa Barbara) Ashish Goel (Stanford University) Srinivasan Seshan, Head, Computer Science Department Martial Hebert, Dean, School of Computer Science Keywords: Multimodal learning, semantic alignment, multimodal robustness, generalization, cross-domain alignment CMU-CS-24-101.pdf (41.42 MB) ( 220 pages) Copyright Notice