Lucian Vlad Lita Instance-Based Question Answering Degree Type: Ph.D. in Computer Science Advisor(s): Jaime Carbonell Graduated: December 2006 Abstract: During recent years, question answering (QA) has grown from simple passage retrieval and information extraction to very complex approaches that incorporate deep question and document analysis, reasoning, planning, and sophisticated uses of knowledge resources. Most existing QA systems combine rule-based, knowledge-based and statistical components, and are highly optimized for a particular style of questions in a given language. Typical question answering approaches depend on specific ontologies, resources, processing tools, document sources, and very often rely on expert knowledge and rule-based components.Furthermore, such systems are very difficult to re-train and optimize for different domains and languages, requiring considerable time and human effort. We present a fully statistical, data-driven, instance-based approach to question answering (IBQA) that learns how to answer new questions from similar training questions and their known correct answers. We represent training questions as points in a multi-dimensional space and cluster them according to different granularity, scatter, and similarity metrics. From each individual cluster we automatically learn an answering strategy for finding answers to questions. When answering a new question that is covered by several clusters, multiple answering strategies are simultaneously employed. The resulting answer confidence combines elements such as each strategy's estimated probability of success, cluster similarity to the new question, cluster size, and cluster granularity. The IBQA approach obtains good performance on factoid and definitional questions, comparable to the performance of top systems participating in official question answering evaluations. Each answering strategy is cluster-specific and consists of an expected answer model, a query content model, and an answer extraction model. The expected answer model is derived from all training questions in its cluster and takes the form of a distribution over all possible answer types. The query content model for document retrieval is constructed using content from queries that are successful on training questions in that cluster. Finally, we train cluster-specific answer extractors on training data and use them to find answers to new questions. The IBQA approach is resource non-intensive, but can easily be extended to incorporate knowledge resources or rule-based components. Since it does not rely on hand-written rules, expert knowledge, and manually tuned parameters, it is less dependent on a particular language or domain, allowing for fast re-training with minimum human effort. Under limited data, our implementation of an IBQA system achieves good performance, improves with additional training instances, and is easily trainable and adaptable to new types of data. The IBQA approach provides a principled, robust, and easy to implement base system which constitutes a robust and well performing platform for further domain-specific adaptation. Thesis Committee: Jaime Carbonell (Chair) Eric Nyberg Tom Mitchell Nanda Kambhatla (IBM TJ Watson) Jeannette Wing, Head, Computer Science Department Randy Bryant, Dean, School of Computer Science Keywords: Statistical question answering, QA, natural language processing, statistical answer extraction, question clustering, answer type distributions, cluster-based query expansion, learning answering strategies, machine learning in NLP CMU-CS-06-179.pdf (1.79 MB) ( 231 pages) Copyright Notice