Adam Berger Statistical Machine Learning for Information Retrieval Degree Type: Ph.D. in Computer Science Advisor(s): John Lafferty Graduated: May 2001 Abstract: The purpose of this work is to introduce and experimentally validate a framework, based on statistical machine learning, for handling a broad range of problems in information retrieval (IR). Probably the most important single component of this framework is a parametric statistical model of word relatedness. A longstanding problem in IR has been to develop a mathematically principled model for document processing which acknowledges that one sequence of words may be closely related to another even if the pair have few (or no) words in common. The fact that a document contains the word "automobile," for example, suggests that it may be relevant to the queries "Where can I find information on motor vehicles?" and "Tell me about car transmissions," even though the word "automobile" itself appears nowhere in these queries. Also, a document containing the words (plumbing, caulk, paint, gutters) might best be summarized as "common house repairs", even if none of the three words in this candidate summary ever appeared in the document. Until now, the word-relatedness problem has typically been addressed with techniques like automatic query expansion (Rocchio:71), an often successful though ad hoc technique which artificially injects new, related words into a document for the purpose of ensuring that related documents have some lexical overlap. In the past few years have emerged a number of novel probabilistic approaches to information processing---including the language modeling approach to document ranking suggested first by Ponte and Croft, the non-extractive summarization work of Mittal and Witbrock, and the Hidden Markov Model-based ranking of Miller et.al. This thesis advances that body of work by proposing a principled, general probabilistic framework which naturally accounts for word-relatedness issues, using techniques from statistical machine learning such as the Expectation-Maximization (EM) algorithm. Applying this new framework to the problem of ranking documents by relevancy to a query, for instance, we discover a model that contains a version of the Ponte and Miller models as a special case, but surpasses these in its ability to recognize the relevance of a document to a query even when the two have minimal lexical overlap. Historically, information retrieval has been a field of inquiry driven largely by empirical considerations. After all, whether system A was constructed from a more sound theoretical framework than system B is of no concern to the system's end users. This thesis honors the strong engineering flavor of the field by evaluating the proposed algorithms in many different settings and on datasets from many different domains. The result of this analysis is an empirical validation of the notion that one can devise useful real-world information processing systems built from statistical machine learning techniques. Thesis Committee: John Lafferty (Chair) Jamie Callan Jaime Carbonell Daniel Sleator Jan Pedersen (Centrata Corp.) Randy Bryant, Head, Computer Science Department James Morris, Dean, School of Computer Science Keywords: Information retrieval, machine learning, language models, statistical inference, Hidden Markov Models, informaton theory, text summarization CMU-CS-01-110.pdf (1.39 MB) ( 147 pages) Copyright Notice