Monica Rogati Domain Adaptation of Translation Models for Multilingual Applications Degree Type: Ph.D. in Computer Science Advisor(s): Yiming Yang, Jaime G. Carbonell Graduated: May 2009 Abstract: The performance of a statistical translation algorithm in the context of multilingual applications such as cross-lingual information retrieval (CLIR) and machine translation (MT) depends on the quality, quantity and proper domain matching of the training data. Traditionally, manual selection and customization of training resources has been the prevailing approach. In addition to being labor-intensive, this approach does not scale to the large quantity of heterogeneous resources that have recently become available, such as parallel text and bilingual thesauri in various domains. More importantly, manual customization does not offer a solution to efficiently and effectively producing tailored translation models for a mixture of heterogeneous target documents in various domains, topics, languages and genres. Translation models trained on a general domain do not work well in technical do- mains; models trained on written documents are not appropriate for spoken dialogue; models trained on manual transcripts can be sub-optimal for translating noisy transcripts produced by a speech recognizer; finally, models trained on a mixture of topics are not optimal for any of the topic-specific documents. We seek to address this challenge by automatically adapting translation models (and implicitly parallel training resources) to specific target domains or sub-domains. The high-level adaptation process involves automatically weighting and combining multiple translation resources, according to several criteria, in order to better match a target corpus or a specific domain sample. The criteria we examine include lexical-level domain match, translation quality estimates, size, and taxonomy representation. An orthogonal dimension in the adaptation process is the granularity level at which these criteria are measured and applied: from the collection level - under the assumption of homogeneous within-collection data - to the document level. The relative contribution of each criterion is subsequently determined by a model that can range from uniform weighting to a global non-linear optimization model trained on application specific evaluation data. In this thesis, we examine how such adaptation applies to two important multilingual applications: cross-lingual information retrieval and machine translation. In CLIR, we adapt translation models for domain-specific query translation; in MT, we adapt translation models to heterogeneous target corpora and compare them with previously studied target language model adaptation. We use our adaptation algorithms to enhance state-of-the-art systems, seeking to improve performance under different testing conditions and to reduce the demand for large amounts of domain specific parallel data. We also address the challenge of combining multiple criteria to rank parallel sentence candidates. We investigate Continuous Reactive Tabu Search (CRTS) [2], a global optimization method, as well as Reactive Affine Shaker (RASH) [6], an efficient algorithm which continuously adjusts its search area in order to identify a local minimum. Our experiments in CLIR and statistical MT indicate that selecting training data based on the above-mentioned approaches allows a significant reduction in training data while preserving about 90% of the performance. This result significantly surpasses the random selection approach, and it holds for both CLIR and MT. As expected, the difference increases as the subdomain becomes more specific. Our optimized criteria weights considerably out- perform the uniform distribution baseline, as well as lexical similarity adaptation. Thesis Committee: Yiming Yang (Co-Chair) Jaime Carbonell (Co-Chair) Jamie Callan Salim Roukos (IBM TJ Watson) Peter Lee, Head, Computer Science Department Randy Bryant, Dean, School of Computer Science Keywords: domain adaptation, machine translation, statistical machine translation, parallel corpora, domain specific, cross-language information retrieval, CLIR, criteria optimization, SMT, resource selection, training resource adaptation CMU-CS-09-116.pdf (5.12 MB) ( 127 pages) Copyright Notice