Kamal Paul Nigam

Using Unlabeled Data to Improve Text Classification Degree Type: Ph.D. in Computer Science
Advisor(s): Tom Mitchell
Graduated: May 2001

Abstract:

One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data---labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse.

Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with pa representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.

Thesis Committee:
Tom M. Mitchell (Chair)
Andrew Kachites McCallum
John Lafferty
Tommi Jaakkola (Massachusetts Institute of Technology)

Randy Bryant, Head, Computer Science Department
James Morris, Dean, School of Computer Science

Keywords:
Text classification, text categorization, unlabeled data, Expectation-Maximization, unsupervised learning

CMU-CS-01-126.pdf (964.16 KB) ( 138 pages)
Copyright Notice