Seán Slattery

Hypertext Classification Degree Type: Ph.D. in Computer Science
Advisor(s): Tom Mitchell
Graduated: December 2001

Abstract:

Hypertext classification is the task of assigning labels to arbitrary hypertext documents, typically Web pages. One major problem with current techniques for this task is that they can not be easily extended to incorporate hyperlink information. This dissertation explores the space of algorithms that use hyperlinks effectively and shows that such algorithms can improve classification accuracy.

I demonstrate how a First-Order learner (FOIL) can be used for hypertext classification in a way that easily incorporates hyperlink information. This approach leads to better classification performance and also produces learned rules which tell us more about how hyperlinks can help classification.

A drawback of this approach is that it builds rules which assess document content using the presence or absence of specific keywords. The word-distribution approach used by text classifiers such as Naive Bayes and k Nearest Neighbour is more intuitively appealing for testing document content. I show how a new hypertext classifier, FOIL-PILFS, combines the ability to use hyperlinks easily (via FOIL) and test document content effectively (using Naive Bayes) to produce improved classification performance.

Another useful source of information for improved classification can be the hyperlink structure of the test set. Given an initial labeling of the test documents, hyperlink patterns in the test set can allow us to achieve even better classification. The First-Order Hubs algorithm looks for one kind of hyperlink regularity in the test set, similar to Kleinberg's Hubs and Authorities regularity, and can improve upon an initial test-set classification. Of course other types of regularity are possible and I show how we might find and use these with First-Order Hubs.

Thesis Committee:
Tom Mitchell (Chair)
Avrim Blum
Yiming Yang
Raymond Mooney (University of Texas at Austin)

Randy Bryant, Head, Computer Science Department
James Morris, Dean, School of Computer Science

Keywords:
Machine learning, text classification, hypertext classification, relational learning

CMU-CS-02-142.pdf (1.64 MB) ( 134 pages)
Copyright Notice