Seán Slattery Hypertext Classification Degree Type: Ph.D. in Computer Science Advisor(s): Tom Mitchell Graduated: December 2001 Abstract: Hypertext classification is the task of assigning labels to arbitrary hypertext documents, typically Web pages. One major problem with current techniques for this task is that they can not be easily extended to incorporate hyperlink information. This dissertation explores the space of algorithms that use hyperlinks effectively and shows that such algorithms can improve classification accuracy. I demonstrate how a First-Order learner (FOIL) can be used for hypertext classification in a way that easily incorporates hyperlink information. This approach leads to better classification performance and also produces learned rules which tell us more about how hyperlinks can help classification. A drawback of this approach is that it builds rules which assess document content using the presence or absence of specific keywords. The word-distribution approach used by text classifiers such as Naive Bayes and k Nearest Neighbour is more intuitively appealing for testing document content. I show how a new hypertext classifier, FOIL-PILFS, combines the ability to use hyperlinks easily (via FOIL) and test document content effectively (using Naive Bayes) to produce improved classification performance. Another useful source of information for improved classification can be the hyperlink structure of the test set. Given an initial labeling of the test documents, hyperlink patterns in the test set can allow us to achieve even better classification. The First-Order Hubs algorithm looks for one kind of hyperlink regularity in the test set, similar to Kleinberg's Hubs and Authorities regularity, and can improve upon an initial test-set classification. Of course other types of regularity are possible and I show how we might find and use these with First-Order Hubs. Thesis Committee: Tom Mitchell (Chair) Avrim Blum Yiming Yang Raymond Mooney (University of Texas at Austin) Randy Bryant, Head, Computer Science Department James Morris, Dean, School of Computer Science Keywords: Machine learning, text classification, hypertext classification, relational learning CMU-CS-02-142.pdf (1.64 MB) ( 134 pages) Copyright Notice