Maria-Florina Balcan New Theoretical Frameworks for Machine Learning Degree Type: Ph.D. in Computer Science Advisor(s): Avrim Blum Graduated: December 2008 Abstract: This thesis has two primary thrusts. The first is developing new models and algorithms for important Machine Learning paradigms, including Semi-Supervised Learning, Active Learning, Learning via Kernels (and more general similarity functions), and Clustering. The second thrust is establishing new connections between Machine Learning and Algorithmic Game Theory. The formulation of the PAC learning model by Valiant and the Statistical Learning Theory framework by Vapnik have been instrumental in the development of machine learning and the design and analysis of algorithms for supervised learning. However, while extremely influential, these models do not capture or explain other important classic learning paradigms such as Clustering, nor do they capture important emerging learning paradigms such as Semi-Supervised Learning and other ways of incorporating unlabeled data in the learning process. In this thesis, we develop the first analog of these general discriminative models to the problems of Semi-Supervised Learning and Clustering, and we analyze both their algorithmic and sample complexity implications. We also provide the first generalization of the well-established theory of learning with kernel functions to case of general pairwise similarity functions and in addition provide new positive theoretical results for Active Learning. Finally, this dissertation presents new applications of techniques from Machine Learning to Algorithmic Game Theory, which has been a major area of research at the intersection of Computer Science and Economics. In machine learning, there has been growing interest in using unlabeled data together with labeled data due to the availability of large amounts of unlabeled data in many contemporary applications. As a result, a number of different semi-supervised learning methods such as Co-training, transductive SVM, or graph based methods have been developed. However, the underlying assumptions of these methods are often quite distinct and not captured by standard theoretical models. This thesis introduces a new discriminative model (a PAC or Statistical Learning Theory style model) for semi-supervised learning, that can be used to reason about many of the different approaches taken over the past decade in the Machine Learning community. This model provides a unified framework for analyzing when and why unlabeled data can help in the semi-supervised learning setting, in which one can analyze both sample-complexity and algorithmic issues. In particular, our model allows us to address in a unified way key issues such as "Under what conditions will unlabeled data help and by how much?" and "How much data should I expect to need in order to perform well?". Another important part of this thesis is Active Learning for which we provide several new theoretical results In particular, this dissertation includes the first Active Learning algorithm which works in the presence of arbitrary forms of noise, as well as margin-based active learning algorithms. In the context of Kernel methods (another flourishing area of machine learning research), this thesis shows how Random Projection techniques can be used to convert a given kernel function into an explicit distribution-dependent set of features, which can then be fed into more general (not necessarily kernelizable) learning algorithms. In addition, this work shows how such methods can be extended to more general pairwise similarity functions and also gives a formal theory which matches the standard intuition that a good kernel function is one that acts as a good measure of similarity. We thus strictly generalize and simplify the existing theory of Kernel methods. Our approach brings a new perspective as well as a more concrete explanation for the effectiveness of kernel methods, which can help in the design of good kernel functions for new learning problems. We also show how we can use this perspective to help thinking about Clustering in a novel way. While the study of clustering is centered around an intuitively compelling goal (and it has been a major tool in many different fields), it has been difficult to reason about it in a generic and unified way in part due to the lack of a general theoretical framework along the lines we have for supervised classification. In our work we develop the first general framework for analyzing accuracy in clustering without making strong probabilistic assumptions. Finally, this dissertation presents new connections between Machine Learning and Mechanism Design. Specifically, we show how machine learning methods can be used to reduce mechanism design problems to standard algorithmic questions for a wide range of revenue maximization problems. Our results substantially generalize previous work based on Random Sampling mechanisms both by broadening the applicability of such mechanisms and by simplifying the analysis. From a learning perspective, these settings present several unique challenges and particularities: the loss function is discontinuous and asymmetric, and the range of bidders' valuations may be large. Thesis Committee: Avrim Blum (Chair) Manuel Blum Yishay Mansour Tom Mitchell Santosh Vempala Peter Lee, Head, Computer Science Department Randy Bryant, Dean, School of Computer Science Keywords: Data dependent concept spaces, clustering, value of unlabeled data, semi-supervised learning, active learning, co-training, similarity-based learning, kernels, margins, low-dimensional mappings, sample complexity, mechanism and auction design, random sampling mechanisms, profit maximization CMU-CS-08-153.pdf (1.57 MB) ( 196 pages) Copyright Notice