Dan Pelleg

Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection Degree Type: Ph.D. in Computer Science
Advisor(s): Andrew Moore
Graduated: May 2004

Abstract:

Originally, astronomers dealt with stars. Later, with galaxies. Today, large scale cosmological structures are so complex, they must be first reduced into more succinct representations. For example, a universe simulation containing millions of objects is characterized by its halo occupation distribution.

This progression is typical of many disciplines of science, and even resonates in our daily lives. The easier it is for us to collect new data, store it and manage it, the harder it becomes to keep up with what it all means. For that we need to develop tools capable of mining big data sets.

This new generation of data analysis tools must meet the following requirements. They have to be fast and scale well to big data. Their output has to be straightforward to understand and easy to visualize. They need to only ask for the minimum of user input - ideally they would run completely autonomously once given the data.

I focus on clustering. Its main advantage is its generality. Separating data into groups of similar objects reduces the perception problem significantly. In this context, I propose new algorithms and tools to meet the challenges: an extremely fast spatial clustering algorithm, which can also estimate the number of clusters; a novel and highly comprehensible mixture model; a sub-linear learner for dependency trees; and an active learning framework to minimize the burden on a human expert hunting for rare anomalies. I implemented the algorithms and used them with very large data sets in a wide variety of applications, including astrophysics.

Thesis Committee:
Andrew Moore (Chair)
Manuela Veloso
Geoffrey Gordon
Nir Friedman (Hebrew University)

Randy Bryant, Head, Computer Science Department
James Morris, Dean, School of Computer Science

Keywords:
Machine learning, scientific discovery, mixture models, probability density estimators, efficient algorithms

CMU-CS-04-134.pdf (9.05 MB) ( 166 pages)
Copyright Notice