Bin Fu Algorithms for Large-Scale Astronomical Problems Degree Type: Ph.D. in Computer Science Advisor(s): Jaime Carbonell, Eugene Fink Graduated: August 2013 Abstract: Modern astronomical datasets are getting larger and larger, which already include billions of celestial objects and take up terabytes of disk space. Meanwhile, many astronomical applications do not scale well to such large amount of data, which raises the following question: How can we use modern computer science techniques to help astronomers better analyze large datasets? To answer this question, we applied various computer science techniques to provide fast, scalable solutions to the following astronomical problems: We developed algorithms to better work with big data. We found out that for some astronomical problems, the information that users require each time only covers a small proportion of the input dataset. Thus we carefully organized data layout on disk to quickly answer user queries, and the developed technique uses only one desktop computer to handle datasets with billions of data entries. We made use of database techniques to store and retrieve data. We designed table schemas and query processing functions to maximize their performance on large datasets. Some database features like indexing and sorting further reduce the processing time of user queries. We processed large data using modern distributed computing frameworks. We considered widely-used frameworks in the astronomy world, like Message Passing Interface (MPI), as well as emerging frameworks such as MapReduce. The developed implementations scale well to tens of billions of objects on hundreds of compute cores. During our research, we noticed that modern computer hardware is helpful to solve some sub-problems we encountered. One example is the use of Solid-State Drives (SSDs), whose random access time is faster than regular hard disk drives. The use of Graphics Processing Units (GPUs) is another example, which, under right circumstances, is able to achieve a higher level of parallelism than ordinary CPU clusters. Some astronomical problems are machine learning and statistics problems. For example, the problem of identifying quasars from other similar astronomical objects can be formalized as a classification problem. In this thesis, we applied supervised learning techniques to the quasar detection problem. Additionally, in the context of big data, we also evaluated existing active learning algorithms which aim to reduce the total number of human labels. All the developed techniques are designed to work with datasets that contain billions of astronomical objects. We have tested them extensively on large datasets and report the running times. We believe the interdisciplinarity between computer science and astronomy has great potential, especially toward the big data trend. Thesis Committee: Jaime Carbonell (Co-Chair) Eugene Fink (Co-Chair) Garth Gibson Michael Wood-Vasey (University of Pittsburgh) Frank Pfenning, Head, Computer Science Department Randy Bryant, Dean, School of Computer Science Keywords: Big data, eScience, astronomy applications, distributed computing, data mining CMU-CS-13-122.pdf (2.16 MB) ( 122 pages) Copyright Notice