At Carnegie Mellon, we’ve taken on data intensive scalable computing as a major focus for our research and educational efforts.
We believe that the potential applications for data‐intensive computing are nearly limitless, that many challenging and exciting research problems arise when trying to scale up our systems and computations to handle terabyte‐scale datasets, and that we need to expose our students to the technologies that will help them cope with the data‐intensive world they will live in.
There are many important research problems to be addressed to fully realize the potential of data‐intensive computing. How can processor, storage, and networking hardware be designed to improve performance, energy efficiency, and reliability? How can we run a collection of data‐intensive computations on the system simultaneously? What programming models and languages can we devise to support forms of computation that do not fit well in the Map/Reduce model? What machine‐learning algorithms can scale to datasets with billions of elements? As a research organization, the School of Computer Science views DISC as a source of a large number of exciting opportunities.
To deal with these large-scale datasets, we need to spread the data across many disk drives, possibly hundreds, so that we can access large amounts of information in parallel. These disks also need processors and networking, and so we should incorporate them into a cluster computing system that can be set up in a machine room, with the nodes and power supplies mounted in racks, and connected by cables. We're interested in questions surrounding the reliability of hard drives supporting cloud computing and parallel programming issues of these facilities because they will help us analyze projects that require very large datasets to be analyzed quickly and accurately.