Minglong Shao Efficient Data Organization and Management on Heterogeneous Storage Hierarchies Degree Type: Ph.D. in Computer Science Advisor(s): Anastasia Ailamaki Graduated: May 2008 Abstract: Due to preferences for design and implementation simplicity, current data organization and management in database systems are based on simple assumptions about storage devices and workload characteristics. This has been the major design principle since the inception of database systems. While the device- and workload- oblivious approach worked well in the past, it falls short when considering today's demands for fast data processing on large-scale datasets that have various characteristics. The ignorance of rich and diverse features in both devices and workloads has posed unnecessary performance trade-offs in existing database systems. This dissertation proposes efficient, flexible, and robust data organization and management for database systems by enhancing the interaction with workloads and hardware devices. It achieves the goal through three steps. First, a microbenchmark suite is needed for quick and accurate evaluation. The proposed solution is DBmbench, a significantly reduced database microbenchmark suite which simulates OLTP and DSS workloads. DBmbench enables quick evaluation and provides performance forecasting for real large-scale benchmarks. Second, Clotho investigates how to build a workload-concious buffer pool manager by utilizing query payload information. Clotho decouples the in-memory page layout from the storage organization by using a new query-specific layout called CSM. Due to its adaptive structure, CSM eliminates the long-standing performance trade-offs of NSM and DSM, thus achieving good performance for both DSS and OLTP applications, two predominant database workloads with conflict characteristics. Clotho demonstrates that simple workload information, such as query payloads, is of great value to improve performance without increasing complexity. The third step looks at how to use hardware information to eliminate performance trade-offs in existing device-oblivious designs. MultiMap is first proposed as a new mapping algorithm to store multidimensional data onto disks without losing spatial locality. MultiMap exploits the new adjacency model of disks to build a multidimensional structure on top of the linear disk space. It outperforms existing mapping algorithms on various spatial queries. Later, MultiMap is expanded to organize intermediate results for hash join and external sorting where the I/O performance of different execution phases exhibits similar trade-offs as those in 2-D data accesses. Our prototype demonstrates an up to 2 times improvement over the existing implementation in memory limited executions. The above two projects complete Clotho by showing the benefits of exploiting detailed hardware features. Thesis Committee: Anastasia Ailamaki (Chair) Greg Ganger Todd Mowry Per-Åke (Paul) Larson (Microsoft Research) Peter Lee, Head, Computer Science Department Randy Bryant, Dean, School of Computer Science Keywords: Data placement, data organization, data management, benchmark, optimization, buffer pool, database management system, storage, performance, multidimensional CMU-CS-07-170.pdf (1011.32 KB) ( 128 pages) Copyright Notice