Database Seminar - Xiangpeng Hao January 21, 2025 12:00pm — 1:00pm Location: In Person - Blelloch-Skees Conference Room, Gates Hillman 8115 Speaker: XIANGPENG HAO, Ph.D. Student, Computer Sciences Department, University of Wisconsin-Madison https://haoxp.xyz/ Modern data analytics embrace a disaggregated architecture which decouples storage, cache, and compute into network-connected independent components. With disaggregated cache, a key design decision is whether to push down query predicates to the cache server. Without predicate pushdown, the cache must send all data to compute nodes, creating network bottlenecks. With predicate pushdown, the cache server evaluates predicates on cached data, but its limited computational resources become the bottleneck. In this talk, we introduce SplitSQL, a pushdown cache system with efficient predicate evaluation. Our system is built upon a surprising observation: pushdown cost is dominated by decoding data, not predicate evaluations. SplitSQL reduces decoding overhead by transcoding storage formats (like Parquet) into a cache-optimized format that enables predicate evaluation on encoded data and supports efficient, fine-grained decoding. Implemented on Apache DataFusion, SplitSQL achieves both low network traffic and significantly reduced computational overhead compared to conventional pushdown systems. Experiments on ClickBench show that SplitSQL's cache-specific format delivers up to 3x end-to-end performance improvement while maintaining compression ratio on par with the original storage format. — Xiangpeng Hao is a PhD student at the University of Wisconsin-Madison studying computer science with a focus on database/storage systems. Event Website: https://db.cs.cmu.edu/events/splitsql-practical-pushdown-cache-for-datalake-analytics-xiangpeng-hao Add event to Google Add event to iCal