Parallel Data Laboratory Summer Talk Series - Jacob Baskin July 31, 2024 12:00pm — 1:00pm Location: Virtual Presentation - ET - Remote Access - Zoom Speaker: JACOB BASKIN, Software Engineer, Jane Street https://www.linkedin.com/in/jacob-baskin-4a0a29b1/ Superstore: What We Learned Building a Data Warehouse In 2022, Jane Street decided to build an on-premises data warehouse, called Superstore, which launched in 2023 and stores about 2PB of data. While we used existing software for most of the heavy lifting, some of our design decisions were a bit more customized. In this talk, I will give a brief architecture overview of Superstore and discuss the choices we made, how they worked in practice, and what we could or should have done differently. Is Parquet the storage format of the future? Does data locality matter? How do you efficiently handle arbitrarily wide data sets with a fixed amount of RAM? Our opinions on all these questions have changed significantly in the past year. — Jacob Baskin is a software engineer at Jane Street. His previous jobs have included CTO and co-founder of Coord, an urban transportation startup, and software engineer at Google. His focus is on managing data effectively at the application level at scales ranging from "big" to "artisanal small-batch". He graduated from Brown University with a B.A. in Computer Science. Zoom Participation. See announcement. Event Website: https://pdl.cmu.edu/talk-series/2024/073124.shtml