Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3299869.3319896acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

FishStore: Faster Ingestion with Subset Hashing

Published: 25 June 2019 Publication History

Abstract

The last decade has witnessed a huge increase in data being ingested into the cloud, in forms such as JSON, CSV, and binary formats. Traditionally, data is either ingested into storage in raw form, indexed ad-hoc using range indices, or cooked into analytics-friendly columnar formats. None of these solutions is able to handle modern requirements on storage: making the data available immediately for ad-hoc and streaming queries while ingesting at extremely high throughputs. This paper builds on recent advances in parsing and indexing techniques to propose FishStore, a concurrent latch-free storage layer for data with flexible schema, based on multi-chain hash indexing of dynamically registered predicated subsets of data. We find predicated subset hashing to be a powerful primitive that supports a broad range of queries on ingested data and admits a high-performance concurrent implementation. Our detailed evaluation on real datasets and queries shows that FishStore can handle a wide range of workloads and can ingest and retrieve data at an order of magnitude lower cost than state-of-the-art alternatives.

References

[1]
Azza Abouzied, Daniel J. Abadi, and Avi Silberschatz. 2013. Invisible loading: access-driven data transfer from raw files into database systems. In EDBT.
[2]
Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. 2012. NoDB: efficient query execution on raw data files. In SIGMOD.
[3]
Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. 2018. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In SIGMOD. 601--613.
[4]
Oana Balmau, Diego Didona, Rachid Guerraoui, Willy Zwaenepoel, Huapeng Yuan, Aashray Arora, Karan Gupta, and Pavan Konka. 2017. TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores. In USENIX ATC. 363--375.
[5]
Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, and Arie Shoshani. 2014. Parallel data analysis directly on scientific file formats. In SIGMOD. 385--396.
[6]
Daniele Bonetta and Matthias Brantner. 2017. FAD.Js: Fast JSON Data Access Using JIT-based Speculative Optimizations. PVLDB, Vol. 10, 12 (Aug. 2017), 1778--1789.
[7]
Renata Borovica-Gajic, Stratos Idreos, Anastasia Ailamaki, Marcin Zukowski, and Campbell Fraser. 2018. Smooth Scan: robust access path selection without cardinality estimation. VLDB J., Vol. 27, 4 (2018), 521--545.
[8]
Edward Bortnikov, Anastasia Braginsky, Eshcar Hillel, Idit Keidar, and Gali Sheffi. 2018. Accordion: Better Memory Organization for LSM Key-Value Stores. PVLDB, Vol. 11, 12 (2018), 1863--1875.
[9]
Badrish Chandramouli, Guna Prasaad, Donald Kossmann, Justin J. Levandoski, James Hunter, and Mike Barnett. 2018. FASTER: A Concurrent Key-Value Store with In-Place Updates. In SIGMOD . 275--290.
[10]
Craig Chasseur and Jignesh M. Patel. 2013. Design and Evaluation of Storage Organizations for Read-Optimized Main Memory Databases. PVLDB, Vol. 6, 13 (2013), 1474--1485.
[11]
Yu Cheng and Florin Rusu. 2014. Parallel in-situ data processing with speculative loading. In SIGMOD.
[12]
Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal Navigable Key-Value Store. In SIGMOD. 79--94.
[13]
Niv Dayan and Stratos Idreos. 2018. Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging. In SIGMOD. 505--520.
[14]
Apache Software Foundation. 2018. Apache Cassandra. http://cassandra.apache.org/. (2018).
[15]
Georgios Giannikis, Philipp Unterbrunner, Jeremy Meyer, Gustavo Alonso, Dietmar Fauser, and Donald Kossmann. 2010. Crescando. In SIGMOD . 1227--1230.
[16]
Google. 2018. Google Cloud Dataflow. https://cloud.google.com/dataflow/. (2018).
[17]
Stratos Idreos, Ioannis Alagiannis, Ryan Johnson, and Anastasia Ailamaki. 2011. Here are my Data Files. Here are my Queries. Where are my Results?. In CIDR.
[18]
Martin L. Kersten, Stratos Idreos, Stefan Manegold, and Erietta Liarou. 2011. The Researchertextquoterights Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds. PVLDB, Vol. 4, 12 (2011), 1474--1477.
[19]
Michael S. Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?. In SIGMOD. ACM, New York, NY, USA, 715--730.
[20]
Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013. The Bw-Tree: A B-tree for New Hardware Platforms. In ICDE. IEEE Computer Society, Washington, DC, USA, 302--313.
[21]
Yinan Li, Nikos R. Katsipoulakis, Badrish Chandramouli, Jonathan Goldstein, and Donald Kossmann. 2017. Mison: A Fast JSON Parser for Data Analytics. PVLDB, Vol. 10, 10 (2017), 1118--1129.
[22]
Erietta Liarou, Romulo Goncalves, and Stratos Idreos. 2009. Exploiting the power of relational databases for efficient stream processing. In EDBT . 323--334.
[23]
Erietta Liarou, Stratos Idreos, Stefan Manegold, and Martin L. Kersten. 2013. Enhanced stream processing in a DBMS kernel. In EDBT. 501--512.
[24]
Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache Craftiness for Fast Multicore Key-value Storage. In EuroSys. ACM, New York, NY, USA, 183--196.
[25]
Tobias Mü hlbauer, Wolf Rö diger, Robert Seilbeck, Angelika Reiser, Alfons Kemper, and Thomas Neumann. 2013. Instant Loading for Main Memory Databases. PVLDB, Vol. 6, 14 (2013), 1702--1713.
[26]
Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The Log-structured Merge-tree (LSM-tree). Acta Inf., Vol. 33, 4 (June 1996), 351--385.
[27]
Shoumik Palkar, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2018. Filter Before You Parse: Faster Analytics on Raw Data with Sparser. PVLDB, Vol. 11, 11 (2018), 1576--1589.
[28]
Markus Pilman, Kevin Bocksrocker, Lucas Braun, Renato Marroquin, and Donald Kossmann. 2017. Fast Scans on Key-Value Stores. PVLDB, Vol. 10, 11 (2017), 1526--1537.
[29]
Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, and Ittai Abraham. 2017. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. In SOSP . 497--514.
[30]
Kai Ren, Qing Zheng, Joy Arulraj, and Garth Gibson. 2017. SlimDB: A Space-Efficient Key-Value Storage Engine For Semi-Sorted Data. PVLDB, Vol. 10, 13 (2017), 2037--2048.
[31]
Kenneth A. Ross. 2007. Efficient Hash Probes on Modern Processors. In ICDE. 1297--1301.
[32]
Dharma Shukla et almbox. 2015. Schema-Agnostic Indexing with Azure DocumentDB. PVLDB, Vol. 8, 12 (2015), 1668--1679.
[33]
Facebook Open Source. 2018. RocksDB. http://rocksdb.org/. (2018).
[34]
Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. 2013. Speedy Transactions in Multicore In-memory Databases. In SOSP. ACM, New York, NY, USA, 18--32.
[35]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI.
[36]
Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. SuRF: Practical Range Query Filtering with Fast Succinct Tries. In SIGMOD . 323--336.
[37]
Jingren Zhou, John Cieslewicz, Kenneth A. Ross, and Mihir Shah. 2005. Improving Database Performance on Simultaneous Multithreading Processors. In VLDB . 49--60.

Cited By

View all
  • (2024)On‐demand JSON: A better way to parse documents?Software: Practice and Experience10.1002/spe.331354:6(1074-1086)Online publication date: 18-Jan-2024
  • (2022)Design and Implementation of an Efficient Key-Value Storage Engine for Mobile Edge ComputingJournal of Digital Contents Society10.9728/dcs.2022.23.5.92123:5(921-927)Online publication date: 31-May-2022
  • (2022)Fast JSON parser using metaprogramming on GPU2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA54385.2022.10032381(1-10)Online publication date: 13-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
June 2019
2106 pages
ISBN:9781450356435
DOI:10.1145/3299869
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. csv
  2. flexible schema
  3. hashing
  4. indexing
  5. ingestion
  6. json
  7. key-value store
  8. parsing
  9. performance
  10. predicates
  11. subsets

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '19
Sponsor:
SIGMOD/PODS '19: International Conference on Management of Data
June 30 - July 5, 2019
Amsterdam, Netherlands

Acceptance Rates

SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)3
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)On‐demand JSON: A better way to parse documents?Software: Practice and Experience10.1002/spe.331354:6(1074-1086)Online publication date: 18-Jan-2024
  • (2022)Design and Implementation of an Efficient Key-Value Storage Engine for Mobile Edge ComputingJournal of Digital Contents Society10.9728/dcs.2022.23.5.92123:5(921-927)Online publication date: 31-May-2022
  • (2022)Fast JSON parser using metaprogramming on GPU2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA54385.2022.10032381(1-10)Online publication date: 13-Oct-2022
  • (2022)Workload-optimized sensor data store for industrial IoT gatewaysFuture Generation Computer Systems10.1016/j.future.2022.05.012135(394-408)Online publication date: Oct-2022
  • (2021)JSON Tiles: Fast Analytics on Semi-Structured DataProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452809(445-458)Online publication date: 9-Jun-2021
  • (2019)FishStoreProceedings of the VLDB Endowment10.14778/3352063.335210012:12(1922-1925)Online publication date: 1-Aug-2019
  • (2019)Parsing gigabytes of JSON per secondThe VLDB Journal10.1007/s00778-019-00578-528:6(941-960)Online publication date: 11-Oct-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media