Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example

Published: 20 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Data Scientists deal with a wide variety of file data formats and data representations. Probably the most difficult to handle are custom data formats that liberally define their own particular flat or nested structure with multiple custom delimiters, multi-line records, or undocumented semantics of attribute sequences, co-appearances, and repetitions. As a prerequisite for exploratory ML model training, data scientists need to map these data representations into regular frames or matrices. Unfortunately, existing tools and frameworks provide only limited support for aiding this process, which causes redundant manual efforts and unnecessary data quality issues. In this paper, we initiate work on automatic matrix and frame reader generation by example. A user provides a sample of raw text data and its mapped matrix or frame representation. Our GIO framework then first identifies the mapping rules from raw to structured data, and subsequently generates source code of an efficient, multi-threaded reader for reading full raw datasets of this format. In order to facilitate manual improvements, both the mapping rules, and generated reader can be modified as needed. Our experiments show that GIO is able to correctly identify the mapping rules for basic text formats like CSV, LibSVM, MatrixMarket; custom text formats from publishing, automotive, and health care; as well as various nested formats such as JSON and XML. Additionally, the automatically generated readers yield competitive performance compared to hand-coded readers and tuned libraries like RapidJSON.

    Supplemental Material

    MP4 File
    Data Scientists deal with a wide variety of file data formats and data representations. Probably the most difficult to handle are custom data formats that liberally define their own particular flat or nested structure with multiple custom delimiters, multi-line records, or undocumented semantics of attribute sequences, co-appearances, and repetitions. As a prerequisite for exploratory ML model training, data scientists need to map these data representations into regular frames or matrices. Unfortunately, existing tools and frameworks provide only limited support for aiding this process, which causes redundant manual efforts and unnecessary data quality issues. In this paper, we initiate work on automatic matrix and frame reader generation by example. A user provides a sample of raw text data and its mapped matrix or frame representation. Our GIO framework then first identifies the mapping rules from raw to structured data, and subsequently generates source code of an efficient, multi-threaded reader for reading full raw datasets of this format. In order to facilitate manual improvements, both the mapping rules, and generated reader can be modified as needed. Our experiments show that GIO is able to correctly identify the mapping rules for basic text formats like CSV, LibSVM, MatrixMarket; custom text formats from publishing, automotive, and health care; as well as various nested formats such as JSON and XML. Additionally, the automatically generated readers yield competitive performance compared to hand-coded readers and tuned libraries like RapidJSON.
    PDF File
    Read me
    ZIP File
    Source Code


    2000. Auto-lead Data Format / ADF: An Industry Standard Data Format for the Export and Import of Automotive Customer Leads using XML. https://adfxml.info/adf_spec.pdf
    2013. Matrix Market Exchange Formats. Technical Report. Math, Statistics, and Computational Science. https://math.nist.gov/MatrixMarket/formats.html
    2022. Gson. https://github.com/google/gson/
    2022. HAPI object-oriented HL7 2.x parser for Java. https://hapifhir.github.io/hapi-hl7v2/
    2022. Jackson. https://github.com/FasterXML/jackson/
    2022. RapidJSON. http://rapidjson.org/
    2022. Schema Guru. https://github.com/snowplow/schema-guru
    Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2017. Data Profiling: A Tutorial. In SIGMOD. 1747--1751. https://doi.org/10.1145/3035918.3054772
    Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. 2012. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD. 241--252. https://doi.org/10.1145/2213836.2213864
    Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. 2012. NoDB in Action: Adaptive Query Processing on Raw Data. PVLDB 5, 12 (2012), 1942--1945. https://doi.org/10.14778/2367502.2367543
    Bogdan Alexe, Balder TEN Cate, Phokion G Kolaitis, and Wang-Chiew Tan. 2011. Characterizing schema mappings via data examples. TODS 36, 4 (2011), 1--48. https://doi.org/10.1145/2043652.2043656
    Bogdan Alexe, Balder ten Cate, Phokion G Kolaitis, and Wang-Chiew Tan. 2011. EIRENE: Interactive design and refinement of schema mappings via data examples. PVLDB 4, 12 (2011), 1414--1417. http://www.vldb.org/pvldb/vol4/p1414-alexe.pdf
    Bogdan Alexe, Laura Chiticariu, Renée J Miller, and Wang-Chiew Tan. 2008. Muse: Mapping understanding and design by example. In ICDE. 10--19. https://doi.org/10.1109/ICDE.2008.4497409
    Bogdan Alexe, Balder Ten Cate, Phokion G Kolaitis, and Wang-Chiew Tan. 2011. Designing and refining schema mappings via data examples. In SIGMOD. 133--144. https://doi.org/10.1145/1989323.1989338
    Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. 1383--1394. https://doi.org/10.1145/2723372.2742797
    Lyes Attouche, Mohamed Amine Baazizi, Dario Colazzo, Francesco Falleni, Giorgio Ghelli, Cristiano Landi, Carlo Sartiani, and Stefanie Scherzinger. 2021. A Tool for JSON Schema Witness Generation. In EDBT. 694--697. https://doi.org/10.5441/002/edbt.2021.86
    Lyes Attouche, Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2022. Witness Generation for JSON Schema. PVLDB 15, 13 (2022), 4002--4014. https://www.vldb.org/pvldb/vol15/p4002-sartiani.pdf
    David Aumueller, Hong Hai Do, Sabine Massmann, and Erhard Rahm. 2005. Schema and ontology matching with COMA. In SIGMOD. 906--908. https://doi.org/10.1145/1066157.1066283
    Tahir Azim, Manos Karpathiotakis, and Anastasia Ailamaki. 2017. ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data. PVLDB 11, 3 (2017), 324--337. https://doi.org/10.14778/3157794.3157801
    Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2017. Counting types for massive JSON datasets. In DBPL@VLDB Workshop. 1--12. https://doi.org/10.1145/3122831.3122837
    Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Parametric schema inference for massive JSON datasets. VLDB J. 28, 4 (2019), 497--521. https://doi.org/10.1007/s00778-018-0532--7
    Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Schemas and Types for JSON Data: From Theory to Practice. In SIGMOD. 2060--2063. https://doi.org/10.1145/3299869.3314032
    Mohamed-Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2017. Schema inference for massive JSON datasets. In EDBT. https://doi.org/10.5441/002/edbt.2017.21
    Sebastian Baunsgaard, Matthias Boehm, Ankit Chaudhary, Behrouz Derakhshan, Stefan Geißelsöder, Philipp M. Grulich, Michael Hildebrand, Kevin Innerebner, Volker Markl, Claus Neubauer, Sarah Osterburg, Olga Ovcharenko, Sergey Redyuk, Tobias Rieger, Alireza Rezaei Mahdiraji, Sebastian Benjamin Wrede, and Steffen Zeuch. 2021. ExDRa: Exploratory Data Science on Federated Raw Data. In SIGMOD. 2450--2463. https://doi.org/10.1145/3448016.3457549
    Lasse Bergroth, Harri Hakonen, and Timo Raita. 2000. A survey of longest common subsequence algorithms. In Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000. IEEE, 39--48. https://doi.org/10.1109/SPIRE.2000.878178
    Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic Schema Matching, Ten Years Later. PVLDB 4, 11 (2011), 695--701. http://www.vldb.org/pvldb/vol4/p695-bernstein_madhavan_rahm.pdf
    George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2013. On the relative trust between inconsistent data and inaccurate constraints. In ICDE. 541--552. https://doi.org/10.1109/ICDE.2013.6544854
    Kevin S. Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Y. Eltabakh, Carl-Christian Kanne, Fatma Özcan, and Eugene J. Shekita. 2011. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB 4, 12 (2011), 1272--1283. http://www.vldb.org/pvldb/vol4/p1272-beyer.pdf
    Robert Binna, Eva Zangerle, Martin Pichl, Günther Specht, and Viktor Leis. 2018. HOT: A Height Optimized Trie Index for Main-Memory Database Systems. In SIGMOD. 521--534. https://doi.org/10.1145/3183713.3196896
    Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, and Arie Shoshani. 2014. Parallel data analysis directly on scientific file formats. In SIGMOD. 385--396. https://doi.org/10.1145/2588555.2612185
    Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginthör, Kevin Innerebner, Florijan Klezin, Stefanie N. Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqui, and Sebastian Benjamin Wrede. 2020. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. In CIDR. http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf
    Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Prithviraj Sen, Alexandre V. Evfimievski, and Niketan Pansare. 2018. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. PVLDB 11, 12 (2018), 1755--1768. https://doi.org/10.14778/3229863.3229865
    Matthias Böhm, Benjamin Schlegel, Peter Benjamin Volk, Ulrike Fischer, Dirk Habich, and Wolfgang Lehner. 2011. Efficient In-Memory Indexing with Generalized Prefix Trees. In BTW. 227--246. https://dl.gi.de/20.500.12116/19581
    Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3 (2011), 27:1--27:27. https://doi.org/10.1145/1961189.1961199
    Yu Cheng and Florin Rusu. 2014. Parallel in-situ data processing with speculative loading. In SIGMOD. 1287--1298. https://doi.org/10.1145/2588555.2593673
    Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. 2009. MAD Skills: New Analysis Practices for Big Data. PVLDB 2, 2 (2009), 1481--1492. https://doi.org/10.14778/1687553.1687576
    Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR. http://cidrdb.org/cidr2017/papers/p44-deng-cidr17.pdf
    Hong Hai Do and Erhard Rahm. 2002. COMA - A System for Flexible Combination of Schema Matching Approaches. In VLDB. 610--621. https://doi.org/10.1016/B978--155860869--6/50060--3
    Dominik Durner, Viktor Leis, and Thomas Neumann. 2021. JSON Tiles: Fast Analytics on Semi-Structured Data. In SIGMOD. 445--458. https://doi.org/10.1145/3448016.3452809
    Ronald Fagin, Phokion G Kolaitis, Renée J Miller, and Lucian Popa. 2005. Data exchange: semantics and query answering. Theoretical Computer Science 336, 1 (2005), 89--124. https://doi.org/10.1016/j.tcs.2004.10.033
    Chang Ge, Yinan Li, Eric Eilebrecht, Badrish Chandramouli, and Donald Kossmann. 2019. Speculative distributed CSV data parsing for big data analytics. In SIGMOD. 883--899. https://doi.org/10.1145/3299869.3319898
    Chang Ge, Yinan Li, Eric Eilebrecht, Badrish Chandramouli, and Donald Kossmann. 2019. Speculative distributed CSV data parsing for big data analytics. In Proceedings of the 2019 International Conference on Management of Data. 883--899. https://doi.org/10.1145/3299869.3319898
    Philipp M Grulich, Breß Sebastian, Steffen Zeuch, Jonas Traub, Janis von Bleichert, Zongxiong Chen, Tilmann Rabl, and Volker Markl. 2020. Grizzly: Efficient stream processing through adaptive query compilation. In SIGMOD. 2487--2503. https://doi.org/10.1145/3318464.3389739
    Laura M. Haas, Mauricio A. Hernández, Howard Ho, Lucian Popa, and Mary Roth. 2005. Clio grows up: from research prototype to industrial tool. In SIGMOD. 805--810. https://doi.org/10.1145/1066157.1066252
    Mauricio A. Hernández, Renée J. Miller, and Laura M. Haas. 2001. Clio: A Semi-Automatic Tool For Schema Mapping. In SIGMOD. 607. https://doi.org/10.1145/375663.375767
    Madelon Hulsebos, Kevin Zeng Hu, Michiel A. Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çagatay Demiralp, and César A. Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In SIGKDD. 1500--1508. https://doi.org/10.1145/3292500.3330993
    Stratos Idreos, Ioannis Alagiannis, Ryan Johnson, and Anastasia Ailamaki. 2011. Here are my Data Files. Here are my Queries. Where are my Results?. In CIDR. 57--68. http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
    Milena Ivanova, Yagiz Kargin, Martin L. Kersten, Stefan Manegold, Ying Zhang, Mihai Datcu, and Daniela Espinoza- Molina. 2013. Data vaults: a database welcome to scientific file repositories. In SSDBM. 48:1--48:4. https://doi.org/10.1145/2484838.2484876
    Lin Jiang, Junqiao Qiu, and Zhijia Zhao. 2020. Scalable Structural Index Construction for JSON Analytics. PVLDB 14, 4 (2020). https://doi.org/10.14778/3436905.3436926
    Peter Kairouz, Brendan McMahan, and Virginia Smith. 2020. Federated Learning Tutorial. In NeurIPS. https://slideslive.com/38935813/federated-learning-tutorial
    Manos Karpathiotakis, Ioannis Alagiannis, and Anastasia Ailamaki. 2016. Fast Queries Over Heterogeneous Data Through Engine Customization. PVLDB 9, 12 (2016), 972--983. https://doi.org/10.14778/2994509.2994516
    Manos Karpathiotakis, Miguel Branco, Ioannis Alagiannis, and Anastasia Ailamaki. 2014. Adaptive Query Processing on RAW Data. PVLDB 7, 12 (2014), 1119--1130. https://doi.org/10.14778/2732977.2732986
    Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores. In BTW. 425--444. https://dl.gi.de/20.500.12116/2420
    Phokion G Kolaitis. 2005. Schema mappings, data exchange, and metadata management. In PODS. 61--75. https://doi.org/10.1145/1065167.1065176
    Marcel Kornacker et al . 2015. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf
    Geoff Langdale and Daniel Lemire. 2019. Parsing gigabytes of JSON per second. VLDB J. 28, 6 (2019), 941--960. https://doi.org/10.1007/s00778-019-00578--5
    Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In ICDE. 38--49. https://doi.org/10.1109/ICDE.2013.6544812
    Yinan Li, Nikos R Katsipoulakis, Badrish Chandramouli, Jonathan Goldstein, and Donald Kossmann. 2017. Mison: a fast JSON parser for data analytics. PVLDB 10, 10 (2017), 1118--1129. https://doi.org/10.14778/3115404.3115416
    Ericsson M. Garcia-Martin, G. Camarillo. 2008. Extensible Markup Language (XML) Format Extension for Representing Copy Control Attributes in Resource Lists. RFC 5364. RFC Editor. https://datatracker.ietf.org/doc/html/rfc5364
    Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. 2001. Generic Schema Matching with Cupid. In VLDB. 49--58. http://www.vldb.org/conf/2001/P049.pdf
    Renée J. Miller, Laura M. Haas, and Mauricio A. Hernández. 2000. Schema Mapping as Query Discovery. In VLDB. 77--88. http://www.vldb.org/conf/2000/P077.pdf
    Donald R. Morrison. 1968. PATRICIA - Practical Algorithm To Retrieve Information Coded in Alphanumeric. J. ACM 15, 4 (1968), 514--534. https://doi.org/10.1145/321479.321481
    Ingo Müller, Ghislain Fourny, Stefan Irimescu, Can Berker Cikis, and Gustavo Alonso. 2020. Rumble: Data Independence for Large Messy Data Sets. PVLDB 14, 4 (2020), 498--506. https://doi.org/10.14778/3436905.3436910
    Svetlozar Nestorov, Jeffrey Ullman, Janet Wiener, and Sudarashan Chawathe. 1997. Representative objects: Concise representations of semistructured, hierarchical data. In ICDE. 79--90.
    Shoumik Palkar, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2018. Filter Before You Parse: Faster Analytics on Raw Data with Sparser. PVLDB 11, 11 (2018). https://doi.org/10.14778/3236187.3236207
    Christina Pavlopoulou, E Preston Carman Jr, Till Westmann, Michael J Carey, and Vassilis J Tsotras. 2018. A Parallel and Scalable Processor for JSON Data. In EDBT. 576--587. https://doi.org/10.5441/002/edbt.2018.68
    Li Qian, Michael J Cafarella, and HV Jagadish. 2012. Sample-driven schema mapping. In SIGMOD. 73--84. https://doi.org/10.1145/2213836.2213846
    Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4 (2001), 334--350. https://doi.org/10.1007/s007780100057
    Y. Shafranovich. 2005. Common Format and MIME Type for Comma-Separated Values (CSV) Files. RFC 4180. RFC Editor. https://www.rfc-editor.org/rfc/rfc4180
    Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar. 2021. Towards Benchmarking Feature Type Inference for AutoML Platforms. In SIGMOD. 1584--1596. https://doi.org/10.1145/3448016.3457274
    Elias Stehle and Hans-Arno Jacobsen. 2020. ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data. PVLDB 13, 5 (2020). https://doi.org/10.14778/3377369.3377372
    Ed. T. Bray. 2017. The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259. RFC Editor. https://datatracker.ietf.org/doc/html/rfc8259
    Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnetminer: extraction and mining of academic social networks. In SIGKDD. 990--998. https://doi.org/10.1145/1401890.1402008
    Arno Unkrieg. 2014. Janino: A super-small, super-fast Java Compiler. https://janino-compiler.github.io/janino/2014-02--18_SWM-JAK.pdf
    Qiu Yue Wang, Jeffrey Xu Yu, and Kam-Fai Wong. 2000. Approximate graph schema extraction for semi-structured data. In EDBT. 302--316. https://doi.org/10.1007/3--540--46439--5_21
    Navid Yaghmazadeh, Xinyu Wang, and Isil Dillig. 2018. Automated migration of hierarchical data to relational tables using programming-by-example. PVLDB 11, 5 (2018), 580--593. https://doi.org/10.1145/3187009.3177735
    Ling-Ling Yan, Renée J. Miller, Laura M. Haas, and Ronald Fagin. 2001. Data-Driven Understanding and Refinement of Schema Mappings. In SIGMOD. 485--496. https://doi.org/10.1145/375663.375729
    Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI. 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
    Matei Zaharia, Ali Ghodsi, Reynold Xin, and Michael Armbrust. 2021. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In CIDR. http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
    Dan Zhang, Yoshihiko Suhara, Jinfeng Li, Madelon Hulsebos, Çagatay Demiralp, and Wang-Chiew Tan. 2020. Sato: Contextual Semantic Type Detection in Tables. PVLDB 13, 11 (2020), 1835--1848. http://www.vldb.org/pvldb/vol13/p1835-zhang.pdf

    Cited By

    View all
    • (2024)Effective Entry-Wise Flow for Molecule Generation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00023(207-220)Online publication date: 13-May-2024



    Information & Contributors


    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 1, Issue 2
    June 2023
    2310 pages
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].


    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 June 2023
    Published in PACMMOD Volume 1, Issue 2


    Request permissions for this article.


    Author Tags

    1. custom data format
    2. data loading
    3. efficient readers
    4. raw data


    • Research-article


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)112
    • Downloads (Last 6 weeks)5
    Reflects downloads up to

    Other Metrics


    Cited By

    View all
    • (2024)Effective Entry-Wise Flow for Molecule Generation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00023(207-220)Online publication date: 13-May-2024

    View Options

    Get Access

    Login options

    Full Access

    View options


    View or Download as a PDF file.



    View online with eReader.








    Share this Publication link

    Share on social media