research-article

GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example

Authors:

Saeed Fathollahzadeh,

Matthias BoehmAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 2

Article No.: 120, Pages 1 - 26

https://doi.org/10.1145/3589265

Published: 20 June 2023 Publication History

Abstract

Data Scientists deal with a wide variety of file data formats and data representations. Probably the most difficult to handle are custom data formats that liberally define their own particular flat or nested structure with multiple custom delimiters, multi-line records, or undocumented semantics of attribute sequences, co-appearances, and repetitions. As a prerequisite for exploratory ML model training, data scientists need to map these data representations into regular frames or matrices. Unfortunately, existing tools and frameworks provide only limited support for aiding this process, which causes redundant manual efforts and unnecessary data quality issues. In this paper, we initiate work on automatic matrix and frame reader generation by example. A user provides a sample of raw text data and its mapped matrix or frame representation. Our GIO framework then first identifies the mapping rules from raw to structured data, and subsequently generates source code of an efficient, multi-threaded reader for reading full raw datasets of this format. In order to facilitate manual improvements, both the mapping rules, and generated reader can be modified as needed. Our experiments show that GIO is able to correctly identify the mapping rules for basic text formats like CSV, LibSVM, MatrixMarket; custom text formats from publishing, automotive, and health care; as well as various nested formats such as JSON and XML. Additionally, the automatically generated readers yield competitive performance compared to hand-coded readers and tuned libraries like RapidJSON.

Supplemental Material

MP4 File

Data Scientists deal with a wide variety of file data formats and data representations. Probably the most difficult to handle are custom data formats that liberally define their own particular flat or nested structure with multiple custom delimiters, multi-line records, or undocumented semantics of attribute sequences, co-appearances, and repetitions. As a prerequisite for exploratory ML model training, data scientists need to map these data representations into regular frames or matrices. Unfortunately, existing tools and frameworks provide only limited support for aiding this process, which causes redundant manual efforts and unnecessary data quality issues. In this paper, we initiate work on automatic matrix and frame reader generation by example. A user provides a sample of raw text data and its mapped matrix or frame representation. Our GIO framework then first identifies the mapping rules from raw to structured data, and subsequently generates source code of an efficient, multi-threaded reader for reading full raw datasets of this format. In order to facilitate manual improvements, both the mapping rules, and generated reader can be modified as needed. Our experiments show that GIO is able to correctly identify the mapping rules for basic text formats like CSV, LibSVM, MatrixMarket; custom text formats from publishing, automotive, and health care; as well as various nested formats such as JSON and XML. Additionally, the automatically generated readers yield competitive performance compared to hand-coded readers and tuned libraries like RapidJSON.

Download
23.73 MB

PDF File

Read me

Download
387.89 KB

ZIP File

Source Code

Download
7.26 MB

References

[1]

2000. Auto-lead Data Format / ADF: An Industry Standard Data Format for the Export and Import of Automotive Customer Leads using XML. https://adfxml.info/adf_spec.pdf

[2]

2013. Matrix Market Exchange Formats. Technical Report. Math, Statistics, and Computational Science. https://math.nist.gov/MatrixMarket/formats.html

[3]

2022. Gson. https://github.com/google/gson/

[4]

2022. HAPI object-oriented HL7 2.x parser for Java. https://hapifhir.github.io/hapi-hl7v2/

[5]

2022. Jackson. https://github.com/FasterXML/jackson/

[6]

2022. RapidJSON. http://rapidjson.org/

[7]

2022. Schema Guru. https://github.com/snowplow/schema-guru

[8]

Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2017. Data Profiling: A Tutorial. In SIGMOD. 1747--1751. https://doi.org/10.1145/3035918.3054772

Digital Library

[9]

Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. 2012. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD. 241--252. https://doi.org/10.1145/2213836.2213864

Digital Library

[10]

Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. 2012. NoDB in Action: Adaptive Query Processing on Raw Data. PVLDB 5, 12 (2012), 1942--1945. https://doi.org/10.14778/2367502.2367543

Digital Library

[11]

Bogdan Alexe, Balder TEN Cate, Phokion G Kolaitis, and Wang-Chiew Tan. 2011. Characterizing schema mappings via data examples. TODS 36, 4 (2011), 1--48. https://doi.org/10.1145/2043652.2043656

Digital Library

[12]

Bogdan Alexe, Balder ten Cate, Phokion G Kolaitis, and Wang-Chiew Tan. 2011. EIRENE: Interactive design and refinement of schema mappings via data examples. PVLDB 4, 12 (2011), 1414--1417. http://www.vldb.org/pvldb/vol4/p1414-alexe.pdf

Digital Library

[13]

Bogdan Alexe, Laura Chiticariu, Renée J Miller, and Wang-Chiew Tan. 2008. Muse: Mapping understanding and design by example. In ICDE. 10--19. https://doi.org/10.1109/ICDE.2008.4497409

Digital Library

[14]

Bogdan Alexe, Balder Ten Cate, Phokion G Kolaitis, and Wang-Chiew Tan. 2011. Designing and refining schema mappings via data examples. In SIGMOD. 133--144. https://doi.org/10.1145/1989323.1989338

Digital Library

[15]

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. 1383--1394. https://doi.org/10.1145/2723372.2742797

Digital Library

[16]

Lyes Attouche, Mohamed Amine Baazizi, Dario Colazzo, Francesco Falleni, Giorgio Ghelli, Cristiano Landi, Carlo Sartiani, and Stefanie Scherzinger. 2021. A Tool for JSON Schema Witness Generation. In EDBT. 694--697. https://doi.org/10.5441/002/edbt.2021.86

[17]

Lyes Attouche, Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2022. Witness Generation for JSON Schema. PVLDB 15, 13 (2022), 4002--4014. https://www.vldb.org/pvldb/vol15/p4002-sartiani.pdf

Digital Library

[18]

David Aumueller, Hong Hai Do, Sabine Massmann, and Erhard Rahm. 2005. Schema and ontology matching with COMA. In SIGMOD. 906--908. https://doi.org/10.1145/1066157.1066283

Digital Library

[19]

Tahir Azim, Manos Karpathiotakis, and Anastasia Ailamaki. 2017. ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data. PVLDB 11, 3 (2017), 324--337. https://doi.org/10.14778/3157794.3157801

Digital Library

[20]

Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2017. Counting types for massive JSON datasets. In DBPL@VLDB Workshop. 1--12. https://doi.org/10.1145/3122831.3122837

Digital Library

[21]

Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Parametric schema inference for massive JSON datasets. VLDB J. 28, 4 (2019), 497--521. https://doi.org/10.1007/s00778-018-0532--7

Digital Library

[22]

Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Schemas and Types for JSON Data: From Theory to Practice. In SIGMOD. 2060--2063. https://doi.org/10.1145/3299869.3314032

Digital Library

[23]

Mohamed-Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2017. Schema inference for massive JSON datasets. In EDBT. https://doi.org/10.5441/002/edbt.2017.21

[24]

Sebastian Baunsgaard, Matthias Boehm, Ankit Chaudhary, Behrouz Derakhshan, Stefan Geißelsöder, Philipp M. Grulich, Michael Hildebrand, Kevin Innerebner, Volker Markl, Claus Neubauer, Sarah Osterburg, Olga Ovcharenko, Sergey Redyuk, Tobias Rieger, Alireza Rezaei Mahdiraji, Sebastian Benjamin Wrede, and Steffen Zeuch. 2021. ExDRa: Exploratory Data Science on Federated Raw Data. In SIGMOD. 2450--2463. https://doi.org/10.1145/3448016.3457549

Digital Library

[25]

Lasse Bergroth, Harri Hakonen, and Timo Raita. 2000. A survey of longest common subsequence algorithms. In Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000. IEEE, 39--48. https://doi.org/10.1109/SPIRE.2000.878178

[26]

Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic Schema Matching, Ten Years Later. PVLDB 4, 11 (2011), 695--701. http://www.vldb.org/pvldb/vol4/p695-bernstein_madhavan_rahm.pdf

[27]

George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2013. On the relative trust between inconsistent data and inaccurate constraints. In ICDE. 541--552. https://doi.org/10.1109/ICDE.2013.6544854

Digital Library

[28]

Kevin S. Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Y. Eltabakh, Carl-Christian Kanne, Fatma Özcan, and Eugene J. Shekita. 2011. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB 4, 12 (2011), 1272--1283. http://www.vldb.org/pvldb/vol4/p1272-beyer.pdf

Digital Library

[29]

Robert Binna, Eva Zangerle, Martin Pichl, Günther Specht, and Viktor Leis. 2018. HOT: A Height Optimized Trie Index for Main-Memory Database Systems. In SIGMOD. 521--534. https://doi.org/10.1145/3183713.3196896

Digital Library

[30]

Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, and Arie Shoshani. 2014. Parallel data analysis directly on scientific file formats. In SIGMOD. 385--396. https://doi.org/10.1145/2588555.2612185

Digital Library

[31]

Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginthör, Kevin Innerebner, Florijan Klezin, Stefanie N. Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqui, and Sebastian Benjamin Wrede. 2020. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. In CIDR. http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf

[32]

Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Prithviraj Sen, Alexandre V. Evfimievski, and Niketan Pansare. 2018. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. PVLDB 11, 12 (2018), 1755--1768. https://doi.org/10.14778/3229863.3229865

Digital Library

[33]

Matthias Böhm, Benjamin Schlegel, Peter Benjamin Volk, Ulrike Fischer, Dirk Habich, and Wolfgang Lehner. 2011. Efficient In-Memory Indexing with Generalized Prefix Trees. In BTW. 227--246. https://dl.gi.de/20.500.12116/19581

[34]

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3 (2011), 27:1--27:27. https://doi.org/10.1145/1961189.1961199

Digital Library

[35]

Yu Cheng and Florin Rusu. 2014. Parallel in-situ data processing with speculative loading. In SIGMOD. 1287--1298. https://doi.org/10.1145/2588555.2593673

Digital Library

[36]

Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. 2009. MAD Skills: New Analysis Practices for Big Data. PVLDB 2, 2 (2009), 1481--1492. https://doi.org/10.14778/1687553.1687576

Digital Library

[37]

Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR. http://cidrdb.org/cidr2017/papers/p44-deng-cidr17.pdf

[38]

Hong Hai Do and Erhard Rahm. 2002. COMA - A System for Flexible Combination of Schema Matching Approaches. In VLDB. 610--621. https://doi.org/10.1016/B978--155860869--6/50060--3

[39]

Dominik Durner, Viktor Leis, and Thomas Neumann. 2021. JSON Tiles: Fast Analytics on Semi-Structured Data. In SIGMOD. 445--458. https://doi.org/10.1145/3448016.3452809

Digital Library

[40]

Ronald Fagin, Phokion G Kolaitis, Renée J Miller, and Lucian Popa. 2005. Data exchange: semantics and query answering. Theoretical Computer Science 336, 1 (2005), 89--124. https://doi.org/10.1016/j.tcs.2004.10.033

[41]

Chang Ge, Yinan Li, Eric Eilebrecht, Badrish Chandramouli, and Donald Kossmann. 2019. Speculative distributed CSV data parsing for big data analytics. In SIGMOD. 883--899. https://doi.org/10.1145/3299869.3319898

Digital Library

[42]

Chang Ge, Yinan Li, Eric Eilebrecht, Badrish Chandramouli, and Donald Kossmann. 2019. Speculative distributed CSV data parsing for big data analytics. In Proceedings of the 2019 International Conference on Management of Data. 883--899. https://doi.org/10.1145/3299869.3319898

Digital Library

[43]

Philipp M Grulich, Breß Sebastian, Steffen Zeuch, Jonas Traub, Janis von Bleichert, Zongxiong Chen, Tilmann Rabl, and Volker Markl. 2020. Grizzly: Efficient stream processing through adaptive query compilation. In SIGMOD. 2487--2503. https://doi.org/10.1145/3318464.3389739

Digital Library

[44]

Laura M. Haas, Mauricio A. Hernández, Howard Ho, Lucian Popa, and Mary Roth. 2005. Clio grows up: from research prototype to industrial tool. In SIGMOD. 805--810. https://doi.org/10.1145/1066157.1066252

Digital Library

[45]

Mauricio A. Hernández, Renée J. Miller, and Laura M. Haas. 2001. Clio: A Semi-Automatic Tool For Schema Mapping. In SIGMOD. 607. https://doi.org/10.1145/375663.375767

Digital Library

[46]

Madelon Hulsebos, Kevin Zeng Hu, Michiel A. Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çagatay Demiralp, and César A. Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In SIGKDD. 1500--1508. https://doi.org/10.1145/3292500.3330993

Digital Library

[47]

Stratos Idreos, Ioannis Alagiannis, Ryan Johnson, and Anastasia Ailamaki. 2011. Here are my Data Files. Here are my Queries. Where are my Results?. In CIDR. 57--68. http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf

[48]

Milena Ivanova, Yagiz Kargin, Martin L. Kersten, Stefan Manegold, Ying Zhang, Mihai Datcu, and Daniela Espinoza- Molina. 2013. Data vaults: a database welcome to scientific file repositories. In SSDBM. 48:1--48:4. https://doi.org/10.1145/2484838.2484876

Digital Library

[49]

Lin Jiang, Junqiao Qiu, and Zhijia Zhao. 2020. Scalable Structural Index Construction for JSON Analytics. PVLDB 14, 4 (2020). https://doi.org/10.14778/3436905.3436926

Digital Library

[50]

Peter Kairouz, Brendan McMahan, and Virginia Smith. 2020. Federated Learning Tutorial. In NeurIPS. https://slideslive.com/38935813/federated-learning-tutorial

[51]

Manos Karpathiotakis, Ioannis Alagiannis, and Anastasia Ailamaki. 2016. Fast Queries Over Heterogeneous Data Through Engine Customization. PVLDB 9, 12 (2016), 972--983. https://doi.org/10.14778/2994509.2994516

Digital Library

[52]

Manos Karpathiotakis, Miguel Branco, Ioannis Alagiannis, and Anastasia Ailamaki. 2014. Adaptive Query Processing on RAW Data. PVLDB 7, 12 (2014), 1119--1130. https://doi.org/10.14778/2732977.2732986

Digital Library

[53]

Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores. In BTW. 425--444. https://dl.gi.de/20.500.12116/2420

[54]

Phokion G Kolaitis. 2005. Schema mappings, data exchange, and metadata management. In PODS. 61--75. https://doi.org/10.1145/1065167.1065176

Digital Library

[55]

Marcel Kornacker et al . 2015. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf

[56]

Geoff Langdale and Daniel Lemire. 2019. Parsing gigabytes of JSON per second. VLDB J. 28, 6 (2019), 941--960. https://doi.org/10.1007/s00778-019-00578--5

[57]

Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In ICDE. 38--49. https://doi.org/10.1109/ICDE.2013.6544812

Digital Library

[58]

Yinan Li, Nikos R Katsipoulakis, Badrish Chandramouli, Jonathan Goldstein, and Donald Kossmann. 2017. Mison: a fast JSON parser for data analytics. PVLDB 10, 10 (2017), 1118--1129. https://doi.org/10.14778/3115404.3115416

Digital Library

[59]

Ericsson M. Garcia-Martin, G. Camarillo. 2008. Extensible Markup Language (XML) Format Extension for Representing Copy Control Attributes in Resource Lists. RFC 5364. RFC Editor. https://datatracker.ietf.org/doc/html/rfc5364

[60]

Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. 2001. Generic Schema Matching with Cupid. In VLDB. 49--58. http://www.vldb.org/conf/2001/P049.pdf

[61]

Renée J. Miller, Laura M. Haas, and Mauricio A. Hernández. 2000. Schema Mapping as Query Discovery. In VLDB. 77--88. http://www.vldb.org/conf/2000/P077.pdf

[62]

Donald R. Morrison. 1968. PATRICIA - Practical Algorithm To Retrieve Information Coded in Alphanumeric. J. ACM 15, 4 (1968), 514--534. https://doi.org/10.1145/321479.321481

Digital Library

[63]

Ingo Müller, Ghislain Fourny, Stefan Irimescu, Can Berker Cikis, and Gustavo Alonso. 2020. Rumble: Data Independence for Large Messy Data Sets. PVLDB 14, 4 (2020), 498--506. https://doi.org/10.14778/3436905.3436910

Digital Library

[64]

Svetlozar Nestorov, Jeffrey Ullman, Janet Wiener, and Sudarashan Chawathe. 1997. Representative objects: Concise representations of semistructured, hierarchical data. In ICDE. 79--90.

[65]

Shoumik Palkar, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2018. Filter Before You Parse: Faster Analytics on Raw Data with Sparser. PVLDB 11, 11 (2018). https://doi.org/10.14778/3236187.3236207

Digital Library

[66]

Christina Pavlopoulou, E Preston Carman Jr, Till Westmann, Michael J Carey, and Vassilis J Tsotras. 2018. A Parallel and Scalable Processor for JSON Data. In EDBT. 576--587. https://doi.org/10.5441/002/edbt.2018.68

[67]

Li Qian, Michael J Cafarella, and HV Jagadish. 2012. Sample-driven schema mapping. In SIGMOD. 73--84. https://doi.org/10.1145/2213836.2213846

Digital Library

[68]

Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4 (2001), 334--350. https://doi.org/10.1007/s007780100057

Digital Library

[69]

Y. Shafranovich. 2005. Common Format and MIME Type for Comma-Separated Values (CSV) Files. RFC 4180. RFC Editor. https://www.rfc-editor.org/rfc/rfc4180

[70]

Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar. 2021. Towards Benchmarking Feature Type Inference for AutoML Platforms. In SIGMOD. 1584--1596. https://doi.org/10.1145/3448016.3457274

Digital Library

[71]

Elias Stehle and Hans-Arno Jacobsen. 2020. ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data. PVLDB 13, 5 (2020). https://doi.org/10.14778/3377369.3377372

Digital Library

[72]

Ed. T. Bray. 2017. The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259. RFC Editor. https://datatracker.ietf.org/doc/html/rfc8259

[73]

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnetminer: extraction and mining of academic social networks. In SIGKDD. 990--998. https://doi.org/10.1145/1401890.1402008

Digital Library

[74]

Arno Unkrieg. 2014. Janino: A super-small, super-fast Java Compiler. https://janino-compiler.github.io/janino/2014-02--18_SWM-JAK.pdf

[75]

Qiu Yue Wang, Jeffrey Xu Yu, and Kam-Fai Wong. 2000. Approximate graph schema extraction for semi-structured data. In EDBT. 302--316. https://doi.org/10.1007/3--540--46439--5_21

[76]

Navid Yaghmazadeh, Xinyu Wang, and Isil Dillig. 2018. Automated migration of hierarchical data to relational tables using programming-by-example. PVLDB 11, 5 (2018), 580--593. https://doi.org/10.1145/3187009.3177735

Digital Library

[77]

Ling-Ling Yan, Renée J. Miller, Laura M. Haas, and Ronald Fagin. 2001. Data-Driven Understanding and Refinement of Schema Mappings. In SIGMOD. 485--496. https://doi.org/10.1145/375663.375729

Digital Library

[78]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI. 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

[79]

Matei Zaharia, Ali Ghodsi, Reynold Xin, and Michael Armbrust. 2021. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In CIDR. http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

[80]

Dan Zhang, Yoshihiko Suhara, Jinfeng Li, Madelon Hulsebos, Çagatay Demiralp, and Wang-Chiew Tan. 2020. Sato: Contextual Semantic Type Detection in Tables. PVLDB 13, 11 (2020), 1835--1848. http://www.vldb.org/pvldb/vol13/p1835-zhang.pdf

Digital Library

Cited By

Zhang QYao JYang YShi YGao WWang X(2024)Effective Entry-Wise Flow for Molecule Generation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00023(207-220)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00023

Index Terms

GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
        Data scans
      2. Data layout
        Record and block layout
    2. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Analysis of Data Interchange Formats for Interoperable and Efficient Data Communication in Clouds
UCC '13: Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Efficient mechanisms for data structuring and formatting are indispensable for managing data traffic between and within federated Cloud environments to avoid excessive bandwidth cost and to ensure portability and interoperability. This facilitates ...
On the Language of Nested Tuple Generating Dependencies
Best of SIGMOD 2018, Best of PODS 2018 and Regular Papers

During the past 15 years, schema mappings have been extensively used in formalizing and studying such critical data interoperability tasks as data exchange and data integration. Much of the work has focused on GLAV mappings, i.e., schema mappings ...
Towards the unification of formats for overlapping markup

Overlapping markup refers to the issue of how to represent data structures more expressive than trees—for example direct acyclic graphs—using markup (meta-) languages which have been designed with trees in mind—for example XML. In this paper we observe ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 2

PACMMOD

June 2023

2310 pages

EISSN:2836-6573

DOI:10.1145/3605748

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023

Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Badges

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
122
Total Downloads

Downloads (Last 12 months)112
Downloads (Last 6 weeks)5

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Zhang QYao JYang YShi YGao WWang X(2024)Effective Entry-Wise Flow for Molecule Generation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00023(207-220)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00023

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents