research-article

Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data

Authors:

Michael DiScala,

Daniel J. AbadiAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 295 - 310

https://doi.org/10.1145/2882903.2882924

Published: 14 June 2016 Publication History

Abstract

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data.

In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

References

[1]

S. Bell. Dependency mining in relational databases. In Qualitative and Quantitative Practical Reasoning. 1997.

Digital Library

[2]

D. Bitton, J. Millman, and S. Torgersen. A feasibility and performance study of dependency inference {database design}. In Proc. of ICDE, 1989.

Digital Library

[3]

P. Bohannon, J. Freire, P. Roy, and J. Simeon. From xml schema to relations: a cost-based approach to xml storage. In Proc. of ICDE, 2002.

Digital Library

[4]

C. Chasseur, Y. Li, and J. M. Patel. Enabling json document stores in relational systems. In WebDB, 2013.

[5]

A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with stored. In SIGMOD, 1999.

Digital Library

[6]

F. Du, S. Amer-Yahia, and J. Freire. Shrex: managing xml documents in relational databases. In VLDB, 2004.

Digital Library

[7]

R. Fagin, A. O. Mendelzon, and J. D. Ullman. A simplied universal relation assumption and its properties. ACM Trans. Database Syst., Sept. 1982.

Digital Library

[8]

P. A. Flach and I. Savnik. Database dependency discovery: A machine learning approach. AI Commun., 12(3), 1999.

Digital Library

[9]

D. Florescu and D. Kossmann. A performance evaluation of alternative mapping schemes for storing xml data in a relational database. In Inria Research Report, 1999.

[10]

M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. Xtract: A system for extracting document type descriptors from xml documents. In SIGMOD, 2000.

Digital Library

[11]

G. Gousios. The ghtorrent dataset and tool suite. In Conference on Mining Software Repositories, 2013.

Digital Library

[12]

O. Hassanzadeh, S. H. Yeganeh, and R. J. Miller. Linking semistructured data on the web. In WebDB, 2011.

[13]

Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The computer journal, 1999.

[14]

I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.

Digital Library

[15]

J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, pages 129 -- 149, 1995.

Digital Library

[16]

W.-S. Li and C. Clifton. Semint: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data & Knowledge Engineering, 2000.

Digital Library

[17]

D. Maier, J. D. Ullman, and M. Y. Vardi. On the Foundations of the Universal Relation Model. ACM Trans. Database Syst., 9(2):283--308, June 1984.

Digital Library

[18]

H. Mannila and K.-J. Raiha. Algorithms for inferring functional dependencies from relations. DKE, 1994.

Digital Library

[19]

P. Minh-Duc, P. Linnea, E. Orri, and P. Boncz. Deriving an emergent relational schema from rdf data. In WWW, 2015.

Digital Library

[20]

E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDBJ, 10(4):334--350, 2001.

Digital Library

[21]

A. Schmidt, M. Kersten, M. Windhouwer, and F. Waas. Efficient relational storage and retrieval of xml documents. In WebDB, pages 47--52, 2000.

Digital Library

[22]

J. Shanmugasundaram, E. Shekita, J. Kiernan, R. Krishnamurthy, E. Viglas, J. Naughton, and I. Tatarinov. A general technique for querying xml documents using a relational database system. SIGMOD Rec., pages 20--26, 2001.

Digital Library

[23]

J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying xml documents: Limitations and opportunities. pages 302--314, 1999.

Digital Library

[24]

D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A sql system for multi-structured data. In SIGMOD, 2014.

Digital Library

[25]

K. Wang and H. Liu. Schema discovery for semistructured data. In KDD, 1997.

[26]

K. Wang and H. Liu. Discovering typical structures of documents: A road map approach. In SIGIR, 1998.

Digital Library

[27]

H. Zhang and F. W. Tompa. Querying xml documents by dynamic shredding. In DocEng, 2004.

Digital Library

Cited By

Afonso AMartins Pda Silva A(2024)SEREIA: document store exploration through keywordsKnowledge and Information Systems10.1007/s10115-024-02151-1Online publication date: 10-Jun-2024
https://doi.org/10.1007/s10115-024-02151-1
Saeedan MEldawy AZhao Z(2023)dsJSON: A Distributed SQL JSON ProcessorProceedings of the ACM on Management of Data10.1145/35889571:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588957
Yuan GLu JYan ZWu S(2023)A Survey on Mapping Semi-Structured Data and Graph Data to Relational DataACM Computing Surveys10.1145/356744455:10(1-38)Online publication date: 2-Feb-2023
https://dl.acm.org/doi/10.1145/3567444
Show More Cited By

Index Terms

Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data
1. Information systems
  1. Data management systems
    1. Database design and models
    2. Information integration
  2. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

Translating JSON Data into Relational Data Using Schema-oblivious Approaches
ACMSE '19: Proceedings of the 2019 ACM Southeast Conference

JSON (JavaScript Object Notation) has become popular as the data exchange standard over the Web. JSON has been gaining more popularity over XML due to its simplicity, compactness and ability to fit into the object types of programming languages. The ...
JSON data management: supporting schema-less development in RDBMS
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Relational Database Management Systems (RDBMS) have been very successful at managing structured data with well-defined schemas. Despite this, relational systems are generally not the first choice for management of data where schemas are not pre-defined ...
Constraint Preserving Transformation from Relational Schema to XML Schema

XML has become the standard for publishing and exchanging data on the Web. However, most business data is managed and will remain to be managed by relational database management systems. As such, there is an increasing need to efficiently and accurately ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
775
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Afonso AMartins Pda Silva A(2024)SEREIA: document store exploration through keywordsKnowledge and Information Systems10.1007/s10115-024-02151-1Online publication date: 10-Jun-2024
https://doi.org/10.1007/s10115-024-02151-1
Saeedan MEldawy AZhao Z(2023)dsJSON: A Distributed SQL JSON ProcessorProceedings of the ACM on Management of Data10.1145/35889571:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588957
Yuan GLu JYan ZWu S(2023)A Survey on Mapping Semi-Structured Data and Graph Data to Relational DataACM Computing Surveys10.1145/356744455:10(1-38)Online publication date: 2-Feb-2023
https://dl.acm.org/doi/10.1145/3567444
Yuan QYuan YWen ZWang HTang SChen HDuh WHuang HKato MMothe JPoblete B(2023)An Effective Framework for Enhancing Query Answering in a Heterogeneous Data LakeProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591637(770-780)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591637
Cao YFan WFu WJin ROu WYi W(2023)Extracting Graphs Properties with Semantic Joins2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00175(2262-2275)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00175
Priya DThilagam P(2022)JSON document clustering based on schema embeddingsJournal of Information Science10.1177/01655515221116522(016555152211165)Online publication date: 12-Sep-2022
https://doi.org/10.1177/01655515221116522
Forresi CFrancia MGallinucci EGolfarelli M(2022)Cost-based Optimization of Multistore Query PlansInformation Systems Frontiers10.1007/s10796-022-10320-225:5(1925-1951)Online publication date: 4-Oct-2022
https://doi.org/10.1007/s10796-022-10320-2
Baazizi MColazzo DGhelli GSartiani C(2022)Parametric schema inference for massive JSON datasetsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0532-728:4(497-521)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/s00778-018-0532-7
Leclercq ÉGillet AGrison TSavonnet M(2022)Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data AnalyticsTransactions on Large-Scale Data- and Knowledge-Centered Systems XLII10.1007/978-3-662-60531-8_3(51-90)Online publication date: 11-Mar-2022
https://dl.acm.org/doi/10.1007/978-3-662-60531-8_3
Yusop MGhazali MYusof MRemli M(2021)Pipeline Condition Assessment by Instantaneous Frequency Response over Hydroinformatics Based Technique—An Experimental and Field AnalysisFluids10.3390/fluids61103736:11(373)Online publication date: 21-Oct-2021
https://doi.org/10.3390/fluids6110373
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents