Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2882924acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data

Published: 14 June 2016 Publication History

Abstract

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data.
In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

References

[1]
S. Bell. Dependency mining in relational databases. In Qualitative and Quantitative Practical Reasoning. 1997.
[2]
D. Bitton, J. Millman, and S. Torgersen. A feasibility and performance study of dependency inference {database design}. In Proc. of ICDE, 1989.
[3]
P. Bohannon, J. Freire, P. Roy, and J. Simeon. From xml schema to relations: a cost-based approach to xml storage. In Proc. of ICDE, 2002.
[4]
C. Chasseur, Y. Li, and J. M. Patel. Enabling json document stores in relational systems. In WebDB, 2013.
[5]
A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with stored. In SIGMOD, 1999.
[6]
F. Du, S. Amer-Yahia, and J. Freire. Shrex: managing xml documents in relational databases. In VLDB, 2004.
[7]
R. Fagin, A. O. Mendelzon, and J. D. Ullman. A simplied universal relation assumption and its properties. ACM Trans. Database Syst., Sept. 1982.
[8]
P. A. Flach and I. Savnik. Database dependency discovery: A machine learning approach. AI Commun., 12(3), 1999.
[9]
D. Florescu and D. Kossmann. A performance evaluation of alternative mapping schemes for storing xml data in a relational database. In Inria Research Report, 1999.
[10]
M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. Xtract: A system for extracting document type descriptors from xml documents. In SIGMOD, 2000.
[11]
G. Gousios. The ghtorrent dataset and tool suite. In Conference on Mining Software Repositories, 2013.
[12]
O. Hassanzadeh, S. H. Yeganeh, and R. J. Miller. Linking semistructured data on the web. In WebDB, 2011.
[13]
Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The computer journal, 1999.
[14]
I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.
[15]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, pages 129 -- 149, 1995.
[16]
W.-S. Li and C. Clifton. Semint: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data & Knowledge Engineering, 2000.
[17]
D. Maier, J. D. Ullman, and M. Y. Vardi. On the Foundations of the Universal Relation Model. ACM Trans. Database Syst., 9(2):283--308, June 1984.
[18]
H. Mannila and K.-J. Raiha. Algorithms for inferring functional dependencies from relations. DKE, 1994.
[19]
P. Minh-Duc, P. Linnea, E. Orri, and P. Boncz. Deriving an emergent relational schema from rdf data. In WWW, 2015.
[20]
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDBJ, 10(4):334--350, 2001.
[21]
A. Schmidt, M. Kersten, M. Windhouwer, and F. Waas. Efficient relational storage and retrieval of xml documents. In WebDB, pages 47--52, 2000.
[22]
J. Shanmugasundaram, E. Shekita, J. Kiernan, R. Krishnamurthy, E. Viglas, J. Naughton, and I. Tatarinov. A general technique for querying xml documents using a relational database system. SIGMOD Rec., pages 20--26, 2001.
[23]
J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying xml documents: Limitations and opportunities. pages 302--314, 1999.
[24]
D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A sql system for multi-structured data. In SIGMOD, 2014.
[25]
K. Wang and H. Liu. Schema discovery for semistructured data. In KDD, 1997.
[26]
K. Wang and H. Liu. Discovering typical structures of documents: A road map approach. In SIGIR, 1998.
[27]
H. Zhang and F. W. Tompa. Querying xml documents by dynamic shredding. In DocEng, 2004.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. JSON
  2. deduplication
  3. denormalized data
  4. entity extraction
  5. functional dependencies
  6. functional dependency mining
  7. key-value data
  8. normalization
  9. relational databases
  10. schema extraction
  11. schema generation
  12. schema matching
  13. semistructured data
  14. semistructured-to-relational mappings

Qualifiers

  • Research-article

Funding Sources

  • National Science Foundation

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SEREIA: document store exploration through keywordsKnowledge and Information Systems10.1007/s10115-024-02151-1Online publication date: 10-Jun-2024
  • (2023)dsJSON: A Distributed SQL JSON ProcessorProceedings of the ACM on Management of Data10.1145/35889571:1(1-25)Online publication date: 30-May-2023
  • (2023)A Survey on Mapping Semi-Structured Data and Graph Data to Relational DataACM Computing Surveys10.1145/356744455:10(1-38)Online publication date: 2-Feb-2023
  • (2023)An Effective Framework for Enhancing Query Answering in a Heterogeneous Data LakeProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591637(770-780)Online publication date: 19-Jul-2023
  • (2023)Extracting Graphs Properties with Semantic Joins2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00175(2262-2275)Online publication date: Apr-2023
  • (2022)JSON document clustering based on schema embeddingsJournal of Information Science10.1177/01655515221116522(016555152211165)Online publication date: 12-Sep-2022
  • (2022)Cost-based Optimization of Multistore Query PlansInformation Systems Frontiers10.1007/s10796-022-10320-225:5(1925-1951)Online publication date: 4-Oct-2022
  • (2022)Parametric schema inference for massive JSON datasetsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0532-728:4(497-521)Online publication date: 10-Mar-2022
  • (2022)Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data AnalyticsTransactions on Large-Scale Data- and Knowledge-Centered Systems XLII10.1007/978-3-662-60531-8_3(51-90)Online publication date: 11-Mar-2022
  • (2021)Pipeline Condition Assessment by Instantaneous Frequency Response over Hydroinformatics Based Technique—An Experimental and Field AnalysisFluids10.3390/fluids61103736:11(373)Online publication date: 21-Oct-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media