Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Normalizing Property Graphs

Published: 01 July 2023 Publication History

Abstract

Normalization aims at minimizing sources of potential data inconsistency and costs of update maintenance incurred by data redundancy. For relational databases, different classes of dependencies cause data redundancy and have resulted in proposals such as Third, Boyce-Codd, Fourth and Fifth Normal Form. Features of more advanced data models make it challenging to extend achievements from the relational model to missing, non-atomic, or uncertain data. We initiate research on the normalization of graph data, starting with a class of functional dependencies tailored to property graphs. We show that this class captures important semantics of applications, constitutes a rich source of data redundancy, its implication problem can be decided in linear time, and facilitates the normalization of property graphs flexibly tailored to their labels and properties that are targeted by applications. We normalize property graphs into Boyce-Codd Normal Form without loss of data and dependencies whenever possible for the target labels and properties, but guarantee Third Normal Form in general. Experiments on real-world property graphs quantify and qualify various benefits of graph normalization: 1) removing redundant property values as sources of inconsistent data, 2) detecting inconsistency as violation of functional dependencies, 3) reducing update overheads by orders of magnitude, and 4) significant speed ups of aggregate queries.

References

[1]
Renzo Angles, Angela Bonifati, Stefania Dumbrava, George Fletcher, Alastair Green, Jan Hidders, Bei Li, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Stefan Plantikow, Ognjen Savkovic, Michael Schmidt, Juan Sequeda, Slawek Staworko, Dominik Tomaszuk, Hannes Voigt, Domagoj Vrgoc, Mingxi Wu, and Dusan Zivkovic. 2023. PG-Schema: Schemas for Property Graphs. Proc. ACM Manag. Data 1, 2 (2023), 198:1--198:25.
[2]
Renzo Angles, Angela Bonifati, Stefania Dumbrava, George Fletcher, Keith W. Hare, Jan Hidders, Victor E. Lee, Bei Li, Leonid Libkin, Wim Martens, Filip Murlak, Josh Perryman, Ognjen Savkovic, Michael Schmidt, Juan F. Sequeda, Slawek Staworko, and Dominik Tomaszuk. 2021. PG-Keys: Keys for Property Graphs. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 2423--2436.
[3]
Marcelo Arenas. 2006. Normalization theory for XML. SIGMOD Record 35, 4 (2006), 57--64.
[4]
William Ward Armstrong. 1974. Dependency Structures of Data Base Relationships. In Information Processing, Proceedings of the 6th IFIP Congress 1974, Stockholm, Sweden, August 5--10, 1974. 580--583.
[5]
Carlo Batini and Andrea Maurino. 2018. Design for Data Quality. In Encyclopedia of Database Systems, Second Edition, Ling Liu and M. Tamer Özsu (Eds.).
[6]
Catriel Beeri and Philip A. Bernstein. 1979. Computational Problems Related to the Design of Normal Form Relational Schemas. ACM Trans. Database Syst. 4, 1 (1979), 30--59.
[7]
Philip A. Bernstein. 1976. Synthesizing Third Normal Form Relations from Functional Dependencies. ACM Trans. Database Syst. 1, 4 (1976), 277--298.
[8]
Joachim Biskup, Umeshwar Dayal, and Philip A. Bernstein. 1979. Synthesizing Independent Database Schemas. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, USA, May 30 - June 1. 143--151.
[9]
Angela Bonifati, George H. L. Fletcher, Hannes Voigt, and Nikolay Yakovets. 2018. Querying Graphs. Morgan & Claypool Publishers.
[10]
E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (1970), 377--387.
[11]
Edgar F. Codd. 1972. Further normalization of the database relational model. In Courant Computer Science Symposia 6: Data Base Systems. 33--64.
[12]
William F. Dowling and Jean H. Gallier. 1984. Linear-Time Algorithms for Testing the Satisfiability of Propositional Horn Formulae. J. Log. Program. 1, 3 (1984), 267--284.
[13]
Ronald Fagin. 1977. Multivalued Dependencies and a New Normal Form for Relational Databases. ACM Trans. Database Syst. 2, 3 (1977), 262--278.
[14]
Ronald Fagin. 1981. A Normal Form for Relational Databases That Is Based on Domains and Keys. ACM Trans. Database Syst. 6, 3 (1981), 387--415.
[15]
Wenfei Fan. 2019. Dependencies for Graphs: Challenges and Opportunities. ACM J. Data Inf. Qual. 11, 2 (2019), 5:1--5:12.
[16]
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33, 2 (2008), 6:1--6:48.
[17]
Wenfei Fan, Chunming Hu, Xueli Liu, and Ping Lu. 2020. Discovering Graph Functional Dependencies. ACM Trans. Database Syst. 45, 3 (2020), 15:1--15:42.
[18]
Wenfei Fan and Ping Lu. 2019. Dependencies for Graphs. ACM Trans. Database Syst. 44, 2 (2019), 5:1--5:40.
[19]
Wenfei Fan, Yinghui Wu, and Jingbo Xu. 2016. Functional Dependencies for Graphs. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016. 1843--1857.
[20]
Miika Hannula, Zhuoxing Zhang, Bor-Kuan Song, and Sebastian Link. 2023. Discovery of Cross Joins. IEEE Trans. Knowl. Data Eng. 35, 7 (2023), 6839--6851.
[21]
Emil Eifrem Ian Robinson, Jim Webber. 2015. Graph Databases. O'Reilly Media.
[22]
Christian S. Jensen, Richard T. Snodgrass, and Michael D. Soo. 1996. Extending Existing Dependency Theory to Temporal Databases. IEEE Trans. Knowl. Data Eng. 8, 4 (1996), 563--582.
[23]
Henning Köhler and Sebastian Link. 2016. SQL Schema Design: Foundations, Normal Forms, and Normalization. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016. 267--279.
[24]
Henning Köhler and Sebastian Link. 2018. SQL schema design: foundations, normal forms, and normalization. Inf. Syst. 76 (2018), 88--113.
[25]
Solmaz Kolahi and Leonid Libkin. 2010. An information-theoretic analysis of worst-case redundancy in database design. ACM Trans. Database Syst. 35, 1 (2010), 5:1--5:32.
[26]
Mark Levene and George Loizou. 1999. Database Design for Incomplete Relations. ACM Trans. Database Syst. 24, 1 (1999), 80--125.
[27]
Mark Levene and Millist Vincent. 2000. Justification for Inclusion Dependency Normal Form. IEEE Trans. Knowl. Data Eng. 12, 2 (2000), 281--291.
[28]
Sebastian Link, Henning Köhler, Aniruddh Gandhi, Sven Hartmann, and Bernhard Thalheim. 2023. Cardinality constraints and functional dependencies in SQL: Taming data redundancy in logical database design. Inf. Syst. 115 (2023), 102208.
[29]
Sebastian Link and Henri Prade. 2019. Relational database schema design for uncertain data. Inf. Syst. 84 (2019), 88--110.
[30]
Sebastian Link and Ziheng Wei. 2021. Logical Schema Design that Quantifies Update Inefficiency and Join Efficiency. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 1169--1181.
[31]
Heikki Mannila and Kari-Jouko Räihä. 1992. Design of Relational Databases. Addison-Wesley.
[32]
Stephan Mennicke. 2019. Modal Schema Graphs for Graph Databases. In Conceptual Modeling - 38th International Conference, ER 2019, Salvador, Brazil, November 4--7, 2019, Proceedings. 498--512.
[33]
Wai Yin Mok. 2016. Utilizing Nested Normal Form to Design Redundancy Free JSON Schemas. Int. J. Recent Contributions Eng. Sci. IT 4, 4 (2016), 21--25.
[34]
Wai Yin Mok, Yiu-Kai Ng, and David W. Embley. 1996. A Normal Form for Precisely Characterizing Redundancy in Nested Relations. ACM Trans. Database Syst. 21, 1 (1996), 77--106.
[35]
Sylvia L. Osborn. 1979. Testing for Existence of a Covering Boyce-Codd normal Form. Inf. Process. Lett. 8, 1 (1979), 11--14.
[36]
Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proc. VLDB Endow. 8, 10 (2015), 1082--1093.
[37]
Larissa Capobianco Shimomura, George Fletcher, and Nikolay Yakovets. 2020. GGDs: Graph Generating Dependencies. In CIKM '20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19--23, 2020. 2217--2220.
[38]
Philipp Skavantzos, Uwe Leck, Kaiqi Zhao, and Sebastian Link. 2023. Uniqueness Constraints for Object Stores. ACM J. Data Inf. Qual. 15, 2 (2023), 13.1--13.29.
[39]
Philipp Skavantzos, Kaiqi Zhao, and Sebastian Link. 2021. Uniqueness Constraints on Property Graphs. In Advanced Information Systems Engineering - 33rd International Conference, CAiSE 2021, Melbourne, VIC, Australia, June 28 - July 2, 2021, Proceedings. 280--295.
[40]
Zahir Tari, John Stokes, and Stefano Spaccapietra. 1997. Object Normal Forms and Dependency Constraints for Object-Oriented Schemata. ACM Trans. Database Syst. 22, 4 (1997), 513--569.
[41]
Millist Vincent. 1997. A Corrected 5NF Definition for Relational Database Design. Theor. Comput. Sci. 185, 2 (1997), 379--391.
[42]
Millist Vincent and Mark Levene. 2000. Restructuring Partitioned Normal Form Relations without Information Loss. SIAM J. Comput. 29, 5 (2000), 1550--1567.
[43]
Roberto De Virgilio, Antonio Maccioni, and Riccardo Torlone. 2014. Model-Driven Design of Graph Databases. In Conceptual Modeling - 33rd International Conference, ER 2014, Atlanta, GA, USA, October 27--29, 2014. Proceedings. 172--185.
[44]
Ziheng Wei, Sven Hartmann, and Sebastian Link. 2020. Discovery Algorithms for Embedded Functional Dependencies. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020. 833--843.
[45]
Ziheng Wei, Sven Hartmann, and Sebastian Link. 2021. Algorithms for the discovery of embedded functional dependencies. VLDB J. 30, 6 (2021), 1069--1093.
[46]
Ziheng Wei, Uwe Leck, and Sebastian Link. 2019. Discovery and Ranking of Embedded Uniqueness Constraints. Proc. VLDB Endow. 12, 13 (2019), 2339--2352.
[47]
Ziheng Wei and Sebastian Link. 2019. Embedded Functional Dependencies and Data-completeness Tailored Database Design. Proc. VLDB Endow. 12, 11 (2019), 1458--1470.
[48]
Ziheng Wei and Sebastian Link. 2021. Embedded Functional Dependencies and Data-completeness Tailored Database Design. ACM Trans. Database Syst. 46, 2 (2021), 7:1--7:46.
[49]
Ziheng Wei and Sebastian Link. 2023. Towards the efficient discovery of meaningful functional dependencies. Inf. Syst. 116 (2023), 102224.
[50]
Zhuoxing Zhang, Wu Chen, and Sebastian Link. 2023. Composite Object Normal Forms: Parameterizing Boyce-Codd Normal Form by the Number of Minimal Keys. Proc. ACM Manag. Data 1, 1 (2023), 13:1--13:25.

Cited By

View all
  • (2024)Transforming Property GraphsProceedings of the VLDB Endowment10.14778/3681954.368197217:11(2906-2918)Online publication date: 1-Jul-2024
  • (2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 30-May-2024
  • (2024)Data limitations in developing countries make river restoration planning challenging. Study case of the Cesar River, ColombiaEcohydrology & Hydrobiology10.1016/j.ecohyd.2024.01.012Online publication date: Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 16, Issue 11
July 2023
789 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2023
Published in PVLDB Volume 16, Issue 11

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)134
  • Downloads (Last 6 weeks)22
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Transforming Property GraphsProceedings of the VLDB Endowment10.14778/3681954.368197217:11(2906-2918)Online publication date: 1-Jul-2024
  • (2024)Mixed Covers of Keys and Functional Dependencies for Maintaining the Integrity of Data under UpdatesProceedings of the VLDB Endowment10.14778/3654621.365462617:7(1578-1590)Online publication date: 30-May-2024
  • (2024)Data limitations in developing countries make river restoration planning challenging. Study case of the Cesar River, ColombiaEcohydrology & Hydrobiology10.1016/j.ecohyd.2024.01.012Online publication date: Feb-2024
  • (2024)PG-FD: Mapping Functional Dependencies to the Future Property Graph Schema StandardAdvances in Databases and Information Systems10.1007/978-3-031-70626-4_4(45-59)Online publication date: 28-Aug-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media