Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets

Published: 27 January 2018 Publication History

Abstract

Although the ultimate objective of Linked Data is linking and integration, it is not currently evident how connected the current Linked Open Data (LOD) cloud is. In this article, we focus on methods, supported by special indexes and algorithms, for performing measurements related to the connectivity of more than two datasets that are useful in various tasks including (a) Dataset Discovery and Selection; (b) Object Coreference, i.e., for obtaining complete information about a set of entities, including provenance information; (c) Data Quality Assessment and Improvement, i.e., for assessing the connectivity between any set of datasets and monitoring their evolution over time, as well as for estimating data veracity; (d) Dataset Visualizations; and various other tasks. Since it would be prohibitively expensive to perform all these measurements in a naïve way, in this article, we introduce indexes (and their construction algorithms) that can speed up such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl:sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and, finally, (iv) two lattice-based incremental algorithms for speeding up the computation of the intersection of URIs of any set of datasets. For enhancing scalability, we propose parallel index construction algorithms and parallel lattice-based incremental algorithms, we evaluate the achieved speedup using either a single machine or a cluster of machines, and we provide insights regarding the factors that affect efficiency. Finally, we report measurements about the connectivity of the (billion triples-sized) LOD cloud that have never been carried out so far.

References

[1]
Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer, and Jens Lehmann. 2013. Crowdsourcing linked data quality assessment. In International Semantic Web Conference. Springer, 260--276.
[2]
Charu Aggarwal, Yan Xie, and Philip S. Yu. 2009. Gconnect: A connectivity index for massive disk-resident graphs. Proc. VLDB Endow. 2, 1 (2009), 862--873.
[3]
Sören Auer, Jan Demter, Michael Martin, and Jens Lehmann. 2012. LODStats--An extensible framework for high-performance dataset analytics. In International Conference on Knowledge Engineering and Knowledge Management. Springer, 353--362.
[4]
Nikos Bikakis and Timos K. Sellis. 2016. Exploration and visualization in the web of big linked data: A survey of the state of the art. In Proceedings of the International Conference on Extending Database Technology/International Conference on Display Technology Workshops (EDBT/ICDT’16), Vol. 1558.
[5]
Christian Bizer, Peter Boncz, Michael L Brodie, and Orri Erling. 2012. The meaningful use of big data: Four perspectives--four challenges. ACM SIGMOD Rec. 40, 4 (2012), 56--60.
[6]
Christoph Böhm, Gerard de Melo, Felix Naumann, and Gerhard Weikum. 2012. LINDA: Distributed web-of-data-scale entity matching. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 2104--2108.
[7]
Vassilis Christophides, Vasilis Efthymiou, and Kostas Stefanidis. 2015. Entity resolution in the web of data. Synth. Lect. Semant. Web 5, 3 (2015), 1--122.
[8]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[9]
Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal Doshi, and Joel Sachs. 2004. Swoogle: A search and metadata engine for the semantic web. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management. ACM, 652--659.
[10]
Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, and Wei Zhang. 2014. From data fusion to knowledge fusion. Proc. VLDB Endow. 7, 10 (2014), 881--892.
[11]
Xin Luna Dong and Felix Naumann. 2009. Data fusion: Resolving data conflicts for integration. Proc. VLDB Endow. 2, 2 (2009), 1654--1655.
[12]
Ahmed El-Roby and Ashraf Aboulnaga. 2015. ALEX: Automatic link exploration in linked data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1839--1853.
[13]
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze, and Konstantin Todorov. 2016. Dataset recommendation for data linking: An intensional approach. In International Semantic Web Conference. Springer, 36--51.
[14]
José M. Giménez-García, Harsh Thakkar, and Antoine Zimmermann. 2016. Assessing trust with pagerank in the web of data. In International Semantic Web Conference. Springer, 293--307.
[15]
Hugh Glaser, Afraz Jaffri, and Ian Millard. 2009. Managing co-reference on the semantic web. In Proceedings of the WWW2009 Workshop: Linked Data on the Web (LDOW’09).
[16]
Christophe Guéret, Paul Groth, Claus Stadler, and Jens Lehmann. 2012. Assessing linked data mappings using network measures. In The Semantic Web: Research and Applications. Springer, 87--102.
[17]
Harry Halpin and Patrick J. Hayes. 2010. When owl: Sameas isn’t the same: An Analysis of identity links on the semantic web. In Proceedings of the Conference on Linked Data on the Web (LDOW’10).
[18]
Andreas Harth, Jürgen Umbrich, Aidan Hogan, and Stefan Decker. 2007. YARS2: A federated repository for querying graph structured data from the web. In The Semantic Web: Proceedings of the 6th International Semantic Web Conference (ISWC’07). 211--224.
[19]
James Hendler. 2014. Data integration for heterogenous datasets. Big Data 2, 4 (2014), 205--215.
[20]
Aidan Hogan, Andreas Harth, and Stefan Decker. 2007. Performing object consolidation on the semantic web data graph. In Proceedings of the WWW2007 Workshop I3.
[21]
Aidan Hogan, Andreas Harth, Jürgen Umbrich, Sheila Kinsella, Axel Polleres, and Stefan Decker. 2011. Searching and browsing linked data with swse: The semantic web search engine. Web Semant. 9, 4 (2011), 365--401.
[22]
Filip Ilievski, Wouter Beek, Marieke van Erp, Laurens Rietveld, and Stefan Schlobach. 2016. LOTUS: Adaptive text search for big linked data. In International Semantic Web Conference. Springer, 470--485.
[23]
Antoine Isaac and Bernhard Haslhofer. 2013. Europeana linked open data--data. europeana. eu. Semant. Web 4, 3 (2013), 291--297.
[24]
Thomas Jech. 2013. Set Theory. Springer Science 8 Business Media.
[25]
Tobias Käfer, Ahmed Abdelrahman, Jürgen Umbrich, Patrick O’Byrne, and Aidan Hogan. 2013. Observing linked data dynamics. In Extended Semantic Web Conference. Springer, 213--227.
[26]
Tobias Käfer, Jürgen Umbrich, Aidan Hogan, and Axel Polleres. 2012. Towards a dynamic linked data observatory. Proceedings of the Linked Data on the Web at WWW (LDOW’12).
[27]
Jan-Christoph Kalo, Silviu Homoceanu, Jewgeni Rose, and Wolf-Tilo Balke. 2015. Avoiding chinese whispers: Controlling end-to-end join quality in linked open data stores. In Proceedings of the ACM Web Science Conference. ACM.
[28]
U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2011. PEGASUS: Mining peta-scale graphs. Knowl. Inf. Syst. 27, 2 (2011), 303--325.
[29]
Michael Hausenblas Keith Alexander, Richard Cyganiak and Jun Zhao. 2011. Describing Linked Datasets with the VoID Vocabulary, W3C Interest Group Note. Retrieved from http://www.w3.org/TR/2011/NOTE-void-20110303/.
[30]
Graham Klyne and Jeremy J. Carroll. 2004. Resource description framework (RDF): Concepts and abstract syntax, W3C Recommendation. Retrieved from http://www.w3.org/TR/rdf-concepts/.
[31]
M. Mountantonakis, C. Allocca, P. Fafalios, N. Minadakis, Y. Marketakis, C. Lantzaki, and Y. Tzitzikas. 2014. Extending VoID for expressing the connectivity metrics of a semantic warehouse. In Proceedings of the 1st International Workshop on Dataset Profiling 8 Federated Search for Linked Data (PROFILES’14).
[32]
M. Mountantonakis, N. Minadakis, Y. Marketakis, P. Fafalios, and Y. Tzitzikas. 2016. Quantifying the connectivity of a semantic warehouse and understanding its evolution over time. Int. J. Semant. Web Inf. Syst. 12, 3 (2016), 27–78.
[33]
Michalis Mountantonakis and Yannis Tzitzikas. 2016. On measuring the lattice of commonalities among several linked datasets. Proc. VLDB Endow. 9, 12 (2016), 1101–1112.
[34]
Michalis Mountantonakis and Yannis Tzitzikas. 2017. How linked data can aid machine learning-based tasks. In International Conference on Theory and Practice of Digital Libraries. Springer, 155--168.
[35]
Markus Nentwig, Tommaso Soru, Axel-Cyrille Ngonga Ngomo, and Erhard Rahm. 2014. LinkLion: A link repository for the web of data. In The Semantic Web: ESWC 2014 Satellite Events. Springer, 439--443.
[36]
Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X engine for scalable management of RDF data. VLDB J. 19, 1 (2010), 91--113.
[37]
Damla Oguz, Belgin Ergenc, Shaoyi Yin, Oguz Dikenelli, and Abdelkader Hameurlain. 2015. Federated query processing on linked data: A qualitative survey and open challenges. Knowl. Eng. Rev. 30, 5 (2015), 545--563.
[38]
Laura Papaleo, Nathalie Pernelle, Fatiha Saïs, and Cyril Dumont. 2014. Logical detection of invalid sameas statements in RDF data. In International Conference on Knowledge Engineering and Knowledge Management. Springer, 373--384.
[39]
Norman W. Paton, Klitos Christodoulou, Alvaro A. A. Fernandes, Bijan Parsia, and Cornelia Hedeler. 2012. Pay-as-you-go data integration for linked data: Opportunities, challenges and architectures. In Proceedings of the 4th International Workshop on Semantic Web Information Management. ACM.
[40]
Peng Peng, Lei Zou, M. Tamer Özsu, Lei Chen, and Dongyan Zhao. 2016. Processing SPARQL queries over distributed RDF graphs. VLDB J. 25, 2 (2016), 243–268.
[41]
Eric Prud’ Hommeaux and Andy Seaborne. 2008. SPARQL query language for RDF. W3C Recommend. Retrieved from http://www.w3.org/TR/rdf-sparql-query/.
[42]
Vibhor Rastogi, Ashwin Machanavajjhala, Laukik Chitnis, and Anish Das Sarma. 2013. Finding connected components in map-reduce in logarithmic rounds. In Proceedings of the International Conference on Data Engineering (ICDE’13). IEEE, 50--61.
[43]
Laurens Rietveld, Wouter Beek, and Stefan Schlobach. 2015. LOD lab: Experiments at LOD scale. In International Semantic Web Conference. Springer, 339--355.
[44]
Max Schmachtenberg, Christian Bizer, and Heiko Paulheim. 2014. Adoption of the linked data best practices in different topical domains. In Proceedings of the International Semantic Web Conference (ISWC’14). Springer, 245--260.
[45]
Robert Tarjan. 1971. Depth-first search and linear graph algorithms. In Conference Record of the 1971 12th Annual Symposium on Switching and Automata Theory. IEEE, 114--121.
[46]
Yannis Theoharis, Yannis Tzitzikas, Dimitris Kotzinos, and Vassilis Christophides. 2008. On graph features of semantic web schemas. IEEE Trans. Knowl. Data Eng. 20, 5 (2008), 692--702.
[47]
Giovanni Tummarello, Eyal Oren, and Renaud Delbru. 2007. Sindice.com: Weaving the open linked data. In Proceedings of the International Semantic Web Conference (ISWC’07), Vol. 4825. 547--560.
[48]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets.HotCloud 10, 10-10 (2010), 95.
[49]
Amrapali Zaveri, Dimitris Kontokostas, Mohamed A Sherif, Lorenz Bühmann, Mohamed Morsey, Sören Auer, and Jens Lehmann. 2013. User-driven quality evaluation of dbpedia. In Proceedings of the 9th International Conference on Semantic Systems. ACM, 97--104.
[50]
Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2016. Quality assessment for linked data: A survey. Semant. Web J. 7, 1 (2016), 63--93.
[51]
Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM Comput. Surv. 38, 2 (2006), 6.

Cited By

View all
  • (2022)Modular framework for similarity-based dataset discovery using external knowledgeData Technologies and Applications10.1108/DTA-09-2021-026156:4(506-535)Online publication date: 15-Feb-2022
  • (2022)Open dataset discovery using context-enhanced similarity searchKnowledge and Information Systems10.1007/s10115-022-01751-z64:12(3265-3291)Online publication date: 1-Dec-2022
  • (2022)How Your Cultural Dataset is Connected to the Rest Linked Open Data?Trandisciplinary Multispectral Modelling and Cooperation for the Preservation of Cultural Heritage10.1007/978-3-031-20253-7_12(136-148)Online publication date: 24-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 9, Issue 3
Special Issue on Improving the Veracity and Value of Big Data
September 2017
140 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3183573
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2018
Accepted: 01 November 2017
Revised: 01 November 2017
Received: 01 March 2017
Published in JDIQ Volume 9, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data quality
  2. big data
  3. connectivity
  4. dataset discovery
  5. dataset selection
  6. lattice of measurements
  7. linked data
  8. mapreduce
  9. spark

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • the General Secretariat for Research and Technology (GSRT) and the Hellenic Foundation for Research and Innovation (HFRI)
  • European Union's Horizon 2020 research BlueBRIDGE project

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Modular framework for similarity-based dataset discovery using external knowledgeData Technologies and Applications10.1108/DTA-09-2021-026156:4(506-535)Online publication date: 15-Feb-2022
  • (2022)Open dataset discovery using context-enhanced similarity searchKnowledge and Information Systems10.1007/s10115-022-01751-z64:12(3265-3291)Online publication date: 1-Dec-2022
  • (2022)How Your Cultural Dataset is Connected to the Rest Linked Open Data?Trandisciplinary Multispectral Modelling and Cooperation for the Preservation of Cultural Heritage10.1007/978-3-031-20253-7_12(136-148)Online publication date: 24-Nov-2022
  • (2022)LODChain: Strengthen the Connectivity of Your RDF Dataset to the Rest LOD CloudThe Semantic Web – ISWC 202210.1007/978-3-031-19433-7_31(537-555)Online publication date: 23-Oct-2022
  • (2021)GeoLOD: A Spatial Linked Data Catalog and RecommenderBig Data and Cognitive Computing10.3390/bdcc50200175:2(17)Online publication date: 19-Apr-2021
  • (2021)A Hybrid Approach Combining R*-Tree and k-d Trees to Improve Linked Open Data Query PerformanceApplied Sciences10.3390/app1105240511:5(2405)Online publication date: 8-Mar-2021
  • (2021)Towards Interactive Analytics over RDF GraphsAlgorithms10.3390/a1402003414:2(34)Online publication date: 25-Jan-2021
  • (2021)Large scale services for connecting and integrating hundreds of linked datasetsACM SIGWEB Newsletter10.1145/3494825.34948282021:Autumn(1-4)Online publication date: 3-Dec-2021
  • (2020)The contribution of linked open data to augment a traditional data warehouseJournal of Intelligent Information Systems10.1007/s10844-020-00594-wOnline publication date: 19-Feb-2020
  • (2020)Analytics over RDF GraphsInformation Search, Integration, and Personalization10.1007/978-3-030-44900-1_3(37-52)Online publication date: 27-Mar-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media