Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Comparing data summaries for processing live queries over Linked Data

Published: 01 October 2011 Publication History
  • Get Citation Alerts
  • Abstract

    A growing amount of Linked Data--graph-structured data accessible at sources distributed across the Web--enables advanced data integration and decision-making applications. Typical systems operating on Linked Data collect (crawl) and pre-process (index) large amounts of data, and evaluate queries against a centralised repository. Given that crawling and indexing are time-consuming operations, the data in the centralised index may be out of date at query execution time. An ideal query answering system for querying Linked Data live should return current answers in a reasonable amount of time, even on corpora as large as the Web. In such a live query system source selection--determining which sources contribute answers to a query--is a crucial step. In this article we propose to use lightweight data summaries for determining relevant sources during query evaluation. We compare several data structures and hash functions with respect to their suitability for building such summaries, stressing benefits for queries that contain joins and require ranking of results and sources. We elaborate on join variants, join ordering and ranking. We analyse the different approaches theoretically and provide results of an extensive experimental evaluation.

    References

    [1]
    Aberer, K., Cudré-Mauroux, P., Hauswirth, M., Van Pelt, T.: GridVine: building internet-scale semantic overlay networks. In: ISWC'04, pp. 107-121 (2004).
    [2]
    Adjiman, Ph., Goasdoué, F., Rousset, M.-Ch.: Some RDFS in the semantic web. JDS 8, 158-181 (2007).
    [3]
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS '02, pp. 1-16 (2002).
    [4]
    Berners-Lee, T.: Linked Data, July 2006. http://www.w3.org/DesignIssues/LinkedData
    [5]
    Berners-Lee, T., Connolly, D.: Notation3 (N3): a readable RDF syntax, January 2008. W3C Team Submission. Available at http://www.w3.org/TeamSubmission/n3/
    [6]
    Bizer, Ch., Heath, T., Berners-Lee, T.: Linked data--the story so far. JSWIS 5(3), 1-22 (2009).
    [7]
    Brickley, D., Miller, L.: FOAF vocabulary specification 0.91, November 2007. http://xmlns.com/ foaf/spec/
    [8]
    Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. SIGMOD Rec. 30(2), 211-222 (2001).
    [9]
    Cai, M., Frank, M.: RDF Peers: a scalable distributed RDF repository based on a structured peer-to-peer network. In: WWW'04, pp. 650-657 (2004).
    [10]
    Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. 10(2-3), 199-223 (2001).
    [11]
    Cheng, G., Qu, Y.: Searching linked objects with falcons: approach, implementation and evaluation. JSWIS 5(3), 49-70 (2009).
    [12]
    Clark, K.G., Feigenbaum, L., Torres, E.: SPARQL Protocol for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-protocol/.
    [13]
    Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: ICDCS '02, pp. 23- 32 (2002).
    [14]
    Cudré-Mauroux, P., Agarwal, S., Aberer, K.: GridVine: an infrastructure for peer information management. IEEE Internet Computing 11(5), 864-875 (2007).
    [15]
    Cyganiak, R., Stenzhorn, H., Delbru, R., Decker, S., Tummarello, G.: Semantic sitemaps: efficient and flexible access to datasets on the semantic web. In: ESWC'08, pp. 690-704 (2008).
    [16]
    d'Aquin, M., Baldassarre, C., Gridinoc, L., Angeletou, S., Sabou, M., Motta, E.: Characterizing knowledge on the semantic web with Watson. In: EON'07, pp. 1-10 (2007).
    [17]
    Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: ESWC 2010, pp. 240-256 (2010).
    [18]
    Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Englewood Cliffs (1999).
    [19]
    Gibbons, P., Matias, Y., Poosala, V.: Fast incremental maintenance of approximate histograms. In: VLDB '97, pp. 466-475 (1997).
    [20]
    Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: VLDB '01, pp. 79-88 (2001).
    [21]
    Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In: VLDB'97, pp. 436-445 (1997).
    [22]
    Gunopulos, D., Kollios, G., Tsotras, V., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD '00, pp. 463-474 (2000).
    [23]
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD '84, pp. 47- 57 (1984).
    [24]
    Harth, A., Decker, S.: Optimized index structures for querying RDF from the web. In: 3rd Latin American Web Congress, pp. 71-80 (2005).
    [25]
    Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K., Umbrich, J.: Data summaries for on-demand queries over Linked Data. In: WWW'10, pp. 411-420 (2010).
    [26]
    Hartig, O., Bizer, Ch., Freytag, J.-Ch.: Executing SPARQL queries over theWeb of Linked Data. In: ISWC'09 (2009).
    [27]
    Hayes, P.: RDF semantics. W3C Recommendation, February 2004. http://www.w3.org/ TR/rdf-mt/
    [28]
    Heimbigner, D., McLeod, D.: A federated architecture for information management. ACM Trans. Inf. Syst. 3(3), 253-278 (1985).
    [29]
    Heine, F.: Scalable P2P based RDF querying. In: InfoScale'06, pp. 17-22 (2006).
    [30]
    Heine, F., Hovestadt, M., Kao, O.: Processing complex RDF queries over P2P networks. In: Workshop on Information Retrieval in Peer-to-Peer Networks (P2PIR'05), pp. 41-48 (2005).
    [31]
    Henzinger, M.R., Heydon, A., Mitzenmacher, M., Najork, M.: Measuring index quality using random walks on the web. Comput. Netw. 31(11-16), 1291-1303 (1999).
    [32]
    Hogan, A., Harth, A., Umbrich, J., Decker, S.: Towards a scalable search and query engine for the web. In: WWW'07, pp. 1301-1302 (2007).
    [33]
    Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. Technical Report DERI-TR-2010-07- 23, DERI (2010).
    [34]
    Hose, K.: Processing rank-aware queries in schema-based P2P systems. Ph.D. thesis, TU Ilmenau (2009).
    [35]
    Hose, K., Karnstedt, M., Koch, A., Sattler, K., Zinn, D.: Processing rank-aware queries in P2P systems. In: DBISP2P'05, pp. 238-249 (2005).
    [36]
    Hose, K., Klan, D., Sattler, K.: Distributed data summaries for approximate query processing in PDMS. In: IDEAS '06, pp. 37-44 (2006).
    [37]
    Huang, S.-H.S.: Multidimensional extendible hashing for partial-match queries. JPP 14, 73-82 (1985).
    [38]
    Ioannidis, Y.: The history of histograms (abridged). In: VLDB '03, pp. 19-30 (2003).
    [39]
    Karnstedt, M.: Query processing in a DHT-based universal storage. Ph.D. thesis, AVM (2009).
    [40]
    Karnstedt, M., Sattler, K., Richtarsky, M., Müller, J., Hauswirth, M., Schmidt, R., John, R.: UniStore: querying a DHT-based universal storage. In: ICDE'07 Demonstrations Program, pp. 1503-1504 (2007).
    [41]
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. JACM 46(5), 604-632 (1999).
    [42]
    Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422-469 (2000).
    [43]
    Langegger, A., Wöß, W.: RDFStats--an extensible RDF statistics generator and library. In: Workshop on Web Semantics, DEXA (2009).
    [44]
    ldspider. Google code, April 2010.
    [45]
    Manola, F., Miller, E.: RDF Primer. W3C Recommendation, February 2004. http://www.w3.org/ TR/rdf-primer/
    [46]
    Marzolla, M., Mordacchini, M., Orlando, S.: Tree vector indexes: efficient range queries for dynamic content on peer-to-peer networks. In: PDP'06, pp. 457-464 (2006).
    [47]
    Miller, L., Seaborne, A., Reggiori, A.: Three implementations of SquishQL, a simple RDF query language. In: ISWC'02, pp. 423-435 (2002).
    [48]
    Muralikrishna, M., DeWitt, D.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD 88, pp. 28-36 (1988).
    [49]
    Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M., Naeve, A., Nilsson, M., Palmer, M., Risch, T.: Edutella: a P2P networking infrastructure based on RDF. In: WWW'02 (2002).
    [50]
    Neumann, Th., Weikum, G.: RDF-3X: a RISC-style engine for RDF. VLDB Endowment 1(1), 647-659 (2008).
    [51]
    Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1), 37-52 (2008).
    [52]
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998).
    [53]
    Petrakis, Y., Koloniari, G., Pitoura, E.: On using histograms as routing indexes in peer-to-peer systems. In: DBISP2P, pp. 16-30 (2004).
    [54]
    Petrakis, Y., Pitoura, E.: On constructing small worlds in unstructured peer-to-peer systems. In: EDBTWorkshops, pp. 415-424 (2004).
    [55]
    Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. In: VLDB '97, pp. 486-495 (1997).
    [56]
    Prud'hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query/.
    [57]
    Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC'08, pp. 524-538, Tenerife, Spain. Springer (2008).
    [58]
    Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC'08, pp. 524-538 (2008).
    [59]
    Rathi, A., Lu, H., Hedrick, G.E.: Performance comparison of extendible hashing and linear hashing techniques. SIGSMALL/PC Notes 17(2), 19-26 (1991).
    [60]
    Schlosser, M., Sintek, M., Decker, S., Nejdl, W.: HyperCuP, hypercubes, ontologies, and efficient search on peer-to-peer networks. In: Agents and Peer-to-Peer Computing, vol. 2530, pp. 133-134. Springer (2003).
    [61]
    Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: consistent histogram construction using query feedback. In: ICDE '06, p. 39 (2006).
    [62]
    Stuckenschmidt, H., Vdovjak, R., Broekstra, J., Houben, G.-J.: Towards distributed processing of RDF path queries. JWET 2(2/3), 207-230 (2005).
    [63]
    Stuckenschmidt, H., Vdovjak, R., Houben, G.-J., Broekstra, J.: Index structures and algorithms for querying distributed RDF repositories. In: WWW'04, pp. 631-639 (2004).
    [64]
    Umbrich, J., Karnstedt, M., Land, S.: Towards understanding the changing web: mining the dynamics of Linked-Data sources and entities. In: LWA 2010, FG-KDML, pp. 159-162 (2010).
    [65]
    Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. VLDB Endowment 1(1), 1008-1019 (2008).
    [66]
    Zinn, D.: Skyline queries in P2P systems. Master's thesis, TU Ilmenau (2004).

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image World Wide Web
    World Wide Web  Volume 14, Issue 5-6
    October 2011
    243 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 October 2011

    Author Tags

    1. Linked Data
    2. RDF querying
    3. index structures

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Link Traversal Query Processing Over Decentralized Environments with Structural AssumptionsThe Semantic Web – ISWC 202310.1007/978-3-031-47240-4_1(3-22)Online publication date: 6-Nov-2023
    • (2020)FedQPLProceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services10.1145/3428757.3429120(436-445)Online publication date: 30-Nov-2020
    • (2019)Answering SPARQL queries on the web of data through zero-knowledge link traversalACM SIGAPP Applied Computing Review10.1145/3372001.337200319:3(18-32)Online publication date: 8-Nov-2019
    • (2019)Exploiting context and quality for linked data source selectionProceedings of the 34th ACM/SIGAPP Symposium on Applied Computing10.1145/3297280.3297503(2251-2258)Online publication date: 8-Apr-2019
    • (2019)Decentralized Indexing over a Network of RDF PeersThe Semantic Web – ISWC 201910.1007/978-3-030-30793-6_1(3-20)Online publication date: 26-Oct-2019
    • (2018)Hybrid Index Structure based on MBB Approximation for Linked DataProceedings of the 10th International Conference on Computer Modeling and Simulation10.1145/3177457.3177458(101-104)Online publication date: 8-Jan-2018
    • (2017)Semantic Extension of Query for the Linked DataInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.201710010613:4(109-133)Online publication date: 1-Oct-2017
    • (2017)Decomposing federated queries in presence of replicated fragmentsWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2016.12.00142:C(1-18)Online publication date: 1-Jan-2017
    • (2017)Challenges of Source Selection in the WoDThe Semantic Web – ISWC 201710.1007/978-3-319-68288-4_19(313-328)Online publication date: 21-Oct-2017
    • (2017)SESLDSInternational Journal of Intelligent Systems10.1002/int.2192633:2(233-258)Online publication date: 27-Dec-2017
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media