Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3200334.3200337acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Building and querying semantic layers for web archives

Published: 19 June 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles ("layers") that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts and events), and publishing all this data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.

    References

    [1]
    G. Weikum, M. Spaniol, N. Ntarmos, P. Triantafillou, A. Benczúr, S. Kirkpatrick, P. Rigaux, and M. Williamson, "Longitudinal analytics on web archive data: It's about time!" in 5th Biennial Conference on Innovative Data Systems Research. CIDR 2011, 2011.
    [2]
    G. Marchionini, "Exploratory search: from finding to understanding," Communications of the ACM, vol. 49, no. 4, 2006.
    [3]
    M. Whitelaw, "Generous interfaces for digital cultural collections," Digital Humanities Quarterly, vol. 9, no. 1, 2015.
    [4]
    D. Brickley, R. V. Guha, and B. McBride, "Rdf schema 1.1," W3C recommendation, 2014.
    [5]
    E. PrudHommeaux, A. Seaborne et al., "Sparql query language for rdf," W3C recommendation, vol. 15, 2008.
    [6]
    E. Prudhommeaux, C. Buil-Aranda et al., "Sparql 1.1 federated query," W3C Recommendation, vol. 21, 2013.
    [7]
    J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer et al., "Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia," Semantic Web, vol. 6, no. 2, pp. 167--195, 2015.
    [8]
    A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel, "Profiling web archive coverage for top-level domain and content language," International Journal on Digital Libraries, vol. 14, no. 3--4, pp. 149--166, 2014.
    [9]
    S. Alam, M. L. Nelson, H. Van de Sompel, L. L. Balakireva, H. Shankar, and D. S. Rosenthal, "Web archive profiling through cdx summarization," in International Conference on Theory and Practice of Digital Libraries. Springer, 2015.
    [10]
    N. J. Bornand, L. Balakireva, and H. Van de Sompel, "Routing memento requests using binary classifiers," in 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 2016.
    [11]
    S. Alam, M. L. Nelson, H. Van de Sompel, and D. S. Rosenthal, "Web archive profiling through fulltext search," in International Conference on Theory and Practice of Digital Libraries. Springer, 2016.
    [12]
    K. Padia, Y. AlNoamany, and M. C. Weigle, "Visualizing digital collections at archive-it," in 12th ACM/IEEE-CS joint conference on Digital Libraries. ACM, 2012.
    [13]
    H. Holzmann and A. Anand, "Tempas: Temporal archive search based on tags," in International Conference on World Wide Web, 2016.
    [14]
    N. Kanhabua, P. Kemkes, W. Nejdl, T. N. Nguyen, F. Reis, and N. K. Tran, "How to search the internet archive without indexing it," in 20th International Conference on Theory and Practice of Digital Libraries. Springer, 2016.
    [15]
    K. D. Vo, T. Tran, T. N. Nguyen, X. Zhu, and W. Nejdl, "Can we find documents in web archives without knowing their contents?" in ACM Conference on Web Science, 2016.
    [16]
    Z. T. Fernando, I. Marenzi, W. Nejdl, and R. Kalyani, "Archiveweb: Collaboratively extending and exploring web archive collections," in International Conference on Theory and Practice of Digital Libraries. Springer, 2016.
    [17]
    A. Jackson, J. Lin, I. Milligan, and N. Ruest, "Desiderata for exploratory search interfaces to web archives in support of scholarly activities," in 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 2016.
    [18]
    J. Singh, W. Nejdl, and A. Anand, "History by diversity: Helping historians search news archives," in ACM Conference on Human Information Interaction and Retrieval, 2016.
    [19]
    S. Ferré, "Sparklis: a sparql endpoint explorer for expressive question answering," in ISWC Posters & Demonstrations Track, 2014.
    [20]
    G. M. Sacco and Y. Tzitzikas, Dynamic taxonomies and faceted search: theory, practice, and experience. Springer Science & Business Media, 2009, vol. 25.
    [21]
    Y. Tzitzikas, N. Manolis, and P. Papadakos, "Faceted exploration of rdf/s datasets: a survey," Journal of Intelligent Information Systems, pp. 1--36, 2016.
    [22]
    C. Unger, L. Bühmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, and P. Cimiano, "Template-based question answering over rdf data," in 21st international conference on World Wide Web. ACM, 2012.
    [23]
    J. Lin, M. Gholami, and J. Rao, "Infrastructure for supporting exploration and discovery in web archives," in International Conference on World Wide Web, 2014.
    [24]
    H. Holzmann, V. Goel, and A. Anand, "Archivespark: Efficient web archive access, extraction and derivation," in 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. ACM, 2016.
    [25]
    "Apache spark: Lightning-fast cluster computing," 2015.
    [26]
    E. Sandhaus, "The new york times annotated corpus," Linguistic Data Consortium, Philadelphia, vol. 6, no. 12, 2008.
    [27]
    H. Van de Sompel, M. Nelson, and R. Sanderson, "Rfc 7089-http framework for time-based access to resource states-memento," Internet Engineering Task Force (IETF), RFC, 2013.
    [28]
    R. Sanderson, P. Ciccarese, H. Van de Sompel, S. Bradshaw, D. Brickley, L. J. G. a Castro, T. Clark, T. Cole, P. Desenne, A. Gerber et al., "Open annotation data model," W3C community draft, 2013.
    [29]
    P. Fafalios, M. Baritakis, and Y. Tzitzikas, "Exploiting linked data for open and configurable named entity extraction," International Journal on Artificial Intelligence Tools, vol. 24, no. 02, 2015.
    [30]
    T. Heath and C. Bizer, "Linked data: Evolving the web into a global data space," Synthesis lectures on the semantic web: theory and technology, vol. 1, no. 1, pp. 1--136, 2011.
    [31]
    S. Bechhofer, "Owl: Web ontology language," in Encyclopedia of Database Systems. Springer, 2009, pp. 2008--2009.
    [32]
    K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao, "Describing linked datasets with the void vocabulary," 2011.
    [33]
    L. Moreau and P. Missier, "The prov data model," 2013.
    [34]
    P. Ferragina and U. Scaiella, "Tagme: on-the-fly annotation of short text fragments (by wikipedia entities)," in 19th ACM international conference on Information and knowledge management. ACM, 2010.
    [35]
    J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum, "Robust disambiguation of named entities in text," in Conference on Empirical Methods in Natural Language Processing, 2011.
    [36]
    A. Moro, A. Raganato, and R. Navigli, "Entity linking meets word sense disambiguation: a unified approach," Transactions of the Association for Computational Linguistics, vol. 2, 2014.
    [37]
    D. Beckett and B. McBride, "Rdf/xml syntax specification (revised)," W3C recommendation, vol. 10, 2004.
    [38]
    R. Blanco, G. Ottaviano, and E. Meij, "Fast and space-efficient entity linking in queries," in Eight ACM International Conference on Web Search and Data Mining. New York, NY, USA: ACM, 2015.

    Cited By

    View all
    • (2018)Ranking Archived Documents for Structured Queries on Semantic LayersProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197049(155-164)Online publication date: 23-May-2018
    • (2018)Entity-Aspect LinkingProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197047(49-58)Online publication date: 23-May-2018
    • (2017)Towards a ranking model for semantic layers over digital archivesProceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries10.5555/3200334.3200400(336-337)Online publication date: 19-Jun-2017
    1. Building and querying semantic layers for web archives

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      JCDL '17: Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries
      June 2017
      383 pages
      ISBN:9781538638613

      Sponsors

      Publisher

      IEEE Press

      Publication History

      Published: 19 June 2017

      Check for updates

      Qualifiers

      • Research-article

      Conference

      JCDL '17
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 415 of 1,482 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)0

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)Ranking Archived Documents for Structured Queries on Semantic LayersProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197049(155-164)Online publication date: 23-May-2018
      • (2018)Entity-Aspect LinkingProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197047(49-58)Online publication date: 23-May-2018
      • (2017)Towards a ranking model for semantic layers over digital archivesProceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries10.5555/3200334.3200400(336-337)Online publication date: 19-Jun-2017

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media