Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Bringing Semantics into Historical Archives with Computer-aided Rich Metadata Generation

Published: 16 September 2022 Publication History

Abstract

This article relies on the idea that a semantically rich metadata layer is required in order to provide an effective, intelligent, and engaging access to historical archives. However, building such a semantic layer represents a well-known bottleneck that can be overcome only by a hybrid strategy, integrating user-generated content and automatic techniques. The PRiSMHA project provides a contribution in this direction with the design and development of the prototype of an ontology-driven platform supporting users in semantic metadata generation. In particular, the main contribution of this article is to show how automatic information extraction techniques (namely, Named Entity and Temporal Expression Recognition) and information retrieved from external datasets in the LOD cloud can support users in the identification and characterization of new entities to annotate documents with.

1 Introduction

In recent years, awareness of the great value of stories and memories stored in historical archives has increased, and a lot of efforts all over the world have been devoted to the digital curation of archival resources [Lee and Tibbo 2011; Post et al. 2019], including document digitization, metadata generation, and access tools development. Many initiatives, for example, try to use storytelling techniques to provide a large public with an immersive experience in the interaction with historical memories; see, for instance, projects like CrossCult (www.crosscult.eu) or meSch (www.mesch-project.eu); see also Battad et al. [2019], Underberg-Goode [2017], Lombardo et al. [2016], and Damiano and Lombardo [2016], among many others.
However, many archives host a huge amount of undigitized material; digital catalogs often consist of quite poor metadata, usually describing resources at fonds or series level, while the actual content of single documents is neglected. However, detailed knowledge about what individual archival documents talk about is precisely what is needed to provide an effective, intelligent, and engaging access to historical archives. The idea that a semantically rich metadata layer is required to enhance the access to archival resources is shared in the Digital Humanities community; see, for instance, Motta et al. [2000] and Goy et al. [2015].
The PRiSMHA (Providing Rich Semantic Metadata for Historical Archives) project [Goy et al. 2017] aims at providing a contribution in this direction by designing an ontology-driven platform that supports semantic metadata generation, needed to offer an effective access to archival documents. PRiSMHA is a 3-year (2017–2020) Italian project, funded by Compagnia di San Paolo Foundation and Università di Torino. It involves the Computer Science and the Historical Studies Departments of the same university, and the Fondazione Istituto piemontese Antonio Gramsci (www.gramscitorino.it), member of the Polo del ‘900 (www.polodel900.it), a cultural institution headquartered in Torino, hosting a very rich archive (www.polodel900.it/9centro).
The main idea underlying the PRiSMHA approach is that only a crowdsourcing model, coupled with automatic techniques, can enable the (collaborative) construction of the semantic layer required to guarantee a content-based access to historical archives. In order to test and demonstrate the feasibility of this approach, the PRiSMHA team developed a proof-of-concept prototype (see Section 3), running on a small set of Istituto Gramsci's collections (a little bit more than 200 documents), related to the students’ and workers’ protest during the years 1968–1969 in Italy [Goy et al. 2019]; besides a few pictures and newspaper articles, such resources are mainly represented by typewritten leaflets, often with handwritten annotations and drawings (see Figures 1 and 2).
Fig. 1.
Fig. 1. A leaflet about a strike at FIAT (copyright: Fondazione Istituto piemontese Antonio Gramsci Onlus).
Fig. 2.
Fig. 2. A leaflet mentioning Guido Viale (copyright: Fondazione Istituto piemontese Antonio Gramsci Onlus).
Let us introduce a scenario with the goal of providing the reader with the motivations underlying the research in the PRiSMHA project. Consider Antonio, a schoolteacher who wants to enrich his lessons with information directly taken from original documents. He is talking to his students about protest actions that took place in Torino in 1968. In particular, he is searching a digital online archive system, looking for leaflets referring to strikes that both students and workers participated in. Even with a (very) good OCR tool, if the system is based on a keyword search mechanism, the results of a query for “sciopero” (strike) would not include the document shown in Figure 1: the document, in fact, does not contain the world “sciopero/i” (strike/es), although it actually talks about a strike, using the very specific word “picchetti” (picketings). Antonio also looks for leaflets mentioning specific people (e.g., Guido Viale, a leader of the ‘68 Movement in Torino) involved in protest actions. The results for a query for “Guido Viale” in a keyword-based system would probably include the leaflet in Figure 2. However, that leaflet would not be retrieved if the query also contains (in AND) a keyword for the action (such as “sciopero”/strike, “manifestazione”/demonstration, and so on), since the protest action is not explicitly mentioned, although it is clear to a human reader that the document talks about a protest action.
Moreover, Guido Viale, although mentioned in the document, is not an active participant; the leaflet indeed says that he has been arrested and the demonstration is organized to ask for his release (he could be considered the “topic” of the protest action, or a participant in his release from prison, i.e., in the event representing the goal of the protest action).
If Antonio was searching for the active involvement of Guido Viale in protest actions, for example, he would need a tool enabling him to ask for all documents talking about any type of protest action that have a relation of type active participation with the person Guido Viale. A simplified sketch of the knowledge involved in such a query is shown in Figure 3, where rounded squares represent concepts (i.e., ontology classes), ovals represent individuals (class instances), and arrows represent relations (see Section 3).
Fig. 3.
Fig. 3. Simplified graphical semantic representation of participation of Guido Viale in protest actions, with links to archival documents.
This scenario demonstrates that, in order to both provide users with the possibility of posing such queries and be able to answer them, the system needs a semantic layer over archival documents, containing a formal machine-readable representation of their content, based on the conceptual vocabulary represented by computational ontologies.
Such a semantic layer could also be exploited by third-party applications, ranging from education-oriented games to citizen services and history-aware tourist guides, that could exploit knowledge about places, people, and organizations involved in the narrated events. Moreover, these rich semantic metadata are linked to the archival resources, thereby providing applications with access to the original documents and offering archival institutions a great opportunity to turn their heritage into a visible and live matter.
However, given the conceptual vocabulary (i.e., concepts and relations), who builds the potentially extremely huge semantic knowledge base containing the formal representation of the content of archival documents?
We believe that the bottleneck represented by the population of the semantic knowledge base of systems like PRiSMHA can only be overcome by implementing a hybrid strategy that integrates user-generated content and automatic techniques [Foley et al. 2017]. In particular, the approach described in this article builds on the work presented in Goy et al. [2020], where we described our solution to provide users with an ontology-driven user interface enabling them to build the semantic layer, by “annotating” archival documents with semantic representations of their content. In this article we show how this activity can be supported by exploiting automatic information extraction techniques (namely, Named Entity Recognition) and linking to external resources (such as Wikidata).
Therefore, the research question of the work presented in this article is the following:
Given an ontology-driven web-based system enabling users to build the formal semantic representations of archival document content, can automatic text mining techniques (such as Named Entity Recognition) and entity linking to external resources (Linked Open Data) provide users with effective support in the annotation activity?
The main contribution of this article is to provide an answer to this research question by describing and evaluating the support provided to users of the annotation platform.
The rest of the article is organized as follows: Section 2 discusses the most relevant related works. Section 3 provides the background by briefly introducing the PRiSMHA architecture with its main modules, the underlying ontology, and the main steps of the interaction with the annotation platform. Section 4 is devoted to the presentation of the details about the automatic support provided by Information Extraction and linking to external datasets in the LOD cloud. Finally, Section 5 presents the results of a qualitative evaluation of the mentioned support, while Section 6 concludes the article by sketching the main future work directions.

2 Related Work

Several research areas are related to the approach presented in this article, namely, at least ontology-based data access, ontology-based/semantic annotation (in particular in relation to Computational Archival Science), and partially ontology integration. In the work presented in this article, we also use NLP techniques to extract relevant entities from texts. However, since we rely on state-of-the-art tools, without the ambition to provide enhancements in the field, we do not review related work about NLP in this section, but we simply provide an overview in Section 4.1.
The first research area that can be mentioned is ontology-based data access. Several energies have been invested into making metadata resources more accessible for users [Walsh and Hall 2015; Kollia et al. 2012; Tonkin and Tourte 2016; Windhager et al. 2016]. These projects aim at increasing the resource accessibility for end users, while in PRiSMHA the issue of accessibility is mainly considered in the context of data acquisition, where the user has to deal with the annotation tool to equip documents with semantic metadata. The goal of the PRiSMHA project, in fact, is to provide users with an ontology-based platform that enables them to annotate documents by offering support in the identification and semantic characterization of entities used in the annotation activity.
Probably, the closest research area that needs to be mentioned is ontology-based annotation, and several ontologies have been developed for the Cultural Heritage domain [Alma'aitah et al. 2020].
Many studies have been conducted on developing ontologies in order to annotate documents related to this domain, according to a semantic model. Laclavik et al. [2006], for example, developed a tool for ontology-based text annotation called OnTeA that processes text documents by employing regular expressions and detects equivalent semantics elements (i.e., elements that share the same meaning) according to the defined domain ontology. Lana et al. [2014] built an ontology, in the framework of the Geolat project (https://geolat.uniupo.it/), to make the Latin literature accessible. Garozzo et al. [2017] developed CulTO, a tool relying on an ontology, specifically designed to support annotation of photographic data and text documents related to historical buildings. Carboni and De Luca [2017] developed the Visual and Iconographical Representations (VIR) ontology to record statements about the physical and conceptual nature of the heritage.
Adrews et al. [2012] identify four types of annotation: tags, attributes, relations, and ontologies. In particular, ontology annotation (or semantic annotation) enables annotators to associate a resource with a semantic tag relying on the vocabulary provided by an ontology. The framework proposed by Andrews et al. assumes that the ontology completely describes the domain of the examined resources: users are requested to provide a link from a resource to the ontology. For example, consider a document about a strike by FIAT employees: the aforementioned framework assumes that the ontology already contains the characterization of the FIAT company and its employees, and users are called to link the document with both entities.
This approach is the closest to PRiSMHA, although the PRiSMHA point of view on annotations and on the role of users is slightly different. In PRiSMHA, the contribution of users is twofold: (1) populating a semantic knowledge base, where entities of different types (events, persons, organizations, places, etc.) are represented, and (2) linking archival documents to entities in the knowledge base, in order to describe their content. The user him- or herself, supported by the automatic tools described in Section 4, builds the semantic description of entities and events mentioned in the examined document by exploiting the web platform (see Section 3). Let us consider again the document about the strike by FIAT employees. According to the PRiSMHA model, an annotator should identify the occurrences of “FIAT” in the document, provide the description of the FIAT company, and link the FIAT characterization to the document. The support provided by automatic Information Extraction techniques and entity linking to LOD sets aims at helping the user in these tasks.
In connection with the ontology-based annotation research, projects and tools supporting collaborative semantic annotation are worth mentioning. For example, the Micropasts (crowdsourced.micropasts.org) [Bonacchi et al. 2019] and CULTURA (www.cultura-strep.eu) [Agosti et al. 2013] projects aimed at enabling users to participate in the annotation of historical documents, while the MultimediaN E-Culture project (multimedian.project.cwi.nl) [Schreiber et al. 2008] developed a web-based system for the semantic annotation of Cultural Heritage objects. The SAGE framework [Foley et al. 2017] is particularly interesting because, besides semantic annotation, it introduces automatic techniques to support users in the annotation activity. Another relevant project is SAWS (Sharing Ancient Wisdoms) [Jordanous et al. 2012], which provided semantic annotation of historical texts through RDF triples, thus enabling the creation of a conceptual network of historical documents based on Linked Data. This project can be seen as a pioneer of the Computational Archival Science (CAS) field of study (computationalarchives.net), where one of the main goals is to interconnect archival resources as a way to make their (historical) context explicit.
Moreover, there are several web-based annotation software tools, usually based on W3C Web Annotation standards (www.w3.org/blog/news/archives/6156) and the Open Annotation model (www.openannotation.org/spec/core), like, for instance, Hypothesis (web.hypothes.is) and Recogito (recogito.pelagios.org), a platform mainly oriented to geodata, and including a Named Entity Recognition module that identifies names of places and people. However, the most interesting tool in this category is probably Pundit (www.netseven.it/pundit), a tool that supports the annotation of web pages based on semantic vocabularies and Linked Open Data; Pundit is natively linked to DBPedia, but other datasets can be connected, and custom vocabularies can be imported.
In all these projects and tools, including the most promising Pundit, our experience demonstrated that none of them enables the exploitation of a full-fledged domain ontology, driving the annotation user interface, and including an effective support to the definition and characterization of new entities, as required by the PRiSMHA approach.
In the last few decades, many efforts have also been spent integrating and connecting heterogeneous ontologies. The so-called alignment operation aims at building a generic and shared schema that plays its role as an interface between syntactically and semantically heterogeneous metadata [Alma'aitah et al. 2020]. The CIDOC-CRM ontology is probably the best-known project aimed at supporting integration and connection of heterogeneous sources of Cultural Heritage knowledge [Crofts et al. 2003]. Other projects are worth mentioning, like, for instance, the REACH project [Doulaverakis et al. 2005], aimed at defining an ontology-based representation in order to provide enhanced unified access to heterogeneous distributed Cultural Heritage digital databases, mainly focused on Greek and Roman Antiquity.
Hyvönen et al. [2005] have successfully carried out the MuseumFinland project, with the objective of building a single access point to more than 15 museum collections. The software combines databases transforming them into a shared XML format, thus obtaining syntactic interoperability. Semantic interoperability is obtained by translating from XML to RDF, exploiting seven domain ontologies.
Daquino et al. [2016] developed two ad hoc ontologies to describe the Zeri Photo Archive catalog, aligned with CIDOC-CRM. Finally, two initiatives should be mentioned: Data for History (dataforhistory.org), which offers a web-based platform supporting ontology development and alignment with CIDOC-CRM, and ArCo (wit.istc.cnr.it/arco), an initiative whose goal is to develop a knowledge graph for Italian Cultural Heritage by aligning top-level models and ontologies characterizing properties and actions for Cultural Heritage curation.
The majority of ontological models related to the Cultural Heritage domain, including CIDOC-CRM along with the other mentioned ones, are mainly designed with curation in mind. This implies that they usually include a fine-grained characterization of cultural resource types, properties, and actions that can be taken with them. The goal of the ontology we developed within the PRiSMHA project is to model the concepts and relations representing the content of cultural resources. In this perspective, the major part of our ontology is not about Cultural Heritage, but about the domain Cultural Heritage “talks about.” Only a very small part of our ontology, in fact, is used to model archival documents as such, fragments they are composed of, and the relation between fragments and semantic representations of their content.
To conclude, we mention the work by Dragoni et al. [2016], which presents a general architecture for knowledge management platforms, together with an implementation (MOKI-CH), in the Cultural Heritage domain. The authors identify a set of requirements for the presented architecture, among which the most interesting for the approach presented in this article is data exposure and data linking, which represent a goal of the PRiSMHA project (as mentioned in Section 1 and explained in Sections 3 and 4).

3 The Architecture, the Ontology, AND the Annotation Platform

Figure 4 shows the architecture of the PRiSMHA prototype. As already stated, the goal of the project was the design and development of a prototype of the Crowdsourcing Platform, with its User Interface (Crowdsourcing Platform UI), which is driven by the ontology (HERO and HERO-900) and aims at populating the Semantic KB (implemented as an RDF triplestore). The Semantic KB, together with the ontology, represents the above-mentioned semantic layer, which contains a formal machine-readable representation of the content of archival documents. As stated in Section 1, PRiSMHA implements a hybrid strategy, by integrating user-generated content, provided through the Crowdsourcing Platform, and automatic techniques, represented by the Information Extraction (IE) module, which identifies relevant entities within textual documents, and the LOD linking module, which supports the connection of the Semantic KB with datasets in the LOD cloud.
Fig. 4.
Fig. 4. PRiSMHA prototype architecture.
The ontology-driven Crowdsourcing Platform and its UI, enabling users to annotate documents with semantic representations of their content stored in the Semantic KB, are described in detail in Goy et al. [2020]. In this article, we focus on the role of the IE and LOD linking modules, which support users of the Crowdsourcing Platform in their activity. These modules are described in detail in Section 4.
In the rest of this section, we briefly describe the HERO and HERO-900 ontologies and the structure of the Semantic KB; we then provide an overview of the most relevant aspects of the Crowdsourcing Platform UI, focusing on the work of a single user annotating a single document, with the collaborative aspects out of the scope of the present article. This will lay the ground for the presentation of the additional features, represented by the IE and LOD linking functionalities, in Sections 4 and 5.
HERO (Historical Event Representation Ontology) is a reference ontology that provides classes and properties useful to characterize historical events. In particular, HERO offers the conceptual vocabulary for specifying event types (e.g., a strike), the places events occur in (e.g., Milan), the date or time frame events take place in (e.g., November 1968), and—most important for this article—the participants in the events, i.e., people, organizations, objects, and so forth involved in events with different roles.
The upper-level module, HERO-TOP, provides the most general concepts, inherited by all the other modules. It is grounded in the basic distinctions defined in DOLCE [Borgo and Masolo 2009], i.e., perdurants, objects, and abstract entities. Perdurants can be states (e.g., being wounded) or events (e.g., a strike). Among objects, HERO distinguishes between physical objects (e.g., persons, buildings) and non-physical objects, among which social objects play a major role in the historical domain; examples of social objects are organizations (e.g., trade union) and social roles (e.g., student).
One of the most relevant relationships between objects and perdurants is participation (e.g., Sergio Garavini, a person—thus an object—participated in a strike, an event—thus a perdurant). The HERO-EVENT module accounts for the mentioned distinction between states and events and offers properties for describing them, among which are thematic (or semantic) roles, expressing the modalities in which objects (e.g., persons, organizations) participate in events or states (e.g., agent, patient, instrument); see Goy et al. [2018]. HERO-PLACE defines concepts and properties relevant for the characterization of places, while HERO-TIME provides the notions for expressing time. HERO is available at w3id.org/hero/HERO. With respect to the work presented in this article, the most important module is HERO-ROCS, which defines the formal instruments for describing organizations (e.g., political parties, companies), collective entities (e.g., students, workers), and social roles (e.g., professions).
HERO-EVENT-900, HERO-PLACE-900, and HERO-ROCS-900 are domain modules refining the corresponding HERO modules by introducing concepts and properties useful to describe the history of the 20th century. The current version of these modules covers the concepts and properties needed to describe the historical events—and the involved entities—considered in the PRiSMHA project, i.e., the students’ and workers’ protests during the years 1968–1969 in Italy. In particular, HERO-ROCS-900 offers a set of specific organization types (e.g., various types of trade unions, various types of organizations in the political sphere), a set of specific collective entity types (e.g., social classes, political-based collective entities), and a set of specific role types (e.g., various types of workers, various types of students).
Within the PRiSMHA project, we developed an application version of HERO, encoded in OWL 2 DL (www.w3.org/OWL), containing 429 classes, 378 properties, 79 individuals, and nearly 4,500 logical axioms.
The Semantic KB is implemented as a RDF triplestore (www.w3.org/RDF), containing RDF triples of the form <s, p, o>, where s is an entity in the Semantic KB, p is a property (defined in HERO/HERO-900, or belonging to RDF itself—e.g., rdf:type), and o can be either an entity in the Semantic KB, a literal (e.g., a string, a number), or a class defined in HERO/HERO-900 (e.g., Organization). Each triple represents an assertion stating that the entity s has the value o for the property p. Entities, properties, and classes are represented in the triplestore by URIs. Data stored in the Semantic KB can be accessed through a SPARQL endpoint or navigated through the Final User Interface (see Figure 5), which is currently available as a mockup (the description of which is out of the scope of the present article).
Fig. 5.
Fig. 5. A document on the PRiSMHA Crowdsourcing Platform (the biography of Emilio Pugno).
The Crowdsourcing Platform prototype is a web application accessible through a browser. Its implementation is based on the AJAX model and exploits JQuery 3.3.1 (jquery.com) and Bootstrap 3.3.7 (getbootstrap.com/docs/3.3/). The implementation of the Crowdsourcing Platform application logic relies on the Spring Boot 1.5.10 framework (spring.io/projects/spring-boot), while data is stored in a MySQL 5.6.38 (www.mysql.com) relational database. The OWLAPI 5.1.0 (owlcs.github.io/owlapi) library supports the interaction with the ontology, and an RDF triplestore, implemented by means of Jena TDB 3.6.0 (jena.apache.org/documentation/tdb), stores the semantic representations.
Before entering the description of the User Interface of the Crowdsourcing Platform, a few words should be devoted to the prospective users of such a platform. By means of informal interviews with users and employees of the library and archives of the Polo del ‘900, we identified the potential users of the PRiSMHA Crowdsourcing Platform: such users can be historians, archivists, students, researchers, or simply enthusiasts and people interested in the history of the 20th century, participating in the PRiSMHA community, with the role of experts, or simply trusted users, motivated in spending time and effort in the semantic annotation process. Despite the efforts to reach a good level of usability (see Goy et al. [2020]), the interaction with the Crowdsourcing Platform UI remains a challenging task that requires some learning and training, as well as some knowledge about the domain (basically, the Italian history of the 20th century).
Figure 5 shows a textual document, the biography of Emilio Pugno (an Italian trade union leader), accessed through the Crowdsourcing Platform. Users enabled to work on this document can identify textual units that can be annotated, called fragments (highlighted in cyan). By clicking on a fragment, users can see, on the right-hand bar, the annotations for that fragment, and by clicking on them, a modal window shows the details. For example, Figure 6 shows the semantic representation of the entity Partito Comunista Italiano, which at least one fragment in the biography of Emilio Pugno is annotated with. Such a representation contains the label for the entity (in the upper left corner), the type (class) label (Partito politicoPolitical party), the corresponding entity in Wikidata if any (see Section 4.2), the value for the properties (in this case, the value for the data-property name, i.e., PCI), the URI in the Semantic KB, and the list of documents containing fragments annotated with this entity.
Fig. 6.
Fig. 6. Semantic representation of the entity Partito Comunista Italiano (Italian Communist Party).
By clicking on a fragment, users can also add new annotations (Figure 7): the system suggests to search the Semantic KB for a suitable entity among existing ones; if nothing satisfactory is found, the user can create a new entity and link it to the fragment in focus by clicking on Add and link new entity (or Add new entity, to link it later on).
Fig. 7.
Fig. 7. The modal window enabling users to add a new annotation.
If the user decides to create a new entity, he or she can characterize it. Figure 8 shows the window enabling the user to describe an entity through its properties. First, the user selects a HERO class (Entity type) and enters a label for the entity. On the basis of the selected class, the system calculates the available properties, which are presented as a form, split into three tabs, corresponding to important, useful, and other properties; the details of the algorithm computing the compatible properties with respect to the selected class, as well as the criteria to split properties into tabs, can be found in Goy et al. [2020]. When the user clicks the Save button at the end of the form, the semantic representation of the entity in focus, corresponding to the form data, is generated and the RDF triples are saved in the Semantic KB.
Fig. 8.
Fig. 8. The window enabling users to describe an entity, by specifying its properties.
A detailed description of the properties is out of the scope of the present article. We will focus on the first one (Corrisponde esattamente aExactly matching, in the figure) in Section 4.2, when describing the link to external datasets.
In Sections 4 and 5 we will describe the features representing the core of the work presented in this article, i.e., the support provided to users by the IE and LOD linking modules in the annotation activity.

4 Support to User Annotations

4.1 Automatic Information Extraction from Texts

Suppose that a user of the Crowdsourcing Platform is working on a text, obtained from an OCR-ized document or from an original textual source, such as the one in Figure 5. By clicking on the Show Named Entities and Temporal Expressions link, Named Entities and Temporal Expressions are automatically extracted by the IE module (see Section 3) and highlighted in the text (see Figure 9, where the biography of Sergio Garavini, another well-known Italian trade union leader, is displayed).
Fig. 9.
Fig. 9. A textual document (the biography of Sergio Garavini) with Named Entities and Temporal Expressions highlighted.
These entities are highlighted in two different ways, depending on whether or not they appear in a fragment (i.e., textual portions highlighted in cyan): entities that do not belong to fragments are represented with an orange font color and are not clickable, while entities that belong to fragments are represented with an orange background and are clickable. In the first case, the entities could help users to recognize interesting fragments, while in the second case they could be useful to recognize relevant entities representing the content of the fragment to be annotated (see the discussion in Section 5).
By clicking on a Named Entity or Temporal Expression that occurs in a fragment, the system shows some information items that hopefully help the user in describing the entity (Figure 10).
Fig. 10.
Fig. 10. The window describing the recognized entity Torino, shown by clicking on “Torino” in Figure 9.
In particular, besides the entity name (corresponding to the expression identified in the text), the system suggests the HERO class to be associated with the entity (Elemento GeograficoGeographic Feature—in this example), and it proposes potentially available links to external resources, such as the semantic networks BabelNet [Navigli and Ponzetto 2010] and DBPedia [Auer et al. 2007]. The system also tries to identify entities, already available in the Semantic KB, that refer to the Named Entity or Temporal Expression recognized in the textual fragment (Triplestore URI(s) in the figure), in order to avoid duplicates. Such entities can be used to annotate the current fragment by clicking on Link entity to fragment (if the entity is already linked to the fragment in focus, the link can be removed). If no entity in the Semantic KB corresponds to the one recognized in the text fragment, the user can add a new one by clicking Add new entity or Add and link new entity: a user interface similar to the one in Figure 8 is shown, pre-compiled with the label and the ontology class automatically assigned by the system.
The IE module consists of two sub-modules:
Named Entities Recognition, recognizing persons, organizations, and places
Temporal Expressions Recognition, recognizing hours, days, months, years, seasons, and centuries.
The Named Entity Recognition sub-module is based on the approach described in Carducci et al. [2019], adapted to the Italian language. It consists of two integrated components:
A component based on a machine learning approach, provided by the TINT Named Entity Recognition Module [Aprosio and Moretti 2016], based on the Stanford CoreNLP NER module [Manning et al. 2014], which in turn relies on Conditional Random Field (CRF) classifiers [Lafferty et al. 2001]. The classifier is trained on the Italian Content Annotation Bank (I-CAB) corpus [Magnini et al. 2006], containing around 180,000 words taken from the Italian newspaper L'Adige.
A component that exploits a semantic-based approach, using the Word-Sense Disambiguation and Named Entities Recognition technique provided by BabelFy [Moro et al. 2014], that, in turn, employs the semantic network BabelNet [Navigli and Ponzetto 2010] as source of information. The use of BabelFy is particularly relevant since it supports word-sense disambiguation; i.e., it selects the most promising sense within the set of candidates provided by BabelNet. Consider, for example, the sentence “Cavour è nato a Torino nel 1810” (“Cavour was born in Turin in 1810”); in this case, BabelFy recognizes the verb “nascere” (“to born”) and two Named Entities (“Cavour” and “Torino”), assigning a particular sense to each of them. Assigning the correct sense to each element recognized in the text is not a trivial task: for example, if we look for the verb “nascere” (“to be born”) in BabelNet, the semantic network suggests 10 possible senses (babelnet.org/search?word=nascere&lang=IT). Moreover, considering “Cavour,” the algorithm needs to discriminate between the sense representing the Cavour town in Piedmont and the one representing the politician Camillo Benso conte di Cavour; while considering “Torino” the algorithm has to decide if it refers to the city of Turin in Piedmont or to the Torino Football Club. As we can see in Figure 11, BabelFy chooses the correct sense for each considered element, thanks to its word-sense disambiguation module.
Fig. 11.
Fig. 11. The results of the BabelFy disambiguation algorithm considering the sentence “Cavour è nato a Torino nel 1810” (“Cavour was born in Turin in 1810”), from http://babelfy.org/. Screenshot by authors.
The Named Entity Recognition sub-module can recognize three different types of entities, namely instances of the HERO PhysicalPerson class, instances of the HERO Organization class, and instances of the HERO GeographicFeature class, on the basis of the mappings shown in Table 1.
Table 1.
HEROTINTBabelNet
PhysicalPerson (https://w3id.org/hero/HERO-TOP#PhysicalPerson)PERBabelSynset representing the concept of human (https://babelnet.org/synset?word=bn:00044576n)
Organization (https://w3id.org/hero/HERO-ROCS#Organization)ORGBabelSynset representing the concept of company (https://babelnet.org/synset?word=bn:00021286n)
GeographicFeature (https://w3id.org/hero/HERO-TOP#GeographicFeature)LOCBabelSynset representing the concept of location (https://babelnet.org/synset?word=bn:00051760n)
Table 1. Mappings between HERO, TINT, and BabelNet
TINT categories are natively supported by the TINT Named Entity Recognition Module, while the component based on BabelFy analyzes the ancestors of the entity in the BabelNet semantic network (following edges labeled is-a) in order to obtain the correct classification: for example, considering the BabelNet synset that represents the city of Turin, it is classified as a city (babelnet.org/synset?word=bn:03335997n), which in turn is a settlement (babelnet.org/synset?word=bn:00070724n), which in turn is a location; so, Turin is recognized as a location.
The results provided by the two components are merged into S (i.e., the set containing all the Named Entities that our system can retrieve) with the following strategy:
If an entity is recognized only by the TINT-based component, it is added to S (associated with the corresponding class).
If an entity is recognized only by the BabelFy-based component, it is added to S (associated with the corresponding class).
If an entity is recognized by both components, it is added to S, associated with the class identified by the TINT-based component, also in those cases in which the BabelFy-based component disagrees. This choice considers the fact that the accuracy of the classification is usually better for the approach based on TINT than for the one based on BabelFy.
The Temporal Expressions Recognition sub-module is based on the Heideltime library [Strötgen and Gertz 2010], which, besides recognizing temporal expressions, also normalizes them. Normalization is essential in order to recognize the “prototype” of a particular temporal indication, whatever is the expression used in the text. As an example, both “2 giugno 2020” (“June 2, 2020”) and “2/6/2020” expressions are normalized as “2020-06-02”. In this way, we have a unique representation of a temporal interval, independent from the natural language used in the text.
Heideltime recognizes temporal expressions using patterns represented as regular expressions, coded with the TimeML markup language [Pustejovsky et al. 2005]. In particular, it can retrieve three types of temporal expressions:
Explicit expressions, such as “2 giugno 2020” or “2/6/2020”.
Implicit expressions, such as “San Silvestro 2015” (“New Year's Eve 2015”) or “Natale 2020” (“Christmas Day 2020”), normalized respectively as “2015-12-31” and “2020-12-25”.
Relative expressions, that can only be normalized using the context in which they occur. For example, if it finds the expression “due anni dopo” (“two years later”) and the previous lines provide the information that the story took place in 2017, Heideltime can normalize the expression as “2019.”
Each recognized Temporal Expression is assigned one of the four TIMEX3 types [Saurí et al. 2006] available, namely:
DATE, which describes a calendar time interval or subinterval (e.g., a day)
TIME, which refers to a time frame within a day (e.g., in the afternoon)
DURATION, which refers to explicit durations (e.g., 2 months)
SET, which describes a set of time intervals (e.g., every two weeks).
The PRiSMHA Temporal Expressions Recognition sub-module considers only the first two types. In particular, using a pattern-based approach that analyzes the normalized form (based on the Heideltime temporal tagger), we recognize five subtypes of entities belonging to the Heideltime DATE type, namely:
days, i.e., instances of the HERO Day class (https://w3id.org/hero/HERO-TIME#Day)
months, i.e., instances of the HERO class CalendarYearMonth (https://w3id.org/hero/HERO-TIME#CalendarYearMonth)
years, i.e., instances of the HERO CalendarYear class (https://w3id.org/hero/HERO-TIME#CalendarYear)
seasons, i.e., instances of the HERO CalendarSeason class (https://w3id.org/hero/HERO-TIME#CalendarSeason).
Entities associated by Heideltime with the TIME type are recognized as instances of the HERO DayTime class (https://w3id.org/hero/HERO-TIME#DayTime), representing time spans within a day, like hours.

4.2 Linking External Datasets

Fully describing an entity, from a semantic point of view, is not a trivial task, and the user who is adding a new entity to the Semantic KB often is not aware of all its features. In this case, the user can ask the system for support to obtain further information and possibly link the PRiSMHA entity to an external one. In particular, when facing the form for the characterization of a new entity, the user can click the Search on external resource link (see Figure 8), thus activating the Wikidata explorer interface. Wikidata [Vrandečić and Krötsch 2014] is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation and focused on items that represent topics, concepts, or objects using RDF as a data model. Each item is represented by a unique and persistent identifier, which is a positive integer prefixed with the uppercase letter Q, known as QID.
The Wikidata support page available in the PRiSMHA Crowdsourcing Platform UI (Figure 12) enables the user to specify the entity to search, by indicating a label for the entity (e.g., the name of the particular person to search) and its type. As for now, the user can search for persons, organizations, and places. In particular, the mappings shown in Table 2 have been defined.
Fig. 12.
Fig. 12. Wikidata support page.
Table 2.
Table 2. Mappings between HERO and Wikidata Types
When label and type have been specified, the user can click the Search button: the system sends a query to Wikidata (see details below) and shows the results (Figure 12). For example, if the user searches for an entity labeled “Partito comunista,” the system retrieves multiple candidates (Partito Comunista Italiano, Partito Comunista Serbo, Partito Marxista-leninista americano, etc.), among which the user can select the entity matching the one he or she has in mind by clicking on the corresponding Select button. Henceforth, the form for characterizing the new entity is shown again (see Figure 8), filled in with the link between the entity that the user is describing and the corresponding entry in Wikidata.
Actually, the user can select between two types of matching:
Corrisponde esattamente a (exactly matching): it refers to the skos:exactMatch property in SKOS (Simple Knowledge Organization System) [Miles and Pérez-Agüera 2007] and “indicates a high degree of confidence that two concepts can be used interchangeably across a wide range of information retrieval applications” (www.w3.org/TR/skos-reference/#L4858).
Corrisponde più o meno a (roughly corresponding to): it refers to the skos:closeMatch property in SKOS and “indicates that two concepts are sufficiently similar that they can be used interchangeably in some information retrieval applications” (www.w3.org/TR/skos-reference/#L4858).
In order to help the user to discriminate between different candidates, each one is represented as a “card” showing the following information:
A Wikidata label (in Italian)
A short description of the entity (in Italian)
A set of available links to external resources representing the same entity; in particular, we selected the following external resources as relevant in our historical domain:
Enciclopedia Treccani (http://www.treccani.it)
The Virtual International Authority File (VIAF, https://viaf.org)
Catalogo del Servizio Bibliotecario Nazionale (SBN, https://opac.sbn.it/opacsbn/opac/iccu/free.jsp)
Sito storico del Senato (http://www.senato.it/)
Sito della camera (http://dati.camera.it/)
Sito storico della camera (https://storia.camera.it/)
Dizionario storico Treccani (http://www.treccani.it/)
When the user clicks the Search button in the Wikidata support page (Figure 12), a SPARQL query is sent to the Wikidata Query Service (https://query.wikidata.org/), using the JENA API [McBride 2002]. Moreover, in order to retrieve the needed information, some additional services are used [Malyshev et al., 2018], namely:
wikibase:label (https://en.wikibooks.org/wiki/SPARQL/SERVICE_-_Label): Wikibase (https://wikiba.se/) is used to obtain the label and the short Italian description for the entity.
wikibase:mwapi (https://en.wikibooks.org/wiki/SPARQL/SERVICE_-_mwapi): MediaWiki API Query Service (MWAPI service) is used to search for entities filtered by a particular type selected by the user.
For example, if the user is looking for “Sergio Garavini” (label) and provides PhysicalPerson as type, the following SPARQL query is executed:
PREFIX wikibase: < http://wikiba.se/ontology#>
SELECT DISTINCT ?item ?itemLabel ?itemDescription
WHERE
{
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:api “EntitySearch”.
bd:serviceParam wikibase:endpoint “www.wikidata.org”.
bd:serviceParam mwapi:search “Sergio Garavini”.
bd:serviceParam mwapi:language “it”.
?item wikibase:apiOutputItem mwapi:item.
}
?item wdt:P31/wdt:P279* wd:Q5.
SERVICE wikibase:label {bd:serviceParam wikibase:language “it”}
}
ORDER BY asc(str(fn:lower-case(?itemLabel)))
Consider, in particular, the line ?item wdt:P31/wdt:P279* wd:Q5: the Wikidata node wd:Q5 (https://www.wikidata.org/wiki/Q5) represents the concept of human (mapped onto the HERO PhysicalPerson class), and thus the system selects all the entities in Wikidata, retrieved by the MWAPI service, that are instances of human or of any of its sub-concepts.
Available links to external resources are found through another SPARQL query; for example, the external links related to the Wikidata node Q338536 (https://www.wikidata.org/wiki/Q338536) representing Sergio Garavini are extracted with the following query:
SELECT ?treccaniURL ?viafURL ?sbnURL ?openPolisURL ?senateURL ?cameraDatiURL ?cameraStori aURL ?storiaTreccaniURL ?wikipediaURL WHERE {
wd:P3365 wdt:P1630 ?treccaniFormatter.
wd:P214 wdt:P1630 ?viafFormatter.
wd:P396 wdt:P1630 ?sbnFormatter.
wd:P1229 wdt:P1630 ?openPolisFormatter.
wd:P2549 wdt:P1630 ?senateFormatter.
wd:P1341 wdt:P1630 ?cameraDatiFormatter.
wd:P3935 wdt:P1630 ?cameraStoriaFormatter.
wd:P6404 wdt:P1630 ?storiaTreccaniFormatter.
optional {?wikipediaIRI schema:about wd:Q338536; schema:isPartOf <https://it.wikipedia.org/}>.
optional {wd:Q338536 wdt:P3365 ?treccaniID}.
optional {wd:Q338536 wdt:P214 ?viafID}.
optional {wd:Q338536 wdt:P396 ?sbnID}.
optional {wd:Q338536 wdt:P1229 ?openPolisID}.
optional {wd:Q338536 wdt:P2549 ?senateID}.
optional {wd:Q338536 wdt:P1341 ?cameraDatiID}.
optional {wd:Q338536 wdt:P3935 ?cameraStoriaID}.
optional {wd:Q338536 wdt:P6404 ?storiaTreccaniID}.
BIND(str(?wikipediaIRI) as ?wikipediaURL).
BIND(REPLACE(?treccaniID, '^(.+)$', ?treccaniFormatter) AS ?treccaniURL).
BIND(REPLACE(?viafID, '^(.+)$', ?viafFormatter) AS ?viafURL).
BIND(REPLACE(?sbnID, '^(.+)$', ?sbnFormatter) AS ?sbnURL).
BIND(REPLACE(?openPolisID, '^(.+)$', ?openPolisFormatter) AS ?openPolisURL).
BIND(REPLACE(?senateID, '^(.+)$', ?senateFormatter) AS ?senateURL).
BIND(REPLACE(?cameraDatiID, '^(.+)$', ?cameraDatiFormatter) AS ?cameraDatiURL).
BIND(REPLACE(?cameraStoriaID, '^(.+)$', ?cameraStoriaFormatter) AS ?cameraStoriaURL).
BIND(REPLACE(?storiaTreccaniID, '^(.+)$', ?storiaTreccaniFormatter) AS ?storiaTreccaniURL).
} LIMIT 1.

5 Evaluating Suggestions BY IE AND LOD Modules

5.1 Evaluation Setting

In order to assess the effectiveness of IE and LOD linking modules in supporting PRiSMHA users, we carried out a qualitative evaluation, for which 30 subjects were recruited. Each participant was asked to perform twice a sequence of mini-tasks (in the following this sequence will be referred to as the Main Task), once with a version of the Crowdsourcing Platform prototype without IE and LOD support, and once with the full-fledged version. The subjects were split into two groups of 15 people, named Group O and Group W:
Group O performed the Main Task first without IE and LOD support and then with the full-fledged version of the prototype.
Group W performed the Main Task first with IE and LOD support and then with the stripped-down version of the prototype.
The two groups were necessary because it could be expected that, repeating the Main Task twice, the second execution would be experienced as “easier” by the participants (as we will see in the Section 3.5, this was indeed the case): we did not want them to incorrectly attribute their increased ease to our support tools, when it was indeed due to a better knowledge of the application.
The Main Task included the following steps:
(A)
Log into the prototype and open a given project.
(B)
Find a specific document within the project and read it. The documents used for this step already included a few annotations, and the relevant fragments within them were highlighted.
(C)
Open a few relevant entities within highlighted text fragments, check them out for possible existing annotations, and, if not, annotate the corresponding fragment as they saw fit.
Users were asked to annotate two different documents, one to be used in the first execution of the Main Task, the other in the second. The two documents were the same for everyone, but which was used in the first execution and which in the second was randomly chosen by the system.
The third sub-step (C) was the active phase, where the IE and LOD support played its role. In the full-fledged version of the prototype the participants could benefit from the following support functionalities:
NER/TER: When the user is reading a document, by clicking the Show Named Entities and Temporal Expressions link, the IE module automatically identifies Named Entities and Temporal Expressions. The corresponding phrases are highlighted in the text (see Section 4.1) by using an orange background when they belong to a fragment and an orange text color when they do not.
INFO: By clicking on a Named Entity or Temporal Expression that occurs in a fragment, the system shows some information items describing the entity. In particular, it proposes links to external resources (namely, DBPedia and BabelNet), if available. The system also tries to identify those entities that are already available in the Semantic KB, in order to avoid duplicates. Such entities can be directly used to annotate the fragment in focus by clicking the Link entity to fragment link (see Section 4.1).
LINK: When adding a new entity, in order to annotate a fragment with it, the user can specify a label and a type for the entity and then ask the PRiSMHA platform to search Wikidata for possible matches. The Wikidata query results are then displayed as a list of cards among which the user can select the one representing the entity he or she has in mind. Such an entity can be linked to the one in the PRiSMHA Semantic KB by means of the exactly matching or roughly corresponds to property (see Section 4.2).
AUTOFILL: When the user selects a Wikidata entry as an exact or rough match, the PRiSMHA platform prompts him or her with the form for creating a new entity (see Figure 8 and Section 4.2); the system then automatically fills in the label and type fields thanks to the data retrieved from Wikidata.
After completing the assigned tasks, participants were asked to fill in a questionnaire, which mainly focused on the difference between the two experiences with the two prototype versions. The results we present in the following section consist of the questionnaire answers we collected.

5.2 Results

All the subjects had the Italian equivalent of a BSc, a MSc, or a PhD. In both groups, all of the subjects but one read their emails and browsed the web on a daily basis. All of them also worked with standard office applications (text editors, electronic sheets, presentation software) at least on a weekly basis. All of them declared a regular usage of both a personal computer and a smartphone. About half of them (7 out of 15 in Group W, 8 out of 15 in Group O) said they also regularly used a tablet.
The first part of the questionnaire was aimed at setting the context of the support offered by IE and LOD linking modules, by measuring the perceived complexity of the Main Task. The purpose of this section is not the evaluation of the annotation tool as a whole, nor the assessment of the user interface usability. We rather aimed at establishing a baseline for evaluating the potential usefulness of the support functionalities mentioned above.
In this section of the questionnaire, we asked the subjects to express their agreement with three statements, on a 5-point scale ranging from 1 (complete disagreement) to 5 (complete agreement). The statements were the following:
(a)
The task was in itself complex.
(b)
Even if the task was complex, it became easier once learned.
(c)
I did what was asked of me quite easily.
Figure 13 shows the answers to these questions, split between the two groups W and O, by means of boxplots.
Fig. 13.
Fig. 13. Boxplots for the answers to the first part of the questionnaire, evaluating the difficulty of the task. Participants were asked how much they agreed with statements (a), (b), and (c) described in the text. Answers were given on a 5-point scale (1 to 5), with 1 representing “complete disagreement” and 5 representing “complete agreement.” The boxplot themselves span from the first quartile Q1 to the third quartile Q3, with a dividing line showing the median (second quartile Q2). Whiskers show the minimum and maximum values based on the interquartile range (respectively, the lowest value above Q1 – 1.5 * IQR, and the highest value below Q3 + 1.5 * IQR). The diamond inside each boxplot represents the mean value.
Subsequently, we asked the subjects to express the degree to which each of the support functionalities mentioned above (NER/TER, INFO, LINK, AUTOFILL) helped them or rather hindered them. Again, the participants could answer on a 5-point scale, with 1 representing “significant hindrance” and 5 representing “significant help.”
The boxplots for the replies to these questions, from both Group W and Group O, are shown in Figure 14. In this phase users could also provide free-text comments; the feedback received in these comments is discussed in Section 5.3.
Fig. 14.
Fig. 14. Boxplots for the answers to the second part of the questionnaire, evaluating the degree of help/hindrance experienced for each support functionality. Answers were given on a 5-point scale (1 to 5), with 1 representing “significant hindrance” and 5 representing “significant help.” The conventions for the boxplot representation are the same as those used for Figure 13.

5.3 Discussion

As commented in Section 3, the PRiSMHA Crowdsourcing Platform is aimed at a quite specific and competent type of user; using the application is not easy, at least at first glance, and becoming acquainted with the underlying ontology requires some background knowledge about the domain and some interaction rounds. Although our 30 test subjects were reasonably tech-savvy, and with a good degree of academic education, they were using the application for the first time, and this can explain why most of them, in both groups, found the task rather complex, although less so once learned, and they did not find it easy to complete it, as the results in Figure 13 show. Such difficulty could of course impact on their evaluation of the support functionalities offered by the application, since appreciating them required a certain degree of knowledgeability on both the domain and the document annotation task itself.
Nonetheless, Figure 14 shows that both groups of users found the four support functionalities reasonably helpful. None of the boxplots falls in the bottom (“hindrance”) half of the diagram, and for all four functionalities the median is 4, and the mean falls between 3.5 and 4.
It can be observed that answers from Group O were consistently slightly lower than those of Group W. This mild difference between the two groups can be ascribed to the fact that, for Group W, the lack of support in the second execution of the task was compensated by an increased familiarity with the application. Also, as some participants remarked in the free-text comments to the questionnaire, the support features added visual and interaction complexity for people new to the task (Group W). The support features were in general easier to exploit for those who had already learned the basic use of the application, i.e., people in Group O. This is consistent with the fact that, as we can notice in Figure 13, Group W deemed the task slightly more difficult, and less easy to learn, than Group O.
While all the participants found the second execution easier than the first, some people in Group W actually blamed the support tools for this. As previously stated, we introduced the two groups to factor out a “false”-positive bias from Group O (“the second time I did it, it was easier, thus the support tools were helpful”); apparently Group W actually “compensated” it with the opposite negative bias (“the second time I did it, it was easier, thus the support tools were a hindrance”). For this reason we appreciated in particular the overall positive evaluation given from the participants in Group W.
As stated above, participants had the possibility to add free-text comments to their answers. These provided us with an assessment of the main advantages provided by the support tools, as well as directions regarding possible areas of improvement. Let us briefly discuss them.
Regarding the NER/TER support, 18 people out of 30 provided a free-text comment. Of these 18, 13 participants remarked positively on the helpfulness of the tool: they not only found it useful to identify examples of potential entities (“it helps recognizing which phrases correspond to entities,” “it helps recognizing relevant entities”) but also saw it as an aid in reading the text and identifying key concepts and relevant portions (“it was useful in identifying keywords,” “it made my job faster,” “it helped me choose which parts of a fragment were relevant and which not”). The remaining five participants commented on possible improvements: two of these remarks concerned UI improvements (“the meaning of differences in text color/background was not immediately clear,” “I was expecting that the highlighted entities were not only suggested, but already added into the system”), while the remaining three concerned the IE module itself, which did not always identify the full phrase corresponding to the entity. This was particularly true in cases of organizations with articulated names; for example, an organization named “Federazione giovanile del Partito socialista di unità proletaria” (Youth Federation of the Proletarian Unity Socialist Party) was recognized as three separate entities (“Federazione giovanile”/Youth Federation, “Partito socialista”/Socialist Party, “unità proletaria”/Proletarian Unity). Overall, we can state that the NER/TER functionality mainly helps users in identifying the relevant entities to be used in the annotation.
Moving to the INFO support, i.e., the link to external resources provided when clicking on an entity recognized by NER or TER, we collected 13 free-text comments. Eight participants expressed a positive remark (“it sped up the process of inserting simple entities, so that I could concentrate on more complex ones,” “the external links were useful to discover more about certain entities”) with some suggestions for improving the UI (“it would be helpful to include here a brief description of the entity taken from these external sources”). The other five participants expressed perplexity toward this feature, partly because the dialog presenting the information was not informative enough (“I did not understand these alphanumeric IDs”), partly because of unmet expectations (“I expected to be able to directly add the entity by using the external sources”). This last case is reported by three people; it is interesting to note that they lamented the absence of a feature that was made available at a later stage in the annotation process, i.e., the LINK support tool.
The LINK support received nine free-text comments. Six people commented positively, stating that being able to search for possible correspondences in Wikidata helped them to discover more on the entity itself; besides being helpful for filling in the entity creation form (“It was very intuitive to use and it helped me discover facts concerning the entity I was exploring”), they also found it interesting as an enrichment of their knowledge on the topic (“I found unexpected connections”) and as a validation instrument (“I think that connecting to an external resource is an enrichment because it somehow validates the identity of the entity itself”). In this case, the three negative remarks concerned exclusively the UI: for one participant the labels used in the form were misleading (“It was not clear to me that selecting exactly matches would ensue in a search on Wikidata”); the other two people complained that the search would not find anything, probably because it was not clear that both the label and type fields needed to be filled in for the search to work.
Last but not least, nine people commented on the AUTOFILL tool. All of them provided positive remarks, highlighting how having these two fields already filled in partially compensated for the complexity of the form, making it faster to fill it in, if not always easier (“Without it you would spend a lot of time finding the right category,” “When I had the automatic suggestion I felt less worried about making mistakes”).
On the basis of the results and comments, we can say that the INFO, LINK, and AUTOFILL functionalities actually help users in gathering information about entities used for the annotation; moreover, such information supports them in the characterization of these entities (i.e., in filling in the form shown in Figure 8).
On the whole, the support offered by the IE and LOD linking modules—which was the focus of the assessment—received a positive evaluation and most of the problems turned out to be related not to the support itself, but to some awkwardness in the User Interface, which either did not sufficiently highlight how one could benefit from the support or did not enforce the correct steps needed for the support to be effective. Nonetheless, these remarks pointed out the main directions of improvement.
An interesting point is the request for a broader support in the annotation task itself: in the NER/TER support, some users suggest to go beyond simple identification of entities in the text, together with correspondence to external resources, and ask for an automatic annotation, or at least a suggestion for it. Also in the LINK functionality, some users suggest to automatically create the candidate annotation on the basis of the Wikidata entry.
These enhancements in the annotation support are not trivial (e.g., they risk producing lower-quality data within the Semantic KB), but they are clearly worth considering. Moreover, the fact that users found the AUTOFILL functionality useful encourages us to plan to design a new version of the system where properties other than typology and label are automatically pre-filled (see Section 6).

6 Conclusions AND Future Work

In this article we have demonstrated that automatic approaches can be successfully exploited in order to support users in the semantic annotation of archival documents. More specifically, we have shown that Named Entity (and Temporal Expression) Recognition techniques are useful when users of the PRiSMHA Crowdsourcing Platform have to identify and characterize entities to annotate a document with. Moreover, when creating new entities to be added to the Semantic KB, exploiting information from external datasets such as Wikidata proved to be useful. These findings answer the research question introduced in Section 1: given an ontology-driven web-based system enabling users to build the formal semantic representations of archival document content, automatic text mining techniques (namely, Named Entity and Temporal Expression Recognition) and entity linking to external resources (Linked Open Data) actually provide users with an effective support in the (semantic) annotation activity. Moreover, a Semantic KB that describes the content of archival historical documents, and is linked to external LOD sets, enhances the interconnection of archival resources and provides a step forward in the direction indicated by Computational Archival Science (see computationalarchives.net).
Participants in the evaluation also suggested some promising improvements, the most interesting and challenging of which concerns the exploitation of the mentioned external datasets such as Wikidata. As commented in Section 5.3, when a Wikidata entry was selected, the PRiSMHA platform automatically fills in the label and type fields of the form for the characterization of the corresponding new entity to be added to the Semantic KB. The fact that users valued very much this feature encourages us to carry on the study aimed at designing a new version of the system where properties other than typology and label are automatically pre-filled on the basis of information retrieved from Wikidata, or from other datasets.

Acknowledgments

Thanks to all PRiSMHA collaborators, and special thanks to Rossana Damiano and Daniele Paolo Radicioni, for their valuable support in the project.

References

[1]
M. Agosti, O. Conlan, N. Ferro, C. Hampson, and G. Munnelly. 2013. Interacting with digital cultural heritage collections via annotations: The CULTURA approach. Proceedings of the ACM Symposium on Document Engineering (DocEng’13), 13–22.
[2]
W. Z. Alma aitah, A. Z. Talib, and M. A. Osman. 2020. Opportunities and challenges in enhancing access to metadata of cultural heritage collections: A survey. Artificial Intelligence Review 53 (2020), 3621–3646.
[3]
P. Andrews, I. Zaihrayeu, and J. Pane. 2012. A classification of semantic annotation systems. Semantic Web 3, 3 (2012), 223–248.
[4]
A. P. Aprosio and G. Moretti. 2016. Italy goes to Stanford: A collection of CoreNLP modules for Italian. arXiv preprint arXiv:1609.06204.
[5]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web. ISWC 2007, ASWC 2007”, LNCS 4825, Aberer, K., et al. (Eds.). Springer, Berlin, 722–735.
[6]
Z. Battad, A. White, and M. Si. 2019. Facilitating information exploration of archival library materials through multi-modal storytelling. In International Conference on Interactive Digital Storytelling. LNCS 11869, R. E. Cardona-Rivera, A. Sullivan, and R. M. Young (Eds.). Springer Nature Switzerland, 120–127.
[7]
C. Bonacchi, A. Bevan, A. Keinan-Schoonbaert, D. Pett, and J. Wexler. 2019. Participation in heritage crowdsourcing. Museum Management and Curatorship 34, 2 (2019), 166–182.
[8]
S. Borgo and C. Masolo. 2009. Foundational choices in dolce. In Handbook on Ontologies, 2nd ed. S. Staab and R. Studer (Eds.). Springer, 361–381.
[9]
N. Carboni and L. De Luca. 2017. Towards a semantic documentation of heritage objects through visual and iconographical representations. International Information & Library Review 49, 2 (2017), 207–217.
[10]
G. Carducci, M. Leontino, D. P. Radicioni, G. Bonino, E. Pasini, and P. Tripodi. 2019. Semantically aware text categorisation for metadata annotation. In Proceedings of the Italian Research Conference on Digital Libraries, 315–330.
[11]
N. Crofts, M. Doerr, and T. Gill. 2003. The CIDOC conceptual reference model a standard for communicating cultural contents. Cultivate Interactive 9 (2003).
[12]
R. Damiano and V. Lombardo. 2016. Labyrinth 3D. Cultural archetypes for exploring media archives. Digital Creativity 27, 3 (2016), 234–255.
[13]
M. Daquino, F. Mambelli, S. Peroni, F. Tomasi, and F. Vitali. 2016. Enhancing semantic expressivity in the cultural heritage domain: Exposing the Zeri photo archive as linked open data. Journal on Computing and Cultural Heritage 10, 4 (2016), 1–21.
[14]
C. Doulaverakis, I. Kompatsiaris, and M. G. Strintzis. 2005. Ontology-based access to multimedia cultural heritage collections - The REACH project. In Proceedings of the International Conference on “Computer as a Tool” (EUROCON’05), 151–154.
[15]
M. Dragoni, S. Tonelli, and G. Moretti. 2016. A knowledge management architecture for digital cultural heritage. ACM Transactions on Applied Perception 1, 1 (2016).
[16]
J. Foley, P. Kwan, and M. Welch. 2017. A web-based infrastructure for the assisted annotation of heritage collections. Journal on Computing and Cultural Heritage 10, 3 (2017), 1–25.
[17]
R. Garozzo, F. Murabito, C. Santagati, C. Pino, and C. Spampinato. 2017. CULTO: An ontology-based annotation tool for data curation in cultural heritage. ISPRS – International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 42, (2/W5), 267–274.
[18]
A. Goy, C. Accornero, D. Astrologo, D. Colla, M. D'Ambrosio, R. Damiano, M. Leontino, A. Lieto, F. Loreto, D. Magro, E. Mensa, A. Montanaro, V. Mosca, S. Musso, D. P. Radicioni, and C. Re. 2019. Fruitful synergies between computer science, historical studies and archives: The experience in the PRiSMHA project. In Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KMIS’19), 225–230.
[19]
A. Goy, D. Colla, D. Magro, C. Accornero, F. Loreto, and D. P. Radicioni. 2020. Building semantic metadata for historical archives through an ontology-driven user interface. Journal on Computing and Cultural Heritage 13, 3 (2020).
[20]
A. Goy, R. Damiano, F. Loreto, D. Magro, S. Musso, D. Radicioni, C. Accornero, D. Colla, A. Lieto, E. Mensa, M. Rovera, D. Astrologo, B. Boniolo, and M. D'ambrosio. 2017. PRiSMHA (providing rich semantic metadata for historical archives). In Proceedings of the Contextual Representation of Objects and Events in Language.
[21]
A. Goy, D. Magro, and M. Rovera. 2015. Ontologies and historical archives: A way to tell new stories. Applied Ontology 10, 3/4 (2015), 331–338.
[22]
A. Goy, D. Magro, and M. Rovera. 2018. On the role of thematic roles in a historical event ontology. Applied Ontology 13, (2018), 19–39.
[23]
E. Hyvönen, E. Mäkelä, M. Salminen, A. Valo, K. Viljanen, S. Saarela, M. Junnila, and S. Kettula. 2005. MuseumFinland - Finnish museums on the semantic web. Journal of Web Semantics 3, 2–3 (2005), 224–241.
[24]
A. Jordanous, K. F. Lawrence, M. Hedges, and C. Tupman. 2012. Exploring manuscripts: Sharing ancient wisdoms across the semantic web. In Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics (WIMS’12), 44, 1–12.
[25]
I. Kollia, V. Tzouvaras, N. Drosopoulos, and G. Stamou. 2012. A systemic approach for effective semantic access to cultural content. Semantic Web Journal 3, 1 (2012), 65–83.
[26]
M. Lana, F. Ciotti, D. Magro, S. Peroni, F. Tomasi, and F. Vitali. 2014. Annotating texts with ontologies, from geography to persons and events. In Proceedings of Digital Humanities 2014.
[27]
M. Laclavik, M. Šeleng, E. Gatial, Z. Balogh, and Ladislav Hluchý. 2006. Ontology based text annotation - OnTeA. In Information Modelling and Knowledge Bases XVIII. M. Duží, H. Jaakkola, Y. Kiyoki, and H. Kangassalo (Eds.). IOS Press, Amsterdam, 311–315.
[28]
J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann, 282–289.
[29]
C. A. Lee and H. Tibbo. 2011. Where's the archivist in digital curation? Exploring the possibilities through a matrix of knowledge and skills. Archivaria 72 (2011), 123–168.
[30]
V. Lombardo, P. Pizzo, and R. Damiano. 2016. Safeguarding and accessing drama as intangible cultural heritage. Journal on Computing and Cultural Heritage 9, 1 (2016), 1–26.
[31]
B. Magnini, E. Pianta, C. Girardi, M. Negri, L. Romano, M. Speranza, V. Bartalesi Lenzi, and R. Sprugnoli. 2006. I-CAB: The Italian content annotation bank. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), 963–968.
[32]
S. Malyshev, M. Krötzsch, L. González, J. Gonsior, and A. Bielefeldt. 2018. Getting the most out of Wikidata: Semantic technology usage in Wikipedia's knowledge graph. In Proceedings of the International Semantic Web Conference, 376–394.
[33]
C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55–60.
[34]
B. McBride. 2002. Jena: A semantic web toolkit. IEEE Internet Computing 6, 6 (2002), 55–59.
[35]
A. Miles and J. R. Pérez-Agüera. 2007. SKOS: Simple knowledge organisation for the web. Cataloging & Classification Quarterly 43, 3–4 (2007), 69–83.
[36]
A. Moro, A. Raganato, and R. Navigli. 2014. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics 2 (2014), 231–244.
[37]
E. Motta, S. Buckingham Shum, and J. Domingue. 2000. Ontology-driven document enrichment: Principles, tools and applications. International Journal of Human-Computer Studies 52, 6 (2000), 1071–1109.
[38]
R. Navigli and S. P. Ponzetto. 2010. BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 216–225.
[39]
C. Post, A. Chassanoff, C. A. Lee, A. Rabkin, Y. Zhang, K. Skinner, and S. Meister. 2019. Digital curation at work: Modeling workflows for digital archival materials. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries. ACM Press, 39–48.
[40]
J. Pustejovsky, R. Ingria, R. Saurì, J. M. Castaño, J. Littman, R. J. Gaizauskas, A. Setzer, G. Katz, and I. Mani. 2005. The specification language TimeML. In The Language of Time-A Reader. J. Pustejovsky, I. Mani, and R. Gaizauskas (Eds.). Oxford University Press, 545–558.
[41]
R. Saurí, J. Littman, B. Knippen, R. Gaizauskas, A. Setzer, and J. Pustejovsky. 2006. TimeML annotation guidelines. www.timeml.org.
[42]
G. Schreiber, A. Amin, L. Aroyo, M. van Assem, V. De Boer, L. Hardman, M. Hildebrand, B. Omelayenko, J. van Osenbruggen, A. Tordai, J. Wielemaker, and B. Wielinga. 2008. Semantic annotation and search of cultural-heritage collections: The multimedian e-culture demonstrator. Web Semantics 6, 4 (2008), 243–249.
[43]
J. Strötgen and M. Gertz. 2010. Heideltime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation, 321–324.
[44]
E. L. Tonkin and G. J. L. Tourte. 2016. Using the crowd to update cultural heritage catalogue. In Proceedings of Presented at Involving the Crowd in Future Museum Experience Design (CHI’16 Workshop), 1–6.
[45]
N. Underberg-Goode. 2017. Digital storytelling for heritage across media. Collections: A Journal for Museum and Archives Professionals 13, 2 (2017), 103–114.
[46]
D. Vrandečić and M. Krötzsch. 2014. Wikidata: A free collaborative knowledgebase. Communications of the ACM 57, 10 (2014), 78–85.
[47]
D. Walsh and M. M. Hall. 2015. Just looking around: Supporting casual user's initial encounters with digital cultural heritage. In Proceedings of the 1st International Workshop on Supporting Complex Search Tasks, 1338 of CEUR workshop proceedings. CEUR-WS.org
[48]
F. Windhager, E. Mayr, G. Schreder, M. Smuc, P. Federico, and S. Miksch. 2016. Reframing cultural heritage collections in a visualization framework of space–time cubes. In Proceedings of the 3rd Histo-Informatics Workshop, 1632, 20–24.

Cited By

View all
  • (2021)Wikidata Support in the Creation of Rich Semantic Metadata for Historical ArchivesApplied Sciences10.3390/app1110437811:10(4378)Online publication date: 12-May-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage
Journal on Computing and Cultural Heritage   Volume 15, Issue 3
September 2022
402 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/3544006
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2022
Online AM: 07 March 2022
Accepted: 29 August 2021
Revised: 11 May 2021
Received: 02 November 2020
Published in JOCCH Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Artificial intelligence and archives
  2. semantic metadata generation
  3. linked data
  4. ontologies
  5. entity extraction
  6. synergies between computational and human-based methods
  7. semantic processing

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Compagnia di San Paolo Foundation and Università di Torino
  • PRiSMHA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)653
  • Downloads (Last 6 weeks)61
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Wikidata Support in the Creation of Rich Semantic Metadata for Historical ArchivesApplied Sciences10.3390/app1110437811:10(4378)Online publication date: 12-May-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media