The current digital turn in studying and analyzing historical documents results in both having ma... more The current digital turn in studying and analyzing historical documents results in both having machine actionable cultural data and providing software able to process them. However, these data and services often lack in integration strategies among them in order to be reused in other contexts different from the original ones. As pointed out by Franz Fischer in a worthy of note article: “There is no out-of-the-box software available for creating truly critical and truly digital editions at the same time” [1]. Likewise, Monica Berti stated that is now important to "build a model for representing quotations and text reuses of lost works in a digital environment” [2]. In this vision Bridget Almas is in charge of developing an integrated platform for collaboratively transcribing, editing, and translating historical documents and texts. She claimed that through this platform, called Perseids, students and scholars are able to create open source digital scholarly editions [3]. A numbe...
The paper focuses on the automatic extraction of domain knowledge from Italian legal texts and pr... more The paper focuses on the automatic extraction of domain knowledge from Italian legal texts and presents a fully-implemented ontology learning system (T2K, Text-2-Knowledge) that includes a battery of tools for Natural Language Processing, statistical text analysis and machine learning. Evaluated results show the considerable potential of systems like T2K, exploiting an incremental interleaving of NLP and machine learning techniques for accurate large-scale semi-automatic extraction and structuring of domain-specific knowledge.
One of the main challenges of the DH community is to provide suitable software models and tools. ... more One of the main challenges of the DH community is to provide suitable software models and tools. To model the literary domain and the relative user requirements, we chose to follow the engineering principles of object-oriented analysis and design. The digital representation of a textual resource is a challenge as it involves several theoretical and epistemological issues in semiotics, paleography, philology, linguistics, engineering, and computer science. We have designed and implemented a set of core entities as the fundamental data types shared among all the components of the environment.
Abstract. The domain adaptation task was aimed at investigating techniques for adapting state-of-... more Abstract. The domain adaptation task was aimed at investigating techniques for adapting state-of-the-art dependency parsing systems to new domains. Both the language dealt with, i.e. Italian, and the target domain, namely the legal domain, represent two main novelties of the task organised at Evalita 2011. In this paper, we define the task and describe how the datasets were created from different resources. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results.
This paper introduces the POESIA internet filtering system, which is open-source, and which combi... more This paper introduces the POESIA internet filtering system, which is open-source, and which combines standard filtering methods, such as positive/negative URL lists, with more advanced techniques, such as image processing and NLP-enhanced text filtering. The description here focusses on components providing textual content filtering for three European languages (English, Italian and Spanish), employing NLP methods to enhance performance. We address also the acquisition of language data needed to develop these filters, and the evaluation of the system and its components.
English. The inclusion of semantic features in the stylometric analysis of literary texts appears... more English. The inclusion of semantic features in the stylometric analysis of literary texts appears to be poorly investigated. In this work, we experiment with the application of Distributional Semantics to a corpus of Italian literature to test if words distribution can convey stylistic cues. To verify our hypothesis, we have set up an Authorship Attribution experiment. Indeed, the results we have obtained suggest that the style of an author can reveal itself through words distribution too. Italiano. L’inclusione di caratteristiche semantiche nell’analisi stilometrica di testi letterari appare poco studiata. In questo lavoro, sperimentiamo l’applicazione della Semantica Distribuzionale ad un corpus di letteratura italiana per verificare se la distribuzione delle parole possa fornire indizi stilistici. Per verificare la nostra ipotesi, abbiamo imbastito un esperimento di Authorship Attribution. I risultati ottenuti suggeriscono che, effettivamente, lo stile di un autore pu rivelarsi a...
Within the “Museo Virtuale della Musica BellinInRete” project, a corpus of letters, written by th... more Within the “Museo Virtuale della Musica BellinInRete” project, a corpus of letters, written by the renowned composer Vincenzo Bellini (1801 - 1835) from Catania, will be encoded and made publicly available. This contribution aims at illustrating the part of the project regarding the implementation of the prototype for the metadata and text encoding, indexing and visualisation of Bellini’s correspondence. The encoding scheme has been defined according to the latest guidelines of the Text Encoding Initiative and it has been instantiated on a sample of letters. Contextually, a first environment has been implemented by customizing two open source tools: Edition Visualization Technology and Omega Scholarly platform . The main objective of the digital edition is to engage general public with the cultural heritage held by the Belliniano Civic Museum of Catania. This wide access to Bellini’s correspondence has been conceived preserving the scholarly transcriptions of the letters edited by S...
Background: Due to the rapidly expanding body of biomedical literature, biologists require increa... more Background: Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results: This article provides an overvi...
This article illustrates the first steps towards the implementation of a Decision Support System ... more This article illustrates the first steps towards the implementation of a Decision Support System aimed to recreate a research environment for scholars and provide them with computational tools to assist in the processing and interpretation of texts. While outlining the general characteristics of the system, the paper presents a minimal set of user requirements and provides a possible use case on Dante’s Inferno.
Abstract The management and exchange of multimedia data is a challenging area of research due to... more Abstract The management and exchange of multimedia data is a challenging area of research due to the variety of formats, standards and the many interesting intended applications. Semantic web technologies are very promising to enable interoperability and integration of media. ...
Proc. of the 3rd Italian Semantic Web Workshop-SWAP 2006, 2006
Abstract—The demand for efficient methods for extracting knowledge from multimedia content has le... more Abstract—The demand for efficient methods for extracting knowledge from multimedia content has led to a growing research community investigating the convergence of multimedia and knowledge technologies. In this paper we describe a methodology for extracting multimedia information from product catalogues empowered by the synergetic use and extension of a domain ontology. The methodology was implemented in the Trade Fair Advanced Semantic Annotation Pipeline of the VIKE-framework. Index Terms—Semantic ...
A formal digital structuring of the terminology of the Talmud is being carried out in the context... more A formal digital structuring of the terminology of the Talmud is being carried out in the context of the Project for the Translation of the Babylonian Talmud into Italian. According to the principles of the Meaning-Text Theory, the terminological resource was encoded in the form of a multi-language Explanatory Combinatorial Dictionary (Hebrew-Aramaic-Italian). The construction of such a resource was supported by text processing and computational linguistics techniques aimed at automatically extracting terms from the Italian translation of the Talmud and aligning them with the corresponding Hebrew/Aramaic source terms. The paper describes the process that was set up for constructing the terminological resource with the ultimate goal of illustrating the advantages of adopting a formal linguistic model. The terminological resource aims to be a useful tool to deepen the characteristics of the languages of the Talmud, help translators in their work, and more generally, scholars in their study of the Talmud itself.
The automatic linguistic analysis of ancient Hebrew represents a new research opportunity in the ... more The automatic linguistic analysis of ancient Hebrew represents a new research opportunity in the field of Jewish studies. In fact, very little has been produced, both in terms of linguistic resources and, above all, of tools for analysis of ancient Hebrew. This article illustrates a work born within the Italian Translation of the Babylonian Talmud Project aimed at the construction of an automatic linguistic annotator of mishnaic Hebrew.
The current digital turn in studying and analyzing historical documents results in both having ma... more The current digital turn in studying and analyzing historical documents results in both having machine actionable cultural data and providing software able to process them. However, these data and services often lack in integration strategies among them in order to be reused in other contexts different from the original ones. As pointed out by Franz Fischer in a worthy of note article: “There is no out-of-the-box software available for creating truly critical and truly digital editions at the same time” [1]. Likewise, Monica Berti stated that is now important to "build a model for representing quotations and text reuses of lost works in a digital environment” [2]. In this vision Bridget Almas is in charge of developing an integrated platform for collaboratively transcribing, editing, and translating historical documents and texts. She claimed that through this platform, called Perseids, students and scholars are able to create open source digital scholarly editions [3]. A numbe...
The paper focuses on the automatic extraction of domain knowledge from Italian legal texts and pr... more The paper focuses on the automatic extraction of domain knowledge from Italian legal texts and presents a fully-implemented ontology learning system (T2K, Text-2-Knowledge) that includes a battery of tools for Natural Language Processing, statistical text analysis and machine learning. Evaluated results show the considerable potential of systems like T2K, exploiting an incremental interleaving of NLP and machine learning techniques for accurate large-scale semi-automatic extraction and structuring of domain-specific knowledge.
One of the main challenges of the DH community is to provide suitable software models and tools. ... more One of the main challenges of the DH community is to provide suitable software models and tools. To model the literary domain and the relative user requirements, we chose to follow the engineering principles of object-oriented analysis and design. The digital representation of a textual resource is a challenge as it involves several theoretical and epistemological issues in semiotics, paleography, philology, linguistics, engineering, and computer science. We have designed and implemented a set of core entities as the fundamental data types shared among all the components of the environment.
Abstract. The domain adaptation task was aimed at investigating techniques for adapting state-of-... more Abstract. The domain adaptation task was aimed at investigating techniques for adapting state-of-the-art dependency parsing systems to new domains. Both the language dealt with, i.e. Italian, and the target domain, namely the legal domain, represent two main novelties of the task organised at Evalita 2011. In this paper, we define the task and describe how the datasets were created from different resources. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results.
This paper introduces the POESIA internet filtering system, which is open-source, and which combi... more This paper introduces the POESIA internet filtering system, which is open-source, and which combines standard filtering methods, such as positive/negative URL lists, with more advanced techniques, such as image processing and NLP-enhanced text filtering. The description here focusses on components providing textual content filtering for three European languages (English, Italian and Spanish), employing NLP methods to enhance performance. We address also the acquisition of language data needed to develop these filters, and the evaluation of the system and its components.
English. The inclusion of semantic features in the stylometric analysis of literary texts appears... more English. The inclusion of semantic features in the stylometric analysis of literary texts appears to be poorly investigated. In this work, we experiment with the application of Distributional Semantics to a corpus of Italian literature to test if words distribution can convey stylistic cues. To verify our hypothesis, we have set up an Authorship Attribution experiment. Indeed, the results we have obtained suggest that the style of an author can reveal itself through words distribution too. Italiano. L’inclusione di caratteristiche semantiche nell’analisi stilometrica di testi letterari appare poco studiata. In questo lavoro, sperimentiamo l’applicazione della Semantica Distribuzionale ad un corpus di letteratura italiana per verificare se la distribuzione delle parole possa fornire indizi stilistici. Per verificare la nostra ipotesi, abbiamo imbastito un esperimento di Authorship Attribution. I risultati ottenuti suggeriscono che, effettivamente, lo stile di un autore pu rivelarsi a...
Within the “Museo Virtuale della Musica BellinInRete” project, a corpus of letters, written by th... more Within the “Museo Virtuale della Musica BellinInRete” project, a corpus of letters, written by the renowned composer Vincenzo Bellini (1801 - 1835) from Catania, will be encoded and made publicly available. This contribution aims at illustrating the part of the project regarding the implementation of the prototype for the metadata and text encoding, indexing and visualisation of Bellini’s correspondence. The encoding scheme has been defined according to the latest guidelines of the Text Encoding Initiative and it has been instantiated on a sample of letters. Contextually, a first environment has been implemented by customizing two open source tools: Edition Visualization Technology and Omega Scholarly platform . The main objective of the digital edition is to engage general public with the cultural heritage held by the Belliniano Civic Museum of Catania. This wide access to Bellini’s correspondence has been conceived preserving the scholarly transcriptions of the letters edited by S...
Background: Due to the rapidly expanding body of biomedical literature, biologists require increa... more Background: Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results: This article provides an overvi...
This article illustrates the first steps towards the implementation of a Decision Support System ... more This article illustrates the first steps towards the implementation of a Decision Support System aimed to recreate a research environment for scholars and provide them with computational tools to assist in the processing and interpretation of texts. While outlining the general characteristics of the system, the paper presents a minimal set of user requirements and provides a possible use case on Dante’s Inferno.
Abstract The management and exchange of multimedia data is a challenging area of research due to... more Abstract The management and exchange of multimedia data is a challenging area of research due to the variety of formats, standards and the many interesting intended applications. Semantic web technologies are very promising to enable interoperability and integration of media. ...
Proc. of the 3rd Italian Semantic Web Workshop-SWAP 2006, 2006
Abstract—The demand for efficient methods for extracting knowledge from multimedia content has le... more Abstract—The demand for efficient methods for extracting knowledge from multimedia content has led to a growing research community investigating the convergence of multimedia and knowledge technologies. In this paper we describe a methodology for extracting multimedia information from product catalogues empowered by the synergetic use and extension of a domain ontology. The methodology was implemented in the Trade Fair Advanced Semantic Annotation Pipeline of the VIKE-framework. Index Terms—Semantic ...
A formal digital structuring of the terminology of the Talmud is being carried out in the context... more A formal digital structuring of the terminology of the Talmud is being carried out in the context of the Project for the Translation of the Babylonian Talmud into Italian. According to the principles of the Meaning-Text Theory, the terminological resource was encoded in the form of a multi-language Explanatory Combinatorial Dictionary (Hebrew-Aramaic-Italian). The construction of such a resource was supported by text processing and computational linguistics techniques aimed at automatically extracting terms from the Italian translation of the Talmud and aligning them with the corresponding Hebrew/Aramaic source terms. The paper describes the process that was set up for constructing the terminological resource with the ultimate goal of illustrating the advantages of adopting a formal linguistic model. The terminological resource aims to be a useful tool to deepen the characteristics of the languages of the Talmud, help translators in their work, and more generally, scholars in their study of the Talmud itself.
The automatic linguistic analysis of ancient Hebrew represents a new research opportunity in the ... more The automatic linguistic analysis of ancient Hebrew represents a new research opportunity in the field of Jewish studies. In fact, very little has been produced, both in terms of linguistic resources and, above all, of tools for analysis of ancient Hebrew. This article illustrates a work born within the Italian Translation of the Babylonian Talmud Project aimed at the construction of an automatic linguistic annotator of mishnaic Hebrew.
Nel contesto del Progetto per la traduzione del Talmud babilonese in italiano (PTTB) si sta proce... more Nel contesto del Progetto per la traduzione del Talmud babilonese in italiano (PTTB) si sta procedendo a una strutturazione digitale formale della terminologia. La risorsa terminologica è stata codificata sotto forma di un dizionario combinatorio ed esplicativo, multilingue (ebraico-aramaico-italiano) secondo i principi della teoria testo-senso. La costruzione di tale risorsa è stata supportata dall'elaborazione del testo e dalle tecniche linguistiche computazionali volte a estrarre automaticamente i termini dalla traduzione italiana del Talmud e ad allinearli con i corrispondenti termini in ebraico / aramaico. L'articolo descrive il processo avviato per la costruzione della risorsa terminologica con l'obiettivo finale di illustrare i vantaggi dell'adozione di un modello linguistico formale. La risorsa terminologica mira, infatti, a essere uno strumento utile per approfondire le caratteristiche delle lingue del Talmud, per aiutare i traduttori nel loro lavoro e più in generale l'ampia platea di studiosi del Talmud.
One of the main challenges of the DH community is to provide suitable software models and tools. ... more One of the main challenges of the DH community is to provide suitable software models and tools. To model the literary domain and the relative user requirements, we chose to follow the engineering principles of object-oriented analysis and design. The digital representation of a textual resource is a challenge as it involves several theoretical and epistemological issues in semiotics, paleography, philology, linguistics, engineering, and computer science. We have designed and implemented a set of core entities as the fundamental data types shared among all the components of the environment.
The Literary Computing group of the Institute for Computational Linguistics at the National Resea... more The Literary Computing group of the Institute for Computational Linguistics at the National Research Council of Italy (ILC-CNR) is carrying on a line of research about designing software models for textual scholarship as well as implementing them by using cutting-edge software engineering approaches and technologies. The research work is aimed at providing a general framework (called Omega) [4] - inherently conceived with the object-oriented paradigm and semantic web technologies - suitable for studying historical and literary documents and texts.
Uploads
source terms. The paper describes the process that was set up for constructing the terminological resource with the ultimate goal of illustrating the advantages of adopting a formal linguistic
model. The terminological resource aims to be a useful tool to deepen the characteristics of the languages of the Talmud, help translators in their work, and more generally, scholars in their study of the Talmud itself.
source terms. The paper describes the process that was set up for constructing the terminological resource with the ultimate goal of illustrating the advantages of adopting a formal linguistic
model. The terminological resource aims to be a useful tool to deepen the characteristics of the languages of the Talmud, help translators in their work, and more generally, scholars in their study of the Talmud itself.