Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Documenation for the six semantically tagged Europarl v7 data files
<strong>Semantically tagged Europarl-es.v7 </strong> 55.8+ M lines Lexical coverage of the tagging: 85.54% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with Treetagger:... more
<strong>Semantically tagged Europarl-es.v7 </strong> 55.8+ M lines Lexical coverage of the tagging: 85.54% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with Treetagger: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ Output in base form Documentation of the semantic tagger (for Finnish, but same principles hold for Swedish, too): https://www.aclweb.org/anthology/W19-0306/ https://zenodo.org/record/3676372#.YFNwIa8zY2w <strong>Format: base form POS Semtag</strong> Unknown words marked with tag Z99 <strong>Example output</strong> reanudación# Z99<br> del prep+art Z5<br> período noun T1.1 T1.3<br> de prep Z5<br> sesión# Z99<br> declarar# Z99<br> reanudar# Z99<br> el art Z5<br> período noun T1.1 T1.3<br> de prep Z5<br> sesión# Z99<br> del prep+art Z5<br> parlamento noun G1.2<br> europeo adj Z2 Z2/S2mfnc
The so called <em>Kotus word list</em> consists of the words in the 1990's <em>Perussanakirja</em> (Basic dictionary of Finnish) and in its original form it is available here:... more
The so called <em>Kotus word list</em> consists of the words in the 1990's <em>Perussanakirja</em> (Basic dictionary of Finnish) and in its original form it is available here: https://kaino.kotus.fi/sanat/nykysuomi/ Here published version of the wordlist of 94 385 lexemes is a modification, that combines information from two sources: UD1 (Universal Dependency Parser) of the Turku NLP group: analysis runs were performed in The Language Bank of Finland Semantic tags based on the UCREL Finnish semantic tag system: https://github.com/UCREL/Multilingual-USAS/tree/master/Finnish with the FiST semantic tagger If the word has been tagged with the semantic tags by FiST, the output looks like this: <em> aakkonen Noun Q3</em> If the word was not analyzed by FiST, it is given its UD1 analysis and tag Z99: <em> aallokas NOUN§ Case=Nom|Number=Sing Z99</em> <em> </em>UD1 was able to analyze 39 524 of the compounds not analyzed by FiST...
<strong>Semantically tagged Europarl-fi.v7 with POS data (UD1)</strong> 37.5+ M lines Lexical coverage of the tagging: 91.31% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with... more
<strong>Semantically tagged Europarl-fi.v7 with POS data (UD1)</strong> 37.5+ M lines Lexical coverage of the tagging: 91.31% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with UD1: https://turkunlp.org/finnish_nlp.html#parser Output in base form and original running text Documentation of the FiST semantic tagger: https://www.aclweb.org/anthology/W19-0306/ <strong>Semantic tagging</strong> Tagging of the data was performed in Puhti computing environment of the CSC – IT CENTER FOR SCIENCE LTD. https://research.csc.fi/-/puhti <strong>Forma</strong>t: Text form Base form Semtag POS information Unknown words marked with tag Z99 <strong>Example:</strong> 1 Istuntokauden istuntokausi Z99 NOUN _ Case=Gen|Number=Sing 2 nmod:poss _ _<br> 2 uudelleenavaaminen uudelleenavaaminen Z99 NOUN _ Case=Nom|Number=Sing 0 root _ _<br> 1 Julistan #julistaa#Verb#Q2.1 VERB _ Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ _<br> 2 perjantaina #perjantai#Noun#T1.3 NOUN _ Case=Ess|Number=Sing 1 nmod _ _<br> 3 joulukuun #joulukuu#Noun#T1.3 NOUN _ Case=Gen|Number=Sing 5 nmod:poss _ _<br> 4 NUMB<br> 5 päivänä #päivä#Noun#T1.3 NOUN _ Case=Ess|Number=Sing 6 nmod _ _<br> 6 keskeytetyn #keskeyttää#Verb#T2- VERB _ Case=Gen|Degree=Pos|Number=Sing|PartForm=Past|VerbForm=Part|Voice=Pass 8 acl _ _<br> 7 Euroopan #Eurooppa#Proper#Z2 PROPN _ Case=Gen|Number=Sing 8 nmod:poss _ _<br> 8 parlamentin #parlamentti#Noun#G1.1/S5+ NOUN _ Case=Gen|Number=Sing 9 nmod:poss _ _<br> 9 istunnon #istunto#Noun#G1.1 Y2 NOUN _ Case=Gen|Number=Sing 10 dobj _ _<br> 10 avatuksi #avata#Verb#A10+ T2+ A1.1.1 VERB _ Case=Tra|Degree=Pos|Number=Sing|PartForm=Past|VerbForm=Part|Voice=Pass 1 xcomp:ds _ _<br> 11 ja #ja#Conjunction#Z5 CONJ _ _ 1 cc _ _
PurposeThis study aims to identify user perception of different qualities of optical character recognition (OCR) in texts. The purpose of this paper is to study the effect of different quality OCR on users' subjective perception... more
PurposeThis study aims to identify user perception of different qualities of optical character recognition (OCR) in texts. The purpose of this paper is to study the effect of different quality OCR on users' subjective perception through an interactive information retrieval task with a collection of one digitized historical Finnish newspaper.Design/methodology/approachThis study is based on the simulated work task model used in interactive information retrieval. Thirty-two users made searches to an article collection of Finnish newspaper Uusi Suometar 1869–1918 which consists of ca. 1.45 million autosegmented articles. The article search database had two versions of each article with different quality OCR. Each user performed six pre-formulated and six self-formulated short queries and evaluated subjectively the top 10 results using a graded relevance scale of 0–3. Users were not informed about the OCR quality differences of the otherwise identical articles.FindingsThe main resul...
<strong>Semantically tagged Europarl-fi.v7 </strong> 37+ M lines Lexical coverage of the tagging: 92.88% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with Treetagger:... more
<strong>Semantically tagged Europarl-fi.v7 </strong> 37+ M lines Lexical coverage of the tagging: 92.88% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with Treetagger: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ Output in base form Documentation of the<strong> FiST</strong> semantic tagger: https://www.aclweb.org/anthology/W19-0306/ https://zenodo.org/record/3676372#.YFNwIa8zY2w <strong>Format: base form POS Semtag</strong> <em>Unknown words marked with tag Z99</em> <strong><em>Example:</em></strong> istunto Noun G1.1 Y2<br> uudelleen Adverb N6+<br> julistaa Verb Q2.1<br> perjantai Noun T1.3<br> joulu Noun S9/T1.3<br> 17 NUMB<br> . PUNCT<br> päivä Noun T1.3<br> keskeyttää Verb T2-<br> Eurooppa Proper Z2
Pentti Haanpää (1905-1955) was one of the most important Finnish authors in the first half of the 20th century. His short stories and novels describe life in the northwestern part of the Finnish countryside many times, but his collected... more
Pentti Haanpää (1905-1955) was one of the most important Finnish authors in the first half of the 20th century. His short stories and novels describe life in the northwestern part of the Finnish countryside many times, but his collected works also include many other themes. Among his works are five books, three novels, and two short story collections, which describe either military life or war. His first war novel, Korpisotaa describes the Finnish Winter War of 1939-40. Haanpää wrote the novel based loosely on his own war experiences for a competition for the best winter war novel arranged in 1940 by Prentice-Hall together with the Finnish publisher Otava; the novel was ranked third best in the competition. The novel is generally considered the first realistic war novel published in Finland [1-3], and its reception was favorable in general [4]. In this study, we focus on the analysis of geographic space in Korpisotaa. We use a digital version of the novel to be able to easily search for all the relevant space and location words in the novel. The methods we use in the study are familiar from linguistic corpus studies, and they have been used to some extent in literary studies as well. Besides common methods like keyness and frequency counts, we can benefit from a lexical semantic tagger of Finnish. Usage of the tagger systematizes the finding of the geographic space words in the novel and the comparison texts and enables us to perform keyness counts for semantic word groups instead of single words. Our work contributes especially to the use of digital methods in literary analysis and the creation of literary study corpora. Even for a novel-length, the availability of a digital version of the studied text helps detailed analysis very much, as will be shown in the analysis of Korpisotaa.
Historical newspapers are increasingly accessed digitally for different purposes both by professional and lay users. These ever-growing historical collections are usually formed by utilizing Optical Character Recognition... more
Historical  newspapers  are  increasingly  accessed  digitally  for different purposes both by professional and lay users. These ever-growing  historical  collections  are  usually  formed  by  utilizing Optical Character Recognition (OCR), which may introduce noise to the texts. This subsequently leads to compromised information retrieval (IR) performance and user understanding. The effect of OCR noise on IR performance has been studied earlier by utilizing artificially  degraded  OCR  quality texts(see, e.g., [2,  15]),  test collection containing documents with authentic low OCR quality [12],  or  by  gathering  end-user  impressions [23].  However,  it remains  challenging  to  measure  how  the  user’s  subjective perception is affected by the amount of OCR noise remaining in the documents. Recently, the National Library of Finland has set up an experimental system which allows studying this issue. The system allows  presenting  each  underlying  historical  document  as  two alternatives –either based on the baseline OCR quality, or on the new, improved OCR quality. This set up facilitates studying the effects of OCR quality changes on the user’s subjective perception of the document.

Following Gäde et al. [8] we describe in this paper the research design, infrastructure, and research data utilized in a recent user experiment of Kettunen et al. [19]entailing thirty-two test subjects performing simulated work tasks [4]and discuss the prospects of reuse of  the  experimental  components  of  the  study.  So far,the system  has  been  used  in  one  experiment in  which  the  subjects performed simulated tasks. However, the research design and its general model could be utilized in the future to study the effects of OCR  quality  on  professional  settings  entailing  historians performing naturalistic phases of their research ta
This article uses semantic tagging to analyse the Nordic concept of everyman's rights (a right of public access to nature) in protocols of the Finnish parliament. In the analysis, we use a novel tool, a lexical semantic tagger for Finnish... more
This article uses semantic tagging to analyse the Nordic concept of everyman's rights (a right of public access to nature) in protocols of the Finnish parliament. In the analysis, we use a novel tool, a lexical semantic tagger for Finnish (Finnishlanguage Semantic Tagger), which is used to tag key discussions about everyman's rights in the Finnish parliament. The article has two contributions as follows: first, it presents a method that combines semantic tagging and similarity analysis of corpora (keyness) for studying the formation of political concepts in large textual data. Secondly, it sheds light on the Nordic access rights and the underlying customary everyman's rights. Despite its central role in public debate, the history of the concept has not been well researched. Our analysis shows that the legislative context could be clearly detected with our approach, and that the method allowed us to describe shifts in the meaning of everyman's rights in the legislative discussion.
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of... more
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of these about 7.64 million pages are freely available on the web site https:/ / digi.kansalliskirjasto.f / etusivu. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929.
Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so far been studied in data-oriented scenarios regarding the effectiveness of retrieval results. Such studies have either focused on the... more
Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so far been studied in data-oriented scenarios regarding the effectiveness of retrieval results. Such studies have either focused on the effects of artificially degraded OCR quality (see, e.g., [1-2]) or utilized test collections containing texts based on authentic low quality OCR data (see, e.g., [3]). In this paper the effects of OCR quality are studied in a user-oriented information retrieval setting. Thirty-two users evaluated subjectively query results of six topics each (out of 30 topics) based on pre-formulated queries using a simulated work task setting. To the best of our knowledge our simulated work task experiment is the first one showing empirically that users' subjective relevance assessments of retrieved documents are affected by a change in the quality of optically read text.
<strong>Semantically tagged Europarl-sv.v7 </strong> 45.6+ M lines Lexical coverage of the tagging: 83.90% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with Treetagger:... more
<strong>Semantically tagged Europarl-sv.v7 </strong> 45.6+ M lines Lexical coverage of the tagging: 83.90% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with Treetagger: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ Output in base form Documentation of the semantic tagger (for Finnish, but same principles hold for Swedish, too): https://www.aclweb.org/anthology/W19-0306/ https://zenodo.org/record/3676372#.YFNwIa8zY2w Format: base form POS Semtag Unknown words marked with tag Z99 <strong>Example output:</strong> Återupptagande# Z99<br> av pp Z5<br> sessionen# Z99<br> jag nn S1.2.3+ Q4.1<br> förklara vb Q2.2 K5.1%<br> Europaparlamentets# Z99<br> session nn T1.3<br> återuppta# Z99<br> efter av X9.1-<br> avbrottet# Z99<br> en nl N1 Z8<br> 17 NUMB<br> december nn T1.3<br> . PUNCT<br> jag nn S1.2.3+ Q4.1
<strong>Semantically tagged Europarl-sv.v7 with POS data (UD2)</strong> 49.7+ M lines Lexical coverage of the tagging: 87.79% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with... more
<strong>Semantically tagged Europarl-sv.v7 with POS data (UD2)</strong> 49.7+ M lines Lexical coverage of the tagging: 87.79% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with UD2: https://turkunlp.org/finnish_nlp.html#parser Output in base form and original running text, original sentences separated Documentation of the FiST semantic tagger: https://www.aclweb.org/anthology/W19-0306/ Semantic tagging Tagging of the data was performed in Puhti computing environment of the CSC – IT CENTER FOR SCIENCE LTD. https://research.csc.fi/-/puhti Format: Text form Base form Semtag POS information Unknown words marked with tag Z99 Example: # sent_id = 2<br> # text = Jag förklarar Europaparlamentets session återupptagen efter avbrottet den 17 december.<br> 1 Jag #jag#nn#S1.2.3+ Q4.1 PRON PERS-P1SG-NOM Case=Nom|Definite=Def|Gender=Com|Number=Sing|PronType=Prs 2 nsubj _ _<br> 2 förklarar #förklara#vb#Q2.2 K5.1% VERB PRES-ACT Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ _<br> 3 Europaparlamentets Europaparlamentets Z99 PROPN SG-GEN Case=Gen 4 nmod:poss _ _<br> 4 session #session#nn#T1.3 NOUN SG-IND-NOM Case=Nom|Definite=Ind|Gender=Com|Number=Sing 2 obj _ _<br> 5 återupptagen återupptå Z99 VERB AD-SG-IND Mood=Ind|VerbForm=Inf|Voice=Pass 2 xcomp _ _<br> 6 efter #efter#pp#M6 N4 T4- X7 A6.1+ ADP _ _ 7 case _ _<br> 7 avbrottet #avbrott#nn#T2- T1.2 NOUN SG-DEF-NOM Case=Nom|Definite=Def|Gender=Neut|Number=Sing 5 obl _ _<br> 8 den #den#al#Z5 PRON PERS-P3SG Definite=Def|Number=Plur|PronType=Prs 7 nmod _ _<br> 9 NUMB<br> 10 december decemb Z99 NOUN PL-IND-NOM Case=Nom|Definite=Ind|Gender=Com|Number=Plur 8 nmod _ SpaceAfter=No<br> . PUNCT<br>
<strong>Semantically tagged Europarl-pt.v7 </strong> 55.9+ M lines Lexical coverage of the tagging: 76.97% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with Treetagger:... more
<strong>Semantically tagged Europarl-pt.v7 </strong> 55.9+ M lines Lexical coverage of the tagging: 76.97% No semantic ambiguity resolving, all the tags marked POS tagging for semantic tagging performed with Treetagger: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ Output in base form Documentation of the semantic tagger (for Finnish, but same principles hold for Swedish, too): https://www.aclweb.org/anthology/W19-0306/ https://zenodo.org/record/3676372#.YFNwIa8zY2w <strong>Format: base form POS Semtag</strong> Unknown words marked with tag Z99 <strong>Example output</strong> reinício# Z99<br> de+a# Z99<br> sessão noun T1.3 T1.3/G1.1 T1.3/G2.1 T1.3/K1<br> declarar# Z99<br> reaberto# Z99<br> o pron Z8 Z8f Z8m Z8mf Z8mfn<br> sessão noun T1.3 T1.3/G1.1 T1.3/G2.1 T1.3/K1<br> de+o# Z99<br> parlamento noun G1.1<br> europeu noun Z2/S2mf Z3<br> , PUNCT<br> que conj A13.3 A6.1+ Z5 Z8<br> ter# Z99<br> ser verb A3+ Z5<br> interromper verb H4 T1.3+<br> em+a# Z99
This paper presents work that has been carried out in the National Library of Finland to improve optical character recognition (OCR) quality of a Finnish historical newspaper and journal collection 1771–1910. Work and results reported in... more
This paper presents work that has been carried out in the National Library of Finland to improve optical character recognition (OCR) quality of a Finnish historical newspaper and journal collection 1771–1910. Work and results reported in the paper are based on a 500 000 word ground truth (GT) sample of the Finnish language part of the whole collection. The sample has three different parallel parts: a manually corrected ground truth version, original OCR with ABBYY FineReader v. 7 or v. 8, and an ABBYY FineReader v. 11 reOCRed version. Based on this sample and its page image originals we have developed a re-OCRing process using the open source software package Tesseract v. 3.04.01. Our methods in the re-OCR include image preprocessing techniques, usage of morphological analyzers and a set of weighting rules for resulting candidate words. Besides results based on the GT sample we present also results of re-OCR for a 29 year period of one newspaper of our collection, Uusi Suometar. The...
This study describes first usage of a particular implementation of Normalized Compression Distance (NCD) as a machine translation quality evaluation tool. NCD has been introduced and tested for clustering and classification of different... more
This study describes first usage of a particular implementation of Normalized Compression Distance (NCD) as a machine translation quality evaluation tool. NCD has been introduced and tested for clustering and classification of different types of data and found a reliable and general tool. As far as we know NCD in its Complearn implementation has not been evaluated as a MT
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of... more
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.
This paper introduces mNCD, a method for automatic evaluation of machine translations. The measure is based on normalized compression distance (NCD), a general information theoretic measure of string similarity, and flexible word matching... more
This paper introduces mNCD, a method for automatic evaluation of machine translations. The measure is based on normalized compression distance (NCD), a general information theoretic measure of string similarity, and flexible word matching provided by stemming and synonyms. The mNCD measure outperforms NCD in system-level correlation to human judgments in English.
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many... more
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary [16]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report evaluation result of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish a...
This study continues a work in progress for implementing a full-text lexical semantic tagger for Finnish, FiST. The tagger is based on a 46,226 lexeme semantic lexicon of Finnish that was published in 2016 [1]. Kettunen [2], [3] describes... more
This study continues a work in progress for implementing a full-text lexical semantic tagger for Finnish, FiST. The tagger is based on a 46,226 lexeme semantic lexicon of Finnish that was published in 2016 [1]. Kettunen [2], [3] describes the basic working version of FiST. FiST is based on freely available components: the first implementation uses Omorfi and FinnPos for morphological analysis and disambiguation of Finnish words. The current paper describes work with compound splitting for semantic tagging and its effects on the lexical coverage of the tagger. We try out two different approaches to morphological analysis and disambiguation of words for an improved version of FiST, FiSTComp: FinnPos [4], and Turku Dependency Parser [5], [6], UD1. Both these tools disambiguate morphological interpretations of words and provide boundary markings for compounds, but details and granularity of constituent decomposition vary. Our results with two-, three and four-part compounds show that an...
The paper presents a new method for handling of morphological variation of query terms in best-match IR. The method is based on enhanced inflectional stems. Use of inflectional stems has earlier been shown to be a good retrieval method in... more
The paper presents a new method for handling of morphological variation of query terms in best-match IR. The method is based on enhanced inflectional stems. Use of inflectional stems has earlier been shown to be a good retrieval method in inflected indexes in a best-match environment for a highly inflected and compound-rich language, Finnish. In this paper the earlier stem method is elaborated upon by enhancing the stems with regular expressions. Contrary to our expectations the results show that the enhanced stem queries do not outperform basic inflectional stems, but neither are they considerably worse with long queries. With short web-like queries they perform relatively better than with long queries and outperform clearly stemming (Finnish stemmer of Snowball) and plain, unprocessed query words. The main benefits of the proposed method, besides fairly good precision and recall (P-R) performance, are shorter and more manageable queries, which is of practical importance, e.g. with...
The paper introduces the evaluation results of Cross Language Information Retrieval(CLIR) for three target languages, Finnish, German and Swedish using English as the source language. Our CLLR approach is based on machine translation of... more
The paper introduces the evaluation results of Cross Language Information Retrieval(CLIR) for three target languages, Finnish, German and Swedish using English as the source language. Our CLLR approach is based on machine translation of topics and usage of the Frequent Case Generation (FCG) method for management of query term variation in translated topics and retrieval in inflected indexes. Retrieval results of more standard query term variation management approaches, such as stemming and lemmatization of translated topics, are also shown. Results of the paper show, that when machine translation of queries are combined with FCG, results can be at best very promising. The besi Machine Translation (MT) programs seem to translate standard laboratory type Information Retrieval (IR) topics quite well at least from the query performance point of view. Few times the translated queries perform as well as or slightly better than the monolingual baseline. Many times differences to monolingua...
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of... more
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of these about 7.64 million pages are freely available on the web site https://digi.kansalliskirjasto.fi/etusivu . The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The last nine years, 1921–1929, were opened in January 2018. This paper presents briefly the ground truth Optical Character Recognition data of about 500 000 words that has been compiled at the NLF for development of an improved OCR process for the Finnish collection. We discuss compilation of the data generally and show results of the new OCR process in comparison to current OCR, using the ground truth data as an evaluation benchmark. We also show with real ...
Tässä artikkelissa luodaan katsaus Kansalliskirjaston digitoitujen lehtiaineistojen avoimen datan tutkimuskäyttöön. Lehtiaineistoista julkaistiin vuonna 2017 vuodet 1771–1910 kattava datapaketti, ja sen tutkimuskäytöstä on kertynyt tähän... more
Tässä artikkelissa luodaan katsaus Kansalliskirjaston digitoitujen lehtiaineistojen avoimen datan tutkimuskäyttöön. Lehtiaineistoista julkaistiin vuonna 2017 vuodet 1771–1910 kattava datapaketti, ja sen tutkimuskäytöstä on kertynyt tähän mennessä hiukan yli vuoden kokemus. Sivuamme katsauksessa myös aineiston verkkokäyttöä tutkimuksessa. Esittelemme lisäksi myös ohjelmistorajapintoja, joiden kautta aineistoihin pääsee käsiksi.
Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until the present... more
Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until the present day. The material up to 1910 can be viewed in the public web service, where as anything later is available at the six legal deposit libraries in Finland. A recent user study noticed that a different type of researcher use is one of the key uses of the collection. National Library of Finland has gotten several requests to provide the content of the digital collections as one offline bundle, where all the needed content is included. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. We consider these formats most useful to be provided as raw data for the researchers. In this paper we will describe how the export format was created, how other parties have packaged the same data and what the benefits are of the current approach. We shall also briefly discuss word level quality of the content and show a real research scenario for the data.
Research Interests:
Effects of three different morphological methods- lemmatization, stemming and inflectional stem generation- for Finnish are compared in a probabilistic IR environment (INQUERY). Evaluation is done using a four point relevance scale which... more
Effects of three different morphological methods- lemmatization, stemming and inflectional stem generation- for Finnish are compared in a probabilistic IR environment (INQUERY). Evaluation is done using a four point relevance scale which is partitioned differently in different test settings. Results show that inflectional stem generation which has not been used much in IR, compares well with lemmatization in a best-match IR environment. Differences in performance between inflectional stem generation and lemmatization are small and they are not statistically significant in most of the tested settings. It is also shown that hitherto a rather neglected method of morphological processing for Finnish, stemming, performs reasonably well although the stemmer used – a Porter stemmer implementation – is far from optimal for a morphologically complex language like Finnish. In another series of tests, the effects of 1 compound splitting and derivational expansion of queries are tested.

And 93 more

Digitization by means of scanning and optical character recognition (OCR) of both handwritten and printed historical material during the last 10–15 years has been an ongoing academic and non-academic industry. Most probably this activity... more
Digitization by means of scanning and optical character recognition (OCR) of both handwritten and printed historical material during the last 10–15 years has been an ongoing academic and non-academic industry. Most probably this activity will only increase in the ongoing Digital Humanities era. As a result of past and current work we have lots of digital historical document collections available and will have more of them in the future. The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2014; Kettunen et al. 2014). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library's Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of the newspaper material (years 1771–1874) is also available freely downloadable in The Language Bank of Finland provided by the FinCLARIN consortium 1. The collection can also be accessed through the Korp 2 environment that has been developed by Språkbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield style information retrieval test collection has also been produced out of a small part of the Digi newspaper material at the University of Tampere (Järvelin et al. 2015). The web service digi.kansalliskirjasto.fi contains different material besides newspapers, including journals, and ephemera (different small prints). Recently a new service was created: it enables marking of clips and storing of them to a personal scrapbook. The web service is used, for example, by genealogists, heritage societies, researchers, and history enthusiast laymen. There is also an increasing desire to offer the material more widely for educational use. In 2014 the service had over 10 million page loads. User statistics show that about 88.5 % of the usage of the Digi comes from Finland, but a 11.5 % share of use is coming outside of Finland. Quality of OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections (Holley, 2008, Tanner et al., 2009). There is no single available method to assess quality of large collections, but different methods can be used to approximate quality. This paper discusses different corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analysers, frequency analysis of words and comparisons to comparable edited lexical data. Our aim in the quality analysis is twofold: firstly to analyse the present state of the lexical data and secondly, to establish a set of assessment methods that build up a compact procedure for overall quality assessment after e.g. re-OCRing or post-correction of the material. In the discussion part of the paper we shall synthesise results of our different analyses. Our results show, that about 69 % of all the word tokens of the Digi can be recognized with a modern Finnish morphological analyser. If orthographical variation of v/w in the 19 th century Finnish is taken into account and number of out-of-vocabulary words (OOVs) is estimated, the recognition rate increases to 74–75 %. The rest, about 625 M words, is estimated to consist mostly of OCR errors, at least half of them being hard ones. 1 M most frequent word types in the data make 1 https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/KielipankkiAineistotDigilibPub 2 https://korp.csc.fi/ 2.043 billion tokens, out of which 79.1 % can be recognized. If words that occur only once in the data (hapax legomena) are analysed, 98 % of them are unrecognized by morphological software. The lexical quality approximation process we have set up is relatively straightforward and does not need complicated tools. It is based on frequency calculations and usage of off-the-shelf modern Finnish morphological analyzers. Even though we have done the estimation now in a partially automatized way, it is possible to automatize the operation completely. It is also apparent that we need to be cautious in conclusions, as different data are of different sizes which may cause errors in estimations (Baayen 2001; Kilgariff 2001). However, we believe that our analyses have shed considerable light into quality of the Digi collection.
Research Interests:
Historical newspapers are increasingly accessed digitally for different purposes both by professional and lay users. These evergrowing historical collections are usually formed by utilizing Optical Character Recognition (OCR), which may... more
Historical newspapers are increasingly accessed digitally for different purposes both by professional and lay users. These evergrowing historical collections are usually formed by utilizing Optical Character Recognition (OCR), which may introduce noise
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many... more
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first large scale trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors ; its estimated word level correctness is about 70–75 % (Kettunen and Pääkkönen, 2016). Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This research reports first published large scale results of NER in a historical Finnish OCRed newspaper collection. Results of the research supplement NER
Research Interests: