Obtaining semantic or functional word categories from data in an unsupervised manner is a problem... more Obtaining semantic or functional word categories from data in an unsupervised manner is a problem motivated both from the linguistic point of view and from that of construing language models for various language processing tasks. In this work, we use the Self-Organizing Map algorithm to visualize and cluster common Finnish verbs based on their immediate morphological contexts. Based on a data set of over 500,000 utterances of 25 verbs, we studied (1) the base forms and (2) the most common word forms of the same verbs (4764 forms). The results show that even the simple feature selection used in this experiment was found to be suitable for rough automatic categorization of verbs on the basis of data extracted from unrestricted texts. In particular, the automatically obtained organization resembles the semantic classification of verbs designed by a linguist.
Helsingin yliopiston politiikan ja talouden tutkimuksen laitoksella toimiva Kuluttajatutkimuskesk... more Helsingin yliopiston politiikan ja talouden tutkimuksen laitoksella toimiva Kuluttajatutkimuskeskus on avannut yhteistyossa Allerin, FIN-CLARINin ja CSC:n kanssa Suomi24-aineiston tutkimuskayttoon. Kyseessa on suomalaisittain ainutlaatuinen avoimen datan hanke. Hankkeeseen on jo kytkeytynyt kymmenia tutkijoita ja yhteistyotahoja, joten aineistoanalyysi ja niista kayty keskustelu tulee tuottamaan tutkimusta ja uudenlaisia sosiaalisen median kayttoja. Tyon alla on esimerkiksi tutkimusaloitteita ennakoivista algoritmeista ja uudissanoista.
Powerful methods for interactive exploration and search from collections of free-form textual doc... more Powerful methods for interactive exploration and search from collections of free-form textual documents are needed to manage the ever-increasing flood of digital information. In this article we present a method, WEBSOM, for automatic organization of full-text document collections using the self-organizing map (SOM) algorithm. The document collection is ordered onto a map in an unsupervised manner utilizing statistical information of short word contexts. The resulting ordered map where similar documents lie near each other thus presents a general view of the document space. With the aid of a suitable (WWW-based) interface, documents in interesting areas of the map can be browsed. The browsing can also be interactively extended to related topics, which appear in nearby areas on the map. Along with the method we present a case study of its use.
We study properties of morphemes by analyzing their use in a large Finnish text corpus using Inde... more We study properties of morphemes by analyzing their use in a large Finnish text corpus using Independent Component Analysis (ICA). As a result, we obtain emergent linguistic representations for the morphemes. On a coarse level, main syntactic categories are observed. On a more detailed level, the components depict potential thematic roles of the morphemes. An interesting question is whether these discovered lower-dimensional representations could be directly utilized in language processing applications.
Obtaining semantic or functional word categories from data in an unsupervised manner is a problem... more Obtaining semantic or functional word categories from data in an unsupervised manner is a problem motivated both from the linguistic point of view and from that of construing language models for various language processing tasks. In this work, we use the self-organizing map algorithm to visualize and cluster common Finnish verbs based on functional and semantic information coded by case marking and function words like postpositions and adverbs. Firstly, based on a data set of over 500,000 utterances of 25 verbs, we studied (a) the base forms and (b) the most common word forms of the same verbs (4764 forms). Secondly, the first experiment was repeated on a set of 600 verbs. The results show that even the simple feature selection used in this experiment was found to be suitable for rough automatic categorization of verbs on the basis of data extracted from unrestricted texts. In particular, the results demonstrate the importance of cultural, social and emotional dimensions in lexical
Research focusing on online health discussions provides valuable insights into the use of medicin... more Research focusing on online health discussions provides valuable insights into the use of medicines, as well as health-related experiences and difficulties currently not well understood. We introduce Medicine Radar, a tool for exploring health-related online discussions obtained from the Finnish Suomi24 chat forum. The health subset of the entire Suomi24 data consists of 19 million messages written over a time span of 16 years. We outline the method, identify some challenges in analyzing Finnish texts and explain how we overcame them in this specific domain. In particular, we present a novel method for generating domain vocabularies from colloquial texts, which utilizes a combination of machine learning and human input. Medicine Radar is accessible as an open sourced web interface that we hope will inspire and facilitate further research.
Statistical language modeling (SLM) is an essential part in any large-vocabulary continuous speec... more Statistical language modeling (SLM) is an essential part in any large-vocabulary continuous speech recognition (LVCSR) system. The development of the standard SLM methods has been strongly affected by the goals of LVCSR in English. The structure of Finnish is substantially different from English, so if the standard SLMs are directly applied, the success is by no means granted. In this paper we describe our first attempts of building a LVCSR for Finnish and the new SLMs that we have tried. One of our objective has been the indexing and recognition of broadcast news, so special issues of our interest are topic detection, word stemming and modeling words that are poorly covered in the training data. Our new methods are based on neural computing using the self-organizing map (SOM) which has recently been shown to successfully extract and approximate latent semantic structures from massive text collections.
In this work, we announce the Morfessor 1.0 software package, which is a program that takes as in... more In this work, we announce the Morfessor 1.0 software package, which is a program that takes as input a corpus of raw text and produces a segmentation of the word forms observed in the text. The segmentation obtained often resembles a linguistic morpheme segmentation. In addition, we briefly describe the Hutmegs package, also publicly available for research purposes. Hutmegs contains semi-automatically produced correct, or gold-standard, morpheme segmentations for a large number of Finnish and English word forms. One easy way for the reader to familiarize himself with our work is to test the demonstration program on our Internet site. The demo shows how Morfessor segments words that the user types in.
In this work, we describe the first public version of the Morfessor software, which is a program ... more In this work, we describe the first public version of the Morfessor software, which is a program that takes as input a corpus of unannotated text and produces a segmentation of the word forms observed in the text. The segmentation obtained often resembles a linguistic morpheme segmentation. Morfessor is not language-dependent. The number of segments per word is not restricted to two or three as in some other existing morphology learning models. The current version of the software essentially implements two morpheme segmentation models presented earlier by us (Creutz and Lagus, 2002; Creutz, 2003). The document contains user’s instructions, as well as the mathematical formulation of the model and a description of the search algorithm used. Additionally, a few experiments on Finnish and English text corpora are reported in order to give the user some ideas of how to apply the program to his own data sets and how to evaluate the results.
Humans tend to group together related properties in order to understand complex phenomena. When m... more Humans tend to group together related properties in order to understand complex phenomena. When modeling large problems with limited representational resources, it is important to be able to construct compact models of the data. Structuring the problem into sub-problems that can be modeled independently is a means for achieving compactness. We describe the Independent Variable Group Analysis (IVGA), an unsupervised learning principle that in modeling a data set, also discovers a grouping of the input variables that reflects statistical independencies in the data. In addition, we discuss its connection to some aspects of cognitive modeling and of representations in the brain. The IVGA approach and its implementation are designed to be practical, efficient, and useful for real world applications. Initial experiments on several data sets are reported to examine the performance and potential uses of the method. The preliminary results are promising: the method does seem to find independ...
Text mining systems are developed to aid the users in satisfying their information needs, which m... more Text mining systems are developed to aid the users in satisfying their information needs, which may vary from searching answers to well-specified questions to learning more of a scientific discipline. The major tasks of web mining are searching, browsing, and visualization. Searching is best suited for answering specific questions of a well-informed user. Browsing and visualization, on the other hand, are beneficial especially when the information need is more general, or the topic area is new to the user [6]. The SOM, applied to organizing very large document collections, can aid in all the three tasks. Statistical models of documents In the vast majority of SOM applications, the input data constitute high-dimensional real feature vectors. In the SOMs that form similarity graphs of text documents, the feature sets that describe collections of words in the documents should be expressible as real vectors, too. The feature sets can simply be weighted histograms of the words, but usual...
Morpho Challenge is an annual evaluation campaign for unsupervised morpheme analysis. In morpheme... more Morpho Challenge is an annual evaluation campaign for unsupervised morpheme analysis. In morpheme analysis, words are segmented into smaller meaningful units. This is an essential part in processing complex word forms in many large-scale natural language processing applications, such as speech recognition, information retrieval, and machine translation. The discovery of morphemes is particularly important for morphologically rich languages where inflection, derivation and composition can produce a huge amount of different word forms. Morpho Challenge aims at language-independent unsupervised learning algorithms that can discover useful morpheme-like units from raw text material. In this paper we define the challenge, review proposed algorithms, evaluations and results so far, and point out the questions that are still open.
Determining optimal units of representing morphologically complex words in the mental lexicon is ... more Determining optimal units of representing morphologically complex words in the mental lexicon is a central question in psycholinguistics. Here, we utilize advances in computational sciences to study human morphological processing using statistical models of morphology, particularly the unsupervised Morfessor model that works on the principle of optimization. The aim was to see what kind of model structure corresponds best to human word recognition costs for multimorphemic Finnish nouns: a model incorporating units resembling linguistically defined morphemes, a whole-word model, or a model that seeks for an optimal balance between these two extremes. Our results showed that human word recognition was predicted best by a combination of two models: a model that decomposes words at some morpheme boundaries while keeping others unsegmented and a whole-word model. The results support dual-route models that assume that both decomposed and full-form representations are utilized to optimally...
Neuroimaging studies of the reading process point to functionally distinct stages in word recogni... more Neuroimaging studies of the reading process point to functionally distinct stages in word recognition. Yet, current understanding of the operations linked to those various stages is mainly descriptive in nature. Approaches developed in the field of computational linguistics may offer a more quantitative approach for understanding brain dynamics. Our aim was to evaluate whether a statistical model of morphology, with well-defined computational principles, can capture the neural dynamics of reading, using the concept of surprisal from information theory as the common measure. The Morfessor model, created for unsupervised discovery of morphemes, is based on the minimum description length principle and attempts to find optimal units of representation for complex words. In a word recognition task, we correlated brain responses to word surprisal values derived from Morfessor and from other psycholinguistic variables that have been linked with various levels of linguistic abstraction. The ...
Recent entrepreneurship education research underlines the need to better understand affective and... more Recent entrepreneurship education research underlines the need to better understand affective and conative aspects of learning entrepreneurial behaviour. However, this research has not succeeded in defining how the interplay between the cognitive, conative and affective aspects take place in learning processes. To better understand these differences we adopt the three-partite constructs of the personality and intelligence originally introduced by Snow,
Obtaining semantic or functional word categories from data in an unsupervised manner is a problem... more Obtaining semantic or functional word categories from data in an unsupervised manner is a problem motivated both from the linguistic point of view and from that of construing language models for various language processing tasks. In this work, we use the Self-Organizing Map algorithm to visualize and cluster common Finnish verbs based on their immediate morphological contexts. Based on a data set of over 500,000 utterances of 25 verbs, we studied (1) the base forms and (2) the most common word forms of the same verbs (4764 forms). The results show that even the simple feature selection used in this experiment was found to be suitable for rough automatic categorization of verbs on the basis of data extracted from unrestricted texts. In particular, the automatically obtained organization resembles the semantic classification of verbs designed by a linguist.
Helsingin yliopiston politiikan ja talouden tutkimuksen laitoksella toimiva Kuluttajatutkimuskesk... more Helsingin yliopiston politiikan ja talouden tutkimuksen laitoksella toimiva Kuluttajatutkimuskeskus on avannut yhteistyossa Allerin, FIN-CLARINin ja CSC:n kanssa Suomi24-aineiston tutkimuskayttoon. Kyseessa on suomalaisittain ainutlaatuinen avoimen datan hanke. Hankkeeseen on jo kytkeytynyt kymmenia tutkijoita ja yhteistyotahoja, joten aineistoanalyysi ja niista kayty keskustelu tulee tuottamaan tutkimusta ja uudenlaisia sosiaalisen median kayttoja. Tyon alla on esimerkiksi tutkimusaloitteita ennakoivista algoritmeista ja uudissanoista.
Powerful methods for interactive exploration and search from collections of free-form textual doc... more Powerful methods for interactive exploration and search from collections of free-form textual documents are needed to manage the ever-increasing flood of digital information. In this article we present a method, WEBSOM, for automatic organization of full-text document collections using the self-organizing map (SOM) algorithm. The document collection is ordered onto a map in an unsupervised manner utilizing statistical information of short word contexts. The resulting ordered map where similar documents lie near each other thus presents a general view of the document space. With the aid of a suitable (WWW-based) interface, documents in interesting areas of the map can be browsed. The browsing can also be interactively extended to related topics, which appear in nearby areas on the map. Along with the method we present a case study of its use.
We study properties of morphemes by analyzing their use in a large Finnish text corpus using Inde... more We study properties of morphemes by analyzing their use in a large Finnish text corpus using Independent Component Analysis (ICA). As a result, we obtain emergent linguistic representations for the morphemes. On a coarse level, main syntactic categories are observed. On a more detailed level, the components depict potential thematic roles of the morphemes. An interesting question is whether these discovered lower-dimensional representations could be directly utilized in language processing applications.
Obtaining semantic or functional word categories from data in an unsupervised manner is a problem... more Obtaining semantic or functional word categories from data in an unsupervised manner is a problem motivated both from the linguistic point of view and from that of construing language models for various language processing tasks. In this work, we use the self-organizing map algorithm to visualize and cluster common Finnish verbs based on functional and semantic information coded by case marking and function words like postpositions and adverbs. Firstly, based on a data set of over 500,000 utterances of 25 verbs, we studied (a) the base forms and (b) the most common word forms of the same verbs (4764 forms). Secondly, the first experiment was repeated on a set of 600 verbs. The results show that even the simple feature selection used in this experiment was found to be suitable for rough automatic categorization of verbs on the basis of data extracted from unrestricted texts. In particular, the results demonstrate the importance of cultural, social and emotional dimensions in lexical
Research focusing on online health discussions provides valuable insights into the use of medicin... more Research focusing on online health discussions provides valuable insights into the use of medicines, as well as health-related experiences and difficulties currently not well understood. We introduce Medicine Radar, a tool for exploring health-related online discussions obtained from the Finnish Suomi24 chat forum. The health subset of the entire Suomi24 data consists of 19 million messages written over a time span of 16 years. We outline the method, identify some challenges in analyzing Finnish texts and explain how we overcame them in this specific domain. In particular, we present a novel method for generating domain vocabularies from colloquial texts, which utilizes a combination of machine learning and human input. Medicine Radar is accessible as an open sourced web interface that we hope will inspire and facilitate further research.
Statistical language modeling (SLM) is an essential part in any large-vocabulary continuous speec... more Statistical language modeling (SLM) is an essential part in any large-vocabulary continuous speech recognition (LVCSR) system. The development of the standard SLM methods has been strongly affected by the goals of LVCSR in English. The structure of Finnish is substantially different from English, so if the standard SLMs are directly applied, the success is by no means granted. In this paper we describe our first attempts of building a LVCSR for Finnish and the new SLMs that we have tried. One of our objective has been the indexing and recognition of broadcast news, so special issues of our interest are topic detection, word stemming and modeling words that are poorly covered in the training data. Our new methods are based on neural computing using the self-organizing map (SOM) which has recently been shown to successfully extract and approximate latent semantic structures from massive text collections.
In this work, we announce the Morfessor 1.0 software package, which is a program that takes as in... more In this work, we announce the Morfessor 1.0 software package, which is a program that takes as input a corpus of raw text and produces a segmentation of the word forms observed in the text. The segmentation obtained often resembles a linguistic morpheme segmentation. In addition, we briefly describe the Hutmegs package, also publicly available for research purposes. Hutmegs contains semi-automatically produced correct, or gold-standard, morpheme segmentations for a large number of Finnish and English word forms. One easy way for the reader to familiarize himself with our work is to test the demonstration program on our Internet site. The demo shows how Morfessor segments words that the user types in.
In this work, we describe the first public version of the Morfessor software, which is a program ... more In this work, we describe the first public version of the Morfessor software, which is a program that takes as input a corpus of unannotated text and produces a segmentation of the word forms observed in the text. The segmentation obtained often resembles a linguistic morpheme segmentation. Morfessor is not language-dependent. The number of segments per word is not restricted to two or three as in some other existing morphology learning models. The current version of the software essentially implements two morpheme segmentation models presented earlier by us (Creutz and Lagus, 2002; Creutz, 2003). The document contains user’s instructions, as well as the mathematical formulation of the model and a description of the search algorithm used. Additionally, a few experiments on Finnish and English text corpora are reported in order to give the user some ideas of how to apply the program to his own data sets and how to evaluate the results.
Humans tend to group together related properties in order to understand complex phenomena. When m... more Humans tend to group together related properties in order to understand complex phenomena. When modeling large problems with limited representational resources, it is important to be able to construct compact models of the data. Structuring the problem into sub-problems that can be modeled independently is a means for achieving compactness. We describe the Independent Variable Group Analysis (IVGA), an unsupervised learning principle that in modeling a data set, also discovers a grouping of the input variables that reflects statistical independencies in the data. In addition, we discuss its connection to some aspects of cognitive modeling and of representations in the brain. The IVGA approach and its implementation are designed to be practical, efficient, and useful for real world applications. Initial experiments on several data sets are reported to examine the performance and potential uses of the method. The preliminary results are promising: the method does seem to find independ...
Text mining systems are developed to aid the users in satisfying their information needs, which m... more Text mining systems are developed to aid the users in satisfying their information needs, which may vary from searching answers to well-specified questions to learning more of a scientific discipline. The major tasks of web mining are searching, browsing, and visualization. Searching is best suited for answering specific questions of a well-informed user. Browsing and visualization, on the other hand, are beneficial especially when the information need is more general, or the topic area is new to the user [6]. The SOM, applied to organizing very large document collections, can aid in all the three tasks. Statistical models of documents In the vast majority of SOM applications, the input data constitute high-dimensional real feature vectors. In the SOMs that form similarity graphs of text documents, the feature sets that describe collections of words in the documents should be expressible as real vectors, too. The feature sets can simply be weighted histograms of the words, but usual...
Morpho Challenge is an annual evaluation campaign for unsupervised morpheme analysis. In morpheme... more Morpho Challenge is an annual evaluation campaign for unsupervised morpheme analysis. In morpheme analysis, words are segmented into smaller meaningful units. This is an essential part in processing complex word forms in many large-scale natural language processing applications, such as speech recognition, information retrieval, and machine translation. The discovery of morphemes is particularly important for morphologically rich languages where inflection, derivation and composition can produce a huge amount of different word forms. Morpho Challenge aims at language-independent unsupervised learning algorithms that can discover useful morpheme-like units from raw text material. In this paper we define the challenge, review proposed algorithms, evaluations and results so far, and point out the questions that are still open.
Determining optimal units of representing morphologically complex words in the mental lexicon is ... more Determining optimal units of representing morphologically complex words in the mental lexicon is a central question in psycholinguistics. Here, we utilize advances in computational sciences to study human morphological processing using statistical models of morphology, particularly the unsupervised Morfessor model that works on the principle of optimization. The aim was to see what kind of model structure corresponds best to human word recognition costs for multimorphemic Finnish nouns: a model incorporating units resembling linguistically defined morphemes, a whole-word model, or a model that seeks for an optimal balance between these two extremes. Our results showed that human word recognition was predicted best by a combination of two models: a model that decomposes words at some morpheme boundaries while keeping others unsegmented and a whole-word model. The results support dual-route models that assume that both decomposed and full-form representations are utilized to optimally...
Neuroimaging studies of the reading process point to functionally distinct stages in word recogni... more Neuroimaging studies of the reading process point to functionally distinct stages in word recognition. Yet, current understanding of the operations linked to those various stages is mainly descriptive in nature. Approaches developed in the field of computational linguistics may offer a more quantitative approach for understanding brain dynamics. Our aim was to evaluate whether a statistical model of morphology, with well-defined computational principles, can capture the neural dynamics of reading, using the concept of surprisal from information theory as the common measure. The Morfessor model, created for unsupervised discovery of morphemes, is based on the minimum description length principle and attempts to find optimal units of representation for complex words. In a word recognition task, we correlated brain responses to word surprisal values derived from Morfessor and from other psycholinguistic variables that have been linked with various levels of linguistic abstraction. The ...
Recent entrepreneurship education research underlines the need to better understand affective and... more Recent entrepreneurship education research underlines the need to better understand affective and conative aspects of learning entrepreneurial behaviour. However, this research has not succeeded in defining how the interplay between the cognitive, conative and affective aspects take place in learning processes. To better understand these differences we adopt the three-partite constructs of the personality and intelligence originally introduced by Snow,
Uploads
Papers by Krista Lagus