The first 100 days corpus is a curated corpus of the first 100 days of the United States of Ameri... more The first 100 days corpus is a curated corpus of the first 100 days of the United States of America’s President and the Senate. During the first 100 days, the political parties in the USA try to push their agendas for the upcoming year under the new President. As communication has changed this is primarily being done on Twitter so that the President and Senators can communicate directly with their constituents. We analyzed the current President along with 100 Senators ranging the political spectrum to see the differences in their language usage. The creation of this corpus is intended to help Natural Language Processing (NLP) and Political Science research studying the changing political climate during a shift in power through language. To help accomplish this, the corpus is harvested and normalized in multiple formats. As well, we include gold standard part-of-speech tags for selected individuals including the President. Through analysis of the text, a clear distinction between pol...
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
The paper presents an ongoing work on the implementation of an MT system between Indonesian and M... more The paper presents an ongoing work on the implementation of an MT system between Indonesian and Malaysian. The system uses a method of almost a direct translation exploiting the similarity of both languages. This method was previously used on a number of language pairs of European languages. The paper also makes an overview of linguistic phenomena which can negatively influence the translation quality and it suggests a solution for some of them. Keywords-machine translation; related languages; direct translation; morphology; hybrid method
In this paper, we study the effect of incorporating morphological information on an Indonesian (i... more In this paper, we study the effect of incorporating morphological information on an Indonesian (id) to English (en) Statistical Machine Translation (SMT) system as part of a preprocessing module. The linguistic phenomenon that is being addressed here is Indonesian cliticized words. The approach is to transform the text by separating the correct clitics from a cliticized word to simplify the word alignment. We also study the effect of applying the preprocessing on different SMT systems trained on different kinds of text, such as spoken language text. The system is built using the state-of-the-art SMT tool, MOSES. The Indonesian morphological information is provided by MorphInd. Overall the preprocessing improves the translation quality, especially for the Indonesian spoken language text, where it gains 1.78 BLEU score points of increase.
We adopt a previously developed model of deep syntactic and semantic processing to support questi... more We adopt a previously developed model of deep syntactic and semantic processing to support question answering for Bahasa Indonesia, and extend it by adding a number of axioms designed to encode useful knowledge for answering questions, thus increasing the inferential power of the QA system. We believe this approach can increase the robustness of semantic analysis-based QA systems, whilst simultaneously lightening the burden of complexity in designing semantic attachment rules that transduce logical forms from syntactic structures. We show how these added axioms enable the system to answer questions which previously could not have been answered.
This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The... more This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: plain', stored in text format and morphologically enriched', stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage.
This thesis addresses the problem of Verb Sense Disambiguation (VSD) in European Portuguese. Verb... more This thesis addresses the problem of Verb Sense Disambiguation (VSD) in European Portuguese. Verb Sense Disambiguation is a subproblem of the Word Sense Desambiguation (WSD) problem, that tries to identify in which sense a polissemic word is used in a given sentence. Thus a sense inventory for each word (or lemma) must be used. For the VSD problem, this sense inventory consisted in a lexicon-syntactic classification of the most frequent verbs in European Portuguese (ViPEr). Two approaches to VSD were considered. The first, rule-based, approach makes use of the lexical, syntactic and semantic descriptions of the verb senses present in ViPEr to determine the meaning of a verb. The second approach uses machine learning with a set of features commonly used in the WSD problem to determine the correct meaning of the target verb. Both approaches were tested in several scenarios to determine the impact of different features and different combinations of methods. The baseline accuracy of 84%...
We introduce and describe ongoing work in our Indonesian dependency treebank. We described charac... more We introduce and describe ongoing work in our Indonesian dependency treebank. We described characteristics of the source data as well as describe our annotation guidelines for creating the dependency structures. Reported within are the results from the start of the Indonesian dependency treebank. We also show ensemble dependency parsing and self training approaches applicable to under-resourced languages using our manually annotated dependency structures. We show that for an under-resourced language, the use of tuning data for a meta classifier is more effective than using it as additional training data for individual parsers. This meta-classifier creates an ensemble dependency parser and increases the dependency accuracy by 4.92% on average and 1.99% over the best individual models on average. As the data sizes grow for the the under-resourced language a meta classifier can easily adapt. To the best of our knowledge this is the first full implementation of a dependency parser for I...
This paper presents a method to improve a word alignment model in a phrase-based Statistical Mach... more This paper presents a method to improve a word alignment model in a phrase-based Statistical Machine Translation system for a low-resourced language using a string similarity approach. Our method captures similar words that can be seen as semi-monolingual across languages, such as numbers, named entities, and adapted/loan words. We use several string similarity metrics to measure the monolinguality of the words, such as Longest Common Subsequence Ratio (LCSR), Minimum Edit Distance Ratio (MEDR), and we also use a modified BLEU Score (modBLEU). Our approach is to add intersecting alignment points for word pairs that are orthographically similar, before applying a word alignment heuristic, to generate a better word alignment. We demonstrate this approach on Indonesian-to-English translation task, where the languages share many similar words that are poorly aligned given a limited training data. This approach gives a statistically significant improvement by up to 0.66 in terms of BLEU ...
The Votter Corpus is a new annotated corpus of social polling questions and answers. The Votter C... more The Votter Corpus is a new annotated corpus of social polling questions and answers. The Votter Corpus is novel in its use of the mobile application format and novel in its coverage of specific demographics. With over 26,000 polls and close to 1 millions votes, the Votter Corpus covers everyday question and answer language, primarily for users who are female and between the ages of 13-24. The corpus is annotated by topic and by popularity of particular answers. The corpus contains many unique characteristics such as emoticons, common mobile misspellings, and images associated with many of the questions. The corpus is a collection of questions and answers from The Votter App on the Android operating system. Data is created solely on this mobile platform which differs from most social media corpora. The Votter Corpus is being made available online in XML format for research and non-commercial use. The Votter android app can be downloaded for free in most android app stores.
This paper discusses the adaptation of the Stanford typed dependency model (de Marneffe and Manni... more This paper discusses the adaptation of the Stanford typed dependency model (de Marneffe and Manning 2008), initially designed for English, to the requirements of typologically different languages from the viewpoint of practical parsing. We argue for a framework of functional dependency grammar that is based on the idea of parallelism between syntax and semantics. There is a twofold challenge: (1) specifying the annotation scheme in order to deal with the morphological and syntactic peculiarities of each language and (2) maintaining cross- linguistically consistent annotations to ensure homogenous analysis for similar linguistic phenomena. We applied a number of modifications to the original Stanford scheme in an attempt to capture the language-specific grammatical features present in heterogeneous CoNLL-encoded data sets for German, Dutch, French, Spanish, Brazilian Portuguese, Russian, Polish, Indonesian, and Traditional Chinese. From a multilingual perspective, we discuss features...
We describe the development of a bidirectional rule-based machine translation system between Indo... more We describe the development of a bidirectional rule-based machine translation system between Indonesian and Malaysian (id-ms), two closely related Austronesian languages natively spoken by approximately 35 million people. The system is based on the re-use of free and publicly available resources, such as the Apertium machine translation platform and Wikipedia articles. We also present our approaches to overcome the data scarcity problems in both languages by exploiting the morphology similarities between the two.
This paper describes a work on preparing an Indonesian-English Sta- tistical Machine Translation ... more This paper describes a work on preparing an Indonesian-English Sta- tistical Machine Translation (SMT) System. It includes the creation of Indonesian morphological analyzer, MorphInd, and the composing of an Indonesian-English parallel corpus, IDENTIC. We build an SMT system using the state-of-the-art phrase-based SMT system, MOSES. We show several scenarios where the morpho- logical tool is used to incorporate morphological information in the SMT system trained with the composed parallel corpus.
This paper presents a model of deep syn- tactic and semantic processing to support question-answe... more This paper presents a model of deep syn- tactic and semantic processing to support question-answering for Bahasa Indonesia. Starting from an existing unification-based grammar, we specify lexical semantics for each lexeme and semantic attachment rules for each grammar rule using lambda- calculus notation. This approach enables us to obtain semantic representations of Indonesian sentences in the form of first order
Proceedings of the 22nd Pacific Asia Conference on Language, Information, and Computation (PACLIC 2008), 2008
Abstract. We adopt a previously developed model of deep syntactic and semantic processing to supp... more Abstract. We adopt a previously developed model of deep syntactic and semantic processing to support question answering for Bahasa Indonesia, and extend it by adding a number of axioms designed to encode useful knowledge for answering questions, thus increasing the inferential power of the QA system. We believe this approach can increase the robustness of semantic analysis-based QA systems, whilst simultaneously lightening the burden of complexity in designing semantic attachment rules that transduce logical forms ...
Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, Sep 1, 2007
This paper presents a model of deep syntactic and semantic processing to support question-answeri... more This paper presents a model of deep syntactic and semantic processing to support question-answering for Bahasa Indonesia. Starting from an existing unification-based grammar, we specify lexical semantics for each lexeme and semantic attachment rules for each grammar rule using lambdacalculus notation. This approach enables us to obtain semantic representations of Indonesian sentences in the form of first order logic literals. These representations are used by a question-answering module which stores declarative ...
This paper describes a robust finite state morphology tool for Indonesian (MorphInd), which handl... more This paper describes a robust finite state morphology tool for Indonesian (MorphInd), which handles both morphological analysis and lemmatization for a given surface word form so that it is suitable for further language processing. MorphInd has wider coverage on handling Indonesian derivational and inflectional morphology compared to an existing Indonesian morphological analyzer [1], along with a more detailed tagset. MorphInd
The first 100 days corpus is a curated corpus of the first 100 days of the United States of Ameri... more The first 100 days corpus is a curated corpus of the first 100 days of the United States of America’s President and the Senate. During the first 100 days, the political parties in the USA try to push their agendas for the upcoming year under the new President. As communication has changed this is primarily being done on Twitter so that the President and Senators can communicate directly with their constituents. We analyzed the current President along with 100 Senators ranging the political spectrum to see the differences in their language usage. The creation of this corpus is intended to help Natural Language Processing (NLP) and Political Science research studying the changing political climate during a shift in power through language. To help accomplish this, the corpus is harvested and normalized in multiple formats. As well, we include gold standard part-of-speech tags for selected individuals including the President. Through analysis of the text, a clear distinction between pol...
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
The paper presents an ongoing work on the implementation of an MT system between Indonesian and M... more The paper presents an ongoing work on the implementation of an MT system between Indonesian and Malaysian. The system uses a method of almost a direct translation exploiting the similarity of both languages. This method was previously used on a number of language pairs of European languages. The paper also makes an overview of linguistic phenomena which can negatively influence the translation quality and it suggests a solution for some of them. Keywords-machine translation; related languages; direct translation; morphology; hybrid method
In this paper, we study the effect of incorporating morphological information on an Indonesian (i... more In this paper, we study the effect of incorporating morphological information on an Indonesian (id) to English (en) Statistical Machine Translation (SMT) system as part of a preprocessing module. The linguistic phenomenon that is being addressed here is Indonesian cliticized words. The approach is to transform the text by separating the correct clitics from a cliticized word to simplify the word alignment. We also study the effect of applying the preprocessing on different SMT systems trained on different kinds of text, such as spoken language text. The system is built using the state-of-the-art SMT tool, MOSES. The Indonesian morphological information is provided by MorphInd. Overall the preprocessing improves the translation quality, especially for the Indonesian spoken language text, where it gains 1.78 BLEU score points of increase.
We adopt a previously developed model of deep syntactic and semantic processing to support questi... more We adopt a previously developed model of deep syntactic and semantic processing to support question answering for Bahasa Indonesia, and extend it by adding a number of axioms designed to encode useful knowledge for answering questions, thus increasing the inferential power of the QA system. We believe this approach can increase the robustness of semantic analysis-based QA systems, whilst simultaneously lightening the burden of complexity in designing semantic attachment rules that transduce logical forms from syntactic structures. We show how these added axioms enable the system to answer questions which previously could not have been answered.
This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The... more This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: plain', stored in text format and morphologically enriched', stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage.
This thesis addresses the problem of Verb Sense Disambiguation (VSD) in European Portuguese. Verb... more This thesis addresses the problem of Verb Sense Disambiguation (VSD) in European Portuguese. Verb Sense Disambiguation is a subproblem of the Word Sense Desambiguation (WSD) problem, that tries to identify in which sense a polissemic word is used in a given sentence. Thus a sense inventory for each word (or lemma) must be used. For the VSD problem, this sense inventory consisted in a lexicon-syntactic classification of the most frequent verbs in European Portuguese (ViPEr). Two approaches to VSD were considered. The first, rule-based, approach makes use of the lexical, syntactic and semantic descriptions of the verb senses present in ViPEr to determine the meaning of a verb. The second approach uses machine learning with a set of features commonly used in the WSD problem to determine the correct meaning of the target verb. Both approaches were tested in several scenarios to determine the impact of different features and different combinations of methods. The baseline accuracy of 84%...
We introduce and describe ongoing work in our Indonesian dependency treebank. We described charac... more We introduce and describe ongoing work in our Indonesian dependency treebank. We described characteristics of the source data as well as describe our annotation guidelines for creating the dependency structures. Reported within are the results from the start of the Indonesian dependency treebank. We also show ensemble dependency parsing and self training approaches applicable to under-resourced languages using our manually annotated dependency structures. We show that for an under-resourced language, the use of tuning data for a meta classifier is more effective than using it as additional training data for individual parsers. This meta-classifier creates an ensemble dependency parser and increases the dependency accuracy by 4.92% on average and 1.99% over the best individual models on average. As the data sizes grow for the the under-resourced language a meta classifier can easily adapt. To the best of our knowledge this is the first full implementation of a dependency parser for I...
This paper presents a method to improve a word alignment model in a phrase-based Statistical Mach... more This paper presents a method to improve a word alignment model in a phrase-based Statistical Machine Translation system for a low-resourced language using a string similarity approach. Our method captures similar words that can be seen as semi-monolingual across languages, such as numbers, named entities, and adapted/loan words. We use several string similarity metrics to measure the monolinguality of the words, such as Longest Common Subsequence Ratio (LCSR), Minimum Edit Distance Ratio (MEDR), and we also use a modified BLEU Score (modBLEU). Our approach is to add intersecting alignment points for word pairs that are orthographically similar, before applying a word alignment heuristic, to generate a better word alignment. We demonstrate this approach on Indonesian-to-English translation task, where the languages share many similar words that are poorly aligned given a limited training data. This approach gives a statistically significant improvement by up to 0.66 in terms of BLEU ...
The Votter Corpus is a new annotated corpus of social polling questions and answers. The Votter C... more The Votter Corpus is a new annotated corpus of social polling questions and answers. The Votter Corpus is novel in its use of the mobile application format and novel in its coverage of specific demographics. With over 26,000 polls and close to 1 millions votes, the Votter Corpus covers everyday question and answer language, primarily for users who are female and between the ages of 13-24. The corpus is annotated by topic and by popularity of particular answers. The corpus contains many unique characteristics such as emoticons, common mobile misspellings, and images associated with many of the questions. The corpus is a collection of questions and answers from The Votter App on the Android operating system. Data is created solely on this mobile platform which differs from most social media corpora. The Votter Corpus is being made available online in XML format for research and non-commercial use. The Votter android app can be downloaded for free in most android app stores.
This paper discusses the adaptation of the Stanford typed dependency model (de Marneffe and Manni... more This paper discusses the adaptation of the Stanford typed dependency model (de Marneffe and Manning 2008), initially designed for English, to the requirements of typologically different languages from the viewpoint of practical parsing. We argue for a framework of functional dependency grammar that is based on the idea of parallelism between syntax and semantics. There is a twofold challenge: (1) specifying the annotation scheme in order to deal with the morphological and syntactic peculiarities of each language and (2) maintaining cross- linguistically consistent annotations to ensure homogenous analysis for similar linguistic phenomena. We applied a number of modifications to the original Stanford scheme in an attempt to capture the language-specific grammatical features present in heterogeneous CoNLL-encoded data sets for German, Dutch, French, Spanish, Brazilian Portuguese, Russian, Polish, Indonesian, and Traditional Chinese. From a multilingual perspective, we discuss features...
We describe the development of a bidirectional rule-based machine translation system between Indo... more We describe the development of a bidirectional rule-based machine translation system between Indonesian and Malaysian (id-ms), two closely related Austronesian languages natively spoken by approximately 35 million people. The system is based on the re-use of free and publicly available resources, such as the Apertium machine translation platform and Wikipedia articles. We also present our approaches to overcome the data scarcity problems in both languages by exploiting the morphology similarities between the two.
This paper describes a work on preparing an Indonesian-English Sta- tistical Machine Translation ... more This paper describes a work on preparing an Indonesian-English Sta- tistical Machine Translation (SMT) System. It includes the creation of Indonesian morphological analyzer, MorphInd, and the composing of an Indonesian-English parallel corpus, IDENTIC. We build an SMT system using the state-of-the-art phrase-based SMT system, MOSES. We show several scenarios where the morpho- logical tool is used to incorporate morphological information in the SMT system trained with the composed parallel corpus.
This paper presents a model of deep syn- tactic and semantic processing to support question-answe... more This paper presents a model of deep syn- tactic and semantic processing to support question-answering for Bahasa Indonesia. Starting from an existing unification-based grammar, we specify lexical semantics for each lexeme and semantic attachment rules for each grammar rule using lambda- calculus notation. This approach enables us to obtain semantic representations of Indonesian sentences in the form of first order
Proceedings of the 22nd Pacific Asia Conference on Language, Information, and Computation (PACLIC 2008), 2008
Abstract. We adopt a previously developed model of deep syntactic and semantic processing to supp... more Abstract. We adopt a previously developed model of deep syntactic and semantic processing to support question answering for Bahasa Indonesia, and extend it by adding a number of axioms designed to encode useful knowledge for answering questions, thus increasing the inferential power of the QA system. We believe this approach can increase the robustness of semantic analysis-based QA systems, whilst simultaneously lightening the burden of complexity in designing semantic attachment rules that transduce logical forms ...
Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, Sep 1, 2007
This paper presents a model of deep syntactic and semantic processing to support question-answeri... more This paper presents a model of deep syntactic and semantic processing to support question-answering for Bahasa Indonesia. Starting from an existing unification-based grammar, we specify lexical semantics for each lexeme and semantic attachment rules for each grammar rule using lambdacalculus notation. This approach enables us to obtain semantic representations of Indonesian sentences in the form of first order logic literals. These representations are used by a question-answering module which stores declarative ...
This paper describes a robust finite state morphology tool for Indonesian (MorphInd), which handl... more This paper describes a robust finite state morphology tool for Indonesian (MorphInd), which handles both morphological analysis and lemmatization for a given surface word form so that it is suitable for further language processing. MorphInd has wider coverage on handling Indonesian derivational and inflectional morphology compared to an existing Indonesian morphological analyzer [1], along with a more detailed tagset. MorphInd
Uploads
Papers by Septina Larasati