Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–36 of 36 results for author: Malmasi, S

.
  1. arXiv:2407.09653  [pdf, other

    cs.CL cs.IR

    Bridging the Gap Between Information Seeking and Product Search Systems: Q&A Recommendation for E-commerce

    Authors: Saar Kuzi, Shervin Malmasi

    Abstract: Consumers on a shopping mission often leverage both product search and information seeking systems, such as web search engines and Question Answering (QA) systems, in an iterative process to improve their understanding of available products and reach a purchase decision. While product search is useful for shoppers to find the actual products meeting their requirements in the catalog, information s… ▽ More

    Submitted 16 July, 2024; v1 submitted 12 July, 2024; originally announced July 2024.

    Journal ref: In ACM SIGIR Forum, vol. 58, no. 1, pp. 1-10. New York, NY, USA: ACM, 2024

  2. arXiv:2406.05255  [pdf, other

    cs.CL cs.AI

    Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers

    Authors: Lütfi Kerem Senel, Besnik Fetahu, Davis Yoshida, Zhiyu Chen, Giuseppe Castellucci, Nikhita Vedula, Jason Choi, Shervin Malmasi

    Abstract: Recommender systems are widely used to suggest engaging content, and Large Language Models (LLMs) have given rise to generative recommenders. Such systems can directly generate items, including for open-set tasks like question suggestion. While the world knowledge of LLMs enable good recommendations, improving the generated content through user feedback is challenging as continuously fine-tuning L… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted at ACL 2024 Main Proceedings

  3. Question Suggestion for Conversational Shopping Assistants Using Product Metadata

    Authors: Nikhita Vedula, Oleg Rokhlenko, Shervin Malmasi

    Abstract: Digital assistants have become ubiquitous in e-commerce applications, following the recent advancements in Information Retrieval (IR), Natural Language Processing (NLP) and Generative Artificial Intelligence (AI). However, customers are often unsure or unaware of how to effectively converse with these assistants to meet their shopping needs. In this work, we emphasize the importance of providing c… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 5 pages, 1 figure

  4. arXiv:2404.06659  [pdf, other

    cs.CL

    Leveraging Interesting Facts to Enhance User Engagement with Conversational Interfaces

    Authors: Nikhita Vedula, Giuseppe Castellucci, Eugene Agichtein, Oleg Rokhlenko, Shervin Malmasi

    Abstract: Conversational Task Assistants (CTAs) guide users in performing a multitude of activities, such as making recipes. However, ensuring that interactions remain engaging, interesting, and enjoyable for CTA users is not trivial, especially for time-consuming or challenging tasks. Grounded in psychological theories of human interest, we propose to engage users with contextual and interesting statements… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: 10 pages, 1 figure

  5. arXiv:2404.06017  [pdf, other

    cs.CL

    Identifying Shopping Intent in Product QA for Proactive Recommendations

    Authors: Besnik Fetahu, Nachshon Cohen, Elad Haramaty, Liane Lewin-Eytan, Oleg Rokhlenko, Shervin Malmasi

    Abstract: Voice assistants have become ubiquitous in smart devices allowing users to instantly access information via voice questions. While extensive research has been conducted in question answering for voice search, little attention has been paid on how to enable proactive recommendations from a voice assistant to its users. This is a highly challenging problem that often leads to user friction, mainly d… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Accepted at IronGraphs@ECIR'2024

  6. arXiv:2404.02422  [pdf, other

    cs.CL cs.LG

    Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data

    Authors: Parth Patwa, Simone Filice, Zhiyu Chen, Giuseppe Castellucci, Oleg Rokhlenko, Shervin Malmasi

    Abstract: Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks. In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt. In this paper, we propose a strategy to make LLMs as efficient as 0-shot text classifiers, while getting comparable or… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Accepted at LREC-COLING 2024

  7. arXiv:2401.09785  [pdf, other

    cs.CL

    Instant Answering in E-Commerce Buyer-Seller Messaging using Message-to-Question Reformulation

    Authors: Besnik Fetahu, Tejas Mehta, Qun Song, Nikhita Vedula, Oleg Rokhlenko, Shervin Malmasi

    Abstract: E-commerce customers frequently seek detailed product information for purchase decisions, commonly contacting sellers directly with extended queries. This manual response requirement imposes additional costs and disrupts buyer's shopping experience with response time fluctuations ranging from hours to days. We seek to automate buyer inquiries to sellers in a leading e-commerce store using a domain… ▽ More

    Submitted 30 January, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted at ECIR 2024

  8. arXiv:2401.09775  [pdf, other

    cs.CL

    Controllable Decontextualization of Yes/No Question and Answers into Factual Statements

    Authors: Lingbo Mo, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi

    Abstract: Yes/No or polar questions represent one of the main linguistic question categories. They consist of a main interrogative clause, for which the answer is binary (assertion or negation). Polar questions and answers (PQA) represent a valuable knowledge resource present in many community and other curated QA sources, such as forums or e-commerce applications. Using answers to polar questions alone in… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted at ECIR 2024

  9. arXiv:2310.17034  [pdf, other

    cs.CL

    Follow-on Question Suggestion via Voice Hints for Voice Assistants

    Authors: Besnik Fetahu, Pedro Faustini, Giuseppe Castellucci, Anjie Fang, Oleg Rokhlenko, Shervin Malmasi

    Abstract: The adoption of voice assistants like Alexa or Siri has grown rapidly, allowing users to instantly access information via voice search. Query suggestion is a standard feature of screen-based search experiences, allowing users to explore additional topics. However, this is not trivial to implement in voice-based settings. To enable this, we tackle the novel task of suggesting questions with compact… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: Accepted as Long Paper at EMNLP'23 Findings

  10. arXiv:2310.16361  [pdf, other

    cs.CL cs.AI

    InstructPTS: Instruction-Tuning LLMs for Product Title Summarization

    Authors: Besnik Fetahu, Zhiyu Chen, Oleg Rokhlenko, Shervin Malmasi

    Abstract: E-commerce product catalogs contain billions of items. Most products have lengthy titles, as sellers pack them with product attributes to improve retrieval, and highlight key product aspects. This results in a gap between such unnatural products titles, and how customers refer to them. It also limits how e-commerce stores can use these seller-provided titles for recommendation, QA, or review summa… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: Accepted by EMNLP 2023 (Industry Track)

  11. arXiv:2310.13213  [pdf, other

    cs.CL cs.AI

    MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition

    Authors: Besnik Fetahu, Zhiyu Chen, Sudipta Kar, Oleg Rokhlenko, Shervin Malmasi

    Abstract: We present MULTICONER V2, a dataset for fine-grained Named Entity Recognition covering 33 entity classes across 12 languages, in both monolingual and multilingual settings. This dataset aims to tackle the following practical challenges in NER: (i) effective handling of fine-grained classes that include complex entities like movie titles, and (ii) performance degradation due to noise generated from… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

    Comments: Accepted to the Findings of EMNLP 2023

  12. arXiv:2306.03411  [pdf, other

    cs.CL cs.AI cs.IR

    Generate-then-Retrieve: Intent-Aware FAQ Retrieval in Product Search

    Authors: Zhiyu Chen, Jason Choi, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi

    Abstract: Customers interacting with product search engines are increasingly formulating information-seeking queries. Frequently Asked Question (FAQ) retrieval aims to retrieve common question-answer pairs for a user query with question intent. Integrating FAQ retrieval in product search can not only empower users to make more informed purchase decisions, but also enhance user retention through efficient po… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: ACL 2023 Industry Track

  13. arXiv:2305.17393  [pdf, other

    cs.CL cs.AI

    Answering Unanswered Questions through Semantic Reformulations in Spoken QA

    Authors: Pedro Faustini, Zhiyu Chen, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi

    Abstract: Spoken Question Answering (QA) is a key feature of voice assistants, usually backed by multiple QA systems. Users ask questions via spontaneous speech which can contain disfluencies, errors, and informal syntax or phrasing. This is a major challenge in QA, causing unanswered questions or irrelevant answers, and leading to bad user experiences. We analyze failed QA requests to identify core challen… ▽ More

    Submitted 3 June, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Industry Track

  14. arXiv:2305.14793  [pdf, other

    cs.CL

    Faithful Low-Resource Data-to-Text Generation through Cycle Training

    Authors: Zhuoer Wang, Marcus Collins, Nikhita Vedula, Simone Filice, Shervin Malmasi, Oleg Rokhlenko

    Abstract: Methods to generate text from structured data have advanced significantly in recent years, primarily due to fine-tuning of pre-trained language models on large datasets. However, such models can fail to produce output faithful to the input data, particularly on out-of-domain data. Sufficient annotated data is often not available for specific domains, leading us to seek an unsupervised approach to… ▽ More

    Submitted 11 July, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: 19 pages, 4 figures, ACL 2023

  15. arXiv:2305.06586  [pdf, other

    cs.CL cs.AI

    SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)

    Authors: Besnik Fetahu, Sudipta Kar, Zhiyu Chen, Oleg Rokhlenko, Shervin Malmasi

    Abstract: We present the findings of SemEval-2023 Task 2 on Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2). Divided into 13 tracks, the task focused on methods to identify complex fine-grained named entities (like WRITTENWORK, VEHICLE, MUSICALGRP) across 12 languages, in both monolingual and multilingual scenarios, as well as noisy settings. The task used the MultiCoNER V2 dataset, compos… ▽ More

    Submitted 25 May, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: SemEval-2023 (co-located with ACL-2023 in Toronto, Canada)

  16. arXiv:2302.11074  [pdf, other

    cs.CL cs.AI cs.LG

    Preventing Catastrophic Forgetting in Continual Learning of New Natural Language Tasks

    Authors: Sudipta Kar, Giuseppe Castellucci, Simone Filice, Shervin Malmasi, Oleg Rokhlenko

    Abstract: Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model. Training an MTL model requires having the training data for all tasks available at the same time. As systems usually evolve over time, (e.g., to support new functionalities), adding a new task to an existing MTL model usually requires retraining the… ▽ More

    Submitted 21 February, 2023; originally announced February 2023.

    Comments: KDD 2022

  17. arXiv:2210.15777  [pdf, other

    cs.CL cs.IR

    Reinforced Question Rewriting for Conversational Question Answering

    Authors: Zhiyu Chen, Jie Zhao, Anjie Fang, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi

    Abstract: Conversational Question Answering (CQA) aims to answer questions contained within dialogues, which are not easily interpretable without context. Developing a model to rewrite conversational questions into self-contained ones is an emerging solution in industry settings as it allows using existing single-turn QA systems to avoid training a CQA model from scratch. Previous work trains rewriting mode… ▽ More

    Submitted 31 October, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: A cleaned version of our paper Accepted by EMNLP 2022 (Industry Track)

  18. arXiv:2208.14536  [pdf, other

    cs.CL

    MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition

    Authors: Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, Oleg Rokhlenko

    Abstract: We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, a… ▽ More

    Submitted 30 August, 2022; originally announced August 2022.

    Comments: Accepted at COLING 2022

  19. arXiv:1904.07839  [pdf, other

    cs.CL

    UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks

    Authors: Gustavo Henrique Paetzold, Shervin Malmasi, Marcos Zampieri

    Abstract: In this paper we revisit the problem of automatically identifying hate speech in posts from social media. We approach the task using a system based on minimalistic compositional Recurrent Neural Networks (RNN). We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset. The dataset made available by… ▽ More

    Submitted 16 April, 2019; originally announced April 2019.

    Comments: Proceedings of SemEval

  20. arXiv:1903.08983  [pdf, other

    cs.CL

    SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

    Authors: Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

    Abstract: We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets. It featured three sub-tasks. In sub-task A, the goal was to discriminate between offensive and non-offensive posts. I… ▽ More

    Submitted 26 April, 2019; v1 submitted 19 March, 2019; originally announced March 2019.

    Comments: Proceedings of the International Workshop on Semantic Evaluation (SemEval)

  21. arXiv:1902.09666  [pdf, ps, other

    cs.CL

    Predicting the Type and Target of Offensive Posts in Social Media

    Authors: Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

    Abstract: As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offe… ▽ More

    Submitted 16 April, 2019; v1 submitted 25 February, 2019; originally announced February 2019.

    Comments: Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

  22. arXiv:1811.04695  [pdf, ps, other

    cs.CL

    Classifying Patent Applications with Ensemble Methods

    Authors: Fernando Benites, Shervin Malmasi, Marcos Zampieri

    Abstract: We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications. The goal of the task is to use computational methods to categorize patent applications according to a coarse-grained taxonomy of eight classes based on the International Patent Classification (IPC). We teste… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: Proceedings of ALTA 2018

  23. arXiv:1808.04800  [pdf, other

    cs.CL

    Classifier Ensembles for Dialect and Language Variety Identification

    Authors: Liviu P. Dinu, Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi

    Abstract: In this paper we present ensemble-based systems for dialect and language variety identification using the datasets made available by the organizers of the VarDial Evaluation Campaign 2018. We present a system developed to discriminate between Flemish and Dutch in subtitles and a system trained to discriminate between four Arabic dialects: Egyptian, Levantine, Gulf, North African, and Modern Standa… ▽ More

    Submitted 14 August, 2018; originally announced August 2018.

  24. arXiv:1807.08230  [pdf, other

    cs.CL

    German Dialect Identification Using Classifier Ensembles

    Authors: Alina Maria Ciobanu, Shervin Malmasi, Liviu P. Dinu

    Abstract: In this paper we present the GDI_classification entry to the second German Dialect Identification (GDI) shared task organized within the scope of the VarDial Evaluation Campaign 2018. We present a system based on SVM classifier ensembles trained on characters and words. The system was trained on a collection of speech transcripts of five Swiss-German dialects provided by the organizers. The transc… ▽ More

    Submitted 21 July, 2018; originally announced July 2018.

  25. arXiv:1807.03108  [pdf, other

    cs.CL

    Discriminating between Indo-Aryan Languages Using SVM Ensembles

    Authors: Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Santanu Pal, Liviu P. Dinu

    Abstract: In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. We investigate the performance of individual features and combine the output of single classifiers to maximize performance. The system competed in the Indo-Aryan Language Identification (I… ▽ More

    Submitted 9 July, 2018; originally announced July 2018.

    Comments: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects

  26. arXiv:1804.11346  [pdf, other

    cs.CL

    A Portuguese Native Language Identification Dataset

    Authors: Iria del Río, Marcos Zampieri, Shervin Malmasi

    Abstract: In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, D… ▽ More

    Submitted 30 April, 2018; originally announced April 2018.

    Comments: Proceedings of The 13th Workshop on Innovative Use of NLP for Building Educational Applications (BEA)

  27. arXiv:1804.09132  [pdf, other

    cs.CL

    A Report on the Complex Word Identification Shared Task 2018

    Authors: Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo H. Paetzold, Lucia Specia, Sanja Štajner, Anaïs Tack, Marcos Zampieri

    Abstract: We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT'2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification… ▽ More

    Submitted 24 April, 2018; originally announced April 2018.

    Comments: Second CWI Shared Task co-located with the BEA Workshop 2018 at NAACL-HLT in New Orleans, USA

  28. arXiv:1803.05495  [pdf, other

    cs.CL

    Challenges in Discriminating Profanity from Hate Speech

    Authors: Shervin Malmasi, Marcos Zampieri

    Abstract: In this study we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered. Using a new dataset annotated specifically for this task, we employ supervised classification along with a set of features that includes n-grams, skip-grams and clustering-based word representations. We apply approaches based on single classifi… ▽ More

    Submitted 14 March, 2018; originally announced March 2018.

    Journal ref: Shervin Malmasi, Marcos Zampieri (2018) Challenges in Discriminating Profanity from Hate Speech. Journal of Experimental & Theoretical Artificial Intelligence. Volume 30, Issue 2, pp. 187-202. Taylor & Francis

  29. arXiv:1712.06427  [pdf, other

    cs.CL

    Detecting Hate Speech in Social Media

    Authors: Shervin Malmasi, Marcos Zampieri

    Abstract: In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity. We aim to establish lexical baselines for this task by applying supervised classification methods using a recently released dataset annotated for this purpose. As features, our system uses character n-grams, word n-grams and word skip-grams. We obtain results of 78% accuracy in… ▽ More

    Submitted 26 December, 2017; v1 submitted 18 December, 2017; originally announced December 2017.

    Comments: Proceedings of Recent Advances in Natural Language Processing (RANLP). pp. 467-472. Varna, Bulgaria

  30. arXiv:1710.09306  [pdf, other

    cs.CL

    Exploring the Use of Text Classification in the Legal Domain

    Authors: Octavia-Maria Sulea, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P. Dinu, Josef van Genabith

    Abstract: In this paper, we investigate the application of text classification methods to support law professionals. We present several experiments applying machine learning techniques to predict with high accuracy the ruling of the French Supreme Court and the law area to which a case belongs to. We also investigate the influence of the time period in which a ruling was made on the form of the case descrip… ▽ More

    Submitted 25 October, 2017; originally announced October 2017.

    Comments: Proceedings of the 2nd Workshop on Automated Semantic Analysis of Information in Legal Texts (ASAIL)

  31. arXiv:1710.04989  [pdf, other

    cs.CL

    Complex Word Identification: Challenges in Data Annotation and System Performance

    Authors: Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold, Lucia Specia

    Abstract: This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed p… ▽ More

    Submitted 13 October, 2017; originally announced October 2017.

    Comments: Proceedings of the 4th Workshop on NLP Techniques for Educational Applications (NLPTEA 2017)

  32. arXiv:1707.04817  [pdf, ps, other

    cs.CL

    Open-Set Language Identification

    Authors: Shervin Malmasi

    Abstract: We present the first open-set language identification experiments using one-class classification. We first highlight the shortcomings of traditional feature extraction methods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One- Class Support Vector Machine using only a monolingual corpus for each… ▽ More

    Submitted 16 July, 2017; originally announced July 2017.

  33. arXiv:1707.00621  [pdf, ps, other

    cs.CL

    Including Dialects and Language Varieties in Author Profiling

    Authors: Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Liviu P. Dinu

    Abstract: This paper presents a computational approach to author profiling taking gender and language variety into account. We apply an ensemble system with the output of multiple linear SVM classifiers trained on character and word $n$-grams. We evaluate the system using the dataset provided by the organizers of the 2017 PAN lab on author profiling. Our approach achieved 75% average accuracy on gender iden… ▽ More

    Submitted 3 July, 2017; originally announced July 2017.

    Comments: Proceedings of PAN at CLEF 2017

  34. arXiv:1703.06541  [pdf, other

    cs.CL

    Native Language Identification using Stacked Generalization

    Authors: Shervin Malmasi, Mark Dras

    Abstract: Ensemble methods using multiple classifiers have proven to be the most successful approach for the task of Native Language Identification (NLI), achieving the current state of the art. However, a systematic examination of ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble architectures such as classifier stacking have not been closely evaluated. We present a set of exp… ▽ More

    Submitted 19 March, 2017; originally announced March 2017.

  35. arXiv:1610.00031  [pdf, other

    cs.CL

    Discriminating Similar Languages: Evaluations and Explorations

    Authors: Cyril Goutte, Serge Léger, Shervin Malmasi, Marcos Zampieri

    Abstract: We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ense… ▽ More

    Submitted 30 September, 2016; originally announced October 2016.

    Comments: Proceedings of Language Resources and Evaluation (LREC)

    Journal ref: Proceedings of Language Resources and Evaluation (LREC). Portoroz, Slovenia. pp 1800-1807 (2016)

  36. arXiv:1610.00030  [pdf, other

    cs.CL

    Modeling Language Change in Historical Corpora: The Case of Portuguese

    Authors: Marcos Zampieri, Shervin Malmasi, Mark Dras

    Abstract: This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. W… ▽ More

    Submitted 30 September, 2016; originally announced October 2016.

    Comments: Proceedings of Language Resources and Evaluation (LREC)

    Journal ref: Proceedings of Language Resources and Evaluation (LREC). Portoroz, Slovenia. pp. 4098-4104 (2016)