Search | arXiv e-print repository

Bridging the Gap Between Information Seeking and Product Search Systems: Q&A Recommendation for E-commerce

Abstract: Consumers on a shopping mission often leverage both product search and information seeking systems, such as web search engines and Question Answering (QA) systems, in an iterative process to improve their understanding of available products and reach a purchase decision. While product search is useful for shoppers to find the actual products meeting their requirements in the catalog, information s… ▽ More Consumers on a shopping mission often leverage both product search and information seeking systems, such as web search engines and Question Answering (QA) systems, in an iterative process to improve their understanding of available products and reach a purchase decision. While product search is useful for shoppers to find the actual products meeting their requirements in the catalog, information seeking systems can be utilized to answer any questions they may have to refine those requirements. The recent success of Large Language Models (LLMs) has opened up an opportunity to bridge the gap between the two tasks to help customers achieve their goals quickly and effectively by integrating conversational QA within product search. In this paper, we propose to recommend users Question-Answer (Q&A) pairs that are relevant to their product search and can help them make a purchase decision. We discuss the different aspects of the problem including the requirements and characteristics of the Q&A pairs, their generation, and the optimization of the Q&A recommendation task. We highlight the challenges, open problems, and suggested solutions to encourage future research in this emerging area. △ Less

Submitted 16 July, 2024; v1 submitted 12 July, 2024; originally announced July 2024.

Journal ref: In ACM SIGIR Forum, vol. 58, no. 1, pp. 1-10. New York, NY, USA: ACM, 2024

arXiv:2406.05255 [pdf, other]

Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers

Authors: Lütfi Kerem Senel, Besnik Fetahu, Davis Yoshida, Zhiyu Chen, Giuseppe Castellucci, Nikhita Vedula, Jason Choi, Shervin Malmasi

Abstract: Recommender systems are widely used to suggest engaging content, and Large Language Models (LLMs) have given rise to generative recommenders. Such systems can directly generate items, including for open-set tasks like question suggestion. While the world knowledge of LLMs enable good recommendations, improving the generated content through user feedback is challenging as continuously fine-tuning L… ▽ More Recommender systems are widely used to suggest engaging content, and Large Language Models (LLMs) have given rise to generative recommenders. Such systems can directly generate items, including for open-set tasks like question suggestion. While the world knowledge of LLMs enable good recommendations, improving the generated content through user feedback is challenging as continuously fine-tuning LLMs is prohibitively expensive. We present a training-free approach for optimizing generative recommenders by connecting user feedback loops to LLM-based optimizers. We propose a generative explore-exploit method that can not only exploit generated items with known high engagement, but also actively explore and discover hidden population preferences to improve recommendation quality. We evaluate our approach on question generation in two domains (e-commerce and general knowledge), and model user feedback with Click Through Rate (CTR). Experiments show our LLM-based explore-exploit approach can iteratively improve recommendations, and consistently increase CTR. Ablation analysis shows that generative exploration is key to learning user preferences, avoiding the pitfalls of greedy exploit-only approaches. A human evaluation strongly supports our quantitative findings. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted at ACL 2024 Main Proceedings

arXiv:2405.01738 [pdf, other]

doi 10.1145/3626772.3661371

Question Suggestion for Conversational Shopping Assistants Using Product Metadata

Authors: Nikhita Vedula, Oleg Rokhlenko, Shervin Malmasi

Abstract: Digital assistants have become ubiquitous in e-commerce applications, following the recent advancements in Information Retrieval (IR), Natural Language Processing (NLP) and Generative Artificial Intelligence (AI). However, customers are often unsure or unaware of how to effectively converse with these assistants to meet their shopping needs. In this work, we emphasize the importance of providing c… ▽ More Digital assistants have become ubiquitous in e-commerce applications, following the recent advancements in Information Retrieval (IR), Natural Language Processing (NLP) and Generative Artificial Intelligence (AI). However, customers are often unsure or unaware of how to effectively converse with these assistants to meet their shopping needs. In this work, we emphasize the importance of providing customers a fast, easy to use, and natural way to interact with conversational shopping assistants. We propose a framework that employs Large Language Models (LLMs) to automatically generate contextual, useful, answerable, fluent and diverse questions about products, via in-context learning and supervised fine-tuning. Recommending these questions to customers as helpful suggestions or hints to both start and continue a conversation can result in a smoother and faster shopping experience with reduced conversation overhead and friction. We perform extensive offline evaluations, and discuss in detail about potential customer impact, and the type, length and latency of our generated product questions if incorporated into a real-world shopping assistant. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 5 pages, 1 figure

arXiv:2404.06659 [pdf, other]

Leveraging Interesting Facts to Enhance User Engagement with Conversational Interfaces

Authors: Nikhita Vedula, Giuseppe Castellucci, Eugene Agichtein, Oleg Rokhlenko, Shervin Malmasi

Abstract: Conversational Task Assistants (CTAs) guide users in performing a multitude of activities, such as making recipes. However, ensuring that interactions remain engaging, interesting, and enjoyable for CTA users is not trivial, especially for time-consuming or challenging tasks. Grounded in psychological theories of human interest, we propose to engage users with contextual and interesting statements… ▽ More Conversational Task Assistants (CTAs) guide users in performing a multitude of activities, such as making recipes. However, ensuring that interactions remain engaging, interesting, and enjoyable for CTA users is not trivial, especially for time-consuming or challenging tasks. Grounded in psychological theories of human interest, we propose to engage users with contextual and interesting statements or facts during interactions with a multi-modal CTA, to reduce fatigue and task abandonment before a task is complete. To operationalize this idea, we train a high-performing classifier (82% F1-score) to automatically identify relevant and interesting facts for users. We use it to create an annotated dataset of task-specific interesting facts for the domain of cooking. Finally, we design and validate a dialogue policy to incorporate the identified relevant and interesting facts into a conversation, to improve user engagement and task completion. Live testing on a leading multi-modal voice assistant shows that 66% of the presented facts were received positively, leading to a 40% gain in the user satisfaction rating, and a 37% increase in conversation length. These findings emphasize that strategically incorporating interesting facts into the CTA experience can promote real-world user participation for guided task interactions. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: 10 pages, 1 figure

arXiv:2404.06017 [pdf, other]

Identifying Shopping Intent in Product QA for Proactive Recommendations

Authors: Besnik Fetahu, Nachshon Cohen, Elad Haramaty, Liane Lewin-Eytan, Oleg Rokhlenko, Shervin Malmasi

Abstract: Voice assistants have become ubiquitous in smart devices allowing users to instantly access information via voice questions. While extensive research has been conducted in question answering for voice search, little attention has been paid on how to enable proactive recommendations from a voice assistant to its users. This is a highly challenging problem that often leads to user friction, mainly d… ▽ More Voice assistants have become ubiquitous in smart devices allowing users to instantly access information via voice questions. While extensive research has been conducted in question answering for voice search, little attention has been paid on how to enable proactive recommendations from a voice assistant to its users. This is a highly challenging problem that often leads to user friction, mainly due to recommendations provided to the users at the wrong time. We focus on the domain of e-commerce, namely in identifying Shopping Product Questions (SPQs), where the user asking a product-related question may have an underlying shopping need. Identifying a user's shopping need allows voice assistants to enhance shopping experience by determining when to provide recommendations, such as product or deal recommendations, or proactive shopping actions recommendation. Identifying SPQs is a challenging problem and cannot be done from question text alone, and thus requires to infer latent user behavior patterns inferred from user's past shopping history. We propose features that capture the user's latent shopping behavior from their purchase history, and combine them using a novel Mixture-of-Experts (MoE) model. Our evaluation shows that the proposed approach is able to identify SPQs with a high score of F1=0.91. Furthermore, based on an online evaluation with real voice assistant users, we identify SPQs in real-time and recommend shopping actions to users to add the queried product into their shopping list. We demonstrate that we are able to accurately identify SPQs, as indicated by the significantly higher rate of added products to users' shopping lists when being prompted after SPQs vs random PQs. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: Accepted at IronGraphs@ECIR'2024

arXiv:2404.02422 [pdf, other]

Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data

Authors: Parth Patwa, Simone Filice, Zhiyu Chen, Giuseppe Castellucci, Oleg Rokhlenko, Shervin Malmasi

Abstract: Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks. In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt. In this paper, we propose a strategy to make LLMs as efficient as 0-shot text classifiers, while getting comparable or… ▽ More Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks. In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt. In this paper, we propose a strategy to make LLMs as efficient as 0-shot text classifiers, while getting comparable or better accuracy than ICL. Our solution targets the low resource setting, i.e., when only 4 examples per class are available. Using a single LLM and few-shot real data we perform a sequence of generation, filtering and Parameter-Efficient Fine-Tuning steps to create a robust and efficient classifier. Experimental results show that our approach leads to competitive results on multiple text classification datasets. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: Accepted at LREC-COLING 2024

arXiv:2401.09785 [pdf, other]

Instant Answering in E-Commerce Buyer-Seller Messaging using Message-to-Question Reformulation

Authors: Besnik Fetahu, Tejas Mehta, Qun Song, Nikhita Vedula, Oleg Rokhlenko, Shervin Malmasi

Abstract: E-commerce customers frequently seek detailed product information for purchase decisions, commonly contacting sellers directly with extended queries. This manual response requirement imposes additional costs and disrupts buyer's shopping experience with response time fluctuations ranging from hours to days. We seek to automate buyer inquiries to sellers in a leading e-commerce store using a domain… ▽ More E-commerce customers frequently seek detailed product information for purchase decisions, commonly contacting sellers directly with extended queries. This manual response requirement imposes additional costs and disrupts buyer's shopping experience with response time fluctuations ranging from hours to days. We seek to automate buyer inquiries to sellers in a leading e-commerce store using a domain-specific federated Question Answering (QA) system. The main challenge is adapting current QA systems, designed for single questions, to address detailed customer queries. We address this with a low-latency, sequence-to-sequence approach, MESSAGE-TO-QUESTION ( M2Q ). It reformulates buyer messages into succinct questions by identifying and extracting the most salient information from a message. Evaluation against baselines shows that M2Q yields relative increases of 757% in question understanding, and 1,746% in answering rate from the federated QA system. Live deployment shows that automatic answering saves sellers from manually responding to millions of messages per year, and also accelerates customer purchase decisions by eliminating the need for buyers to wait for a reply △ Less

Submitted 30 January, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted at ECIR 2024

arXiv:2401.09775 [pdf, other]

Controllable Decontextualization of Yes/No Question and Answers into Factual Statements

Authors: Lingbo Mo, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi

Abstract: Yes/No or polar questions represent one of the main linguistic question categories. They consist of a main interrogative clause, for which the answer is binary (assertion or negation). Polar questions and answers (PQA) represent a valuable knowledge resource present in many community and other curated QA sources, such as forums or e-commerce applications. Using answers to polar questions alone in… ▽ More Yes/No or polar questions represent one of the main linguistic question categories. They consist of a main interrogative clause, for which the answer is binary (assertion or negation). Polar questions and answers (PQA) represent a valuable knowledge resource present in many community and other curated QA sources, such as forums or e-commerce applications. Using answers to polar questions alone in other contexts is not trivial. Answers are contextualized, and presume that the interrogative question clause and any shared knowledge between the asker and answerer are provided. We address the problem of controllable rewriting of answers to polar questions into decontextualized and succinct factual statements. We propose a Transformer sequence to sequence model that utilizes soft-constraints to ensure controllable rewriting, such that the output statement is semantically equivalent to its PQA input. Evaluation on three separate PQA datasets as measured through automated and human evaluation metrics show that our proposed approach achieves the best performance when compared to existing baselines. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted at ECIR 2024

arXiv:2310.17034 [pdf, other]

Follow-on Question Suggestion via Voice Hints for Voice Assistants

Authors: Besnik Fetahu, Pedro Faustini, Giuseppe Castellucci, Anjie Fang, Oleg Rokhlenko, Shervin Malmasi

Abstract: The adoption of voice assistants like Alexa or Siri has grown rapidly, allowing users to instantly access information via voice search. Query suggestion is a standard feature of screen-based search experiences, allowing users to explore additional topics. However, this is not trivial to implement in voice-based settings. To enable this, we tackle the novel task of suggesting questions with compact… ▽ More The adoption of voice assistants like Alexa or Siri has grown rapidly, allowing users to instantly access information via voice search. Query suggestion is a standard feature of screen-based search experiences, allowing users to explore additional topics. However, this is not trivial to implement in voice-based settings. To enable this, we tackle the novel task of suggesting questions with compact and natural voice hints to allow users to ask follow-up questions. We define the task, ground it in syntactic theory and outline linguistic desiderata for spoken hints. We propose baselines and an approach using sequence-to-sequence Transformers to generate spoken hints from a list of questions. Using a new dataset of 6681 input questions and human written hints, we evaluated the models with automatic metrics and human evaluation. Results show that a naive approach of concatenating suggested questions creates poor voice hints. Our approach, which applies a linguistically-motivated pretraining task was strongly preferred by humans for producing the most natural hints. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Accepted as Long Paper at EMNLP'23 Findings

arXiv:2310.16361 [pdf, other]

InstructPTS: Instruction-Tuning LLMs for Product Title Summarization

Authors: Besnik Fetahu, Zhiyu Chen, Oleg Rokhlenko, Shervin Malmasi

Abstract: E-commerce product catalogs contain billions of items. Most products have lengthy titles, as sellers pack them with product attributes to improve retrieval, and highlight key product aspects. This results in a gap between such unnatural products titles, and how customers refer to them. It also limits how e-commerce stores can use these seller-provided titles for recommendation, QA, or review summa… ▽ More E-commerce product catalogs contain billions of items. Most products have lengthy titles, as sellers pack them with product attributes to improve retrieval, and highlight key product aspects. This results in a gap between such unnatural products titles, and how customers refer to them. It also limits how e-commerce stores can use these seller-provided titles for recommendation, QA, or review summarization. Inspired by recent work on instruction-tuned LLMs, we present InstructPTS, a controllable approach for the task of Product Title Summarization (PTS). Trained using a novel instruction fine-tuning strategy, our approach is able to summarize product titles according to various criteria (e.g. number of words in a summary, inclusion of specific phrases, etc.). Extensive evaluation on a real-world e-commerce catalog shows that compared to simple fine-tuning of LLMs, our proposed approach can generate more accurate product name summaries, with an improvement of over 14 and 8 BLEU and ROUGE points, respectively. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Accepted by EMNLP 2023 (Industry Track)

arXiv:2310.13213 [pdf, other]

MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition

Authors: Besnik Fetahu, Zhiyu Chen, Sudipta Kar, Oleg Rokhlenko, Shervin Malmasi

Abstract: We present MULTICONER V2, a dataset for fine-grained Named Entity Recognition covering 33 entity classes across 12 languages, in both monolingual and multilingual settings. This dataset aims to tackle the following practical challenges in NER: (i) effective handling of fine-grained classes that include complex entities like movie titles, and (ii) performance degradation due to noise generated from… ▽ More We present MULTICONER V2, a dataset for fine-grained Named Entity Recognition covering 33 entity classes across 12 languages, in both monolingual and multilingual settings. This dataset aims to tackle the following practical challenges in NER: (i) effective handling of fine-grained classes that include complex entities like movie titles, and (ii) performance degradation due to noise generated from typing mistakes or OCR errors. The dataset is compiled from open resources like Wikipedia and Wikidata, and is publicly available. Evaluation based on the XLM-RoBERTa baseline highlights the unique challenges posed by MULTICONER V2: (i) the fine-grained taxonomy is challenging, where the scores are low with macro-F1=0.63 (across all languages), and (ii) the corruption strategy significantly impairs performance, with entity corruption resulting in 9% lower performance relative to non-entity corruptions across all languages. This highlights the greater impact of entity noise in contrast to context noise. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: Accepted to the Findings of EMNLP 2023

arXiv:2306.03411 [pdf, other]

Generate-then-Retrieve: Intent-Aware FAQ Retrieval in Product Search

Authors: Zhiyu Chen, Jason Choi, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi

Abstract: Customers interacting with product search engines are increasingly formulating information-seeking queries. Frequently Asked Question (FAQ) retrieval aims to retrieve common question-answer pairs for a user query with question intent. Integrating FAQ retrieval in product search can not only empower users to make more informed purchase decisions, but also enhance user retention through efficient po… ▽ More Customers interacting with product search engines are increasingly formulating information-seeking queries. Frequently Asked Question (FAQ) retrieval aims to retrieve common question-answer pairs for a user query with question intent. Integrating FAQ retrieval in product search can not only empower users to make more informed purchase decisions, but also enhance user retention through efficient post-purchase support. Determining when an FAQ entry can satisfy a user's information need within product search, without disrupting their shopping experience, represents an important challenge. We propose an intent-aware FAQ retrieval system consisting of (1) an intent classifier that predicts when a user's information need can be answered by an FAQ; (2) a reformulation model that rewrites a query into a natural question. Offline evaluation demonstrates that our approach improves Hit@1 by 13% on retrieving ground-truth FAQs, while reducing latency by 95% compared to baseline systems. These improvements are further validated by real user feedback, where 71% of displayed FAQs on top of product search results received explicit positive user feedback. Overall, our findings show promising directions for integrating FAQ retrieval into product search at scale. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: ACL 2023 Industry Track

arXiv:2305.17393 [pdf, other]

Answering Unanswered Questions through Semantic Reformulations in Spoken QA

Authors: Pedro Faustini, Zhiyu Chen, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi

Abstract: Spoken Question Answering (QA) is a key feature of voice assistants, usually backed by multiple QA systems. Users ask questions via spontaneous speech which can contain disfluencies, errors, and informal syntax or phrasing. This is a major challenge in QA, causing unanswered questions or irrelevant answers, and leading to bad user experiences. We analyze failed QA requests to identify core challen… ▽ More Spoken Question Answering (QA) is a key feature of voice assistants, usually backed by multiple QA systems. Users ask questions via spontaneous speech which can contain disfluencies, errors, and informal syntax or phrasing. This is a major challenge in QA, causing unanswered questions or irrelevant answers, and leading to bad user experiences. We analyze failed QA requests to identify core challenges: lexical gaps, proposition types, complex syntactic structure, and high specificity. We propose a Semantic Question Reformulation (SURF) model offering three linguistically-grounded operations (repair, syntactic reshaping, generalization) to rewrite questions to facilitate answering. Offline evaluation on 1M unanswered questions from a leading voice assistant shows that SURF significantly improves answer rates: up to 24% of previously unanswered questions obtain relevant answers (75%). Live deployment shows positive impact for millions of customers with unanswered questions; explicit relevance feedback shows high user satisfaction. △ Less

Submitted 3 June, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: ACL 2023 Industry Track

arXiv:2305.14793 [pdf, other]

Faithful Low-Resource Data-to-Text Generation through Cycle Training

Authors: Zhuoer Wang, Marcus Collins, Nikhita Vedula, Simone Filice, Shervin Malmasi, Oleg Rokhlenko

Abstract: Methods to generate text from structured data have advanced significantly in recent years, primarily due to fine-tuning of pre-trained language models on large datasets. However, such models can fail to produce output faithful to the input data, particularly on out-of-domain data. Sufficient annotated data is often not available for specific domains, leading us to seek an unsupervised approach to… ▽ More Methods to generate text from structured data have advanced significantly in recent years, primarily due to fine-tuning of pre-trained language models on large datasets. However, such models can fail to produce output faithful to the input data, particularly on out-of-domain data. Sufficient annotated data is often not available for specific domains, leading us to seek an unsupervised approach to improve the faithfulness of output text. Since the problem is fundamentally one of consistency between the representations of the structured data and text, we evaluate the effectiveness of cycle training in this work. Cycle training uses two models which are inverses of each other: one that generates text from structured data, and one which generates the structured data from natural language text. We show that cycle training, when initialized with a small amount of supervised data (100 samples in our case), achieves nearly the same performance as fully supervised approaches for the data-to-text generation task on the WebNLG, E2E, WTQ, and WSQL datasets. We perform extensive empirical analysis with automated evaluation metrics and a newly designed human evaluation schema to reveal different cycle training strategies' effectiveness of reducing various types of generation errors. Our code is publicly available at https://github.com/Edillower/CycleNLG. △ Less

Submitted 11 July, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 19 pages, 4 figures, ACL 2023

arXiv:2305.06586 [pdf, other]

SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)

Authors: Besnik Fetahu, Sudipta Kar, Zhiyu Chen, Oleg Rokhlenko, Shervin Malmasi

Abstract: We present the findings of SemEval-2023 Task 2 on Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2). Divided into 13 tracks, the task focused on methods to identify complex fine-grained named entities (like WRITTENWORK, VEHICLE, MUSICALGRP) across 12 languages, in both monolingual and multilingual scenarios, as well as noisy settings. The task used the MultiCoNER V2 dataset, compos… ▽ More We present the findings of SemEval-2023 Task 2 on Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2). Divided into 13 tracks, the task focused on methods to identify complex fine-grained named entities (like WRITTENWORK, VEHICLE, MUSICALGRP) across 12 languages, in both monolingual and multilingual scenarios, as well as noisy settings. The task used the MultiCoNER V2 dataset, composed of 2.2 million instances in Bangla, Chinese, English, Farsi, French, German, Hindi, Italian., Portuguese, Spanish, Swedish, and Ukrainian. MultiCoNER 2 was one of the most popular tasks of SemEval-2023. It attracted 842 submissions from 47 teams, and 34 teams submitted system papers. Results showed that complex entity types such as media titles and product names were the most challenging. Methods fusing external knowledge into transformer models achieved the best performance, and the largest gains were on the Creative Work and Group classes, which are still challenging even with external knowledge. Some fine-grained classes proved to be more challenging than others, such as SCIENTIST, ARTWORK, and PRIVATECORP. We also observed that noisy data has a significant impact on model performance, with an average drop of 10% on the noisy subset. The task highlights the need for future research on improving NER robustness on noisy data containing complex entities. △ Less

Submitted 25 May, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: SemEval-2023 (co-located with ACL-2023 in Toronto, Canada)

arXiv:2302.11074 [pdf, other]

doi 10.1145/3534678.3539169

Preventing Catastrophic Forgetting in Continual Learning of New Natural Language Tasks

Authors: Sudipta Kar, Giuseppe Castellucci, Simone Filice, Shervin Malmasi, Oleg Rokhlenko

Abstract: Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model. Training an MTL model requires having the training data for all tasks available at the same time. As systems usually evolve over time, (e.g., to support new functionalities), adding a new task to an existing MTL model usually requires retraining the… ▽ More Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model. Training an MTL model requires having the training data for all tasks available at the same time. As systems usually evolve over time, (e.g., to support new functionalities), adding a new task to an existing MTL model usually requires retraining the model from scratch on all the tasks and this can be time-consuming and computationally expensive. Moreover, in some scenarios, the data used to train the original training may be no longer available, for example, due to storage or privacy concerns. In this paper, we approach the problem of incrementally expanding MTL models' capability to solve new tasks over time by distilling the knowledge of an already trained model on n tasks into a new one for solving n+1 tasks. To avoid catastrophic forgetting, we propose to exploit unlabeled data from the same distributions of the old tasks. Our experiments on publicly available benchmarks show that such a technique dramatically benefits the distillation by preserving the already acquired knowledge (i.e., preventing up to 20% performance drops on old tasks) while obtaining good performance on the incrementally added tasks. Further, we also show that our approach is beneficial in practical settings by using data from a leading voice assistant. △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: KDD 2022

arXiv:2210.15777 [pdf, other]

Reinforced Question Rewriting for Conversational Question Answering

Authors: Zhiyu Chen, Jie Zhao, Anjie Fang, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi

Abstract: Conversational Question Answering (CQA) aims to answer questions contained within dialogues, which are not easily interpretable without context. Developing a model to rewrite conversational questions into self-contained ones is an emerging solution in industry settings as it allows using existing single-turn QA systems to avoid training a CQA model from scratch. Previous work trains rewriting mode… ▽ More Conversational Question Answering (CQA) aims to answer questions contained within dialogues, which are not easily interpretable without context. Developing a model to rewrite conversational questions into self-contained ones is an emerging solution in industry settings as it allows using existing single-turn QA systems to avoid training a CQA model from scratch. Previous work trains rewriting models using human rewrites as supervision. However, such objectives are disconnected with QA models and therefore more human-like rewrites do not guarantee better QA performance. In this paper we propose using QA feedback to supervise the rewriting model with reinforcement learning. Experiments show that our approach can effectively improve QA performance over baselines for both extractive and retrieval QA. Furthermore, human evaluation shows that our method can generate more accurate and detailed rewrites when compared to human annotations. △ Less

Submitted 31 October, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: A cleaned version of our paper Accepted by EMNLP 2022 (Industry Track)

arXiv:2208.14536 [pdf, other]

MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition

Authors: Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, Oleg Rokhlenko

Abstract: We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, a… ▽ More We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%), highlighting the difficulty of our data. GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%). MultiCoNER poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems. MultiCoNER is publicly available at https://registry.opendata.aws/multiconer/ and we hope that this resource will help advance research in various aspects of NER. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Comments: Accepted at COLING 2022

arXiv:1904.07839 [pdf, other]

UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks

Authors: Gustavo Henrique Paetzold, Shervin Malmasi, Marcos Zampieri

Abstract: In this paper we revisit the problem of automatically identifying hate speech in posts from social media. We approach the task using a system based on minimalistic compositional Recurrent Neural Networks (RNN). We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset. The dataset made available by… ▽ More In this paper we revisit the problem of automatically identifying hate speech in posts from social media. We approach the task using a system based on minimalistic compositional Recurrent Neural Networks (RNN). We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset. The dataset made available by the HatEval organizers contained English and Spanish posts retrieved from Twitter annotated with respect to the presence of hateful content and its target. In this paper we present the results obtained by our system in comparison to the other entries in the shared task. Our system achieved competitive performance ranking 7th in sub-task A out of 62 systems in the English track. △ Less

Submitted 16 April, 2019; originally announced April 2019.

Comments: Proceedings of SemEval

arXiv:1903.08983 [pdf, other]

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

Authors: Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

Abstract: We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets. It featured three sub-tasks. In sub-task A, the goal was to discriminate between offensive and non-offensive posts. I… ▽ More We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets. It featured three sub-tasks. In sub-task A, the goal was to discriminate between offensive and non-offensive posts. In sub-task B, the focus was on the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, about 800 teams signed up to participate in the task, and 115 of them submitted results, which we present and analyze in this report. △ Less

Submitted 26 April, 2019; v1 submitted 19 March, 2019; originally announced March 2019.

Comments: Proceedings of the International Workshop on Semantic Evaluation (SemEval)

arXiv:1902.09666 [pdf, ps, other]

Predicting the Type and Target of Offensive Posts in Social Media

Authors: Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

Abstract: As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offe… ▽ More As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID. △ Less

Submitted 16 April, 2019; v1 submitted 25 February, 2019; originally announced February 2019.

Comments: Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

arXiv:1811.04695 [pdf, ps, other]

Classifying Patent Applications with Ensemble Methods

Authors: Fernando Benites, Shervin Malmasi, Marcos Zampieri

Abstract: We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications. The goal of the task is to use computational methods to categorize patent applications according to a coarse-grained taxonomy of eight classes based on the International Patent Classification (IPC). We teste… ▽ More We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications. The goal of the task is to use computational methods to categorize patent applications according to a coarse-grained taxonomy of eight classes based on the International Patent Classification (IPC). We tested a variety of approaches for this task and the best results, 0.778 micro-averaged F1-Score, were achieved by SVM ensembles using a combination of words and characters as features. Our team, BMZ, was ranked first among 14 teams in the competition. △ Less

Submitted 12 November, 2018; originally announced November 2018.

Comments: Proceedings of ALTA 2018

arXiv:1808.04800 [pdf, other]

Classifier Ensembles for Dialect and Language Variety Identification

Authors: Liviu P. Dinu, Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi

Abstract: In this paper we present ensemble-based systems for dialect and language variety identification using the datasets made available by the organizers of the VarDial Evaluation Campaign 2018. We present a system developed to discriminate between Flemish and Dutch in subtitles and a system trained to discriminate between four Arabic dialects: Egyptian, Levantine, Gulf, North African, and Modern Standa… ▽ More In this paper we present ensemble-based systems for dialect and language variety identification using the datasets made available by the organizers of the VarDial Evaluation Campaign 2018. We present a system developed to discriminate between Flemish and Dutch in subtitles and a system trained to discriminate between four Arabic dialects: Egyptian, Levantine, Gulf, North African, and Modern Standard Arabic in speech broadcasts. Finally, we compare the performance of these two systems with the other systems submitted to the Discriminating between Dutch and Flemish in Subtitles (DFS) and the Arabic Dialect Identification (ADI) shared tasks at VarDial 2018. △ Less

Submitted 14 August, 2018; originally announced August 2018.

arXiv:1807.08230 [pdf, other]

German Dialect Identification Using Classifier Ensembles

Authors: Alina Maria Ciobanu, Shervin Malmasi, Liviu P. Dinu

Abstract: In this paper we present the GDI_classification entry to the second German Dialect Identification (GDI) shared task organized within the scope of the VarDial Evaluation Campaign 2018. We present a system based on SVM classifier ensembles trained on characters and words. The system was trained on a collection of speech transcripts of five Swiss-German dialects provided by the organizers. The transc… ▽ More In this paper we present the GDI_classification entry to the second German Dialect Identification (GDI) shared task organized within the scope of the VarDial Evaluation Campaign 2018. We present a system based on SVM classifier ensembles trained on characters and words. The system was trained on a collection of speech transcripts of five Swiss-German dialects provided by the organizers. The transcripts included in the dataset contained speakers from Basel, Bern, Lucerne, and Zurich. Our entry in the challenge reached 62.03% F1-score and was ranked third out of eight teams. △ Less

Submitted 21 July, 2018; originally announced July 2018.

arXiv:1807.03108 [pdf, other]

Discriminating between Indo-Aryan Languages Using SVM Ensembles

Authors: Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Santanu Pal, Liviu P. Dinu

Abstract: In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. We investigate the performance of individual features and combine the output of single classifiers to maximize performance. The system competed in the Indo-Aryan Language Identification (I… ▽ More In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. We investigate the performance of individual features and combine the output of single classifiers to maximize performance. The system competed in the Indo-Aryan Language Identification (ILI) shared task organized within the VarDial Evaluation Campaign 2018. Our best entry in the competition, named ILIdentification, scored 88:95% F1 score and it was ranked 3rd out of 8 teams. △ Less

Submitted 9 July, 2018; originally announced July 2018.

Comments: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects

arXiv:1804.11346 [pdf, other]

A Portuguese Native Language Identification Dataset

Authors: Iria del Río, Marcos Zampieri, Shervin Malmasi

Abstract: In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, D… ▽ More In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP. We discuss possible applications of this dataset and present the results obtained for the first lexical baseline system for Portuguese NLI. △ Less

Submitted 30 April, 2018; originally announced April 2018.

Comments: Proceedings of The 13th Workshop on Innovative Use of NLP for Building Educational Applications (BEA)

arXiv:1804.09132 [pdf, other]

A Report on the Complex Word Identification Shared Task 2018

Authors: Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo H. Paetzold, Lucia Specia, Sanja Štajner, Anaïs Tack, Marcos Zampieri

Abstract: We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT'2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification… ▽ More We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT'2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings. △ Less

Submitted 24 April, 2018; originally announced April 2018.

Comments: Second CWI Shared Task co-located with the BEA Workshop 2018 at NAACL-HLT in New Orleans, USA

arXiv:1803.05495 [pdf, other]

Challenges in Discriminating Profanity from Hate Speech

Authors: Shervin Malmasi, Marcos Zampieri

Abstract: In this study we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered. Using a new dataset annotated specifically for this task, we employ supervised classification along with a set of features that includes n-grams, skip-grams and clustering-based word representations. We apply approaches based on single classifi… ▽ More In this study we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered. Using a new dataset annotated specifically for this task, we employ supervised classification along with a set of features that includes n-grams, skip-grams and clustering-based word representations. We apply approaches based on single classifiers as well as more advanced ensemble classifiers and stacked generalization, achieving the best result of 80% accuracy for this 3-class classification task. Analysis of the results reveals that discriminating hate speech and profanity is not a simple task, which may require features that capture a deeper understanding of the text not always possible with surface n-grams. The variability of gold labels in the annotated data, due to differences in the subjective adjudications of the annotators, is also an issue. Other directions for future work are discussed. △ Less

Submitted 14 March, 2018; originally announced March 2018.

Journal ref: Shervin Malmasi, Marcos Zampieri (2018) Challenges in Discriminating Profanity from Hate Speech. Journal of Experimental & Theoretical Artificial Intelligence. Volume 30, Issue 2, pp. 187-202. Taylor & Francis

arXiv:1712.06427 [pdf, other]

Detecting Hate Speech in Social Media

Authors: Shervin Malmasi, Marcos Zampieri

Abstract: In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity. We aim to establish lexical baselines for this task by applying supervised classification methods using a recently released dataset annotated for this purpose. As features, our system uses character n-grams, word n-grams and word skip-grams. We obtain results of 78% accuracy in… ▽ More In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity. We aim to establish lexical baselines for this task by applying supervised classification methods using a recently released dataset annotated for this purpose. As features, our system uses character n-grams, word n-grams and word skip-grams. We obtain results of 78% accuracy in identifying posts across three classes. Results demonstrate that the main challenge lies in discriminating profanity and hate speech from each other. A number of directions for future work are discussed. △ Less

Submitted 26 December, 2017; v1 submitted 18 December, 2017; originally announced December 2017.

Comments: Proceedings of Recent Advances in Natural Language Processing (RANLP). pp. 467-472. Varna, Bulgaria

arXiv:1710.09306 [pdf, other]

Exploring the Use of Text Classification in the Legal Domain

Authors: Octavia-Maria Sulea, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P. Dinu, Josef van Genabith

Abstract: In this paper, we investigate the application of text classification methods to support law professionals. We present several experiments applying machine learning techniques to predict with high accuracy the ruling of the French Supreme Court and the law area to which a case belongs to. We also investigate the influence of the time period in which a ruling was made on the form of the case descrip… ▽ More In this paper, we investigate the application of text classification methods to support law professionals. We present several experiments applying machine learning techniques to predict with high accuracy the ruling of the French Supreme Court and the law area to which a case belongs to. We also investigate the influence of the time period in which a ruling was made on the form of the case description and the extent to which we need to mask information in a full case ruling to automatically obtain training and test data that resembles case descriptions. We developed a mean probability ensemble system combining the output of multiple SVM classifiers. We report results of 98% average F1 score in predicting a case ruling, 96% F1 score for predicting the law area of a case, and 87.07% F1 score on estimating the date of a ruling. △ Less

Submitted 25 October, 2017; originally announced October 2017.

Comments: Proceedings of the 2nd Workshop on Automated Semantic Analysis of Information in Legal Texts (ASAIL)

arXiv:1710.04989 [pdf, other]

Complex Word Identification: Challenges in Data Annotation and System Performance

Authors: Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold, Lucia Specia

Abstract: This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed p… ▽ More This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed poorly on the SemEval CWI dataset, and one of the reasons for that is the way in which human annotation was performed. △ Less

Submitted 13 October, 2017; originally announced October 2017.

Comments: Proceedings of the 4th Workshop on NLP Techniques for Educational Applications (NLPTEA 2017)

arXiv:1707.04817 [pdf, ps, other]

Open-Set Language Identification

Authors: Shervin Malmasi

Abstract: We present the first open-set language identification experiments using one-class classification. We first highlight the shortcomings of traditional feature extraction methods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One- Class Support Vector Machine using only a monolingual corpus for each… ▽ More We present the first open-set language identification experiments using one-class classification. We first highlight the shortcomings of traditional feature extraction methods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One- Class Support Vector Machine using only a monolingual corpus for each language. Each model is evaluated against a test set of data from all 10 languages and we achieve an average F-score of 0.99, highlighting the effectiveness of this approach for open-set language identification. △ Less

Submitted 16 July, 2017; originally announced July 2017.

arXiv:1707.00621 [pdf, ps, other]

Including Dialects and Language Varieties in Author Profiling

Authors: Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, Liviu P. Dinu

Abstract: This paper presents a computational approach to author profiling taking gender and language variety into account. We apply an ensemble system with the output of multiple linear SVM classifiers trained on character and word $n$-grams. We evaluate the system using the dataset provided by the organizers of the 2017 PAN lab on author profiling. Our approach achieved 75% average accuracy on gender iden… ▽ More This paper presents a computational approach to author profiling taking gender and language variety into account. We apply an ensemble system with the output of multiple linear SVM classifiers trained on character and word $n$-grams. We evaluate the system using the dataset provided by the organizers of the 2017 PAN lab on author profiling. Our approach achieved 75% average accuracy on gender identification on tweets written in four languages and 97% accuracy on language variety identification for Portuguese. △ Less

Submitted 3 July, 2017; originally announced July 2017.

Comments: Proceedings of PAN at CLEF 2017

arXiv:1703.06541 [pdf, other]

Native Language Identification using Stacked Generalization

Authors: Shervin Malmasi, Mark Dras

Abstract: Ensemble methods using multiple classifiers have proven to be the most successful approach for the task of Native Language Identification (NLI), achieving the current state of the art. However, a systematic examination of ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble architectures such as classifier stacking have not been closely evaluated. We present a set of exp… ▽ More Ensemble methods using multiple classifiers have proven to be the most successful approach for the task of Native Language Identification (NLI), achieving the current state of the art. However, a systematic examination of ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble architectures such as classifier stacking have not been closely evaluated. We present a set of experiments using three ensemble-based models, testing each with multiple configurations and algorithms. This includes a rigorous application of meta-classification models for NLI, achieving state-of-the-art results on three datasets from different languages. We also present the first use of statistical significance testing for comparing NLI systems, showing that our results are significantly better than the previous state of the art. We make available a collection of test set predictions to facilitate future statistical tests. △ Less

Submitted 19 March, 2017; originally announced March 2017.

arXiv:1610.00031 [pdf, other]

Discriminating Similar Languages: Evaluations and Explorations

Authors: Cyril Goutte, Serge Léger, Shervin Malmasi, Marcos Zampieri

Abstract: We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ense… ▽ More We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ensemble and oracle combination, and provide learning curves to help us understand which languages are more challenging. A number of difficult sentences are identified and investigated further with human annotation. △ Less

Submitted 30 September, 2016; originally announced October 2016.

Comments: Proceedings of Language Resources and Evaluation (LREC)

Journal ref: Proceedings of Language Resources and Evaluation (LREC). Portoroz, Slovenia. pp 1800-1807 (2016)

arXiv:1610.00030 [pdf, other]

Modeling Language Change in Historical Corpora: The Case of Portuguese

Authors: Marcos Zampieri, Shervin Malmasi, Mark Dras

Abstract: This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. W… ▽ More This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change. △ Less

Submitted 30 September, 2016; originally announced October 2016.

Comments: Proceedings of Language Resources and Evaluation (LREC)

Journal ref: Proceedings of Language Resources and Evaluation (LREC). Portoroz, Slovenia. pp. 4098-4104 (2016)

Showing 1–36 of 36 results for author: Malmasi, S