2023
pdf
bib
abs
Reflection of Demographic Background on Word Usage
Aparna Garimella
|
Carmen Banea
|
Rada Mihalcea
Computational Linguistics, Volume 49, Issue 2 - June 2023
The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the baseline for all demographic categories, even for scenarios with classification into 15 categories, illustrating the usefulness of topic-based features in identifying word usage differences. Further, we note that for location and industry, topics extracted from immediate context are the best predictors of word usages, hinting at the importance of word meaning and its grammatical function for these demographics, while for gender, topics obtained from longer contexts are better predictors for word usage.
2020
pdf
bib
abs
“Judge me by my size (noun), do you?” YodaLib: A Demographic-Aware Humor Generation Framework
Aparna Garimella
|
Carmen Banea
|
Nabil Hossain
|
Rada Mihalcea
Proceedings of the 28th International Conference on Computational Linguistics
The subjective nature of humor makes computerized humor generation a challenging task. We propose an automatic humor generation framework for filling the blanks in Mad Libs® stories, while accounting for the demographic backgrounds of the desired audience. We collect a dataset consisting of such stories, which are filled in and judged by carefully selected workers on Amazon Mechanical Turk. We build upon the BERT platform to predict location-biased word fillings in incomplete sentences, and we fine-tune BERT to classify location-specific humor in a sentence. We leverage these components to produce YodaLib, a fully-automated Mad Libs style humor generation framework, which selects and ranks appropriate candidate words and sentences in order to generate a coherent and funny story tailored to certain demographics. Our experimental results indicate that YodaLib outperforms a previous semi-automated approach proposed for this task, while also surpassing human annotators in both qualitative and quantitative analyses.
pdf
bib
abs
Building Location Embeddings from Physical Trajectories and Textual Representations
Laura Biester
|
Carmen Banea
|
Rada Mihalcea
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Word embedding methods have become the de-facto way to represent words, having been successfully applied to a wide array of natural language processing tasks. In this paper, we explore the hypothesis that embedding methods can also be effectively used to represent spatial locations. Using a new dataset consisting of the location trajectories of 729 students over a seven month period and text data related to those locations, we implement several strategies to create location embeddings, which we then use to create embeddings of the sequences of locations a student has visited. To identify the surface level properties captured in the representations, we propose a number of probing tasks such as the presence of a specific location in a sequence or the type of activities that take place at a location. We then leverage the representations we generated and employ them in more complex downstream tasks ranging from predicting a student’s area of study to a student’s depression level, showing the effectiveness of these location embeddings.
2019
pdf
bib
abs
Women’s Syntactic Resilience and Men’s Grammatical Luck: Gender-Bias in Part-of-Speech Tagging and Dependency Parsing
Aparna Garimella
|
Carmen Banea
|
Dirk Hovy
|
Rada Mihalcea
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Several linguistic studies have shown the prevalence of various lexical and grammatical patterns in texts authored by a person of a particular gender, but models for part-of-speech tagging and dependency parsing have still not adapted to account for these differences. To address this, we annotate the Wall Street Journal part of the Penn Treebank with the gender information of the articles’ authors, and build taggers and parsers trained on this data that show performance differences in text written by men and women. Further analyses reveal numerous part-of-speech tags and syntactic relations whose prediction performances benefit from the prevalence of a specific gender in the training data. The results underscore the importance of accounting for gendered differences in syntactic tasks, and outline future venues for developing more accurate taggers and parsers. We release our data to the research community.
2017
pdf
bib
abs
Demographic-aware word associations
Aparna Garimella
|
Carmen Banea
|
Rada Mihalcea
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Variations of word associations across different groups of people can provide insights into people’s psychologies and their world views. To capture these variations, we introduce the task of demographic-aware word associations. We build a new gold standard dataset consisting of word association responses for approximately 300 stimulus words, collected from more than 800 respondents of different gender (male/female) and from different locations (India/United States), and show that there are significant variations in the word associations made by these groups. We also introduce a new demographic-aware word association model based on a neural net skip-gram architecture, and show how computational methods for measuring word associations that specifically account for writer demographics can outperform generic methods that are agnostic to such information.
2016
pdf
bib
SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Eneko Agirre
|
Carmen Banea
|
Daniel Cer
|
Mona Diab
|
Aitor Gonzalez-Agirre
|
Rada Mihalcea
|
German Rigau
|
Janyce Wiebe
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
pdf
bib
abs
Building a Dataset for Possessions Identification in Text
Carmen Banea
|
Xi Chen
|
Rada Mihalcea
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Just as industrialization matured from mass production to customization and personalization, so has the Web migrated from generic content to public disclosures of one’s most intimately held thoughts, opinions and beliefs. This relatively new type of data is able to represent finer and more narrowly defined demographic slices. If until now researchers have primarily focused on leveraging personalized content to identify latent information such as gender, nationality, location, or age of the author, this study seeks to establish a structured way of extracting possessions, or items that people own or are entitled to, as a way to ultimately provide insights into people’s behaviors and characteristics. In order to promote more research in this area, we are releasing a set of 798 possessions extracted from blog genre, where possessions are marked at different confidence levels, as well as a detailed set of guidelines to help in future annotation studies.
2015
pdf
bib
SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability
Eneko Agirre
|
Carmen Banea
|
Claire Cardie
|
Daniel Cer
|
Mona Diab
|
Aitor Gonzalez-Agirre
|
Weiwei Guo
|
Iñigo Lopez-Gazpio
|
Montse Maritxalar
|
Rada Mihalcea
|
German Rigau
|
Larraitz Uria
|
Janyce Wiebe
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2014
pdf
bib
SemEval-2014 Task 10: Multilingual Semantic Textual Similarity
Eneko Agirre
|
Carmen Banea
|
Claire Cardie
|
Daniel Cer
|
Mona Diab
|
Aitor Gonzalez-Agirre
|
Weiwei Guo
|
Rada Mihalcea
|
German Rigau
|
Janyce Wiebe
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
pdf
bib
SimCompass: Using Deep Learning Word Embeddings to Assess Cross-level Similarity
Carmen Banea
|
Di Chen
|
Rada Mihalcea
|
Claire Cardie
|
Janyce Wiebe
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
2013
pdf
bib
CPN-CORE: A Text Semantic Similarity System Infused with Opinion Knowledge
Carmen Banea
|
Yoonjung Choi
|
Lingjia Deng
|
Samer Hassan
|
Michael Mohler
|
Bishan Yang
|
Claire Cardie
|
Rada Mihalcea
|
Jan Wiebe
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity
2012
pdf
bib
Multilingual Subjectivity and Sentiment Analysis
Rada Mihalcea
|
Carmen Banea
|
Janyce Wiebe
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
pdf
bib
abs
Learning Sentiment Lexicons in Spanish
Verónica Pérez-Rosas
|
Carmen Banea
|
Rada Mihalcea
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In this paper we present a framework to derive sentiment lexicons in a target language by using manually or automatically annotated data available in an electronic resource rich language, such as English. We show that bridging the language gap using the multilingual sense-level aligned WordNet structure allows us to generate a high accuracy (90%) polarity lexicon comprising 1,347 entries, and a disjoint lower accuracy (74%) one encompassing 2,496 words. By using an LSA-based vectorial expansion for the generated lexicons, we are able to obtain an average F-measure of 66% in the target language. This implies that the lexicons could be used to bootstrap higher-coverage lexicons using in-language resources.
pdf
bib
Measuring Semantic Relatedness using Multilingual Representations
Samer Hassan
|
Carmen Banea
|
Rada Mihalcea
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
pdf
bib
UNT: A Supervised Synergistic Approach to Semantic Text Similarity
Carmen Banea
|
Samer Hassan
|
Michael Mohler
|
Rada Mihalcea
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
2011
pdf
bib
Word Sense Disambiguation with Multilingual Features
Carmen Banea
|
Rada Mihalcea
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)
pdf
bib
Sense-level Subjectivity in a Multilingual Setting
Carmen Banea
|
Rada Mihalcea
|
Janyce Wiebe
Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2011)
2010
pdf
bib
Proceedings of TextGraphs-5 - 2010 Workshop on Graph-based Methods for Natural Language Processing
Carmen Banea
|
Alessandro Moschitti
|
Swapna Somasundaran
|
Fabio Massimo Zanzotto
Proceedings of TextGraphs-5 - 2010 Workshop on Graph-based Methods for Natural Language Processing
pdf
bib
Multilingual Subjectivity: Are More Languages Better?
Carmen Banea
|
Rada Mihalcea
|
Janyce Wiebe
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
2008
pdf
bib
abs
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources
Carmen Banea
|
Rada Mihalcea
|
Janyce Wiebe
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper introduces a method for creating a subjectivity lexicon for languages with scarce resources. The method is able to build a subjectivity lexicon by using a small seed set of subjective words, an online dictionary, and a small raw corpus, coupled with a bootstrapping process that ranks new candidate words based on a similarity measure. Experiments performed with a rule-based sentence level subjectivity classifier show an 18% absolute improvement in F-measure as compared to previously proposed semi-supervised methods.
pdf
bib
Multilingual Subjectivity Analysis Using Machine Translation
Carmen Banea
|
Rada Mihalcea
|
Janyce Wiebe
|
Samer Hassan
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
2007
pdf
bib
UNT: SubFinder: Combining Knowledge Sources for Automatic Lexical Substitution
Samer Hassan
|
Andras Csomai
|
Carmen Banea
|
Ravi Sinha
|
Rada Mihalcea
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)
pdf
bib
Learning Multilingual Subjective Language via Cross-Lingual Projections
Rada Mihalcea
|
Carmen Banea
|
Janyce Wiebe
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics
2006
pdf
bib
Random-Walk Term Weighting for Improved Text Classification
Samer Hassan
|
Carmen Banea
Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing