Multi-aspect Sentiment Analysis with Topic Models
Bin Lu∗†‡§ , Myle Ott†§ , Claire Cardie† and Benjamin K. Tsou‡∗
of Chinese, Translation and Linguistics, City University of Hong Kong, Hong Kong
† Department of Computer Science, Cornell University, Ithaca, NY, USA
‡ Research Centre on Linguistics and Language Information Sciences, Hong Kong Institute of Education, Hong Kong
lubin2010@gmail.com, {myleott,cardie}@cs.cornell.edu, btsou99@gmail.com
∗ Department
Abstract—We investigate the efficacy of topic model based
approaches to two multi-aspect sentiment analysis tasks: multiaspect sentence labeling and multi-aspect rating prediction. For
sentence labeling, we propose a weakly-supervised approach
that utilizes only minimal prior knowledge—in the form of
seed words—to enforce a direct correspondence between topics
and aspects. This correspondence is used to label sentences
with performance that approaches a fully supervised baseline.
For multi-aspect rating prediction, we find that overall ratings
can be used in conjunction with our sentence labelings to
achieve reasonable performance compared to a fully supervised
baseline. When gold-standard aspect-ratings are available, we
find that topic model based features can be used to improve
unsophisticated supervised baseline performance, in agreement
with previous multi-aspect rating prediction work. This improvement is diminished, however, when topic model features
are paired with a more competitive supervised baseline—a
finding not acknowledged in previous work.
Keywords-multi-aspect sentiment analysis; topic modeling;
I. I NTRODUCTION
The ever-increasing popularity of websites that feature user-generated opinions (e.g., TripAdvisor.com and
Yelp.com) has led to an abundance of customer reviews that
are often too numerous for a user to read. Consequently,
there is a growing need for systems that are able to automatically extract, evaluate and present opinions in ways that
are both helpful and easy for a user to interpret.
Early approaches to this problem [1]–[4] have focused
on determining either the overall polarity (i.e., positive or
negative) or the sentiment rating (e.g., one-to-five stars) of
a review. However, only considering coarse overall ratings
fails to adequately represent the multiple potential dimensions on which an entity can be reviewed. For example,
while the following review from OpenTable.com might
express an overall sentiment rating of 3-stars, it additionally
expresses a positive opinion toward the restaurant’s food, as
well as negative opinions toward the restaurant’s ambiance
and service:
“The food was very good, but it took over half an
hour to be seated, ... and the service was terrible.
The room was very noisy and cold wind blew in
from a curtain next to our table. Desserts were
very good, but because of [the] poor service, I’m
not sure we’ll ever go back!”
Looking beyond just overall ratings is important for users,
too, because they are likely to differ in how much value they
ascribe to each of these distinct aspects. For example, while
a gourmand may forgive a restaurant’s poor ambiance, they
may be uncompromising when it comes to food quality. Accordingly, a new branch of sentiment analysis has emerged,
called MULTI - ASPECT SENTIMENT ANALYSIS, that aims to
take into account these various, potentially related aspects
often discussed within a single review.
Recently, several topic modeling approaches based on
Latent Dirichlet Allocation (LDA) [5] have been proposed
for multi-aspect sentiment analysis tasks [6]–[8]. These
approaches use variations of LDA to uncover latent topics in
a document collection, with the hopes that these topics will
correspond to rateable aspects for the entity under review.
In this work, we investigate the role of several unsupervised and weakly supervised topic modeling approaches
to two popular multi-aspect sentiment analysis tasks: (1)
MULTI - ASPECT SENTENCE LABELING , where each sentence
in a review is labeled according to the aspects it discusses
(see Section III-A), and (2) MULTI - ASPECT RATING PRE DICTION , the goal of which is to predict implicit aspectspecific star ratings for each review (see Section III-B).
For multi-aspect sentence labeling, we propose a weakly
supervised topic modeling approach (see Section III-A1) that
uses minimal prior knowledge in the form of seed words
to encourage a correspondence between topics and ratable
aspects. We find that these models generally perform quite
well (see Section VI-A), and that the best of these models
performs comparably to a supervised approach.
For multi-aspect rating prediction, we consider two settings. In the first, we assume that aspect-ratings are unavailable, but find (in Section VI-B) that by leveraging overall
ratings in conjunction with our multi-aspect sentence labeling approach, we can produce significant improvements over
an aspect-blind baseline. In our second setting, we use goldstandard aspect-ratings to train supervised classifiers both
with and without topic model based features. We find (in
Section VI-C) that these additional features improve performance over an online supervised baseline (Perceptron Rank).
However, this improvement is diminished when a more
§ The first two authors are listed in alphabetical order.
competitive supervised baseline is used instead (SupportVector Regression)—a finding not previously acknowledged.
For both tasks, we examine and compare four types of
topic models (see Section IV): LDA, Local LDA [6], MultiGrain LDA [7], and Segmented Topic Models (STM)—a
recently proposed [9] topic model that, to date, has not been
applied to sentiment analysis tasks.
Lastly, we perform our experiments using three datasets
(see Section V-A) from two domains (hotel and restaurant
reviews). Specifically, we evaluate our data coming from
CitySearch, OpenTable, and TripAdvisor.
II. R ELATED W ORK
While sentiment analysis has been studied extensively for
some time [10], most approaches have focused on documentlevel overall sentiment. Recently, there has been a growing
interest in sentiment analysis at finer levels of granularity,
and specifically approaches that take into account the multiaspect nature of many sentiment analysis tasks.
A. Multi-aspect Sentiment Analysis
Early multi-aspect work focused on creating aspect-based
review summaries using mined product features [11]–[13].
More recent work [14], [15] has also began modeling
implicit aspects. For example, [16] develop an aspect-based
review summarization system that extracts and aggregates
aspects and their corresponding sentiments.
Recent work has also began to look at multi-aspect rating
prediction. [17] present the Good Grief algorithm, which
jointly learns ranking models for individual aspects using
an online Perceptron Rank (PRank) [18] algorithm. [19] and
[20] bootstrap aspect terms with seed words for unsupervised
multi-aspect opinion polling and probabilistic rating regression, respectively. [21] integrate a document-level HMM
model to improve both multi-aspect rating prediction and
aspect-based sentiment summarization.
B. Multi-aspect Topic Models
While early generative approaches to sentimenent analysis
tasks focused only on latent topics [22]–[24], recently work
has begun to additionally model multiple aspects present in a
single document. For example, [7] present Multi-grain LDA
(MG-LDA), in which review-specific elements and ratable
aspects are modeled by global and local topics, respectively.
[6] introduce Local-LDA, a sentence-level LDA that discovers ratable aspects in reviews. [8] present MaxEnt-LDA, a
maximum entropy hybrid model that discovers both aspects
and aspect-specific opinion words.
However, the mapping between topics and aspects in these
models is still largely implicit, which can be burdensome
when working with different parameterizations or datasets.
[25] integrate ground-truth aspect ratings into MG-LDA to
force topics to correlate directly with aspects. However, their
approach requires gold-standard aspect ratings. In contrast,
in this work we both consider settings in which aspect
ratings are available (see Section III-B), and settings in
which they are unavailable (see Section III-A).
III. M ULTI - ASPECT S ENTIMENT A NALYSIS TASKS
A. Multi-aspect Sentence Labeling
The first phase of multi-aspect sentiment analysis is aspect
identification and mention extraction. This step identifies the
relevant aspects for a rated entity and extracts all textual
mentions associated with those aspects [25].
In this work, we consider a limited version of the aspect identification and mention extraction task, which we
call multi-aspect sentence labeling. In our limited setting,
we assume that aspects are fixed—e.g., food, service, and
ambiance for restaurant reviews—and that it is sufficient to
identify a single aspect for each sentence in a document.
In particular, we evaluate 4 topic models, weakly supervised with aspect-specific seed words (see Section III-A1),
and label each sentence according to its latent topic distribution. Formally, for each sentence s and topic k, we calculate
the probability, psk , of words in s assigned to k, averaged over
n samples, and use arg maxk psk as the label for s.
1) Weak Supervision with Minimal Prior Knowledge: To
encourage topic models to learn latent topics that correlate
directly with aspects, we augment them with a weak supervised signal in the form of aspect-specific seed words. Rather
than directly using the seed words to do bootstrapping, as in
[19] and [20], we use them to define an asymmetric prior on
the word-topic distributions. This approach guides the latent
topic learning towards more coherent aspect-specific topics,
while also allowing us to utilize large-scale unlabeled data.
For example, we define our prior knowledge (seed words)
for the original LDA model as a conjugate Dirichlet prior
to the multinomial word-topic distributions φ. By integrating with the symmetric smoothing prior β, we define a
combined conjugate prior for each seed word w in φ ∼
Dir ({β + Cw }w∈V ), where Cw can be interpreted as an
equivalent sample size—i.e., the impact of our asymmetric
prior is equivalent to adding Cw pseudo counts to the
sufficient statistics of the topic to which w belongs. When we
do not have prior knowledge for a word w, we set Cw = 0.
B. Multi-aspect Rating Prediction
The second phase of multi-aspect sentiment analysis is
multi-aspect rating prediction [7], [17], [20], [21]—in which
each aspect of a document is assigned polar (i.e., positive,
negative, neutral), numeric, or “star” (i.e., 1-5) ratings.
Specifically, we consider two settings: (1) multi-aspect
rating prediction with indirect supervision, and (2) supervised multi-aspect rating prediction. In (1), aspect ratings are
predicted based only on the text and overall rating of each
review. Specifically, we train a regression model on the given
overall ratings and, for each aspect, apply the model to the
corresponding aspect-labeled sentences (see Section III-A).
– Choose document topic proportions: θd ∼ Dir(α)
– For each word w in document d:
∗ Choose topic: zd,w ∼ θd
∗ Choose word: w ∼ φzd,w
(a) LDA.
(b) Local LDA.
(c) MG-LDA.
Figure 1.
•
Choose global topic proportions: θgl ∼ Dir(αgl )
•
For each sliding window v of size T :
loc
– Choose local topic proportions: θd,v
∼ Dir(αloc )
(d) STM.
– Choose granularity mixture: πd,v ∼ Beta(αmix )
Plate notations for topic models described in Section IV.
In (2), the supervised multi-aspect rating prediction setting, we augment and compare standard supervised regression learners with features derived from unsupervised topic
models (without seed words). Following [7], we create
features based on the output of each topic model by concatenating standard n-gram features with their associated
sentence-level topic assignments, and then evaluate supervised classifiers trained on those features.
IV. T OPIC M ODELS
In their most basic form, topic models exploit word cooccurrence information to capture latent topics in a corpus.
Approaches to both tasks described in Section III use these
latent topics to model multiple aspects within a document,
however the quality of these topics varies depending on the
topic model used. In this work we consider 4 topic models,
described here. Graphical representations for each of these
models appear in Figure 1, in plate notation.
1) LDA and Local LDA: The first two topic models
that we consider are based on Latent Dirichlet Allocation
(LDA) [5]. LDA is a probabilistic generative model in which
documents are represented as mixtures over latent topics.
Formally, LDA assumes that a corpus is generated according
to the following generative story line:
• For each topic k:
– Choose word-topic mixture: φk ∼ Dir(β)
•
While LDA can effectively model word co-occurrence
at the document level, [6] argue that review aspects are
more likely to be discovered from sentence-level word cooccurrence information. They propose Local LDA, in which
sentences are modeled as documents are in standard LDA.
2) Multi-grain LDA: In response to limitations of standard LDA for multi-aspect work, [7] propose Multi-Grain
LDA (MG-LDA). MG-LDA jointly models documentspecific themes (global topics), and themes that are common
throughout the corpus intended to correspond to ratable aspects, called local topics. Additionally, while the distribution
over global topics is fixed for a given document (review),
local topic proportions are varied across the document
according to sentence-level sliding windows. Formally, each
document d is generated as follows:
For each document d:
•
For each sentence s:
– Choose window proportions: ψd,s ∼ Dir(γ)
•
For each word w in sentence s of document d:
– Choose sliding window: vd,w ∼ ψd,s
– Choose granularity: rd,w ∼ πd,vd,w
loc
– Choose topic: zd,w ∼ {θgl , θd,v
}rd,w
r
– Choose word: w ∼ φzd,w
d,w
When T = 1, MG-LDA generalizes to a combination of
standard and Local LDA, where αmix regulates the tradeoff
between document- and sentence-level topic proportions.
3) Segmented Topic Model: Lastly, we introduce the
Segmented Topic Model (STM) [9], which jointly models
document- and sentence-level topic proportions using a twoparameter Poisson Dirichlet Process (PDP). Documents d are
generated as follows:
•
Choose document topic proportions: θd ∼ Dir(α)
•
For each sentence s:
– Choose topic proportions: θs ∼ P DP (θd , a, b)
•
For each word w in sentence s:
– Choose topic: zd,w ∼ θs
– Choose word: w ∼ φzd,w
STM can be considered an extension of Local LDA that
additionally considers document-level topic distributions induced from the individual sentence-level topic distributions.
4) Inference: While exact inference for the models just
presented is largely intractable [5], approximate techniques
such as variational inference or Gibbs sampling can be used
instead. Following [26], we use a collapsed Gibbs sampling
approach for inference.1 The exact sampling algorithms are
excluded for brevity. We instead refer the reader to [26]
for the LDA and Local LDA sampler, [7] for the MG-LDA
sampler, and [9] for the STM sampler.
V. E XPERIMENTAL S ETUP
A. Dataset and Preprocessing
Tasks and models discussed in Section III and Section IV
are evaluated on three datasets. The first dataset contains
73,495 reviews and their associated overall, food, service,
and ambiance aspect ratings for all restaurants in the New
York/Tri-State area appearing on OpenTable.com, and is
used for our multi-aspect rating prediction task. After excluding reviews that were too short (< 50 words) or too
long (> 300 words), we were left with 29,596 reviews.
Since the OpenTable dataset does not contain goldstandard labeled sentences, we evaluate our multi-aspect sentence labeling performance on a second, annotated dataset,
of 652 restaurant reviews from CitySearch.com, introduced
by [27]. Each sentence in this corpus has been manually
labeled with one or more of the following six aspects: food,
service, ambiance, price, anecdotes, or miscellaneous.
Finally, we evaluate multi-aspect rating prediction on
[20]’s TripAdvisor hotel review corpus. For each review,
this corpus contains an associated overall rating, as well
as ratings for 7 aspects: value, room, location, cleanliness,
check-in/front desk, service, and business services. After
removing reviews missing any of the first 6 aspect-ratings,
and (as before) excluded reviews that were too short or too
long, we were left with 66,512 reviews.
Datasets were tokenized and sentence split using the
Stanford POS Tagger [28]. For topic models, we removed
singleton words, and stop words not appearing in the sentiment lexicon introduced by [29].
B. Supervised Classifiers for Multi-aspect Rating Prediction
We consider two supervised machine-learning approaches
to multi-aspect rating prediction. The first is linear ǫSupport Vector Regression (SVR) [30]. We use the LIBSVM
toolkit [31] with default parameters.2 The second is Perceptron Ranking (PRank) [18], an online ordinal regression
classifier that has been used in related work [7], [17], [19],
[32]. We use the implementation by [17].3
1 In this work we sample all models for 1,000 iterations, with a 500iteration burn-in and a sampling-lag of 10.
2 Pilot experiments suggest that these values give near-optimal performance compared to parameters fully tuned by grid-search.
3 http://people.csail.mit.edu/bsnyder/naacl07/
Table I
S EED WORDS FOR RESTAURANT REVIEWS .
Aspect
food
service
ambiance
price
Seed Words
food, chicken, beef, steak
service, staff, waiter, reservation
ambiance, atmosphere, room, experience
price, value, quality, worth
Classifiers are trained on unit-normalized binary unigram4
presence features. We also experimented with raw and
normalized frequency counts and raw binary features, but
found that normalized binary features work best.
Finally, pilot experiments suggest that the optimal number
of iterations for PRank is data-dependent, and can heavily
influence performance. Consequently, except where specified, the number of iterations for PRank is always tuned via
nested cross validation on the training set.
C. Topic Model Hyperparameters
Unless otherwise stated, topic model hyperparameters are
assigned the following values: α: 0.5 for STM, 0.1 for LDA,
Local LDA and MG-LDA (including αgl , αloc and αmix );
β: 0.1 for all models; the window size v for MG-LDA
is 3; and a and b for STM are 0.1 and 1, respectively.
The values for LDA and MG-LDA follow [7], [25], and
those for Local LDA follow [6]. Some experimentation
was performed with different hyperparameter choices, but
downstream performance was not significantly affected.
VI. R ESULTS AND D ISCUSSION
A. Multi-aspect Sentence Labeling
Topic models were weakly supervised using seed words
in Table I. The pseudo count Cw for seed words was
heuristically set to be 3000 (∼1% of the number of reviews),
although we show in Section VI-A1 that performance is
robust to variations of this parameter. Assuming that the
majority of sentences are aspect-related, we set the number
of topics K to 5, thereby allowing a single “background”
topic.5 We also tried other topic numbers in the range of
[5-30] with a step of 5, with performance decreasing with
increasing K, in most cases.6
For evaluation, we used all 1,490 singly-labeled sentences
from the annotated portion of the CitySearch corpus for
the three main aspects (food, service, and ambiance), following [6] and [8]. Because LDA, MG-LDA and STM are
document-level models, inference is performed on all 652
4 While bigram and trigram features can be considered, unigram features
better highlight differences between competing topic models.
5 Note that the number of global topics for MG-LDA was set to 10.
6 While we restrict ourselves to only one set of seed words for each aspect,
it is also possible to enlarge the topic number K by providing more than
one set of seed words for the major aspects, such as food for the restaurant
domain, to reflect the reality that there could be many subtopics of major
aspects, such as the subtopics drink, bakery and main dishes shown in [6].
However, that strategy would involve more fine-tuning of seed words for
each subtopic, and is therefore left to future work.
Table II
M ULTI - ASPECT SENTENCE LABELING RESULTS .
Accuracy
Majority
LDA
MG-LDA
STM
Local LDA
SVM
0.595
0.477
0.760
0.794
0.803
0.830
P
0.595
0.646
0.888
0.954
0.969
0.814
Food
R
1
0.554
0.772
0.776
0.775
0.975
F1
0.746
0.597
0.826
0.856
0.861
0.887
P
0
0.469
0.637
0.674
0.731
0.874
Service
R
0
0.494
0.648
0.759
0.810
0.670
0.75
0.7
Accuracy
P
0
0.126
0.609
0.611
0.573
0.860
Ambiance
R
0
0.179
0.876
0.908
0.892
0.538
F1
0
0.148
0.719
0.731
0.698
0.662
Table III
E NTITY- LEVEL MULTI - ASPECT RATING PREDICTION RESULTS FOR
T RIPA DVISOR DATA .
0.8
LDA
MG−LDA
STM
Local LDA
0.65
0.6
0.55
0.5
0.45
F1
0
0.481
0.642
0.714
0.768
0.759
500
1000
Figure 2.
2000
3000
Pseudo Count for Seed Words
4000
5000
Influence of pseudo counts.
documents, and then performance is evaluated on the 1,490sentence subset. Note that none of the OpenTable data is
labeled with respect to sentence-level aspects.
Results are given in term of precision (P), recall (R),
and F-1 score in Table II. The majority baseline labels all
sentences according to the most common aspect label, food.
As an upper bound, we also test a fully supervised SVM
classifier on the labeled data with 5-fold cross-validation.
We can see that weakly supervised topic models achieve
good performance on this task, and at best are comparable to
the supervised SVM classifier, confirming that adding prior
knowledge can encourage latent topics to correlate directly
with aspects. Among the topic models themselves, Local
LDA gives the highest accuracy and is also the best at
labeling food and service aspects; STM achieves similar
results and is the best performing topic model for the
ambiance aspect, followed by MG-LDA and LDA.
These results can be explained as follows. Since most
sentences usually focus on just one or two aspects, sentencelevel word co-occurrence information is more appropriate
than document-level co-occurrences for studying aspects.
Indeed, while a review may talk about several aspects simultaneously, the document-level word co-occurrence may not
be able to well distinguish the individual aspects from each
other. Through directly modeling the word co-occurrences
within sentences, Local LDA better captures aspect information, while standard LDA fails to differentiate between
words in different aspects, even given seed words.
While both STM and MG-LDA simultaneously model
document- and sentence-level word co-occurrences, the for-
SVR Ovr
LDA
MG-LDA
STM
Local LDA
SVR
L1 error
0.311
0.645
0.400
0.517
0.433
0.238
ρaspect
0
−0.149
0.407
0.218
0.335
0.715
ρreview
0.800
0.454
0.622
0.694
0.729
0.846
M AP @10
0.429
0.143
0.129
0.286
0.229
0.400
mer indirectly models document-level co-occurrences via
sentence-level co-occurrences and a PDP prior. The latter,
MG-LDA, models both document- and sentence-level cooccurrences directly, which may therefore consider some
aspects to be global topics, when they are in fact specic
to a type of restaurants, as mentioned in [7].
1) Influence of Pseudo Counts: We also examine the
influence of the seed-word pseudo-count parameter, Cw ,
with results shown in Figure 2. We observe that performance
is reasonable across a variety of values of Cw , and the
relative ordering between models is stable. Notably, there is
a dramatic drop in performance for LDA at Cw = 3, 000. By
looking at the corresponding LDA topics, we found that with
large Cw , LDA separates the food aspect into two topics,
one focusing on main dishes (due to the seed words for
food) and the other focusing on dessert. This dramatically
decreases overall performance, since only a single label is
assigned to each sentence.
B. Multi-aspect Rating Prediction with Indirect Supervision
For multi-aspect rating prediction with indirect supervision, we assume that we only have access to overall ratings
in the training data, and no gold-standard aspect ratings. We
label sentences with aspects using weakly supervised topic
models on both the OpenTable and TripAdvisor datasets (see
Section III-B). Seed words for TripAdvisor come from [20].
For TripAdvisor, we also set Cw = 6, 000, and use K = 8
topics (with 15 global topics for MG-LDA).
Because not all aspects are discussed in every review,
we chose to combine all reviews for each entity (hotel or
restaurant) into a single “super”-review. Ground-truth aspect
ratings are obtained by averaging the overall/aspect ratings
for each “super”-review. After excluding “super”-reviews
containing fewer than 10 reviews, we were left with 913
restaurants and 1,604 hotels.
Table IV
M ULTI - ASPECT RATING PREDICTION RESULTS FOR RESTAURANT DATA .
Restaurant
Greek Taverna - Glen Rock
Milonga Wine and Tapas
Equus Tavern
Food
4.19 (3.9)
4.0 (4.1)
3.87 (3.8)
Service
3.31 (3.2)
3.54 (3.1)
3.97 (4.1)
Ambiance
3.9 (3.6)
3.97 (3.7)
3.83 (3.6)
We then predict aspect ratings based on the aspect-labeled
sentences by using a support vector regression (SVR) model
trained on all combined vectors for each kind of entity
(hotel or restaurant) and their overall ratings. The baseline
approach always uses the predicted overall rating as aspect
ratings for each entity, called SVR Ovr. As an upper bound,
we also test a fully supervised SVR model (SVR) trained
with ground-truth aspect ratings. For both SVR Ovr and
SVR, we use 5-fold cross validation.
In addition to L1 error (absolute difference) [21], we use
three other metrics from [20]. The first metric is MAP@10,
which measures how well the predicted ratings keep the
top entities on the top. The other two metrics are ρaspect
and ρreviews , which are two averaged Pearson correlations
between the predicted and the ground-truth ratings for all
aspects within each review, and for each aspect across all
entities. The former assesses whether the predicted ratings
give the correct preference order over the different aspects
within each review, e.g., the reviewer likes food more than
service. The latter measures how well the predicted aspect
ratings rank entities for each aspect, in order to answer
questions such as “which restaurant has the best food.”
Due to space constraints, we only show averaged results
over all aspects for the hotel dataset in Table III. We observe
that, with the exception of LDA for ρaspect , topic models
provide positive correlations. MG-LDA and Local LDA
show a medium correlation (larger than 0.3) with the gold
standard on ρaspect , which means that even without access
to the ground-truth aspect ratings, we can still reasonably
predict the relative preference order over aspects by using
the aspect-labeled sentences given by the weakly supervised
topic models. Although STM and MG-LDA perform well on
some metrics, Local LDA is always among the top two in
terms of all metrics among topic models. LDA performs the
worst. Not surprisingly, SVR performs the best in terms of
three metrics with access to the ground-truth aspect ratings,
while SVR-Ovr does quite well in three metrics, but cannot
provide information on ρaspect .
For qualitative evaluation, we select 3 restaurants with
the same overall rating of 3.7 (on average) but different
aspect ratings, and compare the predicted ratings given by
Local LDA. The prediction results are shown in Table IV
with ground-truth ratings in parentheses. We observe that
although all three restaurants have the same overall rating,
the aspect ratings are quite different: Greek Taverna - Glen
Rock and Milonga Wine and Tapas has higher ratings for
food, and Equus Tavern has better service. This kind of
detailed aspect information is important for users who have
different aspect preferences.
C. Supervised Multi-aspect Rating Prediction
We also evaluate multi-aspect rating prediction for each
classifier introduced in Section V-B, trained with and without
features derived from topic models introduced in Section IV,
in addition to baseline n-gram features (unigrams). Topic
models trained in this section do not make use seed words.
Topic model features are created following [7]. For each
sentence s and topic k, we calculate the proportion, psk , of
words in s assigned to k, averaged over 50 samples. We then
bucket the corpus-wide proportions as evenly as possible
into five buckets, such that bsk ∈ {1, 2, 3, 4, 5} corresponds
to the bucket containing psk . Then, since a sentence will
typically contain small proportions of many topics, we limit
our consideration to only the top-3 topics per sentence,
ordered by psk , which we denote k1s∗ , k2s∗ and k3s∗ . Finally,
for each word w in sentence
s, we
construct three binary
features of the form: w, kis∗ , bsks∗ for i ∈ {1, 2, 3}.
i
We report 5-fold cross-validated performance for each
method on subsets of the OpenTable and TripAdvisor data
introduced in Section V-A. For each dataset we select a
balanced (according to overall rating) random subset of
5,000 reviews. The remaining reviews are used to train the
unsupervised topic models.
Results appear in Table V. Interestingly, we find that
the PRank baseline performs worse than the SVR baseline
across all aspects and datasets. This is perhaps unsurprising,
since PRank was originally proposed for online learning,
and is very sensitive to both its parameterization and data
ordering. While more experiments are necessary, these results suggest that despite PRank’s recent popularity, it is
perhaps an ineffective baseline for aspect-rating prediction.
We also observe that adding features derived from topic
models can increase performance (albeit slightly) over even
a strong (SVR) baseline. However, in contrast to previous
work by [7], we find that the choice of topic model makes
little difference in this case. Indeed, LDA often outperforms
other more complicated models on this supervised task.
D. Further Discussion on Aspect-based Summarization
In addition to the concise aspect-based opinion summary
shown in Section VI-B, we can choose sentences from
reviews based on their aspects and rating scores to provide
aspect-based review summaries for a given entity. Since the
aspect label for each sentence has a probability, as mentioned
in Section III-A, we set a threshold to filter out unconfident
sentences for each aspect (e.g., 0.75). We predict the rating
of each sentence by using an SVR model trained on the
overall ratings of the 5,000 balanced restaurant reviews
mentioned in Section VI-C, and then we select the sentences
with the highest and lowest scores for each aspect.
In Table VI, we show a sample aspect-based summary
(with ground-truth ratings in parenthesis) generated in this
Table V
S UPERVISED MULTI - ASPECT RATING PREDICTION RESULTS , WITH MODELS RUN TO GENERATE 15 TOPICS (45 GLOBAL TOPICS FOR MG-LDA).
R ESULTS WERE SIMILAR ACROSS A VARIETY OF TOPIC NUMBER CHOICES .
Learner
PRank
SVR
Model
Baseline
LDA
Local LDA
MG-LDA
STM
Baseline
LDA
Local LDA
MG-LDA
STM
OpenTable
Over.
Food
0.798
0.821
0.638
0.683
0.650
0.703
0.650
0.707
0.642
0.686
0.654
0.700
0.637
0.679
0.651
0.686
0.656
0.693
0.650
0.682
(L1 error)
Serv.
Amb.
1.052
1.071
0.806
0.817
0.815
0.841
0.812
0.841
0.812
0.838
0.810
0.811
0.790
0.781
0.786
0.804
0.787
0.804
0.794
0.794
Over.
0.687
0.563
0.569
0.554
0.574
0.585
0.560
0.576
0.576
0.571
Check.
0.818
0.640
0.657
0.656
0.647
0.651
0.628
0.654
0.648
0.643
TripAdvisor (L1
Serv.
Value
0.856
0.946
0.682
0.770
0.685
0.761
0.685
0.767
0.689
0.750
0.708
0.737
0.667
0.738
0.688
0.731
0.681
0.743
0.686
0.741
error)
Loc.
0.828
0.668
0.680
0.672
0.679
0.695
0.663
0.688
0.676
0.675
Rooms
0.932
0.737
0.757
0.764
0.754
0.747
0.732
0.742
0.744
0.741
Clean.
0.900
0.721
0.716
0.722
0.723
0.725
0.709
0.729
0.725
0.718
Table VI
A SPECT- BASED COMPARATIVE SUMMARY FOR M ESA G RILL R ESTAURANT.
Aspect
Food
3.90
(3.69)
Service
3.53
(3.87)
Ambiance
3.66
(3.71)
Summary
[+] The [food] is delicious, the grits are phenomenal and I love the breads they bring before
the meal with the pepper jelly.
[–] One entree was not even edible it was overcooked and dry.
[+] The staff is professional and friendly.
[–] Our server hovered over us until we got our appetizers, trying to push more booze, but
then disappeared, and we had to wait for about half an hour between apps and main meal,
with no one coming over to check in with us about what was going on.
[+] Atmosphere was great; full of energy and a great open bar area.
[–] The place was too cramped as you feel like the restaurant management has squeezed too
many tables in the seating area.
way for Mesa Grill, one of the most popular restaurants
on OpenTable.com. We can see that reviewers had different
experience or preference. For example, in terms of ambiance,
one user thinks that the restaurant is full of energy, while
another considers it too cramped. Such detailed summaries
could be helpful to both consumers and service providers.
VII. C ONCLUSION
We investigate the role of unsupervised and weakly supervised topic modeling approaches to multi-aspect sentiment
analysis. We show that weakly supervised topic models
perform quite well on multi-aspect sentence labeling tasks,
and can also be used to aid multi-aspect rating prediction
with only indirect supervision. In combination, they can
also support interesting applications for aspect-based review
summarization. Finally, we find that incorporating features
derived from unsupervised topic models provides substantial
increases in performance, but only for weak prediction
models like PRank. With a stronger model, like SVR, this
improvement is diminished.
ACKNOWLEDGMENT
We thank Shuo Chen, Long Jiang, Lillian Lee, Chenhao
Tan, Ainur Yessenalina, and Jingbo Zhu, as well as members
of the Cornell NLP seminar group and the anonymous
reviewers for useful comments and discussion. This work
was supported in part by National Science Foundation
Grants BCS-0904822, IIS-0968450, IIS-1111176; a gift
from Google; the Jack Kent Cooke Foundation; and a
HKSAR Research Grant Council Grant (No. 149607).
Rating
4.62
0.85
5.09
1.08
4.30
1.85
R EFERENCES
[1] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” in
Proceedings of the ACL-02 conference on Empirical methods
in natural language processing-Volume 10. Association for
Computational Linguistics, 2002, pp. 79–86.
[2] B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating
scales,” in Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics. Association for
Computational Linguistics, 2005, pp. 115–124.
[3] S. Baccianella, A. Esuli, and F. Sebastiani, “Multi-facet rating
of product reviews,” in Proceedings of the 31th European
Conference on IR Research on Advances in Information
Retrieval. Springer-Verlag, 2009, pp. 461–472.
[4] L. Qu, G. Ifrim, and G. Weikum, “The bag-of-opinions
method for review rating prediction from sparse text patterns,”
in Proceedings of the 23rd International Conference on
Computational Linguistics. Association for Computational
Linguistics, 2010, pp. 913–921.
[5] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,”
Journal of Machine Learning Research, vol. 3, pp. 993–1022,
2003.
[6] S. Brody and N. Elhadad, “An unsupervised aspect-sentiment
model for online reviews,” in Proceedings of ACL:HLT, 2010,
pp. 804–812.
[7] I. Titov and R. McDonald, “Modeling online reviews with
multi-grain topic models,” in Proceeding of the 17th international conference on World Wide Web. ACM, 2008, pp.
111–120.
[8] W. Zhao, J. Jiang, H. Yan, and X. Li, “Jointly modeling aspects and opinions with a maxent-lda hybrid,” in Proceedings
of the 2010 Conference on Empirical Methods in Natural
Language Processing. Association for Computational Linguistics, 2010, pp. 56–65.
[21] C. Sauper, A. Haghighi, and R. Barzilay, “Incorporating content structure into text analysis applications,” in Proceedings
of the 2010 Conference on Empirical Methods in Natural
Language Processing. Association for Computational Linguistics, 2010, pp. 377–387.
[9] L. Du, W. Buntine, and H. Jin, “A segmented topic model
based on the two-parameter poisson-dirichlet process,” Machine learning, vol. 81, no. 1, pp. 5–19, 2010.
[22] Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai, “Topic
sentiment mixture: modeling facets and opinions in weblogs,”
in Proceedings of the 16th international conference on World
Wide Web. ACM, 2007, pp. 171–180.
[10] B. Pang and L. Lee, “Opinion mining and sentiment analysis,”
Foundations and Trends in Information Retrieval, vol. 2, no.
1-2, pp. 1–135, 2008.
[11] M. Hu and B. Liu, “Mining opinion features in customer
reviews,” in Proceedings of the National Conference on
Artificial Intelligence. Menlo Park, CA; Cambridge, MA;
London; AAAI Press; MIT Press; 1999, 2004, pp. 755–760.
[12] ——, “Mining and summarizing customer reviews,” in Proceedings of the tenth ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 2004, pp.
168–177.
[23] C. Lin and Y. He, “Joint sentiment/topic model for sentiment
analysis,” in Proceeding of the 18th ACM conference on
Information and knowledge management. ACM, 2009, pp.
375–384.
[24] Y. Lu, C. Zhai, and N. Sundaresan, “Rated aspect summarization of short comments,” in Proceedings of the 18th
international conference on World wide web. ACM, 2009,
pp. 131–140.
[25] I. Titov and R. McDonald, “A joint model of text and aspect
ratings for sentiment summarization,” Urbana, vol. 51, pp.
308–316, 2008.
[13] A. Popescu and O. Etzioni, “Extracting product features and
opinions from reviews,” in Proceedings of the conference on
Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational
Linguistics, 2005, pp. 339–346.
[26] T. Griffiths and M. Steyvers, “Finding scientific topics,”
Proceedings of the National Academy of Sciences of the
United States of America, vol. 101, no. Suppl 1, p. 5228,
2004.
[14] L. Zhuang, F. Jing, and X. Zhu, “Movie review mining and
summarization,” in Proceedings of the 15th ACM international conference on Information and knowledge management. ACM, 2006, pp. 43–50.
[27] G. Ganu, N. Elhadad, and A. Marian, “Beyond the stars:
Improving rating predictions using review text content,” in
Proceedings of the 12th International Workshop on the Web
and Databases. Citeseer, 2009.
[15] K. Lerman, S. Blair-Goldensohn, and R. McDonald, “Sentiment summarization: Evaluating and learning user preferences,” in Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009, pp.
514–522.
[28] K. Toutanova, D. Klein, C. Manning, and Y. Singer, “Featurerich part-of-speech tagging with a cyclic dependency network,” in Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003, pp. 173–180.
[16] S. Blair-Goldensohn, K. Hannan, R. McDonald, T. Neylon,
G. Reis, and J. Reynar, “Building a sentiment summarizer
for local service reviews,” in WWW Workshop on NLP in the
Information Explosion Era, 2008.
[29] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing contextual polarity in phrase-level sentiment analysis,” in Proceedings of the conference on Human Language Technology
and Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, 2005, pp. 347–
354.
[17] B. Snyder and R. Barzilay, “Multiple aspect ranking using the
good grief algorithm,” in Proceedings of NAACL HLT, 2007,
pp. 300–307.
[18] K. Crammer and Y. Singer, “Pranking with ranking,” in
Proceedings of NIPS, 2001, pp. 641–647.
[19] J. Zhu, H. Wang, B. Tsou, and M. Zhu, “Multi-aspect opinion
polling from textual reviews,” in Proceeding of the 18th ACM
Conference on information and Knowledge Management.
ACM, 2009, pp. 1799–1802.
[20] H. Wang, Y. Lu, and C. Zhai, “Latent aspect rating analysis
on review text data: a rating regression approach,” in Proceedings of the 16th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 2010, pp.
783–792.
[30] A. Smola and B. Schölkopf, “A tutorial on support vector
regression,” Statistics and computing, vol. 14, no. 3, pp. 199–
222, 2004.
[31] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support
vector machines,” ACM Transactions on Intelligent Systems
and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
[32] N. Gupta, G. Di Fabbrizio, and P. Haffner, “Capturing the
stars: Predicting ratings for service and product reviews,” in
Proceedings of the NAACL HLT 2010 Workshop on Semantic
Search. Association for Computational Linguistics, 2010,
pp. 36–43.