Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Towards Automatic Generation of Product Reviews From Aspect-Sentiment Scores

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Towards Automatic Generation of Product Reviews from Aspect-Sentiment

Scores

Hongyu Zang and Xiaojun Wan


Institute of Computer Science and Technology, Peking University
The MOE Key Laboratory of Computational Linguistics, Peking University
{zanghy, wanxiaojun}@pku.edu.cn

Abstract et al., 2016), and so on. The task of review genera-


tion still needs to be further explored.
Data-to-text generation is very essential and
Think about how we generate review texts: we
important in machine writing applications.
The recent deep learning models, like Recur- usually have the sentiment polarities with respect to
rent Neural Networks (RNNs), have shown product aspects before we speak or write. Inspired
a bright future for relevant text generation by this, we focus on study of review generation from
tasks. However, rare work has been done structured data, which consist of aspect-sentiment
for automatic generation of long reviews from scores.
user opinions. In this paper, we introduce a
deep neural network model to generate long
Traditional generation models are mainly based
Chinese reviews from aspect-sentiment scores on rules. It is time consuming to handcraft rules.
representing users’ opinions. We conduct Thanks to the quick development of neural networks
our study within the framework of encoder- and deep learning, text generation has achieved a
decoder networks, and we propose a hierar- breakthrough in recent years in many domains, e.g.,
chical structure with aligned attention in the image-to-text (Karpathy and Fei-Fei, 2015; Xu et
Long-Short Term Memory (LSTM) decoder. al., 2015), video-to-text (Yu et al., 2016), and text-
Experiments show that our model outper-
to-text (Sutskever et al., 2014; Li et al., 2015), etc.
forms retrieval based baseline methods, and
also beats the sequential generation models in More and more works show that generation models
qualitative evaluations. with neural networks can generate meaningful and
grammatical texts (Bahdanau et al., 2015; Sutskever
et al., 2011). However, recent studies of text gener-
1 Introduction ation mainly focus on generating short texts of sen-
Text generation is a central task in the NLP field. tence level. There are still challenges for modern
The progress achieved in text generation will help a sequential generation models to handle long texts.
lot in building strong artificial intelligence (AI) that And yet there is very few work having been done in
can comprehend and compose human languages. generating long reviews.
Review generation is an interesting subtask of In this paper, we aim to address the challenging
data-to-text generation. With more and more online task of long review generation within the encoder-
trades, it usually happens that customers are lazy to decoder neural network framework. Based on the
do brainstorming to write reviews, and sellers want encoder-decoder framework, we investigate differ-
to benefit from good reviews. As we can see, review ent models to generate review texts. Among these
generation can be really useful and worthy of study. models, the encoders are typically Multi-Layer Per-
But recent researches on text generation mainly fo- ceptron (MLP) to embed the input aspect-sentiment
cus on generation of weather reports, financial news, scores. The decoders are RNNs with LSTM units,
sports news (Konstas, 2014; Kim et al., 2016; Zhang but differ in architectures. We proposed a hierarchi-

168

Proceedings of The 10th International Natural Language Generation conference, pages 168–177,
c
Santiago de Compostela, Spain, September 4-7 2017. 2017 Association for Computational Linguistics
cal generation model with a new attention mecha- tence. We use Jieba3 for Chinese word segmenta-
nism, which shows better results compared to other tion. Note that each review text contains eight sen-
models in both automatic and manual evaluations tences, where each sentence has 24 Chinese charac-
based on a real Chinese review dataset. ters on average. The review texts in our corpus are
To the best of our knowledge, our work is the first actually very long, about 195 Chinese characters per
attempt to generate long review texts from aspect- review.
sentiment scores with neural network models. Ex- The rating score for each aspect is in a range of
periments proved that it is feasible to general long [1, 5], and we regard rating 3 as neutral, and nor-
product reviews with our model. malize ratings into [-1.0, 1.0] by Equation (1)4 , and
the sign of a normalized rating means the sentiment
2 Problem Definition and Corpus polarity. For instance, if the original ratings for all
To have a better understanding of the task investi- eight aspects are [1,2,3,4,5,4,3,2], we will normalize
gated in this study, we’d like to introduce the corpus it into[-1.0,-0.5,0.0,0.5,1.0,0.5, 0.0,-0.5] and use the
first. normalized vector as the input for review generation.
Without loss of generality, we use Chinese car re-
views in this study and reviews in other domains can M ax+M in
0 x− 2
be processed and generated in the same way. The x = M ax−M in
(1)
Chinese car reviews are crawled from the website 2
AutoHome1 . Each review text contains eight sen-
tences describing eight aspects2 , respectively: 空 And finally, we get 43060 pairs of aspect-
间/Space, 动 力/Power, 控 制/Control, 油 耗/Fuel sentiment vectors and the corresponding review
Consumption, 舒适度/Comfort, 外观/Appearance, texts, in which there are 8340 different inputs5 . Then
内 饰/Interior, and 性 价 比/Price. Each review we split the data randomly into training set and test
text corresponds to these eight aspects and the cor- set. The training set contains 32195 pairs (about
responding sentiment ratings, and the review sen- 75%) and 6290 different inputs, while the test set
tences are aligned with the aspects and ratings. So contains the rest 10865 pairs with 2050 different in-
we may split the whole review into eight sentences puts. The test set does not overlap with the train set
when we need. Note that the sentences in each re- with respect to the input aspect-sentiment vector.
view are correlated with each other, so if we re- Furthermore, we transform the input vector into
gard them as independent sentences with respect aspect-oriented vectors as input for our models.
to individual aspect-sentiment scores, they proba- For each aspect, we use an additional one-hot vec-
bly seem pretty mendacious when put altogether. tor to represent the aspect, and then append the
We should keep each review text as a whole and one-hot vector to the input vector. For exam-
generate the long and complete review at one time, ple, if we are dealing with a specific aspect Power
rather than generating each review sentence inde- corresponding to a one-hot vector [0,1,0,0,0,0,0,0]
pendently. Specifically, we define our task as gen- for the above review with input vector [-1.0,-
erating long Chinese car reviews from eight aspect- 0.5,0.0,0.5,1.0,0.5,0.0,-0.5], the new input vec-
sentiment scores. tor with respect to this aspect is actually [-1.0,-
The raw data are badly formatted. In order to 0.5,0.0,0.5,1.0,0.5,0.0,-0.5,0,1,0,0,0,0,0,0]. Each
clean the data, we keep the reviews whose sentences new input vector is aligned with a review sentence.
corresponding to all the eight aspects. And we skip Similarly, we can get eight new vectors with respect
the reviews whose sentences are too long or too to the eight aspects as input for our models.
short. We accept length of 10 to 40 words per sen-
1 3
www.autohome.com.cn github.com/fxsjy/jieba
2 4
In fact, there may be multiple grammatical sentences de- We set the origin rating as x, and the normalized rating as
scribing one single aspect. But for simplification, we define x0 . M ax and M in is the maximum and minimum value out of
the sequence of characters describing the same aspects as a se- all the original ratings in the dataset, or rather, 5 and 1.
5
quence. We allow multiple gold-standard answers to one input.

169
3 Preliminaries texts, which is a long sequence of words
Y {y1 , y2 , . . . , y|Y |−1 , hEOSi} (hEOSi is the spe-
In this section, we will give a brief introduction
cial word representing the end of a sequence). As
to LSTM Network (Hochreiter and Schmidhuber,
mentioned in section 2, we also transform an in-
1997).
put vector Vs into a series of new input vectors
3.1 RNN {V1 , V2 , . . . , V8 } with respect to eight aspects for our
models. More specifically, in order to obtain each
RNN has been widely used for sequence genera-
Vi , we append a one-hot vector representing a spe-
tion tasks (Graves, 2012a; Schuster and Paliwal,
cific aspect to Vs . That is, Vi = [Vs , O], where O is
1997). RNN accepts sequence of inputs X =
a one-hot vector with the size of eight, and only the
{x1 , x2 , x3 , ..., x|X| }, and gets ht at time t accord-
ith element of O is 1.
ing to Equation (2).
We have three different kinds of embeddings:
  E stands for word embedding, E V stands for
W
ht−1
ht = W H × (2) embedding of the input vector by a MLP encoder,
xt
and E C stands for embedding of context sentences.
3.2 LSTM Network There will be subscripts specifying the word, the
An LSTM network contains LSTM units in RNN vector, and the context.
and an LSTM unit is a recurrent network unit that And in LSTM, h is a hidden vector, x is an in-
excels at remembering values for either long or short put vector, P is the possibility distribution, y 0 is the
durations of time(Graves, 2012b; Sundermeyer et predicted word, and t is the time step.
al., 2012). It contains an input gate, a forget gate, 4.2 Sequential Review Generation Models
an output gate and a memory cell. Respectively, at (SRGMs)
time t, we set the above parts as it , ft , ot , ct . In an
SRGMs are similar to the popular Seq2Seq models
LSTM network, we propagate as Equation (3)(4)(5).
(Chung et al., 2014; Sutskever et al., 2011), except
that it receives inputs of structured data (like aspect-
    
it WI   sentiment scores) and encodes them with an MLP.
 ft  = sigmoid WF  × ht−1  The encoder’s output EsV is treated as the initial
xt
ot WO hidden state h0 of the decoder. And the initial in-
(3) put vector is set as the word embedding of hBOSi
(hBOSi is the special word representing the begin
   of a sequence). Then the decoder proceeds as a stan-
ht−1
ct = it × tanh WC × + ft × ct−1 (4) dard LSTM network.
xt
At time t(t ≥ 1), the hidden state of the decoder
ht is used to predict the distribution of words by a
ht = ct × ot (5) softmax layer. We will choose the word with max
possibility as the word predicted at time t, and the
In the past few years, many generation models
word will be used as the input of the decoder at time
based on LSTM networks have given promising re-
t + 1.
sults in different domains (Xu et al., 2015; Shang et
This procedure can be formulated as follows:
al., 2015; Wu et al., 2016). Compared to other net-
work units of RNN, like GRU (Chung et al., 2014), h0 = EsV = M LP (Vs ) (6)
LSTM is considered the best one in most cases.
W
x1 = EhBOSi (7)
4 Review Generation Models
ht = LST M (ht−1 , xt ) (8)
4.1 Notations
Pt = sof tmax(ht ) (9)
We define our task as receiving a vector of
aspect-sentiment scores Vs to generate review yt0 = argmaxw (Pt,w ) (10)

170
SRGMs. The final output of LST M S is treated as
the embedding of the context sentences ETC , which
is also the input of LST M P at time T + 1. We call
this hierarchical model HRGM-o.

hP0 = EsV = M LP (Vs ) (13)


xP1 = 0 (14)
hPT = LST M P
(hPT −1 , xPT ) (15)

Figure 1: The architecture of SRGM-w.


hST,0 = hPT (16)
hST,t = LST M S (hST,t−1 , xST,t ) (17)
xt+1 = EyW0 (11) PT,t = sof tmax(hST,t ) (18)
t

In each training step, we adopt the negative like- 0


yT,t = argmaxw (PT,t,w ) (19)
lihood loss function.
xST,t+1 = EyW0 (20)
T,t
1 X
Loss = − logPt,yt (12) xPT +1 = ETC = hST,|YT | (21)
|Y | t
In the experiment results of HRGM-o, we find
However, Sutskever et al. (2014) and Pouget- that the model has its drawback. In some test cases,
Abadie et al. (2014) have shown that standard LSTM the output texts miss some important parts of the in-
decoder does not perform well in generating long se- put aspects.
quences. Therefore, besides treating the review as a As many previous studies have shown that the at-
whole sequence, we also tried splitting the reviews tention mechanism promises a better result by con-
into sentences, generating the sentences separately, sidering the context (Bahdanau et al., 2015; Fang et
and then concatenating the generated sentences alto- al., 2016; Li et al., 2015). We adopt attention to the
gether. Respectively, we name the sequential model generation of each sentence, which is aligned to the
generating the whole review as SRGM-w, and the sentence’s main aspect.
one generating separate sentences as SRGM-s. Different from the attention mechanism men-
4.3 Hierarchical Review Generation Models tioned in previous studies, in our situation, we
(HRGMs) have the alignment relationships between aspect-
sentiment ratings and sentences, which are natural
Inspired by Li et al. (2015), we build a hierarchi- attentions to be used in the generation process. By
cal LSTM decoder based on the SRGMs. Note that applying additional input vector VT at each time step
we have two different LSTM units in hierarchical T , we obtain the initial hidden state of LST M S
models, in which the superscript S denotes the sen- from two source vectors ETV and hPT . Therefore,
tence-level LSTM, and the superscript P denotes we simply train a gate vector g to control the two
the paragraph-level one. And t is the time step no- parts of information. The encoding of VT is similar
tation in the sentence decoder, while T is the time to Equation (13), but with different parameters. In
step notation in the paragraph decoder. Both the brief, we change Equation (16) to Equation (22)(23).
time step symbols are put in the position of sub-
scripts. ETV = M LP 0 (VT ) (22)
There is a one-hidden-layer-MLP to encode the  P 
input vector into EsV . LST M P receives EsV as the S hT
hT,0 = × [g, 1 − g] (23)
initial hidden state, and the initial input xP1 is a zero ETV
vector. At time T (T ≥ 1), the output of LST M P Based on all of these, we propose a hierarchical
is used as the initial hidden state of LST M S . And model with a special aligned attention mechanism
then LST M S works just like the LSTM decoder in as shown in Figure 2. We call the model HRGM-a.

171
5.2 Baselines
Apart from SRGM-w and SRGM-s, we also devel-
oped several baselines for comparison.
• Rand-w: It randomly chooses a whole review
from the training set.
• Rand-s: It randomly choose a sentence for each
aspect from the training set and concatenates the
sentences to form a review.
• Cos: It finds a sentiment vector from the train-
ing set which has the the largest cosine similarity
value with the input vector, and then returns the cor-
responding review text.
• Match: It finds a sentiment vector from the
training set which has the maximum number of rat-
ing scores matching exactly with that in the input
vector, and then returns the corresponding review
text.
• Pick: It finds one sentence for each aspect re-
spectively in the training set by matching the same
sentiment rating, and then concatenates them to
form a review.
Generally speaking, models in this paper are di-
vided into four classes. The first class is lower bound
methods (Rand-w, Rand-s), where we choose some-
thing from the training set randomly. The second
one is based on retrieval (Cos, Match, Pick), and we
Figure 2: The architecture of HRGM-a. use similarity to decide which to choose. The third
one is sequential generation models based on RNNs
(SRGM-w, SRGM-s). And the last one is hierarchi-
5 Experiments cal RNN models to handle the whole review gener-
ation (HRGM-o, HRGM-a).
5.1 Training Detail
5.3 Automatic Evaluation
We implemented our models with TensorFlow
1.106 , and trained them on an NVIDIA TITANX We used the popular BLEU (Papineni et al., 2002)
GPU (12G). scores as evaluation metrics and BLEU has shown
Because the limitation of our hardware, we only good consistent with human evaluation in many ma-
do experiments with one layer of encoder and one chine translation and text generation tasks. High
layer of LSTM network. The batch size is 4 in BLEU score means many n-grams in the hypothesis
HRGMs, and 32 in SRGMs. The initial learning rate texts meets the gold-standard references. Here, we
is set to 0.5, and we dynamically adjust the learn- report BLEU-2 to BLEU-4 scores, and the evalua-
ing rate according to the loss value. As experiments tion is conducted after Chinese word segmentation.
show that the size of hidden layer does not affect the The only parameters in BLEU is the weights W
results regularly, we set all of them to 500. for n-gram precisions. In this study, we set W as
All the rest parameters in our model can be average weights (Wi = n1 for BLEU-n evaluation).
learned during training. As for multiple answers to the same input, we put all
of them into the reference set of the input.
6
github.com/tensorflow/tensorflow/tree/r0.10 The results are shown in Table 1. Retrieval based

172
BLEU-2 BLEU-3 BLEU-4 which model, the subjects are required to rate on
Rand-w 0.1307 0.0378 0.0117 a 5-pt Likert scale7 about readability, accuracy, and
Rand-s 0.1406 0.0412 0.0124 usefulness. In our 5-pt Likert scale, 5-point means
Cos 0.1342 0.0403 0.0129 “very satisfying”, while 1-point means “very terri-
Match 0.1358 0.0423 0.0136 ble”. The ratings with respect to each aspect of qual-
Pick 0.1427 0.0434 0.0133 ity are then averaged across the three subjects and
SRGM-w 0.1554 0.0713 0.0307 the 50 inputs.
SRGM-s 0.1709 0.0829 0.0369 To be more specific, we define readability, accu-
HRGM-o 0.1850 0.0854 0.0334 racy, and usefulness as follows. Readability is the
HRGM-a 0.1985 0.0942 0.0412 metric concerned with the fluency and coherence of
the texts. Accuracy indicates how well the review
Table 1: The results of BLEU evaluations.
text matches the given aspects and sentiment ratings.
Usefulness is more subjective, and subjects need to
baselines get low BLEU scores in BLEU-2, BLEU-3 decide whether to accept it or not when the text is
and BLEU-4. Among these models, Cos and Match shown to them. The readability, accuracy, even the
even get lower BLEU scores than the lower bound length of the review text will have an effect on the
methods in some BLEU evaluations, which may be usefulness metric.
attributed to the sparsity of the data in the training
set. Pick is better than lower bound methods in Readability Accuracy Usefulness
all of the BLEU evaluations. Compared to the re- Gold 4.61 4.41 4.39
Rand-s 4.44 3.21 3.52
trieval based baselines, SRGMs get higher scores in
Pick 4.55 4.15 4.20
BLEU-2, BLEU-3, and BLEU-4. It is very promis- SRGM-s 4.51 4.21 4.21
ing that HRGMs get the highest BLEU scores in all HRGM-a 4.52 4.33 4.26
evaluations, which demonstrates the effectiveness
Table 2: Human evaluation results of typical models. We set
of the hierarchical structures. Moreover, HRGM-a
the best result of each metric in bold except for Gold-Standard.
achieves better scores than HRGM-o, which verifies
the helpfulness of our proposed new attention mech-
anism. The results are shown in Table 2. We can see
In all, the retrieval models and sequential genera- that in human evaluations, all the models get high
tion models can not handle long sequences well, but scores in readability. The readability score of our
hierarchical models can handle long sequences. The model HRGM-a is very close to the highest readabil-
reviews generated by our models are of better qual- ity score achieved by Pick. Rand-s gets the worst
ity according to BLEU evaluations. scores for accuracy and usefulness, while the rest
models perform much better in these metrics. Com-
5.4 Human Evaluation pared to the strong baselines Pick and SRGM-s, al-
though our model is not the best in readability, it
We also perform human evaluation to further com-
performs better in accuracy and usefulness. The re-
pare these models. Human evaluation requires hu-
sults also demonstrate the efficacy of our proposed
man judges to read all the results and give judgments
models.
with respect to different aspects of quality.
We randomly choose 50 different inputs in the 5.5 Samples
test set. For each input, we compare the best mod-
To get a clearer view of what we have done and
els in each class, specifically, Rand-s, Pick, SRGM-
have an intuitive judgment of the generated texts, we
s, HRGM-a, and the Gold (gold-standard) answer.
present some samples in Table 3.
We employ three subjects (excluding the authors of
In Table 3, the first three samples are output texts
this paper) who have good knowledge in the domain
of Gold-Standard, Pick, and our model HRGM-a
of car reviews to evaluate the outputs of the mod-
for the same input. And in the last sample, we
els. The outputs are shuffled before shown to sub-
jects. Without any idea which output belongs to 7
en.wikipedia.org/wiki/Likert scale

173
Inputs Outputs

后备箱的空间还是蛮大的,就是后排的空间比较小,座椅也不平整。动力还行吧,只要舍得给油,还说的过去
Gold-Standard 。方向盘精准度高,路况反应清晰。可能是因为轮胎薄的原因吧,自动档的油耗有点高,市区油耗在10个左右,高
Space: 3 速最多7个油。座椅还是蛮舒适的,就是行车中噪音比较大,建议做个全车隔音比较好。小hU N Ki 的颜值在同级
Power: 4 别里算高的了,这点比较不错,特别是那个战斧轮毂。用料还行,偶尔会有点小异响,这个价位的车差不多
Control: 5 hU N Ki这样。不错,对得起这个价了,毕竟价钱摆在那里。
Fuel: 3 Translation: Trunk space is quite large, but the rear space is relatively small, and the seat is not smooth. Power is also okay,
Comfort: 3 as long as willing to give oil. Steering wheel has high precision. Probably because of the reasons for thin tires, fuel con-
Appearance: 5 sumption of automatic transmission is a bit high, urban fuel consumption in 10 or so, while on high way up to 7 oil. The seat
Interior: 4 is still quite comfortable, but there is large noise when driving. I propose the car to have a better sound insulation. Little
Price: 4 hU N Ki’s appearance is better than others in the same class, and this is quite good. I especially like the Tomahawk wheels.
Materials are okay, there is occasionally a little abnormal sound. The price of the car is acceptable. Yeah, worthy of the
price hU N Ki. After all, the price is not that high.

略显狭小,如果坐4个人就很拥挤了。1.4T双增压带来的直观动力表现是不错的,不输给家里的锐志,毕竟1.4
Pick 的车。新款也把双增压换成单增压,感觉有点不厚道。指哪打哪,宝马的公路操控,应该是比较不错了,很精准,
Space: 3 无虚伪,路感强,现在400公里纯市内全程空调12个多,没平均速度可看,2.5的排量这个油耗能够接受。由于空间
Power: 4 较大,所以随之舒适性也加了不少分。这个不多说OK!!!满意!!!一般吧,希望有些小细节要改进一下,多
Control: 5 听听大家的意见。优惠完之后,觉得性价比还好,要是优惠hU N Ki就不考虑宝来了
Fuel: 3 Translation: The space is slightly narrow, not enough for four people. The intuitional performance given by 1.4T double boo-
Comfort: 3 ster is good, better the Reiz of mine. After all, 1.4T. The new dual-pressure supercharger is replaced by a single one, where
Appearance: 5 I feel a little unkind. The car goes anywhere I want. BMW’s road control should be relatively good. Very accurate, no hypo-
Interior: 4 crisy, strong sense of the road. I have driven the car 400 km along with air condition, the fuel consumption is about 12. I
Price: 4 didn’t pay attention to the average speed, but I think a displacement of 2.5 is acceptable to have such fuel consumption. Be-
cause of the large space, it’s very comfortable in the car. I’m very satisfactory with the appearance! About interior, I think
there are some small details to improve. The price is good with discount, if not for the discount hU N Ki, I won’t consider Bora.

HRGM-a 空间一般,后排空间有点小。动力方面,提速很快。超车很轻松。操控很好,指向精准,方向盘很轻,高速
Space: 3 容易上手。油耗在市区内比较高,高速在7个左右,市区的话油耗在10个左右。舒 舒适 性 一 般 , 毕 竟 是 运 动 型 的 车
Power: 4 。外观很满意,我喜欢。内饰做工还可以,就是中控的塑料感太强了。性价比很高,这个价位的车,这个配置的值
Control: 5 价位,这个价格。性价比配置,这个价位。值!
Fuel: 3 Translation: The space is just so so, as the rear space is a little small. As for power, it can speed up very quickly, which
Comfort: 3 makes it pretty easy to overtake. The control is good. It’s very precise. And the steering wheel is very light, easy to use on
Appearance: 5 highway. Fuel consumption in the urban area is relatively high, about 7 on highway, about 10 in urban roads. It’s not
Interior: 4 comfortable enough in the car. After all, it is a sports car. The appearance is very satisfactory. I like it very much.
Price: 4 Interiors are ok. But there is too much plastic in center control area. The price/performance ratio is very high. A car at this
price, with these configurations, worths buying.

HRGM-a 空间一般,后排空间有点小,后备箱空间也不错,就是后排座椅不能放倒。动力还不错,提速很快。操控
Space: 3 很好,指向精准。油耗还可以,毕竟是2.0的排量,油耗也不高,毕竟是2.0的排量,也不可能我个人开车的原因
Power: 4 。舒 舒适性很 好,座椅的 包裹性很好 ,坐着很舒 服。外观很满意,就是喜欢。很有个性。内饰做工一般,但是用
Control: 5 料还是很好的,不过这个价位的车也就这样了!性价比不错,值得购买。
Fuel: 3 Translation: The space is just so so, as the rear space is a little small. The trunk space is also good, but the rear seat cannot
Comfort: 5 be tipped. Power is also OK. The car can speed up very quickly. Control is very good. It goes wherever you want. Fuel
Appearance: 5 consumption is acceptable. After all, with a 2.0 displacement, fuel consumption is not that high. But it can’t be my pro-
Interior: 4 blem. It’s comfortable in the car. The seats are well wrapped, which makes them really comfortable. The appearance
Price: 4 is very satisfactory. I just like the cool features. Interiors are ok. The materials are ok. After all, you can’t want more from
cars at this price. It’s worth buying the car, and I can say that the price/performance ratio is pretty good.

Table 3: Sample reviews. Given the same input, our model can generate long reviews that matches the input aspects and sentiments
better than the baseline methods. When we change the input rating for Comfortable from middle (3) to high (5), our model can also
detect the difference and change the outputs accordingly.

change one rating in the input to show how our dance. But the output of Pick has a few problems.
model changes the output according to the slight dif- For example, there is a serious logic problem in
ference in the input. the reviews of Space and Comfort. It says the car
is narrow in Space, but the car has a large space
As we can see, Pick is a little better than our in Comfort, which violates the context consistency.
model HRGM-a in text length and content abun-

174
What’s more, it gives improper review to Comfort. Song iambics(Wang et al., 2016) with hierarchical
Although Comfort gets 3-point, the review sentence RNNs.
is kind of positive. And that can be considered as The attention mechanism originated from the area
a mismatch with the input. On the contrary, our of image (Mnih et al., 2014), but is widely used in all
model produces review texts as a whole and the texts kinds of generation models in NLP (Bahdanau et al.,
are aligned with the input aspect-sentiment scores 2015; Fang et al., 2016). Besides, attention today is
more appropriately. All 3-point aspects get neutral not totally the same with the original ones. It’s more
or slightly negative reviews, while all 5-point as- a thinking than an algorithm. Various changes can
pects get definitely positive comments. And 4-point be made to construct a better model.
aspects also get reviews biased towards being posi-
tive. 7 Conclusion and Future Work
As for the last example after changing the rating
In this paper, we design end-to-end models to chal-
of Comfort from 3-point to 5-point, we can see that
lenge the automatic review generation task. Re-
except for the review sentence for Comfort, other
trieval based methods have problems generating
sentences do not change apparently. But the review
texts consistent with input aspect-sentiment scores,
sentence of Comfort changes significantly from neu-
while RNNs cannot deal well with long texts. To
tral to positive, which shows the power of our model.
overcome these obstacles, we proposed models and
find that our model with hierarchical structure and
6 Related Work aligned attention can produce long reviews with high
Several previous studies have attempted for review quality, which outperforms the baseline methods.
generation (Tang et al., 2016; Lipton et al., 2015; However, we can notice that there are still some
Dong et al., 2017) . They generate personalized re- problems in the texts generated by our models.
views according to an overall rating. But they do not In some generated texts, the contents are not rich
consider the product aspects and whether each gen- enough compared to human-written reviews, which
erated sentence is produced as the user requires. The may be improved by applying diversity decoding
models they proposed are very similar to SRGMs. methods (Vijayakumar et al., 2016; Li et al., 2016).
And the length of reviews texts are not as long as And there are a few logical problems in some gen-
ours. Therefore, our work can be regarded as a sig- erated texts, which may be improved by generative
nificant improvement of their researches. adversarial nets (Goodfellow et al., 2014) or rein-
forcement learning (Sutton and Barto, 1998).
Many researches of text generation are also
closely related to our work. Traditional way for text In future work, we will apply our proposed mod-
generation (Genest and Lapalme, 2012; Yan et al., els to text generation in other domains. As men-
2011) mainly focus on grammars, templates, and tioned earlier, our models can be easily adapted for
so on. But it is usually complicated to make every other data-to-text generation tasks, if the alignment
part of the system work and cooperate perfectly fol- between structured data and texts can be provided.
lowing the traditional techniques, while end-to-end We hope our work will not only be an exploration
generation systems nowadays, like the ones within of review generation, but also make contributions to
encoder-decoder framework (Cho et al., 2014; Sor- general data-to-text generation.
doni et al., 2015), have distinct architectures and
Acknowledgments
achieve promising performances.
Moreover, the recent researches on hierarchical This work was supported by 863 Program of China
structure help a lot with the improvement of the gen- (2015AA015403), NSFC (61331011), and Key Lab-
eration systems. Li et al. (2015) experimented on oratory of Science, Technology and Standard in
LSTM autoencoders to show the power of the hier- Press Industry (Key Laboratory of Intelligent Press
archical structured LSTM networks to encode and Media Technology). We thank the anonymous re-
decode long texts. And recent studies have succe- viewers for helpful comments. Xiaojun Wan is the
fully generated Chinese peotries(Yi et al., 2016) and corresponding author.

175
References In Proceedings of the ACM International Conference
on Interactive Experiences for TV and Online Video,
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
pages 141–145. ACM.
gio. 2015. Neural machine translation by jointly
learning to align and translate. ICLR. Ioannis Konstas. 2014. Joint models for concept-to-text
generation.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Jiwei Li, Minh-thang Luong, and Dan Jurafsky. 2015.
Schwenk, and Yoshua Bengio. 2014. Learning phrase A hierarchical neural autoencoder for paragraphs and
representations using rnn encoder-decoder for statisti- documents. In Proceedings of ACL. Citeseer.
cal machine translation. EMNLP. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Bill Dolan. 2016. A diversity-promoting objec-
and Yoshua Bengio. 2014. Empirical evaluation of tive function for neural conversation models. In Pro-
gated recurrent neural networks on sequence model- ceedings of NAACL-HLT, pages 110–119.
ing. NIPS, Deep Learning and Representation Learn- Zachary C Lipton, Sharad Vikram, and Julian McAuley.
ing Workshop. 2015. Generative concatenative nets jointly learn
Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, to write and classify reviews. arXiv preprint
Ming Zhou, and Ke Xu. 2017. Learning to gener- arXiv:1511.03683.
ate product reviews from attributes. In Proceedings of Volodymyr Mnih, Nicolas Heess, Alex Graves, et al.
the 15th Conference of the European Chapter of the 2014. Recurrent models of visual attention. In
Association for Computational Linguistics: Volume 1, Advances in neural information processing systems,
Long Papers, pages 623–632, Valencia, Spain, April. pages 2204–2212.
Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Wei Fang, Juei-Yang Hsu, Hung-yi Lee, and Lin-Shan Jing Zhu. 2002. Bleu: a method for automatic eval-
Lee. 2016. Hierarchical attention model for improved uation of machine translation. In Proceedings of the
machine comprehension of spoken content. IEEE 40th annual meeting on association for computational
Workshop on Spoken Language Technology (SLT). linguistics, pages 311–318. Association for Computa-
Pierre-Etienne Genest and Guy Lapalme. 2012. Fully tional Linguistics.
abstractive approach to guided summarization. In Pro-
Jean Pouget-Abadie, Dzmitry Bahdanau, Bart van
ceedings of the 50th Annual Meeting of the Association
Merriënboer, Kyunghyun Cho, and Yoshua Bengio.
for Computational Linguistics: Short Papers-Volume
2014. Overcoming the curse of sentence length for
2, pages 354–358. Association for Computational Lin-
neural machine translation using automatic segmen-
guistics.
tation. Syntax, Semantics and Structure in Statistical
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Translation, page 78.
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
Courville, and Yoshua Bengio. 2014. Generative ad-
tional recurrent neural networks. IEEE Transactions
versarial nets. In Advances in neural information pro-
on Signal Processing, 45(11):2673–2681.
cessing systems, pages 2672–2680.
Alex Graves. 2012a. Sequence transduction with recur- Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neu-
rent neural networks. International Conference of Ma- ral responding machine for short-text conversation.
chine Learning (ICML) Workshop on Representation ACL.
Learning. Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi,
Alex Graves. 2012b. Supervised sequence labelling. In Christina Lioma, Jakob Grue Simonsen, and Jian-Yun
Supervised Sequence Labelling with Recurrent Neural Nie. 2015. A hierarchical recurrent encoder-decoder
Networks, pages 5–13. Springer. for generative context-aware query suggestion. In Pro-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long ceedings of the 24th ACM International on Conference
short-term memory. Neural computation, 9(8):1735– on Information and Knowledge Management, pages
1780. 553–562. ACM.
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.
semantic alignments for generating image descrip- 2012. Lstm neural networks for language modeling.
tions. In Proceedings of the IEEE Conference on Com- In Interspeech, pages 194–197.
puter Vision and Pattern Recognition, pages 3128– Ilya Sutskever, James Martens, and Geoffrey E Hinton.
3137. 2011. Generating text with recurrent neural networks.
Soomin Kim, JongHwan Oh, and Joonhwan Lee. 2016. In Proceedings of the 28th International Conference
Automated news generation for tv program ratings. on Machine Learning (ICML-11), pages 1017–1024.

176
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.
In Advances in neural information processing systems,
pages 3104–3112.
Richard S Sutton and Andrew G Barto. 1998. Reinforce-
ment learning: An introduction, volume 1. MIT press
Cambridge.
Jian Tang, Yifan Yang, Sam Carton, Ming Zhang, and
Qiaozhu Mei. 2016. Context-aware natural language
generation with recurrent neural networks. arXiv
preprint arXiv:1611.09900.
Ashwin K Vijayakumar, Michael Cogswell, Ram-
prasath R Selvaraju, Qing Sun, Stefan Lee, David
Crandall, and Dhruv Batra. 2016. Diverse beam
search: Decoding diverse solutions from neural se-
quence models. arXiv preprint arXiv:1610.02424.
Qixin Wang, Tianyi Luo, Dong Wang, and Chao
Xing. 2016. Chinese song iambics generation
with neural attention-based model. arXiv preprint
arXiv:1604.06274.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey, Maxim
Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
2016. Google’s neural machine translation system:
Bridging the gap between human and machine trans-
lation. arXiv preprint arXiv:1609.08144.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Aaron C Courville, Ruslan Salakhutdinov, Richard S
Zemel, and Yoshua Bengio. 2015. Show, attend and
tell: Neural image caption generation with visual at-
tention. In ICML, volume 14, pages 77–81.
Rui Yan, Liang Kong, Congrui Huang, Xiaojun Wan, Xi-
aoming Li, and Yan Zhang. 2011. Timeline genera-
tion through evolutionary trans-temporal summariza-
tion. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pages 433–
443. Association for Computational Linguistics.
Xiaoyuan Yi, Ruoyu Li, and Maosong Sun. 2016.
Generating chinese classical poems with rnn encoder-
decoder. arXiv preprint arXiv:1604.01537.
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and
Wei Xu. 2016. Video paragraph captioning using hier-
archical recurrent neural networks. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4584–4593.
Jianmin Zhang, Jin-ge Yao, and Xiaojun Wan. 2016. To-
ward constructing sports news from live text commen-
tary. In Proceedings of ACL.

177

You might also like