Titov Bunker
Titov Bunker
Titov Bunker
Henning M. Wold1 , Linn Vikre1 , Jon Atle Gulla1 , Özlem Özgöbek1,2 and Xiaomeng Su3
1 Department of Computer and Information Science, NTNU, Trondheim, Norway
2 Department of Computer Engineering, Balikesir University, Balikesir, Turkey
3 Department of Informatics and e-Learning, NTNU, Trondheim, Norway
Abstract: Social media platforms like Twitter have become increasingly popular for the dissemination and discussion
of current events. Twitter makes it possible for people to share stories that they find interesting with their
followers, and write updates on what is happening around them. In this paper we attempt to use topic models
of tweets in real time to identify breaking news. Two different methods, Latent Dirichlet Allocation (LDA)
and Hierarchical Dirichlet Process (HDP) are tested with each tweet in the training corpus as a document by
itself, as well as with all the tweets of a unique user regarded as one document. This second approach emulates
Author-Topic modeling (AT-modeling). The evaluation of methods relies on manual scoring of the accuracy
of the modeling by volunteered participants. The experiments indicate topic modeling on tweets in real-time is
not suitable for detecting breaking news by itself, but may be useful in analyzing and describing news tweets.
211
Wold, H., Vikre, L., Gulla, J., Özgöbek, Ö. and Su, X.
Twitter Topic Modeling for Breaking News Detection.
In Proceedings of the 12th International Conference on Web Information Systems and Technologies (WEBIST 2016) - Volume 2, pages 211-218
ISBN: 978-989-758-186-1
Copyright c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
WEBIST 2016 - 12th International Conference on Web Information Systems and Technologies
to build the context aware news expreriences based they concluded that Twitter could be a good source
on deep understanding of text in continuous news of news that has low coverage in other news media.
streams (Ingvaldsen et al., 2015). The study also suggets that while Twitter users are not
In Section 2 we present the previous research in exceptionally interested in world news, they do help
the field of topic modeling in general, and on tweets spread awareness of important world events.
in particular. Section 4 describes how we aim to find Another method that has been used for topic
a suitable topic modeling technique to handle tweets modeling tweets is Hierarchical Dirichlet Processes
in real time. After that, we evaluate the results of the (HDP) (Teh et al., 2006). HDP is a nonparametric
different topic modeling techniques in Section 5. In Bayesian approach which is used to cluster related
Section 7, we discuss our results while Section 8 sum- data and can be used to cluster tweets that have sim-
marizes our findings along with the proposal of future ilar topics. (Wang et al., 2013) describes how HDP
work. can be used to detect events occurring from tweets in
real-time and shows how the clustering in HDP works
on these tweets.
2 RELATED WORK All of these studies suggest that there are several
techniques and methods that perform well in different
areas that can be utilized for our purpose of collecting
A topic model of a collection of documents is a
breaking news from Twitter. Although there are stud-
trained statistical model that exposes abstract topics
ies that have experimented with similar ideas (Zhao
in the collection. Each document may concern multi-
et al., 2011), there are few who have tried to do this
ple topics in different proportions, like a news article
in real-time. Most studies have done this in semi real-
about pets that may write about topics like cats, dogs,
time or looked at previous data to see if they could get
and fish. The topics themselves are represented by
an indication of whether or not it is possible to fetch
high-frequency words that occur in the descriptions
potential breaking news using Twitter.
of the topics. Historically, topic modeling has been
widely used to explore topical trends in large corpora
that spans several years, analyzing for example polit-
ical and social changes in important time periods. 3 TWITTER NEWS DETECTION
In recent years, topic modeling has been increas-
ingly utilized for analyzing corpora of tweets. Latent When news-worthy events happen, people who wit-
Dirichlet Allocation (LDA) (Blei et al., 2003) is one ness it are often quick to post about the event to their
of the most widely used techniques for this analysis. social network feeds in general, and to their Twitter
There are several modifications and extensions pro- accounts in particular. In (Kwak et al., 2010) it was
posed to LDA that improves its performance in social found that any retweeted tweet reached an average of
media settings in general, and for tweets in particular. 1,000 users. This means Twitter could be an inter-
In (Rosen-Zvi et al., 2004) an extension to LDA esting place for detecting breaking news as they are
called the Author-Topic Model (AT model) is pro- emerging. Unfortunately, many of the posts on Twit-
posed. In (Rosen-Zvi et al., 2010), it is showed that ter are not news items, but rather address personal
when the test documents contain only a small number opinions and mundane status updates or similar, such
of words, the proposed model outperforms LDA. This as those found in figures 1 and 4. Other tweets are
research was done on a collection of 1,700 NIPS con- more serious and potentially related to news, such as
ference papers and 160,000 CiteSeer abstracts. The figure 3. Lastly some tweets (e.g. figure 2) are entirely
work is done on abstracts which are shorter than nor- written in foreign languages often using non-latin al-
mal documents but still longer than regular tweets. phabets. The first step in detecting breaking news is
(Hong and Davison, 2010) showed how training a then to filter out the noise, leaving posts that are po-
topic model on aggregated messages results in higher tential news for the analysis.
quality learned models, yielding better results when The main challenge of news detection on Twitter
modeling tweets. In this work all the messages of a is the length restriction on tweets. A tweet can be a
particular user are aggregated before training the LDA maximum of 140 characters long. Because tweets are
model on them. This simple and straightforward ex-
tension to LDA gives a more accurate topic distribu-
tion than running LDA on each tweet individually.
In (Zhao et al., 2011) an empirical comparison is
done between Twitter and the New York Times. Us-
ing an extension to LDA called Twitter-LDA is used, Figure 1: Example of short, non-serious, tweet.
212
Twitter Topic Modeling for Breaking News Detection
213
WEBIST 2016 - 12th International Conference on Web Information Systems and Technologies
continuously build upon the already existing model, HDP which they called Online Hierarchical Dirich-
which is based on the information that comes from let Process. The Online-HDP is designed to analyze
the data stream. In our case that means new docu- streams of data. It provides the flexibility of HDP
ments (that is tweets) will be added to the model con- and the speed of online variational Bayes. The on-
tinuously as they are received by the system. line aspect of the Online-HDP is both a performance
improvement and a variation in how models are up-
4.2 Author-topic dated. The improvement in performance is achieved
by not having to crawl through the entire corpus re-
Author-topic (AT) model is an extension of LDA. In peatedly. Instead of having several passes with a fixed
this model, the content of each document and the in- set of data, it was suggested by (Wang et al., 2011) the
terests of authors are simultaneously modeled. AT updates of the model can be optimized by iteratively
uses probabilistic “topics-to-author” model, which al- choosing a random subset of the data, and then up-
lows a mixture of different weights (θ and φ) for dif- dating the variational parameters based on the current
ferent topics to be determined by authors of a spe- subset of data.
cific document (Rosen-Zvi et al., 2004). φ is gener-
ated similar to θ, where it is chosen from a symmetric
Dirichlet(β).
The AT-model as defined above, would result in a
5 NEWS DETECTION
very large memory footprint if directly implemented. EXPERIMENT
The reason for this is that we would have to store all
inbound tweets to be able to continuously update the In the experiment part of this work, we utilize a set of
documents written by each author, which would scale tweets to train a topic model using the LDA and HDP
linearly in time. So instead of implementing it di- models. We train both models by treating each tweet
rectly we replicate the approach of (Hong and Davi- as a separate document and also by aggregating all
son, 2010). This involves aggregating all the docu- tweets of a single user into one document. This gives
ments made by a single author into one document. In us four different approaches to compare. Our moti-
our case this means concatenating all the tweets of a vation for training the models in two different ways
unique user into one document. using the same dataset is to see if we can counter the
biggest problem with modeling tweets which is their
4.3 Hierarchical Dirichlet Process short length. As this aggregation of tweets is an at-
tempt to emulate the AT-model, we refer to the exper-
The Hierarchical Dirichlet Process (HDP) is a topic iments using this dataset as LDA-AT and HDP-AT.
modeling technique for performing unsupervised Before briefly describing the pre-processing steps
analysis of grouped data. HDP provides a nonpara- and training of the models, we describe the data set
metric topic model where documents are grouped by and the data collection process. After that, we test the
observed words, topics are distributions over several various methods described in the previous section on
terms, and every document shows different distribu- a set of test tweets, separate from the training tweets.
tions of topics. Its formal definition, due to (Teh et al.,
2006) is: 5.1 Data Set
G0 |γ, H ∼ DP(γ, H)
G j |α0 , G0 ∼ DP(α0 , G0 ) For our experiments, we have fetched data from Twit-
for each group j. Here G j is the random probabil- ter’s own streaming API3 and built a separate data set.
ity measure of group j and its distribution is given by Our training data set contains approximately 600,000
a Dirichlet process. This Dirichlet process depends tweets from the New York (USA) area, collected in
on α j , the concentration parameter associated with the period from November 26, 2014 to December 5,
the group, and G0 . G0 is the base distribution shared 2014.
across all groups. This base distribution is given by a Using Twitter’s API it is possible to get an exten-
Dirichlet process as well, where γ is the base concen- sive set of metadata connected to each tweet. Most of
tration parameter associated with the group and H is this metadata is of no significance for topic modeling,
the base distribution which governs the a priori distri- and is stripped away. For the purposes of the experi-
bution over data items. ment, the only things we keep are the screen name of
One limitation of standard HDP is that it has to the author and the content of the tweet itself.
crawl through all the existing data multiple times.
Therefore (Wang et al., 2011) proposed a variant of 3 http://dev.twitter.com
214
Twitter Topic Modeling for Breaking News Detection
Based on an idea from Meyer et al. (Meyer et al., meant that the model was updated in chunks of 1000
2011), we ignored any tweet containing non-ASCII unique users. By doing so in our experiments, we
characters. The rationale behind this idea is that there observed that this caused one topic to be “inflated”.
are several tweets containing nothing but emoticons, Almost every single message we attempted to clas-
as well as several tweets written using non-ASCII sify using a model trained in this fashion was classi-
characters (such as Arabic and Chinese). The dan- fied into the same, highly general topic. By increasing
ger of doing this is that a few relevant tweets might the chunk size by a factor of ten to 10,000, this phe-
also be removed. We have nevertheless decided this nomenon disappeared.
is a fair tradeoff and negligable in order to find the Furthermore, we set the number of passes to per-
majority of the news including tweets. If we were form with LDA to 10. This was chosen to ensure
to include tweets made in foreign languages using a the initial training set converged well on topics. For
non-ASCII alphabet, the complexity of the analysis HDP, we set the chunk size to 256. When doing on-
would dramatically increase. Moreover we are inter- line training (meaning that updating the model with
ested in presenting breaking news to a user who, pre- real-time tweets), it is not possible to change these
sumably, knows English but not necessarily other lan- parameters. So the model is simply updated straight
guages. Another aspect of ignoring non-ASCII char- away with the provided corpus (new messages re-
acters is that tweets containing unicode emoticons are ceived in real-time). It is somewhat possible to ad-
mostly status updates or chatter, which can be safely just the chunk size with real-time messages by doing
ignored. Finally, if an actual news post gets filtered batch updates. The size of the batches will have to
out due to including non-ASCII characters, chances be balanced around not being too rare in addition to
are high that someone else will have posted about this not being too small so that they skew the models. A
same event without utilizing non-ASCII characters. compromise here from our trials seem to be update at
All these points considered, we decided that it was about every 500-1000 new tweets that arrive.
a fair tradeoff to ignore all the tweets containing non- After training the models, we used them on a set
ASCII characters. In our data set around 30% of the of 100 tweets collected in the same time period as the
tweets are ignored due to their inclusion of non-ASCII testing set. These tweets were new in that they were
characters. not part of the training set.
Other than removing tweets containing non- Additionally we used the models on portions of
ASCII characters, we have replaced all the URLs with the New York Times annotated corpus (Sandhaus,
the word “LINK”. We kept all hashtags but we re- 2008) to assess which topics were most frequently
moved words that are not meaningful like combina- found in actual news articles. We did this by tally-
tion of many words prepared as a hashtag but lacks ing up the most relevant topic for each of the articles
the ’#’ sign. We also removed common stop words in the corpus. Doing this gave us a list over the top-
since they have no effect on analyzing the meaning ics most related to the articles in the New York Times
of tweets by using topic modeling. We have not per- corpus. The “score” of a topic, then, is the number
formed any other word processing on the tweets or of articles in the New York Times corpus that topic
stemming on the words. We also made a copy of was the best fit for. This was done as a part of fil-
the data set where all the tweets belonging to a user tering tweets concerning news and not as part of the
merged together. experiment described below.
Before performing the experiment, we trained the
different algorithms on the dataset described above.
5.2 Experiment
We used the Python library Gensim4 as it has embed-
ded support for both LDA and HDP. For both models,
the number of topics were set to 50. This number was To conduct our experiment we asked 7 people to par-
chosen based on earlier research done by Hong and ticipate for manual grading of our results. As men-
Davison (Hong and Davison, 2010) which showed tioned above, we collected 100 tweets for our exper-
that the best results were given if the number of topic iment. We utilized each of the trained models (LDA,
were set to 50. There are some additional parameters LDA-AT, HDP, HDP-AT) to topic model these new
that can be tweaked for the models, and we tried a tweets. The results of this topic modeling was dis-
few different settings. First we made the LDA model tributed to our participants where they graded how
update in chunks of 1000. This means that the model well the topic assigned fit on a scale from 1-5. Here a
is trained with 1000 tweets at a time. For the training score of 1 means the topic is not relevant, and 5 means
set where each user’s messages are aggregated, this it is a perfect fit.
As shown in table 1, some of the resulting topics are
4 https://radimrehurek.com/gensim/ very generic and cover a broad set of terms. This is
215
WEBIST 2016 - 12th International Conference on Web Information Systems and Technologies
Table 1: An example of words in topic #32 and #1 found by Table 2: Precision results given by the different topic mod-
LDA-AT. eling methods.
Topic #32 Method Precision
WORD PROB. LDA 0.287
good 0.0216 LDA-AT 0.520
love 0.0207 HDP 0.059
time 0.0205 HDP-AT 0.198
day 0.0188
today 0.0150 pants of the experiment.
night 0.0124 To measure the results and give a final score of
great 0.0103 the different methods, we use precision. We set the
work 0.0099 threshold for a topic being relevant for a tweet at 3 out
life 0.0093 of 5. This means that every assignment with a score of
youre 0.0089 3 or higher, gets marked as relevant, while any assign-
ment graded 2 or lower gets marked as not relevant.
Topic #1 We calculate the precision by taking the number of
WORD PROB. relevant assignments and dividing them by the total
game 0.0649 number of assignments.
played 0.0490 Using the scores given by the test participants and
team 0.0354 the definition above, we calculated the precision of
win 0.0345 each method. These scores can be found in table 2.
play 0.0253 As mentioned previously, the sparsity and noisy na-
football 0.0147 ture of tweets make it difficult to get reasonable data
games 0.0145 out of them. This is especially valid when we strip
ball 0.0114 them of stop words. After this process some tweets
pick 0.0112 end up with a very low word count, and as such is
points 0.0102 not likely to get a suitable topic assigned. Further-
more, tweets in foreign languages are a challenge, and
is not something we have taken into account in this
work. Even when ignoring tweets containing non-
ASCII characters, some languages (such as Spanish)
still slip through.
As the results show, LDA-AT outperforms the
other three models. The main rationale behind this
is that by combining all tweets from a single author in
the training set into a document (meaning the train-
ing set then contains one document per author), it
Figure 5: Results from modeling 100 tweets using LDA, becomes possible to somewhat counter the document
LDA-AT, HDP and HDP-AT. length limitation inherent to tweets. As authors tend
to stick to only a handful of topics, this should not
not surprising, as many tweets are about the every- skew the model in any meaningful way.
day affairs of their authors. As we collected tweets
from the New York area we expected an abundance
of tweets about things occurring there.
Figure 5 shows the scores given to the categorization
6 TOPIC MODELING COMBINED
of each tweet in our test set by our test subjects. As WITH NEWS INDEX
is evident there is a clear difference in performance
between the four methods. The two HDP variants, To be able to filter out what is news on Twitter, we
HDP-AT in particular performed rather poorly than utilized the New York Times annotated corpus, as de-
HDP having an average score of 1.792, and HDP-AT scribed earlier. We modeled all the articles published
having an average score of 1.178. The LDA variants by the New York Times in 2006 using the LDA-AT
performed much better compared to HDP methods. approach. We chose that approach, as our previous
LDA had an average score of 2.000 where LDA-AT experiment had suggested it had the best performance
had an average score of 2.610 given by the partici- on tweets of the four approaches tested. This gave us a
216
Twitter Topic Modeling for Breaking News Detection
Table 3: Test results showing the number of relevant and cause, that would compromise the goal of topic mod-
non-relevant tweets with their precision values in different eling the tweets as soon as they arrive. One potential
categories. solution to this for news detection purposes would be
Experiment to use a static topic model. On the other hand, this
#Categories Relevant Not rel. Precision have the risk of the model getting outdated as popu-
3 categories 7 247 0.028 lar topics on Twitter drift. Another potential solution
2 categories 7 101 0.065 for the real-time processing is to incrementally update
1 category 4 35 0.103 the model in a set time, so as not to overly delay the
topic modeling of the tweets themselves. This solu-
I’m on grand jury watch in the Eric Gar- tion comes with another issue, however. The topic
ner case. They are meeting and could de- model we have used does not allow for terms to be
cide today. Follow here and @NYTMetro for added to the dictionary after the model has been ini-
breaking updates tialized. This can be alleviated by using a hash map
based dictionary, with the caveat that certain terms
Figure 6: An example of a news tweet found by using the will share the same index and potentially lowers the
topic modeling method combined with the news index. precision of the model. The alternative is to keep the
dictionary static. The danger of doing this is that over
handful of topics that were much more likely to be as- time certain terms that are not in the dictionary could
signed to actual news articles than others. We then ran become important to identify news.
some fresh tweets through the model, and those who However, the inherent limitation of 140 words per
assigned to one of the top ranked topics are saved to tweet is problematic for any statistical model that
see if they were actually news items. We did this three draws on word frequencies. Only a few content-
times; one with the top 3 topics, one with the top 2, relevant nominal phrases are included, and most of
and finally one with only the top topic. The results these 140 words tend to come from the stop word list.
can be seen in table 3. Even though the approach may be improved some-
As is immediately clear, our approach for extract- what, the experiments seem to suggest that topic mod-
ing news does not perform very well. The data set eling is far from being effective in detecting breaking
tested consisted of 3,454 tweets, meaning more than news on Twitter in the near future. Other and sim-
90% of the total tweets were removed. Even so the pler techniques, like detecting clusters of tweets at
best precision achieved was merely 0.103. particular locations at particular times, may be both
computationally more efficient and quality-wise more
precise.
On the other hand, topic modeling has some other
7 DISCUSSION advantages that may be interesting as part of a larger
news aggregator service. Associating tweets with
In this work we experimented different methods for multiple topics, we can organize news tweets accord-
detecting breaking news from Twitter streams. As ing to multiple dimensions for easier user inspection.
a result of our comparison of different methods for We may also use the most prominent words of each
topic modeling tweets, it seems using an LDA model topic representation as a short summary or title of
coupled with aggregating all the tweets of a user, a group of related tweets, making it unnecessary to
called LDA-AT, is the most effective approach. The check every tweet to understand the overall content.
largest limitation of using only topic modeling for
news detection, is that it is sometimes difficult to
know what a topic represents. Another pervasive
limitation is that of Twitter’s 140 characters limit on 8 CONCLUSION AND FUTURE
tweets. WORK
Using the LDA-AT approach is a challenge when
working with streams of real-time tweets. In a real- In this paper, we have compared four different meth-
time setting the goal is to model tweets as they ar- ods for topic modeling tweets as part of a breaking
rive. As our results show, combining all the tweets news detection system. As a result of our experiments
of a single user into one document is desired when LDA-AT outperforms the other three models. We also
performing this. In a real-time setting, however, one attempted to use the trained topic model in a practi-
cannot afford to wait for tweets for an extended pe- cal manner to detect news. We used the New York
riod of time to make sure that each user’s document is Times annotated corpus to decide which topics were
large enough before updating the model. This is be- most likely to be assigned to news articles, before us-
217
WEBIST 2016 - 12th International Conference on Web Information Systems and Technologies
ing those topics to filter a new data set. The results tion through the geo-tagged twitter network. In CATA,
of this experiment show that the majority of tweets pages 84–89.
fetched using this method is non news, achieving a Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P.,
precision of only 0.103 in the best case. and Steyvers, M. (2010). Learning author-topic mod-
Topic modeling itself is not likely to be sufficient els from text corpora. ACM Transactions on Informa-
tion Systems (TOIS), 28(1):4.
for detecting breaking news from Twitter. The tweets
Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P.
are too short and too ambiguous to generate statistical (2004). The author-topic model for authors and doc-
models of the necessary precision. As a supplement uments. In Proceedings of the 20th conference on
to other techniques for news detection, they may how- Uncertainty in artificial intelligence, pages 487–494.
ever be useful, since they assume no knowledge of AUAI Press.
location, time or author. Sakaki, T., Okazaki, M., and Matsuo, Y. (2010). Earthquake
From a news aggregator perspective topic model- shakes twitter users: real-time event detection by so-
ing is interesting also for clustering and summarizing cial sensors. In Proceedings of the 19th international
news content. Each tweet is associated with a number conference on World wide web, pages 851–860. ACM.
of relevant topics or clusters, and each topic is again Sandhaus, E. (2008). The newyork times annotated cor-
pus. Linguistic Data Consortium, Philadelphia,
described using a set of prominent word for that topic.
6(12):e26752.
In the future we intend to further explore the cluster-
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M.
ing abilities of topic modeling to improve the user ex- (2006). Hierarchical dirichlet processes. Journal of
perience of our news aggregator. It allows us to struc- the american statistical association, 101(476).
ture news content along several dimensions and use Titov, I. and McDonald, R. (2008). Modeling online re-
short labels to summarize sets of news stories. views with multi-grain topic models. In Proceedings
of the 17th international conference on World Wide
Web, pages 111–120. ACM.
Wang, C., Paisley, J. W., and Blei, D. M. (2011). Online
REFERENCES variational inference for the hierarchical dirichlet pro-
cess. In International Conference on Artificial Intelli-
Blei, D. M. (2012). Probabilistic topic models. Communi- gence and Statistics, pages 752–760.
cations of the ACM, 55(4):77–84. Wang, X., Zhu, F., Jiang, J., and Li, S. (2013). Real time
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent event detection in twitter. In Wang, J., Xiong, H.,
dirichlet allocation. the Journal of machine Learning Ishikawa, Y., Xu, J., and Zhou, J., editors, Web-Age
research, 3:993–1022. Information Management, volume 7923 of Lecture
Gulla, J. A., Fidjestøl, A. D., Su, X., and Castejon, H. Notes in Computer Science, pages 502–513. Springer
(2014). Implicit user profiling in news recommender Berlin Heidelberg.
systems. Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H.,
Hong, L. and Davison, B. D. (2010). Empirical study of and Li, X. (2011). Comparing twitter and traditional
topic modeling in twitter. In Proceedings of the First media using topic models. In Advances in Information
Workshop on Social Media Analytics, pages 80–88. Retrieval, pages 338–349. Springer.
ACM.
Hu, M., Liu, S., Wei, F., Wu, Y., Stasko, J., and Ma, K.-L.
(2012). Breaking news on twitter. In Proceedings of
the SIGCHI Conference on Human Factors in Com-
puting Systems, pages 2751–2754. ACM.
Ingvaldsen, J. E., Gulla, J. A., and Özgöbek, Ö. (2015).
User controlled news recommendations. In Proceed-
ings of the Joint Workshop on Interfaces and Hu-
man Decision Making for Recommender Systems co-
located with ACM Conference on Recommender Sys-
tems (RecSys 2015).
Kwak, H., Lee, C., Park, H., and Moon, S. (2010). What
is twitter, a social network or a news media? In
Proceedings of the 19th international conference on
World wide web, pages 591–600. ACM.
Mendoza, M., Poblete, B., and Castillo, C. (2010). Twitter
under crisis: Can we trust what we rt? In Proceedings
of the first workshop on social media analytics, pages
71–79. ACM.
Meyer, B., Bryan, K., Santos, Y., and Kim, B. (2011). Twit-
terreporter: Breaking news detection and visualiza-
218