Choudhery2017 Social Media Mining Prediction of Box Office Revenue

Social Media Mining:
Prediction of Box Office Revenue
Deepankar Choudhery and Carson K. Leung()

University of Manitoba
Winnipeg, MB
Canada
kleung@cs.umanitoba.ca
* ACM Reference format:

ABSTRACT
In recent years, social media has played a huge role in how D. Choudhery and C.K. Leung. 2017. Social media mining:
we share and communicate our thoughts and opinions. This prediction of box office revenue. In Proceedings of 21st
information can very valuable for companies and governments International Database Engineering & Applications Symposium,
as it can be used to analyze public mood and opinion which is a Bristol, United Kingdom, July 2017 (IDEAS '17), 10 pages.
very powerful tool. In this paper, we present a system that mines DOI: 10.1145/3105831.3105854
social media content from a platform such as Twitter for
predicting future outcomes. Specifically, it uses chatter from
1 INTRODUCTION
Twitter to predict box office revenue of movies by extracting
features such as tweets and their sentiments. Then, by using In the current era of big data, a wide variety of valuable data
these features, our system constructs a polynomial regression of different veracities can be easily collected and generated from
model for predicting box office revenue. Experimental results a broad range of data sources at a high velocity in various real-
show the effectiveness of our system in mining social media and life applications (e.g., bioinformatics, sensor and stream systems,
predicting box office revenue. smart worlds, Web, social networks). Moreover, volumes of
these big data are also beyond the ability of commonly-used
CCS CONCEPTS software to manage, query, process, and analyze within a
tolerable elapsed time. In general, characteristics of these big
• Networks → Online social networks;
data can be described by the following well-known 5V’s [1, 2]:
• Information systems → Data mining; Social networks; 1. Variety, which focuses on differences in types, contents,
• Computing methodologies → Supervised learning by or formats of data;
regression 2. Value, which focuses on the usefulness of data (e.g.,
knowledge that can be discovered from these big data);
KEYWORDS 3. Veracity, which focuses on the quality of data (e.g.,
Data mining, prediction, social media data, social networking precise data, uncertain and imprecise data);
sites, Twitter, tweets, box office, movies 4. Velocity, which focuses on the speed at which data are
collected or generated; and
5. Volume, which focuses on the quantity of data.
A rich source of big data is social networking sites such as
Twitter. Embedded in these big data—such as social media data—
is implicit, previously unknown and potentially useful
Permission to make digital or hard copies of all or part of this work for personal or information and knowledge. To discover the useful information
classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and and knowledge, data mining techniques—including data
the full citation on the first page. Copyrights for components of this work owned analytics and visual analytics techniques [3, 4]—are in demand.
by others than the author(s) must be honored. Abstracting with credit is permitted. Common data mining tasks include the following:
To copy otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. Request permissions
• Association rule and frequent pattern mining [5-7], which
from Permissions@acm.org aims to find (a) frequent patterns in the form of sets of
frequently purchased merchandise items or co-occurring
IDEAS '17, July 12-14, 2017, Bristol, United Kingdom
events, or (b) associations between the frequent patterns;
©2017 Copyright is held by the owner/author(s). Publication rights licensed to
ACM. • Anomaly or outlier detection [8, 9], which aims to find
ACM ISBN 978-1-4503-5220-8/17/07 …$15.00 rarely occurring events or objects that are deviated from
http://dx.doi.org/10.1145/3105831.3105854
the norm;
IDEAS '17, July 2017, Bristol, UK D. Choudhery and C.K. Leung
• Clustering [10, 11], which aims to find clusters in the social network mining system that can be used for predicting
form of groups of similar objects; and box office revenue. Section 4 shows experimental results. Finally,
• Classification and prediction [12-17], which aims to Section 5 gives the conclusions.
establish a systematic model to give or predict labels for
classes of test data based on the training data. 2 RELATED WORKS
Due to their popularity, social networks have played a major Predicting box office revenue has been an ongoing research
role in the spread or propagation of information and opinions. topic for many researchers [20-24]. Some existing
As a social media platform, Twitter has played a popular role. approaches [25] use a lot of movie-specific data including genre,
For instance, as of the end of June 2016, there were 313 million rating, release data, cast, number of debut screen and use some
monthly active Twitter users.† form of regression to predict earnings using these features.
Intuitively, one may expect that a high social media presence Another approach [26] is to consider this topic as a classification
for goods or services (e.g., a movie) would correlate to high problem and use neural networks to classify movies from “flop”
revenue. Since the volume of information and opinions has to “blockbuster”, which may lead to low accuracy. A third
grown exponentially, strong correlations with frequency of approach [19] uses Twitter data by constructing a linear
chatter and revenue can be a good indicator of future outcome regression model using twitter data ahead of a movie release.
[18, 19]. As a result, the development of systems that combine This approach creates a quantifiable measure—namely, a tweet
the frequency with the opinions of the population on social rate (tweets per hour)—for use in the model, which is then used
media for gaining useful insights and predicting trends are in to predict the first 2 weeks of revenue. Then, sentiment is added
demand. to fine-tune the model and improve its accuracy. In contrast, our
Our key contributions of this paper include the design and social media system uses a polynomial regression model because
development of a system that uses chatter from social media a curve is expected to be better suited to the data for predicting
platforms to discover interesting patterns that can be used for week-by-week drop off in revenue.
prediction. In particular, we focus on mining Twitter tweets and
predicting box office revenue of movies. Reasons for mining
3 OUR SOCIAL MEDIA MINING SYSTEM
tweets relating movies (than data from other entertainment
media) are partially due to the following: In this section, we describe our social media mining system.
• Movie chatter is quite popular on Twitter, and It mines tweets about movies from Twitter. Consequently, data
analysts could get a better understanding of the consumers’ (i.e.,
• Movie box office revenue numbers are publicly
movie viewers’) behaviour, which in turn helps predicting the
accessible (cf. companies for other media such as
box office revenue.
television and video games usually do not provide
Twitter is a micro-blogging network with over 313 million
accurate sale numbers).
monthly active users. Users can tweet about any topic within the
It is important to note that, although our designed and
140-character limit, and can follow other Twitter users to receive
developed social media mining system focuses on Twitter
tweets.
tweets, it can be applicable to data from other social media
platforms such as Facebook. Moreover, although our system
3.1 Collection of Social Media Data
mines tweets about movie to predict the corresponding box
office revenue, our system can be adapted to mine other Twitter provides users with a robust application program
entertainment media (e.g., magazines, songs) to discover other interface (API) to query for keywords or tags. In response, the
interesting knowledge (e.g., trends). API returns a tweet object in JavaScript object notation (JSON)
As a preview, we will use six recent movies for the format. Hence, to collect social media data, our social media
illustrative and evaluation purpose. These six movies were mining system provides users with two options—which use the
chosen based on multiple popular movie news websites (e.g., following primary Twitter APIs—to collect data through a
Movie Insider‡, Screen Rant§) when searching for the “most keyword:
anticipated movies in 2017”. Due to their popularity, there are 1. Twitter Search APIs**, which looks for older tweets. These
lots of social media data (e.g., tweets) about these movies on older tweets have a limited range of 0 to 7-8 days of
social media websites like Twitter. All these movies tweet date. In a single request, Twitter gathers and
coincidentally have a budget of at least 50 million dollars. Since returns 100 tweets for any specific keyword. Twitter
movies are heavily advertised on social media, a high budget allows 180 such requests in a 15-minute timeframe (i.e.,
would also indicate high social media presence. 4 × 180 = 720 requests per hour). Hence, the hypothetical
The remainder of this paper is organized as follows. Next maximum of tweets one can gather is 4 × 180 × 100 =
section describes related works. In Section 3, we present our 72,000 tweets per hour.
†
https://about.twitter.com/company
‡
https://www.movieinsider.com/
§ **
http://screenrant.com/ https://dev.twitter.com/rest/public/search
Social Media Mining IDEAS '17, July 2017, Bristol, UK
2. Twitter Streaming APIs††, which collects tweets as they eagerness (or its lack) of these potential viewers in
are being tweeted. This makes it easier to get a lot of watching the movie.
data quickly for a particular keyword. However, the • Our system continues plotting the number of tweets
keyword cannot be too general to avoid hitting the weeks after the movie releases when viewers expressing
stream limit, which is 1% of total tweets being posted their opinions about the movie. Tweets at that time
currently. As an example, if there are about more than show the positive (or negative) opinions from viewers
500 million tweets a day, then this search API can collect about the movie, and they also show the eagerness (or
a hypothetical maximum of at least 1 million tweets a its lack) of potential viewers in watching the movie after
day (i.e., about 41,667 tweets per hour). seeing these positive (or negative) opinions.
Between the two data gathering tools, the default of our Knowing that tweet counts alone may not be sufficient to
social media mining system is the Twitter Search API because it make a good predictive model, our social media mining system
provides more powerful queries since it allows filtering results also considers other factors like sentiment, public reception, and
based on location, language, account IDs, and specific words in age demographic. For instance, in addition to counting relevant
the text of a tweet. Moreover, the Twitter Search API can quickly tweets, our system also applies sentiment analysis on tweets and
collect small datasets quick because multiple days and weeks’ analyzes the emotion in the text of a tweet.
worth of tweets can be mined in a few hours (cf. collecting day- Again, let us open the “black box” and reveal some
by-day using Twitter Streaming API). However, our system also implementation details of our social media mining system.
warns users a caveat of using the Twitter Search API that only Specifically, to analyze the emotion in the text of a tweet, our
tweets for the past 7-8 days can be collected so the users have to system uses TextBlob***, which is a simple natural language
run Twitter Search API frequently—say, at least every 7-8 days. processing library for Python. It provides a simple API for text
With the default tweet gathering tool of the Twitter Search processing, common natural language processing tasks such as
API, let us open the “black box” and reveal some implementation speech tagging, noun phrase extraction, sentiment analysis,
details of our social media mining system. Specifically, our classification, and translation. Here, TextBlob helps extract the
system applies a Python library—called Tweepy‡‡—to access the polarity—which is a measure of the positivity, negativity, or the
Twitter Search API. With Tweepy, the API calls are simplified neutrality—of tweet text. Based on the polarity measure, a tweet
because they use Python methods§§ as wrappers. By specifying a can then be classified as follows:
string array of the terms to be searched for (e.g., movie titles) • A positive tweet, which possess a polarity measure of
and the day ranges (e.g., 0-8 days), the system returns a JSON file value between +0.01 and +1;
for each movie title. The system then augments JSON objects of • A negative tweet, which possess a polarity measure of
tweets with several informative attributes (e.g., user ID, time, value between -1 and -0.01; and
text). The resulting JSON files are usually iterated through all the • A neutral tweet, which possess a polarity measure of
tweets, normalized the tweet text (including emoji and value between -0.01 and +0.01.
hyperlinks), and then added to a database (e.g., MySQL In addition to temporal charts on tweet counts, our system
database). An advantage of creating JSON files (cf. adding tweets also plots a temporal chart—in terms of the percentage of positive
to database right away) is the fast speed of accessing and storing and negatives tweets—for each movie. Such a chart helps reveal
into the database. the association between the positive (or negative) sentiment
about a movie during pre-release and post-release periods.
3.2 Analysis of Social Media Data
After collecting social media data (e.g., tweets) via Tweepy 3.3 Prediction for Social Media Data
(which calls the Twitter Search API), our social media mining After analyzing different features (e.g., tweet counts,
system analyzes the data. Our system provides users with the percentage of positive tweets, percentage of negative tweets) of
analysis results in visual form (e.g., charts). social media data (e.g., tweets), our social media mining system
For instance, the system plots a temporal chart—in terms of take these features into account for prediction (e.g., when
the tweet counts (i.e., the number of tweets)—for each movie. predicting box office revenue). For simplicity and time-
Such a chart helps reveal the association between the number of efficiency, our system builds a simple prediction model by
tweets about a movie and its time before & after the movie applying polynomial regression to the following features of social
release (i.e., pre-release and post-release periods): media data collected on a particular day:
• Our system starts plotting the number of tweets two • Tweet counts,
weeks before the movie releases when potential viewers
• Percentage of positive tweets, and
start talking about the movie based on trailers, ads, and
• Percentage of negative tweets.
other promotion materials. Tweets at that time show the
In general, polynomial regression is a form of statistical
regression analysis, in which the relationship between the
††
https://dev.twitter.com/streaming/overview
‡‡
http://www.tweepy.org/
§§ ***
https://www.python.org/ https://textblob.readthedocs.io/en/dev/
independent variable x and the expected value of dependent release date (i.e., from 17-Feb-2017 to 19-Mar-2017). Similarly,
variable y is modelled as an nth degree polynomial in x: Figure 1(b) shows the tweet counts for movie “Kong: Skull
= ∑ +ε (1) Island” also for a 1-month period from 14 days before its release
where each value of independent variable xi is weighted by date to 16 days after the release date. The difference is that its
parameter ai, and ε captures the error. Here, independent release date was 10-Mar-2017. Hence, the 1-month period was
variable x is also known as predictor variable, and dependent from 24-Feb-2017 to 26-Mar-2017. Similar comments apply to the
variable y is also known as explanatory or response variable. four other sub-figures. Figure 1(c) shows the tweet counts for
Hence, a simple linear regression expressed in Equation (2) movie “Beauty and the Beast” from 3-Mar-2017 to 2-Apr-2017
can be considered as a specialization of Equation (1) where n=1: (with its release date of 17-Mar-2017). Figures 1(d) and 1(e) show
y = a0 + a1x + ε (2) the tweet counts for movie “Power Rangers” and “Life”,
and a quadratic model expressed in Equation (3) can be respectively, both from 10-Mar-2017 to 9-Apr-2017 (with their
considered as another specialization of Equation (1) where n=2: release date of 24-Mar-2017). Finally, Figure 1(f) shows the tweet
y = a0 + a1x + a2x2 + ε (3) counts for movie “Ghost in the Shell” from 17-Mar-2017 to
To predict the expected box office revenue y, our system 16-Apr-2017 (with its release date of 31-Mar-2017). All these six
applies polynomial regression to 3-dimensional data x (capturing sub-figures are then put together in Figure 2, which shows tweet
dimensions “tweet counts”, “percentage of positive tweets”, and counts of all six movies from Day -14 (i.e., a 14-day pre-release
“percentage of negative tweets”). The system chooses the period) to Day 16 (i.e., a 16-day post-release period) with Day 0
polynomial degree n such that the results lead to a low mean being the release day. The following are some observations from
squared error while not likely to over-fit the data. Figures 1 and 2:
• Every movie got an influx of tweets during the
4 EVALUATION weekends (e.g., Days 0-2, Days 7-9, and Days 14-16)
because Friday (Day 0) is the standard for a movie
As mentioned in Section 1, we use six recent movies for the
release and subsequent weekends got the most traffic for
illustrative and evaluation purpose. These six movies—as listed
movies. The bumps got smaller every weekend, and so
in Table 1—were chosen based on multiple popular movie news
did the revenue.
websites (e.g., Movie Insider, Screen Rant) when searching with
• Surprisingly, unlike all other five movies, the movie
the query “most anticipated movies in 2017”. Due to their
“Power Rangers” peak of tweet counts occurred on
popularity, there are lots of tweets about these movies on
22-Mar-2017 (i.e., two days before its release of
Twitter. All these movies coincidentally have a budget of at least
24-Mar-2017). It may partially due to the premiere of
US$50M. Since movies are heavily advertised on social media, a
movie and the release of the movie occurring across the
high budget would also indicate high social media presence.
globe just a few days before the North American release.
To relate to revenue, Figure 3 shows both tweet counts and
Table 1: Six Selected 2017 Movies for Evaluation revenues for the movies on the same graph from Day 0 (i.e., the
release day) to Day 16 (i.e., a 16-day post-release period). In the
Movie title Release date graph, tweet counts are represented in the same color in a
Logan 3-Mar-2017 scatterplot as shown in Figures 1 and 2, whereas revenues are
Kong: Skull Island 10-Mar-2017 represented in a black polyline as an overlaid scatterplot. Table 2
Beauty and the Beast 17-Mar-2017 and Figure 4 present the aggregated total tweet count and
Power Rangers 24-Mar-2017 revenue for each of the six movies for the first 16 days after its
Life 24-Mar-2017 release. The following are some observations from Figures 3
Ghost in the Shell 31-Mar-2017 and 4, together with Table 2:
• The movie “Logan” peaked higher tweets for the first
First, our social media mining system allows us as users to week when compared to the movie “Beauty and the
choose the default data gathering tool of Twitter Search API Beast”, even though the latter had roughly twice the
(which applies the Python library Tweepy) to collect about 30GB revenue for the first weekend. However, in the second
worth of tweets in JSON files (which enables fast data access and weekend after its release date, “Beauty and the Beast”
storage). This tool is faster than an alternative tool of Twitter took over “Logan” in terms of tweet counts.
Streaming API. Recall that the Twitter Search API could collect • All movies suffered from significant drop-off in tweet
72,000 tweets per hour, whereas the Twitter Streaming API could counts and revenue, with the lowest revenue drop-off
only collect 41,667 tweets per hour. percentage being 47.3%.
Then, our social media mining system allows us as users to • Considering only tweet counts may not be sufficient to
analyze the collected tweets and visualize the results of the make a good predictive model. This explains why our
analysis. For instance, Figure 1 shows the tweet count for each of social media mining system also considers other factors
the six selected movies. More specifically, Figure 1(a) shows the like sentiment, public reception, and age demographic.
tweet counts for movie “Logan” for a 1-month period from
14 days before its release date of 3-Mar-2017 to 16 days after the
Figure 1: Numbers of tweets from 14 days before and 16 days after the release of each of the six movies.
Figure 2: Numbers of tweets from 14 days before and 16 days after the release of six movies.
Figure 3: Numbers of tweets and revenue from the release date to subsequent 16 days of six movies.
Table 2: Production Budget and Revenue of the Six Movies
25% of Revenue
Production
Movie title production 1st weekend 2nd weekend 3rd weekend Total
budge
budget (Days 0-2) (Days 7-9) (Days 14-16) (Days 0-16)
Logan $97M $24.25M $88.41M $38.11M $18.81M $184.28M
Kong: Skull Island $185M $46.25M $61.02M $27.83M $14.67M $133.66M
Beauty and the Beast $160M $40.00M $174.75M $90.42M $45.00M $360.68M
Power Rangers $100M $25.00M $40.30M $14.20M $6.20M $75.01M
Life $58M $14.50M $12.50M $5.55M $2.37M $26.84M
Ghost in the Shell $110M $27.50M $18.67M $7.31M $2.46M $35.41M
Figure 4: Total numbers of tweets and total revenue from the release date to subsequent 16 days of six movies.
Figure 5: Percentages of positive tweets from 14 days before and 16 days after the release of six movies.
Figure 6: Percentages of negative tweets from 14 days before and 16 days after the release of six movies.
Figure 7: Two-dimensional projections of three-dimensional feature-graphs for 6th degree polynomial regression function.
• Our system gathers the following age demographic 16-day post-release period) with Day 0 being the release day.
information (e.g., based on Motion Picture Association of The following are some observations from Figures 5 and 6:
America (MPAA) film rating system or Canadian Home • All movies had more positive sentiment than negative
Video Rating System (CHVRS)) from sources like sentiment during both the pre-release and the
Internet Movie Database (IMDb)†††: post-release periods.
o Movies like “Logan” and “Life” are rated R, • Major changes in the direction in both positive and
which are restricted to viewers of 18 years of negative curves occurred every weekend after the
age and older; movie’s release.
o Movies like “Kong: Shull Island”, “Power • Movies “Kong: Skull Island”, “Power Rangers”, and
Rangers”, and “Ghost in the Shell” are PG-13, in “Ghost in the Shell” dived down in positive sentiment
which some scenes may be inappropriate for even in their opening weekends.
children under 13 years old and parental • For negative sentiment, all movies spiked up during the
guidance is strongly advised; and opening weekend, with the strongest spike being of the
o Movies like “Beauty and the Beast” is rated PG, movie “Life”, which was also underperformed during its
in which some scenes may not be suitable for opening week.
children and parental guidance is suggested. The temporal charts on the percentage of positive and
• These film ratings dictate the age demographic that is negatives tweets also help determine whether sentiment leading
allowed to watch the movie. For example, “Beauty and up to a movie affects the opening-week revenue and its effect on
the Beast” had certain percentage (say, 22% of the crowd subsequent weekends. Knowing that a movie’s opening weekend
occupied by children under 12 for its opening weekend, usually accounts for 25% of its domestic revenue, our social
and this demographic would not be allowed to watch a media mining system uses this metric to check the sentiment for
movie such as “Logan”. the weekend of release so as to see whether a movie was a
• The Twitter age demographic also plays a role in the success or not. The following are some observations from
movies that end up trending on the platform itself. Figures 5 and 6, together with Table 2 that shows the revenue
Recall from Section 3.2 that, in addition to providing users and budget information (which can be obtained from websites
with temporal charts on tweet counts, our social media mining like Box Office Mojo‡‡‡):
system also provides users with temporal charts on the • With positive sentiment, movies “Kong: Skull Island”
percentage of positive and negatives tweets for each movie. and “Power Rangers” performed successfully—in terms
These charts help reveal the association between the positive (or of revenue—at the opening weekend of the box office,
negative) sentiment about a movie during pre-release and whereas movie “Ghost in the Shell” underperformed
post-release periods. Figures 5 and 6 show percentages of during the opening weekend.
positive and negatives tweets, respectively, of all six movies • With negative sentiment, movies “Kong: Skull Island”,
from Day -14 (i.e., a 14-day pre-release period) to Day 16 (i.e., a “Life”, and “Ghost in the Shell” suffered from heavy
††† ‡‡‡
http://www.imdb.com/ http://www.boxofficemojo.com
drop-offs in revenue in the weekends subsequent to the [5] C.K. Leung, F. Jiang, and Y. Hayduk. 2017. A landmark-model based system
for mining frequent patterns from uncertain data streams. In Proceedings of
opening weekend. IDEAS 2011, 249-250.
• Movies “Kong: Skull Island” and “Ghost in the Shell” also [6] C.K. Leung, F. Jiang, L. Sun, and Y. Wang. 2012. A constrained frequent
pattern mining system for handling aggregate constraints. In Proceedings of
had the lowest percentage of positive tweets among the the IDEAS 2012, 14-23.
six movies used in the evaluation, whereas movie “Life” [7] C.K. Leung, S.K. Tanbeer, B.P. Budhia, and L.C. Zacharias. 2012. Mining
had the highest percentage of negative tweets. probabilistic datasets vertically. In Proceedings of the IDEAS 2012, 199-204.
[8] B. Hao, C.K. Leung, S. Camorlinga, M.H. Reed, M.K. Bunge, J. Wrogemann,
To further analyze tweets, our social media mining system and R.J. Higgins. 2008. A computer-aided change detection system for
applies polynomial regression to social media data. In other paediatric acute intracranial haemorrhage. In Proceedings of the C3S2E 2008,
109-111.
words, it tries to fit the data using an nth degree polynomial [9] M.A. F. Mateo and C.K. Leung. 2008. Design and development of a prototype
function and computes the mean squared error (MSE). Recall system for detecting abnormal weather observations. In Proceedings of the
C3S2E 2008, 45-59.
from Section 3.3 that our system chooses a polynomial degree n [10] P. Braun, A. Cuzzocrea, T.D. Keding, C.K. Leung, A.G.M. Pazdor, and
such that the results lead to a low MSE while not likely to D. Sayson. 2017. Game data mining: clustering and visualization of online
over-fit the data. Our social media mining algorithm tries to fit game data in cyber-physical worlds. Procedia Computer Science.
[11] R.C. Lee, A. Cuzzocrea, W. Lee, and C.K. Leung. 2017. Majority voting
the With the 30GB worth of tweets as input data using a mechanism in interactive social network clustering. ACM WISM 2017 (12)
6th degree polynomial regression function, which provides a [12] S.D. Bernhard, C.K. Leung, V.J. Reimer, and J. Westlake. 2016. Clickstream
prediction using sequential stream mining techniques with Markov chains. In
reasonably low MSE of about 13% while not over-fitting the data. Proceedings of the IDEAS 2016, 24-33.
Some 2-dimensional projections of 3-dimensional feature graph [13] P. Braun, A. Cuzzocrea, L.M.V. Doan, S. Kim, C.K. Leung, J.F.A. Matundan,
for this 6th degree polynomial regression function are shown in and R.R. Singh. 2017. Enhanced prediction of user-preferred YouTube videos
based on cleaned viewing pattern history. Procedia Computer Science.
Figure 7. [14] N.K. Chowdhury and C.K. Leung. 2011. Improved travel time prediction
algorithms for intelligent transportation systems. In Proceedings of the KES
2011, Part II, 355-365.W. Lee, J.J. Song, and C.K. Leung. 2011. Categorical data
5 CONCLUSIONS skyline using classification tree. In Proceedings of the APWeb 2011, 181-187.
[15] C.K. Leung and K.W. Joseph. 2014. Sports data mining: predicting results for
In this paper, we presented a social data mining system that the college football games. Procedia Computer Science 35, 710-719.
mines social media platforms such as Twitter so as to predict box [16] C.K. Leung, R.K. MacKinnon, and Y. Wang. 2014. A machine learning
approach for stock price prediction. In Proceedings of the IDEAS 2014, 274-277.
office revenue of movies. Specifically, our system takes into [17] R.K. MacKinnon and C.K. Leung. 2015. Stock price prediction in undirected
account the number of tweets per day, as well as percentages of graphs using a structural support vector machine. In Proceedings of the
positive and negative tweets (based on sentiment analysis on IEEE/WIC/ACM WI-IAT 2015, vol. 1, 548-555.
[18] D.M. Pennock, S. Lawrence, C.L. Giles, and F.A. Nielsen. 2001. The real power
those tweets). With these three features, our system builds a of artificial markets. Science 291(5506), 987-988.
polynomial regression model to predict the expected box office [19] K.-Y. Chen, L.R. Fine, and B.A. Huberman. 2003. Predicting the future.
Information Systems Frontiers 5(1), 47-61.
revenue. Evaluation of 30 GB worth of Twitter tweets shows [20] J. Duan, X. Ding, and T. Liu. 2015. A Gaussian copula regression model for
effectiveness of our system in mining social media and movie box-office revenue prediction with social media. In Proceedings of the
predicting box office revenue. As ongoing work, we plan to SMP 2015, 28-37.
[21] Z. Guo, X. Zhang, and Y. Hou. 2015. Predicting box office receipts of movies
explore other features (e.g., “likes”, trailer engagement [27-30]) with pruned random forest. In Proceedings of the ICONIP 2015, vol. 1, 55-62.
to further enhance the predictive power of our system and [22] M. Hur, P. Kang, and S. Cho. 2016. Box-office forecasting based on
sentiments of movie reviews and independent subspace method. Information
conduct more exhaustive evaluation on our social media mining Sciences 372, 608-624.
system. [23] J. Lee, S. Jung, and J. Park. 2017. The role of entropy of review text
sentiments on online WOM and movie box office sales. Electronic Commerce
Research and Applications 22, 42-52.
ACKNOWLEDGMENTS [24] T.G. Rhee and F.H. Zulkernine. 2016. Predicting movie box office
This work is partially supported by Natural Sciences and profitability: a neural network approach. In Proceedings of the ICMLA 2016,
665-670.
Engineering Research Council of Canada (NSERC) and [25] M. Joshi, D. Das, K. Gimpel, and N.A. Smith. 2010. Movie reviews and
University of Manitoba. revenues: an experiment in text regression. In Proceedings of the HLT 2010,
293-296.
[26] R. Sharda and D. Delen. 2006. Predicting box-office success of motion
REFERENCES pictures with neural networks. Expert Systems with Applications 30, 243-254.
[1] C.K. Leung. 2018. Big data analysis and mining. Encyclopedia of Information [27] C. Ding, H.K. Cheng, Y. Duan, and Y. Jin. 2017. The power of the “like”
Science and Technology, 4th ed., vol. I, 338-348. button: the impact of social media on box office. Decision Support Systems 94,
DOI: 10.4018/978-1-5225-2255-3.ch030 77-84.
[2] C.K. Leung, F. Jiang, T.W. Poon, and P.-E. Crevier. 2018. Big data analytics of [28] T. Liu, X. Ding, Y. Chen, H. Chen, and M. Guo. 2016. Predicting movie box-
social network data: who cares most about you on Facebook? In Highlighting office revenues by exploiting large-scale social media content. Multimedia
the Importance of Big Data Management and Analysis for Various Applications, Tools and Applications 75(3), 1509-1528.
1-15. DOI: 10.1007/978-3-319-60255-4_1 [29] C. Oh, Y. Roumani, J.K. Nwankpa, and H. Hu. 2017. Beyond likes and tweets:
[3] A. Cuzzocrea, G. Psaila, and M. Toccu. 2016. An innovative framework for consumer engagement behavior and movie box office in social media.
effectively and efficiently supporting big data analytics over geo-located Information & Management 54(1), 25-37.
mobile social media. In Proceedings of the IDEAS 2016, 62-69. [30] S. Oh, J.H. Ahn, and H. Baek. 2015. Viewer engagement in movie trailers and
[4] K. Kurzhals, M. John, F. Heimerl, P. Kuznecov, and D. Weiskopf. 2016. Visual box office revenue. In Proceedings of the HICSS 2015, 1724-1732.
movie analytics. IEEE Transactions on Multimedia 18(11), 2149-2160.

Choudhery2017 Social Media Mining Prediction of Box Office Revenue

Uploaded by

Copyright:

Available Formats

Choudhery2017 Social Media Mining Prediction of Box Office Revenue

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Choudhery2017 Social Media Mining Prediction of Box Office Revenue

Uploaded by

Copyright:

Available Formats

Social Media Mining:

Prediction of Box Office Revenue

Deepankar Choudhery and Carson K. Leung()

* ACM Reference format:

Table 2: Production Budget and Revenue of the Six Movies

You might also like