aghababaei2016
aghababaei2016
aghababaei2016
Abstract—Social media provides increasing opportunities for which convey significant information and can be useful for
users to voluntarily share their thoughts and concerns in a many commercial purposes or security applications. In this
large volume of data. While user-generated data from each paper, we study how information captured from Twitter
individual may not provide considerable information, when provides socio-behavioral signals to predict crime rate di-
combined, they include hidden variables, which may convey rections as “crime trends”. A future trend is predicted based
significant events. In this paper, we pursue the question on information observed from the content of previously
of whether social media context can provide socio-behavior posted tweets. In fact, content is not the crime causation,
“signals” for crime prediction. The hypothesis is that crowd but includes signals to predict future incidents.
publicly available data in social media, in particular Twit- We propose a prediction model where trend prediction
ter, may include predictive variables, which can indicate the is converted to a classification problem. In order to tackle
changes in crime rates. We developed a model for crime trend the lack of data in supervised learning, our prediction model
prediction where the objective is to employ Twitter content addresses automatic data annotation. In this model, learning
to identify whether crime rates have dropped or increased examples are aggregated tweets with different smoothing
for the prospective time frame. We also present a Twitter windows. The examples are labeled with the knowledge
sampling model to collect historical data to avoid missing data inferred from the problem (in our case, crime trends) in
over time. The prediction model was evaluated for different
a prospective time frame. In fact, the concept of data an-
cities in the United States. The experiments revealed the
notation is similar to other labeling approaches such as the
classic lexicon based approach. In this approach, polarities
correlation between features extracted from the content and
or strengths are inferred, based on a set of dictionaries. Thus
crime rate directions. Overall, the study provides insight into
inspired, in our prediction model we infer labels based on
the correlation of social content and crime trends as well as
objective trends. The content of collective individual users
the impact of social data in providing predictive indicators.
is labeled positive or negative if the trend goes up or down,
respectively, in the perspective time frame.
1. Introduction A Twitter sampling method is also proposed to retrieve
a selection of historical tweets for the prediction model.
Conventional crime prediction methods rely on historical Based on the nature of our problem (trend prediction), the
socio-economic indexes and demographic information. This method is intended to retrieve a sufficient number of tweets
information is collected from the areas of concentrated over a considerable period of time (avoiding missing posts
crime known as hot-spot maps and is applied to predict over time), accessing a good set of representative users
the distributions of different crime types. However, there who actively share content on a daily basis. Despite the
are some arguments that hot-spot maps do not indicate with fact that our targeted problem is crime trend prediction, we
certainty the concentration of all crime types [1]. As an did not apply any topic-based sampling [2], where specific
example, in taxicab robberies, the population of victims keywords or hashtags are applied to collect tweets. This
is spread in different locations in which maps of specific group of sampling techniques limits the study to the content
streets or neighborhoods are not representative of high crime of shared topics, which do not have the scalability to predict
regions [1]. Another major issue is the lack of data for different types of crime. In fact, the prediction model is
prediction models. In conventional methods, the concern is not concerned with any specific topics that are the result of
the availability of historical criminal records of locations of incidents occurred previously, but draws the attention to any
interest and cannot be easily generalized to other locations. signals extracted from the content that is predictive of crime
The main drawback is that the methods only focus on the rate directions.
historical crime records while ignoring the socio-behavioral The remainder of the paper is structured as follows: the
data of the community. next section provides a review of crime index prediction as
As an ever-growing number of users share their thoughts, well as the role of social media. Section 3 describes our
concerns, and feelings on social media, their user-generated dataset and Twitter sampling method. Section 4 explains
content includes valuable signals as socio-behavioral factors, the prediction model, while Section 5 discusses our experi-
527
or hashtags are applied to collect tweets through Twitter 4. Prediction Model
API. This group of sampling techniques limits the study to
the content of shared topics, which are not scalable to predict Crime index prediction, similar to any non-deterministic
different crime types. In fact, the prediction model is not signal prediction, such as stock price, is a difficult if not
concerned with any specific topics, but draws the attention impossible task. For example, it seems impossible to predict
to any signals extracted from content that are predictive that 25 incidents of homicide will accrue within 24 hours.
of crime trends. The second group focuses on sampling a On the other hand, the question “ what direction may the
subset of users from their networks. The drawback behind crime trend take tomorrow” leads us to some extent to a
latter approach is that the availability of user posts is not possible answer. What we mean by “direction” or “trend”
tracked over time. In fact, there is no guarantee that sampled is the sign of the change in signal at t(i) compared to some
users are active on a daily basis, which is necessary for reference such as t(i − d). A positive change means that the
temporal prediction models. signal has a rising trend in which a negative change has the
We proposed a sampling method to retrieve a selection opposite meaning.
of subset of historical tweets for the prediction model. Based
on the nature of our problem (trend prediction), we address 4.1. General Structure
two main characteristics in our sampling model: retrieving
a sufficient number of tweets over a desired period of Let X = {x1 , x2 , ..., xn } be a set of temporal ex-
time (avoiding missing posts over time), accessing the best amples or in general temporal data, which is defined as
representation of active users who share content on a daily a state in time. The state is represented by a vector of
basis. In this method, the interest is to find a set of active features xi = (f1 , f2 , ..., f|V | ) where V is the global vo-
users while being unbiased to individuals with a very high nSince each state xi is sampled at time t(i), then
cabulary.
or low number of tweets. X = i=1 xi is the result of n consecutive sampling.
We applied Streaming API to access Twitter’s stream One important pre-processing task in time-series data, is
of tweets for specific locations using “location” parameter, smoothing to increase predictability and to reduce noise
which selects any tweets (geo-tagged and non-geo tweets) and outliers. Hypothetically, temporal data which is a high-
coming from specified cities. For each tweet, user profile dimensional time-series data can be also smoothed. In our
of its author is retrieved, which includes some specific model, each state is represented by a document and a naive
elements such as statues count, created at, followers count, smoothing is a rolling averaging algorithm over the temporal
and following count. For each user, two main specifications documents;
are calculated:
1
q n
1) The number of days a user is active (days). We zi = xj−q+1 , Z = zi , q = [1, n] (3)
calculate the number of days the user’s profile was generated q j=1 i=1
(created at) until the current time (time now) as follows:
where q is the size of aggregation window and x is an
example in t(i) or in our case the day i which is represented
days = time now − created at (1) by a single document. All the relevant tweets are aggregated
into a signal document without targeted filtering. As a result
A longer period of activity is a primary criteria for the X is an n×|V | document-term matrix. The vocabulary V is
selection. As we track the content of users over time, users simply the set of all distinct words appeared in all collected,
who recently became members are ignored. relevant tweets. Although, no keyword search is conducted,
2) The average number of tweets per day (tweets day ): a blind filtering including stopword reduction and low-
: As this parameter is irretrievable, we leverage the total frequent term reduction are applied to the vocabulary. As
number of tweets and the number of days a user is active a result, zi is defined as the average of a set of documents
from j to day j − q + 1, retrospectively.
In our prediction model, the objective is to transform
tweets day = statues count/days (2) a prediction problem into a supervised classification task.
Let Y = {y1 , y2 , ..., yn } be the target time series whose
Users are considered active if they have a high number of future values to be predicted. The time series Y is sampled
active days (days) as well as tweets per day (tweets day ). in time steps t(i), 1 ≤ i ≤ n. To convert regression-based
Active users are classified using f ollowers count to filter prediction into classification, the continuous signal Y has
out accounts belonging to celebrities, news agencies, or to be mapped into a categorical set, which is called the set
major companies. Finally, selected users are fed to REST of labels. There are several techniques to infer labels from
API to collect their historical timelines. Overall, we col- a continuous variable such as quantization or direction of
lected approximately 29M, 22M, 37M, and 13M tweets changes in rate. Due to the nature of the research, we adopt
from Chicago, Houston, Philadelphia, and San Francisco trend analysis of the continues rates for labeling as follows:
respectively. The historical timelines of the selected users
were restricted to the same time frame of crime rates - d > 0 : lag
between January 2014 and September 2014. li = sgn(yi+d − yi ), if (4)
d ≤ 0 : lead
528
where d is the lead or lag from current state (zi ) and target each city is examined using labeling approaches discussed
label, li is the label at t = i and L is the sequence of in Equation 6 and 7 where d = [1, 7].
labels in n consecutive time steps. After inferring labels, a The classifier is linearSVC and a filtering including
set of annotated examples are generated by associating high stopword reduction and low-frequent term reduction was
dimensional temporal data to one dimensional target labels applied to the vocabulary. Documents were applied with
inferred from time series of interest, ∀zi ∈ Z, zi → li , n − d a binary representation, which in our model showed to
training examples of the form {(z1 , l1 ), ..., (zn−d , ln−d )} perform better compared to tf-idf . We applied rolling origin
are generated . [22] as the common method for training and evaluating
the performance of the model for series observations. If k
4.2. Crime Prediction denotes the index of the last known observation, from which
the forecast is started, then the following steps represent the
The objective of the proposed method is to predict evaluation method:
whether crime rates increased or decreased for the perspec-
tive time frame. Therefore, a set of training data (D) is given
to a classifier as follows:
• Train the model using observations at times
d 1, 2, ..., k and the observation at time k+1 is selected
D = {(zi , li )|zi ∈ Z, li ∈ {−1, 1}orli = yi+d }, 1 ≤ i ≤ n− for the test data.
(5) • Compute the error on the prediction for time k + 1.
where in our target problem zi and li are defined as follows: • Repeat the the steps where the training set is moved
Aggregated tweets at time slice i (zi ): All tweets which one document forward (the first k+1) and it is tested
have been posted at time slice i (for instance day i), are on the k + 2th document. This process is continued
aggregated as a single document (zi ). Several preprocess- until all the test data is classified.
ing tasks such as low frequent term deduction, stopword • Compute F-measure based on the errors.
removal, and stemming may be applied to zi . In our model
every zi is a vector of terms (features), which are referred to
unigram model without filtering any specific keywords. One To examine the performance of predicting directions of
might speculate that we must collect keywords to emphasize the indexes, document zi , which was generated at time ti , is
on offensive language implying a rough context. Neverthe- labeled with crime trend li (see Equation 4) in two different
less, content is a rich data which contains valuable hidden approaches. In the first approach, documents are annotated
variables including activities, topic of discussions, public positive or negative if the future index was increased or
interests, and sentiments, which might not be necessarily decreased in the prospective time frame respectively. In fact,
carried by offensive language. documents are labeled based on the ”trend” in future. The
Class label at time slice i (li ): It is derived with two second approach, consider a global threshold (”mean”) and
different approaches: annotate documents positive or negative if the index in the
1) Index of future crime rates (yi+d ) where d is the lag prospective time frame is greater or less than mean value
between content and crime trends. of the overall indexes. Figure 1 illustrates Macro-averaged
2) The changes in crime indexes when comparing the F-measure for different lags (d = [1,7]). The intention is to
current index (i) with the index of (i + d): understand the best lag between content and crime trends.
The results indicate that lag 1 and 7 achieved the lowest
1 if rate(i) < rate(i + d) predictability for most of the crime types. Table 1 shows
li = (6)
−1 otherwise the best results obtained over different lags for the labeling
3) The changes in crime index when comparing the approach based on mean and trend. Overall, the predictabil-
current index (i) with the mean of all rates: ity is higher for Philadelphia, Houston, Chicago compared
to San Francisco (see Figure 1). The results indicate that
1 if rate(i + d) > mean the proposed prediction model reveals satisfactory perfor-
li = (7)
−1 otherwise mance for the most of crime types, such as Theft, Burglary,
and Sex offenses with F-measure up to 0.83. However, in
where rate (i) and rate (i + d) are crime index at i and some type of crimes such as Murder and Vandalism, the
i + d according to our historical data. approach achieved the lowest result compared to the other
crimes. This can be explained according to the nature of the
5. Experimental Results and Discussion incidents. In fact, some crimes such as Battery, Narcotics,
and Prostitution are mostly street incidents and might be
A set of comprehensive experiments was conducted to reflected in daily social conversation while the others have
evaluate the performance of our prediction model on differ- more organized nature. Despite the challenge exists to pre-
ent locations and crime types. The experimental results were dict exact index, content can be employed to predict the
presented based on the contribution of different labeling directions of future crime indexes, which has an important
approaches. The predictability of different crime types for application for law enforcement decision makers.
529
Figure 1: The performance (F-measure) of the prediction model for different lags.
TABLE 1: F-measure of prediction model for labeling based on “trend” and “mean”.
(a) Chicago (b) Houston (c) Philadelphia (d) San Francisco
mean trend mean trend mean trend mean trend
All 0.71 0.62 ALL 0.53 0.53 All 0.63 0.7 All 0.52 0.58
Battery 0.67 0.63 Assault 0.71 0.77 Assault 0.61 0.71 Assault 0.51 0.57
Burglary 0.5 0.63 Auto theft 0.71 0.56 Burglary 0.61 0.81 Juvenile crime 0.54 0.59
Criminal damage 0.68 0.6 Burglary 0.71 0.71 Criminal mischief 0.67 0.69 Narcotics 0.53 0.63
Deceptive practice 0.53 0.73 Murder 0.59 0.58 Drug violation 0.44 0.78 Prostitution 0.57 0.54
Narcotics 0.45 0.69 Rape 0.55 0.59 Fraud 0.82 0.73 Robbery 0.7 0.67
Prostitution 0.7 0.61 Robbery 0.55 0.6 Prostitution 0.75 0.7 Theft 0.67 0.63
Weapon violation 0.65 0.67 Theft 0.55 0.63 Thefts 0.58 0.83 Vandalism 0.52 0.61
6. Concluding Remarks ously posted tweets and crime rates in the perspective time
frame. Despite using only Twitter content, the correlation
In this paper, a model for crime prediction is presented between content and crime trends has been revealed. How-
based on mining posted tweets from a relevant geographic ever, the results pinpoint the dependency of the prediction
area. The proposed method does not need any previously to the nature of the crime type. Some crime types, such as
available training data. In fact, the proposed prediction Burglary and Sex Offenses, have a high correlation with the
model generates its own training data. In this model, labels shared content.
are derived from target signals (here, crime indexes) and
labels are then assigned to the input data. We also present In future, we would like to semantically analyze the
trend prediction by reducing the model to a binary classifi- textual content for better understating of the relation be-
cation. The classifier predicts, given input data, whether or tween features. Although in this study we were interested
not the crime index will go up (or down) in the future. In in presenting the effectiveness of a content-based model,
order to evaluate the predictability of user generated content further analysis is needed to examine the incorporation of
in social media, no keywords and specific terms related to other socio-economic indexes and geographical information,
the crime are targeted. which correlate with criminal activities. These auxiliary
A comprehensive experiment was conducted in different resources, which are shown to have correlation with different
cities of the United States to predict crime trends. Overall, incidents, can enhance the performance of the prediction
the results indicate that there is a relationship between previ- model.
530
References hot spots and changing spatial patterns,” Cartography
and Geographic Information Science, vol. 42, no. 2,
[1] J. Eck, S. Chainey, J. Cameron, and R. Wilson, “Map- pp. 112–121, 2015.
ping crime: Understanding hotspots,” 2005. [17] X. Wang, M. S. Gerber, and D. E. Brown, “Automatic
[2] C. Gerlitz and B. Rieder, “Mining one percent of crime prediction using events extracted from twitter
twitter: collections, baselines, sampling,” M/C Journal, posts,” in Social Computing, Behavioral-Cultural Mod-
vol. 16, no. 2, 2013. eling and Prediction. Springer, 2012, pp. 231–238.
[3] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake [18] M. S. Gerber, “Predicting crime using twitter and
shakes twitter users: real-time event detection by so- kernel density estimation,” Decision Support Systems,
cial sensors,” in Proceedings of the 19th international vol. 61, pp. 115–125, 2014.
conference on World wide web. ACM, 2010, pp. 851– [19] X. Chen, Y. Cho, and S. Y. Jang, “Crime prediction
860. using twitter sentiment and weather,” in Systems and
[4] J. Weng and B.-S. Lee, “Event detection in twitter.” Information Engineering Design Symposium (SIEDS),
ICWSM, vol. 11, pp. 401–408, 2011. 2015. IEEE, 2015, pp. 63–68.
[5] H. Achrekar, A. Gandhe, R. Lazarus, S.-H. Yu, and [20] S. Ghosh, M. B. Zafar, P. Bhattacharya, N. Sharma,
B. Liu, “Twitter improves seasonal influenza predic- N. Ganguly, and K. Gummadi, “On sampling the
tion.” in HEALTHINF, 2012, pp. 61–70. wisdom of crowds: Random vs. expert sampling of
[6] S. Chainey, L. Tompson, and S. Uhlig, “The utility the twitter stream,” in Proceedings of the 22nd ACM
of hotspot mapping for predicting spatial patterns of international conference on Conference on information
crime,” Security Journal, vol. 21, no. 1, pp. 4–28, 2008. & knowledge management. ACM, 2013, pp. 1739–
[7] X. Wang and D. E. Brown, “The spatio-temporal 1744.
modeling for criminal incidents,” Security Informatics, [21] K. White, G. Li, and N. Japkowicz, “Sampling online
vol. 1, no. 1, pp. 1–17, 2012. social networks using coupling from the past,” in Data
[8] G. O. Mohler, M. B. Short, P. J. Brantingham, F. P. Mining Workshops (ICDMW), 2012 IEEE 12th Inter-
Schoenberg, and G. E. Tita, “Self-exciting point pro- national Conference on. IEEE, 2012, pp. 266–272.
cess modeling of crime,” Journal of the American [22] E. Zivot and J. Wang, “Rolling analysis of time se-
Statistical Association, vol. 106, no. 493, 2011. ries,” in Modeling Financial Time Series with S-Plus
R.
[9] G. E. Tita and A. Boessen, “Social networks and Springer, 2003, pp. 299–346.
the ecology of crime: using social network data to
understand the spatial distribution of crime,” The SAGE
Handbook of Criminological Research Methods, p.
128, 2011.
[10] A. B. George E. Tita, “9 social networks and the ecol-
ogy of crime: Using social network data to understand
the spatial distribution of crime,” pp. 128–143, 2012.
[11] J. R. Hipp, C. T. Butts, R. Acton, N. N. Nagle, and
A. Boessen, “Extrapolative simulation of neighborhood
networks based on population spatial distribution: Do
they predict crime?” Social Networks, vol. 35, no. 4,
pp. 614–625, 2013.
[12] A. Bogomolov, B. Lepri, J. Staiano, N. Oliver, F. Pi-
anesi, and A. Pentland, “Once upon a crime: Towards
crime prediction from demographics and mobile data,”
in Proceedings of the 16th International Conference on
Multimodal Interaction. ACM, 2014, pp. 427–434.
[13] M. Traunmueller, G. Quattrone, and L. Capra, “Mining
mobile phone data to investigate urban crime theories
at scale,” in Social Informatics. Springer, 2014, pp.
396–411.
[14] J. Q. Wilson and R. J. Herrnstein, Crime Human
Nature: The Definitive Study of the Causes of Crime.
Simon and Schuster, 1998.
[15] S. Aghababaei and M. Makrehchi, “Temporal topic
inference for trend prediction,” in 2015 IEEE In-
ternational Conference on Data Mining Workshop
(ICDMW). IEEE, 2015, pp. 877–884.
[16] N. Malleson and M. A. Andresen, “The impact of using
social media data in crime rate calculations: shifting
531