Crime Hot-Spots Prediction Using Support Vector Machine: 952 1-4244-0212-3/06/$20.00/©2006 IEEE
Crime Hot-Spots Prediction Using Support Vector Machine: 952 1-4244-0212-3/06/$20.00/©2006 IEEE
Crime Hot-Spots Prediction Using Support Vector Machine: 952 1-4244-0212-3/06/$20.00/©2006 IEEE
Keivan Kianmehr
Department of Computer Science
University of Calgary
Calgary, Alberta, CANADA
kiamehr@cpsc.ucalgary.ca
Reda Alhajj
Department of Computer Science
University of Calgary
Calgary, Alberta, CANADA
alhajj@cpsc.ucalgary.ca
Abstract
Location prediction is a special case of spatial data mining classification. For instance, in the public safety domain, it may be interesting to predict location(s) of crime
hot spots. In this study, we present Support Vector Machine
(SVM) based approach to predict the location as alternative
to existing modeling approaches. SVM forms the new generation of machine learning techniques used to find optimal
separability between classes within datasets. Experiments
on two different spatial datasets show that SVMs gives reasonable results.
1 Introduction
Data mining employs algorithms and techniques from
statistics, machine learning, artificial intelligence, databases
and data warehousing, etc [1]. Data mining techniques have
been successfully applied to analyze spatial data, i.e., data
related to objects that occupy space. Spatial data carries
topological and/or distance information; it is often organized by spatial index structures and accessed by spatial access methods. These distinct features of a spatial database
bring challenges and opportunities for mining knowledge
from spatial data [2]. Spatial Data mining is a subfield of
data mining; it is a process that uses a variety of data analysis tools to discover spatial patterns and relationships in spatial data that may be used to make valid predictions [3, 4].
This has wide applications in Geographic Information Systems (GIS), remote sensing, image databases exploration,
medical imaging, robot navigation, and other areas where
spatial data are used. The main methods for spatial data
analysis include: spatial association rules extraction, clustering, and classification.
Data objects stored in a database are identified by their
attributes. Classification finds a set of rules which determine the class of each classified object according to its attributes. Objects with similar attribute values are classified
1-4244-0212-3/06/$20.00/2006 IEEE
952
Figure 1. Definition of hyper-plane and margin: circular and square dots are samples of
classes -1 and +1, respectively
and distinguish whether each data instance belongs to positive or negative according to the Optimal Separating Hyperplane.
For most real-world problems that seem not to be linearly separable, SVMs can work in combination with the
kernel function technique [11], which automatically realizes a non-linear mapping onto a feature space. The Optimal Separating Hyper-plane found by SVM is the feature
space that corresponds to a non-linear decision boundary in
the input space [11].
Figure 2. Hypersphere with the target data described by center a and radius R. Three objects on the boundary as the support vectors.
One object xi outside and has i > 0
953
Min(R2 ) +
(1)
Subject to:
(xi c)T (xi c) R2 + i , i 0, i [1, l]
where c and R are the center and radius of the sphere, respectively, T is the transpose, and v (0, 1] is the tradeoff
between volume of the sphere and the number of training
data points rejected. When v is large, the volume of the
sphere is small; so more training points will be rejected.
This optimization problem can be solved by the Lagrangian:
L(R, , c, ai , i ) =
1
vl
R2 +
l
P
i=1
l
P
i=1
ai {R2 + i
(x2i 2cxi + c2 )}
l
P
i=1
i i
(2)
where ai 0 and i 0. Setting to zero the partial derivative of L with respect to R, ai , and c, we get:
l
X
ai = 1
(3)
1
vl
(4)
ai xi
(5)
i=1
0 ai
c=
l
X
where S and S are simple subsets of the input space and its
complement, respectively. Let : <N F be a nonlinear
mapping that maps the training data from <N to a feature
space F . To separate the data set from the origin, solve the
following primal optimization problem [15]:
i=1
i,j
Subject to:
0 ai
1
vl ,
l
P
i=1
+1 if x S
f (x) =
(8)
1 if x S
l
1
1 X
2
Minimize V (w, , ) = kwk +
i
2
vl i=1
ai = 1
(9)
Subject to:
(w.(xi )) i , i 0,
where v (0, 1) is a parameter for controlling the tradeoff between the number of outliers and model complexity,
and is the margin. Using the following decision function,
a label can be assigned to a new given data point (x) for
classification task:
In reality the data points are not always spherically distributed. So, different types of kernel functions K(xi , xj ) can
f (x) = sgn(w.(xi ) )
954
(10)
satisfy Mercers conditions. Since two types of these functions are often used for classification problems; polynomial
and Gaussian kernels; here we limit ourselves to these two
kernel functions.
A polynomial mapping is a popular method for nonlinear modeling:
where only a subset of points xi that are closest to the hyperplane have nonzero values i . These points are called
support vectors. Instead of solving the primal optimization
problem directly, the following dual problem can be considered:
1X
Maximize w() =
i j K(xi , xj )
(12)
i,j
2
Subject to:
0 i
1
vl ,
P
i
K(x, x0 ) = hx, x0 i
K(x, x0 ) = (hx, x0 i + 1)
i = 1
K (xi , xj ) = ((xi ), (xj )) are the kernels (the dot products between mapped pairs of input points). Kernel functions allow more general decision functions when the data
points are not linearly separable.
3 Experiment Methods
In order to select a certain representative portion of the
crime datasets to be used as the training set by the system,
we experiment the following approach: for a given percentage of the data and a predefined level of crime rate, we select a subset of the crime dataset to label; and then based on
the predefined level of crime rate, we specify a class label
to each data point in the selected set. The data points which
have the crime rate above the predefined rate are positive
or members of hotspot class and data points with crime rate
below the predefined rate are negative or non-members of
hotspot class. Then this labeled data set will be used as the
training set in SVM classification. To select a given percentage of the data to be labeled, we use the k-median clustering
algorithm. Then, we compare the result when the same percentage of the data is selected randomly.
(14)
(15)
(17)
The second kernel is usually preferable as it avoids problems with the Hessian becoming zero.
Radial basis functions have received significant attention, most commonly with a Gaussian of the form:
!
0 2
kx
x
k
K (x, x0 ) = exp
(18)
2 2
!
X
i K(xi , xj )
(13)
f (x) = sgn
K(x, x0 ) =
(16)
Z Z
955
5 Model Evaluation
Basically, n-fold cross validation is a method in which
the data is randomly divided into n disjoint groups [24]. For
example, suppose the data is divided into ten groups. The
first group is set aside for testing and the other nine are put
together for model building. The model built on the 90%
group is then used to predict the group that was set aside.
This process is repeated a total of 10 times as each group
in turn is set aside. Finally, a model is built using all the
data. The mean of the 10 independent error rate predictions
is used as the error rate for this final model. In our study, a
five-fold cross validation method has been used to estimate
the accuracy of the classification model.
2. Clustering-based Selection for Labeling + OneClass SVM: k-means is used to select a given percentage of the data points to be labeled. Then, the oneclass SVM algorithm uses the output labeled set as the
training set to build the classifier.
3. Complete Data Set + One-Class SVM: after choosing a certain percentage of the dataset for labeling by
random selection or clustering technique, we label the
rest of the dataset as negative samples and add them to
the training set. Then, we pass the complete labeled
dataset to one-class SVM as the training set.
4 Datasets
To test the different approaches used by our model, we
downloaded two published crime datasets from the internet. The datasets consist of the crime rate and related variables for each data point. The location of each data point
is described in the dataset by Euclidian coordinates or Latitude and Longitude. The datasets were downloaded from
[http://www.terraseer.com/]. In this section, we will provide
short description for each dataset. For further information,
please, refer to [http://www.terraseer.com/].
The first dataset is a small crime dataset [22] that records
crime rate and 20 related variables in 49 neighborhoods in
Columbus Ohio, USA (See Figure 4a and 4b). The problem is to distinguish between members and non-members
956
Figure 4. (a) Crime data and (b) its rate distribution; (c) Homicide data and (d) its rate distribution.
In the clustering-based data selection approach, we applied k-mean clustering algorithm to the datasets first. This
second approach of data selection chooses the data more
wisely than the random selection. After labeling the data,
we used the result set as input to the SVM algorithm. In
one-class SVMs, we applied different kernel functions to
see how it would influence the classification accuracy. The
results for four different experiments are shown in Tables 14. As it can be seen in each experiment, first we evaluate the
effect of using different kernel function for one-class SVM
technique. Here we have chosen Linear, Polynomial and
Gaussian kernel functions. We applied the default values of
LIBSVM for Polynomial and Gaussian. We also changed
the parameter in a range to see whether the result will be
improved or not. We also performed each step 20 times and
presented the average of 20 runs as the final result.
Columbus
St. Louis
C
35.13
51.86
4.57
10.58
Linear
63.00
58.50
53.13
50.63
One-Class SVM
Polynomial Gaussian
60.50
68.50
57.50
62.50
53.44
49.69
50.94
48.44
957
C
35.13
51.86
4.57
10.58
Linear
52.45
51.22
49.94
50.39
Table 4. 20% of the dataset selected by clustering and labeled; the rest 80% labeled as
negative samples and added to the training
set.
One-Class SVM
Polynomial Gaussian
51.43
57.04
51.53
55.10
49.62
38.65
50.06
38.46
Data Set
Columbus
St. Louis
Columbus
St. Louis
C
35.13
51.86
4.57
10.58
Linear
63.89
78.33
53.44
51.25
35.13
51.86
4.57
10.58
Linear
53.47
53.37
51.47
50.51
One-Class SVM
Polynomial Gaussian
53.06
57.35
53.37
57.76
51.09
39.23
50.64
39.31
One-Class SVM
Polynomial Gaussian
61.11
69.44
76.11
65.55
54.38
50.94
51.56
24.38
958
[12] N. Cristianini and J. S. Taylor, An Introduction to Support Vector Machines, Cambridge University Press,
2000.
[13] A. Kowalczyk and B. Raskutti, One Class SVM for
Yeast Regulation Prediction, SIGKDD Explorations,
Vol.4, pp.99-100, 2002.
References
[1] M.H. Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, 2002.
[4] K. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic Information Databases,
Proc. of the international Symp. on Large Spatial
Databases, pp.47-66, Portland, Maine, 1995.
[19] V.N. Vapnik, The Nature of Statistical Learning Theory, Second Edition, Springer, New York, 1999.
[22] L. Anselin, Spatial Econometrics: Methods and Models, Dordrecht: Kluwer Academic, Table 12.1 p. 189,
1998.
[8] S. Chawla, S. Shekhar, W. Wu, Predicting Locations Using Map Similarity (PLUMS): A Framework
for Spatial Data Mining, Proc. of ACM International
Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000.
[10] B. E. Boser, I. M. Guyon and V. Vapnik, A Training Algorithm for Optimum Margin Classifiers, Proc.
of the Annual Workshop on Computational Learning
Theory, Pittsburgh. ACM, 1992.
[25] C. C. Chang and C. J. Lin, LIBSVM: A Library for Support Vector Machines, 2001. URL:
http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[26] J. P. LeSage, MATLAB Toolbox for Spatial
Econometrics, 1999. URL: http://www.spatialeconometrics.com.
959