Data Mining Full

ANALYSIS OF GPS TRAJECTORIES
DATA SET
(Using Classification & Clustering Techniques)
REVIEW-3
FOR
DATA MINING TECHNIQUES (SWE2009)
M.Tech. (5 Years Integrated)

in
Software Engineering
by
K.DINESH KUMAR-16MIS0286
HEMANTHRUDRA-16MIS00232
Under the guidance of
Prof. SUDHA.M
School of Information Technology and Engineering

1. Title:-
Analysis on GPS trajectories Dataset using Classification & Clustering
Techniques.
2. Abstract:-
o No of instances are “164 instances”.

o This dataset has totally a collection of 3 attributes and a class label
o Class variable (0 or 1)
 Class value 1 is interpreted as “bus” for GPS trajectory.

 Class value 0 is interpreted as “car” for GPS trajectory.
o Several constraints were considered for the selection of these

instances from a larger database.
o overall this data set is the presentation of the trajectory paths of
vehicles using classification method.
3. INTRODUCTION:
The analysis and study of the relationship between a geo-spatial event and human mobility in
an urban area is very significant for improving productivity, mobility, and safety. In
particular, in order to alleviate serious road congestions, traffic jams, and stampedes, it is
essential to predict and be informed about the occurrence of an event as soon as possible.
When we know an event occurrence in advance, some of those who are not interested in the
event might change their plans and/or might take a detour to avoid to get involved in a heavy
congestion. In this context, this project presents an early event detection technique using GPS
trajectories collected from periodic-cars and buses, which are vehicles periodically traveling
on a pre-scheduled route with a pre-determined rating_bus,rating_weather, such as a transit
bus, shuttle, garbage truck, or municipal patrol car. Using these trajectories, which provide
the real-time and continuous traffic flow and speed, our technique detects large-scale events
in advance, without incurring any privacy invasion. The behavior of periodic-cars or buses
shows a certain sign of a large-scale event before attendees gather around a venue because
traffic can be slowed around the venue before the event occurrence.
4. PROBLEM STATEMENT:
Given a dataset containing various attributes of car and bus,and define a classification
algorithm which can identify whether it is a car or bus at a particular time. To
identifyproblem by using k-means algorithm and naivebayes.
5. Literature survey: -
Clustering-
 Multispectral images segmentation for biomedical applications diagnosis: K-means
oriented approach
The segmentation of multispectral images is considered as a key step in image processing for
biomedical applications. Performing this step using the appropriate methodology is a real
issue that being investigated by the research community. In this paper, we propose a new
algorithm to perform automatic segmentation based on k-means methodology within an
automatic generation of the optimal value of “K”. We applied the new algorithm on a dataset
of a real medical image. The obtained experimental results showed the efficiency and the
speed of our methodology on the choice of the “K” value, and to track pathology's evolution
by the detection of cancerous blood cells for biomedical diagnostic, and some segmentation
experiments show that our proposed system has better accuracy almost than some other
methods.
 Primary cloud assessment in THEOS imagery using k-means clustering and

morphological transformation algorithms
THEOS is the earth observation satellite system, which acquires earth images via its optical
instruments. As its instruments consists of passive type CCD sensors, the instruments require
sunlight reflected from the Earth surface for imaging. Then, cloud presence above imaging
area directly affects the image usability, image interpretation, image classifying accuracy,
calibration activity and so on. As THEOS mainly focuses on country needs, imaging over
Thailand and Asian countries has been main priority since its launch. These countries are in
the equatorial region and subjected to heavy cloud throughout the year. Thus, cloud cover
assessment plays important role in assessing image usability. This study aims to develop
algorithms which will be applied to the images for automatic identification and estimation of
cloud content for each image. The algorithms are separated into 2 steps. Firstly, using the k-
means clustering for specific cloud threshold value detection. Pixels with digital number (
DN) value above cloud threshold are marked as cloud pixels. Secondly, morphological
transformation is subsequently applied to the data consisting of individual steps, erosion,
dilation, closing and opening for reassessing and double-checking the non-cloud pixel. The
results after implementing the cloud mask investigation indicates that this approach is
capable of providing accurate cloud coverage assessment for THEOS images.
 Motion-based moving object detection and tracking using automatic K-means
Multiple objects detection and tracking are amongst the most important tasks in computer
vision-based surveillance and activity recognition. This paper proposes a real-time multiple
objects detection method and compares its performance with three existing methods. ‘Good
Features to Track’ algorithm is used to extract feature points from each frame. Based on the
motion-based information, feature points corresponding to moving objects are extracted from
next frame. Then, the number of moving objects in each frame is determined according to
their motion-based information and position, and are later clustered using the k-means
algorithm. Clustering of moving objects in this paper is performed using feature vectors made
of pixels' intensities, motion magnitudes, motion directions and feature point positions. In
terms of accuracy and efficiency, the proposed method is shown to be highly accurate in
determining the number of moving objects and also fast in tracking them in the scene.
 An effective method determining the initial cluster centers for K-means for
clustering gene expression data
Clustering is an important tool for analyzing gene expression data. Many clustering
algorithms have been proposed for the analysis of gene expression data. In this article we
have clustered real life gene expression data via K-Means which is one of clustering
algorithms. Also, we have proposed a new method determining the initial cluster centers for
K-means. We have compared results of our method with other clustering algorithms. The
comparison results show that the K-means algorithm which uses the proposed methods
converges to better clustering results than other clustering algorithms.
 A GPS data based distributed K-means for cabstand location selection
Taxi has become an important component of public transportation system. A proper
cabstand location can alleviate the traffic pressure. In this paper, a large set of global
positioning system (GPS) data of taxi in Jinan City is employed to help locating the
cabstand. By analyzing more than 300 million taxi driving data in Jinan, Shandong
Province, the parallel K- means algorithm is applied on the cluster analysis based on Spark
distributed computing framework. Based on the clustering results, the characteristics of taxi
passengers are revealed. and the traffic hot spot map of taxi operation is generated
according to visualized data results, which provides technical support for the selection of
cabstand location. Although the results and conclusion are specific to Jinan City, the
methods and models used in this paper can be employed on other cities as well.
Classification: -
 Fixture identification from aggregated hot water consumption data
Activity identification in smart housing utilizes smart meters to label consumption of utilities,
such as cold and hot water, into human activities, such as cooking and cleaning. Typical
approaches utilize a large array of high sampling rate sensors installed at each fixture
location. This high density-high sampling rate approach raises computational challenges due
to the volume of data generated over time. In this paper, we present a novel approach for
identifying water usage patterns using a sparse array of sensors. Unlike traditional
approaches which utilize data from individual fixtures, our approach identify fixtures by
classifying the aggregated water usage from the kitchen sink, bathroom sink and shower.
Furthermore, we model fixture and user characteristics to generate a set of higher level
features that are used to identify individual fixtures. We evaluate our approach using a novel
dataset of 12 apartments from the Clarkson University Smart Housing Project. Our results
show that our approach reduces the number of fixture level smart meters from 7 to 3, while
achieving an average accuracy between 70% to 80% for identifying hot water fixtures used in
the kitchen sink, bathroom sink and shower.
 Software and machine learning tools for monitoring railway track switch
performance
Trackside data logging hardware is often used in the UK, and increasingly elsewhere in the
world, to record and transmit processed condition data from track switching equipment
(points) in order to gauge asset health. This paper presents a novel implementation of three
tools which can be used together to make the analysis and handling of this data easier. The
first of these tools is a statistical classifier which automatically assigns labels to the process
data. The classifier is trained using historical data containing examples of events of interest,
such as recordings taken when maintenance activity or failures have developed. In practice,
the labels are used to pre-filter the data, to bring swift attention to events of interest, and to
automatically create categorised datasets which can be used to analyse historical
performance. Two different types of classifier are presented: a Gaussian Naïve Bayes
classifier and a neural network classifier. The second tool is a simple pattern recognition
algorithm which can determine when the different phases of mechanical operation in a
single track switch movement occur, for example locking, unlocking, and moving. The final
tool is a statistical technique which is used to extract simple features from the data and raise
alarms if they indicate poor track switch performance. The effectiveness of these tools is
tested using real world data taken from three different railways.
 Optimizing indoor location recognition through wireless fingerprinting at the Ian
Potter Museum of Art
Indoor tracking of smartphones adds context to smartphone applications, enabling a range
of smarter behaviours. The predicted use cases are many and varied, and include
navigation, planning, advertising and communication. Potentially, indoor tracking could
become as ubiquitous as GPS - however, all of these possibilities depend on being able to
produce a reasonably accurate, reliable system which does not require specialised
infrastructure. While professional systems using custom devices are able to achieve very
high levels of accuracy (<;1 cm), consumer no-infrastructure systems struggle to achieve
reliable room-level tracking. This paper focuses on the use of WiFi received signal strength
indicator (RSSI) fingerprinting, a machine learning approach which currently seems to be
the most promising option for consumer smartphones. We have undertaken experimentation
and optimisation in a real-world, noisy environment - the Ian Potter Museum of Art - where
we developed and deployed a no-infrastructure, indoor visitor tracking application. Data
was collected in a trial involving several dozen users over a few weeks, who used the
system extensively. This data was analysed with a range of current WiFi RSSI
fingerprinting techniques and algorithms (WASP, Redpin (kNN), SSD, SVM, Gaussian
Naive Bayes and Random Forests), and their efficacy was compared and improved where
possible. Known challenges such as device heterogeneity are explored, and the consistency
of signal levels, including magnetic fields, are examined. Large Random Forests (200 trees)
were found to have the best performance, which was further improved by calibrating for
average differences in RSSI between phone models, to achieve an average of 90% correct
classification of exhibits within the top five hits.
 Predicting Transportation Carbon Emission with Urban Big Data
Transportation carbon emission is a significant contributor to the increase of greenhouse
gases, which directly threatens the change of climate and human health. Under the pressure
of the environment, it is very important to master the information of transportation carbon
emission in real time. In the traditional way, we get the information of the transportation
carbon emission by calculating the combustion of fossil fuel in the transportation sector.
However, it is very difficult to obtain the real-time and accurate fossil fuel combustion in
the transportation field. In this paper, we predict the real-time and fine-grained
transportation carbon emission information in the whole city, based on the spatio-temporal
datasets we observed in the city, that is taxi GPS data, transportation carbon emission data,
road networks, points of interests (POIs) and meteorological data. We propose a three-layer
perceptron neural network (3-layer PNN) to learn the characteristics of collected data and
infer the transportation carbon emission. We evaluate our method with extensive
experiments based on five real data sources obtained in Zhuhai, China. The results show
that our method has advantages over the well-known three machine learning methods
(Gaussian Naive Bayes, Linear Regression, Logistic Regression) and two deep learning
methods (Stacked Denoising Autoencoder, Deep Belief Networks).
 Comparing Classification Methods for Longitudinal fMRI Studies
We compare 10 methods of classifying fMRI volumes by applying them to data from a
longitudinal study of stroke recovery: adaptive Fisher's linear and quadratic discriminant;
gaussian naive Bayes; support vector machines with linear, quadratic, and radial basis
function (RBF) kernels; logistic regression; two novel methods based on pairs of restricted
Boltzmann machines (RBM); and K-nearest neighbors. All methods were tested on three
binary classification tasks, and their out-of-sample classification accuracies are compared.
The relative performance of the methods varies considerably across subjects and
classification tasks. The best overall performers were adaptive quadratic discriminant,
support vector machines with RBF kernels, and generatively trained pairs of RBMs.
6. Data Set Description: -
 Reference link:
https://archive.ics.uci.edu/ml/datasets/GPS+Trajectories
 No of columns: 4
 No of rows : 160
No. of Attributes: 03
i. Rating
ii. Rating_bus
iii. Rating_weather
4.1 Input Value:

Description about each attribute:
 rating _weather
minimum value:1
maximum value:2
 rating_bus
minimum value:0
maximum value:3)
 rating(traffic)
minimum value:0
maximum value:3)
4.2 Target Value:
 Class variable (0 or 1)
 Class value 1 is interpreted as “bus” for GPS trajectory.
 Class value 0 is interpreted as “car” for GPS trajectory.
 Class value Number of instances

0 (car): 88
1 (bus): 77
7. DATASET ND DATAPREPROCESSING: -
Data Set Number of

Multivariate 160
Characteristics: Instances:
Attribute Number of
Real 3
Characteristics: Attributes:
Classification,
Associated Tasks: Missing Values? Yes
Regression
7.1. Training Data set:
98 Training Instances (60% of 160 Instances) are present in Training

Data set.
7.2. Test Data set:

62 Test Instances (40% of 160 instances) are present in Test Data set.
8. Sample database of GPS trajectory dataset: -
9. Testing & Training

Instances information
Total Number of Instances: 164
Using 120 training instances, the sensitivity and the specificity of their
algorithm was 76% on the remaining 44 instances.
98 (60% of 160 Instances) & 62 (40% of 160 instances)
10.Tool used for execution of algorithms:

Programming language-Python
Tools – spyder, anaconda
Algorithm- K-means, Guassian naïve bayes.
11.IMPLEMENTATION:
CODE:
1. CONFUSION MATRIX:
import itertools
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# import some data to play with

data = pd.read_csv('D:\A fall3\data mining\GPS Trajectory\go_track_tracks.csv').as_matrix()
X = data[1:130, 0:3]
y = data[1:130, 4]
#class_names = iris.target_names
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
clf=GaussianNB()
y_pred = clf.fit(X_train, y_train).predict(X_test)
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
print('Confusion matrix')
print(cm)
cnf_matrix = confusion_matrix(y_test, y_pred)
accuracy=accuracy_score(y_test, y_pred, normalize=False)
print(accuracy)
# Plot non-normalized confusion matrix
plot_confusion_matrix(cnf_matrix, classes=y)
2. NAÏVE BAYES CLASSIFICATION:
import csv, math, random
import pandas as pd
import csv
from sklearn import metrics
from sklearn.metrics import confusion_matrix
data =pd.read_csv('D:\A fall3\data mining\GPS

Trajectory\go_track_tracks.csv').as_matrix()
clf = GaussianNB()
traindata = data[1:130, 0:3]
trainLable = data[1:130, 4]
clf.fit(traindata, trainLable)
testData = data[130:,0:3]
testlabel=data[130:,4]
for i in range(1,30):
print(testData[i] )
print(testlabel[i])
x=clf.predict([testData[i]])
print(x)
3. K-MEANS CLUSTERING:
import csv, math, random
import pandas as pd
import csv
import matplotlib.pyplot as plt
pd.set_option('display.max_rows',165)
x=pd.read_csv('D:\A fall3\data mining\GPS Trajectory\go_track_tracks.csv').as_matrix()
df=pd.DataFrame(x)
data= x[1:5, 0:2]
target = x[1:5, 3]
df
from sklearn.cluster import KMeans

kmeans=KMeans(n_clusters=2)
kmeans
KMmodel=kmeans.fit(data)
KMmodel
KMmodel.labels_
df.to_csv('D:\A fall3\data mining\GPS Trajectory\go_track_tracks.csv')
predicttarget=df[1:162,4]
df
KMmodel.cluster_centers_
pd.crosstab(target,KMmodel.labels_)
OUTPUTS:
CONFUSION MATRIX:
GAUSSIAN NAIVEY BAYES:

VISUALIZATION WITH WEKA TOOL:
VISUALIZATION WITH R-STUDIO:
CORRELATION:
ACCURACY:
OUTPUT: The accuracy is --rating_weather = 79.29%
--rating_bus = 80.26%
--rating = 88.64%
RATING ON A SCALE OF 10:
SYSTEM MODEL :
 Import all the required packages for classification and cluster.
 Import or open the dataset using pandas dataframe .
 Train and test the dataset using the python builtin functions to predict the outcome.
 Then use confusion matrix and accuracy_score functions to get the results.
12.References:
 Bouzid-Daho LERICA Laboratory, Faculty of Sciences of engineers, University Badji
Mokhtar, Annaba, Algeria.
 Prayot Puangjaktha Satellite Engineering Division, Geo-Informatics and Space Technology
Development Agency (Public Organization), GISTDA, Bangkok, Thailand
 Jules-Raymond Tapamo School of Engineering, University of KwaZulu-Natal, South Africa
 Deniz Tanır Matematik Bölümü, Ege Üniversitesi, İzmir, Türkiye
 Wei Gui School of MS&E, Anhui Technology of University, Maanshan, China
 Yan Gao Department of Electrical and Computer Engineering, Clarkson University,
Potsdam, NY, USA 13699
 N P Wright MPEC Technology Ltd, United Kingdom
 Işıl Karabey, Levent Bayındır, "Utilization of room-to-room transition time in Wi-Fi
fingerprint-based indoor localization", High Performance Computing & Simulation (HPCS)
2015 International Conference on, pp. 318-322, 2015.
 Xiangyong Lu School of Computer Science and Technology, Huazhong University of
Science and Technology, 12443 Wuhan, hubei China 430074 (e-mail:
m201572848@hust.edu.cn)
 Orhan Firat, Like Oztekin, Fatos T. Yarman Vural, "Deep learning for brain
decoding", Image Processing (ICIP) 2014 IEEE International Conference on, pp. 2784-2788,
2014

Data Mining Full

Uploaded by

Copyright:

Available Formats

Data Mining Full

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Full

Uploaded by

Copyright:

Available Formats

ANALYSIS OF GPS TRAJECTORIES

DATA MINING TECHNIQUES (SWE2009)

M.Tech. (5 Years Integrated)

Under the guidance of

School of Information Technology and Engineering

o No of instances are “164 instances”.

 Class value 1 is interpreted as “bus” for GPS trajectory.

o Several constraints were considered for the selection of these

 Primary cloud assessment in THEOS imagery using k-means clustering and

6. Data Set Description: -

4.1 Input Value:

4.2 Target Value:

 Class value Number of instances

Data Set Number of

7.1. Training Data set:

98 Training Instances (60% of 160 Instances) are present in Training

7.2. Test Data set:

8. Sample database of GPS trajectory dataset: -

9. Testing & Training

98 (60% of 160 Instances) & 62 (40% of 160 instances)

10.Tool used for execution of algorithms:

# import some data to play with

data =pd.read_csv('D:\A fall3\data mining\GPS

from sklearn.cluster import KMeans

GAUSSIAN NAIVEY BAYES:

You might also like