Data Mining Full
Data Mining Full
Data Mining Full
DATA SET
(Using Classification & Clustering Techniques)
REVIEW-3
FOR
Software Engineering
by
K.DINESH KUMAR-16MIS0286
HEMANTHRUDRA-16MIS00232
Prof. SUDHA.M
2. Abstract:-
o Class variable (0 or 1)
3. INTRODUCTION:
The analysis and study of the relationship between a geo-spatial event and human mobility in
an urban area is very significant for improving productivity, mobility, and safety. In
particular, in order to alleviate serious road congestions, traffic jams, and stampedes, it is
essential to predict and be informed about the occurrence of an event as soon as possible.
When we know an event occurrence in advance, some of those who are not interested in the
event might change their plans and/or might take a detour to avoid to get involved in a heavy
congestion. In this context, this project presents an early event detection technique using GPS
trajectories collected from periodic-cars and buses, which are vehicles periodically traveling
on a pre-scheduled route with a pre-determined rating_bus,rating_weather, such as a transit
bus, shuttle, garbage truck, or municipal patrol car. Using these trajectories, which provide
the real-time and continuous traffic flow and speed, our technique detects large-scale events
in advance, without incurring any privacy invasion. The behavior of periodic-cars or buses
shows a certain sign of a large-scale event before attendees gather around a venue because
traffic can be slowed around the venue before the event occurrence.
4. PROBLEM STATEMENT:
Given a dataset containing various attributes of car and bus,and define a classification
algorithm which can identify whether it is a car or bus at a particular time. To
identifyproblem by using k-means algorithm and naivebayes.
5. Literature survey: -
Clustering-
Multispectral images segmentation for biomedical applications diagnosis: K-means
oriented approach
The segmentation of multispectral images is considered as a key step in image processing for
biomedical applications. Performing this step using the appropriate methodology is a real
issue that being investigated by the research community. In this paper, we propose a new
algorithm to perform automatic segmentation based on k-means methodology within an
automatic generation of the optimal value of “K”. We applied the new algorithm on a dataset
of a real medical image. The obtained experimental results showed the efficiency and the
speed of our methodology on the choice of the “K” value, and to track pathology's evolution
by the detection of cancerous blood cells for biomedical diagnostic, and some segmentation
experiments show that our proposed system has better accuracy almost than some other
methods.
Classification: -
Fixture identification from aggregated hot water consumption data
Activity identification in smart housing utilizes smart meters to label consumption of utilities,
such as cold and hot water, into human activities, such as cooking and cleaning. Typical
approaches utilize a large array of high sampling rate sensors installed at each fixture
location. This high density-high sampling rate approach raises computational challenges due
to the volume of data generated over time. In this paper, we present a novel approach for
identifying water usage patterns using a sparse array of sensors. Unlike traditional
approaches which utilize data from individual fixtures, our approach identify fixtures by
classifying the aggregated water usage from the kitchen sink, bathroom sink and shower.
Furthermore, we model fixture and user characteristics to generate a set of higher level
features that are used to identify individual fixtures. We evaluate our approach using a novel
dataset of 12 apartments from the Clarkson University Smart Housing Project. Our results
show that our approach reduces the number of fixture level smart meters from 7 to 3, while
achieving an average accuracy between 70% to 80% for identifying hot water fixtures used in
the kitchen sink, bathroom sink and shower.
Software and machine learning tools for monitoring railway track switch
performance
Trackside data logging hardware is often used in the UK, and increasingly elsewhere in the
world, to record and transmit processed condition data from track switching equipment
(points) in order to gauge asset health. This paper presents a novel implementation of three
tools which can be used together to make the analysis and handling of this data easier. The
first of these tools is a statistical classifier which automatically assigns labels to the process
data. The classifier is trained using historical data containing examples of events of interest,
such as recordings taken when maintenance activity or failures have developed. In practice,
the labels are used to pre-filter the data, to bring swift attention to events of interest, and to
automatically create categorised datasets which can be used to analyse historical
performance. Two different types of classifier are presented: a Gaussian Naïve Bayes
classifier and a neural network classifier. The second tool is a simple pattern recognition
algorithm which can determine when the different phases of mechanical operation in a
single track switch movement occur, for example locking, unlocking, and moving. The final
tool is a statistical technique which is used to extract simple features from the data and raise
alarms if they indicate poor track switch performance. The effectiveness of these tools is
tested using real world data taken from three different railways.
Optimizing indoor location recognition through wireless fingerprinting at the Ian
Potter Museum of Art
Indoor tracking of smartphones adds context to smartphone applications, enabling a range
of smarter behaviours. The predicted use cases are many and varied, and include
navigation, planning, advertising and communication. Potentially, indoor tracking could
become as ubiquitous as GPS - however, all of these possibilities depend on being able to
produce a reasonably accurate, reliable system which does not require specialised
infrastructure. While professional systems using custom devices are able to achieve very
high levels of accuracy (<;1 cm), consumer no-infrastructure systems struggle to achieve
reliable room-level tracking. This paper focuses on the use of WiFi received signal strength
indicator (RSSI) fingerprinting, a machine learning approach which currently seems to be
the most promising option for consumer smartphones. We have undertaken experimentation
and optimisation in a real-world, noisy environment - the Ian Potter Museum of Art - where
we developed and deployed a no-infrastructure, indoor visitor tracking application. Data
was collected in a trial involving several dozen users over a few weeks, who used the
system extensively. This data was analysed with a range of current WiFi RSSI
fingerprinting techniques and algorithms (WASP, Redpin (kNN), SSD, SVM, Gaussian
Naive Bayes and Random Forests), and their efficacy was compared and improved where
possible. Known challenges such as device heterogeneity are explored, and the consistency
of signal levels, including magnetic fields, are examined. Large Random Forests (200 trees)
were found to have the best performance, which was further improved by calibrating for
average differences in RSSI between phone models, to achieve an average of 90% correct
classification of exhibits within the top five hits.
Predicting Transportation Carbon Emission with Urban Big Data
Transportation carbon emission is a significant contributor to the increase of greenhouse
gases, which directly threatens the change of climate and human health. Under the pressure
of the environment, it is very important to master the information of transportation carbon
emission in real time. In the traditional way, we get the information of the transportation
carbon emission by calculating the combustion of fossil fuel in the transportation sector.
However, it is very difficult to obtain the real-time and accurate fossil fuel combustion in
the transportation field. In this paper, we predict the real-time and fine-grained
transportation carbon emission information in the whole city, based on the spatio-temporal
datasets we observed in the city, that is taxi GPS data, transportation carbon emission data,
road networks, points of interests (POIs) and meteorological data. We propose a three-layer
perceptron neural network (3-layer PNN) to learn the characteristics of collected data and
infer the transportation carbon emission. We evaluate our method with extensive
experiments based on five real data sources obtained in Zhuhai, China. The results show
that our method has advantages over the well-known three machine learning methods
(Gaussian Naive Bayes, Linear Regression, Logistic Regression) and two deep learning
methods (Stacked Denoising Autoencoder, Deep Belief Networks).
Comparing Classification Methods for Longitudinal fMRI Studies
We compare 10 methods of classifying fMRI volumes by applying them to data from a
longitudinal study of stroke recovery: adaptive Fisher's linear and quadratic discriminant;
gaussian naive Bayes; support vector machines with linear, quadratic, and radial basis
function (RBF) kernels; logistic regression; two novel methods based on pairs of restricted
Boltzmann machines (RBM); and K-nearest neighbors. All methods were tested on three
binary classification tasks, and their out-of-sample classification accuracies are compared.
The relative performance of the methods varies considerably across subjects and
classification tasks. The best overall performers were adaptive quadratic discriminant,
support vector machines with RBF kernels, and generatively trained pairs of RBMs.
Reference link:
https://archive.ics.uci.edu/ml/datasets/GPS+Trajectories
No of columns: 4
No of rows : 160
No. of Attributes: 03
i. Rating
ii. Rating_bus
iii. Rating_weather
rating _weather
minimum value:1
maximum value:2
rating_bus
minimum value:0
maximum value:3)
rating(traffic)
minimum value:0
maximum value:3)
Class variable (0 or 1)
Class value 1 is interpreted as “bus” for GPS trajectory.
Class value 0 is interpreted as “car” for GPS trajectory.
7. DATASET ND DATAPREPROCESSING: -
Attribute Number of
Real 3
Characteristics: Attributes:
Classification,
Associated Tasks: Missing Values? Yes
Regression
11.IMPLEMENTATION:
CODE:
1. CONFUSION MATRIX:
import itertools
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.naive_bayes import GaussianNB
import csv
from sklearn import metrics
from sklearn.metrics import confusion_matrix
3. K-MEANS CLUSTERING:
import csv, math, random
import pandas as pd
from sklearn.naive_bayes import GaussianNB
import csv
import matplotlib.pyplot as plt
pd.set_option('display.max_rows',165)
x=pd.read_csv('D:\A fall3\data mining\GPS Trajectory\go_track_tracks.csv').as_matrix()
df=pd.DataFrame(x)
data= x[1:5, 0:2]
target = x[1:5, 3]
df
CORRELATION:
ACCURACY:
OUTPUT: The accuracy is --rating_weather = 79.29%
--rating_bus = 80.26%
--rating = 88.64%
RATING ON A SCALE OF 10:
SYSTEM MODEL :
Import all the required packages for classification and cluster.
Import or open the dataset using pandas dataframe .
Train and test the dataset using the python builtin functions to predict the outcome.
Then use confusion matrix and accuracy_score functions to get the results.
12.References:
Bouzid-Daho LERICA Laboratory, Faculty of Sciences of engineers, University Badji
Mokhtar, Annaba, Algeria.
Prayot Puangjaktha Satellite Engineering Division, Geo-Informatics and Space Technology
Development Agency (Public Organization), GISTDA, Bangkok, Thailand
Jules-Raymond Tapamo School of Engineering, University of KwaZulu-Natal, South Africa
Deniz Tanır Matematik Bölümü, Ege Üniversitesi, İzmir, Türkiye
Wei Gui School of MS&E, Anhui Technology of University, Maanshan, China
Yan Gao Department of Electrical and Computer Engineering, Clarkson University,
Potsdam, NY, USA 13699
N P Wright MPEC Technology Ltd, United Kingdom
Işıl Karabey, Levent Bayındır, "Utilization of room-to-room transition time in Wi-Fi
fingerprint-based indoor localization", High Performance Computing & Simulation (HPCS)
2015 International Conference on, pp. 318-322, 2015.
Xiangyong Lu School of Computer Science and Technology, Huazhong University of
Science and Technology, 12443 Wuhan, hubei China 430074 (e-mail:
m201572848@hust.edu.cn)
Orhan Firat, Like Oztekin, Fatos T. Yarman Vural, "Deep learning for brain
decoding", Image Processing (ICIP) 2014 IEEE International Conference on, pp. 2784-2788,
2014