Big Data Analytics - Unit 3
Big Data Analytics - Unit 3
Classification
Supervised v/s Unsupervised Learning
Supervised learning (classification):
Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations.
New data is classified based on the training
set
Unsupervised learning (clustering):
The class labels of training data is unknown
Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
What is Classification and
Prediction?
There are two forms of data analysis that can be used
for extracting models describing important classes or
to predict future data trends. These two forms are as
follows −
➢ Classification
➢ Prediction
Classification models predict categorical class
labels; and prediction models predict continuous
valued functions.
For example, we can build a classification model to
categorize bank loan applications as either safe or
risky, or a prediction model to predict the
expenditures in dollars of potential customers on
computer equipment given their income and
occupation.
WHAT IS CLASSIFICATION?
Following are the examples of cases where
the data analysis task is Classification −
➢ A bank loan officer wants to analyze the data
in order to know which customer (loan
applicant) are risky or which are safe.
➢ A marketing manager at a company needs to
analyze a customer with a given profile, who
will buy a new computer.
In both of the above examples, a model or
classifier is constructed to predict the
categorical labels. These labels are risky or
safe for loan application data and yes or no
for marketing data
WHAT IS PREDICTION?
Following are the examples of cases where the
data analysis task is Prediction −
Suppose the marketing manager needs to
predict how much a given customer will spend
during a sale at his company. In this example
we are bothered to predict a numeric value.
Therefore the data analysis task is an example
of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a
continuous-valued-function or ordered value.
Note − Regression analysis is a statistical
methodology that is most often used for
numeric prediction
Classification – A 2 step
process
Model construction: describing a set of predetermined
classes
◦ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
◦ The set of tuples used for model construction is training set
◦ The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
◦ The known label of test sample is compared with the classified
result from the model
◦ Accuracy rate is the percentage of test set samples that are
correctly classified by the model
◦ Test set is independent of training set, otherwise over-fitting will
occur
If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
Model Construction
Using the Model in Prediction
Classification of Data Mining
Techniques
Classification
of Data mining
frameworks based on the type of data
sources mined:
◦ Here the classification is as per the type of
data. Eg: multimedia, text data, spatial data,
time series data, www etc.
Classification
of Data mining
frameworks based on database involved
◦ Here the classification is as per the data
model involved. Eg: Object-oriented
database, transactional database, relational
database etc
Classification of Data Mining
Techniques
Classificationof Data mining frameworks
as per the kind of knowledge discovered:
◦ This classification depends on the types of
knowledge discovered Eg: Discrimination,
classification, clustering, characterization etc.
Classification
of data mining frameworks
based on data mining techniques used:
◦ This classification is based on the data
analysis approach utilized. Eg: neural
networks, machine learning, generic
algorithms, statistics, data ware house
oriented etc.
Issues regarding Classification
and Prediction
The major issue is preparing the data for
Classification and Prediction. Preparing the data
involves the following activities −
➢ Data Cleaning − Data cleaning involves
removing the noise and treatment of missing
values. The noise is removed by applying
smoothing techniques and the problem of missing
values is solved by replacing a missing value with
most commonly occurring value for that attribute.
➢ Relevance Analysis − Database may also have
the irrelevant attributes. Correlation analysis is
used to know whether any two given attributes
are related.
Issues regarding Classification
and Prediction
➢ Data Transformation and reduction − The
data can be transformed by any of the
following methods.
◦ ■ Normalization − The data is transformed using
normalization. Normalization involves scaling all
values for given attribute in order to make them
fall within a small specified range. Normalization is
used when in the learning step, the neural
networks or the methods involving measurements
are used.
◦ ■ Generalization − The data can also be
transformed by generalizing it to the higher
concept. For this purpose we can use the concept
hierarchies..
Comparison of Classification and
Prediction Methods
Here is the criteria for comparing the methods of
Classification and Prediction −
➢ Accuracy − Accuracy of classifier refers to the ability of
classifier. It predict the class label correctly and the
accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a
new data.
➢ Speed − This refers to the computational cost in
generating and using the classifier or predictor.
➢ Robustness − It refers to the ability of classifier or
predictor to make correct predictions from given noisy data.
➢ Scalability − Scalability refers to the ability to construct
the classifier or predictor efficiently, given large amount of
data.
➢ Interpretability − It refers to what extent the classifier or
predictor understands
Classification and
Prediction
Building Classification
Model
Decision Tree
Decision Tree Algorithm
Naïve Bayes Classifier Prior Probability
PlayTennis is target variable with output P(Play Tennis = yes) = 9 / 14 = 0.64
Yes / No. P(Play Tennis = no) = 5 / 14 = 0.36
New Instance to be classified as Yes / No. Current Probability / conditional
Outlook = Sunny , Temperature = Cool, probabilities of individual attributes:
4 attributes viz., – Outlook,
Humidity = High, Wind = Strong Temperature, Humidity and Wind
Day Outlook Temperature Humidity Wind PlayTennis
Find conditional probabilities of
D1 Sunny Hot High Weak No
individual attributes.
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
Outlook Y N
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No Sunny 2/9 3/5
D7 Overcast Cool Normal Strong Yes
Overcast 4/9 0
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes Rainy 3/9 2/5
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes Temperatu Y N
re
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes Hot 2/9 2/5
D14 Rain Mild High Strong No Mild 4/9 2/5
Humidity Y N Wind Y N Cool 3/9 1/5
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5
Naïve Bayes Classifier
Prior Probability
P(Play Tennis = yes) = 9 / 14 = 0.64
P(Play Tennis = no) = 5 / 14 = 0.36
Current Probability / conditional
probabilities of individual attributes:
4 attributes viz., – Outlook,
Temperature, Humidity and Wind
Find conditional probabilities of
individual attributes.
Naïve Bayes Classifier
Probability that the person will play Tennis is less than the probability that h
not play tennis. Hence the conclusion is that he will not play Tennis.
Classification by
Backpropagation
Classification by Back-
propagation
Back Propagation
Features of Back-propagation:
It uses the gradient descent method
It is different from other networks in respect to the
process by which the weights are calculated during the
learning period of the network.
Training is done in the three stages :
◦ the feed-forward of input training pattern
◦ the calculation and back-propagation of the error
◦ Updating the weight
Back Propagation
Algorithm
Step 1: Inputs X, arrive through the pre-connected path.
Step 2: The input is modeled using true weights W. Weights are
usually chosen randomly.
Step 3: Calculate the output of each neuron from the input layer
to the hidden layer to the output layer.
Step 4: Calculate the error in the outputs
Back propagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to
adjust the weights to reduce the error.
Step 6: Repeat the process until the desired output is achieved.
Parameters :
x = inputs training vector x=(x1,x2,…xn).
t = target vector t=(t1,t2……………tn).
δk = error at output unit.
δj = error at hidden layer.
α = learning rate.
V0j = bias of hidden unit j.
Types of Back-Propagation
Static back-propagation: Static back
propagation is a network designed to map
static inputs for static outputs.
Eg: OCR (Optical Character Recognition)
Disadvantages:
Itis sensitive to noisy data and irregularities.
Noisy data can lead to inaccurate results.
Performance is highly dependent on input
data.
Spending too much time training.
Prediction
Prediction
Predictive Modeling and Machine Learning
-Machine learning takes weather data and builds
relationships between the available data and the
relative predictors.
2. Data – A Crucial Part of Weather Predictions
3. Weather Data – An Aid for many Events
Prediction of floods, sports, predict car sales, predict
asthma attack.
4. Satellite Imagery and Sensor Data
----------------------------------
Total number of predictions
Clustering
Clustering
Clustering is an unsupervised Machine Learning-
based Algorithm
Clustering only utilizes input data, to determine
patterns, anomalies or similarities in its input data
A good clustering algorithm aims to obtain clusters
whose:
◦ The intra-cluster similarities are high, It
implies that the data present inside the cluster is
similar to one another.
◦ The inter-cluster similarity is low, and it
means each cluster holds data that is not similar
to other data.
What is a Cluster?
◦ A cluster is a subset of similar objects
Clustering
Grouping of specific objects based on their
characteristics and their similarities.
A good clustering algorithm is able to identify
the cluster independent of cluster shape.
3 basic stages of clustering algorithm are
Raw Data
Clustering
Algorithm
Clusters of Data
Methods of Clustering in Data
Mining
Many clusters can partition
information into a data set.
Methods of Clustering in Data
Mining
◦ 1. Partitioning based method
◦ 2. Density-based method
◦ 3. Centroid-based method
◦ 4. Hierarchical method
◦ 5. Grid-based method
◦ 6. Model-based method
Partitioning based Method
Partition algorithm divides data into many subsets
4. Social-Media