Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
0 views

Big Data Analytics - Unit 3

Big data analytics for MCA

Uploaded by

Prabha Joshi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Big Data Analytics - Unit 3

Big data analytics for MCA

Uploaded by

Prabha Joshi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Big Data Analytics - Unit 3

Classification
Supervised v/s Unsupervised Learning
Supervised learning (classification):
Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations.
 New data is classified based on the training
set
Unsupervised learning (clustering):
 The class labels of training data is unknown
Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
What is Classification and
Prediction?
 There are two forms of data analysis that can be used
for extracting models describing important classes or
to predict future data trends. These two forms are as
follows −
➢ Classification
➢ Prediction
 Classification models predict categorical class
labels; and prediction models predict continuous
valued functions.
 For example, we can build a classification model to
categorize bank loan applications as either safe or
risky, or a prediction model to predict the
expenditures in dollars of potential customers on
computer equipment given their income and
occupation.
WHAT IS CLASSIFICATION?
Following are the examples of cases where
the data analysis task is Classification −
➢ A bank loan officer wants to analyze the data
in order to know which customer (loan
applicant) are risky or which are safe.
➢ A marketing manager at a company needs to
analyze a customer with a given profile, who
will buy a new computer.
In both of the above examples, a model or
classifier is constructed to predict the
categorical labels. These labels are risky or
safe for loan application data and yes or no
for marketing data
WHAT IS PREDICTION?
Following are the examples of cases where the
data analysis task is Prediction −
 Suppose the marketing manager needs to
predict how much a given customer will spend
during a sale at his company. In this example
we are bothered to predict a numeric value.
Therefore the data analysis task is an example
of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a
continuous-valued-function or ordered value.
Note − Regression analysis is a statistical
methodology that is most often used for
numeric prediction
Classification – A 2 step
process
 Model construction: describing a set of predetermined
classes
◦ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
◦ The set of tuples used for model construction is training set
◦ The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
◦ The known label of test sample is compared with the classified
result from the model
◦ Accuracy rate is the percentage of test set samples that are
correctly classified by the model
◦ Test set is independent of training set, otherwise over-fitting will
occur
 If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
Model Construction
Using the Model in Prediction
Classification of Data Mining
Techniques
Classification
of Data mining
frameworks based on the type of data
sources mined:
◦ Here the classification is as per the type of
data. Eg: multimedia, text data, spatial data,
time series data, www etc.
Classification
of Data mining
frameworks based on database involved
◦ Here the classification is as per the data
model involved. Eg: Object-oriented
database, transactional database, relational
database etc
Classification of Data Mining
Techniques
Classificationof Data mining frameworks
as per the kind of knowledge discovered:
◦ This classification depends on the types of
knowledge discovered Eg: Discrimination,
classification, clustering, characterization etc.
Classification
of data mining frameworks
based on data mining techniques used:
◦ This classification is based on the data
analysis approach utilized. Eg: neural
networks, machine learning, generic
algorithms, statistics, data ware house
oriented etc.
Issues regarding Classification
and Prediction
 The major issue is preparing the data for
Classification and Prediction. Preparing the data
involves the following activities −
 ➢ Data Cleaning − Data cleaning involves
removing the noise and treatment of missing
values. The noise is removed by applying
smoothing techniques and the problem of missing
values is solved by replacing a missing value with
most commonly occurring value for that attribute.
 ➢ Relevance Analysis − Database may also have
the irrelevant attributes. Correlation analysis is
used to know whether any two given attributes
are related.
Issues regarding Classification
and Prediction
➢ Data Transformation and reduction − The
data can be transformed by any of the
following methods.
◦ ■ Normalization − The data is transformed using
normalization. Normalization involves scaling all
values for given attribute in order to make them
fall within a small specified range. Normalization is
used when in the learning step, the neural
networks or the methods involving measurements
are used.
◦ ■ Generalization − The data can also be
transformed by generalizing it to the higher
concept. For this purpose we can use the concept
hierarchies..
Comparison of Classification and
Prediction Methods
 Here is the criteria for comparing the methods of
Classification and Prediction −
 ➢ Accuracy − Accuracy of classifier refers to the ability of
classifier. It predict the class label correctly and the
accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a
new data.
 ➢ Speed − This refers to the computational cost in
generating and using the classifier or predictor.
 ➢ Robustness − It refers to the ability of classifier or
predictor to make correct predictions from given noisy data.
 ➢ Scalability − Scalability refers to the ability to construct
the classifier or predictor efficiently, given large amount of
data.
 ➢ Interpretability − It refers to what extent the classifier or
predictor understands
Classification and
Prediction
Building Classification
Model
Decision Tree
Decision Tree Algorithm
Naïve Bayes Classifier  Prior Probability
PlayTennis is target variable with output  P(Play Tennis = yes) = 9 / 14 = 0.64
Yes / No.  P(Play Tennis = no) = 5 / 14 = 0.36
New Instance to be classified as Yes / No.  Current Probability / conditional
Outlook = Sunny , Temperature = Cool, probabilities of individual attributes:
 4 attributes viz., – Outlook,
Humidity = High, Wind = Strong Temperature, Humidity and Wind
Day Outlook Temperature Humidity Wind PlayTennis
 Find conditional probabilities of
D1 Sunny Hot High Weak No
individual attributes.
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
Outlook Y N
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No Sunny 2/9 3/5
D7 Overcast Cool Normal Strong Yes
Overcast 4/9 0
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes Rainy 3/9 2/5
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes Temperatu Y N
re
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes Hot 2/9 2/5
D14 Rain Mild High Strong No Mild 4/9 2/5
Humidity Y N Wind Y N Cool 3/9 1/5
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5
Naïve Bayes Classifier
Prior Probability
P(Play Tennis = yes) = 9 / 14 = 0.64
P(Play Tennis = no) = 5 / 14 = 0.36
Current Probability / conditional
probabilities of individual attributes:
4 attributes viz., – Outlook,
Temperature, Humidity and Wind
Find conditional probabilities of
individual attributes.
Naïve Bayes Classifier

Probability that the person will play Tennis is less than the probability that h
not play tennis. Hence the conclusion is that he will not play Tennis.
Classification by
Backpropagation
Classification by Back-
propagation
Back Propagation
 Features of Back-propagation:
 It uses the gradient descent method
 It is different from other networks in respect to the
process by which the weights are calculated during the
learning period of the network.
 Training is done in the three stages :
◦ the feed-forward of input training pattern
◦ the calculation and back-propagation of the error
◦ Updating the weight
Back Propagation
Algorithm
 Step 1: Inputs X, arrive through the pre-connected path.
 Step 2: The input is modeled using true weights W. Weights are
usually chosen randomly.
 Step 3: Calculate the output of each neuron from the input layer
to the hidden layer to the output layer.
 Step 4: Calculate the error in the outputs
 Back propagation Error= Actual Output – Desired Output
 Step 5: From the output layer, go back to the hidden layer to
adjust the weights to reduce the error.
 Step 6: Repeat the process until the desired output is achieved.

 Parameters :
 x = inputs training vector x=(x1,x2,…xn).
 t = target vector t=(t1,t2……………tn).
 δk = error at output unit.
 δj = error at hidden layer.
 α = learning rate.
 V0j = bias of hidden unit j.
Types of Back-Propagation
 Static back-propagation: Static back
propagation is a network designed to map
static inputs for static outputs.
 Eg: OCR (Optical Character Recognition)

 Recurrent back-propagation: Activation


in recurrent back-propagation is feed-
forward until a fixed value is reached.

Static back propagation provides an instant


mapping, while recurrent back propagation
does not provide an instant mapping.
 Advantages:
 It is simple, fast, and easy to program.
 It is Flexible and efficient.
 No need for users to learn any special
functions.

 Disadvantages:
 Itis sensitive to noisy data and irregularities.
Noisy data can lead to inaccurate results.
 Performance is highly dependent on input
data.
 Spending too much time training.
Prediction
Prediction
 Predictive Modeling and Machine Learning
-Machine learning takes weather data and builds
relationships between the available data and the
relative predictors.
2. Data – A Crucial Part of Weather Predictions
3. Weather Data – An Aid for many Events
Prediction of floods, sports, predict car sales, predict
asthma attack.
4. Satellite Imagery and Sensor Data

Example: A record 1.2 million people (equal to the


population of Mauritius) were evacuated in less
than 48 hours just because of data scientists. It
was one of the strongest cyclones to have hit India in
the last 20 years.
Accuracy
Classification accuracy
Accuracy

True Positive – Model correctly predicts


the positive class.
True Negative– Model correctly predicts
the Negative class.
False Positive – Model in correctly
predicts the positive class.
False Negative– Model in correctly
predicts the negative class.
Accuracy = No. of correct predictions

----------------------------------
Total number of predictions
Clustering
Clustering
 Clustering is an unsupervised Machine Learning-
based Algorithm
 Clustering only utilizes input data, to determine
patterns, anomalies or similarities in its input data
 A good clustering algorithm aims to obtain clusters
whose:
◦ The intra-cluster similarities are high, It
implies that the data present inside the cluster is
similar to one another.
◦ The inter-cluster similarity is low, and it
means each cluster holds data that is not similar
to other data.
 What is a Cluster?
◦ A cluster is a subset of similar objects
Clustering
 Grouping of specific objects based on their
characteristics and their similarities.
 A good clustering algorithm is able to identify
the cluster independent of cluster shape.
 3 basic stages of clustering algorithm are

Raw Data

Clustering
Algorithm

Clusters of Data
Methods of Clustering in Data
Mining
Many clusters can partition
information into a data set.
Methods of Clustering in Data
Mining
◦ 1. Partitioning based method
◦ 2. Density-based method
◦ 3. Centroid-based method
◦ 4. Hierarchical method
◦ 5. Grid-based method
◦ 6. Model-based method
Partitioning based Method
 Partition algorithm divides data into many subsets

 Let the algorithm build a partition of data and n


objects present in the database.

 This indicates that each group has at least one


object, and every object, must belong to exactly
one group.
Density based Method
 Thealgorithm produces clusters of high
dense regions in the data space, separated
by regions of the lower density of points.
Centroid Based Method
 In Centroid based clustering algorithm
clusters are formed by the closeness of data
points to the centroid of clusters.
 Here, the cluster center i.e. centroid is
formed such that the distance of data points
is minimum with the center.
 A vector of values references (centroids)
almost every cluster in this type of grouping
technique.
 Number of groups should be predefined.
Hierarchical Method
 Hierarchical clustering analysis is a method of
cluster analysis that seeks to build a hierarchy of
clusters i.e. tree-type structure based on the
hierarchy.
 Agglomerative Clustering: Also known as
bottom-up approach or hierarchical
agglomerative clustering (HAC).
 This clustering algorithm does not require us to
prespecify the number of clusters.
 Bottom-up algorithms treat each data as a
singleton cluster at the outset and then
successively agglomerates pairs of clusters until
all clusters have been merged into a single
cluster that contains all data.
Hierarchical Agglomerative
Approach
Divisive Approach
 Also known as a top-down approach.
 This algorithm also does not require to
prespecify the number of clusters.
 Top-down clustering requires a method for
splitting a cluster that contains the whole
data and proceeds by splitting clusters
recursively until individual data have been
split into singleton clusters.
Grid-Based Method
 Gridis divided based on the characteristics
of the data.

 Byusing this method, non-numeric data is


easy to manage.

 Dataorder does not affect the partitioning of


the grid.

 Animportant advantage is the faster


execution time.
Grid – Based Method
Model-Based Method
 Inthis method a hypothesized model based
on probability distribution is used.
 By clustering the density function, this
method locates the clusters.
Applications of Clustering in
Data Mining
 Clustering analysis is broadly used in many applications such
as market research, pattern recognition, data analysis, and
image processing.
 Clustering can also help marketers discover distinct groups in
their customer base. And they can characterize their
customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and
animal taxonomies, categorize genes with similar
functionalities and gain insight into structures inherent to
populations.
 Clustering also helps in identification of areas of similar land
use in an earth observation database. It also helps in the
identification of groups of houses in a city according to house
type, value, and geographic location.
 Clustering also helps in classifying documents on the web for
information discovery.
 Clustering is also used in outlier detection applications such
as detection of credit card fraud.
Spatial Mining
 Spatial data mining is the process of discovering
interesting and previously unknown, but potentially useful
patterns from spatial databases.

 Spatial data mining refers to the extraction of knowledge,


spatial relationships, or other interesting patterns not
explicitly stored in spatial databases.

 Such mining demands the unification of


data mining with spatial database technologies.

 Examples where spatial mining is used are

 Earth observation satellites, census, weather systems,


marketing , determining hotspots, unusual locations, etc.
Spatial mining

Domain Spatial data mining application


Public Safety Discovery of hotspot patterns from crime event
maps
Epidemiology Detection of disease outbreak
Business Market allocation to maximize stores profits
Neuroscience Discover patterns of human brain activity from
neuro-images
Web Mining
Itis the process of Data Mining
technique to automatically discover
and extract information from web
documents and services.
Main purpose is to discover useful
information from world wide
web and its usage patterns.
Applications: It is used for web
searching eg: google, yahoo etc.,
used to predict user behavior.
3 types of mining
Web Content Mining –process of mining
useful information from web pages and
web documents. (mainly use NLP and
information retrieval)
Web Structure Mining- Analyze nodes and
connection structure of a website using
graph theory. (gives information on
structure of web site and how documents
are structured in the website itself.)
Web Usage Mining – Process of extracting
patterns and information from server
logs(gain insight on user activity)
Text Mining
It is the process of extracting
useful information by analyzing
relations, patterns and rules
among textual data.
Process of text mining is as
Gather
follows
unstructured Preprocess the
data after
Processing
and review of
information
from different cleansing the data
sources
Using
decision
Pattern
making
analysis
analyze
trends
Process of Text Mining
Procedures of analyzing Text
Mining
Text Summarization
Text Categorization
Text Clustering
Text Mining Techniques
Information Extraction: It is the process of
extraction of meaningful words from
documents
Information Retrieval: It is the process of
extraction of relevant and associated
patterns
Natural Language Processing: Analysis of
unstructured text information
Clustering: Group text according to their
similar characteristics
Text Summarization: Extract partial content
reflection.
Applications of Text Mining
1. Digital Library

2. Academic and Research Field

3. Life Science

4. Social-Media

5. Business Intelligence


Issues in Text Mining
Efficiencyand effectiveness of
decision-making

Sometimes original message or


meaning can be changed due to
alteration

Many algorithms and techniques


support multi-language text. This
may lead to false-positive results.

You might also like