Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Module 3 Data Science Machine Learning

This document provides an overview of machine learning, including its definitions, applications, and the modeling process. It discusses various techniques for feature engineering, model training, validation, and scoring, as well as libraries and tools used in data analysis. Additionally, it covers types of machine learning, including supervised, unsupervised, and semi-supervised learning, along with a case study on recognizing digits from images.

Uploaded by

Pratibha S
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 3 Data Science Machine Learning

This document provides an overview of machine learning, including its definitions, applications, and the modeling process. It discusses various techniques for feature engineering, model training, validation, and scoring, as well as libraries and tools used in data analysis. Additionally, it covers types of machine learning, including supervised, unsupervised, and semi-supervised learning, along with a case study on recognizing digits from images.

Uploaded by

Pratibha S
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

MODULE 3

MACHINE LEARNING
“Machine learning is a field of study that
gives computers the ability to learn
without being explicitly programmed.”
—Arthur Samuel

“Machine learning is the process by which


a computer can work more accurately as
it collects and learns from the data it is
given.”
—Mike Roberts2
Applications of Machine Learning
•Finding oil fields, gold mines, or
archeological sites based on existing sites
(classification and regression)
•Finding place names or persons in text
(classification)
• Identifying people based on pictures or
voice recordings (classification)
• Recognizing birds based on their whistle
(classification)
•Identifying profitable customers
(regression and classification)
• Proactively identifying car parts that are
likely to fail (regression)
Applications of Machine Learning

• Identifying tumors and diseases


(classification)
• Predicting the amount of money a
person will spend on product X
(regression)
• Predicting the number of eruptions of a
volcano in a period (regression)
• Predicting your company’s yearly
revenue (regression)
• Predicting which team will win the
Champions League in soccer (regression)
Goal of Model building
(Root Cause Analysis)
Interpretation Not Prediction

Understanding and optimizing a


business process, such as
determining which products add
value to a product line

Discovering what causes diabetes

Determining the causes of traffic


jams
PACKAGES FOR WORKING WITH DATA IN MEMORY

 SciPy is a library that integrates fundamental


packages often used in scientific computing such
as NumPy, matplotlib, Pandas, and SymPy.

 NumPy gives you access to powerful array


functions and linear algebra functions.


■Matplotlib is a popular 2D plotting package with
some 3D functionality.

 Pandas is a high-performance, but easy-to-use,


data-wrangling package. It allows us to analyze big
data and make conclusions based on statistical
theories.
PACKAGES FOR WORKING
WITH DATA IN MEMORY
 SymPy is a package used for symbolic
mathematics and computer algebra.

 StatsModels is a package for statistical methods


and algorithms.

 Scikit-learn is a library filled with machine learning


algorithms.

 RPy2 allows you to call R functions from within


Python. R is a popular open source statistics
program.

 NLTK (Natural Language Toolkit) is a Python


toolkit with a focus on text analytics.
Libraries to optimize the

operations
Numba and NumbaPro—These use just-in-time compilation to
speed up applications written directly in Python and a few
annotations. NumbaPro also allowsyou to use the power of
your graphics processor unit (GPU).

 PyCUDA—This allows you to write code that will be executed


on the GPU instead of your CPU and is therefore ideal for
calculation-heavy applications. It works best with problems
that lend themselves to being parallelized and need little
input compared to the number of required computing cycles.
An example is studying the robustness of your predictions by
calculating thousands of different outcomes based on a single
start state.

 Cython, or C for Python—This brings the C programming


language to Python. C is a lower-level language, so the code
is closer to what the computer eventually uses (bytecode).
The closer code is to bits and bytes, the faster it executes. A
computer is also faster when it knows the type of a variable
Libraries to optimize the
operations
 Blaze —Blaze gives you data structures that can be
bigger than your computer’s main memory, enabling you
to work with large data sets.

 Dispy and IPCluster —These packages allow you to write


code that can be distributed over a cluster of computers.

 PP —Python is executed as a single process by default.


With the help of PP you can parallelize com putations on
a single machine or over clusters.

 Pydoop and Hadoopy—These connect Python to Hadoop,


a common big data framework.

 PySpark—This connects Python and Spark, an in-


memory big data framework.
Modeling Process Steps

1.Feature engineering and model


selection
2. Training the model
3. Model validation and selection
4 .Applying the trained model to
unseen data
Step 1:Feature engineering and model
selection
Feature engineering is the process of transforming raw data into relevant information for machine learning (ML) models to use

Types of Feature Selection:


 Filter Method: Based on the statistical
measure of the relationship between the feature
and the target variable. Features with a high
correlation are selected.
 Wrapper Method: Based on the evaluation of
the feature subset using a specific machine
learning algorithm. The feature subset that
results in the best performance is selected.
 Embedded Method: Based on the feature
selection as part of the training process of the
Feature engineering
What is a feature?
 A feature (also known as a variable or
attribute) is an individual measurable property
or characteristic of a data point that is used as
input for a machine learning algorithm.
 Features can be numerical, categorical, or
text-based, and they represent different
aspects of the data that are relevant to the
problem at hand.
 For example, in a dataset of housing prices,
features could include
 number of bedrooms,
 square footage, the location
 age of the property.
Feature Selection
Feature Selection is the process of
selecting a subset of relevant
features from the dataset to be
used in a machine-learning model.

Numerical Variables: Variable with


continuous values such as integer, float
Categorical Variables: Variables with
categorical values such as Boolean,
ordinal, nominals.
Example of Feature Selection in
Heart Failure Clinical Records Dat
aset
Step 1:Feature engineering and
model selection
Availability bias:
Sometimes the features selected are only the
ones which represents this one-sided “truth.”
Models suffering from availability bias often fail
when they’re validated because it becomes
clear that they’re not a valid representation of
the truth.
Availability bias can lead us to believe
information that is untrue or make decisions
without all the facts.

Availability bias is the human tendency to rely


on information that comes readily to mind when
evaluating situations or making decisions.
Step 2: Training the Model
• Train the model to learn with data
collected with few lines of code.

Underfitting occurs when our model


performance is very low in training stage as well
as testing stage.

Overfitting occurs when model gives very good


performance in training stage however when we
test it on testing data, it doesn't give good
performance/result.
Two factors for a good machine
learning model
 Bias –It is an error due to the model’s inability
to represent the true relationship between
input and output accurately. When a model
has poor performance both on the training
and testing data means high bias because of
the simple model, indicating underfitting.

 Variance -It’s the variability of the model’s


predictions for different instances of training
data. As a result, the model performs well on
the training data but poorly on the testing
data, indicating overfitting.
Reasons for Underfitting
If a model that performs exceptionally well on
its training data but poorly on the validation
set is likely overfitting.

 The model is too simple, So it may be not


capable to represent the complexities in the
data.
 The input features which is used to train the
model is not the adequate representations of
underlying factors influencing the target
variable.
 The size of the training dataset used is not
enough.
 Features are not scaled.
Reasons for Overfitting:
If a model shows poor performance
on testing and training sets, it's
probably underfitting.

 High variance and low bias.


The model is too complex.
The size of the training data.
Reasons for Overfitting:
If a model shows poor performance
on testing and training sets, it's
probably underfitting.

 High variance and low bias.


The model is too complex.
The size of the training data.
Step 3: Validating the Model
Validation is extremely important because
it determines whether your model works.
A good model has two properties: it has
good predictive power and it generalizes
well to data it hasn’t seen.
Two common error measures are
classification error rate for
classification problems
mean squared error for regression
problems
 The classification error rate is the
percentage of observations in the test
data set that your model mislabeled;
lower is better.

 The mean squared error measures how


big the average error of your prediction
is. Squaring the average error has two
consequences: you can’t cancel out a
wrong prediction in one direction with a
faulty prediction in the other direction.
Validation Techniques

Split Validation
K-fold cross validation
Leave One Out cross
validation
Cross validation
Cross validation is a technique used in machine
learning to evaluate the performance of a model on
unseen data. It involves dividing the available data
into multiple folds or subsets, using one of these folds
as a validation set, and training the model on the
remaining folds. This process is repeated multiple
times, each time using a different fold as the
validation set.
 An example of split validation in machine learning
is when a dataset is divided into three sets: training,
validation, and testing:
 Training set: Used to train the model
 Validation set: Used to validate the model
 Testing set: Used to test the model
K-fold cross validation
K-fold cross validation in machine learning cross-validation
is a powerful technique for evaluating predictive models in
data science. It involves splitting the dataset into k
subsets or folds, where each fold is used as the validation
set in turn while the remaining k-1 folds are used for
training.
n the K-Fold method, the dataset is divided into ‘k’
subsets, called folds. Using all, but one fold, training is
done. The fold left out is used in the evaluation of the
model once it is trained. This method performs k iterations
k times, in each of which, a different subset is reserved for
testing.
L.O.O.C.V. (Leave One Out
Cross Validation),
In the method of L.O.O.C.V. (Leave One Out Cross Validation),
the model is trained using the entire dataset, while leaving out
only one data point of the dataset, and then iterating for each
data point. One prominent benefit of this method is that all the
data points are used, thus it is low bias

LOOCV is appropriate when an accurate estimate of model


performance is critical. This particularly case when the dataset
is small, such as less than thousands of examples, can lead to
model overfitting during training and biased estimates of model
performance.

LOOCV is an extreme version of k-fold cross-validation that has


the maximum computational cost. It requires one model to be
created and evaluated for each example in the training dataset.
Validation Techniques
 Dividing your data into a training set with X%
of the observations and keeping the rest as a
holdout data set (a data set that’s never used
for model creation)—This is the most common
technique.

 K-folds cross validation—This strategy divides


the data set into k parts and uses each part
one time as a test data set while using the
others as a training data set. This has the
advantage that you use all the data available
in the data set.

 Leave-1 out—This approach is the same as k-


folds but with k=1. You always leave one
Regularization in Validation
 Regularization introduces a penalty for more
complex models, effectively reducing their
complexity and encouraging the model to learn more
generalized patterns.
 With L1 regularization you ask for a model with as
few predictors as possible. This is important for the
model’s robustness: simple solutions tend to hold
true in more situations.
 L2 regularization aims to keep the variance
between the coefficients of the predictors as small
as possible.
 Overlapping variance between predictors makes it
hard to make out the actual impact of each predictor.
Keeping their variance from overlapping will increase
interpretability. To keep it simple: regularization is
mainly used to stop a model from using too many
features and thus prevent over-fitting.
Regularization in Validation

Regularization is a technique
used to reduce errors by fitting
the function appropriately on the
given training set and avoiding
overfitting.
L1 reularuzation(Lasso
regularization)
L2 regularization (Ridge
regularization)
L1 norm calculates the sum of
the absolute values of the vector
elements,

L2 norm calculates the square


root of the sum of the squared
values of the vector element.
Step 4: Applying the trained
model to unseen data
The process of applying your model to
new data is called model scoring.

Two steps in Model Scoring


Prepare a data set that has features
exactly as defined by your model. This
boils down to repeating the data
preparation
Apply the model on this new data set,
and this results in a prediction.
Types of machine learning

 Supervised learning techniques attempt to


discern results and learn by trying to find
patterns in a labeled data set. Human
interaction is required to label the data.

 Unsupervised learning techniques don’t rely


on labeled data and attempt to find patterns in a
data set without human interaction.

 Semi-supervised learning techniques need


labeled data, and therefore human interaction,
to find patterns in the data set, but they can
still progress toward a result and learn even if
passed unlabeled data as well.
Supervised Machine Learning

CASE STUDY: DISCERNING DIGITS FROM


IMAGES
A simple Captcha control can be used to prevent
automated spam being sent through an online web
form.
Data Science Process
CASE STUDY: DISCERNING DIGITS
FROM IMAGES

Step 1: Setting up the Research


goal
Research goal is to let a computer
recognize numbers from Images.
Step 2: Data Collection or
fetching the digital image
data
The MNIST data set, which is often used in the
data science literature for teaching and
benchmarking. The MNIST images can be found
in the data sets package of Scikit-learn and are
already normalized for you (all scaled to the
same size: 64x64 pixels),
The MNIST database (Modified National Institute of Standards
and Technology database) is a large database of handwritten
digits that is commonly used for training various image
processing systems. The database is also widely used for
training and testing in the field of machine learning.
Step 4: Data Exploration using
Sclkit-learn

pl.matshow() returns a two-dimensional array (a


matrix) reflecting the shape of the image.
To flatten it into a list, we need to call reshape()
on digits.images.
The net result will be a one-dimensional array that
looks something like this:
We’ll turn an image into something usable by
the Naïve Bayes classifier by getting the
grayscale value for each of its pixels (shown
on the right) and putting those values in a list.
Image data classification problem on images of
digits
Confusion matrix produced by
predicting what number is
depicted by a blurryimage
Confusion matrix
A confusion matrix is a matrix showing
how wrongly (or correctly) a model
predicted,
how much it got “confused.”
In its simplest form it will be a 2x2 table
for models that try to classify
observations as being A or B.
Let’s say we have a classification
modelthat predicts whether somebody will
buy our newest product: deep-fried cherry
pudding.
We can either predict: “Yes, this person
will buy” or “No, this customer won’t buy.”
Confusion Matrix Example

The model was correct in (35+40) 75 cases


and incorrect in (15+10) 25 cases,resulting
in a (75 correct/100 total observations)
75% accuracy.
Inspecting predictions
vs
actual numbers
For each blurry image a number is
predicted; only the number 2 is is
interpreted as 8. Then an ambiguous
number is predicted to be 3 but it could
as well be 5; even to human eyes this
isn’t clear.
 Supervised learning aims to find a mapping or
relationship between the input variables and
the desired output, which enables the algorithm
to produce precise predictions or classifications
when faced with fresh, unobserved data.

 Aninput-output pair training set is given to the


algorithm during a supervised learning process.
For every example in the training set, the
algorithm iteratively modifies its parameters to
minimize the discrepancy between its predicted
output and the actual output (the ground truth).
This procedure keeps going until the algorithm
performs at an acceptable level.
Supervised learning can be divided into two
main types:
Regression: In regression problems, the goal
is to predict a continuous output or value. For
example, predicting the price of a house
based on its features, such as the number of
bedrooms, square footage, and location.
Classification: In classification problems, the
goal is to assign input data to one of several
predefined categories or classes. Examples
include spam email detection, image
classification (e.g., identifying whether an
image contains a cat or a dog), and sentiment
analysis.
Example of Supervised Learning
 Suppose there is a basket which is filled with some fresh fruits,
the task is to arrange the same type of fruits in one place.

 Also, suppose that the fruits are apple, banana, cherry, and
grape. Suppose one already knows from their previous
work (or experience) that, the shape of every fruit present in
the basket so, it is easy for them to arrange the same type of
fruits in one place. Here, the previous work is called training
data in Data Mining terminology.

 So, it learns things from the training data. This is because it


has a response variable that says that if some fruit has so and
so features then it is grape, and similarly for every fruit. This
type of information is deciphered from the data that is used to
train the model. This type of learning is called Supervised
Learning. Such problems are listed under
classical Classification Tasks.
Unsupervised Learning
 In unsupervised learning, the algorithm tries to find patterns,
structures, or relationships in the data without the guidance.
There are several common types of unsupervised learning
techniques:
 Clustering: Clustering algorithms aim to group similar data
points into clusters based on some similarity metric. K-
means clustering and hierarchical clustering are examples of
unsupervised clustering techniques.
 Dimensionality Reduction: These techniques aim to
reduce the number of features (or dimensions) in the data
while preserving its essential information. Principal
Component Analysis (PCA) and t-distributed Stochastic
Neighbor Embedding (t-SNE) are examples of dimensionality
reduction methods.
 Association: Association rule learning is used to discover
interesting relationships or associations between variables in
large datasets. The Apriori algorithm is a well-known
example used for association rule learning of labeled output.
Example of Unsupervised
Learning
 Suppose there is a basket and it is filled with some fresh fruits.
The task is to arrange the same type of fruits in one place. This
time there is no information about those fruits beforehand, it’s
the first time that the fruits are being seen or discovered So
how to group similar fruits without any prior knowledge about
them? First, any physical characteristic of a particular fruit is
selected. Suppose colour. Then the fruits are arranged based on
the color.
The groups will be something as shown below:
 RED COLOR GROUP: apples & cherry fruits.
 GREEN COLOR GROUP: bananas & grapes. So now, take
another physical character say, size, so now the groups will be
something like this.
 RED COLOR AND BIG SIZE: apple.
 RED COLOR AND SMALL SIZE: cherry fruits.
 GREEN COLOR AND BIG SIZE: bananas.
 GREEN COLOR AND SMALL SIZE: grapes.

You might also like