Machine Learning Lab

The document provides an overview of machine learning algorithms including Naive Bayes, Gaussian Naive Bayes, Multinomial Naive Bayes, and describes the basic steps involved in machine learning projects such as importing and cleaning data, splitting data into training and test sets, creating and training a model, making predictions, and evaluating and improving the model. It also discusses Python libraries commonly used for machine learning like NumPy, Pandas, Matplotlib, and Scikit-Learn.

Uploaded by

Dharanya V

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Machine Learning Lab

Uploaded by

Dharanya V

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

MACHINE

LEARNING LAB
Dharanya S
◦ Types of Naive Bayes Algorithm
◦ Gaussian Naive Bayes When attribute values are continuous, an assumption is made that the values associated with each class
are distributed according to Gaussian i.e., Normal Distribution. If in our data, an attribute say “x” contains continuous data.
We first segment the data by the class and then compute mean & Variance of each class.
◦ MultiNomial Naive Bayes MultiNomial Naive Bayes is preferred to use on data that is multinomially distributed. It is one of
the standard classic algorithms. Which is used in text categorization (classification). Each event in text classification
represents the occurrence of a word in a document.
Machine Learning
How ML works?
Dataset
◦ A dataset is a collection of data in which data is arranged in some order. A dataset can contain any data from a series of an
array to a database table.
◦ A tabular dataset can be understood as a database table or matrix, where each column corresponds to a particular
variable, and each row corresponds to the fields of the dataset. The most supported file type for a tabular dataset is "Comma
Separated File,"
Country Age Salary Purchased

India 38 48000 No
France 43 45000 Yes
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
Steps involved in ML projects
◦ Import Data (CSV Files)
◦ Clean Data
◦ Removing duplicates, irrelevant and incomplete data.

◦ Splitting the data into training/ Test data

◦ For example if you have 100 pictures of cats and dogs we will split 80 pictures for training and 20 for test.

◦ Create Model (Algorithm)

◦ Train Model
◦ Make predictions
◦ Evaluate and improve
Python
◦ Python is a high-level, interpreted, interactive and object-oriented scripting language.
◦ Environment
◦ Jupyter use platofrom Anaconda
◦ Colab:
◦ https://colab.research.google.com/?utm_source=scs-index
◦ What is Colab?
◦ allows you to write and execute Python in your browser, with
◦ Zero configuration required
◦ Access to GPUs free of charge
◦ Easy sharing
◦ Dataset:
◦ https://github.com/
Python Libraries
◦ Numpy – Numerial Python
◦ NumPy is a Python library used for working with arrays. It also has functions for working in domain of
linear algebra, transform, and matrices.
import numpy
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)
◦ Pandas
◦ Used for data analysis and manipulation
import pandas as pd
df = pd.read_csv('data.csv’)
print(df)
◦ MatPlotLib
◦ Used for creating graphs and plots
◦ Scikit-Learn
◦ Gives all comman algorithms like decision trees neural network.
Thank You for
listening
Coding….
◦ Write a python code to print your name !
Answer
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m’
  GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'
print(color.YELLOW + ’Dharanya' + color.END)
Find S algorithm
◦ General Hypothesis
◦ Hypothesis, in general, is an explanation for something. The general hypothesis basically states the
general relationship between the major variables.
◦ For example, a general hypothesis for ordering food would be I want a burger.
◦ G = { ‘?’, ‘?’, ‘?’, …..’?’}
◦ Specific Hypothesis
◦ The specific hypothesis fills in all the important details about the variables given in the general
hypothesis. The more specific details into the example given above would be I want a cheeseburger
with a chicken pepperoni filling with a lot of lettuce.
◦ S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}
Find S Algorithm
◦ The Find-S algorithm follows the steps written below:
1. Initialize ‘h’ to the most specific hypothesis.
2. The Find-S algorithm only considers the positive examples and eliminates negative examples.
3. For each positive example, the algorithm checks for each attribute in the example.
4. If the attribute value is the same as the hypothesis value, the algorithm moves on without any changes.
5. But if the attribute value is different than the hypothesis value, the algorithm changes it to ‘?’.
Dataset

Time Weather Temperature Company Humidity Wind Goes

Morning Sunny Warm Yes Mild Strong Yes

Evening Rainy Cold No Mild Normal No

Morning Sunny Moderate Yes Normal Normal Yes

Evening Sunny Cold Yes High Strong Yes

Import panda as pd
import numpy as np
data = pd.read_csv('/content/StudentsPerformance.csv')
concepts = np.array(data)[:,:-1]
target = np.array(data)[:,-1]
def train(con,tar):
for i,val in enumerate(tar):
if val=='yes':
specific_h = con[i].copy()
break
for i,val in enumerate(con):
if tar[i]=='yes':
for x in range(len(specific_h)):
if val[x] != specific_h[x]:
specific_h[x] = '?'
else:
pass
return specific_h
print(train(concepts,target))
Supervised Learning
◦ Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Basically
supervised learning is when we teach or train the machine using data that is well labelled. Which means
some data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labelled data.
Candidate Elimination Algorithm
◦ Why?
◦ Candidate Elimination Learning Algorithm addresses several of the limitations of FIND-S.
◦ Although the FIND-S algorithm outputs a hypothesis from H, that is consistent with the training examples, this is
just one of many hypotheses from H that might fit the training data equally well.
◦ What?
◦ The key idea in the CANDIDATE-ELIMINATlON Algo is to output a description of the set of all hypotheses
consistent with the training examples.
◦ At the end of the algorithm, we get both specific and general hypotheses as our final solution.
◦ For a positive example, we move from the most specific hypothesis to the most general hypothesis.
◦ For a negative example, we move from the most general hypothesis to the most specific hypothesis.
Continue..
◦ Unlike Find-S(#Link to Find-S) algorithm, the Candidate Elimination algorithm considers not just positive but
negative samples as well. It relies on the concept of version space.
◦ What is Version Space?
◦ It’s a cross between a generic and a specific theory. It didn’t simply write one hypothesis; it wrote a list of all
feasible hypotheses based on the training data.
◦ With regard to hypothesis space H and training examples D, the version space, denoted as VSH,D, is the subset of
hypotheses from H that are consistent with the training instances in D .
Algorithm
Example
◦ Step 1:
◦ G0 = < <?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ?>,

<?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ? >>

◦ S0 = < ϕ, ϕ, ϕ, ϕ, ϕ, ϕ>

◦ Step 2:
◦ S1 = < ‘Sunny’, ‘warm’, ‘normal’, ‘strong’, ‘warm ‘, ‘same’>
◦ G0 = G1= < <?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ?>,

<?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ? >>

◦ Step3 :
◦ S2 = < ‘Sunny’, ‘warm’, ‘?’, ‘strong’, ‘warm ‘, ‘same’>
◦ G0 = G1 = G2 = < <?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ?>,

<?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ?> , <?, ?, ?, ?, ?, ? >>

◦ Step4
◦ G3 = < <‘Sunny’, ?, ?, ?, ?, ?>, <?, ‘warm’, ?, ?, ?, ?>,
<?, ?, ‘Normal’ ?, ?, ?>, <?, ?, ?, ?, ?, ?>, <?, ?, ?, ?, ‘Warm’, ?>,
<?, ?, ?, ?, ?, ‘same’> >
◦ S3 = S2 = < ‘Sunny’, ‘warm’, ‘?’, ‘strong’, ‘warm ‘, ‘same’>

◦ Step5 : Final Hypothesis

◦ G4 = < <‘Sunny’, ?, ?, ?, ?, ?>, <?, ‘warm’, ?, ?, ?, ?> >
◦ S4 = <‘Sunny’, ‘warm’, ?, ‘strong’, ?, ?>
Iterative Dichotomiser 3:ID3
◦ ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively (repeatedly)
dichotomizes(divides) features into two or more groups at each step.
◦ Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree. In simple words, the top-
down approach means that we start building the tree from the top and the greedy approach means that at each iteration we
select the best feature at the present moment to create a node.
◦ It is used to create the smallest possible decision tree.
Entropy
◦ Entropy measure the impurity of collection of examples

Where, p+ is the proportion of positive examples in S
p– is the proportion of negative examples in S.

◦ For the set X = {a,a,a,b,b,b,b,b}

◦ Total instances: 8
◦ Instances of b: 5

◦ Instances of a: 3
= -[0.375 * (-1.415) + 0.625 * (-0.678)]

=-(-0.53-0.424) = 0.954
Information Gain
◦ Information gain, is the expected reduction in entropy caused by partitioning the examples according to
this attribute.
◦ The information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as
Steps
1.Calculate the Information Gain of each feature.
2.Considering that all rows don’t belong to the same class, split the dataset S into subsets using
the feature for which the Information Gain is maximum.
3.Make a decision tree node using the feature with the maximum Information gain.
4.If all rows belong to the same class, make the current node as a leaf node with the class as its
label.
5.Repeat for the remaining features until we run out of all features, or the decision tree has all
leaf nodes.
Advantage/Disadvantage
Advantage :
◦ Understandable prediction rules are created from the training data.
◦ Builds the fastest tree.
◦ Build a short tree.
◦ Only need to test enough attributes until all data is classified.
◦ Finding leaf nodes enables test data to be pruned, reaching number of tests.
Disadvantage:
◦ Data may be over-fitted or over classified if a small sample is tested.
◦ Only one attributed at a time is tested for making a decision.
◦ Classifying continuous data may be computationally expensive, as many trees must be generated to see
where to break the continuum.
Iterative Dichotomiser 3:ID3
◦ A decision tree is a structure that contains nodes and
edges and is built from a dataset.
The initial node is called the root node ,the final nodes are called the leaf nodes and the rest of
the nodes are called intermediate or internal nodes.
The root and intermediate nodes represent the decisions while the leaf nodes represent the
outcomes.
◦ Example:
◦ if a person is less than 30 years of age and doesn’t eat junk food then he is Fit,
◦ if a person is less than 30 years of age and eats junk food then he is Unfit and so on.
Happy Weekend…
Naïve Bayes Algorithm
◦ Naive Bayes uses a similar method to predict the probability of different class based on various attributes.
This algorithm is mostly used in text classification and with problems having multiple classes.
◦ The dataset is divided into two parts, namely, feature matrix and the response vector.
• Feature matrix contains all the vectors(rows) of dataset in which each vector consists of the value of dependent features. In
above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.
• Response vector contains the value of class variable(prediction or output) for each row of feature matrix. In above dataset, the
class variable name is ‘Play Tennis’.
Bayes’ Theorem
◦ Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred.
Bayes’ theorem is stated mathematically as the following equation:
• Basically, we are trying to find probability of event A, given the event B is true. Event B is also termed as evidence.
• P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute
value of an unknown instance(here, it is event B).
• P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
Text Classification
◦ Naive Bayes classifiers have been heavily used for text classification and text analysis machine learning problems.
◦ Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of
symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them expect numerical feature
vectors with a fixed size rather than the raw text documents with variable length.
◦ In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from
text content, namely:
• tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and
punctuation as token separators.
• counting the occurrences of tokens in each document.
◦ In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate sample.
Count Vectorizer
◦ document = [ “One Geek helps Two Geeks”,
“Two Geeks help Four Geeks”,
“Each Geek helps many other Geeks at GeeksforGeeks.”]
◦ A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where
N is the number of target classes. The matrix compares the actual target values with those predicted by the
machine learning model.
◦ Precision is one indicator of a machine learning model's performance – the quality of a positive prediction
made by the model. Precision refers to the number of true positives divided by the total number of positive
predictions (i.e., the number of true positives plus the number of false positives).
◦ The recall is calculated as the ratio between the numbers of Positive samples correctly classified as
Positive to the total number of Positive samples. The recall measures the model's ability to detect positive
samples. The higher the recall, the more positive samples detected.

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
Data Science Course Content
No ratings yet
Data Science Course Content
8 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Introduction To Machine Learning (CS419M)
No ratings yet
Introduction To Machine Learning (CS419M)
25 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
CH 6
No ratings yet
CH 6
72 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Regression Project
100% (1)
Regression Project
60 pages
Ensemble Methods Bagging Boosting and Stacking
100% (1)
Ensemble Methods Bagging Boosting and Stacking
19 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
Literature Review On Feature Selection Methods For HighDimensional Data
No ratings yet
Literature Review On Feature Selection Methods For HighDimensional Data
9 pages
Machine Learning LAB: Practical-1
100% (2)
Machine Learning LAB: Practical-1
24 pages
Supervised Learning - Regression - Annotated
No ratings yet
Supervised Learning - Regression - Annotated
97 pages
Code ExerciseModelSelection
100% (1)
Code ExerciseModelSelection
19 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Expectation Maximization
No ratings yet
Expectation Maximization
23 pages
AirBnb EDA
100% (1)
AirBnb EDA
20 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
Assignment No - 6-1
100% (1)
Assignment No - 6-1
3 pages
Life Expectancy Using Data Analytics
100% (1)
Life Expectancy Using Data Analytics
9 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
Introduction
100% (1)
Introduction
49 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Bank Marketing Data
100% (2)
Bank Marketing Data
14 pages
Week 8-Association Rules Part 1
No ratings yet
Week 8-Association Rules Part 1
31 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
Intro To BI
No ratings yet
Intro To BI
28 pages
UE20CS302 Unit4 Slides
No ratings yet
UE20CS302 Unit4 Slides
312 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Oil Export Indonesia
100% (1)
Oil Export Indonesia
12 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
McKinsey Machine Learning
No ratings yet
McKinsey Machine Learning
6 pages
Chapter
100% (1)
Chapter
101 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Variosalgoritmos - Jupyter Notebook
100% (1)
Variosalgoritmos - Jupyter Notebook
9 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Core Java
No ratings yet
Core Java
217 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
48 pages
Naive Bayes
No ratings yet
Naive Bayes
38 pages
Agility in Audit Could Scrum Improve The Audit Process2018current Issues in Auditing
No ratings yet
Agility in Audit Could Scrum Improve The Audit Process2018current Issues in Auditing
22 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
Artificial Intelligence: CS60045 Course Introduction
100% (4)
Artificial Intelligence: CS60045 Course Introduction
16 pages
Introduction To Machine Learning: Methods, Applications, Etc
No ratings yet
Introduction To Machine Learning: Methods, Applications, Etc
15 pages
Machine Learning Models
0% (1)
Machine Learning Models
16 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
An Introduction of Ensemble Learning
100% (1)
An Introduction of Ensemble Learning
40 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Machine Learninf File Final
No ratings yet
Machine Learninf File Final
45 pages
Jntuk ML RECORD Full
No ratings yet
Jntuk ML RECORD Full
46 pages
Image Classification in Remote Sensing
No ratings yet
Image Classification in Remote Sensing
8 pages
Presentation of Data
No ratings yet
Presentation of Data
18 pages
Using Machine Learning and Natural Language Processing Cancer
No ratings yet
Using Machine Learning and Natural Language Processing Cancer
9 pages
Anti-Spam Filter Based On Naïve Bayes, SVM, and KNN Model
No ratings yet
Anti-Spam Filter Based On Naïve Bayes, SVM, and KNN Model
5 pages
Decision Tree Learning Through A Predictive Model F - 2021 - Computers and Educa
No ratings yet
Decision Tree Learning Through A Predictive Model F - 2021 - Computers and Educa
12 pages
Data Mining
No ratings yet
Data Mining
2 pages
DWM Lab Workbook Sample
No ratings yet
DWM Lab Workbook Sample
10 pages
Master Thesis Uni Bonn
100% (3)
Master Thesis Uni Bonn
6 pages
(PDF Download) Remote Sensing Digital Image Analysis 6th Ed. 2022 Edition John A. Richards Fulll Chapter
100% (5)
(PDF Download) Remote Sensing Digital Image Analysis 6th Ed. 2022 Edition John A. Richards Fulll Chapter
64 pages
Bab 7
No ratings yet
Bab 7
3 pages
Comparative Study On Machine Learning Algorithms For Network Intrusion Detection System
No ratings yet
Comparative Study On Machine Learning Algorithms For Network Intrusion Detection System
4 pages
Bypass Fraud Detection:: Artificial Intelligence Approach
No ratings yet
Bypass Fraud Detection:: Artificial Intelligence Approach
4 pages
Test Bank
No ratings yet
Test Bank
55 pages
Week 6 - Data Mining For BI
No ratings yet
Week 6 - Data Mining For BI
34 pages
ECG Signal Analysis by Pattern Comparison
No ratings yet
ECG Signal Analysis by Pattern Comparison
4 pages
ML Lab Manual TE 2021-22
No ratings yet
ML Lab Manual TE 2021-22
43 pages
Ding Et Al 2021
No ratings yet
Ding Et Al 2021
13 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Ultimate Azure Data Scientist Associate DP 100 Certification Guide 1st Edition Rajib Kumar De All Chapters Instant Download
100% (12)
Ultimate Azure Data Scientist Associate DP 100 Certification Guide 1st Edition Rajib Kumar De All Chapters Instant Download
44 pages
Accepted Papers PDF
No ratings yet
Accepted Papers PDF
7 pages
A Machine Learning Approach To Facies Classification Using Well Logs
No ratings yet
A Machine Learning Approach To Facies Classification Using Well Logs
6 pages
Share DOC-20240807-WA0012.
No ratings yet
Share DOC-20240807-WA0012.
11 pages
Subject Stream Prediction A Machine Learning Approach To Select The Suitable Subject Stream For Senior Secondary Students in Sri Lanka
No ratings yet
Subject Stream Prediction A Machine Learning Approach To Select The Suitable Subject Stream For Senior Secondary Students in Sri Lanka
8 pages
Classification: Lecture Notes For Chapters 4 & 5
No ratings yet
Classification: Lecture Notes For Chapters 4 & 5
42 pages
Mental Stress Detection in University Students Using Machine Learning Algorithms
100% (1)
Mental Stress Detection in University Students Using Machine Learning Algorithms
5 pages
1、Recent Advances of Large-Scale Linear Classification
No ratings yet
1、Recent Advances of Large-Scale Linear Classification
20 pages
Remote Sensing Third Edition Models and Methods for Image Processing Robert A. Schowengerdt pdf download
100% (3)
Remote Sensing Third Edition Models and Methods for Image Processing Robert A. Schowengerdt pdf download
57 pages
Assignment 1: 1. Likelihood and Bayesian Inference in The Linear Model
No ratings yet
Assignment 1: 1. Likelihood and Bayesian Inference in The Linear Model
3 pages
Kali Muthu 2020
No ratings yet
Kali Muthu 2020
7 pages
Parkinsons Disease Detection
No ratings yet
Parkinsons Disease Detection
80 pages