Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 1 - Machine Learning - NOTES1 - ML

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Unit 1

Introduction to Machine
Learning
Machine Learning

 Machine learning is an application of artificial intelligence (AI)


that provides systems the ability to automatically learn and
improve from experience without being explicitly programmed.
 Machine learning focuses on the development of computer
programs that can access data and use it to learn for themselves.
 Machine learning algorithms are used in a wide variety of
applications, such as in medicine, email filtering, speech
recognition, and computer vision
Machine Learning
Machine Learning

 The process of learning begins with observations or data, such as


examples, direct experience, or instruction, in order to look for
patterns in data and make better decisions
 The primary aim is to allow the computers learn automatically
 Machine learning algorithms use historical data as input to
predict new output values.
 The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions
accordingly.
Machine Learning Examples
 Image Recognition
 Speech Recognition
 Medical diagnosis
Types of Machine Learning
Supervised Learning

Supervised learning is when the model is getting trained

on a labelled dataset.

A labelled dataset is one that has both input and output
parameters.

In this type of learning, both training and validation,

datasets are labelled

Figure A: It is a dataset of a shopping store that is useful in
predicting whether a customer will purchase a particular product
under consideration or not based on his/ her gender, age, and
salary.

Input: Gender, Age, Salary

Output: Purchased i.e. 0 or 1; 1 means yes the customer will
purchase and 0 means that the customer won’t purchase it.

While training the model, data is usually split in the ratio of 80:20

i.e. 80% as training data and rest as testing data.

In training data, we feed input as well as output for 80% of data. The
model learns from training data only.

By learning, it means that the model will build some logic of its own.

Once the model is ready then it is good to be tested.

At the time of testing, the input is fed from the remaining 20% data

which the model has never seen before, the model will predict some value
and we will compare it with actual output and calculate the accuracy.
Examples of Supervised Learning

Advertisement Popularity

Email Filtering

Face Recognition
UnSupervised Learning

Unsupervised machine learning algorithms are used

when the information used to train is neither classified

nor labeled.

Models itself find the hidden patterns and insights from the

given data. It can be compared to learning which takes place

in the human brain while learning new things.

The system doesn’t figure out the right output, but it explores

the data and can draw inferences from datasets to describe

hidden structures from unlabeled data.
E.g. Email Filtering, Face Recognition
UnSupervised Learning

The goal of unsupervised learning is to find the underlying structure of
dataset, group that data according to similarities, and represent that dataset
in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs. The algorithm
is never trained upon the given dataset, which means it does not have any
idea about the features of the dataset. The task of the unsupervised learning
algorithm is to identify the image features on their own.

Unsupervised learning algorithm will perform this task by clustering the
image dataset into the groups according to similarities between images.
UnSupervised Learning

It’s a type of learning where we don’t give a target to our model while training

i.e. training model has only input parameter values.

The model by itself has to find which way it can learn.

Data-set in Figure A is mall data that contains information of its

clients that subscribe to them. Once subscribed they are provided a membership card
and so the mall has complete information about the customer and his/her every
purchase.

Now using this data and unsupervised learning techniques, the mall can easily group
clients based on the parameters we are feeding in.
Semi Supervised Learning

In this type of learning, the algorithm is trained upon a

combination of labeled and unlabelled data.

This combination will contain a very small amount of

labeled data and a very large amount of unlabelled data.

It uses the unsupervised techniques to predict labels and then
feed these labels to supervised techniques. This technique is
mostly applicable in the case of image data sets where usually
all images are not labeled.
Reinforcement

In this technique, the model keeps on increasing its performance using
Reward Feedback to learn the behavior or pattern

it will make a lot of mistakes in the beginning.

So long as we provide some sort of signal to the algorithm that associates
good behaviors with a positive signal and bad behaviors with a negative
one

learning algorithm learns to make less mistakes than it used to.

E.x. Video game – Mario game
Reinforcement

These algorithms are specific to a particular problem e.g.
Google Self Driving car, AlphaGo where a bot competes with
humans and even itself to getting better and better performers of
Go Game.
ML Applications

Virtual Personal Assistant - Siri, Alexa, Google Now are some of the
popular examples of virtual personal assistants.

Email Spam and Malware Filtering - There are a number of spam filtering
approaches that email clients use. To ascertain that these spam filters are
continuously updated, they are powered by machine learning.

Product Recommendations - Product recommendation is one of the stark
features of almost every e-commerce website today, which is an advanced
application of machine learning techniques. Using machine learning and
AI, websites track your behavior based on your previous purchase, your
searching pattern, your cart history, and make product recommendations.
ML Applications

Online Fraud Detection - Machine learning is proving its
potential to make cyberspace a secure place and tracking
monetary frauds online is one of its examples. For example:
Paypal is using ML for protection against money laundering.

Image Recognition – It is an approach for cataloging and
detecting a feature or an object in the digital image. E.g. pattern
recognition, face detection, or face recognition.
ML Applications

Sentiment Analysis - Sentiment analysis is a real-time machine
learning application that determines the emotion or opinion of
the speaker or the writer.
Machine Learning Life Cycle

Machine learning life cycle is a cyclic process to build an efficient
machine learning project. The main purpose of the life cycle is to find a
solution to the problem or project.

It is needed to understand the problem because the good result depends
on the better understanding of the problem.

In the complete life cycle process, to solve a problem, we create a
machine learning system called "model", and this model is created by
providing "training". But to train a model, we need data, hence, life cycle
starts by collecting data.
1. Collecting Data:

Identify the different data sources, as data can be collected from various sources
such as files, database, internet, or mobile devices.

The quantity and quality of the collected data will determine the efficiency of the
output. The more will be the data, the more accurate will be the prediction.

This step includes the below tasks:

Identify various data sources

Collect data

Integrate the data obtained from different sources

Coherent set of data is also called as a dataset.

2. Data preparation

Prepare the data to use in machine learning training.

This step can be further divided into two processes:

Data exploration:
It is used to understand the nature of data that we have to work with. We
need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome.

Data pre-processing:
preprocessing of data for its analysis.
3. Data Wrangling

It is the process of cleaning and converting raw data into a useable
format.

It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for
analysis

Cleaning of data is required to address the quality issues.

collected data may have various issues, including:

Missing Values

Duplicate data

Invalid data

Noise
Missing data
Noise data

Duplicate data
4. Data Analysis

This step involves:

Selection of analytical techniques

Building models

Review the result

build a machine learning model to analyze the data using various analytical
techniques and review the outcome.

It starts with the determination of the type of the problems, where we select
the machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and
evaluate the model.
5. Train Model

train model to improve its performance for better outcome of the
problem.

Training a model is required so that it can understand the various
patterns, rules, and, features.
6. Test Model

G,N,O,Y,6 - 97.27

Machine learning model has been trained on a given dataset, then we test
the model. In this step, we check for the accuracy of our model by
providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per
the requirement of project or problem.
7. Deployment


The last step of machine learning life cycle is deployment, where we
deploy the model in the real-world system.
AI vs ML
Artificial Intelligence Machine learning
Artificial intelligence is a technology Machine learning is a subset of AI which
which enables a machine to simulate allows a machine to automatically learn
human behavior. from past data without programming
explicitly.
The goal of AI is to make a smart The goal of ML is to allow machines to
computer system like humans to solve learn from data so that they can give
complex problems. accurate output.
In AI, we make intelligent systems to In ML, we teach machines with data to
perform any task like a human. perform a particular task and give an
accurate result.
Machine learning and deep learning are the Deep learning is a main subset of machine
two main subsets of AI. learning.
AI has a very wide range of scope. Machine learning has a limited scope.

AI is working to create an intelligent Machine learning is working to create


system which can perform various machines that can perform only those
complex tasks. specific tasks for which they are trained.
AI system is concerned about maximizing Machine learning is mainly concerned
the chances of success. about accuracy and patterns.
The main applications of AI are Siri, The main applications of machine learning
customer support using chatboats, Expert are Online recommender system, Google
System, Online game playing, intelligent search algorithms, Facebook auto friend
humanoid robot, etc. tagging suggestions, etc.
On the basis of capabilities, AI can be Machine learning can also be divided into
divided into three types, which are, Weak mainly three types that are Supervised
AI, General AI, and Strong AI. learning, Unsupervised learning, and
Reinforcement learning.
Data in Machine Learning

DATA: It can be any unprocessed fact, value, text, sound, or
picture that is not being interpreted and analyzed. Data is the
most important part of all Data Analytics, Machine Learning,
Artificial Intelligence.

INFORMATION: Data that has been interpreted and
manipulated and has now some meaningful inference for the
users.

KNOWLEDGE: Combination of inferred information,
experiences, learning, and insights. Results in awareness or
concept building for an individual or organization.
Data in Machine Learning

Training Data: The part of data we use to train our model. This
is the data that your model actually sees(both input and output)
and learns from.

Validation Data: The part of data that is used to do a frequent
evaluation of the model, fit on the training dataset along with improving
involved hyperparameters (initially set parameters before the model
begins learning). This data plays its part when the model is actually
training.

Testing Data: Once our model is completely trained, testing data
provides an unbiased evaluation. When we feed in the inputs of Testing
data, our model will predict some values(without seeing actual output).
After prediction, we evaluate our model by comparing it with the actual
output present in the testing data. This is how we evaluate and see how
much our model has learned from the experiences feed in as training
data, set at the time of training.

Properties of Data –


Volume: Scale of Data. With the growing world population and technology at
exposure, huge data is being generated each and every millisecond.

Variety: Different forms of data – healthcare, images, videos, audio clippings.

Velocity: Rate of data streaming and generation.

Value: Meaningfulness of data in terms of information that researchers can infer
from it.

Veracity: Certainty and correctness in data we are working on.
Data Processing
Data Cleaning

Data Cleaning means the process of identifying the incorrect,
incomplete, inaccurate, irrelevant or missing part of the data and
then modifying, replacing or deleting them according to the
necessity.
Inconsistent column

DataFrame (A Data frame is a two-dimensional data structure,
i.e., data is aligned in a tabular fashion in rows and columns)
contains columns that are irrelevant or never going to use them
then it can be dropped to give more focus on the columns.
Missing data:

Most of the dataset contains missing values.

Handling missing values is very important because it may affect your analysis
and machine learning models.

If you find any missing values in the dataset you can perform any of these
three task on it:

1. Leave as it is

2. Filling the missing values

3. Drop them
Outliers:

“In statistics, an outlier is a data point that differs significantly
from other observations.”

That means an outlier indicates a data point that is significantly
different from the other data points in the data set.

Outliers can be created due to the errors in the experiments or
the variability in the measurements.

All the values in math column are in range between 90–95
except 20 which is significantly different from others. It can be
an input error in the dataset. So we can call it a outliers. One
thing should be added here — “ Not all the outliers are bad data
points. Some can be errors but others are the valid values. ”
Duplicate rows:

Datasets may contain duplicate entries. It is one of the most
easiest task to delete duplicate rows.
Roll No Math Science

1 50 55

2 100 90

3 80 85
Data cleansing tools

Openrefine

Trifacta Wrangler

TIBCO Clarity

Cloudingo

IBM Infosphere Quality Stage
Tidy data set:

Tidy dataset means each columns represent separate variables
and each rows represent individual observations. But in untidy
data each columns represent values but not the variables. Tidy
data is useful to fix common data problem.

You might also like