Unit 1 - Machine Learning - NOTES1 - ML
Unit 1 - Machine Learning - NOTES1 - ML
Unit 1 - Machine Learning - NOTES1 - ML
Introduction to Machine
Learning
Machine Learning
Duplicate data
4. Data Analysis
This step involves:
Selection of analytical techniques
Building models
Review the result
build a machine learning model to analyze the data using various analytical
techniques and review the outcome.
It starts with the determination of the type of the problems, where we select
the machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and
evaluate the model.
5. Train Model
train model to improve its performance for better outcome of the
problem.
Training a model is required so that it can understand the various
patterns, rules, and, features.
6. Test Model
G,N,O,Y,6 - 97.27
Machine learning model has been trained on a given dataset, then we test
the model. In this step, we check for the accuracy of our model by
providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per
the requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we
deploy the model in the real-world system.
AI vs ML
Artificial Intelligence Machine learning
Artificial intelligence is a technology Machine learning is a subset of AI which
which enables a machine to simulate allows a machine to automatically learn
human behavior. from past data without programming
explicitly.
The goal of AI is to make a smart The goal of ML is to allow machines to
computer system like humans to solve learn from data so that they can give
complex problems. accurate output.
In AI, we make intelligent systems to In ML, we teach machines with data to
perform any task like a human. perform a particular task and give an
accurate result.
Machine learning and deep learning are the Deep learning is a main subset of machine
two main subsets of AI. learning.
AI has a very wide range of scope. Machine learning has a limited scope.
Volume: Scale of Data. With the growing world population and technology at
exposure, huge data is being generated each and every millisecond.
Variety: Different forms of data – healthcare, images, videos, audio clippings.
Velocity: Rate of data streaming and generation.
Value: Meaningfulness of data in terms of information that researchers can infer
from it.
Veracity: Certainty and correctness in data we are working on.
Data Processing
Data Cleaning
Data Cleaning means the process of identifying the incorrect,
incomplete, inaccurate, irrelevant or missing part of the data and
then modifying, replacing or deleting them according to the
necessity.
Inconsistent column
DataFrame (A Data frame is a two-dimensional data structure,
i.e., data is aligned in a tabular fashion in rows and columns)
contains columns that are irrelevant or never going to use them
then it can be dropped to give more focus on the columns.
Missing data:
Most of the dataset contains missing values.
Handling missing values is very important because it may affect your analysis
and machine learning models.
If you find any missing values in the dataset you can perform any of these
three task on it:
1. Leave as it is
2. Filling the missing values
3. Drop them
Outliers:
“In statistics, an outlier is a data point that differs significantly
from other observations.”
That means an outlier indicates a data point that is significantly
different from the other data points in the data set.
Outliers can be created due to the errors in the experiments or
the variability in the measurements.
All the values in math column are in range between 90–95
except 20 which is significantly different from others. It can be
an input error in the dataset. So we can call it a outliers. One
thing should be added here — “ Not all the outliers are bad data
points. Some can be errors but others are the valid values. ”
Duplicate rows:
Datasets may contain duplicate entries. It is one of the most
easiest task to delete duplicate rows.
Roll No Math Science
1 50 55
2 100 90
3 80 85
Data cleansing tools
Openrefine
Trifacta Wrangler
TIBCO Clarity
Cloudingo
IBM Infosphere Quality Stage
Tidy data set:
Tidy dataset means each columns represent separate variables
and each rows represent individual observations. But in untidy
data each columns represent values but not the variables. Tidy
data is useful to fix common data problem.