Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
59 views

Part 2 Introduction To ML

Uploaded by

sujatakaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Part 2 Introduction To ML

Uploaded by

sujatakaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Terminologies of Machine Learning

 Model A model is a specific representation learned from data by applying some machine
learning algorithm. A model is also called a hypothesis.

 Feature A feature is an individual measurable property of our data. A set of numeric features
can be conveniently described by a feature vector. Feature vectors are fed as input to the
model. For example, in order to predict a fruit, there may be features like color, smell,
taste, etc. Note: Choosing informative, discriminating and independent features is a crucial step
for effective algorithms. We generally employ a feature extractor to extract the relevant
features from the raw data.

 Target (Label) A target variable or label is the value to be predicted by our model. For the
fruit example discussed in the features section, the label with each set of input would be the
name of the fruit like apple, orange, banana, etc.

 Training The idea is to give a set of inputs(features) and its expected outputs(labels), so after
training, we will have a model (hypothesis) that will then map new data to one of the categories
trained on.

 Prediction Once our model is ready, it can be fed a set of inputs to which it will provide a
predicted output(label). But make sure if the machine performs well on unseen data, then only
we can say the machine performs well.

The figure shown below clears the above concepts:

Here are the steps to get started with machine learning:

1. Define the Problem: Identify the problem you want to solve and determine if machine learning
can be used to solve it.

2. Collect Data: Gather and clean the data that you will use to train your model. The quality of
your model will depend on the quality of your data.
3. Explore the Data: Use data visualization and statistical methods to understand the structure and
relationships within your data.

4. Pre-process the Data: Prepare the data for modeling by normalizing, transforming, and
cleaning it as necessary.

5. Split the Data: Divide the data into training and test datasets to validate your model.

6. Choose a Model: Select a machine learning model that is appropriate for your problem and the
data you have collected.

7. Train the Model: Use the training data to train the model, adjusting its parameters to fit the data
as accurately as possible.

8. Evaluate the Model: Use the test data to evaluate the performance of the model and determine
its accuracy.

9. Fine-tune the Model: Based on the results of the evaluation, fine-tune the model by adjusting
its parameters and repeating the training process until the desired level of accuracy is achieved.

10. Deploy the Model: Integrate the model into your application or system, making it available for
use by others.

11. Monitor the Model: Continuously monitor the performance of the model to ensure that it
continues to provide accurate results over time.

Machine learning Life cycle


Machine learning has given the computer systems the abilities to automatically learn without being
explicitly programmed. But how does a machine learning system work? So, it can be described using
the life cycle of machine learning. Machine learning life cycle is a cyclic process to build an efficient
machine learning project. The main purpose of the life cycle is to find a solution to the problem or
project.
Machine learning life cycle involves seven major steps, which are given below:
 Gathering Data
 Data preparation
 Data Wrangling
 Analyse Data
 Train the model
 Test the model
 Deployment

The most important thing in the complete process is to understand the problem and to know the
purpose of the problem. Therefore, before starting the life cycle, we need to understand the problem
because the good result depends on the better understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning system called
"model", and this model is created by providing "training". But to train a model, we need data, hence,
life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to identify
and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important steps of
the life cycle. The quantity and quality of the collected data will determine the efficiency of the
output. The more will be the data, the more accurate will be the prediction.
This step includes the below tasks:
 Identify various data sources
 Collect data
 Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used
in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we
put our data into a suitable place and prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
 Data exploration:

It is used to understand the nature of data that we have to work with. We need to understand
the characteristics, format, and quality of data. A better understanding of data leads to an
effective outcome. In this, we find Correlations, general trends, and outliers.
 Data pre-processing:

Now the next step is preprocessing of data for its analysis.


3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is the
process of cleaning the data, selecting the variable to use, and transforming the data in a proper format
to make it more suitable for analysis in the next step. It is one of the most important steps of the
complete process. Cleaning of data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the data may not be
useful. In real-world applications, collected data may have various issues, including:
 Missing Values
 Duplicate data
 Invalid data
 Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect the quality of
the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
 Selection of analytical techniques
 Building models
 Review the result
The aim of this step is to build a machine learning model to analyze the data using various analytical
techniques and review the outcome. It starts with the determination of the type of the problems, where
we select the machine learning techniques such as Classification, Regression, Cluster analysis,
Association, etc. then build the model using prepared data, and evaluate the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its performance
for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a model is
required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model. In this
step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of project
or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the real-
world system.
If the above-prepared model is producing an accurate result as per our requirement with acceptable
speed, then we deploy the model in the real system. But before deploying the project, we will check
whether it is improving its performance using available data or not. The deployment phase is similar
to making the final report for a project.

Difference between Artificial intelligence and Machine learning


Artificial intelligence and machine learning are the part of computer science that are correlated with
each other. These two technologies are the most trending technologies which are used for creating
intelligent systems.
Although these are two related technologies and sometimes people use them as a synonym for each
other, but still both are the two different terms in various cases.
Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that can mimic
human intelligence. It is comprised of two words "Artificial" and "intelligence", which means "a
human-made thinking power."
Artificial intelligence is a technology using which we can create intelligent systems that can
simulate human intelligence.
The Artificial intelligence system does not require to be pre-programmed, instead of that, they use
such algorithms which can work with their own intelligence. It involves machine learning algorithms
such as Reinforcement learning algorithm and deep learning neural networks. AI is being used in
multiple places such as Siri, Google?s AlphaGo, AI in Chess playing, etc.
Based on capabilities, AI can be classified into three types:
 Weak AI
 General AI
 Strong AI
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for which it
is said that it will be intelligent than humans.
Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from
past data or experiences without being explicitly programmed.
Machine learning enables a computer system to make predictions or take some decisions using
historical data without being explicitly programmed. Machine learning uses a massive amount of
structured and semi-structured data so that a machine learning model can generate accurate result or
give predictions based on that data.
Machine learning works on algorithm which learn by it?s own using historical data. It works only for
specific domains such as if we are creating a machine learning model to detect pictures of dogs, it
will only give result for dog images, but if we provide a new data like cat image then it will become
unresponsive. Machine learning is being used in various places such as for online recommender
system, for Google search algorithms, Email spam filter, Facebook Auto friend tagging suggestion,
etc.
It can be divided into three types:
 Supervised learning
 Reinforcement learning
 Unsupervised learning
Machine learning (ML) vs Traditional Programming vs Artificial Intelligence
(AI):

Traditional
Machine Learning Programming Artificial Intelligence

Machine Learning is a subset


In traditional Artificial Intelligence involves
of artificial intelligence(AI)
programming, rule-based making the machine as much
that focus on learning from
code is written by the capable, So that it can perform
data to develop an algorithm
developers depending on the tasks that typically require
that can be used to make a
the problem statements. human intelligence.
prediction.

Machine Learning uses a data- Traditional programming is AI can involve many different
driven approach, It is typically typically rule-based and techniques, including Machine
trained on historical data and deterministic. It hasn’t self- Learning and Deep Learning, as
then used to make predictions learning features like well as traditional rule-based
on new data. Machine Learning and AI. programming.

Sometimes AI uses a
Traditional programming is
ML can find patterns and combination of both Data and
totally dependent on the
insights in large datasets that Pre-defined rules, which gives it
intelligence of developers.
might be difficult for humans a great edge in solving complex
So, it has very limited
to discover. tasks with good accuracy which
capability.
seem impossible to humans.

Machine Learning is the subset Traditional programming is AI is a broad field that includes
of AI. And Now it is used in often used to build many different applications,
various AI-based tasks like applications and software including natural language
Chatbot Question answering, systems that have specific processing, computer vision,
self-driven car., etc. functionality. and robotics.

How to get datasets for Machine Learning


The field of ML depends vigorously on datasets for preparing models and making precise predictions.
Datasets assume a vital part in the progress of AIML projects and are fundamental for turning into a
gifted information researcher. In this article, we will investigate the various sorts of datasets utilized
in AI and give a definite aid on where to track down them.

What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can contain any
data from a series of an array to a database table. Below table shows an example of the dataset:
Country Age Salary Purchased

India 38 48000 No

France 43 45000 Yes

Germany 30 54000 No

France 48 65000 No

Germany 40 Yes

India 35 58000 Yes

A tabular dataset can be understood as a database table or matrix, where each column corresponds to
a particular variable, and each row corresponds to the fields of the dataset. The most supported
file type for a tabular dataset is "Comma Separated File," or CSV. But to store a "tree-like data,"
we can use the JSON file more efficiently.

Types of data in datasets


 Numerical data:Such as house price, temperature, etc.
 Categorical data:Such as Yes/No, True/False, Blue/green, etc.
 Ordinal data:These data are similar to categorical data but can be measured on the basis of
comparison.
A real-world dataset is of huge size, which is difficult to manage and process at the initial level.
Therefore, to practice machine learning algorithms, we can use any dummy dataset.

Types of datasets
Machine learning incorporates different domains, each requiring explicit sorts of datasets. A few
normal sorts of datasets utilized in machine learning include:
Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer vision tasks
such as image classification, object detection, and image segmentation.
Examples :
 ImageNet
 CIFAR-10
 MNIST
Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment posts. These
datasets are utilized in NLP techniques like sentiment analysis, text classification, and machine
translation.
Examples :
 Gutenberg Task dataset
 IMDb film reviews dataset
Time Series Datasets:
Time series datasets include information focuses gathered after some time. They are generally
utilized in determining, abnormality location, and pattern examination. Examples :
 Securities exchange information
 Climate information
 Sensor readings.
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets. They contain
lines addressing examples or tests and segments addressing highlights or qualities. Tabular datasets
are utilized for undertakings like relapse and arrangement. The dataset given before in the article is
an illustration of a tabular dataset.

Need of Dataset
 Completely ready and pre-handled datasets are significant for machine learning projects.
 They give the establishment to prepare exact and solid models. Notwithstanding, working
with enormous datasets can introduce difficulties regarding the board and handling.
 To address these difficulties, productive information the executive's strategies and are
expected to handle calculations.

Data Pre-processing:
Data pre-processing is a fundamental stage in preparing datasets for machine learning. It includes
changing raw data into a configuration reasonable for model training. Normal pre-processing
procedures incorporate data cleaning to eliminate irregularities or blunders, standardization to scale
data inside a particular reach, highlight scaling to guarantee highlights have comparative ranges, and
taking care of missing qualities through ascription or evacuation.
During the development of the ML project, the developers completely rely on the datasets. In building
ML applications, datasets are divided into two parts:
 Training dataset:
 Test Dataset
Training Dataset and Test Dataset:
In machine learning, datasets are ordinarily partitioned into two sections: the training dataset and the
test dataset. The training dataset is utilized to prepare the machine learning model, while the test
dataset is utilized to assess the model's exhibition. This division surveys the model's capacity, to sum
up to inconspicuous data. It is fundamental to guarantee that the datasets are representative of the
issue space and appropriately split to stay away from inclination or overfitting.

Data Preprocessing in Machine Learning


Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data preprocessing task.

Necessity of Data Processing


A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for
cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
It involves below steps:
 Getting the dataset
 Importing libraries
 Importing datasets
 Finding Missing Data
 Encoding Categorical Data
 Splitting dataset into training and test set
 Feature scaling

1. Getting the Dataset


This is the first step in data preprocessing. You need to collect the data that you want to use for your
machine learning project. The data can come from various sources such as databases, web scraping,
APIs, or existing datasets from repositories like Kaggle, UCI Machine Learning Repository, etc.
2. Importing Libraries
To perform data preprocessing and build machine learning models, you need to import the necessary
libraries. Commonly used libraries include:
 Pandas: for data manipulation and analysis.
 NumPy: for numerical operations.
 Scikit-learn: for machine learning algorithms and preprocessing utilities.
 Matplotlib/Seaborn: for data visualization.
python code (Import necessary libraries which are needed in the programming)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
import matplotlib.pyplot as plt
import seaborn as sns
3. Importing Datasets
After importing the necessary libraries, the next step is to load your dataset into your
environment. This can be done using Pandas for most common data formats like CSV, Excel,
SQL databases, etc.
Python code (Import the dataset)
dataset = pd.read_csv('data.csv') #Replace with your own dataset
4. Finding Missing Data
Real-world data often has missing values. It is important to identify and handle these missing values
appropriately. Common strategies include:
 Removing rows or columns with missing values.
 Imputing missing values with mean, median, mode, or a constant value.
python code
# Check for missing values
print(dataset.isnull().sum())

# Impute missing values (example: filling with mean)


dataset.fillna(dataset.mean(), inplace=True)
5. Encoding Categorical Data
Machine learning models require numerical input, so categorical data needs to be converted into
numerical form. This can be done using techniques like:
 Label Encoding: Converts categories to numeric labels.
 One-Hot Encoding: Converts categories to binary columns.
Python code
# Label Encoding
labelencoder = LabelEncoder()
dataset['Category'] = labelencoder.fit_transform(dataset['Category'])

# One-Hot Encoding
dataset = pd.get_dummies(dataset, columns=['Category'])
6. Splitting Dataset into Training and Test Set
To evaluate the performance of the machine learning model, the dataset is split into a training set and
a test set. The training set is used to train the model, while the test set is used to evaluate its
performance.
Python code
X = dataset.iloc[:, :-1].values # Features
y = dataset.iloc[:, -1].values # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


7. Feature Scaling
Feature scaling is a method to standardize the range of independent variables or features of data. It is
essential for algorithms that compute distances between data points, like K-Nearest Neighbors
(KNN) and Support Vector Machines (SVM). Common techniques include:
 Standardization (Z-score normalization): (value - mean) / standard deviation.
 Normalization (Min-Max scaling): (value - min) / (max - min).
Python code
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

By following these steps, you ensure that your data is clean, well-structured, and suitable for
training machine learning models, which ultimately leads to better performance and accuracy.

You might also like