Part 2 Introduction To ML
Part 2 Introduction To ML
Model A model is a specific representation learned from data by applying some machine
learning algorithm. A model is also called a hypothesis.
Feature A feature is an individual measurable property of our data. A set of numeric features
can be conveniently described by a feature vector. Feature vectors are fed as input to the
model. For example, in order to predict a fruit, there may be features like color, smell,
taste, etc. Note: Choosing informative, discriminating and independent features is a crucial step
for effective algorithms. We generally employ a feature extractor to extract the relevant
features from the raw data.
Target (Label) A target variable or label is the value to be predicted by our model. For the
fruit example discussed in the features section, the label with each set of input would be the
name of the fruit like apple, orange, banana, etc.
Training The idea is to give a set of inputs(features) and its expected outputs(labels), so after
training, we will have a model (hypothesis) that will then map new data to one of the categories
trained on.
Prediction Once our model is ready, it can be fed a set of inputs to which it will provide a
predicted output(label). But make sure if the machine performs well on unseen data, then only
we can say the machine performs well.
1. Define the Problem: Identify the problem you want to solve and determine if machine learning
can be used to solve it.
2. Collect Data: Gather and clean the data that you will use to train your model. The quality of
your model will depend on the quality of your data.
3. Explore the Data: Use data visualization and statistical methods to understand the structure and
relationships within your data.
4. Pre-process the Data: Prepare the data for modeling by normalizing, transforming, and
cleaning it as necessary.
5. Split the Data: Divide the data into training and test datasets to validate your model.
6. Choose a Model: Select a machine learning model that is appropriate for your problem and the
data you have collected.
7. Train the Model: Use the training data to train the model, adjusting its parameters to fit the data
as accurately as possible.
8. Evaluate the Model: Use the test data to evaluate the performance of the model and determine
its accuracy.
9. Fine-tune the Model: Based on the results of the evaluation, fine-tune the model by adjusting
its parameters and repeating the training process until the desired level of accuracy is achieved.
10. Deploy the Model: Integrate the model into your application or system, making it available for
use by others.
11. Monitor the Model: Continuously monitor the performance of the model to ensure that it
continues to provide accurate results over time.
The most important thing in the complete process is to understand the problem and to know the
purpose of the problem. Therefore, before starting the life cycle, we need to understand the problem
because the good result depends on the better understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning system called
"model", and this model is created by providing "training". But to train a model, we need data, hence,
life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to identify
and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important steps of
the life cycle. The quantity and quality of the collected data will determine the efficiency of the
output. The more will be the data, the more accurate will be the prediction.
This step includes the below tasks:
Identify various data sources
Collect data
Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used
in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we
put our data into a suitable place and prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
Data exploration:
It is used to understand the nature of data that we have to work with. We need to understand
the characteristics, format, and quality of data. A better understanding of data leads to an
effective outcome. In this, we find Correlations, general trends, and outliers.
Data pre-processing:
Traditional
Machine Learning Programming Artificial Intelligence
Machine Learning uses a data- Traditional programming is AI can involve many different
driven approach, It is typically typically rule-based and techniques, including Machine
trained on historical data and deterministic. It hasn’t self- Learning and Deep Learning, as
then used to make predictions learning features like well as traditional rule-based
on new data. Machine Learning and AI. programming.
Sometimes AI uses a
Traditional programming is
ML can find patterns and combination of both Data and
totally dependent on the
insights in large datasets that Pre-defined rules, which gives it
intelligence of developers.
might be difficult for humans a great edge in solving complex
So, it has very limited
to discover. tasks with good accuracy which
capability.
seem impossible to humans.
Machine Learning is the subset Traditional programming is AI is a broad field that includes
of AI. And Now it is used in often used to build many different applications,
various AI-based tasks like applications and software including natural language
Chatbot Question answering, systems that have specific processing, computer vision,
self-driven car., etc. functionality. and robotics.
What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can contain any
data from a series of an array to a database table. Below table shows an example of the dataset:
Country Age Salary Purchased
India 38 48000 No
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
A tabular dataset can be understood as a database table or matrix, where each column corresponds to
a particular variable, and each row corresponds to the fields of the dataset. The most supported
file type for a tabular dataset is "Comma Separated File," or CSV. But to store a "tree-like data,"
we can use the JSON file more efficiently.
Types of datasets
Machine learning incorporates different domains, each requiring explicit sorts of datasets. A few
normal sorts of datasets utilized in machine learning include:
Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer vision tasks
such as image classification, object detection, and image segmentation.
Examples :
ImageNet
CIFAR-10
MNIST
Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment posts. These
datasets are utilized in NLP techniques like sentiment analysis, text classification, and machine
translation.
Examples :
Gutenberg Task dataset
IMDb film reviews dataset
Time Series Datasets:
Time series datasets include information focuses gathered after some time. They are generally
utilized in determining, abnormality location, and pattern examination. Examples :
Securities exchange information
Climate information
Sensor readings.
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets. They contain
lines addressing examples or tests and segments addressing highlights or qualities. Tabular datasets
are utilized for undertakings like relapse and arrangement. The dataset given before in the article is
an illustration of a tabular dataset.
Need of Dataset
Completely ready and pre-handled datasets are significant for machine learning projects.
They give the establishment to prepare exact and solid models. Notwithstanding, working
with enormous datasets can introduce difficulties regarding the board and handling.
To address these difficulties, productive information the executive's strategies and are
expected to handle calculations.
Data Pre-processing:
Data pre-processing is a fundamental stage in preparing datasets for machine learning. It includes
changing raw data into a configuration reasonable for model training. Normal pre-processing
procedures incorporate data cleaning to eliminate irregularities or blunders, standardization to scale
data inside a particular reach, highlight scaling to guarantee highlights have comparative ranges, and
taking care of missing qualities through ascription or evacuation.
During the development of the ML project, the developers completely rely on the datasets. In building
ML applications, datasets are divided into two parts:
Training dataset:
Test Dataset
Training Dataset and Test Dataset:
In machine learning, datasets are ordinarily partitioned into two sections: the training dataset and the
test dataset. The training dataset is utilized to prepare the machine learning model, while the test
dataset is utilized to assess the model's exhibition. This division surveys the model's capacity, to sum
up to inconspicuous data. It is fundamental to guarantee that the datasets are representative of the
issue space and appropriately split to stay away from inclination or overfitting.
# One-Hot Encoding
dataset = pd.get_dummies(dataset, columns=['Category'])
6. Splitting Dataset into Training and Test Set
To evaluate the performance of the machine learning model, the dataset is split into a training set and
a test set. The training set is used to train the model, while the test set is used to evaluate its
performance.
Python code
X = dataset.iloc[:, :-1].values # Features
y = dataset.iloc[:, -1].values # Target variable
By following these steps, you ensure that your data is clean, well-structured, and suitable for
training machine learning models, which ultimately leads to better performance and accuracy.