ML Notes

This document discusses issues related to data quality in machine learning. It notes that data used to train machine learning models needs to be of high quality, with characteristics like accuracy, completeness, and consistency. Poor quality data can negatively impact model performance and lead to issues like overfitting or underfitting. The document also outlines different types of data structures commonly used in machine learning, including numeric, categorical, and ordinal data. High quality data is important for building effective machine learning models that can make reliable predictions.

Uploaded by

Vijay Mahalingam

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

ML Notes

Uploaded by

Vijay Mahalingam

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

issues in Machine Learning

Although machine learning is being used in every industry and helps

organizations make more informed and data-driven choices that are more
effective than classical methodologies, it still has so many problems that cannot
be ignored. Here are some common issues in Machine Learning that professionals
face to inculcate ML skills and create an application from scratch.
• Inadequate Training Data
i. Noisy Data
ii. Incorrect data
iii. Generalizing of output data
• Poor quality of data
• Non-representative training data
• Overfitting and Underfitting
• Monitoring and maintenance
• Getting bad recommendations
• Lack of skilled resources
• Customer Segmentation
• Process Complexity of Machine Learning
• Data Bias
• Lack of Explainability
• Slow implementations and results
• Irrelevant features
Machine Learning Models
A machine learning model is defined as a mathematical representation of the output
of the training process.
Machine learning is the study of different algorithms that can improve automatically
through experience & old data and build the model. A machine learning model is
similar to computer software designed to recognize patterns or behaviors based on
previous experience or data.
The learning algorithm discovers patterns within the training data, and it outputs an
ML model which captures these patterns and makes predictions on new data.
Let's understand an example of the ML model where we are creating an app to recognize
the user's emotions based on facial expressions. So, creating such an app is possible by
Machine learning models where we will train a model by feeding images of faces with
various emotions labeled on them. Whenever this app is used to determine the user's
mood, it reads all fed data then determines any user's mood.
Hence, in simple words, we can say that a machine learning model is a simplified
representation of something or a process. In this topic, we will discuss different machine
learning models and their techniques and algorithms.
What is Machine Learning Model?
Machine Learning models can be understood as a program that has been trained to find
patterns within new data and make predictions. These models are represented as a
mathematical function that takes requests in the form of input data, makes predictions
on input data, and then provides an output in response. First, these models are trained
over a set of data, and then they are provided an algorithm to reason over data, extract
the pattern from feed data and learn from those data. Once these models get trained,
they can be used to predict the unseen dataset.
There are various types of machine learning models available based on different
business goals and data sets.
Training Machine Learning Models
Once the Machine learning model is built, it is trained in order to get the appropriate
results. To train a machine learning model, one needs a huge amount of pre-processed
data. Here pre-processed data means data in structured form with reduced null values,
etc. If we do not provide pre-processed data, then there are huge chances that our
model may perform terribly.
How to choose the best model?
In the above section, we have discussed different machine learning models and
algorithms. But one most confusing question that may arise to any beginner that "which
model should I choose?". So, the answer is that it depends mainly on the business
requirement or project requirement. Apart from this, it also depends on associated
attributes, the volume of the available dataset, the number of features, complexity, etc.
However, in practice, it is recommended that we always start with the simplest model
that can be applied to the particular problem and then gradually enhance the
complexity & test the accuracy with the help of parameter tuning and cross-validation.

Machine learning activities

Type of Data
Data can come in many forms, but machine learning models rely on four primary data
types. These include numerical data, categorical data, time series data, and text data.

Numerical data
Numerical data, or quantitative data, is any form of measurable data such as your
height, weight, or the cost of your phone bill. You can determine if a set of data is
numerical by attempting to average out the numbers or sort them in ascending or
descending order. Exact or whole numbers (ie. 26 students in a class) are considered
discrete numbers, while those which fall into a given range (ie. 3.6 percent interest
rate) are considered continuous numbers. While learning this type of data, keep in
mind that numerical data is not tied to any specific point in time, they are simply raw
numbers.

Categorical data
Categorical data is sorted by defining characteristics. This can include gender, social
class, ethnicity, hometown, the industry you work in, or a variety of other labels.
While learning this data type, keep in mind that it is non-numerical, meaning you are
unable to add them together, average them out, or sort them in any chronological
order. Categorical data is great for grouping individuals or ideas that share similar
attributes, helping your machine learning model streamline its data analysis.

Time series data

Time series data consists of data points that are indexed at specific points in time.
More often than not, this data is collected at consistent intervals. Learning and
utilizing time series data makes it easy to compare data from week to week, month to
month, year to year, or according to any other time-based metric you desire. The
distinct difference between time series data and numerical data is that time series
data has established starting and ending points, while numerical data is simply a
collection of numbers that aren’t rooted in particular time periods.

Text data
Text data is simply words, sentences, or paragraphs that can provide some level of
insight to your machine learning models. Since these words can be difficult for models
to interpret on their own, they are most often grouped together or analyzed using
various methods such as word frequency, text classification, or sentiment analysis.
Introduction to Data in Machine Learning
DATA: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analyzed. Data is the most important part of all Data Analytics, Machine
Learning, Artificial Intelligence. Without data, we can’t train any model and all modern
research and automation will go in vain. Big Enterprises are spending lots of money just
to gather as much certain data as possible.

Example: Why did Facebook acquire WhatsApp by paying a huge price of $19 billion?
The answer is very simple and logical – it is to have access to the users’ information that
Facebook may not have but WhatsApp will have. This information of their users is of
paramount importance to Facebook as it will facilitate the task of improvement in their
services.
INFORMATION: Data that has been interpreted and manipulated and has now some
meaningful inference for the users.

KNOWLEDGE: Combination of inferred information, experiences, learning, and insights.

Results in awareness or concept building for an individual or organization.

How we split data in Machine Learning?

Training Data: The part of data we use to train our model. This is the data that your
model actually sees(both input and output) and learns from.
Validation Data: The part of data that is used to do a frequent evaluation of the model,
fit on the training dataset along with improving involved hyperparameters (initially set
parameters before the model begins learning). This data plays its part when the model is
actually training.
Testing Data: Once our model is completely trained, testing data provides an unbiased
evaluation. When we feed in the inputs of Testing data, our model will predict some
values(without seeing actual output). After prediction, we evaluate our model by
comparing it with the actual output present in the testing data. This is how we evaluate
and see how much our model has learned from the experiences feed in as training data,
set at the time of training.

Consider an example:
There’s a Shopping Mart Owner who conducted a survey for which he has a long list of
questions and answers that he had asked from the customers, this list of questions and
answers is DATA. Now every time when he wants to infer anything and can’t just go
through each and every question of thousands of customers to find something relevant
as it would be time-consuming and not helpful. In order to reduce this overhead and
time wastage and to make work easier, data is manipulated through software,
calculations, graphs, etc. as per own convenience, this inference from manipulated data
is Information. So, Data is a must for Information. Now Knowledge has its role in
differentiating between two individuals having the same information. Knowledge is
actually not technical content but is linked to the human thought process.

Different Structure of Data

Numeric Data : If a feature represents a characteristic measured in numbers , it is called
a numeric feature.
Categorical Data : A categorical feature is an attribute that can take on one of the limited
, and usually fixed number of possible values on the basis of some qualitative property .
A categorical feature is also called a nominal feature.
Ordinal Data : This denotes a nominal variable with categories falling in an ordered list .
Examples include clothing sizes such as small, medium , and large , or a measurement of
customer satisfaction on a scale from “not at all happy” to “very happy”.
Data Quality in ML
Data quality is the measure of how well suited a data set is to serve its specific purpose.
Measures of data quality are based on data quality characteristics such as accuracy,
completeness, consistency, validity, uniqueness, and timeliness.

data remediation

By definition, data remediation is correcting the mistakes that accumulate during and
after data collection. Security teams are responsible for reorganizing, cleansing,
migrating, archiving, and deleting data to ensure optimal storage and eliminate data
quality issues.
Data remediation is the process of cleansing, organizing and migrating data so that it's
properly protected and best serves its intended purpose. There is a misconception that
data remediation simply means deleting business data that is no longer needed.
Data Preprocessing in Machine learning
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory
to clean it and put in a formatted way. So for this, we use data preprocessing task.
Why do we need Data Preprocessing?
A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a machine
learning model.

Data Preprocessing steps

1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set
7. Feature scaling

RT Exercises and Solutions Med Tentatal
100% (1)
RT Exercises and Solutions Med Tentatal
264 pages
Data_in_machine_learning
No ratings yet
Data_in_machine_learning
7 pages
ML Unit1.notes
No ratings yet
ML Unit1.notes
8 pages
Chapter 2
No ratings yet
Chapter 2
4 pages
10 Machine Learning
No ratings yet
10 Machine Learning
9 pages
Unit 1 - Machine Learning - NOTES1 - ML
No ratings yet
Unit 1 - Machine Learning - NOTES1 - ML
52 pages
Machine Learning - Ii Unit 1
No ratings yet
Machine Learning - Ii Unit 1
21 pages
Machine Learning
No ratings yet
Machine Learning
12 pages
ML Unit 1
No ratings yet
ML Unit 1
20 pages
C10_AI_projectcycle_unit2
No ratings yet
C10_AI_projectcycle_unit2
48 pages
ML 5units
No ratings yet
ML 5units
284 pages
AI Project Cycle
No ratings yet
AI Project Cycle
7 pages
Life Cycle of Data Science - Complete Step-By-step Guide
No ratings yet
Life Cycle of Data Science - Complete Step-By-step Guide
3 pages
Machine Learning 1
No ratings yet
Machine Learning 1
34 pages
AI Project
No ratings yet
AI Project
14 pages
Project Life Cycle
No ratings yet
Project Life Cycle
14 pages
Data Science Lifecycle
No ratings yet
Data Science Lifecycle
3 pages
Module 4
No ratings yet
Module 4
28 pages
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
ML Workflow Steps: Step 2: Building Dataset
No ratings yet
ML Workflow Steps: Step 2: Building Dataset
5 pages
Rishabhbuccha
No ratings yet
Rishabhbuccha
20 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Machine Learning: Bilal Khan
No ratings yet
Machine Learning: Bilal Khan
26 pages
Model Lifecycle (XII)
100% (3)
Model Lifecycle (XII)
9 pages
Introduction to ML Unit-1 PPT
No ratings yet
Introduction to ML Unit-1 PPT
90 pages
IT Unit 10
No ratings yet
IT Unit 10
4 pages
Machine Learning 3
No ratings yet
Machine Learning 3
30 pages
Ma 1
No ratings yet
Ma 1
31 pages
Fulldoc - Dsec Mca - Crime Prediction
No ratings yet
Fulldoc - Dsec Mca - Crime Prediction
56 pages
ML - Module 1
No ratings yet
ML - Module 1
30 pages
Model Lifecycle (XII)
No ratings yet
Model Lifecycle (XII)
10 pages
Xii Std-Artifical Intelligence-Unit 2 Model Lifecycle
No ratings yet
Xii Std-Artifical Intelligence-Unit 2 Model Lifecycle
10 pages
Machine Learning - Trading
No ratings yet
Machine Learning - Trading
3 pages
ML 1
No ratings yet
ML 1
79 pages
22wj8a6630ml ppt
No ratings yet
22wj8a6630ml ppt
12 pages
Class 10 Ai Notes
No ratings yet
Class 10 Ai Notes
8 pages
ML & DL
No ratings yet
ML & DL
19 pages
5.3 Model
No ratings yet
5.3 Model
26 pages
Unit-I
No ratings yet
Unit-I
23 pages
Ai Notes
No ratings yet
Ai Notes
7 pages
ML Unit 1
No ratings yet
ML Unit 1
16 pages
Machine Learning Report
No ratings yet
Machine Learning Report
73 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
UNIT1@
No ratings yet
UNIT1@
4 pages
DS&ML 1
No ratings yet
DS&ML 1
9 pages
Statistics For Data Science - 1
100% (2)
Statistics For Data Science - 1
38 pages
AI Project Cycle PPT - Notes
No ratings yet
AI Project Cycle PPT - Notes
9 pages
AI Project Cycle Class 10[1]
No ratings yet
AI Project Cycle Class 10[1]
11 pages
An Enlightenment To Machine Learning
100% (1)
An Enlightenment To Machine Learning
16 pages
5 Reasons Why Machine Learning Is Important in Today
No ratings yet
5 Reasons Why Machine Learning Is Important in Today
6 pages
What Is Machine Learning-UNIT III
No ratings yet
What Is Machine Learning-UNIT III
12 pages
Unit2- AI Project Cycle-converted
No ratings yet
Unit2- AI Project Cycle-converted
12 pages
5_6095834670757318868
No ratings yet
5_6095834670757318868
62 pages
Unit 3 - DS - 1st year
No ratings yet
Unit 3 - DS - 1st year
5 pages
Introducion to ML
No ratings yet
Introducion to ML
29 pages
Data Science
No ratings yet
Data Science
5 pages
Design A Machine Learning System
No ratings yet
Design A Machine Learning System
9 pages
An Enlightenment To Machine Learning - Resp
No ratings yet
An Enlightenment To Machine Learning - Resp
22 pages
ai.docx (2)
No ratings yet
ai.docx (2)
13 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
7 pages
Data Science Process
No ratings yet
Data Science Process
4 pages
Deeplab: Semantic Image Segmentation With Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs
No ratings yet
Deeplab: Semantic Image Segmentation With Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs
14 pages
Adobe
No ratings yet
Adobe
1 page
SQL Week-3
No ratings yet
SQL Week-3
29 pages
1 s2.0 S0098135420301599 Main
No ratings yet
1 s2.0 S0098135420301599 Main
22 pages
Student PCA1 Routine ODD Sem 2024
No ratings yet
Student PCA1 Routine ODD Sem 2024
1 page
ML Algorithms
No ratings yet
ML Algorithms
2 pages
System-Design-Primer - Learn How To Design Large-Scale Systems. Prep For The System Design Interview. Includes Anki Flashcards
No ratings yet
System-Design-Primer - Learn How To Design Large-Scale Systems. Prep For The System Design Interview. Includes Anki Flashcards
78 pages
Iv. Single Layer Structures: 4.1. Perceptrons
No ratings yet
Iv. Single Layer Structures: 4.1. Perceptrons
26 pages
Poster
No ratings yet
Poster
1 page
Trending Topic Analysis Using Novel Sub Topic Detection Model
No ratings yet
Trending Topic Analysis Using Novel Sub Topic Detection Model
5 pages
Ann Case Study
No ratings yet
Ann Case Study
14 pages
8 Revision Handout
No ratings yet
8 Revision Handout
17 pages
SAR AI Paper
No ratings yet
SAR AI Paper
26 pages
Computer Project On Ai
No ratings yet
Computer Project On Ai
10 pages
Lab3 ERD2Relational Su2023 SE1754
No ratings yet
Lab3 ERD2Relational Su2023 SE1754
2 pages
Sistem & CLD Presentasi - 3
No ratings yet
Sistem & CLD Presentasi - 3
145 pages
Decision Making in Fuzzy Environments
No ratings yet
Decision Making in Fuzzy Environments
12 pages
Exploring The Power and Potential of ChatGPT
No ratings yet
Exploring The Power and Potential of ChatGPT
14 pages
Haptic Technology
No ratings yet
Haptic Technology
25 pages
99-Article Text-341-1-10-20190510
No ratings yet
99-Article Text-341-1-10-20190510
9 pages
Evaluation Metrics For Machine Learning: Negative (Actual) 98 Positive (Actual) 1
No ratings yet
Evaluation Metrics For Machine Learning: Negative (Actual) 98 Positive (Actual) 1
2 pages
Hatdog 1.2
No ratings yet
Hatdog 1.2
18 pages
Pneumonia Detection Using CNN Published
No ratings yet
Pneumonia Detection Using CNN Published
10 pages
BINARY LOGGING CHAP 6 (Unfinished)
No ratings yet
BINARY LOGGING CHAP 6 (Unfinished)
5 pages
Software Requirements Specification For Sales Prediction Model Page-Ii
No ratings yet
Software Requirements Specification For Sales Prediction Model Page-Ii
11 pages
CC - Unit-5
No ratings yet
CC - Unit-5
26 pages
IEOR E4525 Logistics 2017
No ratings yet
IEOR E4525 Logistics 2017
3 pages
Plag Report 2
No ratings yet
Plag Report 2
2 pages
Module 1 PPT
No ratings yet
Module 1 PPT
122 pages