Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
43 views

ML Notes

This document discusses issues related to data quality in machine learning. It notes that data used to train machine learning models needs to be of high quality, with characteristics like accuracy, completeness, and consistency. Poor quality data can negatively impact model performance and lead to issues like overfitting or underfitting. The document also outlines different types of data structures commonly used in machine learning, including numeric, categorical, and ordinal data. High quality data is important for building effective machine learning models that can make reliable predictions.

Uploaded by

Vijay Mahalingam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

ML Notes

This document discusses issues related to data quality in machine learning. It notes that data used to train machine learning models needs to be of high quality, with characteristics like accuracy, completeness, and consistency. Poor quality data can negatively impact model performance and lead to issues like overfitting or underfitting. The document also outlines different types of data structures commonly used in machine learning, including numeric, categorical, and ordinal data. High quality data is important for building effective machine learning models that can make reliable predictions.

Uploaded by

Vijay Mahalingam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

issues in Machine Learning

Although machine learning is being used in every industry and helps


organizations make more informed and data-driven choices that are more
effective than classical methodologies, it still has so many problems that cannot
be ignored. Here are some common issues in Machine Learning that professionals
face to inculcate ML skills and create an application from scratch.
• Inadequate Training Data
i. Noisy Data
ii. Incorrect data
iii. Generalizing of output data
• Poor quality of data
• Non-representative training data
• Overfitting and Underfitting
• Monitoring and maintenance
• Getting bad recommendations
• Lack of skilled resources
• Customer Segmentation
• Process Complexity of Machine Learning
• Data Bias
• Lack of Explainability
• Slow implementations and results
• Irrelevant features
Machine Learning Models
A machine learning model is defined as a mathematical representation of the output
of the training process.
Machine learning is the study of different algorithms that can improve automatically
through experience & old data and build the model. A machine learning model is
similar to computer software designed to recognize patterns or behaviors based on
previous experience or data.
The learning algorithm discovers patterns within the training data, and it outputs an
ML model which captures these patterns and makes predictions on new data.
Let's understand an example of the ML model where we are creating an app to recognize
the user's emotions based on facial expressions. So, creating such an app is possible by
Machine learning models where we will train a model by feeding images of faces with
various emotions labeled on them. Whenever this app is used to determine the user's
mood, it reads all fed data then determines any user's mood.
Hence, in simple words, we can say that a machine learning model is a simplified
representation of something or a process. In this topic, we will discuss different machine
learning models and their techniques and algorithms.
What is Machine Learning Model?
Machine Learning models can be understood as a program that has been trained to find
patterns within new data and make predictions. These models are represented as a
mathematical function that takes requests in the form of input data, makes predictions
on input data, and then provides an output in response. First, these models are trained
over a set of data, and then they are provided an algorithm to reason over data, extract
the pattern from feed data and learn from those data. Once these models get trained,
they can be used to predict the unseen dataset.
There are various types of machine learning models available based on different
business goals and data sets.
Training Machine Learning Models
Once the Machine learning model is built, it is trained in order to get the appropriate
results. To train a machine learning model, one needs a huge amount of pre-processed
data. Here pre-processed data means data in structured form with reduced null values,
etc. If we do not provide pre-processed data, then there are huge chances that our
model may perform terribly.
How to choose the best model?
In the above section, we have discussed different machine learning models and
algorithms. But one most confusing question that may arise to any beginner that "which
model should I choose?". So, the answer is that it depends mainly on the business
requirement or project requirement. Apart from this, it also depends on associated
attributes, the volume of the available dataset, the number of features, complexity, etc.
However, in practice, it is recommended that we always start with the simplest model
that can be applied to the particular problem and then gradually enhance the
complexity & test the accuracy with the help of parameter tuning and cross-validation.

Machine learning activities


Type of Data
Data can come in many forms, but machine learning models rely on four primary data
types. These include numerical data, categorical data, time series data, and text data.

Numerical data
Numerical data, or quantitative data, is any form of measurable data such as your
height, weight, or the cost of your phone bill. You can determine if a set of data is
numerical by attempting to average out the numbers or sort them in ascending or
descending order. Exact or whole numbers (ie. 26 students in a class) are considered
discrete numbers, while those which fall into a given range (ie. 3.6 percent interest
rate) are considered continuous numbers. While learning this type of data, keep in
mind that numerical data is not tied to any specific point in time, they are simply raw
numbers.

Categorical data
Categorical data is sorted by defining characteristics. This can include gender, social
class, ethnicity, hometown, the industry you work in, or a variety of other labels.
While learning this data type, keep in mind that it is non-numerical, meaning you are
unable to add them together, average them out, or sort them in any chronological
order. Categorical data is great for grouping individuals or ideas that share similar
attributes, helping your machine learning model streamline its data analysis.

Time series data


Time series data consists of data points that are indexed at specific points in time.
More often than not, this data is collected at consistent intervals. Learning and
utilizing time series data makes it easy to compare data from week to week, month to
month, year to year, or according to any other time-based metric you desire. The
distinct difference between time series data and numerical data is that time series
data has established starting and ending points, while numerical data is simply a
collection of numbers that aren’t rooted in particular time periods.

Text data
Text data is simply words, sentences, or paragraphs that can provide some level of
insight to your machine learning models. Since these words can be difficult for models
to interpret on their own, they are most often grouped together or analyzed using
various methods such as word frequency, text classification, or sentiment analysis.
Introduction to Data in Machine Learning
DATA: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analyzed. Data is the most important part of all Data Analytics, Machine
Learning, Artificial Intelligence. Without data, we can’t train any model and all modern
research and automation will go in vain. Big Enterprises are spending lots of money just
to gather as much certain data as possible.

Example: Why did Facebook acquire WhatsApp by paying a huge price of $19 billion?
The answer is very simple and logical – it is to have access to the users’ information that
Facebook may not have but WhatsApp will have. This information of their users is of
paramount importance to Facebook as it will facilitate the task of improvement in their
services.
INFORMATION: Data that has been interpreted and manipulated and has now some
meaningful inference for the users.

KNOWLEDGE: Combination of inferred information, experiences, learning, and insights.


Results in awareness or concept building for an individual or organization.

How we split data in Machine Learning?

Training Data: The part of data we use to train our model. This is the data that your
model actually sees(both input and output) and learns from.
Validation Data: The part of data that is used to do a frequent evaluation of the model,
fit on the training dataset along with improving involved hyperparameters (initially set
parameters before the model begins learning). This data plays its part when the model is
actually training.
Testing Data: Once our model is completely trained, testing data provides an unbiased
evaluation. When we feed in the inputs of Testing data, our model will predict some
values(without seeing actual output). After prediction, we evaluate our model by
comparing it with the actual output present in the testing data. This is how we evaluate
and see how much our model has learned from the experiences feed in as training data,
set at the time of training.

Consider an example:
There’s a Shopping Mart Owner who conducted a survey for which he has a long list of
questions and answers that he had asked from the customers, this list of questions and
answers is DATA. Now every time when he wants to infer anything and can’t just go
through each and every question of thousands of customers to find something relevant
as it would be time-consuming and not helpful. In order to reduce this overhead and
time wastage and to make work easier, data is manipulated through software,
calculations, graphs, etc. as per own convenience, this inference from manipulated data
is Information. So, Data is a must for Information. Now Knowledge has its role in
differentiating between two individuals having the same information. Knowledge is
actually not technical content but is linked to the human thought process.

Different Structure of Data


Numeric Data : If a feature represents a characteristic measured in numbers , it is called
a numeric feature.
Categorical Data : A categorical feature is an attribute that can take on one of the limited
, and usually fixed number of possible values on the basis of some qualitative property .
A categorical feature is also called a nominal feature.
Ordinal Data : This denotes a nominal variable with categories falling in an ordered list .
Examples include clothing sizes such as small, medium , and large , or a measurement of
customer satisfaction on a scale from “not at all happy” to “very happy”.
Data Quality in ML
Data quality is the measure of how well suited a data set is to serve its specific purpose.
Measures of data quality are based on data quality characteristics such as accuracy,
completeness, consistency, validity, uniqueness, and timeliness.

data remediation

By definition, data remediation is correcting the mistakes that accumulate during and
after data collection. Security teams are responsible for reorganizing, cleansing,
migrating, archiving, and deleting data to ensure optimal storage and eliminate data
quality issues.
Data remediation is the process of cleansing, organizing and migrating data so that it's
properly protected and best serves its intended purpose. There is a misconception that
data remediation simply means deleting business data that is no longer needed.
Data Preprocessing in Machine learning
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory
to clean it and put in a formatted way. So for this, we use data preprocessing task.
Why do we need Data Preprocessing?
A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a machine
learning model.

Data Preprocessing steps


1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set
7. Feature scaling

You might also like