ML Notes
ML Notes
Numerical data
Numerical data, or quantitative data, is any form of measurable data such as your
height, weight, or the cost of your phone bill. You can determine if a set of data is
numerical by attempting to average out the numbers or sort them in ascending or
descending order. Exact or whole numbers (ie. 26 students in a class) are considered
discrete numbers, while those which fall into a given range (ie. 3.6 percent interest
rate) are considered continuous numbers. While learning this type of data, keep in
mind that numerical data is not tied to any specific point in time, they are simply raw
numbers.
Categorical data
Categorical data is sorted by defining characteristics. This can include gender, social
class, ethnicity, hometown, the industry you work in, or a variety of other labels.
While learning this data type, keep in mind that it is non-numerical, meaning you are
unable to add them together, average them out, or sort them in any chronological
order. Categorical data is great for grouping individuals or ideas that share similar
attributes, helping your machine learning model streamline its data analysis.
Text data
Text data is simply words, sentences, or paragraphs that can provide some level of
insight to your machine learning models. Since these words can be difficult for models
to interpret on their own, they are most often grouped together or analyzed using
various methods such as word frequency, text classification, or sentiment analysis.
Introduction to Data in Machine Learning
DATA: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analyzed. Data is the most important part of all Data Analytics, Machine
Learning, Artificial Intelligence. Without data, we can’t train any model and all modern
research and automation will go in vain. Big Enterprises are spending lots of money just
to gather as much certain data as possible.
Example: Why did Facebook acquire WhatsApp by paying a huge price of $19 billion?
The answer is very simple and logical – it is to have access to the users’ information that
Facebook may not have but WhatsApp will have. This information of their users is of
paramount importance to Facebook as it will facilitate the task of improvement in their
services.
INFORMATION: Data that has been interpreted and manipulated and has now some
meaningful inference for the users.
Training Data: The part of data we use to train our model. This is the data that your
model actually sees(both input and output) and learns from.
Validation Data: The part of data that is used to do a frequent evaluation of the model,
fit on the training dataset along with improving involved hyperparameters (initially set
parameters before the model begins learning). This data plays its part when the model is
actually training.
Testing Data: Once our model is completely trained, testing data provides an unbiased
evaluation. When we feed in the inputs of Testing data, our model will predict some
values(without seeing actual output). After prediction, we evaluate our model by
comparing it with the actual output present in the testing data. This is how we evaluate
and see how much our model has learned from the experiences feed in as training data,
set at the time of training.
Consider an example:
There’s a Shopping Mart Owner who conducted a survey for which he has a long list of
questions and answers that he had asked from the customers, this list of questions and
answers is DATA. Now every time when he wants to infer anything and can’t just go
through each and every question of thousands of customers to find something relevant
as it would be time-consuming and not helpful. In order to reduce this overhead and
time wastage and to make work easier, data is manipulated through software,
calculations, graphs, etc. as per own convenience, this inference from manipulated data
is Information. So, Data is a must for Information. Now Knowledge has its role in
differentiating between two individuals having the same information. Knowledge is
actually not technical content but is linked to the human thought process.
data remediation
By definition, data remediation is correcting the mistakes that accumulate during and
after data collection. Security teams are responsible for reorganizing, cleansing,
migrating, archiving, and deleting data to ensure optimal storage and eliminate data
quality issues.
Data remediation is the process of cleansing, organizing and migrating data so that it's
properly protected and best serves its intended purpose. There is a misconception that
data remediation simply means deleting business data that is no longer needed.
Data Preprocessing in Machine learning
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory
to clean it and put in a formatted way. So for this, we use data preprocessing task.
Why do we need Data Preprocessing?
A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a machine
learning model.