Summary Chap 1 & 2

Chapter 1:
Data: What we call data are observations of real-world phenomena.

Each piece of data provides a small window into a limited aspect of reality. The collection of all of these
observations gives us a picture of the whole. But the picture is messy because it is composed of a
thousand little pieces, and there’s always measurement noise and missing pieces.
Tasks: There are questions that data can help us answer. (From data to answer)
The path from data to answers is full of false starts and dead ends:
• What starts out as a promising approach may not pan out.
• What was originally just a hunch may end up leading to the best solution.
• Workflows with data are frequently multistage, iterative processes.
Models: This is where mathematical modeling—in particular statistical modeling—comes in.
The language of statistics contains concepts for many frequent characteristics of data, such as wrong,
redundant, or missing.
A mathematical model of data describes the relationships between different aspects of the data.
A feature: is a numeric representation of raw data.
 There are many ways to turn raw data into numeric measurements, which is why features can end
up looking like a lot of things. Naturally, features must derive from the type of data that is
available.
 Perhaps less obvious is the fact that they are also tied to the model; some models are more
appropriate for some types of features, and vice versa. The right features are relevant to the task
at hand and should be easy for the model to ingest.
Feature engineering: is the process of formulating the most appropriate features given the data, the
model, and the task.
The number of features is also important:
• If there are not enough informative features, then the model will be unable to perform the
ultimate task.
• If there are too many features, or if most of them are irrelevant, then the model will be more
expensive and tricky to train. Something might go awry in the training process that impacts
the model’s performance.
In a machine learning workflow, we pick not only the model, but also the features.
• Good features make the subsequent modeling step easy and the resulting model more capable of
completing the desired task.
• Bad features may require a much more complicated model to achieve the same level of
performance.
Data & Features:

Data are:
• Recording of fact.
• Observations of real-world phenomena.
Feature is:
• An attribute of data that is meaningful to the machine learning process.
• A numeric representation of raw data.
Feature Vector is:
• A representation of a datum.
• An ordered list of numerical properties of observed phenomena.
𝑓: 𝑈 → Y
𝑈 (here, called feature space).
𝑌 is the range or image of a function.
Feature Engineering (by task, objective, indicator): The process of transformation data into features
that better represent the underlying problem, resulting in improved machine learning performance.
Feature Engineering Types: (Definitions are for understanding only)
1. Feature Improvement: Making existing features more usable through mathematical.
• transformations.
• Imputing (filling in) missing data
2. Feature Construction: Augmenting the dataset by creating net new interpretable features from
existing interpretable features.
• Combining features
• Expanding features
3. Feature Extraction: Relying on algorithms to automatically create new sometimes
uninterpretable features usually based on making parametric assumptions about the data.
4. Feature Selection: Choosing the best subset of features from an existing set of features.
5. Feature Learning: Automatically generating a brand new set of features usually by extracting
structure and learning representations from raw unstructured data such as text, images, and
videos, often using deep learning.
Chapter 2
Before diving into complex data types such as text and image, let’s start with the simplest: numeric
data. This may come from a variety of sources: geolocation of a place or a person, the price of a
purchase, measurements from a sensor, traffic counts, etc.
Numeric data is already in a format that’s easily ingestible by mathematical models.
The first sanity check for numeric data is whether the magnitude matters.
 Do we just need to know whether it’s positive or negative?
Next, consider the scale of the features.
 What are the largest and the smallest values?
A single numeric feature is also known as a scalar. An ordered list of scalars is known as a vector.
Vectors sit within a vector space.
In the vast majority of machine learning applications, the input to a model is usually represented as a
numeric vector.
A vector can be visualized as a point in space.
we have a two-dimensional vector v = [1, –1].
In the age of Big Data, counts can quickly accumulate without bound.
Binarization: the official user data collection for the Million Song Dataset, contains the full music
listening histories of one million users on Echo Nest. Here are some relevant statistics.
Quantization or Binning: Suppose our task is to use collaborative filtering to predict the rating a user
might give to a business.
What is Normalization in Machine Learning?
Normalization is a scaling technique in Machine Learning applied during data preparation to change the
values of numeric columns in the dataset to use a common scale. It is not necessary for all datasets in a
model. It is required only when features of machine learning models have different ranges. It helps to
enhance the performance and reliability of a machine learning model.
we can calculate normalization with the below formula: (Min-Max Scaling)
Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
• Xn = Value of Normalization
• Xmaximum = Maximum value of a feature
• Xminimum = Minimum value of a feature
Case1: If the value of X is minimum, the value of Numerator will be 0; hence Normalization will also be 0.
Case2: If the value of X is maximum, then the value of the numerator is equal to the denominator; hence Normalization will be 1.
Case3: On the other hand, if the value of X is neither maximum nor minimum, then values of normalization will also be between 0
and 1.
Standardization scaling:
’=-/
However, unlike Min-Max scaling technique, feature values are not restricted to a specific range in the
standardization technique.
This technique is helpful for various machine learning algorithms that use distance measures such as
KNN, K-means clustering, and Principal component analysis.
Important: written
Normalization Standardization
Scales values ranges between [0, 1] or [-1, 1]. Scale values are not restricted to a specific range.
It got affected by outliers. It is comparatively less affected by outliers.
It is also called Scaling normalization. It is known as Z-score normalization.
It is useful when feature distribution is unknown. It is useful when feature distribution is normal.
Normalization: is a transformation technique that helps to improve the performance as well as the
accuracy of your model better. Normalization of a machine learning model is useful when you don't know
feature distribution exactly.
In other words, the feature distribution of data does not follow a Gaussian (bell curve) distribution.
Normalization must have an abounding range, so if you have outliers in data, they will be affected by
Normalization.
it is also useful for data having variable scaling techniques such as KNN, artificial neural networks.
Standardization: your data follows a Gaussian distribution. However, this does not have to be
necessarily true.
Standardization does not necessarily have a bounding range, so if you have outliers in your data, they
will not be affected by Standardization.
it is also useful when data has variable dimensions and techniques such as linear regression, logistic
regression, and linear discriminant analysis.
Feature Selection:
 Filtering: Filtering techniques preprocess features to remove ones that are unlikely to be useful
for the model.
 Wrapper methods: These techniques are expensive, but they allow you to try out subsets of
features.
 Embedded methods: These methods perform feature selection as part of the model training
process.

Summary Chap 1 & 2

Uploaded by

Copyright:

Available Formats

Summary Chap 1 & 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summary Chap 1 & 2

Uploaded by

Copyright:

Available Formats

Chapter 1:

Data: What we call data are observations of real-world phenomena.

Data & Features:

You might also like