ML Chapter 1

Chapter One
Introduction to Machine Learning
1. What Is Machine Learning?

Machine Learning is the science (and art) of programming computers so they can learn from data.
Here is a slightly more general definition:

Machine Learning is the field of study that gives computers the ability to learn without being
explicitly programmed.
— Arthur Samuel, 1959
And a more engineering-oriented one:
A computer program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with experience E.
— Tom Mitchell, 1997
For example, your spam filter is a Machine Learning program that can learn to flag spam given
examples of spam emails (e.g., flagged by users) and examples of regular (nonspam, also called
“ham”) emails. The examples that the system uses to learn are called the training set. Each training
example is called a training instance (or sample). In this case, the task T is to flag spam for new
emails, the experience E is the training data, and the performance measure P needs to be defined;
for example, you can use the ratio of correctly classified emails. This particular performance
measure is called accuracy and it is often used in classification tasks. If you just download a copy
of Wikipedia, your computer has a lot more data, but it is not suddenly better at any task. Thus, it
is not Machine Learning.
A spam filter based on Machine Learning techniques automatically learns which words and
phrases are good predictors of spam by detecting unusually frequent patterns of words in the spam
examples compared to the ham examples (Figure 1-1 ML Approach).
Figure 1-1 ML Approach
1
To summarize, Machine Learning is great for:
 Problems for which existing solutions require a lot of hand-tuning or long lists of rules:
one Machine Learning algorithm can often simplify code and perform better.
 Complex problems for which there is no good solution at all using a traditional approach:
the best Machine Learning techniques can find a solution.
 Fluctuating environments: a Machine Learning system can adapt to new data.
 Getting insights about complex problems and large amounts of data.
2. History and relationships to other fields
As a scientific endeavour, machine learning grew out of the quest for artificial intelligence.
Already in the early days of AI as an academic discipline, some re-searchers were interested in
having machines learn from data. They attempted to approach the problem with various symbolic
methods, as well as what were then termed "neural networks"; these were mostly perceptrons and
other models that were later found to be reinventions of the generalized linear models of statistics.
Probabilistic reasoning was also employed, especially in automated medical diagnosis.
However, an increasing emphasis on the logical, knowledge- based approach caused a rift between
AI and machine learning. Probabilistic systems were plagued by theoretical and practical problems
of data acquisition and representation. By 1980, expert systems had
come to dominate AI, and statistics was out of favor. Work on symbolic/knowledge-based learning
did continue within AI, leading to inductive logic programming, but the more statistical line of
research was now out-side the field of AI proper, in pattern recognition and
information retrieval. Neural networks re-search had been abandoned by AI and computer science
around the same time. This line, too, was continued out-side the AI/CS field, as "connectionism",
by researchers from other disciplines including Hopfield, umelhart and Hinton. Their main success
came in the mid-1980s with the reinvention of backpropagation.
Machine learning, reorganized as a separate field, started to flourish in the 1990s. The field
changed its goal from achieving artificial intelligence to tackling solvable problems of a
practical nature. It shifted focus away from the symbolic approaches it had inherited from AI,
and toward methods and models borrowed from statistics and probability theory. It also
benefited from the increasing availability of digitized information, and the possibility to
distribute that via the internet.
Machine learning and data mining often employ the same methods and overlap significantly.
They can be roughly distinguished as follows:
 Machine learning focuses on prediction, based on known properties learned from the
training data.
 Data mining focuses on the discovery of (previously) unknown properties in the data.
This is the analysis step of Knowledge Discovery in Databases.
The two areas overlap in many ways: data mining uses many machine learning methods, but
often with a slightly different goal in mind. On the other hand, machine learning also employs
data mining methods as “unsupervised learning” or as a preprocessing step to improve learner
2
accuracy. Much of the confusion between these two research communities (which do often
have separate conferences and separate journals, ECML PKDD being a major exception)
comes from the basic assumptions they work with: in machine learning, performance is usually
evaluated with respect to the ability to re-produce known knowledge, while in Knowledge
Discovery and Data Mining (KDD) the key task is the discovery of previously unknown
knowledge. Evaluated with respect to known knowledge, an uninformed (unsupervised)
method will easily be outperformed by supervised methods, while in a typical KDD task,
supervised methods cannot be used due to the unavailability of training data.
Machine learning also has intimate ties to optimization: many learning problems are
formulated as minimization of some loss function on a training set of examples. Loss functions
express the discrepancy between the predictions of the model being trained and the actual
problem instances (for example, in classification, one wants to assign a label to instances, and
models are trained to correctly predict the pre-assigned labels of a set examples). The
difference between the two fields arises from the goal of generalization: while optimization
algorithms can minimize the loss on a training set, machine learning is concerned with
minimizing the loss on unseen samples.
3. Types of machine learning techniques

There are so many different types of Machine Learning systems that it is useful to
classify them in broad categories based on:
1. Whether or not they are trained with human supervision (supervised, unsupervised,
semisupervised, and Reinforcement Learning)
2. Whether or not they can learn incrementally on the fly (online versus batch learning)
3. Whether they work by simply comparing new data points to known data points, or instead
detect patterns in the training data and build a predictive model, much like scientists do
(instance-based versus model-based learning)
Supervised/Unsupervised Learning
Machine Learning systems can be classified according to the amount and type of supervision they
get during training. There are four major categories: supervised learning, unsupervised learning,
semisupervised learning, and Reinforcement Learning.
Supervised learning
Supervised learning algorithms and supervised learning models make predictions based on labeled
training data. Each training sample includes an input and a desired output. A supervised learning
algorithm analyzes this sample data and makes an inference – basically, an educated guess when
determining the labels for unseen data. This is the most common and popular approach to machine
learning. It’s “supervised” because these models need to be fed manually tagged sample data to
learn from. Data is labeled to tell the machine what patterns (similar words and images, data
categories, etc.) it should be looking for and recognize connections with.
3
In supervised learning, the training data you feed to the algorithm includes the desired solutions,
called labels (Figure 1-2).
Figure 1-2. A labeled training set for supervised learning (e.g., spam classification)
A typical supervised learning task is classification. The spam filter is a good example of this: it is
trained with many example emails along with their class (spam or ham), and it must learn how to
classify new emails.
Supervised learning problems can be further grouped into regression and classification problems.
 Classification: A classification problem is when the output variable is a category, such as

“red” or “blue” or “disease” and “no disease”.
 Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Here are some of the most important supervised learning algorithms:

 k-Nearest Neighbors
 Linear Regression
 Logistic Regression
 Support Vector Machines (SVMs)
 Decision Trees and Random Forests
 Neural networks
Unsupervised Learning
Unsupervised learning algorithms uncover insights and relationships in unlabeled data. In this
case, models are fed input data but the desired outcomes are unknown, so they have to make
inferences based on circumstantial evidence, without any guidance or training. The models are not
trained with the “right answer,” so they must find patterns on their own. One of the most common
types of unsupervised learning is clustering, which consists of grouping similar data. This method
is mostly used for exploratory analysis and can help you detect hidden patterns or trends.
Unsupervised learning problems can be further grouped into clustering and association problems.
 Clustering: A clustering problem is where you want to discover the inherent groupings
in the data, such as grouping customers by purchasing behavior.
4
 Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Some popular examples of unsupervised learning algorithms are:
 k-means for clustering problems.

 Apriori algorithm for association rule learning problems.
Semi-Supervised Learning
In semi-supervised learning, training data is split into two. A small amount of labeled data and a
larger set of unlabeled data. In this case, the model uses labeled data as an input to make inferences
about the unlabeled data, providing more accurate results than regular supervised-learning models.
This approach is gaining popularity, especially for tasks involving large datasets such as image
classification. Semi-supervised learning doesn’t require a large number of labeled data, so it’s
faster to set up, more cost-effective than supervised learning methods, and ideal for businesses that
receive huge amounts of data.
Reinforcement Learning
Reinforcement learning (RL) is concerned with how a software agent (or computer program) ought
to act in a situation to maximize the reward. Reinforced machine learning models attempt to
determine the best possible path they should take in a given situation. They do this through trial
and error. Since there is no training data, machines learn from their own mistakes and choose the
actions that lead to the best solution or maximum reward.
This machine learning method is mostly used in robotics and gaming. Video games demonstrate a
clear relationship between actions and results, and can measure success by keeping score.
Therefore, they’re a great way to improve reinforcement learning algorithms.
4. Essential math and statistics for machine learning
5
Before discussing the 4 math skills needed in machine learning, let’s first of all describe the
machine learning process. The machine learning process includes 4 main stages:
1. Problem Framing: This is where you decide what kind of problem are you trying to solve e.g.
model to classify emails as spam or not spam, model to classify tumor cells as malignant or benign,
model to improve customer experience by routing calls into different categories so that calls can
be answered by personnel with the right expertise, model to predict if a loan will charge off after
the duration of the loan, model to predict price of a house based on different features or predictors,
and so on.
2. Data Analysis: This is where you handle the data available for building the model. It includes
data visualization of features, handling missing data, handling categorical data, encoding class
labels, normalization, and standardization of features, feature engineering, dimensionality
reduction, data partitioning into training, validation and testing sets, etc.
3. Model Building: This is where you select the model that you would like to use, e.g. linear
regression, logistic regression, KNN, SVM, K-means, Monte Carlo simulation, time series
analysis, etc. The data set has to be divided into training, validation, and test sets. Hyperparameter
tuning is used to fine tune the model in order to prevent overfitting. Cross-validation is performed
to ensure the model performs well on the validation set. After fine-tuning model parameters, the
model is applied to the test data set. The model’s performance on the test data set is approximately
equal to what would be expected when the model is used for making predictions on unseen data.
4. Application: In this stage, the final machine learning model is put into production to start
improving the customer experience or increasing productivity, or deciding if a bank should
approve credit to a borrower, etc. The model is evaluated in a production setting in order to assess
its performance. This can be done by comparing the performance of the machine learning solution
against a baseline or control solution using methods such as A/B testing. Any mistakes encountered
when transforming from an experimental model to its actual performance on the production line
has to be analyzed. This can then be used in fine-tuning the original model.
Most of the math skills you need for building a machine learning model are used in stages 2, 3,
and 4, which is Data Analysis, Model Building, and Application.
6
The 4 stat and Math Skills for Machine Learning
(I) Statistics and Probability

Statistics and Probability is used for visualization of features, data preprocessing, feature
transformation, data imputation, dimensionality reduction, feature engineering, model evaluation,
etc. Here are the topics you need to be familiar with:
1. Mean
2. Median
3. Mode
4. Standard deviation/variance
5. Correlation coefficient and the covariance matrix
6. Probability distributions (Binomial, Poisson, Normal)
7. p-value
8. Baye’s Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value,
Confusion Matrix, ROC Curve)
9. A/B Testing
10. Monte Carlo Simulation
Statistics: Mean / Median /Mode/ Variance /Standard Deviation

Overview : Mean / Median /Mode/ Variance /Standard Deviation are all very basic but very
important concept of statistics used in data science. Almost all the machine learning algorithm uses
these concepts in data preprocessing steps. These concepts are part of descriptive statistics where
we basically used to describe and understand the data for features in Machine learning
Mean : Mean is also known as average of all the numbers in the data set which is calculated by
below equation.
Lets say we have below heights of persons.
Heights = [168,170,150,160,182,140,175,191,152,150]
7
Median : Median is mid value in this ordered data set.
Arrange the data in the increasing order and then find the mid value.
If we have even number of values in the data set then median is sum of mid two numbers divided
by 2
8
In we have odd number in the data set like below we have 9 heights the median will be 5th
number value.
Mode : Mode is the number which occur most often in the data set. Here 150 is occurring twice
so this is our mode.
9
Variance: Variance is the numerical values that describe the variability of the observations from
its arithmetic mean and denoted by sigma-squared (σ2 )
Variance measure how far individuals in the group are spread out, in the set of data from the
mean.
Where
 Xi : Elements in the data set

 Mu (𝜇) : the population mean
Step 1: This formula says that take each element from dataset(population) and subtract from
mean of data set. Later sum all the values.
Step 2: Take the sum in Step 1 and divide by total number of elements.
Square in the above formula will nullify the effect of negative sign(-)
10
Standard Deviation : It is a measure of dispersion of observation within dataset relative to their
mean.It is square root of the variance and denoted by Sigma (σ) .
Standard deviation is expressed in the same unit as the values in the dataset so it measure how
much observations of the data set differs from its mean.
11
Understanding Variance, Covariance, and Correlation
One of the topics that a data scientist must understand is the relationships that exist in your
dataset. Before you start the machine learning process, it is critical to prepare your data so that
only the relevant parts of your dataset is used for training. To understand the relationships in
your dataset, you need to understand the following concepts:
 Variance
 Covariance
 Correlation
As usual, my aim is to make it easy for you to digest these topics. Let’s begin!
Creating the Sample Dataset

To understand relationships in your dataset, let’s create a simple one and load in into a Pandas
DataFrame:
import pandas as pd
import numpy as npdf = pd.DataFrame({
'a':[1,3,4,6,8],
'b':[2,3,5,6,8],
'c':[6,5,4,3,2],
'd':[5,4,3,4,6]
})
df
The dataframe contains five rows and four columns:
Variance: Variance is the spread of values in a dataset around its mean value. It tells you how
far each number in the dataset is from its mean. The formula for variance (s²) is defined as above:
12
Covariance: Now that you have seen the variances of each columns, it is now time to see how
columns relate to each other. While variance measures the spread of data within its mean value,
covariance measures the relationalship between two random variables.
In statistics, covariance is the measure of the directional relationship between two random
variables.
Let’s plot a scatter plot to see how the columns in our dataframe relate to each other. We shall start
with the a and b columns first:
import matplotlib.pyplot as plt

plt.scatter(df['a'], df['b'])
plt.xlabel('a')
plt.ylabel('b')
As you can see, there seems to be a trend between a and b — as a increases, so does b.
In statistics, a and b are known to have a positive covariance. A positive covariance indicates
that both random variables tend to move upward or downward at the same time.
How about columns b and c? Let’s see:
plt.scatter(df['b'], df['c'])
plt.xlabel('b')
plt.ylabel('c')
13
This time round, the trend seems to go the other way — as b increases, c decreases.
In statistics, b and c are known to have a negative covariance. A negative covariance indicates
that both variables tend to move away from each other — when one moves upward the other
moves downward, and vice versa.
Finally, let’s examine columns c and d:
plt.scatter(df['c'], df['d'])
plt.xlabel('c')
plt.ylabel('d')
There doesn’t seem to exist a direct linear relationship between c and d.
In statistics, c and d are known to have zero covariance (or close to zero). When two random
variables are independent, the covariance will be zero. However, the reverse is not necessarily
true — a covariance of zero does not mean that 2 random variables are independent (a non-linear
14
relationship can still exist between 2 random variables that has zero covariance). In the above
example, you can see that there exists some sort of non-linear v-shape relationship.
Mathematically, the formula for covariance is defined as follows:
Covariance between 2 random variables is calculated by taking the product of the difference
between the value of each random variable and its mean, summing all the products, and finally
dividing it by the number of values in the dataset.
As usual, let’s calculate the covariance between a and b manually using NumPy:
#---covariance for a and b---

((df['a'] - df['a'].mean()) * (df['b'] - df['b'].mean())).sum() / (df.shape[0] - 1)
# 6.35
Like variance, NumPy has the cov() function to calculate covariance of two random variables
directly:
np.cov(df['a'],df['b'])
# array([[7.3 , 6.35],
# [6.35, 5.7 ]])
The output of the cov() function is a 2D array containing the following values:
In this case, the covariance of a and b is 6.35 (a positive covariance).
Here are the covariance for b and c (-3.75, a negative covariance):
np.cov(df['b'], df['c'])
# array([[ 5.7 , -3.75],
# [-3.75, 2.5 ]])
15
While the covariance measures the directional relationship between 2 random variables, it does
not show the strength of the relationship between the 2 random variables. Its value is not
constrained, and can be from -infinity to +infinity.
Also, covariance is dependent on the scale of the values. For example, if you double each value
in columns a and b, you will get a different covariance:
np.cov(df['a']*2, df['b']*2)
# array([[29.2, 25.4],
# [25.4, 22.8]])
A much better way to measure the strength of two random variables is correlation.
Correlation
The correlation between two random variables measures both the strength and direction of a
linear relationship that exists between them. There are two ways to measure correlation:
 Pearson Correlation Coefficient — captures the strength and direction of the linear
association between two continuous variables
 Spearman’s Rank Correlation Coefficient—determines the strength and direction of
the monotonic relationship which exists between two ordinal (categorical) or continuous
variables.
Pearson Correlation Coefficient
The formula for the Pearson Correlation Coefficient is:
The Pearson Correlation Coefficient is defined to be the covariance of x and y divided by the
product of each random variable’s standard deviation.
Substituting the formula for convariance and standard deviation for x and y, you have:
16
Simplifying, the formula now looks like this:
Pandas have a function corr() that calculates the correlation of columns in a dataframe:
df[['a','b']].corr()
The result is:
The diagonal values of 1 indicates the correlation of each column to itself. Obviously, the
correlation of a to a itself is 1, and so is that for column b. The value of 0.984407 is the Pearson
correlation coefficient of a and b.
The Pearson correlation coefficient of b and c is -0.993399:
df[['b','c']].corr()
The Pearson correlation coefficient of c and d is -0.27735:
df[['c','d']].corr()
17
Like covariance, the sign of the pearson correlation coefficient indicates the direction of the
relationship. However, the values of the Pearson correlation coefficient is constrained to be
between -1 and 1. Based on the value, you can deduce the following degrees of correlation:
 Perfect — values near to ±1

 High degree — values between ±0.5 and ±1
 Moderate degree — values between ±0.3 and ±0.49
 Low degree — values below ±0.29
 No correlation — values close to 0
From the above results, you can see that a,b, and b,c have high degrees of correlation, while c,d
have very low degree of correlation.
Understanding the correlations between the various columns in your dataset is an important part
of the process of preparing your data for machine learning. You want to train your model using
the columns that has the highest correlation with the label of your dataset.
Unlike covariance, correlation is not affected by the scale of the values. As an experiment,
multiply columns a and b and you find their correlation:
df['2a'] = df['a']*2 # multiply the values in a by 2

df['2b'] = df['b']*2 # multiply the values in b by 2
df[['2a','2b']].corr() # the result is the same as
# df[['a','b']].corr()
The result is the same as that of a and b:
Spearman’s Rank Correlation Coefficient
If your data is not linearly distributed, you should use Spearman’s Rank Correlation Coefficient
instead of the Pearson Correlation Coefficient. The Spearman’s Rank Correlation Coefficient
is designed for distributions that are monotonic.
In algebra, a montonic function is a function whose gradient never changes sign. In other words,
it is a function which is either always increasing or decreasing. The following first two figures
18
are monotonic, while the third is not (since the gradient changes sign a few times going from left
to right).
The formula for Spearman’s Rank Correlation Coefficient is:
Where d is the difference in rank between the 2 random variables. An example will make it
clear.
For this example, I will have another dataframe:
df = pd.DataFrame({
'math' :[78,89,75,67,60,58,71],
'science':[91,92,90,80,60,56,84]
})
df
It would be useful to first visualize the data:
19
plt.scatter(df['math'], df['science'])
plt.xlabel('math')
plt.ylabel('science')
And this is looks like a monotonic distribution. The next step is to rank the scores using the rank()
function in Pandas:
df['math_rank'] = df['math'].rank(ascending=False)
df['science_rank'] = df['science'].rank(ascending=False)
df
You now have two additional columns containing the ranks for each subject:
Let’s also create another two new columns to store the differences between the ranks and its
squared values:
df['diff'] = df['math_rank'] - df['science_rank']

df['diff_sq'] = np.square(df['diff'])
df
20
You are now ready to calculate the Spearman’s Rank Correlation Coefficient using the
formula defined earlier:
n = df.shape[0]
p = 1 - ((6 * df['diff_sq'].sum()) / (n * (n**2 - 1)))
p # 1.0
And you get a perfect 1.0! Of course, to spare you all the effort in calculating the Spearman’s
Rank Correlation Coefficient manually, you can use the corr() function and specify
‘spearman’ for the method parameter:
df[['math','science']].corr(method='spearman')
Note that the formula for Spearman’s Rank Correlation Coefficient that I have just listed above
is for cases where you have distinct ranks (meaning there is no tie in either math or science scores).
In the event of tied ranks, the formula is a little more complicated.
Which method should you use? Pearson or Spearman’s
 Pearson correlation describes linear relationships and spearman correlation describes

monotonic relationships
 A scatter plot would be helpful to visualize the data — if the distribution is linear, use
Pearson correlation. If it is monotonic, use Spearman correlation.
 You can also apply both the methods and check which is performing well. For instance if
results show spearman rank correlation coefficient is greater than Pearson coefficient, it
means your data has monotonic relationships and not linear.
21
Understanding P-values
The p-value is a number, calculated from a statistical test, that describes how likely you are to have
found a particular set of observations if the null hypothesis were true. P-values are used in
hypothesis testing to help decide whether to reject the null hypothesis. The smaller the p-value,
the more likely you are to reject the null hypothesis.
All statistical tests have a null hypothesis. For most tests, the null hypothesis is that there is no
relationship between your variables of interest or that there is no difference among groups.
The alternate hypothesis (Ha) is usually your initial hypothesis that predicts a relationship
between variables. The null hypothesis (Ho) is a prediction of no relationship between the
variables you are interested in.
You want to test whether there is a relationship between gender and height. Based on your
knowledge of human physiology, you formulate a hypothesis that men are, on average, taller than
women. To test this hypothesis, you restate it as:
Ho: Men are, on average, not taller than women.

Ha: Men are, on average, taller than women.
 Null hypothesis: there is no difference in longevity between the two groups.

 Alternative hypothesis: there is a difference in longevity between the two groups.
The p-value, or probability value, tells you how likely it is that your data could have occurred
under the null hypothesis. It does this by calculating the likelihood of your test statistic, which is
the number calculated by a statistical test using your data.
The p-value tells you how often you would expect to see a test statistic as extreme or more extreme
than the one calculated by your statistical test if the null hypothesis of that test was true. The p-
value gets smaller as the test statistic calculated from your data gets further away from the range
of test statistics predicted by the null hypothesis.
Normal, Binomial, and Poisson Distribution

Distribution is an important part of analyzing data sets which indicates all the potential outcomes
of the data, and how frequently they occur. In a business context, forecasting the happenings of
events, understanding the success or failure of outcomes, and predicting the probability of
outcomes is essential to business development and interpreting data sets. In a modern digital
workplace, businesses need to rely on more than just pure instincts and experience, and instead
utilize analytics to derive value from data sets.
22
Normal Distribution
Normal Distribution is often called a bell curve and is broadly utilized in statistics, business
settings, and government entities such as the FDA. It’s widely recognized as being a grading
system for tests such as the SAT and ACT in high school or GRE for graduate students.
Normal Distribution contains the following characteristics:
 It occurs naturally in numerous situations.

 Data points are similar and occur within a small range.
 Much fewer outliers on the low and high ends of data range
Example:
Formula Values:
x = Value that is being standardized
μ = Mean of the distributionn
σ = Standard deviation of the distribution
 Use the following formula to convert a raw data value ‘X’ to a standard score ‘Z’.
 Assuming a specific population has = 4, and = 2. For example, finding the probability of the
randomly selected value being greater than 6 would resemble the following formula:
 The Z score corresponding to X = 6 will be:
 Z = 1 means that the value of X = 6 which is 1 standard deviation above the mean.
Business Applications
 Can be utilized to model risks and following the distribution of likely outcomes for certain events,
like the amount of next month’s revenue from a specific service.
 Process variations in operations management are sometimes normally distributed, as is employee
performance in Human Resource Management.
 Human Resource management applies Normal Distribution to employee performance.
23
Binomial Distribution
Binomial Distribution is considered the likelihood of a pass or fail outcome in a survey or
experiment that is replicated numerous times. There are only two potential outcomes for this type
of distribution, like a True or False, or Heads or Tails, for example.
Characteristics of Binomial Distribution:

 First variable: The number of times an experiment is conducted
 Second variable: Probability of a single, particular outcome
 None of the performed trials have any effect on the probability of the following trial
 Likelihood of success is the same from one trial to the following trial
Formula Values:
 x: Number of successes
 X: Random variable
 C: Combination of x successes from n trials
 p: Probability of success
 (n - x): Number of failures
 (1 - p): Probability of failure
 Assuming that 15% of changing street lights records a car running a red light, and the
data has a binomial distribution.
 The formula used to determine the probability that exactly 3 cars will run a red light in 20
light changes would be as follows: P = 0.15, n = 20, X = 3
 Apply the formula, substituting these values: P = (X-3) = 20 C3 X 0.153 * 0.8517 =
0.243
 Therefore, the probability of 3 cars running a red light in 20 light changes would be 0.24,
or 24%.
24
 Banks and other financial institutions use Binomial Distribution to determine the
likelihood of borrowers defaulting, and apply the number towards pricing insurance, and
figuring out how much money to keep in reserve, or how much to loan.
Poisson Distribution
The probability of events occurring at a specific time is Poisson Distribution. In other words,
when you are aware of how often the event happened, Poisson Distribution can be used to
predict how often that event will occur. It provides the likelihood of a given number of events
occurring in a set period.
Poisson Distribution Characteristics

 An event can happen any amount of times throughout a period.
 Events occurring don’t affect the probability of another event occurring within the same
period.
 Occurrence rate is constant and doesn’t change based on time.
 The likelihood of an occurring event corresponds to the time length.
Formula Values:
x: Actual number of occurring successes
e: 2.71828 (e = mathematical constant)
: Average number of successes with a specified region
 For example, the average number of yearly accidents at a traffic intersection is 5. To

determine the probability that there are exactly three accidents at the same intersection
this year, apply the following formula:
Here, λ = 5, and x = 3
 Therefore, there’s a 14% chance that there will be exactly three accidents there this year.
25
 Predicting customer sales on particular days/times of the year.
 Supply and demand estimations to help with stocking products.
 Service industries can prepare for an influx of customers, hire temporary help, order
additional supplies, and make alternative plans to reroute customers if needed.
(II) Multivariable Calculus

Most machine learning models are built with a data set having several features or predictors. Hence
familiarity with multivariable calculus is extremely important for building a machine learning
model. Here are the topics you need to be familiar with:
1. Functions of several variables

2. Derivatives and gradients
3. Step function, Sigmoid function, Logit function, ReLU (Rectified Linear Unit) function
4. Cost function
5. Plotting of functions
6. Minimum and Maximum values of a function
Multivariable calculus
Multivariable calculus (also known as multivariate calculus) is the extension of calculus in one
variable to calculus with functions of several variables: the differentiation and integration of
functions involving several variables, rather than just one.
Multivariable calculus may be thought of as an elementary part of advanced calculus It deals with
the functions of multiple variables, whereas single variable calculus deals with the function of one
variable. The differentiation and integration process are similar to the single variable calculus. In
multivariable calculus, to find a partial derivative, first, take the derivative of the appropriate
variable while holding the other variables as constant. It majorly deals with three-dimensional
objects or higher dimensions. The typical operations involved in the multivariable calculus are:
 Limits and Continuity

 Partial Differentiation
 Multiple Integration
Multivariable Calculus Applications
One of the core tools of Applied Mathematics is multivariable calculus. It is used in various
fields such as Economics, Engineering, Physical Science, Computer Graphics, and so on. Some
of the applications of multivariable calculus are as follows:
 Multivariable Calculus provides a tool for dynamic systems.

 It is used in a continuous-time dynamic system for optimal control.
26
 In regression analysis, it helps to derive the formulas to estimate the relationship among
the set of empirical data.
 In Engineering and Social Science, it helps to study and model the high dimensional
systems that exhibit the deterministic nature.
 In Finance, Quantitative Analyst uses multivariable calculus to predict future trends in
the stock market.
(III) Linear Algebra
Linear algebra is the most important math skill in machine learning. A data set is represented as a
matrix. Linear algebra is used in data preprocessing, data transformation, and model evaluation.
Here are the topics you need to be familiar with:
1. Vectors
2. Matrices
3. Transpose of a matrix
4. The inverse of a matrix
5. The determinant of a matrix
6. Dot product
Linear Algebra: Vectors and matrices

Definition: A scalar is a number. Examples of scalars are temperature, distance, speed, or mass
– all quantities that have a magnitude but no “direction”, other than perhaps positive or negative.
Definition: A vector is a list of numbers. There are (at least) two ways to interpret what this list
of numbers mean: One way to think of the vector as being a point in a space. Then this list of
numbers is a way of identifying that point in space, where each number represents the vector’s
component that dimension. Another way to think of a vector is a magnitude and a direction, e.g. a
quantity like velocity (“the fighter jet’s velocity is 250 mph north-by-northwest”). In this way of
think of it, a vector is a directed arrow pointing from the origin to the end point given by the list
of numbers.
The “magnitude” of a vector is the distance from the endpoint of the vector to the origin – in a
word, it’s length.
Definition: A unit vector is a vector of magnitude 1. Unit vectors can be used to express the
direction of a vector independent of its magnitude.
Matrix
A matrix is a two-dimensional array that has a fixed number of rows and columns and contains a
number at the intersection of each row and column. A matrix is usually delimited by square
brackets. Example Here is an example of a matrix having two rows and two columns:
27
Dimension of a matrix: If a matrix has K rows and L columns, we say that it has dimension
$K x L$, or that it is a $K x L$, matrix.
Ex The matrix has 2 rows and 3 columns. So, we say that A is a 2 x 3 matrix.
Note:
 If a matrix has only one row or only one column it is called a vector.
 A matrix having only one row is called a row vector.
 A matrix having only one column is called a column vector.
 A matrix having only one row and one column is called a scalar.
Equal matrices: Two matrices and having the same dimension are said to be equal if and only if all
their corresponding elements are equal to each other
Zero matrices: A matrix is a zero matrix if all its elements are equal to zero, and we write matrix
A=0; then
Square matrices: A matrix is called a square matrix if the number of its rows is the same as the
number of its columns.
Identity matrix: A square matrix is called an identity matrix if all its diagonal elements are
equal to 1 and all its off-diagonal elements are equal to 0. It is usually indicated by the letter I.
Transpose of a matrix
If A is a K X L matrix, its transpose, denoted by AT, is the L x K matrix such that the (l,k)-th element
of AT is equal to the (k,l)-th element of A.
28
Symmetric matrices: A square matrix is said to be symmetric if it is equal to its transpose.
(IV) Optimization Methods
Most machine learning algorithms perform predictive modeling by minimizing an objective

function, thereby learning the weights that must be applied to the testing data in order to obtain
the predicted labels. Here are the topics you need to be familiar with:
1. Cost function/Objective function

2. Likelihood function
3. Error function
4. Gradient Descent Algorithm and its variants (e.g. Stochastic Gradient Descent
Algorithm)
In summary, we’ve discussed the essential math skills that are needed for building a machine
learning model. There are several free online courses that will teach you the necessary math
skills that you need for building a machine learning model.
29
5. Applications of Machine Learning
There are many different applications of machine learning, which can benefit your business in
countless ways. You’ll just need to define a strategy to help you decide the best way to implement
machine learning into your existing processes. In the meantime, here are some common machine
learning use cases and applications that might spark some ideas:
 Social Media Monitoring

 Customer Service & Customer Satisfaction
 Image Recognition
 Virtual Assistants
 Product Recommendations
 Stock Market Trading
 Medical Diagnosis
Question: Discuss the how a machine learning approach could be used in the above listed
applications.
--------------------------------------------------------------------------------------------------------------------
Machine learning is the most algorithm-intense field in computer science. Gone are those days
when people had to code all algorithms for machine learning. Thanks to Python and it’s libraries,
modules, and frameworks.
Python machine learning libraries have grown to become the most preferred language for machine
learning algorithm implementations. Learning Python is essential to master data science and
machine learning. Let’s have a look at the main Python libraries used for machine learning.
Top Python Machine Learning Libraries
1) NumPy
NumPy is a well-known general-purpose array-processing package. An extensive collection of

high complexity mathematical functions makes NumPy powerful to process large multi-
dimensional arrays and matrices. NumPy is very useful for handling linear algebra, Fourier
transforms, and random numbers. Other libraries like TensorFlow uses NumPy at the backend for
manipulating tensors.
With NumPy, you can define arbitrary data types and easily integrate with most databases. NumPy
can also serve as an efficient multi-dimensional container for any generic data that is in any
30
datatype. The key features of NumPy include powerful N-dimensional array object, broadcasting
functions, and out-of-box tools to integrate C/C++ and Fortran code.
Its key features are as below:
 Supports n-dimensional arrays to enable vectorization, indexing, and broadcasting

operations.
 Supports Fourier transforms mathematical functions, linear algebra methods, and random
number generators.
 Implementable on different computing platforms, including distributed and GPU
computing.
 Easy-to-use high-level syntax with the optimized Python code to provide high speed and
flexibility.
 In addition to that, NumPy enables the numerical operations of plenty of libraries
associated with data science, data visualization, image processing, quantum computing,
signal processing, geographic processing, bioinformatics, etc. So, it is one of the versatile
machine learning libraries.
2) SciPy
With machine learning growing at supersonic speed, many Python developers were creating
python libraries for machine learning, especially for scientific and analytical computing. Travis
Oliphant, Eric Jones, and Pearu Peterson in 2001 decided to merge most of these bits and pieces
codes and standardize it. The resulting library was then named as SciPy library.
The current development of the SciPy library is supported and sponsored by an open community
of developers and distributed under the free BSD license.
The SciPy library offers modules for linear algebra, image optimization, integration interpolation,
special functions, Fast Fourier transform, signal and image processing, Ordinary Differential
Equation (ODE) solving, and other computational tasks in science and analytics.
The underlying data structure used by SciPy is a multi-dimensional array provided by the NumPy
module. SciPy depends on NumPy for the array manipulation subroutines. The SciPy library was
built to work with NumPy arrays along with providing user-friendly and efficient numerical
functions.
One of the unique features of SciPy is that its functions are useful in maths and other sciences.
Some of its extensively used functions are optimization functions, statistical functions, and signal
processing. It supports functions for finding the numerical solute to integrals. So you can solve
differential equations and optimization.
The following areas of SciPy’s applications make it one of the popular machine learning
libraries.
 Multidimensional image processing
31
 Solves Fourier transforms, and differential equations
 Its optimized algorithms help you to efficiently and reliably perform linear algebra
calculations
3) Scikit-learn
Scikit Learn is perhaps the most popular library for Machine Learning. It provides almost every
popular model – Linear Regression, Lasso-Ridge, Logistics Regression, Decision Trees, SVMs and
a lot more.
In 2007, David Cournapeau developed the Scikit-learn library as part of the Google Summer of
Code project. In 2010 INRIA involved and did the public release in January 2010. Skikit-learn was
built on top of two Python libraries – NumPy and SciPy and has become the most popular Python
machine learning library for developing machine learning algorithms.
Scikit-learn has a wide range of supervised and unsupervised learning algorithms that works on a
consistent interface in Python. The library can also be used for data-mining and data analysis. The
main machine learning functions that the Scikit-learn library can handle are classification,
regression, clustering, dimensionality reduction, model selection, and preprocessing.
Many ML enthusiasts and data scientists use scikit-learn in their AI journey. Essentially, it is an
all-inclusive machine learning framework. Occasionally, many people overlook it because of the
prevalence of more cutting-edge Python libraries and frameworks. However, it is still a powerful
library and efficiently solves complex Machine Learning tasks.
The following features of scikit-learn make it one of the best machine learning libraries in
Python:
 Easy to use for precise predictive data analysis

 Simplifies solving complex ML problems like classification, preprocessing, clustering,
regression, model selection, and dimensionality reduction
 Plenty of inbuilt machine learning algorithms
 Helps build a fundamental to advanced level ML model
 Developed on top of prevalent libraries like SciPy, NumPy, and Matplotlib
4) TensorFlow
TensorFlow was developed for Google’s internal use by the Google Brain team. Its first release
came in November 2015 under Apache License 2.0. TensorFlow is a popular computational
framework for creating machine learning models. TensorFlow supports a variety of different
toolkits for constructing models at varying levels of abstraction.
TensorFlow exposes a very stable Python and C++ APIs. It can expose, backward compatible APIs
for other languages too, but they might be unstable. TensorFlow has a flexible architecture with
which it can run on a variety of computational platforms CPUs, GPUs, and TPUs. TPU stands for
32
Tensor processing unit, a hardware chip built around TensorFlow for machine learning and
artificial intelligence.
TensorFlow empowers some of the largest contemporary AI models globally. Alternatively, it is

recognized as an end-to-end Deep Learning and Machine Learning library to solve practical
challenges.
The following key features of TensorFlow make it one of the best machine learning libraries
Python:
 Comprehensive control on developing a machine learning model and robust neural network
 Deploy models on cloud, web, mobile, or edge devices through TFX, TensorFlow.js, and
TensorFlow Lite
 Supports abundant extensions and libraries for solving complex problems
 Supports different tools for integration of Responsible AI and ML solutions
5) Keras
Keras has over 200,000 users as of November 2017. Keras is an open-source library used for neural
networks and machine learning. Keras can run on top of TensorFlow, Theano, Microsoft Cognitive
Toolkit, R, or PlaidML. Keras also can run efficiently on CPU and GPU.
Keras works with neural-network building blocks like layers, objectives, activation functions, and
optimizers. Keras also have a bunch of features to work on images and text images that comes
handy when writing Deep Neural Network code.
Apart from the standard neural network, Keras supports convolutional and recurrent neural
networks.
It was released in 2015 and by now, it is a cutting-edge open-source Python deep learning
framework and API. It is identical to Tensorflow in several aspects. But it is designed with a
human-based approach to make DL and ML accessible and easy for everybody.
You can conclude that Keras is one of the versatile machine learning libraries Python because
it includes:
 Everything that TensorFlow provides but presents in easy-to-understand format.

 Quickly runs various DL iterations with full deployment proficiencies.
 Support large TPUs and GPU clusters which facilitate commercial Python machine
learning.
 It is used in various applications, including natural language processing, computer vision,
reinforcement learning, and generative deep learning. So, it is useful for graph, structured,
audio, and time series data.
33
6) PyTorch
PyTorch has a range of tools and libraries that support computer vision, machine learning, and
natural language processing. The PyTorch library is open-source and is based on the Torch library.
The most significant advantage of PyTorch library is it’s ease of learning and using.
PyTorch can smoothly integrate with the python data science stack, including NumPy. You will
hardly make out a difference between NumPy and PyTorch. PyTorch also allows developers to
perform computations on Tensors. PyTorch has a robust framework to build computational graphs
on the go and even change them in runtime. Other advantages of PyTorch include multi GPU
support, simplified preprocessors, and custom data loaders.
Facebook released PyTorch as a powerful competitor of TensorFlow in 2016. It has now attained
huge popularity among deep learning and machine learning researchers. Various aspects of
PyTorch suggest that it is one of the outstanding Python libraries for machine learning. Here
are some of its key capabilities.
 Fully support the development of customized deep neural networks

 Production-ready with TorchServe
 Supports distributed computing through the torch.distributed backend
 Supports various extensions and tools to solve complex problems
 Compatible on all leading cloud platforms for extensible deployment
 Also supported on GitHub as an open-source Python framework
7) Pandas
In simple terms, Pandas is the Python equivalent of Microsoft Excel. Whenever you have tabular
data, you should consider using Pandas to handle it.
Pandas are turning up to be the most popular Python library that is used for data analysis with
support for fast, flexible, and expressive data structures designed to work on both “relational” or
“labeled” data. Pandas today is an inevitable library for solving practical, real-world data analysis
in Python. Pandas is highly stable, providing highly optimized performance. The backend code is
purely written in C or Python.
The two main types of data structures used by pandas are :
 Series (1-dimensional)
 DataFrame (2-dimensional)
These two put together can handle a vast majority of data requirements and use cases from most
sectors like science, statistics, social, finance, and of course, analytics and other areas of
engineering.
34
Pandas support and perform well with different kinds of data including the below :
 Tabular data with columns of heterogeneous data. For instance, consider the data coming
from the SQL table or Excel spreadsheet.
 Ordered and unordered time series data. The frequency of time series need not be fixed,
unlike other libraries and tools. Pandas is exceptionally robust in handling uneven time-
series data
 Arbitrary matrix data with the homogeneous or heterogeneous type of data in the rows and
columns
 Any other form of statistical or observational data sets. The data need not be labeled at all.
Pandas data structure can process it even without labeling.
It was launched as an open-source Python library in 2009. Currently, it has become one of the
favourite Python libraries for machine learning among many ML enthusiasts. The reason is it
offers some robust techniques for data analysis and data manipulation. This library is extensively
used in academia. Moreover, it supports different commercial domains like business and web
analytics, economics, statistics, neuroscience, finance, advertising, etc. It also works as a
foundational library for many advanced Python libraries.
Here are some of its key features:
 Handles missing data

 Handles time series data
 Supports indexing, slicing, reshaping, subsetting, joining, and merging of large datasets
 Offers optimized code for Python using C and Cython
 Powerful DataFrame object for broad data manipulation support
8) Matplotlib
Matplotlib is a data visualization library that is used for 2D plotting to produce publication-quality
image plots and figures in a variety of formats. The library helps to generate histograms, plots,
error charts, scatter plots, bar charts with just a few lines of code.
It provides a MATLAB-like interface and is exceptionally user-friendly. It works by using

standard GUI toolkits like GTK+, wxPython, Tkinter, or Qt to provide an object-oriented API that
helps programmers to embed graphs and plots into their applications.
It is the oldest Python machine learning library. However, it is still not obsolete. It is one of the
most innovative data visualization libraries for Python. So, the ML community admires it.
The following features of the Matplotlib library make it a famous Python machine learning
among the ML community:
 Its interactive charts and plots allow fascinating data storytelling

 Offers an extensive list of plots appropriate for a particular use case
 Charts and plots are customizable and exportable to various file formats
35
 Offers embeddable visualizations with different GUI applications
 Various Python frameworks and libraries extend Matplotlib
Summary
Purpose Libraries
Scientific Computation Numpy, SciPy
Tabular Data Pandas
Data Modelling &
Scikit Learn
Preprocessing
Deep Learning Keras, Tensorflow, Pytorch
Data visualization Matplotlib
Sample Code (Numpy Array)
36
37

ML Chapter 1

Uploaded by

Copyright:

Available Formats

ML Chapter 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Chapter 1

Uploaded by

Copyright:

Available Formats

Chapter One

Introduction to Machine Learning

1. What Is Machine Learning?

Here is a slightly more general definition:

Figure 1-1 ML Approach

2. History and relationships to other fields

3. Types of machine learning techniques

 Classification: A classification problem is when the output variable is a category, such as

Here are some of the most important supervised learning algorithms:

Some popular examples of unsupervised learning algorithms are:

 k-means for clustering problems.

4. Essential math and statistics for machine learning

(I) Statistics and Probability

Statistics: Mean / Median /Mode/ Variance /Standard Deviation

Lets say we have below heights of persons.

 Xi : Elements in the data set

Creating the Sample Dataset

The dataframe contains five rows and four columns:

import matplotlib.pyplot as plt

How about columns b and c? Let’s see:

Finally, let’s examine columns c and d:

There doesn’t seem to exist a direct linear relationship between c and d.

Mathematically, the formula for covariance is defined as follows:

#---covariance for a and b---

In this case, the covariance of a and b is 6.35 (a positive covariance).

Here are the covariance for b and c (-3.75, a negative covariance):

Pearson Correlation Coefficient

The formula for the Pearson Correlation Coefficient is:

The result is:

The Pearson correlation coefficient of b and c is -0.993399:

The Pearson correlation coefficient of c and d is -0.27735:

 Perfect — values near to ±1

df['2a'] = df['a']*2 # multiply the values in a by 2

The result is the same as that of a and b:

Spearman’s Rank Correlation Coefficient

The formula for Spearman’s Rank Correlation Coefficient is:

For this example, I will have another dataframe:

It would be useful to first visualize the data:

df['diff'] = df['math_rank'] - df['science_rank']

Which method should you use? Pearson or Spearman’s

 Pearson correlation describes linear relationships and spearman correlation describes

Ho: Men are, on average, not taller than women.

 Null hypothesis: there is no difference in longevity between the two groups.

Normal, Binomial, and Poisson Distribution

Normal Distribution contains the following characteristics:

 It occurs naturally in numerous situations.

x = Value that is being standardized

μ = Mean of the distributionn

σ = Standard deviation of the distribution

Characteristics of Binomial Distribution:

Poisson Distribution Characteristics

x: Actual number of occurring successes

e: 2.71828 (e = mathematical constant)

: Average number of successes with a specified region

 For example, the average number of yearly accidents at a traffic intersection is 5. To

(II) Multivariable Calculus

1. Functions of several variables

 Limits and Continuity

Multivariable Calculus Applications