Ai
Ai
Ai
Tie up with 20 +
12 + AI Real MNC's and 50
Time Experts Mid Companies
are providing to provide
Training
Training Delivered Tie up with
Training to 50+
1000+ Engineering
Students in and Degree
100 + Short time Educated 20+ Colleges
Students Got BPO + Back
Placed in Office/
Different Operations on
MNC's AI, ML
Contents
Artificial Intelligence History ............................................................................................................................ 10
Why is artificial intelligence important? ......................................................................................... 11
How Artificial Intelligence Is Being Used ..................................................................................................... 12
Health Care......................................................................................................................................... 12
Retail................................................................................................................................................... 12
Manufacturing ................................................................................................................................... 12
Sports ................................................................................................................................................. 13
What are the challenges of using artificial intelligence? ................................................................ 15
How Artificial Intelligence Works ................................................................................................................... 15
Why Algebra Matters ........................................................................................................................ 18
3 Reasons Why We Learn Algebra ................................................................................................................... 19
Algebra in Everyday Life ........................................................................................................................... 19
Examples of using algebra in everyday life ................................................................................................. 19
Going shopping ........................................................................................................................................................ 19
Calculating grocery expense .............................................................................................................................. 20
Filling up the gas tank........................................................................................................................................... 21
Basic Terms............................................................................................................................................................... 22
Coefficient of Algebraic Expressions Properties ....................................................................................... 26
Frequency Histogram: .......................................................................................................................................... 30
Relative Frequency Histograms: ...................................................................................................................... 31
When to Use a Histogram: ................................................................................................................................... 32
Where Do Outliers Come From? ....................................................................................................................... 45
Detecting Outliers .................................................................................................................................................. 45
Table of Contents .................................................................................................................................................... 56
Common Data Types ............................................................................................................................................. 57
Types of Distributions .......................................................................................................................................... 57
Bernoulli Distribution ....................................................................................................................... 57
Uniform Distribution......................................................................................................................... 58
Binomial Distribution........................................................................................................................ 59
Normal Distribution .......................................................................................................................... 61
2
Page
Poisson Distribution.......................................................................................................................... 62
8
Page
Artificial Intelligence
What is Artificial Intelligence (AI)?
According to the father of Artificial Intelligence, John McCarthy, it is “The science and engineering
of making intelligent machines, especially intelligent computer programs”.
Artificial Intelligence is a way of making a computer, a computer-controlled robot, or a
software think intelligently, in the similar manner the intelligent humans think.
Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by
machines, in contrast to the natural intelligence (NI) displayed by humans and other
animals.
Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to
new inputs and perform human-like tasks.
Goals of AI :
To Create Expert Systems − The systems which exhibit intelligent behavior, learn,
demonstrate, explain, and advice its users.
To Implement Human Intelligence in Machines − Creating systems that understand,
think, learn, and behave like humans.
Some of the activities computers with artificial intelligence are designed for include:
Speech recognition
Learning
Planning
Problem solving
Ex:-chess-playing computers to self-driving cars – rely heavily on deep learning and natural
language processing. Using these technologies, computers can be trained to accomplish specific
tasks by processing large amounts of data and recognizing patterns in the data.
AI research is defined as the study of "intelligent agents": any device that perceives its
environment and takes actions that maximize its chance of successfully achieving its goals.
Artificial intelligence was founded as an academic discipline in 1956, and in the years since
has experienced several waves of optimism, followed by disappointment and the loss of
funding (known as an "AI winter"), followed by new approaches, success and renewed
funding.
Attendees Allen Newell (CMU), Herbert Simon (CMU), John McCarthy (MIT), Marvin Minsky
(MIT) and Arthur Samuel (IBM) became the founders and leaders of AI research.
Research associated with artificial intelligence is highly technical and specialized. The core
problems of artificial intelligence include programming computers for certain traits such as:
9
Knowledge
Page
Reasoning
Problem solving
Perception
Learning
Planning
Ability to manipulate and move objects
10
Page
AI automates repetitive learning and discovery through data. But AI is different from
hardware-driven, robotic automation. Instead of automating manual tasks, AI performs
frequent, high-volume, computerized tasks reliably and without fatigue. For this type of
automation, human inquiry is still essential to set up the system and ask the right questions.
AI adds intelligence to existing products. In most cases, AI will not be sold as an individual
application. Rather, products you already use will be improved with AI capabilities, much like
Siri was added as a feature to a new generation of Apple products. Automation, conversational
platforms, bots and smart machines can be combined with large amounts of data to improve
many technologies at home and in the workplace, from security intelligence to investment
analysis.
AI adapts through progressive learning algorithms to let the data do the programming. AI
finds structure and regularities in data so that the algorithm acquires a skill: The algorithm
becomes a classifier or a predicator. So, just as the algorithm can teach itself how to play chess,
it can teach itself what product to recommend next online. And the models adapt when given
new data. Back propagation is an AI technique that allows the model to adjust, through training
and added data, when the first answer is not quite right
AI analyzes more and deeper data using neural networks that have many hidden layers.
Building a fraud detection system with five hidden layers was almost impossible a few years
ago. All that has changed with incredible computer power and big data. You need lots of data to
train deep learning models because they learn directly from the data. The more data you can
feed them, the more accurate they become.
AI achieves incredible accuracy though deep neural networks – which was previously
impossible. For example, your interactions with Alexa, Google Search and Google Photos are all
based on deep learning – and they keep getting more accurate the more we use them. In the
medical field, AI techniques from deep learning, image classification and object recognition can
now be used to find cancer on MRIs with the same accuracy as highly trained radiologists.
AI gets the most out of data. When algorithms are self-learning, the data itself can become
intellectual property. The answers are in the data; you just have to apply AI to get them out.
Since the role of the data is now more important than ever before, it can create a competitive
advantage. If you have the best data in a competitive industry, even if everyone is applying
similar techniques, the best data will win.
11
Page
Health Care
AI applications can provide personalized medicine and X-ray readings.
Personal health care assistants can act as life coaches, reminding you to take your pills,
exercise or eat healthier.
Retail
AI provides virtual shopping capabilities that offer personalized recommendations and
discuss purchase options with the consumer.
Stock management and site layout technologies will also be improved with AI.
Manufacturing
AI can analyze factory IoT data as it streams from connected equipment to forecast
12
expected load and demand using recurrent networks, a specific type of deep learning
network used with sequence data.
Page
Sports
AI is used to capture images of game play and provide coaches with reports on how to
better organize the game, including optimizing field positions and strategy.
AI at Recent News:
13
Page
14
Page
The principle limitation of AI is that it learns from the data. There is no other way in which
knowledge can be incorporated. That means any inaccuracies in the data will be reflected in
the results. And any additional layers of prediction or analysis have to be added separately.
Today’s AI systems are trained to do a clearly defined task. The system that plays poker
cannot play solitaire or chess.
The system that detects fraud cannot drive a car or give you legal advice. In fact, an AI
system that detects health care fraud cannot accurately detect tax fraud or warranty claims
fraud.
In other words, these systems are very, very specialized. They are focused on a single task
and are far from behaving like humans.
self-learning systems are not autonomous systems. The imagined AI technologies that you
see in movies and TV are still science fiction. But computers that can probe complex data to
learn and perfect specific tasks are becoming quite common.
Machine learning automates analytical model building. It uses methods from neural networks,
statistics, operations research and physics to find hidden insights in data without explicitly being
programmed for where to look or what to conclude.
Machine learning is a method of data analysis that automates analytical model building. It is a
branch of artificial intelligence based on the idea that systems can learn from data, identify patterns
and make decisions with minimal human intervention.
A neural network is a type of machine learning that is made up of interconnected units (like
neurons) that processes information by responding to external inputs, relaying information
between each unit. The process requires multiple passes at the data to find connections and derive
meaning from undefined data.
Deep learning uses huge neural networks with many layers of processing units, taking advantage
of advances in computing power and improved training techniques to learn complex patterns in
15
large amounts of data. Common applications include image and speech recognition.
Page
Cognitive computing is a subfield of AI that strives for a natural, human-like interaction with
machines. Using AI and cognitive computing, the ultimate goal is for a machine to simulate human
processes through the ability to interpret images and speech – and then speak coherently in
response.
Computer vision relies on pattern recognition and deep learning to recognize what’s in a picture
or video. When machines can process, analyze and understand images, they can capture images or
videos in real time and interpret their surroundings.
Natural language processing (NLP) is the ability of computers to analyze, understand and
generate human language, including speech. The next stage of NLP is natural language interaction,
which allows humans to communicate with computers using normal, everyday language to perform
tasks.
Graphical processing units are key to AI because they provide the heavy compute power that’s
required for iterative processing. Training neural networks requires big data plus compute power.
The Internet of Things generates massive amounts of data from connected devices, most of it
unanalyzed. Automating models with AI will allow us to use more of it.
Advanced algorithms are being developed and combined in new ways to analyze more data faster
and at multiple levels. This intelligent processing is key to identifying and predicting rare events,
understanding complex systems and optimizing unique scenarios.
APIs, or application processing interfaces, are portable packages of code that make it possible to
add AI functionality to existing products and software packages. They can add image recognition
capabilities to home security systems and Q&A capabilities that describe data, create captions and
headlines, or call out interesting patterns and insights in data.
16
Page
Algebra
What is Algebra: -
Algebra from Arabic "al-Jabr", literally meaning "reunion of broken parts as per Wikipedia.
Algebra: -Whole branch of math called "Algebra".
Algebra Definition:
Algebra can be defined as the representation of numbers and quantities in equations
and formulae in the form of letters.
The topic of algebra is broadly divided into 2 parts namely, elementary algebra (or
basic algebra) and abstract algebra (or modern algebra).
The modern algebra is mainly studied by professional mathematicians.
The basic algebra is involved in almost most of the mathematical and scientific
subjects.
The application of algebra is not only limited to science and engineering but is also
used in medicine and economics.
Various types of definitions for Algebra:
1) Algebra is a branch of mathematics that deals with properties of operations and the
structures these operations are defined on.
2) Algebra is a branch of mathematics that substitutes letters for numbers, and an
algebraic equation represents a scale where what is done on one side of the scale is
also done to the other side of the scale and the numbers act as constants.
3) Algebra is a branch of pure mathematics that deals with the rules of operations and
solving equations.
* Algebra is lot like Arithmetic it follows all the rules of arithmetic and it uses the same
four main operations that arithmetic is built on ------“Operations”
-->Algebra introduces a new element of unknown "?"
Ex: - 1+2=?
The answer isn't known until you go ahead and do the arithmetic. Coming to algebra we
use a symbol in its place. The Symbol is usually just any letter either of the alphabet A to
Z.
Really popular letter to choose is the letter "X". In Algebra we had to write it like this
1+2=X
Arithmetic 1+2=_____
Algebra 1+2=X
“This is the very basic algebraic equation.”
=>An equation is just a mathematical statement that two things are equal.
17
GOALS: Figure out what the unknown values in equations are and when you do that, it
called solving the equation
So 1+2= X
X=3
Definition: Algebra solving equation is all lot like again where you are given mixed up
complicated equations and its your job to simplify them and rearrange them until it is
nice a simple equation where it’s easy to tell what the unknown values are
2X= (30-2X)/4
8X=30-2X
8x+2X=30
10X=30
X=3
Concepts Associated with Algebra
Elementary Algebra involves simple rules and operations on numbers such as:
Addition
Subtraction
Multiplication
Division
Equation solving techniques
Variables
Functions
Polynomials
Algebraic Expressions
There is no limit on the complexity of all these concepts as far as “Algebra” is concerned
Even if you don't think you'll need algebra outside of the hallowed halls of your
average high school, managing budgets, paying bills, and even determining health care
costs and planning for future investments will require a basic understanding of algebra.
18
Page
Here are some simple examples that demonstrate the relevance of algebra in the real
world.
You purchased 10 items from a shopping plaza, and now you need plastic bags to carry
them home. If each bag can hold only 3 items, how many plastic bags you will need to
accommodate 10 items?
10 items(3 items/bag) = 3.333 bags ≈ 4bags
Explanation :
Example 1:
Going shopping
Explanation :
The figure below illustrates the problem:
The different shapes inside the bags denote different items purchased. The number depicts
the item number.
19
Page
We use simple algebraic formula, xy to calculate the total gallons that can be bought.
x = Money in your pocket= $15
y = Price of 1 gallon of gas= $3
Hence,
$15/$3 = 5 gallon
So, with $15 we can buy 5 gallons of gas.
Basic Terms
Equation : An equation can be defined as a statement involving symbols (variables),
numbers (constants) and mathematical operators (Addition, Subtraction, Multiplication,
Division etc) that asserts the equality of two mathematical expressions. The equality of the
two expressions is shown by using a symbol “=” read as “is equal to”.
For example: 3X + 7 = 16 is an equation in the variable X.
=====================================================================
Variable: A variable is a symbol that represents a quantity in an algebraic expression. It is a
value that may change with time and scope of the concerned problem.
For example: in the equation 3X + 7 = 16, X is the variable.
Also in the polynomial X2 + 5XY – 3Y2,
both X and Y are variables.
When variables are used with other numbers, parentheses, or operations, they
Page
a + 2
(a) (b)
3m + 6n - 6
n+5
x–7
w - 25
In algebra, variables are symbols used to represent unspecified numbers or values. Any
letter may be used as a variable.
Example: 3 + 2
An algebraic expression consists of one or more constants and variables along with
one or more arithmetic operationss.
In each expression, the quantities being multiplied are called factors, and the result is
called the product.
Variables are used to change verbal expressions into algebraic expressions, that is,
expressions that are composed of letters that stand for numbers.
Key words that can help you translate words into letters and numbers include:
23
Page
COEFFICIENT
6m + 5 6
8r + 7m + 4 8, 7
14b - 8 14
7x + 5 – 3y2 + y + x/3
In the term 7x, 7 is called the coefficient. A coefficient is a number that is multiplied by a
variable in an algebraic expression.
7x => coefficient = 7
variable = x
A solution of a linear equation in one variable is a real number which, when substituted for
the variable in the equation, makes the equation true.
Example: Is 3 a solution of 2x + 3 = 11?
26
2x + 3 = 11
Page
2(3) + 3 = 11
6 + 3 = 11 false equation
To solve linear equations, we will make heavy use of the following facts.
Graphs of functions:
Graphs of functions: A graph of a function is a visual representation of a function's
behavior on an x-y plane.
Graphs help us understand different aspects of the function, which would be difficult
to understand by just looking at the function itself.
You can graph thousands of equations, and there are different formulas for each
one. That said, there are always ways to graph a function if you forget the exact
steps for the specific type of function.
The graph of a function f is the set of all points in the plane of the form (x, f(x)). We could
also define the graph of f to be the graph of the equation y = f(x).
So, the graph of a function if a special case of the graph of an equation.
27
Page
Graphing steps:
A function is a relation in which each element of the domain is paired with exactly one
element of the range. Another way of saying it is that there is one and only one output (y)
with each input (x).
X f(x)Y
Example 1.
28
Let f(x) = x2 - 3.
Page
The graph of f(x) in this example is the graph of y = x2 - 3. It is easy to generate points on
the graph. Choose a value for the first coordinate, then evaluate f at that number to find the
second coordinate. The following table shows several values for x and the function f
evaluated at those numbers.
Each column of numbers in the table holds the coordinates of a point on the graph of f.
Histograms:
Histograms: - A graphical display of data using bars of different heights.
Def 1:- A histogram is a visual way to display frequency data using bars. A feature of
histograms is that they show the frequency of continuous data, such as the number of
trees at various heights from 3 feet to 8 feet.
General Characteristics:
• Column label is quantitative variable (ages).
• Column label is a range of values (or single value).
• Column height is size of the group.
• Columns NOT separated by space.
• Calculate mean, median, quartiles, standard deviation, and so on.
It is similar to a Bar Chart, but a histogram groups numbers into ranges.
Histograms are a great way to show results of continuous data, such as:
weight
height
how much time. etc.
But when the data is in categories (such as Country or Favorite Movie), we should
use a Bar Chart.
29
1. Histograms are used to show distributions of variables, while bar graphs are used
to compare variables.
2. Histograms plot quantitative (numerical) data with ranges of the data grouped into
intervals, while bar graphs plot categorical data.
3. The bars in a bar graph can be rearranged, but it does not make sense to rearrange
the bars in a histogram.
4. It is possible to speak of the skewness of a histogram, but not of a bar graph.
5. Bar graphs have space between the columns, while histograms do not.
Frequency Histogram:
30
• Each interval now contains a frequency number which represents that specific
interval's count added to the sum of all the previous intervals' counts.
• A cumulative frequency histogram will always contain column bars that get
increasingly taller (or stay the same height) as you move to the right.
Data set: {9, 25, 30, 31, 34, 36, 37, 42, 45, 47, 49, 43, 55, 58, 61, 63, 67}
Count Cumulative
Interval
(frequency) frequency
0-10 1 1
11-20 0 1+0=1
21-30 1 1+1=2
31-40 5 2+5=7
41-50 5 7 + 5 = 12
Notice that the columns increase in height, or
51-60 2 12 + 2 = 14
stay the same, as you move to the right.
61-70 3 14 + 3 = 17
• While the term "frequency" refers to the number of observations (or counts) of a
given piece of data, the term "relative frequency" refers to the number of
observations (or counts) expressed as a "part of the whole" or percentage.
• Each of the number of observations is divided by the total number of observations
from the entire data set.
• In the data set we have been examining, there are 2 pieces of data in the interval
"51-60", but there are 17 total pieces of data.
• The relative frequency of the 2 pieces of data in the interval "51-60" is 2/17 or 0.12
(to the nearest hundredth) or 12%. Notice that when you add all of the relative
frequencies in the chart below, you get a value of one (or 100%).
• Relative frequency may be expressed as a fraction (ratio), a decimal, or a
percentage.
Data set: {9, 25, 30, 31, 34, 36, 37, 42, 45, 47, 49, 43, 55, 58, 61, 63, 67}
31
Count Relative
Interval
(frequency) frequency
11-20 0 0/17 = 0
• How do you know when to use a histogram? You can decide this by looking at your
data. Ask yourself, 'Is the data continuous, or can I group the data into ranges?'
• What you are looking for is a group of data that is continuous. What this means is
that the data covers a range of values that does not jump.
• For example, the range of tree heights is 3 to 8 feet. The data does not jump from 4
to 6 feet. There is no gap. The countries of Norway, Finland, and Sweden, though, do
jump.
• There is no continuity between the countries. They are separate entities. For
histograms, you need continuous information and not just categories that jump.
32
Page
Logarithms:
To be specific, the logarithm of a number x to a base b is just the exponent you put onto b to
make the result equal x.
generically, if x = by, then we say that y is “the logarithm of x to the base b” or “the base-b
logarithm of x”.
For instance, since 5² = 25, we know that 2 (the power) is the logarithm of 25 to base 5.
Symbolically, log5(25) = 2.
Log Rules:
3) logb(mn) = n · logb(m)
Derivatives
The derivative of a function of a real variable measures the sensitivity to change of the
function value (output value) with respect to a change in its argument (input value).
• Derivatives are a fundamental tool of calculus.
• For example, the derivative of the position of a moving object with respect to time is
the object's velocity: this measures how quickly the position of the object changes
when time advances.
m f x lim
f x h f x h
h0 2h
A derivative is a contract between two parties which derives its value/price from an
underlying asset.
34
The Derivative is the exact rate at which one quantity changes with respect to another.
Page
2. Geometrically, the derivative is the slope of curve at the point on the curve.
4. The derivative of a function represents an infinitely small change the function with
respect to one of its variables. • The Process of finding the derivative is called
“differentiation.”
0.8 0.8
0.6 0.6
f (x ) f ' (x )
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
does not use function minimization. Minimizing functions has proven very useful in
ML.
Page
36
Page
Statistics
• Many studies generate large numbers of data points, and to make sense of all that
data, researchers use statistics that summarize the data, providing a better
understanding of overall tendencies within the distributions of scores.
• Statistics is a branch of mathematics dealing with the collection, analysis,
interpretation, presentation, and organization of data.
• Statistics deals with all aspects of data including the planning of data collection in
terms of the design of surveys and experiments.
• Two main statistical methods are used in data analysis: descriptive statistics, which
summarize data from a sample using indexes such as the mean or standard
deviation, and inferential statistics, which draw conclusions from data that are
subject to random variation (e.g., observational errors, sampling variation).
• Descriptive statistics are most often concerned with two sets of properties of a
distribution (sample or population): central tendency (or location) seeks to
characterize the distribution's central or typical value,while dispersion (or
variability) characterizes the extent to which members of the distribution depart
from its center and each other.
• Inferences on mathematical statistics are made under the framework of probability
theory, which deals with the analysis of random phenomena.
Statistics: The science of collecting, describing, and interpreting data.
• Measures of dispersion
• Measures of skewness
2. inferential (which test for significant differences between groups and/or significant
relationships among variables within the sample
• t-ratio, chi-square, beta-value
Descriptive statistics:
A descriptive statistic (in the count noun sense) is a summary statistic that
quantitatively describes or summarizes features of a collection of information, while
descriptive statistics in the mass noun sense is the process of using and analyzing
those statistics.
Descriptive statistics is distinguished from inferential statistics (or inductive
statistics), in that descriptive statistics aims to summarize a sample, rather than use
the data to learn about the population that the sample of data is thought to
represent.
This generally means that descriptive statistics, unlike inferential statistics, is not
developed on the basis of probability theory, and are frequently nonparametric
statistics.
Even when a data analysis draws its main conclusions using inferential statistics,
descriptive statistics are generally also presented.
For example, in papers reporting on human subjects, typically a table is included
giving the overall sample size, sample sizes in important subgroups (e.g., for each
treatment or exposure group), and demographic or clinical characteristics such as
the average age, the proportion of subjects of each sex, the proportion of subjects
with related comorbidities, etc.
Descriptive Statistics:
Frequencies
Basic measurements
Descriptive statistics can be used to summarize and describe a single variable (aka, UNI
variate)
Number
Frequency Count
Percentage
Deciles and quartiles
Measures of Central Tendency (Mean, Midpoint, Mode)
Variability
Variance and standard deviation
Graphs
Normal Curve
Averages:
Mode: most frequently occurring value in a distribution (any scale, most unstable).
Advantages: Disadvantages:
Mean: arithmetic average- the sum of all values in a distribution divided by the number of
• center of gravity
• evenly partitions the sum of all measurement among all cases; average of all
n
x i
x i 1
measures n
R: mean(x)
Arithmetic Mean
-Harmonic Mean
-Geometric Mean
also.. -f mean
40
Page
Median: midpoint in the distribution below which half of the cases reside (ordinal and
above)
• 50th percentile…
• less useful for inferential purposes
• more resistant to effects of outliers…
unit 1 unit 2
9.7 9.0
11.5 11.2
11.6 11.3
12.1 11.7
12.4 12.2
12.6 12.5
12.9 <-- 13.2 13.2
13.1 13.8
13.5 14.0
13.6 15.5
14.8 15.6
16.3 16.2
26.9 16.4
Sample mean: 𝑥
Population mean: (“mu”)
outlier
outliers : An outlier is any value that is numerically distant from most of the other data
points in a set of data. It is not uncommon to find an outlier in a data set.
In statistics, an outlier is an observation point that is distant from other
observations.
An outlier may be due to variability in the measurement or it may indicate
experimental error; the latter are sometimes excluded from the data set.
41
Box plot The box plot is a useful graphical display for describing the behavior of
constructio the data in the middle as well as at the ends of the distributions. The
n box plot uses the median and the lower and upper quartiles (defined as
the 25th and 75th percentiles). If the lower quartile is Q1 and the upper
quartile is Q3, then the difference (Q3 - Q1) is called the interquartile
range or IQ.
Box plots A box plot is constructed by drawing a box between the upper and
with fences lower quartiles with a solid line drawn across the box to locate the
median. The following quantities (called fences) are needed for
identifying extreme values in the tails of the distribution:
From an examination of the fence points and the data, one point (1441)
exceeds the upper inner fence and stands out as a mild outlier; there
are no extreme outliers.
The outlier is identified as the largest value in the data set, 1441, and
appears as the circle to the right of the box plot.
44
may contain information about the process under investigation or the data gathering
important and recording process. Before considering the possible elimination of
information these points from the data, one should try to understand why they
appeared and whether it is likely similar values will continue to appear.
Of course, outliers are often bad data points.
For example, it could be that there were battery problems with the timer that caused the
alarm to go off before the runner's 60 seconds were up.
For example, it could be that the running signal was not loud enough for all of the athletes
to hear, resulting in one runner having a late start. This would put the runner's time far
below that of the other runners.
However, if the outlier was due to chance or some natural process of the construct that is
being measured, it should not be removed.
Detecting Outliers
The easiest way to detect an outlier is by creating a graph. We can spot outliers by using
histograms, scatterplots, number lines, and the interquartile range.
45
Histogram :
Page
Suppose that we were asked to create a histogram using the data that we collected from the
high school track runners. The following is the histogram of the change in distance for each
of the track runners.
N percentiles
A “percentile” is a comparison score between a particular score and the scores of the
rest of a group. It shows the percentage of scores that a particular score surpassed.
R=P/100(N)
Percentiles are commonly used to report scores in tests, like the SAT, GRE and LSAT.
Definition 1: The nth percentile is the lowest score that is greater than a certain
percentage (“n”) of the scores. In this example, our n is 25, so we’re looking for the lowest
score that is greater than 25%.
Definition 2: The nth percentile is the smallest score that is greater than or equal to a
certain percentage of the scores. To rephrase this, it’s the percentage of data that falls at or
below a certain observation. This is the definition used in AP statistics.
• Are standard scores and may be used to compare scores of different measurements.
• Change at different rates (remember comparison of low and and high percentile
scores with middle percentiles), so they should not be used to determine one score
for several different tests.
46
• May prefer to use T-scale when converting raw scores to standard scores.
Page
Variance:
• analogous to average deviation of cases from mean
• in fact, based on sum of squared deviations from the mean—“sum-of-squares”
n
x x
2
i
s2 i 1
n 1
R = VAR(X)
• computational form:
2
n
n
x
i 1
2
i xi / n
i 1
s 2
n 1
• note: units of variance are squared…
• this makes variance hard to interpret
• ex.: projectile point sample:
mean = 22.6 mm
variance = 38 mm2
standard deviation:
The standard deviation gives a rough estimate of the typical distance of a data
values from the mean
The larger the standard deviation, the more variability there is in the data and the
more spread out the data are
47
Page
150
Frequency
0 50
-15 -10 -5 0 5 10 15
150
Frequency
0 50
-15 -10 -5 0 5 10 15
Both of these distributions are bell-shaped
Square root of the Variance:
– expressed in the original units of measurement
– Represents the average amount of dispersion in a sample
– Used in a number of inferential statistics
• square root of variance:
n 2
n
x x
n
x xi / n
2 2
i i
s i 1
s i 1 i 1
n 1 n 1
• units are in same units as base measurements
• ex.: projectile point sample:
mean = 22.6 mm
standard deviation = 6.2 mm
• mean +/- sd (16.4—28.8 mm)
48
– should give at least some intuitive sense of where most of the cases lie,
Page
Skewness(symmetry):
• Measures look at how lopsided distributions are—how far from the ideal of the
normal curve they are
• When the median and the mean are different, the distribution is skewed. The
greater the difference, the greater the skew.
• Distributions that trail away to the left are negatively skewed and those that trail
away to the right are positively skewed
• If the skewness is extreme, the researcher should either transform the data to make
them better resemble a normal curve or else use a different set of statistics—
49
Indicators of Skewness:
Frequency curve is not Symmetrical bell shaped.
Values of Mean, Median, and Mode do not coincide.
Sum of positive deviation is not equal to sum of negative deviation.
Measures of Skewness:
Karl Pearson’s coefficient of skewness,
Sk = Mean - Mode
Standard Deviation
50
• This visual indicator is imprecise and does not take into consideration sample size n.
Kurtosis
Kurtosis:- It is concerned with the degree of Flatness or Peakedness in a curve.
• Kurtosis is the relative length of the tails and the degree of concentration in the
center.
• Consider three kurtosis prototype shapes.
Types of Kurtosis:-
Leptokurtic: A curve which is more peaked then the normal.
Page
52
Page
statistics) or you could use your research (and inferential statistics) to reason that
around 75-80% of the population (all shoppers in all malls) like shopping at Sears.
Page
In order to measure probabilities, mathematicians have devised the following formula for
finding the probability of an event.
Probability Of an Event:
The Number Of Ways Event A Can Occur
P(A) =
The total number Of Possible Outcomes
The probability of event A is the number of ways event A can occur divided by the
total number of possible outcomes. Let's take a look at a slight modification of the
problem from the top of the page.
Experiment 1: A spinner has 4 equal sectors colored yellow, blue, green and red. After
54
Probabilities:
# of ways to land on yellow 1
P(yellow) = =
total # of colors 4
# of ways to land on blue 1
P(blue) = =
total # of colors 4
Probability Distributions:
Suppose you are a teacher in a university. After checking assignments for a week, you
graded all the students. You gave these graded papers to a data entry guy in the university
and tell him to create a spreadsheet containing the grades of all the students. But the guy
only stores the grades and not the corresponding students.
He made another blunder, he missed a couple of entries in a hurry and we have no idea
whose grades are missing. Let’s find a way to solve this.
One way is that you visualize the grades and see if you can find a trend in the data.
55
Page
The graph that you have plot is called the frequency distribution of the data. You see that
there is a smooth curve like structure that defines our data, but do you notice an anomaly?
We have an abnormally low frequency at a particular score range. So the best guess would
be to have missing values that remove the dent in the distribution.
This is how you would try to solve a real-life problem using data analysis. For any Data
Scientist, a student or a practitioner, distribution is a must know concept. It provides the
basis for analytics and inferential statistics.
While the concept of probability gives us the mathematical calculations, distributions help
us actually visualize what’s happening underneath.
In this article, I have covered some important probability distributions which are explained
in a lucid as well as comprehensive manner.
Note: This article assumes you have a basic knowledge of probability. If not, you can refer
this article on basics of probability
Table of Contents
6. Exponential Distribution
Page
Types of Distributions
Bernoulli Distribution
Let’s start with the easiest distribution that is Bernoulli Distribution. It is actually easier to
understand than it sounds!
All you cricket junkies out there! At the beginning of any cricket match, how do you decide
who is going to bat or ball? A toss! It all depends on whether you win or lose the toss, right?
Let’s say if the toss results in a head, you win. Else, you lose. There’s no midway.
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0
(failure), and a single trial. So the random variable X which has a Bernoulli distribution can
take value 1 with the probability of success, say p, and the value 0 with the probability of
failure, say q or 1-p.
Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two
possible outcomes.
The probability mass function is given by: px(1-p)1-x where x € (0, 1).
It can also be written as
The probabilities of success and failure need not be equally likely, like the result of a fight
between me and Undertaker. He is pretty much certain to win. So in this case probability of
my success is 0.15 while my failure is 0.85
Here, the probability of success(p) is not same as the probability of failure. So, the chart
below shows the Bernoulli Distribution of our fight.
57
Page
Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value
is exactly what it sounds. If I punch you, I may expect you to punch me back. Basically
expected value of any distribution is the mean of the distribution. The expected value of a
random variable X from a Bernoulli distribution is found as follows:
E(X) = 1*p + 0*(1-p) = p
The variance of a random variable from a bernoulli distribution is:
V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)
There are many examples of Bernoulli distribution such as whether it’s going to rain
tomorrow or not where rain denotes success and no rain denotes failure and Winning
(success) or losing (failure) the game.
Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these
outcomes are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli
Distribution, all the n number of possible outcomes of a uniform distribution are equally
likely.
A variable X is said to be uniformly distributed if the density function is:
You can see that the shape of the Uniform distribution curve is rectangular, the reason why
Uniform distribution is called rectangular distribution.
For a Uniform Distribution, a and b are the parameters.
The number of bouquets sold daily at a flower shop is uniformly distributed with a
maximum of 40 and a minimum of 10.
Let’s try calculating the probability that the daily sales will fall between 15 and 30.
The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5
Similarly, the probability that daily sales are greater than 20 is = 0.667
The mean and variance of X following a uniform distribution is:
Mean -> E(X) = (a+b)/2
Variance -> V(X) = (b-a)²/12
The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard
uniform density is given by:
Binomial Distribution
Let’s get back to cricket. Suppose that you won the toss today and this indicates a
successful event. You toss again but you lost this time. If you win a toss today, this does not
necessitate that you will win the toss tomorrow. Let’s assign a random variable, say X, to
the number of times you won the toss. What can be the possible value of X? It can be any
number depending on the number of times you tossed a coin.
There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5.
A distribution where only two outcomes are possible, such as success or failure, gain or
loss, win or lose and where the probability of success and failure is same for all the trials is
called a Binomial Distribution.
The outcomes need not be equally likely. Remember the example of a fight between me and
Undertaker? So, if the probability of success in an experiment is 0.2 then the probability of
failure can be easily computed as q = 1 – 0.2 = 0.8.
Each trial is independent since the outcome of the previous toss doesn’t determine or affect
the outcome of the current toss. An experiment with only two possible outcomes repeated
n number of times is called binomial. The parameters of a binomial distribution are n and p
where n is the total number of trials and p is the probability of success in each trial.
59
On the basis of the above explanation, the properties of a Binomial Distribution are
Page
2. There are only two possible outcomes in a trial- either a success or a failure.
3. A total number of n identical trials are conducted.
4. The probability of success and failure is same for all trials. (Trials are identical.)
The mathematical representation of binomial distribution is given by:
A binomial distribution graph where the probability of success does not equal the
probability of failure looks like
Now, when probability of success = probability of failure, in such a situation the graph of
binomial distribution looks like
Normal Distribution
Normal distribution represents the behavior of most of the situations in the universe
(That is why it’s called a “normal” distribution. I guess!). The large sum of (small) random
variables often turns out to be normally distributed, contributing to its widespread
application. Any distribution is known as Normal distribution if it has the following
characteristics:
The mean and variance of a random variable X which is said to be normally distributed is
given by:
Mean -> E(X) = µ
Variance -> Var(X) = σ^2
Here, µ (mean) and σ (standard deviation) are the parameters.
The graph of a random variable X ~ N (µ, σ) is shown below.
61
Page
A standard normal distribution is defined as the distribution with mean 0 and standard
deviation 1. For such a case, the PDF becomes:
Poisson Distribution
Suppose you work at a call center, approximately how many calls do you get in a day? It can
be any number. Now, the entire number of calls at a call center in a day is modeled by
Poisson distribution. Some more examples are
Now, if any distribution validates the above assumptions then it is a Poisson distribution.
Some notations used in Poisson distribution are:
Page
The mean µ is the parameter of this distribution. µ is also defined as the λ times length of
that interval. The graph of a Poisson distribution is shown below:
The graph shown below illustrates the shift in the curve due to increase in mean.
63
Page
It is perceptible that as the mean increases, the curve shifts to the right.
The mean and variance of X following a Poisson distribution:
Mean -> E(X) = µ
Variance -> Var(X) = µ
Exponential Distribution
Let’s consider the call center example one more time. What about the interval of time
between the calls ? Here, exponential distribution comes to our rescue. Exponential
distribution models the interval of time between the calls.
Other examples are:
1. Length of time beteeen metro arrivals
2. Length of time between arrivals at a gas station
3. The life of an Air Conditioner
Exponential distribution is widely used for survival analysis. From the expected life of a
machine to the expected life of a human, exponential distribution successfully delivers the
result.
A random variable X is said to have an exponential distribution with PDF:
f(x) = { λe-λx, x ≥ 0
and parameter λ>0 which is also called the rate.
For survival analysis, λ is called the failure rate of a device at any time t, given that it has
survived up to t.
Mean and Variance of a random variable X following an exponential distribution:
Mean -> E(X) = 1/λ
Variance -> Var(X) = (1/λ)²
Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the
curve. This is explained better with the graph shown below.
64
Page
To ease the computation, there are some formulas given below. P{X≤x} = 1 – e-λx,
corresponds to the area under the density curve to the left of x.
P{X>x} = e-λx, corresponds to the area under the density curve to the right of x.
P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds to the area under the density curve between x1 and
x2.
Hypothesis Testing
The main purpose of statistics is to test a hypothesis. For example, you might
run an experiment and find that a certain drug is effective at treating headaches. But if you
can’t repeat that experiment, no one will take your results seriously. A good example of this
was the cold fusion discovery, which petered into obscurity because no one was able to
duplicate the results.
Contents :
1. What is a Hypothesis?
2. What is Hypothesis Testing?
3. Hypothesis Testing Examples (One Sample Z Test).
4. Hypothesis Test on a Mean (TI 83).
5. Bayesian Hypothesis Testing.
6. More Hypothesis Testing Articles
See also:
Critical Values
What is the Null Hypothesis?
What is a Hypothesis?
65
Page
A hypothesis is an educated guess about something in the world around you. It should be
testable, either by experiment or observation. For example:
A new medicine you think might work.
A way of teaching you think might be better.
A possible location of new species.
A fairer way to administer standardized tests.
It can really be anything at all as long as you can put it to the test.
What is a Hypothesis Statement?
If you are going to propose a hypothesis, it’s customary to write a statement. Your
statement will look like this:
“If I…(do this to an independent variable)….then (this will happen to the dependent
variable).”
For example:
If I (decrease the amount of water given to herbs) then (the herbs will increase in size).
If I (give patients counseling in addition to medication) then (their overall depression
scale will decrease).
If I (give exams at noon instead of 7) then (student test scores will improve).
If I (look in this certain location) then (I am more likely to find new species).
A good hypothesis statement should:
66
Hypothesis testing in statistics is a way for you to test the results of a survey or experiment
to see if you have meaningful results. You’re basically testing whether your results are valid
by figuring out the odds that your results have happened by chance. If your results may
have happened by chance, the experiment won’t be repeatable and so has little use.
Hypothesis testing can be one of the most confusing aspects for students, mostly because
before you can even perform a test, you have to know what your null hypothesis is. Often,
those tricky word problems that you are faced with can be difficult to decipher. But it’s
easier than you think; all you need to do is:
1. Figure out your null hypothesis,
2. State your null hypothesis,
3. Choose what kind of test you need to perform,
4. Either support or reject the null hypothesis.
What is the Null Hypothesis?
If you trace back the history of science, the null hypothesis is always the accepted fact.
Simple examples of null hypotheses that are generally accepted as being true are:
1. DNA is shaped like a double helix.
2. There are 8 planets in the solar system (excluding Pluto).
3. Taking Vioxx can increase your risk of heart problems (a drug now taken off the
market).
How do I State the Null Hypothesis?
You won’t be required to actually perform a real experiment or survey in elementary
statistics (or even disprove a fact like “Pluto is a planet”!), so you’ll be given word problems
from real-life situations. You’ll need to figure out what your hypothesis is from the
problem. This can be a little trickier than just figuring out what the accepted fact is. With
word problems, you are looking to find a fact that is nullifiable (i.e. something you can
reject).
Hypothesis Testing Examples #1: Basic Example
A researcher thinks that if knee surgery patients go to physical therapy twice a week
67
(instead of 3 times), their recovery period will be longer. Average recovery times for knee
surgery patients is 8.2 weeks.
Page
The hypothesis statement in this question is that the researcher believes the average
recovery time is more than 8.2 weeks. It can be written in mathematical terms as:
H1: μ > 8.2
Next, you’ll need to state the null hypothesis (See: How to state the null hypothesis).
That’s what will happen if the researcher is wrong. In the above example, if the researcher
is wrong then the recovery time is less than or equal to 8.2 weeks. In math, that’s:
H0 μ ≤ 8.2
Rejecting the null hypothesis
Ten or so years ago, we believed that there were 9 planets in the solar system. Pluto was
demoted as a planet in 2006. The null hypothesis of “Pluto is a planet” was replaced by
“Pluto is not a planet.” Of course, rejecting the null hypothesis isn’t always that easy — the
hard part is usually figuring out what your null hypothesis is in the first place.
Hypothesis Testing Examples (One Sample Z Test)
The one sample z test isn’t used very often (because we rarely know the actual
population standard deviation). However, it’s a good idea to understand how it works as
it’s one of the simplest tests you can perform in hypothesis testing. In English class you got
to learn the basics (like grammar and spelling) before you could write a story; think of one
sample z tests as the foundation for understanding more complex hypothesis testing. This
page contains two hypothesis testing examples for one sample z-tests.
One Sample Hypothesis Testing Examples: #2
A principal at a certain school claims that the students in his school are above average
intelligence. A random sample of thirty students IQ scores have a mean score of 112. Is
there sufficient evidence to support the principal’s claim? The mean population IQ is 100
with a standard deviation of 15.
Step 1: State the Null hypothesis. The accepted fact is that the population mean is 100, so:
H0: μ=100.
Step 2: State the Alternate Hypothesis. The claim is that the students have above average IQ
scores, so:
H1: μ > 100.
The fact that we are looking for scores “greater than” a certain point means that this is a
one-tailed test.
Step 3: Draw a picture to help you visualize the problem.
68
Page
Step 4: State the alpha level. If you aren’t given an alpha level, use 5% (0.05).
Step 5: Find the rejection region area (given by your alpha level above) from the z-table. An
area of .05 is equal to a z-score of 1.645.
Step 6: Find the test statistic using this formula:
For this set of data: z= (112.5-100) / (15/√30)=4.56.
Step 6: If Step 6 is greater than Step 5, reject the null hypothesis. If it’s less than Step 5, you
cannot reject the null hypothesis. In this case, it is greater (4.56 > 1.645), so you can reject
the null.
One Sample Hypothesis Testing Examples: #3
Blood glucose levels for obese patients have a mean of 100 with a standard deviation of 15.
A researcher thinks that a diet high in raw cornstarch will have a positive or negative effect
on blood glucose levels. A sample of 30 patients who have tried the raw cornstarch diet
have a mean glucose level of 140. Test the hypothesis that the raw cornstarch had an effect.
Step 1: State the null hypothesis: H0:μ=100
Step 2: State the alternate hypothesis: H1:≠100
Step 3: State your alpha level. We’ll use 0.05 for this example. As this is a two-tailed test,
split the alpha into two. 0.05/2=0.025.
Step 4: Find the z-score associated with your alpha level. You’re looking for the area in one
tail only. A z-score for 0.75(1-0.025=0.975) is 1.96. As this is a two-tailed test, you would
also be considering the left tail (z=1.96)
Step 5: Find the test statistic using this formula:
z=(140-100)/(15/√30)=14.60.
Step 6: If Step 5 is less than -1.96 or greater than 1.96 (Step 3), reject the null hypothesis. In
this case, it is greater, so you can reject the null.
*This process is made much easier if you use a TI-83 or Excel to calculate the z-score (the
69
“critical value”).
Page
See:
Critical z value TI 83
Z Score in Excel
Hypothesis Testing Examples: Mean (Using TI 83)
You can use the TI 83 calculator for hypothesis testing, but the calculator won’t figure out
the null and alternate hypotheses; that’s up to you to read the question and input it into the
calculator.
Sample problem: A sample of 200 people has a mean age of 21 with a population standard
deviation (σ) of 5. Test the hypothesis that the population mean is 18.9 at α = 0.05.
Step 1: State the null hypothesis. In this case, the null hypothesis is that the population
mean is 18.9, so we write: H0: μ = 18.9
Step 2: State the alternative hypothesis. We want to know if our sample, which has a mean
of 21 instead of 18.9, really is different from the population, therefore our alternate
hypothesis:
H1: μ ≠ 18.9
Step 3: Press Stat then press the right arrow twice to select TESTS.
Step 4: Press 1 to select 1:Z-Test…. Press ENTER.
Step 5: Use the right arrow to select Stats.
Step 6: Enter the data from the problem:
μ0: 18.9
σ: 5
x: 21
n: 200
μ: ≠μ0
Step 7: Arrow down to Calculate and press ENTER. The calculator shows the p-value:
p = 2.87 × 10-9
This is smaller than our alpha value of .05. That means we should reject the null
70
Page
Hypothesis.
Bayesian Hypothesis Testing: What is it?
Bayesian hypothesis testing helps to answer the question: Can the results from a test or
survey be repeated?
Why do we care if a test can be repeated? Let’s say twenty people in the same village came
down with leukemia. A group of researchers find that cell-phone towers are to blame.
However, a second study found that cell-phone towers had nothing to do with the cancer
cluster in the village. In fact, they found that the cancers were completely random. If that
sounds impossible, it actually can happen! Clusters of cancer can happen simply by chance.
There could be many reasons why the first study was faulty. One of the main reasons could
be that they just didn’t take into account that sometimes things happen randomly and we
just don’t know why.
P Values.
It’s good science to let people know if your study results are solid, or if they could have
happened by chance. The usual way of doing this is to test your results with a p-value. A p
value is a number that you get by running a hypothesis test on your data. A P value of 0.05
(5%) or less is usually enough to claim that your results are repeatable. However, there’s
another way to test the validity of your results: Bayesian Hypothesis testing. This type of
testing gives you another way to test the strength of your results.
called Non-Bayesian. It is how often an outcome happens over repeated runs of the
experiment. It’s an objective view of whether an experiment is repeatable.
Page
Bayesian hypothesis testing is a subjective view of the same thing. It takes into account
how much faith you have in your results. In other words, would you wager money on the
outcome of your experiment?
Z-TEST:
A Z-test is a type of hypothesis test. Hypothesis testing is just a way for you to figure out if
results from a test are valid or repeatable.
A hypothesis test will tell you if it’s probably true, or probably not true. A Z test, is
used when your data is approximately.
A z-test is a statistical test used to determine whether two population means are
different when the variances are known and the sample size is large. The test
statistic is assumed to have a normal distribution, and nuisance parameters such as
standard deviation should be known in order for an accurate z-test to be performed.
72
A z-test is a statistical test used to determine whether two population means are
different when the variances are known and the sample size is large.
Page
The test statistic is assumed to have a normal distribution, and nuisance parameters
such as standard deviation should be known in order for an accurate z-test to be
performed.
A one-sample location test, two-sample location test, paired difference test and
maximum likelihood estimate are examples of tests that can be conducted as z-tests.
Z-tests are closely related to t-tests, but t-tests are best performed when an
experiment has a small sample size.
Also, t-tests assume the standard deviation is unknown, while z-tests assume it is
known. If the standard deviation of the population is unknown, the assumption of
the sample variance equaling the population variance is made.
Hypothesis Test
The z-test is also a hypothesis test in which the z-statistic follows a normal
distribution.
The z-test is best used for greater than 30 samples because, under the central limit
theorem, as the number of samples gets larger, the samples are considered to be
approximately normally distributed.
When conducting a z-test, the null and alternative hypotheses, alpha and z-score
should be stated. Next, the test statistic should be calculated, and the results and
conclusion stated.
For example, assume an investor wishes to test whether the average daily return of a stock
is greater than 1%. A simple random sample of 50 returns is calculated and has an average
of 2%.
Assume the standard deviation of the returns is 2.50%. Therefore, the null hypothesis is
when the average, or mean, is equal to 3%.
Conversely, the alternative hypothesis is whether the mean return is greater than 3%.
Assume an alpha of 0.05% is selected with a two-tailed test.
Consequently, there is 0.025% of the samples in each tail, and the alpha has a critical value
of 1.96 or -1.96. If the value of z is greater than 1.96 or less than -1.96, the null hypothesis is
rejected.
The value for z is calculated by subtracting the value of the average daily return selected for
73
the test, or 1% in this case, from the observed average of the samples.
Page
Next, divide the resulting value by the standard deviation divided by the square root of the
number of observed values.
Therefore, the test statistic is calculated to be 2.83, or (0.02 - 0.01) / (0.025 / (50)^(1/2)).
The investor rejects the null hypothesis since z is greater than 1.96, and concludes that the
average daily return is greater than 1%.
There are different types of Z-test each for different purpose. Some of the popular
types are outlined below:
1.z test for single proportion is used to test a hypothesis on a specific value of the
population proportion:
Statistically speaking, we test the null hypothesis H0: p = p0 against the alternative
hypothesis H1: p >< p0 where p is the population proportion and p0 is a specific value of
the population proportion we would like to test for acceptance.
The example on tea drinkers explained above requires this test. In that example, p 0 = 0.5.
Notice that in this particular example, proportion refers to the proportion of tea drinkers.
2.z test for difference of proportions is used to test the hypothesis that two populations
have the same proportion.
Example;- suppose one is interested to test if there is any significant difference in the
habit of tea drinking between male and female citizens of a town. In such a situation, Z-
test for difference of proportions can be applied.
One would have to obtain two independent samples from the town- one from males and
the other from females and determine the proportion of tea drinkers in each sample in
order to perform this test.
3.z -test for single mean is used to test a hypothesis on a specific value of the population
mean.
Statistically speaking, we test the null hypothesis H0: μ = μ0 against the alternative
hypothesis H1: μ >< μ0 where μ is the population mean and μ0 is a specific value of the
population that we would like to test for acceptance.
Unlike the t-test for single mean, this test is used if n ≥ 30 and population standard
deviation is known.
74
4.z test for single variance is used to test a hypothesis on a specific value of the
population variance.
Page
Statistically speaking, we test the null hypothesis H0: σ = σ0 against H1: σ >< σ0 where σ is
the population mean and σ0 is a specific value of the population variance that we would
like to test for acceptance.
In other words, this test enables us to test if the given sample has been drawn from a
population with specific variance σ0. Unlike the chi square test for single variance, this
test is used if n ≥ 30.
5.Z-test for testing equality of variance is used to test the hypothesis of equality of two
population variances when the sample size of each sample is 30 or larger.
T-test
The t-test was described by 1908 by William Sealy Gosset for monitoring the brewing at
Guinness in Dublin. Guinness considered the use of statistics a trade secret, so he published
his test under the pen-name 'Student' -- hence the test is now often called the 'Student's t-
test'.
The t-test is a basic test that is limited to two groups. For multiple groups, you would have
to compare each pair of groups, for example with three groups there would be three tests
(AB, AC, BC).
The t-test (or student's t-test) gives an indication of the separateness of two sets of
measurements, and is thus used to check whether two sets of measures are essentially
different (and usually that an experimental effect has been demonstrated). The typical way
of doing this is with the null hypothesis that means of the two sets of measures are equal.
The t-test assumes:
A normal distribution (parametric data)
Underlying variances are equal (if not, use Welch's test)
It is used when there is random assignment and only two sets of measurement to compare.
There are two main types of t-test:
Independent-measures t-test: when samples are not matched.
Matched-pair t-test: When samples appear in pairs (eg. before-and-after).
where s is the sample standard deviation of the sample and n is the sample size. The
degrees of freedom used in this test is n − 1.
Independent two-sample t-test:
Equal sample sizes, equal variance
This test is only used when both:
the two sample sizes (that is, the number, n, of participants of each group) are equal;
it can be assumed that the two distributions have the same variance.
where Here Sx1x2 is the grand standard deviation (or pooled standard deviation), 1 =
group one, 2 = group two. The denominator of t is the standard error of the difference
between two means.
For significance testing, the degrees of freedom for this test is 2n − 2 where n is the number
of participants in each group.
Unequal sample sizes, equal variance :
This test is used only when it can be assumed that the two distributions have the same
variance. (When this assumption is violated, see below.) The t statistic to test whether the
means are different can be calculated as follows:
Note that the formulae above are generalizations for the case where both samples have
equal sizes
Sx1x2 is an estimator of the common standard deviation of the two samples: it is defined in
this way so that its square is an unbiased estimator of the common variance whether or not
the population means are the same. In these formulae, n = number of participants, 1 =
76
group one, 2 = group two. n − 1 is the number of degrees of freedom for either group, and
the total sample size minus two (that is, n1 + n2 − 2) is the total number of degrees of
Page
Where s2 is the unbiased estimator of the variance of the two samples, n = number of
participants, 1 = group one, 2 = group two. Note that in this case, is not a
pooled variance. For use in significance testing, the distribution of the test statistic is
approximated as being an ordinary Student's t distribution with the degrees of freedom
calculated using
This is called the Welch–Satterthwaite equation. Note that the true distribution of the test
statistic actually depends (slightly) on the two unknown variances.
This test is used when the samples are dependent; that is, when there is only one sample
that has been tested twice (repeated measures) or when there are two samples that have
been matched or "paired". This is an example of a paired difference test.
For this equation, the differences between all pairs must be calculated. The pairs are either
one person's pre-test and post-test scores or between pairs of persons matched into
meaningful groups (for instance drawn from the same family or age group). The average
(XD) and standard deviation (sD) of those differences are used in the equation. The
constant μ0 is non-zero if you want to test whether the average of the difference is
significantly different from μ0. The degree of freedom used is n − 1.
F – Test :
An F-test ( Snedecor and Cochran, 1983) is used to test if the standard deviations of two
77
populations are equal. This test can be a two-tailed test or a one-tailed test. The two-tailed
Page
version tests against the alternative that the standard deviations are not equal. The one-
tailed version only tests in one direction, that is the standard deviation from the first
population is either greater than or less than (but not both) the second population
standard deviation . The choice is determined by the problem. For example, if we are
testing a new process, we may only be interested in knowing if the new process is less
variable than the old process. The test statistic for F is simply
The systematic factors have a statistical influence on the given data set, but the
Page
Analysts use the analysis of the variance test to determine the result that
independent variables have on the dependent variable amid a regression study.
An ANOVA test is a way to find out if survey or experiment results are significant.
In other words, they help you to figure out if you need to reject the null hypothesis or
accept the alternate hypothesis.
Basically, you’re testing groups to see if there’s a difference between them. Examples of
when you might want to test different groups:
Types of ANOVA
There are two types of analysis of variance: one-way (or unidirectional) and two-
way.
One-way or two-way refers to the number of independent variables in your Analysis
of Variance test.
A one-way ANOVA evaluates the impact of a sole factor on a sole response variable.
It determines whether all the samples are the same.
The one-way ANOVA is used to determine whether there are any statistically
significant differences between the means of three or more independent (unrelated)
groups.
A two-way ANOVA is an extension of the one-way ANOVA.
With a one-way, you have one independent variable affecting a dependent variable.
With a two-way ANOVA, there are two independents. For example, a two-way
ANOVA allows a company to compare worker productivity based on two
independent variables, say salary and skill set. It is utilized to observe the
interaction between the two factors.
79
One-way or two-way refers to the number of independent variables (IVs) in your Analysis
of Variance test.
One-way has one independent variable (with 2 levels) and two-way has two independent
variables (can have multiple levels).
For example, a one-way Analysis of Variance could have one IV (brand of cereal) and a two-
way Analysis of Variance has two IVs (brand of cereal, calories).
One way analysis: When we are comparing more than three groups based on one factor
variable, then it said to be one way analysis of variance (ANOVA). For example, if we want
to compare whether or not the mean output of three workers is the same based on the
working hours of the three workers.
Two way analysis: When factor variables are more than two, then it is said to be two way
analysis of variance (ANOVA). For example, based on working condition and working
hours, we can compare whether or not the mean output of three workers is the same.
K-way analysis: When factor variables are k, then it is said to be the k-way analysis of
variance (ANOVA).
In the above example, your levels for “brand of cereal” might be Lucky Charms, Raisin
Bran, Cornflakes — a total of three levels. Your levels for “Calories” might be: sweetened,
unsweetened — a total of two levels.
The use of this parametric statistical technique involves certain key assumptions,
including the following:
1. Independence of case: Independence of case assumption means that the case of the
dependent variable should be independent or the sample should be selected randomly.
There should not be any pattern in the selection of the sample.
3. Homogeneity: Homogeneity means variance between the groups should be the same.
Levene’s test is used to test the homogeneity between groups.
80
Sum of square between groups: For the sum of the square between groups, we calculate
the individual means of the group, then we take the deviation from the individual mean for
each group. And finally, we will take the sum of all groups after the square of the individual
group.
Sum of squares within group: In order to get the sum of squares within a group, we
calculate the grand mean for all groups and then take the deviation from the individual
group. The sum of all groups will be done after the square of the deviation.
F –ratio: To calculate the F-ratio, the sum of the squares between groups will be divided by
the sum of the square within a group.
Degree of freedom: To calculate the degree of freedom between the sums of the squares
group, we will subtract one from the number of groups. The sum of the square within the
group’s degree of freedom will be calculated by subtracting the number of groups from the
total observation.
BSS df = (g-1) for BSS is between the sum of squares, where g is the group, and df is the
degree of freedom.
WSS df = (N-g) for WSS within the sum of squares, where N is the total sample size.
Significance:
Extension:
Many people have this doubt, what’s the difference between statistics and machine
learning? Is there something like machine learning vs. statistics?
From a traditional data analytics standpoint, the answer to the above question is simple.
Machine Learning is an algorithm that can learn from data without relying on rules-
based programming.
Statistical modeling is a formalization of relationships between variables in the data
in the form of mathematical equations.
Machine learning is all about predictions, supervised learning, unsupervised
learning, etc.
Statistics is about sample, population, hypothesis, etc.
A statistician and machine learning expert at Stanford, calls machine learning “glorified
statistics”.
Nowadays, both machine learning and statistics techniques are used in pattern
recognition, knowledge discovery and data mining. The two fields are converging
more and more even though the below figure may show them as almost exclusive.
machine learning and statistics share the same goal: Learning from data. Both
these methods focus on drawing knowledge or insights from the data. But, their
methods are affected by their inherent cultural differences.
Symbolists: The origin of this tribe is in logic and philosophy. This group relies on
inverse deduction to solve problems.
Connectionists: The origin of this tribe is in neuroscience. This group relies on
backpropagation to solve problems.
Evolutionaries: The origin of this tribe is in evolutionary biology. This group relies
on genetic programming to solve problems.
Bayesians: This origin of this tribe is in statistics. This group relies on probabilistic
inference to solve problems.
Analogizers: The origin of this tribe is in psychology. This group relies on kernel
machines to solve problems.
83
Page
CALCULAS
The word Calculus comes from Latin meaning "small stone", Because it is like
understanding something by looking at small pieces.
Calculus (from Latin calculus, literally 'small pebble', used for counting and calculations, as
on an abacus), is the mathematical study of continuous change, in the same way that
geometry is the study of shape and algebra is the study of generalizations of arithmetic
operations.
Differential Calculus cuts something into small pieces to find how it changes.
Integral Calculus joins (integrates) the small pieces together to find how much there is.
Slope = Change in Y/Change in X
84
Page
Notation
"The derivative of f equals the limit as Δx goes to zero of f(x+Δx) - f(x) over Δx"
What is a 'Derivative':
86
Page
The slope of a line in the plane containing the x and y axes is generally represented
by the letter m, and is defined as the change in the y coordinate divided by the
corresponding change in the x coordinate, between two distinct points on the line.
This is described by the following equation:
The grade (also called slope, incline, gradient, mainfall, pitch or rise) of a
physical feature, landform or constructed line refers to the tangent of the
angle of that surface to the horizontal.
It is a special case of the slope, where zero indicates horizontality. A larger
number indicates higher or steeper degree of "tilt".
Often slope is calculated as a ratio of "rise" to "run", or as a fraction ("rise over
run") in which run is the horizontal distance and rise is the vertical distance.
The grades or slopes of existing physical features such as canyons and
hillsides, stream and river banks and beds are often described.
Grades are typically specified for new linear constructions (such as roads,
landscape grading, roof pitches, railroads, aqueducts, and pedestrian or
bicycle circulation routes). The grade may refer to the longitudinal slope or
the perpendicular cross slope.
d = run
Δh = rise
l = slope length
α = angle of inclination
run.)
Page
expressed as the tangent of the angle of inclination times 100. In the U.S., this
percentage "grade" is the most commonly used unit for communicating slopes
in transportation (streets, roads, highways and rail tracks), surveying,
construction, and civil engineering.
3. as a per mille figure, the formula for which is 100(run/rise)which could also
be expressed as the tangent of the angle of inclination times 1000. This is
commonly used in Europe to denote the incline of a railway.
4. as a ratio of one part rise to so many parts run. For example, a slope that has a
rise of 5 feet for every 100 feet of run would have a slope ratio of 1 in 20. (The
word "in" is normally used rather than the mathematical ratio notation of
"1:20"). This is generally the method used to describe railway grades in
Australia and the UK.
1. as a ratio of many parts run to one part rise, which is the inverse of the
previous expression (depending on the country and the industry
standards). For example, "slopes are expressed as ratios such as 4:1.
This means that for every 4 units (feet or meters) of horizontal
distance there is a 1-unit (foot or meter) vertical change either up or
down."
Any of these may be used. Grade is usually expressed as a percentage, but this is
easily converted to the angle α from horizontal or the other expressions.
Slope may still be expressed when the horizontal run is not known: the rise
can be divided by the hypotenuse (the slope length).
Interesting Functions
x=1
x=2
x=3
x=1
x = 1.5
x=2
y = 1 (x=0)
Page
y = 1.2
y = 1.5
Accuracy
The are only a few hundred pixels in either direction, and so the calculations are not
totally accurate. But they should give you a good feel for what is going on.
And don't worry, you can usually use differential calculus to find an accurate
answer.
Here we look at doing the same thing but using the "dy/dx" notation (also called
Leibniz's notation) instead of limits.
1. Add Δx
Subtract: y = f(x)
To Get: y + Δy − y = f(x + Δx) − f(x)
3. Rate of Change
To work out how fast (called the rate of change) we divide by Δx:
4. Reduce Δx close to 0
We can't let Δx become 0 (because that would be dividing by 0), but we can make it
head towards zero and call it "dx":
Δx dx
Try It On A Function
= (x + dx)2 − x2 dx f(x) = x2
= 2x + dx Simplify fraction
= 2x dx goes towards 0
90
So the derivative of x2 is 2x
Page
Derivative Rules:
For example:
Here are useful rules to help you work out the derivatives of many functions (with
examples below). Note: the little mark ’ means "Derivative of".
Power Rule
The derivative of xn is nx(n-1)."The derivative of" can be shown with the little mark
’
f’(xn) = nx(n-1)
Page
When we write u = u(x,y), we are saying that we have a function, u, which depends
on two independent variables: x and y. We can consider the change in u with respect
to either of these two independent variables by using the partial derivative. The
partial derivative of u with respect to x is written as
What this means is to take the usual derivative, but only x will be the variable. All
other variables will be treated as constants. We can also determine how u changes
with y when x is held constant. This is the partial of u with respect to y. It is written
as
The rule for partial derivatives is that we differentiate with respect to one variable
while keeping all the other variables constant. As another example, find the partial
derivatives of u with respect to x and with respect to y for
Thus,
and
So far we have defined and given examples for first-order partial derivatives.
Second-order partial derivatives are simply the partial derivative of a first-order
partial derivative. We can have four second-order partial derivatives:
93
Page
and
Likewise,
and
94
Page
and
and
95
Page
What if the variables x and y also depend on other variables? For example, we could
have x = x(s,t) and y = y(s,t).
Then,
and
The surface is: the top and bottom with areas of x2 each, and 4 sides of area xy:
f’x = 4x + 4y
f’y = 0 + 4x = 4x
96
Page
Chain Rule
The chain rule provides us a technique for finding the derivative of composite
functions, with the number of functions that make up the composition determining
how many differentiation steps are necessary. For example, if a composite function
f( x) is defined as
97
Note that because two functions, g and h, make up the composite function f, you
have to consider the derivatives g′ and h′ in differentiating f( x).
Page
Consider the two functions f (x) and g (x). These two functions are differentiable.
Then, the chain rule has two different forms as given below:
1. If F(x) = (f o g)(x), then the derivative of the composite function F(x) is,
PYTHON
What is Python?
1. Python is an easy to learn, powerful programming language. The application
development process much faster and easier
2. The programming language Python was conceived in the late 1980s, and its
implementation was started in December 1989 by Guido van Rossum at
Netherlands as a successor to the ABC (programming language)
3. Python First release happened in 1991.
4. Python was named for the BBC TV show Monty Python's Flying Circus.
Why python?
1. Easy to understand
2. Beginners language
3. Portable
4. Less lines of code
5. Simple to implement
6. Huge libraries supports
Features of python:
1. Easy to learn and use.
2. Expressive language
3. Interpreted language.
4. Cross platform language(windows, linux, mac)
5. Free to install and opensource
6. Object oriented language
7. Extensible, Awesome online community
8. Large standard library (numpy, scipy)
9. GUI programming (Tkinter)
Python Implementation alternatives:
1. CPython(stadnard implemenation of python)
2. Jython(Python for java)
3. IronPython( Python for .net)
4. Stackless (Python for concurrency)
5. PyPy ( Python for speed)
Python Packages :
1. Web devlopment - Django,Flask frameworks,Pylons,Web2py framworks.
2. Artificiat Intelligence : Scikit-learn,Keras,TensorFlow,OpenCV
3. GUI - TKinter
4. Desktop Applications : Jython, WxPython
99
>>> _account_number=34525
>>> print(_account_number)
Page
34525
print(a)
Page
Name='RAVI'
Age=21
a=b=c=1
print(a)
print(b)
print(c)
Swaping variable
Python swap values in a single line and this applies to all objects in python.
Syntax:
var1, var2 = var2, var1
Example:
>>> x = 10
>>> y = 20
>>> print(x)
10
>>> print(y)
20
>>> x, y = y, x
>>> print(x)
20
>>> print(y)
10
Input() Function :
>>> a = input()
12
>>> print(a)
12
>>> age = input("Enter your age \n")
Enter your age
28
>>> print(age)
102
28
Page
Python Datatypes :
A datatype represents the type of data stored into a variable or memory.
Buitin datatypes -- Already available in python
User defined datatypes -- Datatypes created by programmers
1. Built in datatypes :
* None Type
* Numaric Types --> int,float,complex
* Sequences --> str,bytes,bytearray,list,tuple,range
* Sets --> set,frozenset
* Mappings --> dict
None :
'None' datatype represents an object that does not contain any value.
In java - NULL
In Python - None
Numeric data types :
1. int
2. float
3. complex
int :
* Int represents an integer number.
* It is number without decimal part and fraction part
* There is no limit for size of int datatype. It can store very integer number
conveniently.
Eg : >>> a = 20
>>> type(a)
<class 'int'>
Float :
* Float represents floating number
* A floating number contains decimal part
Eg: >>> a = 223.345
103
>>> type(a)
<class 'float'>
Page
>>> a = 22e5
>>> print(a)
2200000.0
Sequences in Python :
A sequence represents a group of elements or items.
eg : group of integer numbers will form sequence
There are six types of sequences in python :
1. str
2. bytes
3. bytearray
4. list
5. tuple
6. range
str datatype :
str represents string datatype.
string represents group of characters
string enclosed in single quotes or double quotes('',"")
string can also be represent in """ or '''(if assign to variable then string otherwise it
would be comment only)
eg :
>>> print(word)
welcome
>>> word = "welcome"
>>> print(word)
welcome
>>> word = '''welcome'''
105
>>> print(word)
Page
welcome
>>> type(ch)
Page
<class 'str'>
>>> ch = 'A'
>>> type(ch)
<class 'str'>
>>> name = "srinivas"
>>> name[0]
's'
bytes data type :
The bytes data type represents a group of byte numbers just like an array does.
should be 0 to 255
Does not support negative numbers
>>> a = [12,23,45]
>>> x = bytes(a)
>>> type(x)
<class 'bytes'>
>>> x[0]=55
Traceback (most recent call last):
File "<pyshell#120>", line 1, in <module>
x[0]=55
TypeError: 'bytes' object does not support item assignment
>>> a = [12,23,45,345]
>>> x = bytes(a)
Traceback (most recent call last):
File "<pyshell#122>", line 1, in <module>
x = bytes(a)
ValueError: bytes must be in range(0, 256)
>>> a = [12,23,45,-23]
>>> x = bytes(a)
Traceback (most recent call last):
File "<pyshell#124>", line 1, in <module>
x = bytes(a)
107
>>> print(a)
Page
>>> type(account_details)
<class 'tuple'>
>>> account_details[0]
101
>>> account_details[1]
'srinivas'
>>> account_details[1] = "PHANI"
Traceback (most recent call last):
File "<pyshell#171>", line 1, in <module>
account_details[1] = "PHANI"
TypeError: 'tuple' object does not support item assignment
>>> for i in r:
print(i)
Page
10
13
16
19
22
25
28
>>> r = range(10,30,3)
>>> for i in r:
print(i)
10
13
16
19
22
25
28
Sets :
A set is an unordered collection of elements much like a set in mathematics.
Order of elements is not maintained. It means elements may not appear in the same
order as they entered in to set.
Set does not accept duplicate elements
Two types of sets
1. set datatype
2. frozenset datatype
set datatype :
set elements should separated with ,
set always print only unique elements.
>>> s = {10,20,30,40,50}
111
>>> print(s)
{40, 10, 50, 20, 30}
Page
>>> type(s)
<class 'set'>
>>> s = {10,10,20,20,30,30}
>>> print(s)
{10, 20, 30}
>>> fs = frozenset(s)
>>> type(fs)
Page
<class 'frozenset'>
>>> print(fs)
frozenset({80, 50, 70, 90, 60})
>>> s = {50,50,60,60,70}
>>> fs1 = frozenset(s)
>>> type(fs1)
<class 'frozenset'>
>>> print(fs1)
frozenset({50, 60, 70})
Mapping Type :
map represents a group of elements in the form of key values pairs so that when key
is given will retrieve a value.
The 'dict' datatype is an example of map
dict represents dictionary that contains of pair of elements first one is Key and
second one is Value
key and value separated by ':'
>>> d = {101:"Ram",102:"Ravi",103:"Rani"}
>>> print(d)
{101: 'Ram', 102: 'Ravi', 103: 'Rani'}
>>> d[101]
'Ram'
>>> d[102]
'Ravi'
>>> type(d)
<class 'dict'>
dict = {}
dict['one'] = "This is one"
dict[2] = "This is two"
* These arguments are stored by default in the form of strings in a list with name argv.
this is available in sys module.
argv[0] --> add.py
argv[1] --> 10
argv[2] --> 20
cmnd_line.py
===============
import sys
a = int(sys.argv[1])
b = int(sys.argv[2])
print(type(a))
print(type(b))
add = a + b
print(add)
NOTE : default it will take string so we have convert to integer
execution :
C:\Users\welcome\Desktop>python cmnd_line.py 10 20
Control statements(if..elif..else)
114
============================
Page
If any statement expecting its sub statements then it should be ended with :
conditional execution :
if x > 0 :
print('x is positive')
Alternative execution :
if x%2 == 0 :
print('x is even')
else :
print('x is odd')
Chained conditionals :
if x < y:
print 'x is less than y'
elif x > y:
print 'x is greater than y'
else:
print 'x and y are equal'
What is loop?
A loop is a sequence of instructions that is continually repeated until a certain condition
reached.
Types of loops in Python :
1. for loop
2. while loop
3. Nested loop
for loop :
=========
for(i=0;i<n;i++) ===> ( which is not implemented in python)
>>> for i in range(5):
print(i)
115
0
1
Page
3
4
>>> for i in range(1,5):
print(i)
1
2
3
4
for_example.py
-------------
1. To do operation on each and every element of list.
a = [1,2,3,4]
s=0
for i in a:
s = s+i
print(s)
2. return to list
a = [1,2,3,4]
for i in a:
print(i ** 2)
b = [ (i**2) for i in a]
print(b)
for with if :
student_marks = [12,3,42,34,4,2,2]
for data in student_marks:
if( data % 2 == 0):
print(data," is even number ")
116
else:
Page
output:
12 is even number
3 is odd number
42 is even number
34 is even number
4 is even number
2 is even number
2 is even number
student_marks = [12,3,42,34,4,2,2]
for data in student_marks:
if( data % 2 == 0):
print(" %d is even number " %data)
else:
print(" %d is odd number " %data)
117
%s --> string
%f ---> float
dataset = ['python','java','perl']
for i in dataset:
print(i.upper())
PYTHON
JAVA
PERL
for i in dataset:
print(i[0].upper()+i[1:])
Python
Java
Perl
A statement that alters the execution of loop from its designated sequence is called loop
control statement
Page
1. Break:
To break out the loop we can use break function.
syntax :
for varaiable_name in sequence :
statement1
statement2
if(condition):
break
lst = [10,20,30,40]
for i in lst:
if(i == 30):
break
print(i)
else :
print("loop completed")
10
20
Continue statement:
Continue statement is used to tell python to jump to the next iteration of loop.
lst = [10,20,30,40]
119
for i in lst:
Page
if(i == 30):
continue
print(i)
else :
print("loop completed")
10
20
40
loop completed
while loop:
while loop is used to execute no.of statements till the condition passed in while loop.
once condition is false, the control will come out the loop.
syntax :
while<expression>:
statement1
statement2
>>> while(i < n):
print(i)
i += 1
>>> while(i < n):
print(i)
infinite loop
while else loop:
a = int(input("Enter integer less 10\n"))
print(a)
print("\n")
Introduction to Numpy:
Numpy is module that contains several classes,functions or variables. etc. to deal with
scientific calculations in python. Numpy is useful to create and also process single and
multidimensional arrays.
An array is an object that store a group of elements of same data type.
It means in case of numbes we can store only integer or only float but not one integer and
one is float.
To work with numpy we should import numpy module.
import numpy
arr = numpy.array([10,20,30])
print(arr)
output:
array([10, 20, 30])
import numpy as np
arr = np.array([10,20,30])
print(arr)
output:
array([10, 20, 30])
output:
[10, 20, 30]
array([1, 2, 3, 4])
Page
>>> b = array(a)
>>> b
array([1, 2, 3, 4])
>>> c = b
>>> c
array([1, 2, 3, 4])
>>>
>>> arr1
Page
array([ 6, 7, 8, 9, 10])
sin(arr)
cos(arr)
tan(arr)
arcsin(arr)- sin inverse
arcos(arr) - cos inverse
arctan(arr)
log(arr) - logerthemic value
sum(arr)
prod(arr)
min(arr)
max(arr)
mean(arr)
median(arr)
var(arr)
std(arr)
unique(arr)
sort(arr)
>>>
>>> arr1
array([ 6, 7, 8, 9, 10])
>>> min(arr1)
6
124
>>> max(arr1)
Page
10
>>> sum(arr1)
40
comparision arrays:
>>> from numpy import *
>>> a = array([1,2,3,4])
>>> b = array([0,2,2,4])
>>> c = a == b
>>> c
array([False, True, False, True])
>>>
>>> d = a>b
>>> d
array([ True, False, True, False])
output:
------
[10 11 12 13 14 15]
125
[11 13 15]
Page
Dimensions of array:
>>> b = array([10,
... 20,
... 30,
... 40
... ])
>>> b
array([10, 20, 30, 40])
>>>
2 Dimension : Array contains more than one row and one column.
combination of several 1D arrays.
[[[10 20]
Page
[30 40]]
[[40 50]
[50 60]]]
Attributes of array:
-------------------
1. ndim attribute
2. shape attribute
3. size attribute
ndim attribute : represents no.of dimensions or axes of the array.
>>> arr4.ndim
3
shape attribute : Share attribute gives the shape of an array. The shape is
a tuple listing number of elements along each dimension.
1D--> shape represents no.of elements.
2D --> no.of rows or columns of each
>>> arr1= array([1,2,3,4,5])
>>> print(arr1.shape)
(5,)
6
Page
arr1 = arange(10)
print(arr1)
>>> print(arr1.flatten())
Page
[1 2 3 4 4 5 6 7]
[1, 1, 1],
Page
[1, 1, 1]])
zeros((r,c),dtype)
>>> arr_0 = zeros((4,3),int)
>>> arr_0
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
>>> arr_0 = zeros((4,3),float)
>>> arr_0
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
>>>eye() function : eye() creates a 2D array and fills the elements in
---------------
the diagonal with 1's. The general format of using that function is
eye(n,dtype=datatype)
>>> arr_e
Page
[[1 0 0]
[0 1 0]
[0 0 1]]
>>> arr_e = eye(3,dtype=int)
>>> arr_e
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
reshape() Function :
------------------
As we discussed it will convert 1D to multi dimensional arrays. The syntax is as below.
reshape(arryname,(n,r,c))
>>> a= array([1,2,3,4,5,6,7,8])
>>> b = reshape(a,(2,4))
>>> b
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
>>> b = reshape(a,(4,2))
>>> b
array([[1, 2],
[3, 4],
[5, 6],
[7, 8]])
>>> a= array([1,2,3,4,5,6,7,8,9,10,11,12])
>>> b=a.reshape(3,4)
>>> b
array([[ 1, 2, 3, 4],
131
[ 5, 6, 7, 8],
Page
>>> b = a.reshape(4,3)
>>>
>>> b
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
Matrices in numpy :
--------------------
In mathematics, matrices are represents a rectangular array of elements
arranged in rows and columns.
If matrix has only one row it is called row matrix. if it has only 1 column it
is called as column matrix. Row and column matrix are nothing but 1D array.
When a matrix has more than 1 row and 1 column then it is called m x n matrix.
where m is no.of rows and n is no.of columns.
using 'matrix' keyword directly we can convert array to matrix.
a = array([[1,2,3],[4,5,6]])
print(a)
print(type(a))
b = matrix(a)
print(type(b))
b = matrix(str)
Page
print(b)
output:
[[1 2]
[3 4]
[5 6]]
output:
[[1 3]
[3 5]
[6 7]]
b = diagonal(c)
print(b)
output :
[[1 3 6]
[3 5 7]
[6 7 8]]
[1 5 8]
max(),min(),sum(),mean():
------------------------
133
a = matrix([[1,2,3],[4,5,6],[7,8,9]])
Page
print(a)
output :
[[1 2 3]
[4 5 6]
[7 8 9]]
max element in a is 9
min element in a is 1
sum of elements in a is 45
mean of a is 5.0
product of elments
----------------
>>> a = matrix(arange(12).reshape(3,4))
>>> a
matrix([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a.prod(0)
matrix([[ 0, 45, 120, 231]])
>>> a.prod(1)
matrix([[ 0],
[ 840],
134
[7920]])
Page
Transpose of matrix :
---------------------
Rewriting matrix rows into columns and vice versa is called 'transpose'.
we can use transpose() or getT()
Matrix add,sub,div:
--------------------
>>> a = matrix(" 1 2; 2 3")
>>> a
matrix([[1, 2],
[2, 3]])
>>> b = matrix("2 4 ; 4 5")
>>> b
matrix([[2, 4],
135
[4, 5]])
Page
>>> a
matrix([[1, 2],
[2, 3]])
>>> b
matrix([[2, 4],
[4, 5]])
>>> add = a+b
>>> add
matrix([[3, 6],
[6, 8]])
>>> sub = b-a
>>> sub
matrix([[1, 2],
[2, 2]])
>>> div = b/a
>>> div
matrix([[2. , 2. ],
[2. , 1.66666667]])
Matrix multiplication
-------------------
a = m x n matrix
b = n x p matrix
>>> b
Page
matrix([[2, 4],
[4, 5]])
>>> c = a * b
>>> c
matrix([[10, 14],
[16, 23]])
137
Page
Machine Learning
The heavily hyped, self-driving Google car? The essence of machine learning.
Online recommendation offers such as those from Amazon and Netflix? Machine
learning applications for everyday life.
Knowing what customers are saying about you on Twitter? Machine learning
combined with linguistic rule creation.
Fraud detection? One of the more obvious, important uses in our world today.
138
• Process of ML is similar to that of data mining both the system search through data to
look for patterns . However instead of extracting data for human compression
• Regression: A regression problem is when the output variable is a real value, such
as “dollars” or “weight”.
140
Page
Unsupervised Learning:
- Conceptually Different Problem.
- No information about correct outputs are available.
- No Regression No guesses about the function can be made
Classification: - No information about the correct classes.
141
automatically
Ex: in image recognition, they might learn to identify images that contain cats by
analyzing example images that have been manually labeled as "cat" or "no cat" and
using the results to identify cats in other images. They do this without any prior
142
knowledge about cats, e.g., that they have fur, tails, whiskers and cat-like faces. Instead,
they automatically generate identifying characteristics from the learning material that
Page
they process.
Neural networks, with their remarkable ability to derive meaning from complicated or
imprecise data, can be used to extract patterns and detect trends that are too complex to be
noticed by either humans or other computer techniques. A trained neural network can be
thought of as an "expert" in the category of information it has been given to analyse. This
expert can then be used to provide projections given new situations of interest and answer
"what if" questions.
1. Adaptive learning: An ability to learn how to do tasks based on the data given for
training or initial experience.
2. Self-Organisation: An ANN can create its own organisation or representation of the
information it receives during learning time.
3. Real Time Operation: ANN computations may be carried out in parallel, and special
hardware devices are being designed and manufactured which take advantage of
this capability.
4. Fault Tolerance via Redundant Information Coding: Partial destruction of a network
leads to the corresponding degradation of performance. However, some network
capabilities may be retained even with major network damage.
Much is still unknown about how the brain trains itself to process information, so theories
abound. In the human brain, a typical neuron collects signals from others through a host of
fine structures called dendrites. The neuron sends out spikes of electrical activity through a
long, thin stand known as an axon, which splits into thousands of branches. At the end of
each branch, a structure called a synapse converts the activity from the axon into electrical
effects that inhibit or excite activity from the axon into electrical effects that inhibit or
excite activity in the connected neurones. When a neuron receives excitatory input that is
sufficiently large compared with its inhibitory input, it sends a spike of electrical activity
down its axon. Learning occurs by changing the effectiveness of the synapses so that the
influence of one neuron on another changes.
143
Page
Components of a neuron
The synapse
We conduct these neural networks by first trying to deduce the essential features of
neurones and their interconnections. We then typically program a computer to simulate
these features. However because our knowledge of neurones is incomplete and our
computing power is limited, our models are necessarily gross idealisations of real networks
of neurones.
An engineering approach
1. A simple neuron
An artificial neuron is a device with many inputs and one output. The neuron has two
modes of operation; the training mode and the using mode. In the training mode, the
neuron can be trained to fire (or not), for particular input patterns. In the using mode,
when a taught input pattern is detected at the input, its associated output becomes the
144
current output. If the input pattern does not belong in the taught list of input patterns, the
firing rule is used to determine whether to fire or not.
Page
A simple neuron
2 Firing rules
The firing rule is an important concept in neural networks and accounts for their high
flexibility. A firing rule determines how one calculates whether a neuron should fire for any
input pattern. It relates to all the input patterns, not only the ones on which the node was
trained.
A simple firing rule can be implemented by using Hamming distance technique. The rule
goes as follows:
Take a collection of training patterns for a node, some of which cause it to fire (the 1-taught
set of patterns) and others which prevent it from doing so (the 0-taught set). Then the
patterns not in the collection cause the node to fire if, on comparison , they have more
input elements in common with the 'nearest' pattern in the 1-taught set than with the
'nearest' pattern in the 0-taught set. If there is a tie, then the pattern remains in the
undefined state.
For example, a 3-input neuron is taught to output 1 when the input (X1,X2 and X3) is 111
or 101 and to output 0 when the input is 000 or 001. Then, before applying the firing rule,
the truth table is;
X1: 0 0 0 0 1 1 1 1
X2: 0 0 1 1 0 0 1 1
X3: 0 1 0 1 0 1 0 1
As an example of the way the firing rule is applied, take the pattern 010. It differs from 000
in 1 element, from 001 in 2 elements, from 101 in 3 elements and from 111 in 2 elements.
Therefore, the 'nearest' pattern is 000 which belongs in the 0-taught set. Thus the firing
Page
rule requires that the neuron should not fire when the input is 001. On the other hand, 011
is equally distant from two taught patterns that have different outputs and thus the output
stays undefined (0/1).
By applying the firing in every column the following truth table is obtained;
X1: 0 0 0 0 1 1 1 1
X2: 0 0 1 1 0 0 1 1
X3: 0 1 0 1 0 1 0 1
The difference between the two truth tables is called the generalisation of the neuron.
Therefore the firing rule gives the neuron a sense of similarity and enables it to respond
'sensibly' to patterns not seen during training.
Figure 1.
For example: The network of figure 1 is trained to recognise the patterns T and H. The
146
associated patterns are all black and all white respectively as shown below.
Page
If we represent black squares with 0 and white squares with 1 then the truth tables for the
3 neurones after generalisation are;
X11: 0 0 0 0 1 1 1 1
X12: 0 0 1 1 0 0 1 1
X13: 0 1 0 1 0 1 0 1
OUT: 0 0 1 1 0 0 1 1
Top neuron
X21: 0 0 0 0 1 1 1 1
X22: 0 0 1 1 0 0 1 1
X23: 0 1 0 1 0 1 0 1
Middle neuron
X21: 0 0 0 0 1 1 1 1
X22: 0 0 1 1 0 0 1 1
X23: 0 1 0 1 0 1 0 1
OUT: 1 0 1 1 0 0 1 0
Bottom neuron
From the tables it can be seen the following associasions can be extracted:
147
Page
In this case, it is obvious that the output should be all blacks since the input pattern is
almost the same as the 'T' pattern.
Here also, it is obvious that the output should be all whites since the input pattern is almost
the same as the 'H' pattern.
Here, the top row is 2 errors away from the a T and 3 from an H. So the top output is black.
The middle row is 1 error away from both T and H so the output is random. The bottom
row is 1 error away from T and 2 away from H. Therefore the output is black. The total
output of the network is still in favour of the T shape.
The previous neuron doesn't do anything that conventional conventional computers don't
do already. A more sophisticated neuron (figure 2) is the McCulloch and Pitts model (MCP).
The difference from the previous model is that the inputs are 'weighted', the effect that
each input has at decision making is dependent on the weight of the particular input. The
weight of an input is a number which when multiplied with the input gives the weighted
input. These weighted inputs are then added together and if they exceed a pre-set
threshold value, the neuron fires. In any other case the neuron does not fire.
148
Page
The addition of input weights and of the threshold makes this neuron a very flexible and
powerful one. The MCP neuron has the ability to adapt to a particular situation by changing
its weights and/or threshold. Various algorithms exist that cause the neuron to 'adapt'; the
most used ones are the Delta rule and the back error propagation. The former is used in
feed-forward networks and the latter in feedback networks.
1. Feed-forward networks
Feed-forward ANNs (figure 1) allow signals to travel one way only; from input to output.
There is no feedback (loops) i.e. the output of any layer does not affect that same layer.
Feed-forward ANNs tend to be straight forward networks that associate inputs with
outputs. They are extensively used in pattern recognition. This type of organisation is also
referred to as bottom-up or top-down.
2 Feedback networks
Feedback networks (figure 1) can have signals travelling in both directions by introducing
loops in the network. Feedback networks are very powerful and can get extremely
complicated. Feedback networks are dynamic; their 'state' is changing continuously until
they reach an equilibrium point. They remain at the equilibrium point until the input
changes and a new equilibrium needs to be found. Feedback architectures are also referred
to as interactive or recurrent, although the latter term is often used to denote feedback
connections in single-layer organisations.
149
Page
3 Network layers
The commonest type of artificial neural network consists of three groups, or layers, of
units: a layer of "input" units is connected to a layer of "hidden" units, which is connected
to a layer of "output" units. (see Figure 4.1)
The activity of the input units represents the raw information that is fed into the
network.
The activity of each hidden unit is determined by the activities of the input units and
the weights on the connections between the input and the hidden units.
The behaviour of the output units depends on the activity of the hidden units and
the weights between the hidden and output units.
This simple type of network is interesting because the hidden units are free to construct
their own representations of the input. The weights between the input and hidden units
determine when each hidden unit is active, and so by modifying these weights, a hidden
unit can choose what it represents.
layer organisations. In multi-layer networks, units are often numbered by layer, instead of
following a global numbering.
Page
4 Perceptrons
The most influential work on neural nets in the 60's went under the heading of
'perceptrons' a term coined by Frank Rosenblatt. The perceptron (figure 4.4) turns out to
be an MCP model ( neuron with weighted inputs ) with some additional, fixed, pre--
processing. Units labelled A1, A2, Aj , Ap are called association units and their task is to
extract specific, localised featured from the input images. Perceptrons mimic the basic idea
behind the mammalian visual system. They were mainly used in pattern recognition even
though their capabilities extended a lot more.
Figure 4.4
In 1969 Minsky and Papert wrote a book in which they described the limitations of single
layer Perceptrons. The impact that the book had was tremendous and caused a lot of neural
network researchers to loose their interest. The book was very well written and showed
mathematically that single layer perceptrons could not do some basic pattern recognition
operations like determining the parity of a shape or determining whether a shape is
connected or not. What they did not realised, until the 80's, is that given the appropriate
training, multilevel perceptrons can do these operations.
The main steps for building a Neural Network are:
Define the model structure (such as number of input features and outputs)
Initialize the model's parameters.
Loop.
o Calculate current loss (forward propagation)
o Calculate current gradient (backward propagation)
151
Deep learning architectures such as deep neural networks, deep belief networks and
recurrent neural networks have been applied to fields including computer vision, speech
recognition, natural language processing, audio recognition, social network filtering,
machine translation, bioinformatics, drug design and board game programs, where they
have produced results comparable to and in some cases superior to human experts.[4][5][6]
Deep learning models are vaguely inspired by information processing and communication
patterns in biological nervous systems yet have various differences from the structural and
functional properties of biological brains, which make them incompatible with
neuroscience evidences.
Deep learning methods aim at learning feature hierarchies with features from higher levels
of the hierarchy formed by the composition of lower level features. Automatically learning
features at multiple levels of abstraction allow a system to learn complex functions
mapping the input to the output directly from data, without depending completely on
human-crafted features.
152
Page
153
Page
Deep learning is focused on training deep (many layered) neural network models
using the backpropagation algorithm. The most popular techniques are:
Data Preparation
Data preparation usually involves iterative steps. In this documentation the steps for
getting data ready for supervised machine learning are identified as:
1. Gather data.
2. Clean the data.
3. Split the data.
4. Engineer features.
5. Preprocess the features.
Gather data
Finding data, especially data with the labels you need, can be a challenge. Your sources
might vary significantly from one machine learning project to the next. If you find that you
are merging data from different sources, or getting data entry from multiple places, you’ll
need to be extra careful in the next step.
Cleaning data is the process of checking for integrity and consistency. At this stage you
shouldn't be looking at the data overall for patterns. Instead, you clean data by column
(attribute), looking for such anomalies as:
154
Multiple methods of representing a feature. For example, some instances might list a
length measurement in inches, and others might list it in centimeters. It is crucial
that all instances of a given feature use the same scale and follow the same format.
Features with values far out of the typical range (outliers), which may be data-entry
anomalies or other invalid data.
Significant changes in the data over distances in time, geographic location, or other
recognizable characteristics.
Incorrect labels or poorly defined labeling criteria.
You need at least three subsets of data in a supervised learning scenario: training data,
evaluation data, and test data.
Training data is the data that you’ll get acquainted with. You use it to train your model, yes,
but you also analyze it as you develop the model in the first place.
Evaluation data is what you use to check your model’s performance during the regular
training cycle. It is your primary tool in ensuring that your model is generalizable to data
beyond the training set.
Test data is used to test a model that’s close to completion, usually after multiple training
iterations. You should never analyze or scrutinize your test data, instead keeping it fresh
until needed to test your model. That way you can be assured that you aren't making
assumptions based on familiarity with the data that can pollute your training results.
Here are some important things to remember when you split your data:
It is better to randomly sample the subsets from one big dataset than to use some
pre-divided data, such as instances from two distinct date ranges or data-collection
systems. The latter approach has an increased risk of non-uniformity that can lead
to overfitting.
Ideally you should assign instances to a dataset and keep those associations
throughout the process.
Experts disagree about the right proportions for the different datasets. However,
regardless of the specific ratios you should have more training data than evaluation
data, and more evaluation data than test data.
Engineer features
155
Before you develop your model, you should get acquainted with your training data. Look
for patterns in your data, and think about what values could influence your target attribute.
Page
This process of deciding which data is important for your model is called feature
engineering.
Feature engineering is not just about deciding which attributes you have in your raw data
that you want in your model. The harder and often more important work is extracting
generalizable, indicative features from specific data. That means combining the data you
have with your knowledge about the problem space to get the data you really need. It can
be a complex process, and doing it right depends on understanding the subject matter and
the goals of your problem. Here are a couple of examples:
Data about people often includes a residential address, which is a complex string, often
hard to make consistent, and not particularly useful for many applications on its own. You
should usually extract a more meaningful feature from it. Here are some examples of things
you could extract from an address:
Another common item of data is a timestamp, which is usually a large numerical value
indicating the amount of time elapsed since a common reference point. Here are some
examples of things you might extract from a precise timestamp:
Here are some important things to note about the examples above:
156
You can combine multiple attributes to make one generalizable feature. For
example, address and timestamp can get you the position of the sun.
Page
You can use feature engineering to simplify data. For example, timestamp to time of
day takes an attribute with seemingly countless values and reduces it to four
categories.
You can get useful features, and reduce the number of instances in your dataset, by
engineering across instances. For example, use multiple instances to calculate the
frequency of something.
When you're done you'll have a list of features to include when training your model.
One of the most difficult parts of the process is deciding when you have the right set of
features. It's sometimes difficult to know which features are likely to affect your prediction
accuracy. Machine learning experts often stress that it's a field that requires flexibility and
experimentation. You'll never get it perfect the first try, so make your best guess and use
the results to inform your next iteration.
Preprocessing data
So far this page has described generally applicable steps to take when getting your data
ready to train your model. It hasn't mattered up to this point how your data is represented
and formatted. Preprocessing is the next step: getting your prepared data into a format that
works with the tools and techniques you use to train a model.
Cloud ML Engine doesn't get involved in your data format; you can use whatever input
format is convenient for your training application. That said, you'll need to have your input
data in a format that TensorFlow can read. You also need to have your data in a location
that your Cloud ML Engine project can access. The simplest solution is often to use a CSV
file in a Google Cloud Storage bucket that your Google Cloud Platform project has access to.
Some types of data, such as sparse vectors and binary data, can be better represented using
TensorFlow'stf.train.Example format serialized in a TFRecords file.
Transforming data
There are many transformations that might be useful to perform on your raw feature data.
Some of the more common ones are:
Changing raw text strings to a more compact representation, like a bag of words.
Page
Cloud ML Engine doesn't impose specific requirements on your input data, leaving you to
use whatever format works for your training application. Follow TensorFlow's data
reading procedures.
Use the raw data you have to get the data you need.
Split your dataset into training, validation, and test subsets.
Store your data in a location that your Cloud ML Engine project can access—a Cloud
Storage bucket is often the easiest approach.
Transform features to suit the operations you perform on them.
158
Page
There are also a lot of modules and libraries to choose from, providing multiple ways to do
each task. It can feel overwhelming.
The best way to get started using Python for machine learning is to complete a project.
It will force you to install and start the Python interpreter (at the very least).
It will given you a bird’s eye view of how to step through a small project.
It will give you confidence, maybe to go on to your own small projects.
When you are applying machine learning to your own datasets, you are working on a project.
A machine learning project may not be linear, but it has a number of well known steps:
Define Problem.
Prepare Data.
Evaluate Algorithms.
Improve Results.
Present Results.
The best way to really come to terms with a new platform or tool is to work through a machine
learning project end-to-end and cover the key steps. Namely, from loading data, summarizing
159
If you can do that, you have a template that you can use on dataset after dataset. You can fill in
the gaps such as further data preparation and improving result tasks later, once you have more
confidence.
Attributes are numeric so you have to figure out how to load and handle data.
It is a classification problem, allowing you to practice with perhaps an easier type of
supervised learning algorithm.
It is a multi-class classification problem (multi-nominal) that may require some
specialized handling.
It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory
(and a screen or A4 page).
All of the numeric attributes are in the same units and the same scale, not requiring
any special scaling or transforms to get started.
Let’s get started with your hello world machine learning project in Python.
Try to type in the commands yourself or copy-and-paste the commands to speed things up.
Page
If you have any questions at all, please leave a comment at the bottom of the post.
I do not want to cover this in great detail, because others already have. This is already pretty
straightforward, especially if you are a developer. If you do need help, ask a question in the
comments.
There are 5 key libraries that you will need to install. Below is a list of the Python SciPy
libraries required for this tutorial:
scipy
numpy
matplotlib
pandas
sklearn
There are many ways to install these libraries. My best advice is to pick one method then be
consistent in installing each library.
The scipy installation page provides excellent instructions for installing the above libraries on
multiple different platforms, such as Linux, mac OS X and Windows. If you have any doubts or
questions, refer to this guide, it has been followed by thousands of people.
On Mac OS X, you can use macports to install Python 2.7 and these libraries.
On Linux you can use your package manager, such as yum on Fedora to install RPMs.
If you are on Windows or you are not confident, I would recommend installing the
free version of Anaconda that includes everything you need.
working as expected.
Page
The script below will help you test out your environment. It imports each library required in
this tutorial and prints the version.
1 Python
I recommend working directly in the interpreter or writing your scripts and running them on
the command line rather than big editors and IDEs. Keep things simple and focus on the
machine learning not the toolchain.
1
# Check the versions of libraries
2
# Python version
3
import sys
4
print('Python: {}'.format(sys.version))
5
# scipy
6
import scipy
7
print('scipy: {}'.format(scipy.__version__))
8
# numpy
9
import numpy
10
print('numpy: {}'.format(numpy.__version__))
11
# matplotlib
12
import matplotlib
13
print('matplotlib: {}'.format(matplotlib.__version__))
14
# pandas
15
import pandas
16
print('pandas: {}'.format(pandas.__version__))
17
# scikit-learn
18
import sklearn
19
print('sklearn: {}'.format(sklearn.__version__))
20
Ideally, your versions should match or be more recent. The APIs do not change quickly, so do
not be too concerned if you are a few versions behind, Everything in this tutorial will very likely
still work for you.
If you cannot run the above script cleanly you will not be able to complete this tutorial.
My best advice is to Google search for your error message or post a question on Stack
Exchange.
2. Load The Data
We are going to use the iris flowers dataset. This dataset is famous because it is used as the
“hello world” dataset in machine learning and statistics by pretty much everyone.
The dataset contains 150 observations of iris flowers. There are four columns of measurements
of the flowers in centimeters. The fifth column is the species of the flower observed. All
observed flowers belong to one of three species.
# Load libraries
1 import pandas
2 from pandas.plotting import scatter_matrix
3 import matplotlib.pyplot as plt
4 from sklearn import model_selection
5
6 from sklearn.metrics import classification_report
7 from sklearn.metrics import confusion_matrix
8 from sklearn.metrics import accuracy_score
9
10
from sklearn.linear_model import LogisticRegression
11 from sklearn.tree import DecisionTreeClassifier
12 from sklearn.neighbors import KNeighborsClassifier
163
Everything should load without error. If you have an error, stop. You need a working
SciPy environment before continuing. See the advice above about setting up your environment.
We are using pandas to load the data. We will also use pandas next to explore the data both
with descriptive statistics and data visualization.
Note that we are specifying the names of each column when loading the data. This will help
later when we explore the data.
1 # Load dataset
2 url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
3 names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
4 dataset = pandas.read_csv(url, names=names)
The dataset should load without incident.
If you do have network problems, you can download the iris.data file into your working
directory and load it using the same method, changing URL to the local file name.
Now it is time to take a look at the data. In this step we are going to take a look at the data a
few different ways:
Don’t worry, each look at the data is one command. These are useful commands that you
can use again and again on future projects.
1 # shape
Page
2 print(dataset.shape)
1 (150, 5)
3.2 Peek at the Data
It is also always a good idea to actually eyeball your data.
1 # head
2 print(dataset.head(20))
You should see the first 20 rows of the data:
This includes the count, mean, the min and max values as well as some percentiles.
1 # descriptions
2 print(dataset.describe())
We can see that all of the numerical values have the same scale (centimeters) and similar
ranges between 0 and 8 centimeters.
165
Let’s now take a look at the number of instances (rows) that belong to each class. We can
view this as an absolute count.
1 # class distribution
2 print(dataset.groupby('class').size())
We can see that each class has the same number of instances (50 or 33% of the dataset).
1 Class
2 Iris-setosa 50
3 Iris-versicolor 50
4 Iris-virginica 50
4. Data Visualization
We now have a basic idea about the data. We need to extend that with some visualizations.
We are going to look at two types of plots:
1. Univariate plots to better understand each attribute.
2. Multivariate plots to better understand the relationships between attributes.
4.1 Univariate Plots
We start with some univariate plots, that is, plots of each individual variable.
Given that the input variables are numeric, we can create box and whisker plots of each.
We can also create a histogram of each input variable to get an idea of the distribution.
1 # histograms
2 dataset.hist()
3 plt.show()
It looks like perhaps two of the input variables have a Gaussian distribution. This is useful
to note as we can use algorithms that can exploit this assumption.
167
Page
Histogram Plots
QUALITY THOUGHT * www.facebook.com/qthought * www.qualitythought.in
PH:9515151992, 9963799240 Location: Ameerpet / Kondapur Email Id:info@qualitythought.in
Quality Thought Artificial Intelligence
First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot
structured relationships between input variables.
Scatterplot Matrix
Now it is time to create some models of the data and estimate their accuracy on unseen
168
Later, we will use statistical methods to estimate the accuracy of the models that we create
on unseen data. We also want a more concrete estimate of the accuracy of the best model
on unseen data by evaluating it on actual unseen data.
That is, we are going to hold back some data that the algorithms will not get to see and we
will use this data to get a second and independent idea of how accurate the best model
might actually be.
We will split the loaded dataset into two, 80% of which we will use to train our models and
20% that we will hold back as a validation dataset.
# Split-out validation dataset
1
array = dataset.values
2
X = array[:,0:4]
3
Y = array[:,4]
4
validation_size = 0.20
5
seed = 7
6
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y,
7
test_size=validation_size, random_state=seed)
You now have training data in the X_train and Y_train for preparing models and
a X_validation and Y_validation sets that we can use later.
5.2 Test Harness
We will use 10-fold cross validation to estimate accuracy.
This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all
combinations of train-test splits.
The specific random seed does not matter, learn more about pseudorandom number
generators here:
Page
We don’t know which algorithms would be good on this problem or what configurations to
use. We get an idea from the plots that some of the classes are partially linearly separable
in some dimensions, so we are expecting generally good results.
15 scoring=scoring)
16 results.append(cv_results)
17 names.append(name)
18 msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
We now have 6 models and accuracy estimations for each. We need to compare the models
to each other and select the most accurate.
We can also create a plot of the model evaluation results and compare the spread and the
mean accuracy of each model. There is a population of accuracy measures for each
algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
1 # Compare Algorithms
2 fig = plt.figure()
3 fig.suptitle('Algorithm Comparison')
4 ax = fig.add_subplot(111)
5 plt.boxplot(results)
6 ax.set_xticklabels(names)
7 plt.show()
You can see that the box and whisker plots are squashed at the top of the range, with many
samples achieving 100% accuracy.
171
Page
6. Make Predictions
The KNN algorithm was the most accurate model that we tested. Now we want to get an
idea of the accuracy of the model on our validation set.
This will give us an independent final check on the accuracy of the best model. It is valuable
to keep a validation set just in case you made a slip during training, such as overfitting to
the training set or a data leak. Both will result in an overly optimistic result.
We can run the KNN model directly on the validation set and summarize the results as a
final accuracy score, a confusion matrix and a classification report.
1 0.9
2
Page
3 [[ 7 0 0]
4 [ 0 11 1]
5 [ 0 2 9]]
6
7 precision recall f1-score support
8
9 Iris-setosa 1.00 1.00 1.00 7
10 Iris-versicolor 0.85 0.92 0.88 12
11 Iris-virginica 0.90 0.82 0.86 11
12
13 avg / total 0.90 0.90 0.90 30
1. Business Understanding
Goal
Clearly and explicitly specify the model target(s) as 'sharp' question(s) which is
used to drive the customer engagement.
Clearly specifying where to find the data sources of interest. Define the predictive
model target in this step and determine if we need to bring in ancillary data from
other sources.
How to do it
In this stage, you work with your customer and stakeholder to understand the business
problems that can be greatly enhanced with predictive analytics. A central objective of this
step is to identify the key business variables (sales forecast or the probability of an order
being fraudulent, for example) that the analysis needs to predict (also known as model
targets) to satisfy these requirements. In this stage you also to develop an understanding of
the data sources needed to address the objectives of the project from an analytical
perspective. There are two main aspects of this stage - Define Objectives and Identify data
sources.
2. Define the project goals with 'sharp' question(s). A fine description of what a sharp
question is, and how you can ask it, can be found in this article. As per the article,
Page
here is a very useful tip to ask a sharp question - "When choosing your question,
imagine that you are approaching an oracle that can tell you anything in the
universe, as long as the answer is a number or a name". Data science / machine
learning is typically used to answer these five types of questions:
3. Define the project team, the role and responsibilities. Develop a high level milestone
plan that you iterate upon as more information is discovered.
4. Define success metrics. The metrics must be SMART (Specific, Measurable,
Achievable, Relevant and Time-bound). For example: Achieve customer churn
prediction accuracy of X% by the end of this 3-month project so that we can offer
promotions to reduce churn.
Identify data sources that contain known examples of answers to the sharp questions. Look
for the following:
Data that is Relevant to the question. Do we have measures of the target and
features that are related to the target?
Data that is an Accurate measure of our model target and the features of interest.
It is not uncommon, for example, to find that existing systems need to collect and log
additional kinds of data to address the problem and achieve the project goals. In this case,
you may want to look for external data sources or update your systems to collect newer
data.
Artifacts
Data Sources: This is part of the Data Report that is found in the TDSP project
structure. It describes the sources for the raw data. In later stages you will fill in
additional details like scripts to move the data to your analytic ernvironment.
Data Dictionaries : This document provides the descriptions and the schema (data
types, information on any validation rules) of the data which will be used to answer
the question. If available, the entity-relation diagrams or descriptions are included
too.
Goal
How to do it
In this stage, you will start developing the process to move the data from the source
location to the target locations where the analytics operations like training and predictions
(also known as scoring) will be run. For technical details and options on how to do this on
various Azure data services, see Load data into storage environments for analytics.
Before you train your models, you need to develop a deep understanding about the data.
Real world data is often messy with incomplete or incorrect data. By data summarization
and visualization of the data, you can quickly identify the quality of your data and inform
how to deal with the data quality. For guidance on cleaning the data, see this article.
Data visualization can be particularly useful to answer questions like - Have we measured
the features consistently enough for them to be useful or are there a lot of missing values in
the data? Has the data been consistently collected over the time period of interest or are
there blocks of missing observations? If the data does not pass this quality check, we may
need to go back to the previous step to correct or get more data.
Otherwise, you can start to better understand the inherent patterns in the data that will
help you develop a sound predictive model for your target. Specifically you look for
evidence for how well connected is the data to the target and whether the data is large
enough to move forward with next steps. As we determine if the data is connected or if we
have enough data, we may need to find new data sources with more accurate or more
relevant data to complete the data set initially identified in the previous stage. TDSP also
175
provides automated utility called IDEAR to help visualize the data and prepare data
summary reports. We recommend starting with IDEAR first to explore the data to help
develop initial data understanding interactively with no coding and then write custom code
Page
In addition to the initial ingestion of data, you will typically need to setup a process to score
new data or refresh the data regularly as part of an ongoing learning process. This can be
done by setting up a data pipeline or workflow. Here is an example of how to setup a
pipeline with Azure Data Factory. A solution architecture of the data pipeline is developed
in this stage. The pipeline is developed in parallel in the following stages of the data science
project. The pipeline may be batch based or a streaming/real-time or a hybrid depending
on your business need and the constraints of your existing systems into which this solution
is being integrated.
Artifacts
Data Quality Report : This report contains data summaries, relationships between
each attribute and target, variable ranking etc. The IDEAR tool provided as part of
TDSP can help with the quickly generating this report on any tabular dataset like a
CSV or relational table.
Solution Architecture: This can be a diagram and/or description of your data
pipeline used to run scoring or predictions on new data once you have built a model.
It will also contain the pipeline to retrain your model based on new data. The
document is stored in this directory when using the TDSP directory structure
template.
Checkpoint Decision: Before we begin to do the full feature engineering and model
building process, we can reevaluate the project to determine value in continuing this effort.
We may be ready to proceed, need to collect more data, or it’s possible the data does not
exist to answer the question.
3. Modeling
Goal
Develop new attributes or data features (also known as feature engineering), for
building the machine learning model.
Construct and evaluate an informative model to predict the target.
Determine if we have a model that is suitable for production use
How to do it
176
There are two main aspects in this stage - Feature Engineering and Model training. They
are described in following sub-sections.
Page
Depending on type of question you are trying answer, there are multiple modeling
algorithms options available. For guidance on choosing the algorithms, see this article.
NOTE: Though this article is written for Azure Machine Learning, it should be generally
useful even when using other frameworks.
The input data for modeling is usually split randomly into a training data set and a
test data set.
The models are built using the training data set.
Evaluate (training and test dataset) a series of competing machine learning
algorithms along with the various associated tuning parameters (also known as
parameter sweep) that are geared toward answering the question of interest with
the data we currently have at hand.
Determine the “best” solution to answer the question by comparing the success
metric between alternative methods.
[NOTE] Avoid leakage: Leakage is caused by including variables that can perfectly predict
the target. These are usually variables that may have been used to detect the target initially.
As the target is redefined, these dependencies can be hidden from the original definition.
To avoid this often requires iterating between building an analysis data set, and creating a
model and evaluating the accuracy. Leakage is a major reason data scientists get nervous
when they get really good predictive results.
177
We provide a Automated Modeling and Reporting tool with TDSP that is able to run
through multiple algorithms and parameter sweeps to produce a baseline model. It will
also produce a baseline modeling report summarizing performance of each model and
Page
parameter combination including variable importance. This can further drive further
feature engineering.
Artifacts
Feature Sets: The features developed for the modeling are described in the in the
Feature Set section of the Data Definition report. It contains pointers to the code to
generate the features and description on how the feature was generated.
Modeling Report: For each models that are tried, a standard report following a
specified TDSP template is produced. The
Does the model answer the question sufficiently given the test data?
Should we go back and collect more data or do more feature engineering or try
other algorithms?
4. Deployment
Goal
Deploy models and pipeline to a production or production-like environment for final user
acceptance.
How to do it
Once we have a set of models that perform well, they can be operationalized for other
applications to consume. Depending on the business requirements, predictions are made
either in real time or on a batch basis. To be operationalized, the models have to be
exposed with an open API interface that is easily consumed from various applications such
online website, spreadsheets, dashboards, or line of business and backend applications. See
example of model operationalization with Azure Machine Learning web service in this
article. It is also a good idea to build in telemetry and monitoring of the production model
deployment and the data pipeline to help with system status reporting and
troubleshooting.
Artifacts
178
5. Customer Acceptance
Goal
To finalize the project deliverables by confirming the pipeline, the model, and their
deployment in a production environment.
How to do it
The customer would validate that the system meet their business need and the answers the
questions with acceptable accuracy to deploy the system to production for use by their
client application. All the documentation are finalized and reviewed. A handoff of the
project to the entity responsible for operations is done. Thbis could be an IT or data science
team at the customer or an agent of the customer that will be responsible for running the
system in production.
Artifacts
The main artifact produced in this final stage is the Project Final Report. This is the
project technical report containing all details of the project that useful to learn and operate
the system. A template is provided by TDSP that can be used as is or customized for specific
client needs.
Summary
We have seen the Team Data Science Process lifecycle which is modeled as a sequence of
iterated steps that provide guidance on the tasks needed to use predictive models that can
be deployed in a production environment to be leveraged to build intelligent applications.
The goal of this process lifecycle is to continue to move a data science project forward
towards a clear engagement end point. While it is true that data science is an exercise in
research and discovery, being able to clearly communicate this to customers using a well
defined set of artifacts in a standardized template can help avoid misunderstanding and
increase the odds of success.
Gathering Data
Once we have our equipment and booze, it’s time for our first real step of machine learning:
gathering data. This step is very important because the quality and quantity of data that
you gather will directly determine how good your predictive model can be. In this case, the
data we collect will be the color and the alcohol content of each drink.
179
Page
This will yield a table of color, alcohol%, and whether it’s beer or wine. This will be our
training data.
Data preparation
A few hours of measurements later, we have gathered our training data. Now it’s time for
the next step of machine learning: Data preparation, where we load our data into a
suitable place and prepare it for use in our machine learning training.
We’ll first put all our data together, and then randomize the ordering. We don’t want the
order of our data to affect what we learn, since that’s not part of determining whether a
drink is beer or wine. In other words, we make a determination of what a drink is,
independent of what drink came before or after it.
We’ll also need to split the data in two parts. The first part, used in training our model, will
be the majority of the dataset. The second part will be used for evaluating our trained
model’s performance. We don’t want to use the same data that the model was trained on
for evaluation, since it could then just memorize the “questions”, just as you wouldn’t use
the same questions from your math homework on the exam.
Sometimes the data we collect needs other forms of adjusting and manipulation. Things
like de-duping, normalization, error correction, and more. These would all happen at the
data preparation step. In our case, we don’t have any further data preparation needs, so
let’s move forward.
Choosing a model
The next step in our workflow is choosing a model. There are many models that
researchers and data scientists have created over the years. Some are very well suited for
image data, others for sequences (like text, or music), some for numerical data, others for
180
text-based data. In our case, since we only have 2 features, color and alcohol%, we can use
a small linear model, which is a fairly simple one that should get the job done.
Page
Training
Now we move onto what is often considered the bulk of machine learning — the training.
In this step, we will use our data to incrementally improve our model’s ability to predict
whether a given drink is wine or beer.
We will do this on a much smaller scale with our drinks. In particular, the formula for a
straight line is y=m*x+b, where x is the input, m is the slope of that line, b is the y-intercept,
and y is the value of the line at the position x. The values we have available to us for
adjusting, or “training”, are m and b. There is no other way to affect the position of the line,
since the only other variables are x, our input, and y, our output.
In machine learning, there are many m’s since there may be many features. The collection
of these m values is usually formed into a matrix, that we will denote W, for the “weights”
matrix. Similarly for b, we arrange them together and call that the biases.
The training process involves initializing some random values for W and b and attempting
to predict the output with those values. As you might imagine, it does pretty poorly. But we
can compare our model’s predictions with the output that it should produced, and adjust
the values in W and b such that we will have more correct predictions.
181
Page
This process then repeats. Each iteration or cycle of updating the weights and biases is
called one training “step”.
Let’s look at what that means in this case, more concretely, for our dataset. When we first
start the training, it’s like we drew a random line through the data. Then as each step of the
training progresses, the line moves, step by step, closer to an ideal separation of the wine
and beer.
Evaluation
Once training is complete, it’s time to see if the model is any good, using Evaluation. This is
where that dataset that we set aside earlier comes into play. Evaluation allows us to test
our model against data that has never been used for training. This metric allows us to see
how the model might perform against data that it has not yet seen. This is meant to be
representative of how the model might perform in the real world.
A good rule of thumb I use for a training-evaluation split somewhere on the order of 80/20
or 70/30. Much of this depends on the size of the original source dataset. If you have a lot of
data, perhaps you don’t need as big of a fraction for the evaluation dataset.
Parameter Tuning
182
Once you’ve done evaluation, it’s possible that you want to see if you can further improve
your training in any way. We can do this by tuning our parameters. There were a few
parameters we implicitly assumed when we did our training, and now is a good time to go
Page
One example is how many times we run through the training dataset during training. What
I mean by that is we can “show” the model our full dataset multiple times, rather than just
once. This can sometimes lead to higher accuracies.
Estimating Sales
Linear Regression finds great use in business, for sales forecasting based on the trends. If a
company observes steady increase in sales every month - a linear regression analysis of the
monthly sales data helps the company forecast sales in upcoming months.
Access the Solution to Kaggle Data Science Challenge -Walmart Store Sales Forecasting
Risk Assessment
Linear Regression helps assess risk involved in insurance or financial domain. A health
insurance company can do a linear regression analysis on the number of claims per
customer against age. This analysis helps insurance companies find, that older customers
tend to make more insurance claims. Such analysis results play a vital role in important
business decisions and are made to account for risk.
Data Science Libraries in Python to implement Linear Regression – statsmodel and SciKit
Data Science Libraries in R to implement Linear Regression – stats
Explanations about the top machine learning algorithms will continue, as it is a work in
progress. Stay tuned to our blog to learn more about the popular machine learning
algorithms and their applications!!!
Linear regression requires a linear model. No surprise, right? But what does that really
mean?
A model is linear when each term is either a constant or the product of a parameter and a
184
predictor variable. A linear equation is constructed by adding the results for each term.
This constrains the equation to just one basic form:
Page
In statistics, a regression equation (or function) is linear when it is linear in the parameters.
While the equation must be linear in the parameters, you can transform the predictor
variables in ways that produce curvature. For instance, you can include a squared variable
to produce a U-shaped curve.
Y = b o + b1X1 + b2X12
This model is still linear in the parameters even though the predictor variable is squared.
You can also use log and inverse functional forms that are linear in the parameters to
produce different types of curves.
Example :- of a linear regression model that uses a squared term to fit the curved
relationship between BMI and body fat percentage.
successive approximations.
While a linear equation has one basic form, nonlinear equations can take many different
forms. The easiest way to determine whether an equation is nonlinear is to focus on the
term “nonlinear” itself. Literally, it’s not linear. If the equation doesn’t meet the criteria
above for a linear equation, it’s nonlinear.
That covers many different forms, which is why nonlinear regression provides the most
flexible curve-fitting functionality. Here are several examples from Minitab’s nonlinear
function catalog. Thetas represent the parameters and X represents the predictor in the
nonlinear functions. Unlike linear regression, these functions can have more than one
parameter per predictor variable.
Nonlinear function One possible shape
Linear and nonlinear regression are actually named after the functional form of the models
that each analysis accepts. I hope the distinction between linear and nonlinear equations is
clearer and that you understand how it’s possible for linear regression to model curves! It
also explains why you’ll see R-squared displayed for some curvilinear models even though
it’s impossible to calculate R-squared for nonlinear regression.
Lasso Regression:-
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where
data values are shrunk towards a central point, like the mean. The lasso procedure
encourages simple, sparse models (i.e. models with fewer parameters). This particular type
of regression is well-suited for models showing high levels of muticollinearity or when you
want to automate certain parts of model selection, like variable selection/parameter
187
elimination.
Page
The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator.
method that performs both variable selection and regularization in order to enhance
the prediction accuracy and interpretability of the statistical model it produces.
Lasso was originally formulated for least squares models and this simple case reveals a
substantial amount about the behavior of the estimator, including its relationship to
ridge regression and best subset selection and the connections between lasso
coefficient estimates and so-called soft thresholding. It also reveals that (like standard
linear regression) the coefficient estimates need not be unique if covariates are
collinear.
Lasso’s ability to perform subset selection relies on the form of the constraint and has a
variety of interpretations including in terms of geometry, Bayesian statistics, and
convex analysis.
L1 Regularization
Lasso regression performs L1 regularization, which adds a penalty equal to the absolute
value of the magnitude of coefficients. This type of regularization can result in sparse
models with few coefficients; Some coefficients can become zero and eliminated from the
model. Larger penalties result in coefficient values closer to zero, which is the ideal for
producing simpler models. On the other hand, L2 regularization (e.g. Ridge regression)
doesn’t result in elimination of coefficients or sparse models. This makes the Lasso far
easier to interpret than the Ridge.
Lasso solutions are quadratic programming problems, which are best solved with software
(like Matlab). The goal of the algorithm is to minimize:
Which is the same as minimizing the sum of squares with constraint Σ |Bj≤ s. Some of the βs
are shrunk to exactly zero, resulting in a regression model that’s easier to interpret.
188
A tuning parameter, λ controls the strength of the L1 penalty. λ is basically the amount of
shrinkage:
Page
When λ = 0, no parameters are eliminated. The estimate is equal to the one found
with linear regression.
As λ increases, more and more coefficients are set to zero and eliminated
(theoretically, when λ = ∞, all coefficients are eliminated).
As λ increases, bias increases.
As λ decreases, variance increases.
Ridge regression is a way to create a parsimonious model when the number of predictor
variables in a set exceeds the number of observations, or when a data set has
multicollinearity (correlations between predictor variables).
Tikhivov’s method is basically the same as ridge regression, except that Tikhonov’s has a
larger set. It can produce solutions even when your data set contains a lot of statistical
noise (unexplained variation in a sample).
Ridge regression avoids all of these problems. It works in part because it doesn’t
require unbiased estimators;
Ridge regression adds just enough bias to make the estimates reasonably reliable
approximations to true population values.
Shrinkage
Ridge regression uses a type of shrinkage estimator called a ridge estimator. Shrinkage
estimators theoretically produce new estimators that are shrunk closer to the “true”
population parameters. The ridge estimator is especially good at improving the least-
squares estimate when multicollinearity is present.
Regularization
Ridge regression belongs a class of regression tools that use L2 regularization. The other
type of regularization, L1 regularization, limits the size of the coefficients by adding an L1
penalty equal to the absolute value of the magnitude of coefficients. This sometimes results
in the elimination of some coefficients altogether, which can yield sparse models. L2
regularization adds an L2 penalty, which equals the square of the magnitude of
coefficients. All coefficients are shrunk by the same factor (so none are eliminated). Unlike
L1 regularization, L2 will not result in sparse models.
A tuning parameter (λ) controls the strength of the penalty term. When λ = 0, ridge
189
regression equals least squares regression. If λ = ∞, all coefficients are shrunk to zero. The
ideal penalty is therefore somewhere in between 0 and ∞.
Page
On Mathematics
If X is a centered and scaled matrix, the crossproduct matrix (X`X) is nearly singular when
the X-columns are highly correlated. Ridge regression adds a ridge parameter (k), of the
identity matrix to the cross product matrix, forming a new matrix (X`X + kI). It’s called
ridge regression because the diagonal of ones in the correlation matrix can be described as
a ridge. The new formula is used to find the coefficients:
Choosing a value for k is not a simple task, which is perhaps one major reason why ridge
regression isn’t used as much as least squares or logistic regression. You can read one way
to find k in Dorugade and D. N. Kashid’s paper Alternative Method for Choosing Ridge
Parameter for Regression..
Note:-
It shrinks the parameters, therefore it is mostly used to prevent multicollinearity.
It reduces the model complexity by coefficient shrinkage.
It uses L2 regularization technique.
^βridge=(X′X+λIp)−1X′Y
n∑i=1(yi−p∑j=1xijβj)2+λp∑j=1β2j
's, in the linear model. In this case, what we are doing is that instead of just minimizing the
residual sum of squares we also have a penalty term on the β's. This penalty term is λ (a
pre-chosen constant) times the squared norm of the β vector. This means that if the βj's
Page
take on large values, the optimization function is penalized. We would prefer to take
smaller βj's, or βj
's that are close to zero to drive the penalty term small.
The ellipses correspond to the contours of residual sum of squares (RSS): the inner ellipse
has smaller RSS, and RSS is minimized at ordinal least square (OLS) estimates.
For p=2
, the constraint in ridge regression corresponds to a circle, ∑pj=1β2j<c
We are trying to minimize the ellipse size and circle simultanously in the ridge regression.
The ridge estimate is given by the point at which the ellipse and the circle touch.
Gradient descent is a first-order iterative optimization algorithm for finding the minimum
of a function. To find a local minimum of a function using gradient descent, one takes steps
proportional to the negative of the gradient (or approximate gradient) of the function at
the current point. If instead one takes steps proportional to the positive of the gradient, one
approaches a local maximum of that function; the procedure is then known as gradient
191
ascent.
Gradient descent is also known as steepest descent. However, gradient descent should not
Page
Gradient Descent
Gradient descent is best used when the parameters cannot be calculated analytically (e.g.
using linear algebra) and must be searched for by an optimization algorithm.
Suppose you are at the top of a mountain, and you have to reach a lake which is at the
lowest point of the mountain (a.k.a valley). A twist is that you are blindfolded and you have
zero visibility to see where you are headed. So, what approach will you take to reach the
lake?
192
Page
Source
The best way is to check the ground near you and observe where the land tends to descend.
This will give an idea in what direction you should take your first step. If you follow the
descending path, it is very likely you would reach the lake.
Source
Suppose we want to find out the best parameters (θ1) and (θ2) for our learning algorithm.
Similar to the analogy above, we see we find similar mountains and valleys when we plot
our “cost space”. Cost space is nothing but how our algorithm would perform when we
choose a particular value for a parameter.
So on the y-axis, we have the cost J(θ) against our parameters θ1 and θ2 on x-axis and z-
axis respectively. Here, hills are represented by red region, which have high cost, and
valleys are represented by blue region, which have low cost.
Now there are many types of gradient descent algorithms. They can be classified by two
methods mainly:
In full batch gradient descent algorithms, you use whole data at once to compute the
194
gradient, whereas in stochastic you take a sample while computing the gradient.
Page
When applying gradient descent, you can look at these points which might be helpful in
circumventing the problem:
Error rates – You should check the training and testing error after specific
iterations and make sure both of them decreases. If that is not the case, there might
be a problem!
Gradient flow in hidden layers – Check if the network doesn’t show a vanishing
gradient problem or exploding gradient problem.
Learning rate – which you should check when using adaptive techniques.
gradient descent algorithm and its variants: Batch Gradient Descent, Mini-batch
Gradient Descent, and Stochastic Gradient Descent.
Batch Gradient Descent, also called vanilla gradient descent, calculates the error for each
example within the training dataset, but only after all training examples have been
evaluated, the model gets updated. This whole process is like a cycle and called a training
epoch.
Advantages of it are that it’s computational efficient, it produces a stable error gradient and
a stable convergence. Batch Gradient Descent has the disadvantage that the stable error
gradient can sometimes result in a state of convergence that isn’t the best the model can
achieve. It also requires that the entire training dataset is in memory and available to the
algorithm.
Stochastic gradient descent (SGD) in contrary, does this for each training example within
the dataset. This means that it updates the parameters for each training example, one by
one. This can make SGD faster than Batch Gradient Descent, depending on the problem.
One advantage is that the frequent updates allow us to have a pretty detailed rate of
improvement.
195
The thing is that the frequent updates are more computationally expensive as the approach
of Batch Gradient Descent. The frequency of those updates can also result in noisy
Page
gradients, which may cause the error rate to jump around, instead of slowly decreasing.
Mini-batch Gradient Descent is the go-to method since it’s a combination of the concepts of
SGD and Batch Gradient Descent. It simply splits the training dataset into small batches and
performs an update for each of these batches. Therefore it creates a balance between the
robustness of stochastic gradient descent and the efficiency of batch gradient descent.
Common mini-batch sizes range between 50 and 256, but like for any other machine
learning techniques, there is no clear rule, because they can vary for different applications.
Note that it is the go-to algorithm when you are training a neural network and it is the most
common type of gradient descent within deep learning.
Classification models
Logistic Regression:-
Logistic Regression
The name of this algorithm could be a little confusing in the sense that Logistic Regression
machine learning algorithm is for classification tasks and not regression problems. The
name ‘Regression’ here implies that a linear model is fit into the feature space. This
algorithm applies a logistic function to a linear combination of features to predict the
outcome of a categorical dependent variable based on predictor variables.
The odds or probabilities that describe the outcome of a single trial are modelled as a
function of explanatory variables. Logistic regression algorithms helps estimate the
probability of falling into a specific level of the categorical dependent variable based on the
given predictor variables.
Just suppose that you want to predict if there will be a snowfall tomorrow in New York.
Here the outcome of the prediction is not a continuous number because there will either be
snowfall or no snowfall and hence linear regression cannot be applied. Here the outcome
variable is one of the several categories and using logistic regression helps.
Based on the nature of categorical response, logistic regression is classified into 3 types –
· Binary Logistic Regression – The most commonly used logistic regression when the
categorical response has 2 possible outcomes i.e. either yes or not. Example –Predicting
whether a student will pass or fail an exam, predicting whether a student will have low or
high blood pressure, predicting whether a tumour is cancerous or not.
· Multi-nominal Logistic Regression - Categorical response has 3 or more possible
outcomes with no ordering. Example- Predicting what kind of search engine (Yahoo, Bing,
Google, and MSN) is used by majority of US citizens.
· Ordinal Logistic Regression - Categorical response has 3 or more possible outcomes
with natural ordering. Example- How a customer rates the service and quality of food at a
restaurant based on a scale of 1 to 10.
196
Let us consider a simple example where a cake manufacturer wants to find out if baking a
cake at 160°C, 180°C and 200°C will produce a ‘hard’ or ‘soft’ variety of cake ( assuming the
Page
fact that the bakery sells both the varieties of cake with different names and prices).
Logistic regression is a perfect fit in this scenario instead of other statistical techniques. For
example, if the manufactures produces 2 cake batches wherein the first batch contains 20
cakes (of which 7 were hard and 13 were soft ) and the second batch of cake produced
consisted of 80 cakes (of which 41 were hard and 39 were soft cakes). Here in this case if
linear regression algorithm is used it will give equal importance both the batches of cakes
regardless of the number of cakes in each batch. Applying a logistic regression algorithm
will consider this factor and give the second batch of cakes more weightage than the first
batch.
When to Use Logistic Regression Machine Learning Algorithm
Use logistic regression algorithms when there is a requirement to model the
probabilities of the response variable as a function of some other explanatory
variable. For example, probability of buying a product X as a function of gender
Use logistic regression algorithms when there is a need to predict probabilities that
categorical dependent variable will fall into two categories of the binary response as
a function of some explanatory variables. For example, what is the probability that a
customer will buy a perfume given that the customer is a female?
Logistic regression algorithms is also best suited when the need is to classify
elements two categories based on the explanatory variable. For example-classify
females into ‘young’ or ‘old’ group based on their age.
Advantages of Using Logistic Regression
Easier to inspect and less complex.
Robust algorithm as the independent variables need not have equal variance or
normal distribution.
These algorithms do not assume a linear relationship between the dependent and
independent variables and hence can also handle non-linear effects.
Controls confounding and tests interaction.
Drawbacks of Using Logistic Regression
When the training data is sparse and high dimensional, in such situations a logistic
model may over fit the training data.
Logistic regression algorithms cannot predict continuous outcomes. For instance,
logistic regression cannot be applied when the goal is to determine how heavily it
will rain because the scale of measuring rainfall is continuous. Data scientists can
predict heavy or low rainfall but this would make some compromises with the
precision of the dataset.
Logistic regression algorithms require more data to achieve stability and meaningful
results. These algorithms require minimum of 50 data points per predictor to achieve
stable outcomes.
It predicts outcomes depending on a group of independent variables and if a data
scientist or a machine learning expert goes wrong in identifying the independent
197
variables then the developed model will have minimal or no predictive value.
It is not robust to outliers and missing values.
Page
You are making a weekend plan to visit the best restaurant in town as your parents are
visiting but you are hesitant in making a decision on which restaurant to choose. Whenever
you want to visit a restaurant you ask your friend Tyrion if he thinks you will like a
particular place. To answer your question, Tyrion first has to find out, the kind of
restaurants you like. You give him a list of restaurants that you have visited and tell him
whether you liked each restaurant or not (giving a labelled training dataset). When you ask
Tyrion that whether you will like a particular restaurant R or not, he asks you various
200
questions like “Is “R” a roof top restaurant?” , “Does restaurant “R” serve Italian cuisine?”,
“Does R have live music?”, “Is restaurant R open till midnight?” and so on. Tyrion asks you
Page
several informative questions to maximize the information gain and gives you YES or NO
answer based on your answers to the questionnaire. Here Tyrion is a decision tree for your
favourite restaurant preferences.
A decision tree is a graphical representation that makes use of branching methodology to
exemplify all possible outcomes of a decision, based on certain conditions. In a decision
tree, the internal node represents a test on the attribute, each branch of the tree represents
the outcome of the test and the leaf node represents a particular class label i.e. the decision
made after computing all of the attributes. The classification rules are represented through
the path from root to the leaf node.
Types of Decision Trees
Classification Trees- These are considered as the default kind of decision trees used to
separate a dataset into different classes, based on the response variable. These are
generally used when the response variable is categorical in nature.
Regression Trees-When the response or target variable is continuous or numerical,
regression trees are used. These are generally used in predictive type of problems when
compared to classification.
Decision trees can also be classified into two types, based on the type of target variable-
Continuous Variable Decision Trees and Binary Variable Decision Trees. It is the target
variable that helps decide what kind of decision tree would be required for a particular
problem.
situation.
Decision tree machine learning algorithms help a data scientist capture the idea that
if a different decision was taken, then how the operational nature of a situation or
Page
Decision tree algorithms help make optimal decisions by allowing a data scientist to
traverse through forward and backward calculation paths.
When to use Decision Tree Machine Learning Algorithm
Decision trees are robust to errors and if the training data contains errors- decision
tree algorithms will be best suited to address such problems.
Decision trees are best suited for problems where instances are represented by
attribute value pairs.
If the training data has missing value then decision trees can be used, as they can
handle missing values nicely by looking at the data in other columns.
Decision trees are best suited when the target function has discrete output values.
Advantages of Using Decision Tree Machine Learning Algorithms
Decision trees are very instinctual and can be explained to anyone with ease. People
from a non-technical background, can also decipher the hypothesis drawn from a
decision tree, as they are self-explanatory.
When using decision tree machine learning algorithms, data type is not a constraint
as they can handle both categorical and numerical variables.
Decision tree machine learning algorithms do not require making any assumption on
the linearity in the data and hence can be used in circumstances where the
parameters are non-linearly related. These machine learning algorithms do not make
any assumptions on the classifier structure and space distribution.
These algorithms are useful in data exploration. Decision trees implicitly perform
feature selection which is very important in predictive analytics. When a decision
tree is fit to a training dataset, the nodes at the top on which the decision tree is split,
are considered as important variables within a given dataset and feature selection is
completed by default.
Decision trees help save data preparation time, as they are not sensitive to missing
values and outliers. Missing values will not stop you from splitting the data for
building a decision tree. Outliers will also not affect the decision trees as data
splitting happens based on some samples within the split range and not on exact
absolute values.
Drawbacks of Using Decision Tree Machine Learning Algorithms
The more the number of decisions in a tree, less is the accuracy of any expected
outcome.
A major drawback of decision tree machine learning algorithms, is that the outcomes
may be based on expectations. When decisions are made in real-time, the payoffs and
resulting outcomes might not be the same as expected or planned. There are chances
that this could lead to unrealistic decision trees leading to bad decision making. Any
irrational expectations could lead to major errors and flaws in decision tree analysis,
as it is not always possible to plan for all eventualities that can arise from a decision.
202
Decision Trees do not fit well for continuous variables and result in instability and
classification plateaus.
Page
Decision trees are easy to use when compared to other decision making models but
creating large decision trees that contain several branches is a complex and time
consuming task.
Decision tree machine learning algorithms consider only one attribute at a time and
might not be best suited for actual data in the decision space.
Large sized decision trees with multiple branches are not comprehensible and pose
several presentation difficulties.
Applications of Decision Tree Machine Learning Algorithm
Decision trees are among the popular machine learning algorithms that find great
use in finance for option pricing.
Remote sensing is an application area for pattern recognition based on decision
trees.
Decision tree algorithms are used by banks to classify loan applicants by their
probability of defaulting payments.
Gerber Products, a popular baby product company, used decision tree machine
learning algorithm to decide whether they should continue using the plastic PVC
(Poly Vinyl Chloride) in their products.
Rush University Medical Centre has developed a tool named Guardian that uses a
decision tree machine learning algorithm to identify at-risk patients and disease
trends.
The Data Science libraries in Python language to implement Decision Tree Machine
Learning Algorithm are – SciPy and Sci-Kit Learn.
The Data Science libraries in R language to implement Decision Tree Machine
Learning Algorithm is caret.
203
Random Forest Machine Learning Algorithm works. Tyrion is a decision tree for your
restaurant preferences. However, Tyrion being a human being does not always generalize
your restaurant preferences with accuracy. To get more accurate restaurant
recommendation, you ask a couple of your friends and decide to visit the restaurant R, if
most of them say that you will like it. Instead of just asking Tyrion, you would like to ask
Jon Snow, Sandor, Bronn and Bran who vote on whether you will like the restaurant R or
not. This implies that you have built an ensemble classifier of decision trees - also known as
a forest.
You don’t want all your friends to give you the same answer - so you provide each of
your friends with slightly varying data. You are also not sure of your restaurant preferences
and are in a dilemma.You told Tyrion that you like Open Roof Top restaurants but maybe,
just because it was summer when you visited the restaurant you could have liked it then.
You may not be a fan of the restaurant during the chilly winters. Thus, all your friends
should not make use of the data point that you like open roof top restaurants, to make their
recommendations for your restaurant preferences.
By providing your friends with slightly different data on your restaurant
preferences, you make your friends ask you different questions at different times. In this
case just by slightly altering your restaurant preferences, you are injecting randomness at
model level (unlike randomness at data level in case of decision trees). Your group of
friends now form a random forest of your restaurant preferences.
Random Forest is the go to machine learning algorithm that uses a bagging
approach to create a bunch of decision trees with random subset of the data. A model is
trained several times on random sample of the dataset to achieve good prediction
performance from the random forest algorithm.In this ensemble learning method, the
output of all the decision trees in the random forest, is combined to make the final
prediction. The final prediction of the random forest algorithm is derived by polling the
results of each decision tree or just by going with a prediction that appears the most times
in the decision trees.
For instance, in the above example - if 5 friends decide that you will like restaurant
R but only 2 friends decide that you will not like the restaurant then the final prediction is
that, you will like restaurant R as majority always wins.
Access the Solution to Kaggle Data Science Challenge - Expedia Hotel Recommendations
Why use Random Forest Machine Learning Algorithm?
There are many good open source, free implementations of the algorithm available in
Python and R.
It maintains accuracy when there is missing data and is also resistant to outliers.
Simple to use as the basic random forest algorithm can be implemented with just a
few lines of code.
Random Forest machine learning algorithms help data scientists save data
preparation time, as they do not require any input preparation and are capable of
204
Implicit feature selection as it gives estimates on what variables are important in the
classification.
Advantages of Using Random Forest Machine Learning Algorithms
Overfitting is less of an issue with Random Forests, unlike decision tree machine
learning algorithms. There is no need of pruning the random forest.
These algorithms are fast but not in all cases. A random forest algorithm, when run
on an 800 MHz machine with a dataset of 100 variables and 50,000 cases produced
100 decision trees in 11 minutes.
Random Forest is one of the most effective and versatile machine learning algorithm
for wide variety of classification and regression tasks, as they are more robust to
noise.
It is difficult to build a bad random forest. In the implementation of Random Forest
Machine Learning algorithms, it is easy to determine which parameters to use
because they are not sensitive to the parameters that are used to run the algorithm.
One can easily build a decent model without much tuning.
Random Forest machine learning algorithms can be grown in parallel.
This algorithm runs efficiently on large databases.
Has higher classification accuracy.
Drawbacks of Using Random Forest Machine Learning Algorithms
They might be easy to use but analysing them theoretically, is difficult.
Large number of decision trees in the random forest can slow down the algorithm in
making real-time predictions.
If the data consists of categorical variables with different number of levels, then the
algorithm gets biased in favour of those attributes that have more levels. In such
situations, variable importance scores do not seem to be reliable.
When using RandomForest algorithm for regression tasks, it does not predict beyond
the range of the response values in the training data.
Applications of Random Forest Machine Learning Algorithms
Random Forest algorithms are used by banks to predict if a loan applicant is a likely
high risk.
They are used in the automobile industry to predict the failure or breakdown of a
mechanical part.
These algorithms are used in the healthcare industry to predict if a patient is likely to
develop a chronic disease or not.
They can also be used for regression tasks like predicting the average number of
social media shares and performance scores.
Recently, the algorithm has also made way into predicting patterns in speech
recognition software and classifying images and texts.
Data Science libraries in Python language to implement Random Forest Machine
205
A Support Vector Machine (SVM) is a very powerful and flexible Machine Learning Model,
capable of performing linear or nonlinear classification, regression, and even outlier
detection. It is one of the most popular models in Machine Learning, and anyone interested
in ML should have it in their toolbox. SVMs are particularly well suited for classification of
206
The 2 classes can clearly be separated easily with a straight line (linearly separable). The
left plot shows the decision boundaries of 2 possible linear classifiers. An SVM model is all
about generating the right line (called Hyperplane in higher dimension) that classifies the
data very well. In the left plot, even though red line classifies the data, it might not perform
very well on new instances of data. We can draw many lines that classifies this data, but
among all these lines blue line separates the data most. The same blue line is shown on the
right plot. This line (hyperplane) not only separates the two classes but also stays as far
away from the closest training instances possible. You can think of an SVM classifier as
fitting the widest possible street (represented by parallel dashed lines on the right plot)
between the classes. This is called Large Margin Classification.
This best possible decision boundary is determined (or “supported”) by the instances
located on the edge of the street. These instances are called the support vectors. The
distance between the edges of “the street” is called margin.
207
Page
If we strict our instances be off the “street” and on the correct side of the line, this is
called Hard margin classification. There are 2 problems with hard margin classification.
In the above data classes, there is a blue outlier. And if we apply Hard margin classification
on this dataset, we will get decision boundary with small margin shown in the left diagram.
To avoid these issues it is preferable to to use more flexible model. The objective is to find a
good balance between keeping the street as large as possible and limiting the margin
violation (i.e., instances that end up in the middle of the street or even on the wrong side).
This is called Soft margin classification. If we apply Soft margin classification on this
dataset, we will get decision boundary with larger margin than Hard margin classification.
This is shown in the right diagram.
Nonlinear SVM
Although linear SVM classifiers are efficient and work surprisingly well in many cases,
many datasets are not even close to being linearly seperable. One simple method to handle
208
nonlinear datasets is to add more features, such as polynomial features and sometimes this
can result in a linearly seperable dataset. By generating polynomial features, we will have a
new feature matrix consisting of all polynomial combinations of the features with degree
Page
less than or equal to the specified degree. Following image is an example of using
Polynomial Features for SVM.
Kernel Trick
Kernel is a way of computing the dot product of two vectors x and y in some (possibly very
high dimensional) feature space, which is why kernel functions are sometimes called
“generalized dot product”.
Suppose we have a mapping φ:ℝn→ℝm that brings our vectors in ℝn to some feature space
ℝm. Then the dot product of x and y in this space is φ(x)Tφ(y). A kernel is a function k that
corresponds to this dot product, i.e. k(x,y)=φ(x)Tφ(y). Kernels give a way to compute dot
products in some feature space without even knowing what this space is and what is φ.
Polynomial Kernel
Adding polynomial features is very simple to implement. But a low polynomial degree
cannot deal with complex datasets, and with high polynomial degree it will create huge
number of features, making the model too slow. In these situations we can use a
polynomial kernel to avoid this problem. Polynomial kernal is of the following format;
209
Page
Gaussian RBF(Radial Basis Function) is another popular Kernel method used in SVM
models. Gaussian Kernel is of the following format;
RBF Kernels are very useful if we have datasets like the following one;
Hyperparameters
The C parameter decides the margin width of the SVM classifier. Large value of C makes the
210
classifier strict and thus small margin width. For large values of C, the model will choose a
smaller-margin hyperplane if that hyperplane does a better job of getting all the training
Page
points classified correctly. Conversely, a very small value of C will cause the model to look
The γ parameter defines the influence of each training example reaches. γ parameter is
invalid for a linear kernel in scikit-learn.
Implementation using scikit-learn
In this part we will implement SVM using scikit-learn. We will be using artificial datasets.
Linear Kernel
Python Code:
import numpy as np
import pandas as pd
from matplotlib import style
from sklearn.svm import SVC
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 6)
style.use('ggplot')
# Import Dataset
data = pd.read_csv('data.csv', header=None)
X = data.values[:, :2]
y = data.values[:, 2]
clf_fit = clf.fit(X, y)
clf_arr = []
clf_arr.append(draw_svm(X, y, 0.0001))
clf_arr.append(draw_svm(X, y, 0.001))
clf_arr.append(draw_svm(X, y, 1))
clf_arr.append(draw_svm(X, y, 10))
212
# Accuracy Score
print(clf.score(X, y))
pred = clf.predict([(12, 32), (-250, 32), (120, 43)])
print(pred)
213
Page
Output:
0.992907801418
[1 0 1]
0.992907801418
[1 0 1]
1.0
[1 0 1]
214
1.0
[1 0 1]
Page
You can see the same hyperplane with different margin width. It is depends on the C
hyperparameter.
Polynomial Kernel
import numpy as np
import pandas as pd
from matplotlib import style
from sklearn.svm import SVC
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 6)
style.use('ggplot')
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
ax.scatter(clf.support_vectors_[:, 0],
clf.support_vectors_[:, 1],
s=100, linewidth=1, facecolors='none')
plt.show()
return clf_fit
clf = draw_svm(X, y)
score = clf.score(X, y)
pred = clf.predict([(-130, 110), (-170, -160), (80, 90), (-280, 20)])
print(score)
print(pred)
1.0
[0 1 0 1]
Gaussian Kernel
import numpy as np
216
import pandas as pd
from matplotlib import style
Page
X, y = make_moons(n_samples=200)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
return clf_fit
Page
clf_arr = []
clf_arr.append(draw_svm(X, y, 0.01))
clf_arr.append(draw_svm(X, y, 0.1))
clf_arr.append(draw_svm(X, y, 1))
clf_arr.append(draw_svm(X, y, 10))
218
Page
0.83
0.9
1.0
1.0
import numpy as np
import pandas as pd
from matplotlib import style
219
plt.rcParams['figure.figsize'] = (12, 6)
style.use('ggplot')
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
clf_arr = []
clf_arr.append(draw_svm(X, y, 0.1))
Page
clf_arr.append(draw_svm(X, y, 1))
clf_arr.append(draw_svm(X, y, 10))
clf_arr.append(draw_svm(X, y, 100))
221
Page
222
Page
0.965
0.97
0.985
0.995
γγ parameter is very important to the RBF SVM model. In the first example low value
of γγ leads to almost linear classification.
The 2 classes can clearly be seperated easily with a straight line (linearly seperable). The
left plot shows the decision boundaries of 2 possible linear classifiers. An SVM model is all
about generating the right line (called Hyperplane in higher dimension) that classifies the
data very well. In the left plot, even though red line classifies the data, it might not perform
very well on new instances of data. We can draw many lines that classifies this data, but
among all these lines blue line seperates the data most. The same blue line is shown on the
right plot. This line (hyperplane) not only seperates the two classes but also stays as far
away from the closest training instances possible. You can think of an SVM classifier as
fitting the widest possible street (represented by parallel dashed lines on the right plot)
between the classes. This is called Large Margin Classification.
This best possible decision boundary is determined (or “supported”) by the instances
located on the edge of the street. These instances are called the support vectors. The
distance between the edges of “the street” is called margin.
223
Page
224
Page
In the above data classes, there is a blue outlier. And if we apply Hard margin classification
on this dataset, we will get decision boundary with small margin shown in the left diagram.
To avoid these issues it is preferable to to use more flexible model. The objective is to find a
good balance between keeping the street as large as possible and limiting the margin
violation (i.e., instances that end up in the middle of the street or even on the wrong side).
This is called Soft margin classification. If we apply Soft margin classification on this
dataset, we will get decision boundary with larger margin than Hard margin classification.
This is shown in the right diagram.
Nonlinear SVM
Although linear SVM classifiers are efficient and work surprisingly well in many cases,
many datasets are not even close to being linearly seperable. One simple method to handle
nonlinear datasets is to add more features, such as polynomial features and sometimes this
can result in a linearly seperable dataset. By generating polynomial features, we will have a
new feature matrix consisting of all polynomial combinations of the features with degree
less than or equal to the specified degree. Following image is an example of using
Polynomial Features for SVM.
225
Page
Kernel Trick
Kernel is a way of computing the dot product of two vectors x and y in some (possibly very
high dimensional) feature space, which is why kernel functions are sometimes called
“generalized dot product”. uppose we have a mappingthat brings our vectors in to some
feature space ℝmRm. Then the dot product of xx and yy in this space
is φ(x)Tφ(y)φ(x)Tφ(y). A kernel is a function kkthat corresponds to this dot product,
.e. k(x,y)=φ(x)Tφ(y)k(x,y)=φ(x)Tφ(yKernels give a way to compute dot products in some
feature space without even knowing what this space is and what is φφ.
Polynomial Kernel
Adding polynomial features is very simple to implement. But a low polynomial degree
cannot deal with complex datasets, and with high polynomial degree it will create huge
number of features, making the model too slow. In these situations we can use a
polynomial kernel to avoid this problem. Polynomial kernal is of the following format;
RBF Kernels are very useful if we have datasets like the following one;
226
Page
Hyperparameters
There are 2 important hyperparameters in an SVM model.
C Parameter
The C parameter decides the margin width of the SVM classifier. Large value of C
makes the classifier strict and thus small margin width. For large values of C, the model will
choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the
training points classified correctly. Conversely, a very small value of C will cause the model
to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies
more points. For very tiny values of C, you should get misclassified examples, often even if
your training data is linearly separable.
Parameter
The parameter defines the influence of each training example reaches. parameter
is invalid for a linear kernel in scikit-learn.
Linear Kernel
227
import numpy as np
Page
import pandas as pd
# Import Dataset
data = pd.read_csv('data.csv', header=None)
X = data.values[:, :2]
y = data.values[:, 2]
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)
Page
clf_arr = []
clf_arr.append(draw_svm(X, y, 0.0001))
clf_arr.append(draw_svm(X, y, 0.001))
clf_arr.append(draw_svm(X, y, 1))
clf_arr.append(draw_svm(X, y, 10))
229
Page
230
Page
0.992907801418
[1 0 1]
0.992907801418
[1 0 1]
1.0
[1 0 1]
1.0
[1 0 1]
You can see the same hyperplane with different margin width. It is depends on the C
hyperparameter.
Polynomial Kernel
import numpy as np
import pandas as pd
from matplotlib import style
from sklearn.svm import SVC
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 6)
style.use('ggplot')
ax = plt.gca()
xlim = ax.get_xlim()
Page
ylim = ax.get_ylim()
clf = draw_svm(X, y)
score = clf.score(X, y)
pred = clf.predict([(-130, 110), (-170, -160), (80, 90), (-280, 20)])
print(score)
print(pred)
232
Page
1.0
[0 1 0 1]
Gaussian Kernel
import numpy as np
import pandas as pd
from matplotlib import style
from sklearn.svm import SVC
from sklearn.datasets import make_classification, make_blobs, make_moons
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 6)
style.use('ggplot')
X, y = make_moons(n_samples=200)
# Auto gamma equals 1/n_features
def draw_svm(X, y, C=1.0, gamma='auto'):
plt.scatter(X[:,0], X[:,1], c=y)
clf = SVC(kernel='rbf', C=C, gamma=gamma)
clf_fit = clf.fit(X, y)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
ax.scatter(clf.support_vectors_[:, 0],
clf.support_vectors_[:, 1],
s=100, linewidth=1, facecolors='none')
plt.show()
return clf_fit
clf_arr = []
clf_arr.append(draw_svm(X, y, 0.01))
clf_arr.append(draw_svm(X, y, 0.1))
clf_arr.append(draw_svm(X, y, 1))
clf_arr.append(draw_svm(X, y, 10))
234
Page
235
0.83
0.9
Page
1.0
1.0
import numpy as np
import pandas as pd
from matplotlib import style
from sklearn.svm import SVC
from sklearn.datasets import make_gaussian_quantiles
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 6)
style.use('ggplot')
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
ax.scatter(clf.support_vectors_[:, 0],
clf.support_vectors_[:, 1],
s=100, linewidth=1, facecolors='none')
plt.show()
return clf_fit
clf_arr = []
clf_arr.append(draw_svm(X, y, 0.1))
clf_arr.append(draw_svm(X, y, 1))
clf_arr.append(draw_svm(X, y, 10))
clf_arr.append(draw_svm(X, y, 100))
237
Page
238
Page
0.965
0.97
0.985
0.995
parameter is very important to the RBF SVM model. In the first example low value of
leads to almost linear classification.
GBM (gradient boosting):-
Gradient boosting is a machine learning technique for regression and classification
problems, which produces a prediction model in the form of an ensemble of weak
prediction models, typically decision trees. (Wikipedia definition)
The accuracy of a predictive model can be boosted in two ways: Either by embracing
feature engineering or by applying boosting algorithms straight away. Having participated
in lots of data science competition, I’ve noticed that people prefer to work with boosting
algorithms as it takes less time and produces similar results.
There are multiple boosting algorithms like Gradient Boosting, XGBoost, AdaBoost, Gentle
Boost etc. Every algorithm has its own underlying mathematics and a slight variation is
observed while applying them. If you are new to this, Great! You shall be learning all these
concepts in a week’s time from now.
In this article, I’ve explained the underlying concepts and complexities of Gradient Boosting
Algorithm. In addition, I’ve also shared an example to learn its implementation in R.
While working with boosting algorithms, you’ll soon come across two frequently occurring
buzzwords: Bagging and Boosting. So, how are they different? Here’s a one line
explanation:
Bagging: It is an approach where you take random samples of data, build learning
algorithms and take simple means to find bagging probabilities.
Okay! I understand you’ve questions sprouting up like ‘what do you mean by hard? How do
I know how much additional weight am I supposed to give to a mis-classified observation.’ I
shall answer all your questions in subsequent sections. Keep Calm and proceed.
239
Page
Assume, you are given a previous model M to improve on. Currently you observe that the
model has an accuracy of 80% (any metric). How do you go further about it?
One simple way is to build an entirely different model using new set of input variables and
trying better ensemble learners. On the contrary, I have a much simpler way to suggest. It
goes like this:
Y = M(x) + error
What if I am able to see that error is not a white noise but have same correlation with
outcome(Y) value. What if we can develop a model on this error term? Like,
error = G(x) + error2
Probably, you’ll see error rate will improve to a higher number, say 84%. Let’s take another
step and regress against error2.
error2 = H(x) + error3
This probably will have a accuracy of even more than 84%. What if I can find an optimal
weights for each of the three learners,
Y = alpha * M(x) + beta * G(x) + gamma * H(x) + error4
If we found good weights, we probably have made even a better model. This is the
underlying principle of a boosting learner. When I read the theory for the first time, I had
two quick questions:
I’ll answer these questions in this article, however, in a crisp manner. Boosting is generally
done on weak learners, which do not have a capacity to leave behind white noise. Secondly,
boosting can lead to overfitting, so we need to stop at the right point.
240
Page
We start with the first box. We see one vertical line which becomes our first week learner.
Now in total we have 3/10 mis-classified observations. We now start giving higher weights
to 3 plus mis-classified observations. Now, it becomes very important to classify them right.
Hence, the vertical line towards right edge. We repeat this process and then combine each
of the learner in appropriate weights.
We always start with a uniform distribution assumption. Lets call it as D1 which is 1/n for
all n observations.
where, 241
Step 4 : Use the new population distribution to again find the next learner
Page
Scared of Step 3 mathematics? Let me break it down for you. Simply look at the argument
in exponent. Alpha is kind of learning rate, y is the actual response ( + 1 or -1) and h(x) will
be the class predicted by learner. Essentially, if learner is going wrong, the exponent
becomes 1*alpha and else -1*alpha. Essentially, the weight will probably increase if the
prediction went wrong the last time. So, what’s next?
Step 5 : Iterate step 1 – step 4 until no hypothesis is found which can improve further.
Step 6 : Take a weighted average of the frontier using all the learners used till now. But
what are the weights? Weights are simply the alpha values. Alpha is calculated as follows:
1. Loss Function
The loss function used depends on the type of problem being solved.
It must be differentiable, but many standard loss functions are supported and you can
define your own.
For example, regression may use a squared error and classification may use logarithmic
loss.
A benefit of the gradient boosting framework is that a new boosting algorithm does not
have to be derived for each loss function that may want to be used, instead, it is a generic
enough framework that any differentiable loss function can be used.
2. Weak Learner
Specifically regression trees are used that output real values for splits and whose output
can be added together, allowing subsequent models outputs to be added and “correct” the
Page
Trees are constructed in a greedy manner, choosing the best split points based on purity
scores like Gini or to minimize the loss.
Initially, such as in the case of AdaBoost, very short decision trees were used that only had
a single split, called a decision stump. Larger trees can be used generally with 4-to-8 levels.
It is common to constrain the weak learners in specific ways, such as a maximum number
of layers, nodes, splits or leaf nodes.
This is to ensure that the learners remain weak, but can still be constructed in a greedy
manner.
3. Additive Model
Trees are added one at a time, and existing trees in the model are not changed.
A gradient descent procedure is used to minimize the loss when adding trees.
Generally this approach is called functional gradient descent or gradient descent with
functions.
One way to produce a weighted combination of classifiers which optimizes [the cost] is by
gradient descent in function space
The output for the new tree is then added to the output of the existing sequence of trees in
an effort to correct or improve the final output of the model.
A fixed number of trees are added or training stops once loss reaches an acceptable level or
no longer improves on an external validation dataset.
243
Page
Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.
It can benefit from regularization methods that penalize various parts of the algorithm and
generally improve the performance of the algorithm by reducing overfitting.
1. Tree Constraints
2. Shrinkage
3. Random sampling
4. Penalized Learning
1. Tree Constraints
It is important that the weak learners have skill but remain weak.
A good general heuristic is that the more constrained tree creation is, the more trees you
will need in the model, and the reverse, where less constrained individual trees, the fewer
trees that will be required.
Below are some constraints that can be imposed on the construction of decision trees:
Number of trees, generally adding more trees to the model can be very slow to
overfit. The advice is to keep adding trees until no further improvement is observed.
Tree depth, deeper trees are more complex trees and shorter trees are preferred.
Generally, better results are seen with 4-8 levels.
Number of nodes or number of leaves, like depth, this can constrain the size of
the tree, but is not constrained to a symmetrical structure if other constraints are
used.
Number of observations per split imposes a minimum constraint on the amount
of training data at a training node before a split can be considered
Minimim improvement to loss is a constraint on the improvement of any split
added to a tree.
2. Weighted Updates
244
The contribution of each tree to this sum can be weighted to slow down the learning by the
algorithm. This weighting is called a shrinkage or a learning rate.
Each update is simply scaled by the value of the “learning rate parameter v”
The effect is that learning is slowed down, in turn require more trees to be added to the
model, in turn taking longer to train, providing a configuration trade-off between the
number of trees and learning rate.
Decreasing the value of v [the learning rate] increases the best value for M [the number of
trees].
It is common to have small values in the range of 0.1 to 0.3, as well as values less than 0.1.
A big insight into bagging ensembles and random forest was allowing trees to be greedily
created from subsamples of the training dataset.
This same benefit can be used to reduce the correlation between the trees in the sequence
in gradient boosting models.
Generally, aggressive sub-sampling such as selecting only 50% of the data has shown to be
beneficial.
According to user feedback, using column sub-sampling prevents over-fitting even more so
than the traditional row sub-sampling
Classical decision trees like CART are not used as weak learners, instead a modified form
called a regression tree is used that has numeric values in the leaf nodes (also called
terminal nodes). The values in the leaves of the trees can be called weights in some
literature.
As such, the leaf weight values of the trees can be regularized using popular regularization
functions, such as:
L1 regularization of weights.
L2 regularization of weights.
The additional regularization term helps to smooth the final learnt weights to avoid over-
fitting. Intuitively, the regularized objective will tend to select a model employing simple
and predictive functions.
Un supervised Machine Learning Algorithms
Clustering Models:
K-Means :-
Clustering is a type of Unsupervised learning. This is very often used when you don’t have
labeled data. K-Means Clustering is one of the popular clustering algorithm. The goal of
this algorithm is to find groups(clusters) in the given data.We will implement K-Means
algorithm using Python from scratch.
K-Means Clustering
K-Means is a very simple algorithm which clusters the data into K number of clusters. The
following image from PyPR is an example of K-Means Clustering.
246
Page
Use Cases
Image Segmentation
Clustering Gene Segementation Data
News Article Clustering
Clustering Languages
Species Clustering
Anomaly Detection
Algorithm
Our algorithm works as follows, assuming we have inputs x1,x2,x3,…,xn and value of K
In this step, we find the new centroid by taking the average of all the points assigned to that
248
cluster.
Page
Step 4
In this step, we repeat step 2 and 3 until none of the cluster assignments change. That
means until our clusters remain stable, we repeat the algorithm.
249
We run the algorithm for different values of K(say K = 10 to 1) and plot the K values against
SSE(Sum of Squared Errors). And select the value of K for the elbow point as shown in the
Page
figure.
The dataset we are gonna use has 3000 entries with 3 clusters. So we already know the
value of K.
%matplotlib inline
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Importing the dataset
data = pd.read_csv('xclara.csv')
print(data.shape)
data.head()
(3000, 2)
V1 V2
0 2.072345 -3.241693
1 17.936710 15.784810
2 1.083576 7.319176
3 11.120670 14.406780
4 23.711550 2.557729
From this visualization it is clear that there are 3 clusters with black stars as their centroid.
If you run K-Means with wrong values of K, you will get completely misleading clusters. For
example, if you run K-Means on this with values 2, 4, 5 and 6, you will get the following
clusters.
253
Page
254
Page
# Number of clusters
kmeans = KMeans(n_clusters=3)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_
# Comparing with scikit-learn centroids
print(C) # From Scratch
print(centroids) # From sci-kit learn
[[ 9.47804546 10.68605232]
[ 40.68362808 59.71589279]
[ 69.92418671 -10.1196413 ]]
[[ 9.4780459 10.686052 ]
[ 69.92418447 -10.11964119]
[ 40.68362784 59.71589274]]
You can see that the centroid values are equal, but in different order.
Example 2
import numpy as np
256
plt.rcParams['figure.figsize'] = (16, 9)
# Initializing KMeans
kmeans = KMeans(n_clusters=4)
# Fitting with inputs
kmeans = kmeans.fit(X)
# Predicting the clusters
257
labels = kmeans.predict(X)
# Getting the cluster centers
Page
C = kmeans.cluster_centers_
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y)
ax.scatter(C[:, 0], C[:, 1], C[:, 2], marker='*', c='#050505', s=1000)
In the above image, you can see 4 clusters and their centroids as stars. scikit-learn
approach is very simple and concise.
K Means Clustering Algorithm
K-means is a popularly used unsupervised machine learning algorithm for cluster analysis.
K-Means is a non-deterministic and iterative method. The algorithm operates on a given
data set through pre-defined number of clusters, k. The output of K Means algorithm is k
clusters with input data partitioned among the clusters.
For instance, let’s consider K-Means Clustering for Wikipedia Search results. The search
term “Jaguar” on Wikipedia will return all pages containing the word Jaguar which can
refer to Jaguar as a Car, Jaguar as Mac OS version and Jaguar as an Animal. K Means
clustering algorithm can be applied to group the webpages that talk about similar concepts.
So, the algorithm will group all web pages that talk about Jaguar as an Animal into one
258
In case of globular clusters, K-Means produces tighter clusters than hierarchical clustering.
Given a smaller value of K, K-Means clustering computes faster than hierarchical clustering
for large number of variables.
CLICK HERE to get the 2017 data scientist salary report delivered to your inbox!
Applications of K-Means Clustering
K Means Clustering algorithm is used by most of the search engines like Yahoo, Google to
cluster web pages by similarity and identify the ‘relevance rate’ of search results. This helps
search engines reduce the computational time for the users.
Data Science Libraries in Python to implement K-Means Clustering – SciPy, Sci-Kit Learn,
Python Wrapper
Hierarchical Clustering:-
In data mining and statistics, hierarchical clustering (also called hierarchical cluster
analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of
clusters. Strategies for hierarchical clustering generally fall into two types:
Agglomerative: This is a "bottom up" approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.
In hierarchical clustering, the two most similar clusters are combined and continue to
combine until all objects are in the same cluster. Hierarchical clustering produces a tree
(called a dendogram) that shows the hierarchy of the clusters. ... Hierarchical clustering
is considered an unsupervised clustering method.
more informative than the unstructured set of clusters returned by flat clustering.
Hierarchical clustering does not require us to prespecify the number of clusters and most
hierarchical algorithms that have been used in IR are deterministic. These advantages of
hierarchical clustering come at the cost of lower efficiency. The most common hierarchical
clustering algorithms have a complexity that is at least quadratic in the number of
Hierarchical clustering involves creating clusters that have a predetermined ordering from
top to bottom. For example, all files and folders on the hard disk are organized in a
hierarchy. There are two types of hierarchical clustering, Divisive and Agglomerative.
259
Page
Divisive method
In divisive or top-down clustering method we assign all of the observations to a single
cluster and then partition the cluster to two least similar clusters. Finally, we proceed
recursively on each cluster until there is one cluster for each observation. There is evidence
that divisive algorithms produce more accurate hierarchies than agglomerative algorithms
in some circumstances but is conceptually more complex.
Agglomerative method
In agglomerative or bottom-up clustering method we assign each observation to its own
cluster. Then, compute the similarity (e.g., distance) between each of the clusters and join
the two most similar clusters. Finally, repeat steps 2 and 3 until there is only a single
cluster left. The related algorithm is shown below.
260
containing the distance between each point using a distance function. Then, the matrix is
updated to display the distance between each cluster. The following three methods differ in
how the distance between each cluster is measured.
Single Linkage
In single linkage hierarchical clustering, the distance between two clusters is defined as the
shortest distance between two points in each cluster. For example, the distance between
clusters “r” and “s” to the left is equal to the length of the arrow between their two closest
points.
Complete Linkage
In complete linkage hierarchical clustering, the distance between two clusters is defined as
the longest distance between two points in each cluster. For example, the distance between
clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest
points.
261
Page
Average Linkage
In average linkage hierarchical clustering, the distance between two clusters is defined
as the average distance between each point in one cluster to every point in the other
cluster.
For example, the distance between clusters “r” and “s” to the left is equal to the average
length
each arrow between connecting the points of one cluster to the other.
262
Page
TensorFlow
This describes how to use machine learning to categorize Iris flowers by species. It
uses TensorFlow's eager execution to (1) build a model, (2) train the model on example
data, and (3) use the model to make predictions on unknown data. Machine learning
experience isn't required to follow this guide, but you'll need to read some Python code.
TensorFlow programming
There many TensorFlow APIs available, but we recommend starting with these high-level
TensorFlow concepts:
Enable an eager execution development environment,
Import data with the Datasets API,
Build models and layers with TensorFlow's Keras API.
This shows these APIs and is structured like many other TensorFlow programs:
1. Import and parse the data sets.
2. Select the type of model.
3. Train the model.
4. Evaluate the model's effectiveness.
5. Use the trained model to make predictions.
This is available as an interactive Colab notebook for you to run and change the Python
code directly in the browser. The notebook handles setup and dependencies while you
"play" cells to execute the code blocks. This is a fun way to explore the program and test
ideas. If you are unfamiliar with Python notebook environments, there are a couple of
things to keep in mind:
1. Executing code requires connecting to a runtime environment. In the Colab
notebook menu, select Runtime > Connect to runtime...
2. Notebook cells are arranged sequentially to gradually build the program. Typically,
later code cells depend on prior code cells, though you can always rerun a code
block. To execute the entire notebook in order, select Runtime > Run all. To rerun a
code cell, select the cell and click the play icon on the left.
Setup program
This uses eager execution, which is available in TensorFlow 1.8. (You may need to restart
the runtime after upgrading.)
Import the required Python modules, including TensorFlow, and enable eager execution for
this program. Eager execution makes TensorFlow evaluate operations immediately,
returning concrete values instead of creating a computational graph that is executed later.
If you are used to a REPL or the python interactive console, you'll feel at home.
Once eager execution is enabled, it cannot be disabled within the same program. See
the eager execution guidefor more details.
import os
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.contrib.eager as tfe
tf.enable_eager_execution()
Imagine you are a botanist seeking an automated way to categorize each Iris flower you
find. Machine learning provides many algorithms to statistically classify flowers. For
instance, a sophisticated machine learning program could classify flowers based on
photographs. Our ambitions are more modest—we're going to classify Iris flowers based
on the length and width measurements of their sepals and petals.
265
The Iris genus entails about 300 species, but our program will classify only the following
three:
Page
Iris setosa
Iris virginica
Iris versicolor
Fortunately, someone has already created a data set of 120 Iris flowers with the sepal and
petal measurements. This is a classic dataset that is popular for beginner machine learning
classification problems.
Figure 1. Iris setosa (by Radomil, CC BY-SA 3.0), Iris versicolor, (by Dlanglois,
CC BY-SA 3.0), and Iris virginica (by Frank Mayfield, CC BY-SA 2.0).
We need to download the dataset file and convert it to a structure that can be used by this
Python program.
Download the training dataset file using the tf.keras.utils.get_file function. This returns the
file path of the downloaded file.
train_dataset_url = "http://download.tensorflow.org/data/iris_training.csv"
train_dataset_fp = tf.keras.utils.get_file(fname=os.path.basename(train_dataset_url),
origin=train_dataset_url)
266
This dataset, iris_training.csv, is a plain text file that stores tabular data formatted as comma-
separated values (CSV). Use the head -n5 command to take a peak at the first five entries:
6.4,2.8,5.6,2.2,2
5.0,2.3,3.3,1.0,1
4.9,2.5,4.5,1.7,2
4.9,3.1,1.5,0.1,0
Each label is associated with string name (for example, "setosa"), but machine learning
typically relies on numeric values. The label numbers are mapped to a named
representation, such as:
0: Iris setosa
1: Iris versicolor
2: Iris virginica
For more information about features and labels, see the ML Terminology section of the
267
Since our dataset is a CSV-formatted text file, we'll parse the feature and label values into a
format our Python model can use. Each line—or row—in the file is passed to
the parse_csv function which grabs the first four feature fields and combines them into a
single tensor. Then, the last field is parsed as the label. The function
returns both the features and label tensors:
def parse_csv(line):
example_defaults = [[0.], [0.], [0.], [0.], [0]] # sets field types
parsed_line = tf.decode_csv(line, example_defaults)
# First 4 fields are features, combine into single tensor
features = tf.reshape(parsed_line[:-1], shape=(4,))
# Last field is the label
label = tf.reshape(parsed_line[-1], shape=())
return features, label
TensorFlow's Dataset API handles many common cases for feeding data into a model. This
is a high-level API for reading data and transforming it into a form used for training. See
the Datasets Quick Start guide for more information.
This program uses tf.data.TextLineDataset to load a CSV-formatted text file and is parsed
with our parse_csvfunction. A tf.data.Dataset represents an input pipeline as a collection of
elements and a series of transformations that act on those elements. Transformation
methods are chained together or called sequentially—just make sure to keep a reference to
the returned Dataset object.
Training works best if the examples are in random order. Use tf.data.Dataset.shuffle to
randomize entries, setting buffer_size to a value larger than the number of examples (120 in
this case). To train the model faster, the dataset's batch size is set to 32 examples to train at
once.
train_dataset = tf.data.TextLineDataset(train_dataset_fp)
train_dataset = train_dataset.skip(1) # skip the first header row
train_dataset = train_dataset.map(parse_csv) # parse each row
train_dataset = train_dataset.shuffle(buffer_size=1000) # randomize
train_dataset = train_dataset.batch(32)
Why model?
A model is the relationship between features and the label. For the Iris classification
problem, the model defines the relationship between the sepal and petal measurements
and the predicted Iris species. Some simple models can be described with a few lines of
algebra, but complex machine learning models have a large number of parameters that are
difficult to summarize.
Could you determine the relationship between the four features and the Iris
species without using machine learning? That is, could you use traditional programming
techniques (for example, a lot of conditional statements) to create a model? Perhaps—if
you analyzed the dataset long enough to determine the relationships between petal and
sepal measurements to a particular species. And this becomes difficult—maybe
impossible—on more complicated datasets. A good machine learning approach determines
the model for you. If you feed enough representative examples into the right machine
learning model type, the program will figure out the relationships for you.
We need to select the kind of model to train. There are many types of models and picking a
good one takes experience. This document uses a neural network to solve the Iris
classification problem. Neural networks can find complex relationships between features
and the label. It is a highly-structured graph, organized into one or more hidden layers.
Each hidden layer consists of one or more neurons. There are several categories of neural
networks and this program uses a dense, or fully-connected neural network: the neurons in
one layer receive input connections from every neuron in the previous layer. For example,
Figure 2 illustrates a dense neural network consisting of an input layer, two hidden layers,
and an output layer:
269
Page
When the model from Figure 2 is trained and fed an unlabeled example, it yields three
predictions: the likelihood that this flower is the given Iris species. This prediction is
called inference. For this example, the sum of the output predictions are 1.0. In Figure 2,
this prediction breaks down as: 0.03 for Iris setosa, 0.95 for Iris versicolor, and 0.02 for Iris
virginica. This means that the model predicts—with 95% probability—that an unlabeled
example flower is an Iris versicolor.
The TensorFlow tf.keras API is the preferred way to create models and layers. This makes it
easy to build models and experiment while Keras handles the complexity of connecting
everything together. See the Keras documentation for details.
The tf.keras.Sequential model is a linear stack of layers. Its constructor takes a list of layer
instances, in this case, two Dense layers with 10 nodes each, and an output layer with 3
nodes representing our label predictions. The first layer's input_shape parameter
corresponds to the amount of features from the dataset, and is required.
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation="relu", input_shape=(4,)), # input shape required
tf.keras.layers.Dense(10, activation="relu"),
270
tf.keras.layers.Dense(3)
Page
The activation function determines the output of a single neuron to the next layer. This is
loosely based on how brain neurons are connected. There are many available activations,
but ReLU is common for hidden layers.
The ideal number of hidden layers and neurons depends on the problem and the dataset.
Like many aspects of machine learning, picking the best shape of the neural network
requires a mixture of knowledge and experimentation. As a rule of thumb, increasing the
number of hidden layers and neurons typically creates a more powerful model, which
requires more data to train effectively.
Training is the stage of machine learning when the model is gradually optimized, or the
model learns the dataset. The goal is to learn enough about the structure of the training
dataset to make predictions about unseen data. If you learn too much about the training
dataset, then the predictions only work for the data it has seen and will not be
generalizable. This problem is called overfitting—it's like memorizing the answers instead
of understanding how to solve a problem.
The Iris classification problem is an example of supervised machine learning: the model is
trained from examples that contain labels. In unsupervised machine learning, the examples
don't contain labels. Instead, the model typically finds patterns among the features.
Both training and evaluation stages need to calculate the model's loss. This measures how
off a model's predictions are from the desired label, in other words, how bad the model is
performing. We want to minimize, or optimize, this value.
Our model will calculate its loss using the tf.losses.sparse_softmax_cross_entropy function
which takes the model's prediction and the desired label. The returned loss value is
progressively larger as the prediction gets worse.
The grad function uses the loss function and the tf.GradientTape to record operations that
compute the gradients used to optimize our model. For more examples of this, see
the eager execution guide.
Create an optimizer
Figure 3. Optimization algorthims visualized over time in 3D space. (Source: Stanford class
CS231n, MIT License)
TensorFlow has many optimization algorithms available for training. This model uses
the tf.train.GradientDescentOptimizer that implements the stochastic gradient
descent (SGD) algorithm. The learning_rate sets the step size to take for each iteration down
the hill. This is a hyperparameter that you'll commonly adjust to achieve better results.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
272
Page
Training loop
With all the pieces in place, the model is ready for training! A training loop feeds the
dataset examples into the model to help it make better predictions. The following code
block sets up these training steps:
1. Iterate each epoch. An epoch is one pass through the dataset.
2. Within an epoch, iterate over each example in the training Dataset grabbing its features (x)
and label (y).
3. Using the example's features, make a prediction and compare it with the label. Measure the
inaccuracy of the prediction and use that to calculate the model's loss and gradients.
4. Use an optimizer to update the model's variables.
5. Keep track of some stats for visualization.
6. Repeat for each epoch.
The num_epochs variable is the amount of times to loop over the dataset collection. Counter-
intuitively, training a model longer does not guarantee a better model. num_epochs is
a hyperparameter that you can tune. Choosing the right number usually requires both
experience and experimentation.
num_epochs = 201
# Track progress
epoch_loss_avg(loss(model, x, y)) # add current batch loss
273
# end epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
if epoch % 50 == 0:
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
epoch_loss_avg.result(),
epoch_accuracy.result()))
While it's helpful to print out the model's training progress, it's often more helpful to see
this progress. TensorBoard is a nice visualization tool that is packaged with TensorFlow,
but we can create basic charts using the mathplotlib module.
Interpreting these charts takes some experience, but you really want to see the loss go
down and the accuracy go up.
axes[0].set_ylabel("Loss", fontsize=14)
axes[0].plot(train_loss_results)
axes[1].set_ylabel("Accuracy", fontsize=14)
axes[1].set_xlabel("Epoch", fontsize=14)
axes[1].plot(train_accuracy_results)
plt.show()
274
Page
Now that the model is trained, we can get some statistics on its performance.
Evaluating means determining how effectively the model makes predictions. To determine
the model's effectiveness at Iris classification, pass some sepal and petal measurements to
Page
the model and ask the model to predict what Iris species they represent. Then compare the
model's prediction against the actual label. For example, a model that picked the correct
species on half the input examples has an accuracy of 0.5. Figure 4 shows a slightly more
effective model, getting 4 out of 5 predictions correct at 80% accuracy:
Evaluating the model is similar to training the model. The biggest difference is the
examples come from a separate test set rather than the training set. To fairly assess a
model's effectiveness, the examples used to evaluate a model must be different from the
examples used to train the model.
The setup for the test Dataset is similar to the setup for training Dataset. Download the CSV
text file and parse that values, then give it a little shuffle:
test_url = "http://download.tensorflow.org/data/iris_test.csv"
test_fp = tf.keras.utils.get_file(fname=os.path.basename(test_url),
origin=test_url)
test_dataset = tf.data.TextLineDataset(test_fp)
test_dataset = test_dataset.skip(1) # skip header row
test_dataset = test_dataset.map(parse_csv) # parse each row with the funcition created
earlier
test_dataset = test_dataset.shuffle(1000) # randomize
test_dataset = test_dataset.batch(32) # use the same batch size as the training set
Unlike the training stage, the model only evaluates a single epoch of the test data. In the
following code cell, we iterate over each example in the test set and compare the model's
prediction against the actual label. This is used to measure the model's accuracy across the
entire test set.
test_accuracy = tfe.metrics.Accuracy()
We've trained a model and "proven" that it's good—but not perfect—at classifying Iris
species. Now let's use the trained model to make some predictions on unlabeled examples;
that is, on examples that contain features but not a label.
In real-life, the unlabeled examples could come from lots of different sources including
apps, CSV files, and data feeds. For now, we're going to manually provide three unlabeled
examples to predict their labels. Recall, the label numbers are mapped to a named
representation as:
0: Iris setosa
1: Iris versicolor
2: Iris virginica
predict_dataset = tf.convert_to_tensor([
[5.1, 3.3, 1.7, 0.5,],
[5.9, 3.0, 4.2, 1.5,],
[6.9, 3.1, 5.4, 2.1]
])
predictions = model(predict_dataset)
In this document, we will use the tf.estimator API in TensorFlow to solve a binary
Page
classification problem: Given census data about a person such as age, education, marital
status, and occupation (the features), we will try to predict whether or not the person
earns more than 50,000 dollars a year (the target label). We will train a logistic
regression model, and given an individual's information our model will output a number
between 0 and 1, which can be interpreted as the probability that the individual has an
annual income of over 50,000 dollars.
Setup
1. Install TensorFlow if you haven't already
2. Download the code. :
https://github.com/tensorflow/models/tree/master/official/wide_deep/
3. Execute the data download script we provide to you:
$ python data_download.py
4. Execute the code with the following command to train the linear model described in this
document:
Read on to find out how this code builds its linear model.
Since the task is a binary classification problem, we'll construct a label column named
"label" whose value is 1 if the income is over 50K, and 0 otherwise. For reference,
see input_fn in wide_deep.py.
Next, let's take a look at the dataframe and see which columns we can use to predict the
target label. The columns can be grouped into two types—categorical and continuous
columns:
A column is called categorical if its value can only be one of the categories in a finite set.
For example, the relationship status of a person (wife, husband, unmarried, etc.) or the
education level (high school, college, etc.) are categorical columns.
278
A column is called continuous if its value can be any numerical value in a continuous
range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.
Page
When building a tf.estimator model, the input data is specified by means of an Input Builder
function. This builder function will not be called until it is later passed to
tf.estimator.Estimator methods such as train and evaluate. The purpose of this function is to
construct the input data, which is represented in the form of tf.Tensors or tf.SparseTensors. In
more detail, the input builder function returns the following as a pair:
1. features: A dict from feature column names to Tensors or SparseTensors.
279
The keys of the features will be used to construct columns in the next section. Because we
Page
want to call the train and evaluate methods with different data, we define a method that
returns an input function based on the given data. Note that the returned input function
will be called while constructing the TensorFlow graph, not while running the graph. What
it is returning is a representation of the input data as the fundamental unit of TensorFlow
computations, a Tensor (or SparseTensor).
Each continuous column in the train or test data will be converted into a Tensor, which in
general is a good format to represent dense data. For categorical data, we must represent
the data as a SparseTensor. This data format is good for representing sparse data.
Our input_fn uses the tf.data API, which makes it easy to apply transformations to our dataset:
def parse_csv(value):
print('Parsing', data_file)
columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
features = dict(zip(_CSV_COLUMNS, columns))
labels = features.pop('income_bracket')
return features, tf.equal(labels, '>50K')
if shuffle:
dataset = dataset.shuffle(buffer_size=_SHUFFLE_BUFFER)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
Selecting and crafting the right set of feature columns is key to learning an effective model.
A feature column can be either one of the raw columns in the original dataframe (let's call
Page
them base feature columns), or any new columns created based on some transformations
defined over one or multiple base columns (let's call them derived feature columns).
Basically, "feature column" is an abstract concept of any raw or derived variable that can be
used to predict the target label.
To define a feature column for a categorical feature, we can create a CategoricalColumn using
the tf.feature_column API. If you know the set of all possible feature values of a column and
there are only a few of them, you can use categorical_column_with_vocabulary_list. Each key in the
list will get assigned an auto-incremental ID starting from 0. For example, for
the relationship column we can assign the feature string "Husband" to an integer ID of 0 and
"Not-in-family" to 1, etc., by doing:
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
'relationship', [
'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
'Other-relative'])
What if we don't know the set of possible values in advance? Not a problem. We can
use categorical_column_with_hash_bucket instead:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
'occupation', hash_bucket_size=1000)
What will happen is that each possible value in the feature column occupation will be hashed
to an integer ID as we encounter them in training. See an example illustration below:
ID Feature
9 "Machine-op-inspct"
103 "Farming-fishing"
375 "Protective-serv"
No matter which way we choose to define a SparseColumn, each feature string will be
mapped into an integer ID by looking up a fixed mapping or by hashing. Note that hashing
collisions are possible, but may not significantly impact the model quality. Under the hood,
the LinearModel class is responsible for managing the mapping and creating tf.Variable to store
the model parameters (also known as model weights) for each feature ID. The model
parameters will be learned through the model training process we'll go through later.
education = tf.feature_column.categorical_column_with_vocabulary_list(
281
'education', [
'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
Page
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
'marital_status', [
'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
'relationship', [
'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
'Other-relative'])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
'workclass', [
'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])
Similarly, we can define a NumericColumn for each continuous feature column that we want
to use in the model:
age = tf.feature_column.numeric_column('age')
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')
Sometimes the relationship between a continuous feature and the label is not linear. As a
hypothetical example, a person's income may grow with age in the early stage of one's
career, then the growth may slow at some point, and finally the income decreases after
retirement. In this scenario, using the raw age as a real-valued feature column might not be
a good choice because the model can only learn one of the three cases:
1. Income always increases at some rate as age grows (positive correlation),
2. Income always decreases at some rate as age grows (negative correlation), or
282
If we want to learn the fine-grained correlation between income and each age group
separately, we can leverage bucketization. Bucketization is a process of dividing the entire
range of a continuous feature into a set of consecutive bins/buckets, and then converting
the original numerical feature into a bucket ID (as a categorical feature) depending on
which bucket that value falls into. So, we can define a bucketized_column over age as:
age_buckets = tf.feature_column.bucketized_column(
age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
where the boundaries is a list of bucket boundaries. In this case, there are 10 boundaries,
resulting in 11 age group buckets (from age 17 and below, 18-24, 25-29, ..., to 65 and over).
Using each base feature column separately may not be enough to explain the data. For
example, the correlation between education and the label (earning > 50,000 dollars) may
be different for different occupations. Therefore, if we only learn a single model weight
for education="Bachelors" and education="Masters", we won't be able to capture every single
education-occupation combination (e.g. distinguishing between education="Bachelors" AND
occupation="Exec-managerial" and education="Bachelors" AND occupation="Craft-repair"). To learn the
differences between different feature combinations, we can add crossed feature
columns to the model.
education_x_occupation = tf.feature_column.crossed_column(
['education', 'occupation'], hash_bucket_size=1000)
We can also create a CrossedColumn over more than two columns. Each constituent column
can be either a base feature column that is categorical (SparseColumn), a bucketized real-
valued feature column (BucketizedColumn), or even another CrossColumn. Here's an example:
age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(
[age_buckets, 'education', 'occupation'], hash_bucket_size=1000)
Defining The Logistic Regression Model
After processing the input data and defining all the feature columns, we're now ready to
put them all together and build a Logistic Regression model. In the previous section we've
seen several types of base and derived feature columns, including:
CategoricalColumn
NumericColumn
BucketizedColumn
283
CrossedColumn
Page
All of these are subclasses of the abstract FeatureColumn class, and can be added to
the feature_columnsfield of a model:
base_columns = [
education, marital_status, relationship, workclass, occupation,
age_buckets,
]
crossed_columns = [
tf.feature_column.crossed_column(
['education', 'occupation'], hash_bucket_size=1000),
tf.feature_column.crossed_column(
[age_buckets, 'education', 'occupation'], hash_bucket_size=1000),
]
model_dir = tempfile.mkdtemp()
model = tf.estimator.LinearClassifier(
model_dir=model_dir, feature_columns=base_columns + crossed_columns)
The model also automatically learns a bias term, which controls the prediction one would
make without observing any features (see the section "How Logistic Regression Works" for
more explanations). The learned model files will be stored in model_dir.
After adding all the features to the model, now let's look at how to actually train the model.
Training a model is just a single command using the tf.estimator API:
After the model is trained, we can evaluate how good our model is at predicting the labels
of the holdout data:
The first line of the final output should be something like accuracy: 0.83557522, which means
the accuracy is 83.6%. Feel free to try more features and transformations and see if you can
284
do even better!
Page
After the model is evaluated, we can use the model to predict whether an individual has an
annual income of over 50,000 dollars given an individual's information input.
The model prediction output would be like [b'1'] or [b'0'] which means whether
corresponding individual has an annual income of over 50,000 dollars or not.
If you'd like to see a working end-to-end example, you can download our example code and
set the model_typeflag to wide.
In the Linear Model library, you can add L1 and L2 regularizations to the model as:
model = tf.estimator.LinearClassifier(
model_dir=model_dir, feature_columns=base_columns + crossed_columns,
optimizer=tf.train.FtrlOptimizer(
learning_rate=0.1,
l1_regularization_strength=1.0,
l2_regularization_strength=1.0))
In practice, you should try various combinations of L1, L2 regularization strengths and find
285
the best parameters that best control overfitting and give you a desirable model size.
Page
Model training is an optimization problem: The goal is to find a set of model weights (i.e.
model parameters) to minimize a loss function defined over the training data, such as
logistic loss for Logistic Regression models. The loss function measures the discrepancy
between the ground-truth label and the model's prediction. If the prediction is very close to
the ground-truth label, the loss value will be low; if the prediction is very far from the label,
then the loss value would be high.
-------o------
Note: Book will be updated based on the Technology changes, updates will be available in
the next version. 286
Page