Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Cs - Fundamentals of Data Science

Download as pdf or txt
Download as pdf or txt
You are on page 1of 203

FUNDAMENTALS OF DATA SCIENCE

What is Data Science?


• Data Science is the science of analysing raw data using statistics and machine learning
techniques with the purpose of drawing insights from the data.
• Data Science is used in many industries to allow them to make better business
decisions, and in the sciences to test models or theories.
• This requires a process of inspecting, cleaning, transforming, modelling, analyzing, and
interpreting raw data.
Important
Disciplines
Under Data
Science
The Constant Evolution Of Data Science Industry

Statistician Data Mining Specialist

Predictive Analytics
Data Scientist
Specialist
Data science mainly needed for:

Better decision making (Whether A or B?)


Predictive Analysis (What will happen next?)
Pattern Discovery (Is there any hidden information
in the data?)

So data science is about


Asking right questions and exploring the
data
Modelling the data using various algorithms
Finally, communicating and visualising results

Examples:
1) Self-driving cars
2) Airlines
3) Logistic companies like FedEx
The Various Data Science Disciplines
Not everyone in the field of Data Science is a Data Scientist!

Data Engineer Big Data Engineers Data Analyst


Data Engineers are software The set of engineers handle A Data Analyst is someone who
engineers who handle the Data Warehousing process, by processes and does statistical
design, building, integration of running the analysis on data to discover
data from various data sources Extract-Transform-Load (ETL) possible patterns, trends and
and also manage them. procedure on data. They are also also appropriately communicate
known as Big Data Engineers. the insights gotten for proper
Big Data is data that contains understanding.
greater variety arriving in Data Analysts are sometimes
increasing volumes and with called “Junior Data Scientists”
ever-higher velocity ~ Gartner or “Data Scientists in
Training”
Machine Learning Engineer Data Visualization Engineer Data Scientist
A Machine Learning (ML) Engineer This is someone that tells visually A Data Scientist is an analytical data
is a software engineer that specializes stunning stories with data, create expert who has the technical skills to solve
in making data products work in dynamic data visualizations to help complex problems and the curiosity to
explore what problems needs to be solved.
production. businesses/customers make
meaningful decisions in an Data Scientists apply Statistics, Machine
They are involved in software Learning, and analytical approaches to
architecture and design; they interactive format. solve critical business problems.
understand and carryout practices They basically collaborate with Data
A Data Scientist is also known as a
like A/B testing (A/B testing is a user Analysts and Data Scientists to make
mathematician, a statistician, a
experience research methodology visualizations which effectively computer programmer and an analyst
communicates the insights gotten equipped with a diverse and
from data to the business. wide-ranging skill set, balancing
knowledge in different computer
programming languages with advanced
experience in data mining and
visualization.
Difference between Analysis and Analytics

Analysis Analytics
Analytics generally refers to the future instead of
Consider you have a huge data set containing data explaining past events.
of various types. It explores potential future ones.
Instead of tackling the entire dataset and running Analytics is essentially the application of logical and
the risk of becoming overwhelmed, you separated computational reasoning to the component parts
into easier to digest chunks and study them obtained in an analysis and in doing this you are
individually and examine how they relate to other looking for patterns in exploring what you can do with
parts and that's analysis. them in the future.
One important thing to remember however is that Here analytics branches off into two areas.
you perform analyses on things that have Qualitative analytics
already happened in the past such as using an This is using your intuition and experience in
analysis to explain how a story ended the way it conjunction with the analysis to plan your next
did or how there was a decrease in the cells last business move
summer. Quantitative analytics
All this means that we do analyses to explain how This is applying formulas and algorithms to numbers
and or why something happened you have gathered from your analysis.
Business
Analytics,
Data Analytics,
Data Science:
An
Introduction
❑ Qualitative analytics.
- This includes working with
tools that help predict
future behaviour.
- Therefore must be placed
on the right.
- In essence what we have
now is qualitative Analytics
which belongs to the area
of business analytics.
❑ Sales Forecasting
though is a future
oriented activity so we
can move it to the right
of the black line but not
too much.
• It must still be long on
the sphere of business.
• So it must be in the area
where business analytics
and data intersect.
❑ Data Science
- The most sparkly of them
all is data science.
- Data science is a discipline
reliant on data availability
while business analytics
does not completely rely on
data.
- However, data science
incorporates part of data
analytics mostly the part
that uses complex
mathematical statistical and
programming tools.
- Consequently, this green
rectangle representing data
science on our diagram will
not overlap with data
analytics completely but it
will reach a point beyond
the area of business
analytics.
- An example of a discipline that belongs to
the field of data science and is considered
data analytics but not business analytics is
the oil and gas industry and the
optimization of drilling operations (It
aims to optimize weight on bit, bit rotation
for obtaining maximum drilling rate as
well as minimizing drilling cost).
- This is a perfect fit for this sub area data
science can be used to improve the
accuracy of predictions based on data
extracted from various activities typical for
drilling efficiency.
- Something that involves data analytics but
neither data science nor business analytics
can be digital signal processing.
- Digital signal is used to represent data in
the form of discrete values which is an
example of numeric data.
- Therefore data analytics can be applied to
digital signal in order to produce a higher
quality signal and that’s what digital signal
processing is all about.
- The business intelligence or BI is the process
of analysing and reporting historical business
data after reports and dashboards have been
prepared.
- They can be used to make an informed strategic
and tactical business decisions by end users
such as the general manager.
- Business intelligence aims to explain past
events using business data.
- It must go on the left of the timeline as it deals
only with past events and it must sit within the
data science rectangle as a subfield business
intelligence fits comfortably within data science
because it is the preliminary step of predictive
analytics.
- First you must analyse past data and extract
useful insights using these inferences will allow
you to create appropriate models that could
predict the future of your business accurately.
- As with reporting and creating dashboards these
are precisely what business intelligence is all
about. So we will neatly place these two into
the orange rectangle.
❑ Machine Learning
- The ability of machines to predict
outcomes without being explicitly
programmed to do so is regarded as
machine learning.
- Expanding on this is about creating
and implementing algorithms that let
machines receive data and use this
data to make predictions analyse
patterns and give recommendations
on their own.
- Machine learning cannot be
implemented without data. Hence it
should stay within Data analytics
completely.
- By definition it is about simulating
human knowledge and decision
making with computers.
- We as humans have only
managed to reach AI through
machine learning the
discipline we just talked
about and as the data
scientists we are interested in
how tools from machine
learning can help us improve
the accuracy of our
estimations.
- AI is beyond our expertise
- Artificial intelligence is
intelligence demonstrated
by machines, as opposed to
the natural intelligence
displayed by humans or
animals.
- The client retention(process of engaging
existing customers to continue buying
products or services from your business)
and acquisition(process of gaining new
customers) are two typical business
activities where machine learning is
involved. It helps develop models that
predict what a client's next purchase would
be.
- For example since we could say data
analytics and data science are applied in
client retention and acquisition as well we
can leave this term right over here.
- ML can be applied to fraud prevention as
another example we can feed a machine
learning algorithm with prior fraudulent
activity data. It will find patterns which the
human brain is incapable of seeing.
- Having a model which can detect such
transactions or operations in real time it has
helped the financial system prevent a huge
amount of fraudulent activity.
- When talking AI and ML usually
speech and image recognition
are among the most popular
examples as they are already
being implemented in products
like Siri, Cortana, Google's
assistant and more impressively
self-driving cars.
- Finally an example that is
considered artificial
intelligence but not
machine learning is
symbolic reasoning.
- It is based on the high
level human readable
representations of
problems in logic.
A
Breakdown
of Data
Science
The step by step comparison between
the terms and buzzwords related to each
discipline

- DATA - Big data is a term reserved for extremely large data and it is not just
- Data is defined as information stored in a humongous in terms of volume. This data could be in various formats.
digital format which can then be used as a It can be structured, Semi structure or unstructured. Big data is just
base for performing analyses and decision that big.
making. - You will also often see it characterized by the letter V as in big data.
- As you can see there are two types of data. They may include the vision you have about big data, the value Big
- Traditional data and big data: Dealing with Data carries, the visualization tools you use or the variability and the
data is the first step when solving business consistency of big data and so on.
problems or researching. - However, the following are probably the most important criteria. You
- Traditional data is the data in the form of must remember volume as we already said. Big Data needs a
tables containing numeric or text values data whopping amount of memory space typically distributed between
that is structured and stored in databases minicomputers. Its size is measured in terabytes, Peta bytes, and even
which can be managed from one computer. exabytes variety.
- Here we are not talking just about numbers and text. Big data often
implies dealing with images, audio files, mobile data and others.
- Velocity when working with big data. One's goal is to make extracting
patterns from it as quickly as possible. The progress that has been
done in this area is remarkable outputs from huge data sets can be
retrieved in real time.
- Data science is a broad subject. It's - Business Intelligence is the discipline includes technology driven tools
an interdisciplinary field that involved in the process of analysing, understanding, and reporting
combines statistical, mathematical, available past data. This will result in having reports or dashboards and will
programming, problem solving and help you on your way to making an informed strategic and tactical business
data management tools. decisions.
- We have divided data science into - You can extract insights and ideas about your business that will help to grow
three segments business intelligence, and give you an edge of your competitors giving you added stability.
traditional methods and machine - Business intelligence means understanding-
learning. • how your sales grew and why did competitors lose market share,
• Was there an increase in the price of your products or did you sell a mix
of more expensive products?
• How did your profitability margins behave in the same time frame of a
previous year?
• Were there client accounts that were more profitable?
- This is what BI is all about understanding past business performance in order
to improve future performance.
- Once your BI-reports and dashboards are completed and presented it's time to
apply one of two types of data science.
- Traditional methods according to our - The last column we will be discussing with
framework are a set of methods that are machine learning in contrast to traditional
derived mainly from statistics and are methods.
adapted for business. - The responsibility is left for the machine through
- There is no denying that these conventional mathematics.
data science tools are applicable today. They - A significant amount of computer power in
are perfect for forecasting future applying AI the machine is given the ability to
performance with great accuracy. predict outcomes from data without being
- Regression analysis, cluster analysis and explicitly programmed to smell is all about
factor analysis all of which are prime creating algorithms that let machines receive
examples of traditional methods. data perform calculations and apply statistical
analysis in order to make predictions with
unprecedented accuracy.
The Benefits
of Each
Discipline
There are two types of data. Traditional and big data. Data driven decisions require well organized and relevant raw data stored in a
digital format which can be processed and transformed into meaningful and useful information. It is the material on which you base your
analysis. Without data, a decision maker wouldn't be able to test their decisions and ensure they have taken the right course of
action.
The data you have describes what happened in the past. It is the job of the business intelligence analyst to study the numbers and explain
where and why some things went well and others not so well. Having the business context in mind the business intelligence analyst will
present the data in the form of reports and dashboards.
What else is needed once the patterns have been interpreted. You can forecast potential future outcomes. The application of any term
related to the columns traditional methods or machine learning can be said to belong to the field of predictive analytics.
There is a difference between the two. Traditional methods relate to traditional data. They were designed prior to the existence of big
data where the technology simply wasn't as advanced as it is today. They involve applying statistical approaches to create predictive
models.
If you want to dig deeper however or tackle huge amounts of big data utilizing unconventional methods or AI then you can predict
behaviour in unprecedented ways using machine learning techniques and tools. Both techniques are useful for different purposes.
Traditional methods are better suited for traditional data while machine learning will have better results when it comes to tackling big data.
Techniques for Working with Traditional Data
Data The gathering of raw data is Example would be the use of surveys asking people to rate how much they
Collection referred to as data collection like or dislike a product or experience on a scale of 1 to 10

Preprocessing is a group of operations that will


For example Customer has entered age as 942 or name
basically convert your raw data into a format that is
Preprocessing as UK then off course this entries are invalid which
more understandable and hence useful for further
you need to correct before further processing.
processing.

One technique is class One such category is numerical. For The other label is categorical. Here
labelling. This involves example if you are storing the number you are dealing with information
Class labelling the data point to of goods sold daily t. These are that cannot have mathematical
Labelling the correct data type or numbers which can be manipulated manipulations. For example a
arranging data by category. such as the average number of goods person's profession or place of
sold per day or month. birth

This can come in various forms. Say you are provided with a data set
Data Cleansing The goal of data cleansing is to
containing the US states and a quarter of the names are misspelled in this
deal with inconsistent data.
situation. Certain techniques must be performed to correct these mistakes

Data cleansing and dealing with missing values are


Missing Missing values are another thing you'll problems that must be solved before you can process the
Values have to deal with. data further

Case Shuffling Database


Balancing.
Specific

Visualisa E R Diagram
tion Relational Scema
Real Life Examples of Traditional Data

Consider basic Customer data as example the difference between a numerical and categorical variable.

• The first column shows the id of the different customers. These numbers however cannot be
manipulated. Calculating an average ID is not something that would give you any sort of useful
information. This means that even though they are numbers they hold no numerical value and therefore
representing categorical data.
• Now focus on the last column. This shows how many times that customers filed a complaint. These
numbers are easily manipulated. Adding them all together to give a total number of
complaints is useful information. Therefore they are numerical data.

Another example we can look at is daily historical stock price data


There's a column containing the dates of the observations which is considered categorical data
and a column containing the stock prices which is numerical data.
Techniques for Working with Big Data
Some as traditional data preprocessing can also be implemented on big data is essential to help organize the data before doing
analyses or making predictions as is grouping the data into classes or categories.
While working with Big Data things can get a little more complex. As you have much more variety beyond the simple distinction of
numerical and categorical data. Examples of big data can be text data, digital image data, digital video data, digital audio data and
more.
Consequently, with a larger amount of data types comes a wider range of data cleansing methods.
There are techniques that verify that a digital image observation is ready for processing and specific approaches exists that can
ensure the audio quality of your file is adequate to proceed.
So what about dealing with missing values. This step is a crucial one as big data has big missing values which is a big problem to
exemplify.

Text data mining represents the process of deriving valuable unstructured data from a text.
Consider you may have a database which has stored information from academic papers about marketing expenditure.
It may contain information from academic papers, blog, articles, online platforms, private Excel files and more.
This means you will need to extract marketing expenditure information from many sources. This technique can find the information
you need without much of a problem.

Data masking If you want to maintain a credible business or governmental activity you must preserve confidential information.
However when personal information is shared online it doesn't mean that it can't be touched or used for analysis. Instead you
must apply some data masking techniques so you can analyse the information without compromising private details like data
shuffling.
Masking can be quite complex. It conceals the original data with random and false data allowing you to conduct analysis and keep
all confidential information in a secure place. An example of applying data masking to big data is through what we called
confidentiality preserving data mining techniques.
Real Life Examples of Big Data

- Facebook keeps track of its users names, personal data, photos, videos, recorded messages
and so on. This means that their data has a lot of variety in with over 2 billion users
worldwide. The volume of data stored on their servers is tremendous.
- Facebook requires real time reporting of the aggregated anonymised voice of its users and
it applies many analytical tools for its mobile applications.
- This means the company is investing in boosting its real time data processing powers
increasing the velocity of its data set.

- Lets take financial trading data for example what happens when we record the stock
price every five seconds or every single second.
- We get a data set that is incredibly voluminous requiring significantly more memory
disk space in various techniques to extract meaningful information from it. Data like
this would also be considered big data.
Business Intelligence (BI) Techniques
Let's assume your data has been pre-processed and is ready for
analysis. It is beautifully organized. This means you are ready to
enter the realm of business intelligence.
The job of a business intelligence analyst requires her to understand
the essence of a business and strengthen that business through the
power of data.
So here we have techniques to measure business performance.
However no A measure is the A metric refers to a value

KPI
Measure

Metric
Quantification
Collecting observations

In a real business where the


mathematical accumulation of that derives from the number of observations is
From above diagram, manipulations can be
you can observe applied to these
observations to show measures you obtain and significantly larger you can derive
some information. aims at gauging business thousands of metrics where we
variables such as observations. What we cant keep track of all possible
sales volume (marked must do is quantify that For example, if you performance or progress to metrics we can extract from a
as blue colour dots) information. total the revenues of compare. data set.
Quantification is the all three months to
or new customers process of representing If a measure is related to What you need to do is choose
who have enrolled in obtain the value of something like simple the metrics that are tightly
observations as numbers.
your web site $350 that would be a descriptive statistics of past
aligned with your business
Consider your revenues measure of the objectives. These metrics are
(marked as red colour from new customers for performance a metric has a called K.P.I’s, Key Performance
dots). January, February and
revenue of the first business meaning attached. Indicators.
March were 100, 120 and quarter of that year. Key because they are related to
Each monthly E.g. If you estimate the
$130 respectively while Similarly add together your main business goals.
revenue or each the corresponding average quarterly revenue Performance because they show
customer is the number of new per new customer which
number of new how successfully you have
considered a single customers for the same customers for the equals 350 divided by 50 performed within a specified time
observation. three months are 10, 15 same period and you that is $7. This is a metric. frame and indicators because
and 25. have another their values or metrics that
measure indicate something related to
your business performance.
Real Life Examples of Business Intelligence

- BI can be used for price optimization, hotels use price optimization very effectively by
raising the price of a room at periods when many people want to visit the hotel. And by
reducing it to attract visitors when demand is low they can greatly increase their profits in
order to competently apply such a strategy.
- They must extract the relevant information in real time and compare it with historical BI
allows you to adjust your strategy to pass data. As soon as it is available.

- Another application of business intelligence is enhancing inventory management over and


undersupply can cause problems in a business.
- However, implementing effective inventory management means supplying enough stock to
meet demand with the minimal amount of waste and cost to do this well.
- You can perform an in-depth analysis of past sales transactions for the purpose of identifying
seasonality patterns and the times of the year with the highest cell’s.
- Additionally, you could track your inventory to identify the months of which you have over or
understocked a detailed analysis can even pinpoint the day or time of day were the need for a
good is highest if done right business intelligence will help to efficiently manage your shipment
logistics and in turn reduce costs and increase profit.
- So once prepared the BMI reports and dashboards are prepared and the executives have extracted insights about the
business what do you do with the information you use it to predict some future values as accurately as possible. That's
why at this stage you stop dealing with analysis and start applying analytics more precisely predictive analytics.
- We separate predictive analytics into two branches traditional methods which comprises classical statistical methods
for forecasting and machine learning.
Techniques for Working with Traditional Methods
In this dataset we have house prices in dollars while the other house sizes
1) Regression measured in square feet .
In business statistics, a Every row on the data table is an observation and each can be plotted on this
regression is a model graph as a dot the house size is measured along the horizontal line in its price.
used for quantifying On the vertical line the further to the right an observation is the larger the
house size and further up the higher the price so once we've plotted all 20
causal relationships observations from our dataset our graph will appear like this.
among the different The thing is there is a straight line(red) called a regression line that goes
variables included in through these dots while being as close as it can be to all of them
simultaneously.
your analysis.
Now imagine we drew another line(green). If you observe altogether dots are
closer to the first red line than the second green one. This means that it more
accurately represents the distribution of the observations.
So in this case if y signifies the house price then B represents a coefficient
A) Linear Regression which we multiply by x the house size.
y = Bx
So the equation professionals work with is Y equals B times X and they use the
graph as visual support.
B) Nonlinear Regression – Logistic Regression

- A logistic regression is a common example of a


nonlinear model.
- In this case unlike the house prices example the values on
the vertical line won't be arbitrary integers.
- They'll be ones or zeros only such a model's useful
during a decision-making process.
- Companies apply logistic regression algorithms to filter
job candidates during their screening process.
- If the algorithm estimates the probability that a
prospective candidate will perform well and the company
is above 50 percent it would predict one or a successful
application.
- Otherwise it will predict zero therefore the nonlinear
nature of the logistic regression is nicely summarized by
its Graph very different from the linear regression.
2) Cluster Analysis

- Imagine they are derived from research on German house


prices. Hence they are dispersed differently.
- When the data is divided into a few groups called clusters
you can apply cluster analysis.
- This is another technique that will take into account that
certain observations exhibit similar house sizes and prices.
- For instance this cluster of observations denotes small
houses but with a high price. This could be typical for
houses in the city centre.
- The second cluster could represent houses that are far from
the city because they are quite big but cost less.
- Finally, the last cluster concerns houses that are probably
not in the city centre but are still in nice neighbourhoods.
They are big and cost a lot.
- Noticing that your data can be clustered is important so
you can improve your further analysis. In our example
clustering allowed us to conclude that location is a
significant factor when pricing a house.
3) Factor Analysis
What about a more complicated study where you consider explanatory variables apart from house size.
You might have quantified the location, number of rooms, years of construction and so on which can all
affect house price then when thinking about the mathematical expression corresponding to the regression
model you won't just have one explanatory variable X you will have mini x1, x2, x3 and so on.
Note that an explanatory variable can also be called a regress or an independent variable or a predictor
variable.
Imagine analysing a survey that consists of 100 questions performing any analysis on 100 different
variables is tough. This means you are variables starting from x1 and going all the way up to x100. The
good thing is that often different questions are measuring the same issue and this is where factor analysis
comes in
Assume your survey contained this question on a scale from 1 to 5. How much do you agree with the
following statements.
Survey:
1.I like animals
2. I care about animals.
3. I am against animal cruelty.
People are likely to respond consistently to these three questions.
That is however marks five to the first question does the same for the second and third questions as well.
In other words, if you strongly agree with one of these three statements you won't disagree with the other
two right.
With factor analysis we can combine all the questions into general attitude towards animals. So instead of
three variables we now have one in a similar manner.
You can reduce the dimensionality of the problem from 100 variables to 10 which can be used for a
regression that will deliver a more accurate prediction.
To sum up we can say that clustering is about grouping observations together and factor analysis is
about grouping explanatory variables together.
4) Time Series Analysis

You will use this technique especially if you are working in


economics or finance in these fields you will have to follow the
development of certain values over time such as stock prices or
sales volume you can associate time series with plotting values
against time.
Time will always be on the horizontal line as time is
independent of any other variable therefore such a graph can
end up depicting a few lines that illustrate the behaviour of your
stocks over time.
So when you study the visualization you can spot which stock
performed well and which did not. We must admit there is a vast
variety of methods that professionals can choose from.
Real Life Examples of Traditional Methods
- Imagine you are the head of the user experience department of a
web site selling goods on a global scale which we often abbreviate
as UX.
- So what is your goal as head of UX to maximize user satisfaction
right.
- Assume you have already designed and implemented a survey that
measures the attitude of your customers towards the latest global
product.
- You have launched the graph where you plot your observations will
likely appear in the following way when the data is concentrated in
such a way.
- You should cluster the observations.
- Remember So when you perform cluster analysis you will find that each cluster represents a different continent.
- This group may refer to the responses gathered from Asia or Europe or South America or from North America.
- Once you realize there are four distinct groups it makes sense to run four separate tests. Obviously, the difference between clusters is too great
for us to make a general conclusion.
- Asians may enjoy using your web site one way while Europeans in another. Thus, it would be sensible to adjust your strategy for each of these
groups individually.
- Another noteworthy example we can give you is forecasting sales volume every business and financial company does this.
- So, which traditional statistical technique that we discussed would fit the picture here. Time series analysis it is say this was your data until a
certain date. What will happen next? How should you expect the cells to be for the year ahead? Will their volume increase or decrease?
- Several types of mathematical and statistical models allow you to run multiple simulations which could provide you with future scenarios
based on these scenarios. You can make better predictions and implement adequate strategies awesome job people. You are already acquainted
with many of the data science essential terms but not all.
Machine Learning (ML) Techniques
- The core of machine learning is creating an algorithm which a computer then uses to find a model that fits the data as best as possible and
makes very accurate predictions based on that and how is that different from conventional methods. We provided with algorithms which give
the machine directions on how to learn on its own.
- A machine learning algorithm is like a trial-and-error process. Each consecutive trial is at least as good as the previous one.
- Technically speaking there are four ingredients data, model, objective function and optimization algorithm.
- Example. Imagine a robot holding a bow. We want to find the best way to use that bow to fire accurately. In other words the usage of the bow is
our model, the best way to learn archery is to train right. We train by taking different arrows and trying to hit the target. So, the quiver of
arrows will be or data or more precisely the data that the robot will use for training.
- They are all arrows but they have their subtleties. There are straight ones, crooked ones, light ones, heavy ones. So we can safely say the
arrows represent different data values.
- We said the robot will be firing at a target. In machine learning or at least in the most common type supervised learning, we know what we are
aiming for and we call it a target.
- The objective function will calculate how far from the target the robot shots were on average.
- Here comes the fourth ingredient the optimization algorithm.
- It steps on the findings of the objective function and consists of the mechanics that will improve the robot's archery skills somehow. It's posture
the way it holds the bow how strong it pulls the bowstring etc. Then the robot will take the exact same data or arrows and fire them once again
with its adjusted posture.
- This time the shots will be on average closer to the centre of the target. Normally the improvement will be almost unnoticeable. This entire
process could have been hundreds or thousands of times until the robot finds the optimal way to fire this set of arrows and hit the centre every
single time.
- Nevertheless, it is important to remember that while training you won't provide the robot with a set of rules that is you won't have
programmed a set of instructions like place the arrow in the middle of the bow pull the bow string and so on.
- Instead, you will have given the machine a final goal to place the arrow in the centre of the target.
- So you don't care if it places the arrow in the middle or in the bottom of the bow as long as it hits the target.
- Another important thing is that it won't learn to shoot well right away but after a hundred thousand tries it may have learned how to be
the best archer out there.
- Now there can be infinite possibilities to trial, when will the robots stop training first.
- The robot will learn certain things on the way and will take them into consideration for the next shots at fires for instance if it learns that it must
look towards the target it will stop firing in the opposite direction.
- That is the purpose of the optimization algorithm. Second it cannot fire arrows forever.
- However, hitting the centre nine out of 10 times may be good enough. So, we can choose to stop it after it reaches a certain level of accuracy or
fires a certain number of arrows.
- So, let us follow the four ingredients at the end of the training. Our robot or model is already trained on this data. With this set of arrows most
shots hit the centre so the air or the objective function is quite low or minimized as we like to say the posture the technique and all other factors
cannot be improved.
- So, the optimization algorithm has done its best to improve the shooting ability of the machine.
- We own a robot that is an amazing Archer. So, what can you do? Give it a different bag of arrows. If they had seen most types of arrows while
training it will do great with the new ones.
- However, if we give it half an arrow or a longer arrow than it has seen it will not know what to do with it.
- In all ordinary cases though we would expect the robot to hit the centre or at least get close.
- The benefit of using machine learning is that the robot can learn to fire more effectively than a human.
- It might even discover that we've been holding bows in a wrong way for centuries to conclude we must say that machine learning is not about
robots.
1) Supervised Machine learning
• This name derives from the fact that training an algorithm
resembles a teacher supervising her students.

Types of Machine Learning


• In Supervised machine learning, it is important to mention you
have been dealing with label data. In other words, you can assess
the accuracy of each shot.
• Consider previous example, where there isn't a single target
different arrows have their own targets.
• Let's check what the robot sees when shooting the ground, a target
at a short distance a target at a further distance a target hanging on
a tree far behind it a house to the side and the sky.
• So, having labelled data means the associating or labelling a
target to a type of Arrow.
• You know that with a small arrow the robot is supposed to hit the
closest target with a medium arrow it can reach the target located
further away while with a larger arrow the target that's hanging on
the tree. Finally, a crooked arrow is expected to hit the ground not
reaching any target during the training process.
• The robot will be shooting arrows at the respective targets as well
as it can. After training is finished.
• Ideally the robot will be able to fire the small arrow at the centre
of the closest target the middle arrow at the centre of the one
further away and so on.
• To summarize label data means we know the target prior to the
shot, and we can associate that shot with the target this way.
We're sure where the arrow should hit.
• This allows us to measure the inaccuracy of the shot through the
objective function and improve the way the robot shoots through
the optimization algorithm. So, what we supervise is the training
itself. If a shot is far off from its target, we correct the posture.
Otherwise, we don't get.
2) Unsupervised Machine learning
• In practice though it might happen that you won't have the time or the resources to
associate the arrows with targets before giving them to the robot.
• In that case you could apply the other major type of M-L unsupervised learning here you
will just give your robot a bag of arrows with unknown physical properties unlabelled
data. This means neither you nor the robot will have separated the arrows into groups.
• Then you'd ask the machine to simply fire in a direction without providing it with
targets. Therefore, in this case you won't be looking for a model that helps you shoot
better rather you'll be looking for one which divides the arrows in a certain way.
• The robot will see just the ground the tree the House and the sky. Remember there are
no targets. So, after firing thousands of shots during the training process we will end up
having different types of arrows stuck in different areas.
• For instance, you may identify all the broken arrows by noticing they have fallen on the
ground nearby the others you may realise are divided into small medium and large
arrows.
• There may be anomalies like crossbow bolts in your bag that after being shot may have
accumulated in a pile over here.
• You wouldn't want to use them with a simple bow would you. At the end of the training
the robot will have fired so many times that it could discover answers that may surprise
you.
• The machine may have managed to split the arrows not into four but into five sized
categories due to discovering the crossbow bolt. Or it may have identified that some
arrows are going to break soon by placing them in the Broken Arrow pile.
• It is worth mentioning that supervised learning can deal with such problems too and it
does very often. However, if you have one million arrows you don't really have the time
to assign targets to all of them do you.
• To save time and resources you should apply unsupervised learning.
3) Rainforcement learning
• The third major type of machine learning is called reinforcement learning. This time we
introduce a reward system.
• Every time the robot fires an arrow better than before it will receive an award say a
chocolate it will receive nothing if it fires worse.
• So instead of minimizing an error we are maximizing a reward or in other words
maximizing the objective function.
• If you put yourselves in the shoes of the machine, you'll be reasoning in the following
way. I fire an arrow and receive a reward. I'll try to figure out what I did correctly.
• So, I get more chocolate with the next shot or I fire an arrow and don't receive a
reward. There must be something I need to improve.
• For me to get some chocolate on my next shot positive reinforcement..

• In addition, don't forget the robot Archer was an abstract depiction of what a machine learning model can do.
• In reality there are robots, but the model will be a highly complex mathematical formula the arrows will be a data set and the goals will be various and
quantifiable
• Here are the most notable approaches you will encounter when talking about machine learning support vector machines neural networks deep learning
random forced models and Bazy and networks are all types of supervised learning.
• There are neural networks that can be applied to an unsupervised type of machine learning, but K means is the most common unsupervised approach.
• By the way you may have noticed we have placed deep learning in both categories.
• This is a relatively new revolutionary computational approach which is acclaimed as the State-of-the-art email today.
• Describing it briefly we can say it is fundamentally different from the other approaches.
• However, it has a broad practical scope of application in all M-L areas because of the extremely high accuracy of its models.
• Note that deep learning is still divided and supervised, unsupervised and reinforcement, so it solves the same problems but in a conceptually different way.
Real Life Examples of Machine Learning (ML)
The financial sector and banks have ginormous data sets of credit card transactions.
Unfortunately, banks are facing issues with fraud daily. They are tasked with preventing fraudsters from
acquiring customer data and in order to keep customers funds safe they use machine learning algorithms.
They take past data and because they can tell the computer which transactions in their history were legitimate
and which were found to be fraudulent, they can label the data as such.
So through supervised learning they train models that detect fraudulent activity when these models detect even
the slightest probability of theft.
They flagged the transactions and prevent the fraud in real time. Although no one in the sector has reached a
perfect solution.

Another example of using supervise machine learning with label data can be found in client retention.
A focus of any business be it a global supermarket chain or an online clothing shop is to retain its customers.
But the larger a business grows the harder it is to keep track of customer trends. A local corner shop owner will
recognize and get to know their most loyal customers. They will offer them exclusive discounts to thank them for their
custom.
And by doing so keep them returning on a larger scale. Companies can use machine learning and past label data to
automate the practice.
And with this they can know which customers may purchase goods from them. This means the store can offer discounts
and a personal touch in an efficient way minimizing marketing costs and maximizing profits.
Popular Data Science Tools
STATISTICS
• Statistics is the discipline that
concerns the collection,
organization, analysis,
interpretation, and
presentation of data.

• The practice or science of


collecting and analysing
numerical data in large
quantities, especially for the
purpose of inferring
proportions in a whole from
those in a representative
sample.
Probability
- Life is filled with uncertain events and often we must consider the
possible outcomes before deciding.

- We ask ourselves questions like, What is the chance of success? and


what is the probability that we fail to determine whether the risk is
worth taking?

- By using probability and statistical data they can predict how likely
each outcome is and make the right call for their firm.

- The probability is the chance of something happening. A more


academic definition for this would be the likelihood of an event
occurring.

- An event is a specific outcome or a combination of several outcomes.

- The probability values having a probability of 1 expresses


absolute certainty of the event occurring and a probability
of 0 expresses absolute certainty of the event not
occurring.

Example:

- Consider event A is flipping a coin and getting heads. - Imagine we have a standard six-sided die, and we want
- In this case heads are our only preferred outcome to roll to get four.
- Assuming the coin doesn't just somehow stay in the air - We have a single preferred outcome i.e. to get 4 but this
indefinitely. time we have a greater number of total possible
- There are only two possible outcomes Heads or tails. outcomes are 6.
- This means that our probability would be a half. So, we - Therefore, the probability of this event would look as
write the following, follows.

P(A)=1/2=0.5 P(A) = 1/6 = 0.167

Q. What if we wanted to roll a die for a number divisible by three?


- Note that the probability of two independent events occurring at the same time is equal to the product of all the probabilities of the
individual events.

P(A and B) = P(A).P(B)

- For instance, the likelihood of getting the Ace of Spades equals the probability of getting an ace times the probability of getting a
spade.
P(Ace ♠)=P(Ace) .P(♠)

Problems:
1. Alice has 2 kids and one of them is a girl. What is the probability that the other child is also a girl? You can assume
that there are an equal number of males and females in the world.
2) A fair six-sided die is rolled twice. What is the probability of getting 2 on the first roll and not getting 4 on the second
roll?

3) Cross-fertilizing a red and a white flower produces red flowers 25% of the time. Now we cross-fertilize five pairs of
red and white flowers and produce five offspring. What is the probability that there are no red flower plants in the five
offspring?
Computing Expected Values
• Expected values represent what we expect the outcome to be if
we run an experiment many times to fully grasp the concept.
• So first what is an experiment? Imagine we don't know the
probability of getting heads when flipping a coin. We are going
to try to estimate it ourselves. So, we toss a coin several times
after doing one flip and recording the outcome. We complete a
trial by completing multiple trials. We are conducting an
experiment.
• For example, if we toss a coin 20 times and record the 20
outcomes that entire process is a single experiment with 20
trials.
• The probabilities we get after conducting experiments are
called experimental probabilities.
• Generally, when we are uncertain what the true probabilities
are or how to compute them.
• We like conducting experiments the experimental probabilities
we get are not always equal to the theoretical ones but are a
good approximation.
• The formula we use to calculate experimental probabilities is
similar to the formula applied for the theoretical ones, it is
simply the number of successful trials divided by the total
number of trials now that we know what an experiment is.
• The expected value of an event A denoted as E(A) is the
outcome we expect to occur when we run an experiment.
• To calculate Expected value we take the value for every element in the sample space and multiply it by its probability. Then we add all of those up
to get the expected value.
• For example, if our random variable were the number obtained by rolling 6-sided die, the expected value would be
E(x) = 1*(1/6) + 2*(1/6) + 3*(1/6) + 4*(1/6) + 5*(1/6) + 6*(1/6)
= 3.5

• You are trying to hit a target with a bow and arrow the target has three
layers. The outermost one is worth 10 points, the second one is worth
20 points and the innermost is worth 100. You have practiced enough
to always be able to hit the target but not so much that you hit the centre
every time.
• Suppose the probability of hitting each layer is as follows, 0.5 for the
outmost, 0.4 for the second and 0.1 for the centre.
• The expected value for this example would be,
E(X) = 0.5*10 + 0.4*20 + 0.1*100 = 23

- But we can never get 23 points with a single shot. So why is it important to know what the expected value of an event is?
We can use expected values to make predictions about the future, based on past data.
- We frequently make predictions using intervals instead of specific values due to the uncertainty the future brings.
Example 1:
A local club plans to invest $10000 to host a baseball game. They expect to sell tickets worth $15000 . But if it rains on the
day of game, they won't sell any tickets and the club will lose all the money invested. If the weather forecast for the day of
game is 20% possibility of rain, is this a good investment?

Example 2:
A company makes electronic gadgets. One out of every 50 gadgets is faulty, but the company doesn't know which ones are
faulty until a buyer complains. Suppose the company makes a $3 profit on the sale of any working gadget, but suffers a loss
of $80 for every faulty gadget because they have to repair the unit. Check whether the company can expect a profit in the
long term. Write the probability distribution.
5) Ahmed is playing a lottery game where he must pick 2 numbers from 0 to 9 followed by an English alphabet (from
26-letters). He may choose the same number both times.
If his ticket matches the 2 numbers and 1 letter drawn in order, he wins the grand prize and receives $10405. If just his
letter matches but one or both of the numbers do not match, he wins $100. Under any other circumstance, he wins
nothing. The game costs him $5 to play. Suppose he has chosen 04R to play. What is the expected net profit from playing
this ticket?
Frequency
- Sometimes the result of the expected value is confusing or doesn't tell us much.
- Consider a example throwing two standard six sided dice and adding up the numbers on top.
- We have six options for what the result of the first one could be regardless of the number we roll. We still have six different
possibilities for what we can roll on the second dice. That gives us a total of 6 * 6 = 36 different outcomes for the two roles.

- We can write out the results in a six by six table where rewrite the sum of the two dice.

- You can clearly see that we have repeating entries along the secondary diagonal and all diagonals parallel to it.

- Notice how 7 occurs 6 times in the table. This means we have six favourable
outcomes to get addition as 7. There are 36 possible outcomes, so the chance of
getting a seven is,
P(7) = 6/36 = 1/6
Probability Frequency
Distribution
• A probability frequency distribution
is a collection of the probabilities for
each possible outcome.
• If we write out all the outcomes in
ascending order and the frequency of each
one we construct a frequency distribution
table. By examining this table we can
easily see how the frequency changes with
the results.
• We need to transform the frequency of
each outcome into a probability. Knowing
the size of the sample space we can
determine the true probabilities for each
outcome. We simply divide the frequency
for each possible outcome by the size of
the sample space.
• A collection of all the probabilities for the
various outcomes is called a probability
frequency distribution.
• We can express this probability
frequency distribution through a table or
a graph.
• On the graph we see the probability
frequency distribution. The x axis depicts
the different possible numbers of sums
we can get and the y axis represents the
probability of getting each outcome
when making predictions.
• We generally want our interval to have
the highest probability.
• We can see that the individual outcomes
with the high best probability are the
ones with the highest bars in the graph
usually the highest bars will form around
the expected value.
• Thus the values around it would also be
the values with the highest probability.
Events and Their Complements
• A complement of an event is everything the event is not.
• If we add the probabilities of different events we get their
sum of probabilities. Now if we add up all the possible
outcomes of an event we should always get one.
• Example:
P(head) + P(tail) =1

• All events have complements and we denote them by adding


an apostrophe for example the complement of the event A is
denoted as A’.
• It is also worth noting that the complement of a complement
is the event itself so a apostrophe would equal A.
(A’)’= A
- Example, if you were rolling a standard six sided die Example:
and want to roll an even number, the opposite of that The sum of probabilities of getting 1, 2, 4, 5 or 6 is equal to the sum
would be not rolling an even number which is the of the separate probabilities the likelihood of each outcome is equal
same as wanting to roll an odd number to one sixth so the sum of their probabilities adds up to five sixth.
- Complements are often used when the event we P(A) = P(1) + P(2) + P(4) +P(5) + P(6)
want to occur is satisfied by many outcomes. = 1/6 + 1/6 + 1/6 + 1/6 + 1/6
= 5/6
- For example you want to know the probability of - Now another way of describing getting one, two, four, five or six
rolling a one, two, four, five or six. That is the same is not getting a three.
as the probability of not rolling a three. - Calculate the probability of not getting a three. This is the
- We already said that the sum of the probabilities of complement of getting a three so we know that the two should
all possible outcomes equals one. add up to 1.
P(A’) = 1 – 5/6 = 1/6
- So you can probably guess how we calculate - Therefore the probability of not getting a three equals one minus
complements, the probability of the inverse equals 1 the probability of getting a three.
minus the probability of the event itself - We know that P(3) equals one sixth so the probability of not
• P(A) + P(B) + P(C) = 1 getting three is equal to one minus one sixth.
- Therefore the probability of not getting three is 5/6. This
• A’ = B + C
shows that the probability of getting one, two, four, five or six is
• P(A) = 1- P(A) equal to the probability of not getting a three.
Probability – Combinatorics
- Combinatorics deals with combinations of objects from a specific finite set.
- In addition, we will also consider certain restrictions that can be applied to form combinations. These restrictions can be in
terms of repetition, order or a different criterion.
- We will explore the three integral parts of combinatorics permutations, variations and combinations. Then we will use
each of these parts to determine the number of favourable outcomes or the number of all elements in a sample space.

1) Permutations

Permutations represents the number of different possible


ways we can arrange a set of elements. These elements can
be digits, letters, objects or even people.
- For example, you have to decide Formula for race in which the three drivers on the podium are
Lewis, Max and Kimmie
- A permutation of three, denoted P(3) would express the total number of different ways these
drivers could split the medals among one another
- Suppose Lewis won the race. Then we have two possible scenarios Max finished second and
Kimmie finished third or Kimmie finished second and Max finished third.
- Now suppose that Max won. Once again we have two possible outcomes but this time it is Lewis
and Kimmie who have to split the silver and bronze medals. Either Kimmie got silver and Lewis
got bronze or the other way around.
- If Kimmie won the race we would have two more ways the drivers can be arranged on the podium
either Max gets silver and Lewis gets bronze or Lewis gets silver and Max gets bronze in total.
- This leaves us with six unique ways.
- We start filling out the positions one by one the order in which we fill them out is completely up
to us for convenience. We usually start with the first slot which represents the race winner in our
example since anybody out of the end many drivers in the set could have won the race.
- We have n different possible winners after that we have n-1 possible drivers left and any one of
those can finish second regardless of which out of the n elements we chose to take the first slot.
- We have n-1possibilities for the second slot similarly we would have n-2 possible outcomes for
who finishes third and so on.
- Generally, the further down the ranking we go the more options we exhaust and the more options
we exhaust the fewer options we have left.
- This trend will continue until we get to the last element for which we will only have a single
option available therefore mathematically the number of permutations is represented as,

• Pn=n * (n-1) *(n-2) * (n-3)*……*3*2*1 = n!


Simple
Operations with
Factorials
• The factorial is the notion n! is
used to express the product of
the natural numbers from 1 to
n.
- Negative numbers don't have a
factorial and 0! is equal to 1.
- The first property for any
natural number n we know
that,
• n! = (n-1)! * n
• (n+1)! = (n+1) * n!
Solving Variations with Repetition

** We interpret this as there are nine different


variations of two letter pass codes consisting
of A, B or C only, what happens if the law
could use any of the 26 letters, we would
have 26 to the power of 2 which is 676
different variations. **
Solving Variations without Repetition

- The combinations represent the number of different ways we can pick certain elements of
a set. Combinations
- Imagine you were trying to pick 3 people to represent your company on a very important
technology related conference. There are 10 people working in the office.
- So how many different combinations are there?
- If you calculate this as a variation your answer would be 720 but you would be counting
every group of 3 people several times over. This is because picking Alex, Sarah and Dave
to go to the conference is the same as picking Alex, Dave and Sarah as variations don't
take into account double counting elements.
- We can say that all the different permutations of a single combination are different
variations.
- Let us look at the Sarah, Alex and Dave example choosing those three to represent the
company is a single combination since the order in which we pick them is not relevant.
Choosing Sarah, Alex and Dave is exactly the same as choosing Sarah, Dave and Alex.
Dave, Sarah and Alex; Dave, Alex and Sarah; Alex Dave and Sarah or Alex ,Sarah and
Dave any of the six permutations we wrote is a different variation but not a different
combination.
- That is what we meant when we said that combinations take into account double
counting.
- The formula for calculating permutations of n many elements is simply n! since n is 3 in
this case.
• P(n)= P(3)=6
- There would be a total of 6 permutations for choosing Alex, Dave and Sarah since
variations count these six as separate.

Example:
Amita randomly picks 4 cards from a deck of 52-cards and places them back into the deck ( Any set of 4 cards is
equally likely ). Then, Babita randomly chooses 8 cards out of the same deck ( Any set of 8 cards is equally likely).
Assume that the choice of 4 cards by Amita and the choice of 8 cards by Babita are independent. What is the probability
that all 4 cards chosen by Amita are in the set of 8 cards chosen by Babita?
Symmetry of Combinations

Solving Combinations with Separate Sample Spaces
- Imagine the diner near work just introduced a lunch menu which consists of a sandwich, a drink,
and a side.
- Assuming you go there every day. How long will it take for you to be able to try out every possible
item on the menu? To solve this you need to know what is included in their lunch deal.
- Each menu consists of a sandwich, a side and a drink. They offer three types of sandwiches a
panini, a toast and a veggie wrap. The sides they have available are only fries and onion rings and
the drinks they offer are cola or water.
- The way to tackle such problems is by thinking about the different parts of the menu as separate
positions. If we start by choosing a sandwich first we have 3 options for each of them. We can pick
one of 2 sides fries or rings to complete our menu. We would also have to add a drink which can
either be water or coke.

- Therefore for any combination of Sandwich and side we have two ways of completing the menu
thus we would have a total of three dishes times two sides times two drinks or 12 different lunch
menus at the diner.
3* 2 * 2 =12

- This is important because it shows us how many different possibilities there are available despite
the choices for each part seeming limited.
- Furthermore, this allows us to determine the appropriate amount of time it would take for such a
task to be completed when the components are simply too many. We can remove several of the
options to tremendously decrease the workload.
- The way of calculating the total number of combinations for these kinds of questions is by simply
multiplying the number of options available for each individual event.
Example:
Consider to win lottery you need to satisfy two independent events
1) Correctly guess “Powerball” number (From 1 to 26)
2)Correctly guess 5 regular numbers (From 1 to 69)
Find the probability of winning single ticket.

In how many different ways could 23 children sit on 23 chairs in a Maths Class? If you have 4 lessons a week and there are 52
weeks in a year, how many years does it take to get through all different possibilities? Note: The age of the universe is about
14 billion years.
Unfortunately, you can’t remember the code for your four-digit lock. You only know that you didn’t use any digit more than
once. How many different ways do you have to try? What do you conclude about the safety of those locks?

In a shop there are five different T-shirts you like, coloured red, blue, green, yellow and black. Unfortunately you only have
enough money to buy three of them. How many ways are there to select three T-shirts from the five you like?
Four children, called A, B, C and D, sit randomly on four chairs. What is the probability that A sits on the first chair?
- Every event has a set of outcomes that satisfy it. These are the favourable outcomes.
-
Sets And Events
For example, the event could be even, and the set of values would consist of 2, 4, 6 and all other even
numbers.
- However, values of a set do not always have to be numerical. For instance, an event can be being a
member of the European Union. Values like France or Germany would be a part of this set and
values like USA or Japan would not.
- Convention dictates that we use upper case letters to denote these sets and lower-case letters to
express individual elements.
- In the numerical example upper case X will express all even numbers and lowercase x.
- Any set can be either empty or have values in it.
- If it does not contain any values, we call it the empty set or null set and denote it with ∅
- The non-empty sets can be finite or infinite depending on the number of elements they have when
working with them.
- We often want to express if an element is part of a set the symbol, we use to denote that is the
“belongs to (∈)” symbol.
• x∈X
- We read it as x is an element of or simply in set X.
- But what if we want to show that an element is not contained in a set then we can use the same
notations but simply cross out the symbol with a single diagonal line like,
• x∉X
- So the statements now mean x IS NOT IN X and X does not contain x.
- A subset is a set that is fully contained in another set.
- If every element of A is also an element of B then A is a subset of B. We do know that with a subset
B as you can see not all elements of B are necessarily part of a going forward.
- Remember that every set contains at least two subsets itself and the null set.
- Take events A and B for example, we express the set of values that satisfy each of them as Circle's one for A and another one for B
- Any element that is part of either set will be represented by a point in the appropriate circle. Since we can have additional events the more events, we have the more circles
we draw.
- Let's only focus on A and B, the two circles can either not touch it all, intersect or one can completely overlap the other
- If the two circles never touch, then the two events can never happen simultaneously. Essentially event A occurring guarantees that event B is not occurring and vice
versa Example getting a diamond and getting a heart would be such a situation. If we get a heart, we can't get a diamond and if we get a diamond, we can't get a heart since
each card has exactly one suit.
- Now if these circles intersect it means that the two events can occur at the same time.
- Imagine we draw a card from a standard deck of playing cards. If event A is drawing a diamond and event B is drawing a queen, the area where they intersect will be
represented solely by the queen of diamonds the remaining area of a will represent all other diamonds whereas the area of B outside of that will represent all other queens.
- The third case happens if one circle completely overlaps another. That means that one event can only ever occur if the other one does as well. For instance, event A could
be drawing a red card and event B could be drawing a diamond, then the circle of B is completely contained inside A. So, we can only ever get a diamond if we get a red
card notice that, if the card we drew is black it cannot be a diamond.
- Thus, if event A does not occur then neither does event B. However, because we can draw a heart it is possible to get a red card that isn't a diamond.
- Therefore, event B not occurring does not guarantee event A not occurring. In short if an outcome is not part of a set it cannot be part of any of its subsets. However, an
outcome not being part of some subset does not exclude from the entirety of the Greater set.
- The intersection of two events occur when we want both A and B to happen at the
same time.
- Graphically the intersection is exactly as the name suggests the area where these
events intersect.
- It consists of all the outcomes that are favourable for both event A and event B
simultaneously as we denote it as A intersect B
- Consider the examples the intersection of all hearts and all diamonds is the empty
set as there are no outcomes which satisfy both events simultaneously, we would
write this as,
A∩B=∅
- Consider example, the intersection of all diamonds and all queens is represented
by the queen of diamonds. That card is the only one that satisfies being a queen
and being a diamond at the same time.
- In the example with red cards and diamonds the intersection of the two would
simply be all diamonds. That is because any diamond is simultaneously red and a
diamond. We would write this as
A∩B=B
- Everybody we use intersections only when we want to denote instances where
both events A and B happens simultaneously.
If we only require one of A or B to occur regardless which one that is the same as asking either A or B to happen, in such cases
we need to find the union of A and B

The union of two sets is a combination of all outcomes preferred for either A or B.

We denote the union of two sets as a U symbol.

Let us examine what the unions would be in the three different cases if the sets A and B do not touch it all then their intersection
would be the empty set, therefore their union would simply be their sum.

(A U B) = A + B

Going back to the card example the union of hearts and diamonds would be all red cards, no card can have multiple suits so we
need not worry about counting a card twice therefore the number of red cards equal the union of cards which are either diamonds
or hearts

If the events intersect the area of the Union, then it is represented by the sum of the two sets minus their intersection

(A U B) = A + B – (A ∩ B)

That is because if we simply add up the area of the two sets we would be double counting every element that is part of the
intersection.

In fact the union formula we just showed you is universally true regardless of the relationship between A and B

So what happens if B is a subset of A. In that case the union would simply be the entire set A

Imagine event A is being from the U.S. an event B is being from California if you talk about all the people who are either from
California or the United States you are simply talking about all the people from the USA

Remember that the intersection of these two sets is equal to the entire set B, so the intersection of A and B represents all the
people from California

If we plug this into the formula, that would give us the following statement,

The union of all people from California or the United States = all California natives + all Americans - all Californians

we get exactly what we expect the union of people from either California or the United States equals the entire population of the
USA now
- Mutually exclusive sets are sets in which you're not allowed to have any
overlapping elements graphically their circles never intersect.
- Mutually exclusive sets have the empty set as their intersection.
- Therefore, if the intersection of any number of sets is the empty set then they must
be mutually exclusive and vice versa.
- What about their union if some sets are mutually exclusive? Their union is simply
the sum of all separate individual sets.
- Sets have complements which consist of all values that are parts of the sample
space but not part of the set.
- Consider a set consisting of all the odd numbers, its complement would be the set
of all even numbers. It means complements are always mutually exclusive.
- However not all mutually exclusive sets are complements.
- For instance, imagine A is the set of all even numbers and B is the set of all
numbers ending in five. We know that any number ending with five is odd.
- So these two sets are definitely mutually exclusive.
- However the complement of all even is all odd and not just the ones ending with 5.
- Therefore a number like 13 would be part of the complement but not the set B.
- We can have dependent events as their probabilities vary as conditions change.
- For instance, take the probability of drawing the Queen of Spades normally. The
answer is 1/52. Since we have exactly one favourable outcome and fifty-two elements
Dependence and Independence of
in the sample space.
Sets
- Now imagine we know that the card we drew was a spade. Our chances of getting the
queen of spades suddenly go up since the new sample space contains the 13 cards
from the suit only. Therefore, the probability becomes 1/13.
- Now imagine a different scenario instead of a spade. We know our card is a queen.
- So, the sample space only consists of 4 cards. Therefore, the probability of drawing
the Queen of Spades becomes 1/4
- With this example you could clearly see how the probability of an event changes
depending on the information we have.
- Lets introduce some new notation as usual. Suppose we have two events A and B to
express the probability of getting A, if we are given that B has occurred, we use the
following notation, P(A/B), we read this as P of A given B.
- Going back to our card example, event A is drawing the queen of spades and event B
is drawing a spade.
- Therefore P(A/B) would represent the probability of drawing the queen of spades. If
we know the card is a spade so
P(A/B)=1/13
- Similarly if event C represents getting a queen then P(A/C) expresses the likelihood
of getting the Queen of Spades assuming we drew a queen.
- Thus
P(A/C) = 1/4
- We call this probability the conditional probability and we use it to distinguish
dependent from independent events
• Example:

1) In a manufacturing unit 3 parts from assembly are selected. You are observing whether they are defective or
non-defective.
• Determine
a) The sample space.
b) The probability of the event of getting 2 defective parts if there is chance of 1 defective in 10.
The Conditional Probability Formula


Ex 1
Suppose you draw two cards from a deck and you win if you get a jack followed by an ace (without replacement).
What is the probability of winning, given we know that you got a jack in the first turn?

Ex 2:

Suppose you have a jar containing 6 marbles – 3 black and 3 white. What is the probability of getting a black given the
first one was black too without replacement
Marginal Probability
• Contingency table consists of rows and columns of two attributes at different
levels with frequencies or number in each of the cells.
• It is matrix of frequencies assigned to rows and columns.
• The term marginal is used to indicate that the probabilities are calculated
using a contingency table.
• Also called as joint probability table.
A survey of 200 families were conducted. Information regarding family income per year & whether family buys a car is given
in following table
Family Income below Rs 10 lakh Income above Rs 10lakh Total
Buyer of car 38 42 80
Non buyer 82 38 120
Total 120 80 200

a) What is the probability that a randomly selected family is buyer of the car?
b) What is the probability that a family is both buyer of the car & belonging to income of Rs 10 lakh & above.
c) A family selected at random is found to be belonging to income of Rs 10 lakh & above. What is the probability that this
family is buyer of car?

Example 3
A research group collected the yearly data of road accidents with respect to the conditions of following and not following the
traffic rules of an accident prone area. They are interested in calculating the probability of accident given that a person
followed the traffic rules. The table of the data is given as follows:

Condition Follow Traffic Rule Does not follow Traffic Rule

Accident 50 500

No Accident 2000 5000


The Law of Total Probability
Many scientific papers rely on conducting experiments or surveys

They often provide summarized statistics we use to analyse and interpret how certain factors affect one
another.

An example would illustrate this better imagine you conducted a survey where 100 men and women of all ages
were asked if they eat meat.

The results are summarized in the table, we see 15 of the 47 women that participated are vegetarian as our 29
out of the 53 men.

- Now if A represents being vegetarian and B represents being a woman then P (A/B) and P(B/A) expressed different events
- The former equals 15/47 represents the likelihood of a woman being vegetarian while the other equals 15/44 expresses the
likelihood of a vegetarian being a woman since 15/44 is greater than 15/47.
P(A/B)= 15/47 ≠ P(B/A) = 15/44
- It is more likely for a vegetarian to be female than for a woman not to eat meat.
- It shows you that in probability theory things are never straightforward
- Now we will discuss important concept the law of
total probability.
- Imagine A is the union of some finitely many
events B1, B2 and so on.
A=B1 U B2 U …….. U Bn
- This law dictates that the P( A) is the sum of all
the conditional probabilities of a given some B
multiplied by the probability of the associated B
P(A) = P(A/B1) * P(B1) + P(A/B2) * P(B2) ………..

Let us go back to the survey example. The


probability of being vegetarian equals

P(vegetarian) =P(vegetarian/male)*P(male) + P(vegetarian/female)*P(female)


= 29/53*53/100 + 15/47 * 47/100
= 0.44
The Additive Rule
- The additive law recall when we
introduced the concept of unions
- The union of two events A and B is equal
to their sum minus their intersection
- The additive law states something very
similar the probability of the union of two
sets is equal to the sum of the individual
probabilities of each event minus the
probability of their intersection.
- The union of women and vegetarians equal the sum of probabilities of being a woman
and being a vegetarian minus the probability of being a vegetarian woman

P(women U vegetarian) = P(women) + P(vegetarian) – P(being a vegetarian women)

P(WU V) =P(W) + P(V) – P(W∩V)


= 47/100 + 44/100 – 15/100
=0.47 + 0.44 – 0.15
= 0.76
- Thus if we picked a random person from the survey there is a 76% chance they're
either female vegetarian or both.
- Now consider a different example.
- Now consider a different example.
- Suppose we know 38% of our colleagues can proficiently use Tableau and 45% are experts in SQL
- Additionally 66% of the people in the office are good with at least one of the two. What is the
probability of somebody being able to simultaneously implement SQL and Tableau
- To answer this, We can rearrange the additive law to get the intersection of Tableau and SQL users
equals the

P(Tableau) = 38%
P(SQL) = 45%
P(Tableau U SQL) = 66%

- According to additive rule,


P(T U S) = P(T) + P(S) – P(T ∩ S)

P(T ∩ S) = P(T) + P(S) – P(T U S)


= 38% + 45% - 66%
= 17%

Transforming this into a probability gives us a likelihood of 0.17 for somebody in the office to be able to
proficiently implement SQL and Tableau
Example:
1) From a pack of well shuffled cards, a card is picked up at random.
a) What is probability that the selected card is a king or a queen?
b) What is probability that the selected card is king or a diamond?

2) The probability that you will get an A grade in quantitative method is 0.7. The probability that you get an A grade in
marketing is 0.5. Assuming these 2 courses are independent, compute the probability that you will get an A grade in both
this subjects.
The Multiplication Law

- Consider a numerical example and see why this makes sense.

- Suppose the probability of event B is 0.5 and the probability of event A given B, P(A/B) is 0.8.

- This suggests that event B occurs 50% of the time and event A also appears in 80% of those 50% when B occurred.

- Therefore, the likelihood of A and B occurring simultaneously is 0.8*0.5 or 0.4


Example:
Suppose we draw two cards from a standard deck of 52 playing cards we draw one. Shuffle the deck without returning the card
and then draw a second one. What is the probability of drawing a spade on the second draw and not drawing a spade on the first
draw?
🡪
If we express these as a single conditional probability event A would be drawing a spade on the second try and event B would be not
drawing a spade on the first try.
As stated before the likelihood of drawing a specific suit is 1/4th or 0.25.
We already discussed how to calculate the probability of complements so the probability of B would equal,
P(B) = 1 – 0.25 = 0.75
Now be careful when estimating the probability of drawing a spade on the second turn.
There are only 51 cards left. So we must adjust the favourable overall formula to find the new likelihood.
We have assumed we did not draw a spade on the first go so the favourable outcomes would still be the 13 spades left.
However, we are one card short from having a complete deck. So the new sample space would be 51.
Therefore, the probability would be,
P(A/B) = 13/51 = 0.255.
So far we have calculated the likelihood of not drawing a spade on the first turn and the probability of drawing a spade on the second go.
Given we do something else first however we still haven't answered the question we are interested in.
What is the probability of drawing a spade on the second draw and not drawing a spade on the first draw to answer it.
We need to apply the multiplication rule,
P(A∩B) = P(A/B) * P(B)
= 0.255 * 0.75
= 0.191

So we have a probability of 0.191 of drawing a spade on the second turn assuming we did not draw 1 initially
• Example:

From a pack of cards, 2 cards are drawn in succession one after another. After every draw, the selected card is not replaced.
What is probability that in both the draws you will get spades?
• Bayes' Law
- One of the most prominent examples of using Bayes Rule is in medical research when trying to find a causal relationship
between symptoms.
- Knowing both conditional probabilities between the two helps us make more reasonable arguments about which one causes
the other.
- For instance, there is certain correlation between patients with back problems and patients wearing glasses.
- More specifically 67% of people with spinal problems wear glasses (P(VI/BP) =67%) while only 41% of patients with
eyesight issues have back pains (P(BP/VI)=41%)
- These conditional probabilities suggest that is much more likely for someone with back problems to wear glasses than the
other way around even though we cannot find a direct causal link between the two.
- There exists some arguments to support such claims.
- For instance, most patients with back pain are either elderly or work a desk job where they remained stationary for long
periods
- Old age and a lot of time in front of the desktop computer can have a deteriorating effect on an individual's eyesight
however many healthy and young individuals wear glasses from a young age.
- In those cases, there is no other underlying factor that would suggest incoming back pains.
- Similarly, we can also apply Bayes Theorem in business. Let's explore this fictional scenario.

Example1: A bag I contain 4 white and 6 black balls while another Bag II contains 4 white and 3 black balls. One ball is drawn
at random from one of the bags, and it is found to be black. Find the probability that it was drawn from Bag I.

Example 2:
You are planning a picnic today, but the morning is cloudy
∙ Oh no! 50% of all rainy days start off cloudy!
∙ But cloudy mornings are common (about 40% of days start cloudy)
∙ And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%)
What is the chance of rain during the day?
• Let us say P(Fire) means how often there is fire, and P(Smoke) means how often we see smoke, then:

• P(Fire|Smoke) means how often there is fire when we can see smoke
P(Smoke|Fire) means how often we can see smoke when there is fire
∙ dangerous fires are rare (1%)
∙ but smoke is fairly common (10%) due to barbecues,
∙ and 90% of dangerous fires make smoke

Discover the probability of dangerous Fire when there is Smoke


Bayes' Law Using Total Law of Probability

• One clever application of Bayes theorem is in spam filtering.

• We have

Event A: The message is spam


Test x: The message contains certain words (x)

- Bayesian theorem idea is he switches which event is being conditioned on. He switches between A given B and B given A.
- In spam mail example we want to find out whether the email that we are receiving is spam or not.
- Gmail identifies emails and identify them as spam and moved them to specific folder.
- Let suppose we want to build same application based on the contents of mail that you need to find whether mail is spam or not.
- So, we can design the program like if I know the words of program then I can tell you whether it is spam or not i.e. I want probability of spam given words.
- It means if I tell you the words, can you tell me whether this is spam or not?
- We will solve this problem by finding opposite conditional i.e., what is the probability of words given spam.
- As it is for us to do.
- It means if I know that it is spam then I know the distribution of words & if I know that it is not spam then I know the distribution of words.
- So now twist the problem and say that “if you give me the words, I will tell you whether it is spam or not.”


Probability – Distributions
- A distribution shows the possible values a variable can take and how
frequently they occur
- Let us introduce some important notation we use. The uppercase Y represents the
actual outcome of an event and lowercase y represents one of the possible
outcomes.
- One way to denote the likelihood of reaching a particular outcome Y is
P(Y = y ) or p(y)
- For example uppercase Y could represent the number of red marbles we draw out
of a bag and lowercase y would be a specific number like 3 or 5, then we express
the probability of getting exactly 5 red marbles as,
P(Y)=5 or P (5)
- Since P(Y) expresses the probability for each distinct outcome. We call this the
probability function.
- So, probability distributions or simply probabilities measure the likelihood
of an outcome depending on how often it is featured in the sample space.

- The probability frequency distribution of an event is the frequency for each unique value and divided it by the total number
of elements in the sample space.
- Usually that is the way we construct these probabilities when we have a finite number of possible outcomes if we had an
infinite number of possibilities then recording the frequency for each one becomes impossible because there are infinitely
many of them.
- Imagine you are a data scientist and want to analyse the time it takes for your code to run any single compilation could take
anywhere from a few milliseconds to several days. Often the result will be between a few milliseconds and a few minutes.

- One idea which we will use a lot is that any value between µ - σ and µ + σ falls within one
standard deviation away from the mean.
- The more congested the middle of the distribution the more data falls within that interval
similarly the less data that falls within the interval the more dispersed the data is.
- It is important to know that a constant relationship exists between mean and variance for
any distribution.
- The variance equals,

σ 2 = E((Y - µ)2)
= E(Y2) - µ2
Types of Probability
Distributions
- Here we are going to discuss about various
types of probability distributions and what
kind of events they can be used to describe
certain distributions share features.
- So, we group them into types some like
rolling a die or picking a card have a finite
number of outcomes they follow discrete
distributions, others like recording time and
distance in track and field have infinitely
many outcomes. They follow continuous
distributions.
- Here we will focus on an important aspect of
it or when it is used?
- Before, we get into the specifics you will need
to know the proper notation we implement
when defining distributions, we start off by
writing down the variable name for our set of
values followed by the tilde sign, this is
superseded by a capital letter depicting the
type of the distribution and some
characteristics of the dataset in parentheses.
X ~ N ( µ , σ2 )
Discrete Distributions
- A discrete distribution is a distribution of data in statistics that
has discrete values. Discrete values are countable, finite,
non-negative integers, such as 1, 10, 15, etc.
- A discrete distribution, as mentioned earlier, is a distribution of
values that are countable whole numbers. On the other hand, a
continuous distribution includes values with infinite decimal
places. An example of a value on a continuous distribution would
be “pi.” Pi is a number with infinite decimal places (3.14159…).
- Both distributions relate to probability distributions, which are
the foundation of statistical analysis and probability theory.
- A probability distribution is a statistical function that is used to
show all the possible values and likelihoods of a random
variable in a specific range.
- The range would be bound by maximum and minimum values,
but the actual value would depend on numerous factors.
- There are descriptive statistics used to explain where the
expected value may end up. Some of which are:
∙ Mean (average)
∙ Median
∙ Mode
∙ Standard deviation
∙ Skewness
∙ Kurtosis
- Consider an example where you are counting the number of people walking
into a store in any given hour. The values would need to be countable, finite,
non-negative integers.
- It would not be possible to have 0.5 people walk into a store, and it would
not be possible to have a negative amount of people walk into a store.

- Therefore, the distribution of the values, when represented on a distribution


plot, would be discrete.

- Observing the above discrete distribution of collected data points, we can see
that there were five hours where between one and five people walked into
the store. In addition, there were ten hours where between five and nine
people walked into the store and so on.
- The probability distribution above gives a visual representation of the
probability that a certain amount of people would walk into the store at any
given hour. Without doing any quantitative analysis, we can observe that
there is a high likelihood that between 9 and 17 people will walk into the
store at any given hour.
- Types of Discrete distributions are:
• Discrete Uniform distribution
• Bernoulli Distribution
• Binomial Distribution
• Poisson Distribution
Discrete Distributions: The Uniform Distribution
- It’s when all the distinct random variables have the exact same
probability values, so everything is constant or just a number.
- In a uniform probability distribution, all random variables have the
same or uniform probability; thus, it is referred to as a discrete uniform
distribution.
- Imagine a box of 12 donuts sitting on the table, and you are asked to
randomly select one donut without looking. Each of the 12 donuts has
an equal chance of being selected.
- Therefore, the probability of any one donut being chosen is the same or
uniform.

- In fact, we can represent this idea using a simple graph as follows.

- But notice that we can show this graphical representation as a density


curve of a uniform distribution as a set of rectangles all having equal
heights.
And, what’s important to note is that the value of the total area under any
density curve equals one. Therefore, for a discrete uniform distribution, the
probability mass function is

Moreover, if X is a uniform random variable for a is less than or equal to


b, then the values of the mean and variance of a discrete uniform
distribution is seen below.
So, using our previous example of the box of 12 donuts, where you randomly select one donut without looking. Let’s identify
the distribution and calculate it’s mean and variance.

Discrete Distributions: The Bernoulli Distribution
- At the beginning of any cricket match, how do you decide who is
going to bat or ball? A toss! It all depends on whether you win or
lose the toss, right? Let’s say if the toss results in a head, you win.
Else, you lose. There’s no midway.
- A Bernoulli distribution has only two possible outcomes, namely
1 (success) and 0 (failure), and a single trial.
- So, the random variable X which has a Bernoulli distribution can
take value 1 with the probability of success, say p, and the value 0
with the probability of failure, say q or 1-p.
- Here, the occurrence of a head denotes success, and the occurrence
of a tail denotes failure.
- Probability of getting a head = 0.5 = Probability of getting a tail
since there are only two possible outcomes
- The probability mass function is given by:
px(1-p)1-x where x € (0, 1).
- It can also be written as
- The probabilities of success and failure need not be equally likely, like the
result of a fight between me and Undertaker. He is pretty much certain to
win. So in this case probability of my success is 0.15 while my failure is
0.85
- Here, the probability of success(p) is not same as the probability of failure.
So, the chart below shows the Bernoulli Distribution of our fight.

- Here, the probability of success = 0.15 and probability of failure = 0.85.


- The expected value is exactly what it sounds. If I punch you, I may expect
you to punch me back.
- Basically, expected value of any distribution is the mean of the
distribution. The expected value of a random variable X from a Bernoulli
distribution is found as follows: Example:
E(X) = 1*p + 0*(1-p) = p Find the variance of the Bernoulli distribution with
p=0.6.
- The variance of a random variable from a Bernoulli distribution is:
As p = 0.6
2 2 2
E(X ) = 1 *p + 0 *q = p
Therefore q = 1 – p = 1 – 0.6 = 0.4
V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)
E(X) = 1*p + 0* q = 0.6
- There are many examples of Bernoulli distribution such as whether it’s E(X2) = 12*p + 02*q = p = 0.6
going to rain tomorrow or not where rain denotes success and no rain
denotes failure and Winning (success) or losing (failure) the game. V(X) = E(X²) – [E(X)]² = p – p² = p(1-p) = p*q =
0.6*0.4 = 0.24
Discrete Distributions: The Binomial Distribution
- Let’s get back to cricket. Suppose that you won the toss today and
this indicates a successful event. You toss again but you lost this
time. If you win a toss today, this does not necessitate that you will
win the toss tomorrow.
- Let’s assign a random variable, say X, to the number of times you
won the toss. What can be the possible value of X? It can be any
number depending on the number of times you tossed a coin.
- There are only two possible outcomes. Head denoting success and
tail denoting failure.
- Therefore, probability of getting a head = 0.5 and the probability of
failure can be easily computed as: q = 1- p = 0.5.
- A distribution where only two outcomes are possible, such as success
or failure, gain or loss, win or lose and where the probability of
success and failure is same for all the trials is called a Binomial
Distribution.
- The outcomes need not be equally likely. So, if the probability of
success in an experiment is 0.2 then the probability of failure can be
easily computed as q = 1 – 0.2 = 0.8.

• Each trial is independent since the outcome of the previous toss


doesn’t determine or affect the outcome of the current toss.
- An experiment with only two possible outcomes repeated n number of times is called binomial. The parameters of a binomial
distribution are n and p where n is the total number of trials and p is the probability of success in each trial.
- On the basis of the above explanation, the properties of a Binomial Distribution are
1. Each trial is independent.
2. There are only two possible outcomes in a trial- either a success or a failure.
3. A total number of n identical trials are conducted.
4. The probability of success and failure is same for all trials. (Trials are identical.)

- The mathematical representation of binomial distribution is given by:

A binomial distribution graph where the probability of success does not equal the probability of failure looks like
Now, when probability of success = probability of failure, in such a situation the graph of binomial distribution looks like

- The mean and variance of a binomial distribution are given by:


Mean -> µ = n*p
Variance -> Var(X) = n*p*q
Examples:

1. A coin is tossed four times. Calculate the probability of obtaining more heads than tails.

2. An agent sells life insurance policies to five equally aged, healthy people. According to recent data, the probability of
a person living in these conditions for 30 years or more is 2/3. Calculate the probability that after 30 years:
1. All five people are still living.
2. At least three people are still living.
3. Exactly two people are still living.
Example:
Let X be the number of heads flipped in n=10 flips of a coin. If the coin is fair, then it is clear that X ~ Bin(n=10, p=0.500)
i) What is probability of getting 3 heads?
ii) What is probability of getting 3 tails?
iii) What is probability of getting at most 3 heads or at least 7 heads?

Example:
Given two randomly selected countries, the probability of war between them in given year is one in ten-thousand p=0.0001.
Calculate the probability of at least one war in a given year. (Given that there are 194 countries in world.)
Discrete Distributions: The Poisson Distribution
- Suppose you work at a call centre; approximately how many calls do you get in a day? It can be any number. Now,
the entire number of calls at a call centre in a day is modelled by Poisson distribution. Some more examples are

1. The number of emergency calls recorded at a hospital in a day.

2. The number of thefts reported in an area on a day.

3. The number of customers arriving at a salon in an hour.

4. The number of suicides reported in a particular city.

5. The number of printing errors at each page of the book.

- You can now think of many examples following the same course.

- Poisson Distribution is applicable in situations where events occur at random points of time and space
wherein our interest lies only in the number of occurrences of the event.
- A distribution is called Poisson distribution when the following assumptions are valid:

1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.
The mean µ is the parameter of this distribution. µ is also defined as the
- Now, if any distribution validates the above assumptions then it λ times length of that interval. The graph of a Poisson distribution is
is a Poisson distribution. Some notations used in Poisson shown below:
distribution are:

∙ λ is the rate at which an event occurs,

∙ t is the length of a time interval,

∙ And X is the number of events in that time interval.

- Here, X is called a Poisson Random Variable and the probability The graph shown below illustrates the shift in the curve due to
increase in mean.
distribution of X is called Poisson distribution.
- Let µ denote the mean number of events in an interval of length
t.
- Then, µ = λ*t.

• - The PMF of X following a Poisson distribution is given by:

It is perceptible that as the mean increases, the curve shifts to the right.
The mean and variance of X following a Poisson distribution:
Mean -> E(X) = µ
Variance -> Var(X) = µ
Example:
Births in a hospital occur randomly at an average rate of 1.8 births per hour. What is the probability of observing 4 births in a
given hour at the hospital?

If random variable X follows Poisson distribution, if P(X=2) = P(X=3) then find mean P(X=0)
Continuous Distributions

• Some examples of graphical representations of continuous


distributions are displayed below. It may be observed that
these graphs possess a curved nature, with no distinct
values for each x value.
• This is because, at any given specific x value or
observation in a continuous distribution, the probability is
zero. We are only able to calculate the probability that a
continuous random variable lies within a range of values.
• It should also be noted that the area underneath the
curve is equal to one because this represents the
probability of all outcomes.

❑ Variance
- Each defined random variable has a variance associated with it as well.
- This is a measure of the concentration of the observations within that random variable. This number will give us intel on
how far spread the observations are from the mean.
- The standard deviation is also useful, this is equal to the square root of the variance.
- When calculating variance, the idea is to calculate how far each observation of the random variable is from it’s expected
value, square it, then take the average of all of these squared distances.
- The formula for variance is as follows:
Example:
Verify whether following function is pdf. If yes then Obtain
i) P(X < 0.5) ii) P(X > 0.25) iii) P(0.2 < X < 0.4)
where
f(x) = 6x(1-x); for o<x<1
=0 otherwise
Continuous Distributions: The Normal Distribution
- For starters we define a normal distribution as N(µ, σ2) using a
capital letter N followed by the mean and variance of the
distribution
- We read the following notation as variable X follows a normal
distribution with mean µ and variance σ2.
• X~N(µ, σ2)
- When dealing with actual data, We would usually know the
numerical values of µ and σ2
- The normal distribution frequently appears in nature as well as in
life in various shapes and forms.
- For example, the size of a fully grown male lion follows a normal
distribution. Many records suggest that the average lion weighs
between 150 and 250 kg or 332 - 550 pounds.
- Of course, specimens exist which fall outside of this range.
However, Lions weighing less than 150 or more than 250 kilograms
tend to be the exception rather than the rule.
- Such individuals serve as outliers in our set and the more data we
gather.
- The lower part of the data they represent now that you know what
types of events follow a normal distribution.
- Let us examine some of its distinct characteristics.
∙ For starters the graph of a normal distribution is bell shaped.
∙ Therefore, the majority of the data is centred around the mean.

• Thus, values further away from the mean are less likely to occur
furthermore we can see that the graph is symmetric with regards to the
mean that suggests values equally far away in opposing directions would
still be equally likely.
• Let's assume a population of animals in a zoo is known to
be normally distributed.
• Each animal lives to be 13.1 years old on average (mean),
and the standard deviation of the lifespan is 1.5 years.
• If someone wants to know the probability that an animal
will live longer than 14.6 years, they could use the
empirical rule.
• Knowing the distribution's mean is 13.1 years old, the
following age ranges occur for each standard deviation:

❖ One standard deviation (µ ± σ): (13.1 - 1.5) to (13.1 + 1.5), or 11.6 to 14.6

❖ Two standard deviations (µ ± 2σ): 13.1 - (2 x 1.5) to 13.1 + (2 x 1.5), or 10.1 to 16.1

❖ Three standard deviations (µ ± 3σ): 13.1 - (3 x 1.5) to 13.1 + (3 x 1.5), or, 8.6 to 17.6

• The person solving this problem needs to calculate the total probability of the animal living 14.6 years or longer.
• The empirical rule shows that 68% of the distribution lies within one standard deviation, in this case, from 11.6 to 14.6
years.
• Thus, the remaining 32% of the distribution lies outside this range.
• One half lies above 14.6 and the other below 11.6. So, the probability of the animal living for more than 14.6 is 16%
(calculated as 32% divided by two).
Continuous Distributions: The
Standard Normal Distribution

- The transformation is a way in which we can alter every element


of a distribution to get a new distribution with similar
characteristics.
- For normal distributions we can use addition, subtraction,
multiplication and division without changing the type of the
distribution.
- For instance, if we add a constant to every element of a normal
distribution the new distribution would still be normal.
- Let's discuss the four algebraic options and see how each one
affects the graph if we had a constant like three to the entire
distribution.

• Then we simply need to move the graph three places to the right,
similarly if we subtract a number from every element we would
simply move our current graph to the left to get the new one

Continuous Distributions: The Students' T Distribution
- We use the lowercase letter t to define a Student's t distribution followed by a
single parameter in parentheses called degrees of freedom t(k)
- We read this next statement as variable Y follows A Student's t distribution with 3
degrees of freedom.
Y ~ t(3)
- It is a small sample size approximation of a normal distribution in instances where
we would assume a normal distribution were it not for the limited number of
observations we use the Student's t distribution
- For instance, the average lap times for the entire season of a Formula One race
follow a normal distribution but the lap times for the first lap of the Monaco
Grand Prix would follow A Student's t distribution.
Continuous Distributions: The Chi-Squared Distribution


Continuous Distributions: The Exponential Distribution


Continuous Distributions: The Logistic Distribution
- We denote a logistic distribution with the entire word logistic followed by two
parameters.
Logistic (µ, S)
- It's mean and scale parameter like the one for the exponential distribution.
- We also refer to the mean parameter as the location and we shall use the terms
interchangeably.
- Thus we read the statement below as variable Y follows a logistic distribution with
location 6 and a scale of 3.
X ~ Logistic (6,3)
- We often encounter logistic distributions when trying to determine how continuous
variable inputs can affect the probability of a binary outcome.
- This approach is commonly found in forecasting competitive sports events where there
exist only two clear outcomes victory or defeat.
- For instance we can analyse whether the average speed of a tennis player serve plays a crucial role in the outcome of the match
- Expectation dictates that sending the ball with higher velocity leaves opponents with a shorter period to respond, this usually results in a better hit which could
lead to a point for the server
- To reach the higher speeds, Tennis players often give up some control over the shot so are less accurate therefore we cannot assume that there is a linear
relationship between point conversion and serve Speights theory suggests, there exists some optimal speed which enables the serve to still be accurate enough
then most of the shots we convert into points will likely have similar velocities as tennis players go further away from the optimal speed their shots either become
too slow and easy to handle or to an accurate.
- This suggests that the graph of the PDF of the logistic distribution would look similarly to the normal distribution.
- The graph of the logistic distribution is defined by two key features it's mean and its scale parameter.

Weibull Distribution

Population and Sample
• The branch of statistics called inferential statistics is often defined as
the science of drawing conclusions about a population from
observations made on a representative sample of that population.

• A population includes all members from a specified group, all


possible outcomes or measurements that are of interest. The exact
population will depend on the scope of the study.

• A population is the collection of all items of interest to our study


and is usually denoted with an upper case, N. The numbers we've
obtained when using a population are called parameters.

• A sample consists of some observations drawn from the population,


so a part or a subset of the population. The sample is the group of
elements who participated in the study.

• A sample is a subset of the population and is denoted with a


lowercase n and the numbers we've obtained when working with
the sample are called statistics.
Example:
- Let's say we want to perform a survey of the job prospects of the students studying in the New York University what is the
population you can simply walk into New York University and find every student.
- Surely that would not be the population of NYU students.
- The population of interest includes not only the students on campus but also the ones at home, on exchange abroad distant
education students, part time students, even the ones who enrolled but are still at high school.
- Though exhaustive even this list misses someone Point taken.
- Populations are hard to define and hard to observe in real life a sample however is much easier to gather.
- It is less time consuming and less costly time and resources are the main reasons we prefer drawing samples compared to
analysing an entire population.
- So let's draw a sample then as we first wanted to do we can just go to the NYU campus. Next let's enter the canteen because
we know it will be full of people. We can then interview 50 of them.
- This is a sample drawn from the population of NYU students.
- The populations are hard to observe and contact. That's why statistical tests are designed to work with incomplete data.
- You will almost always be working with sample data and make data driven decisions and inferences based, on it.
- Right since the testicle tests are usually based on sample data.

- Samples are key to accuracy testicle insights. They have two defining characteristics.
Why a sample?
In general, it is almost always impossible to carry out measurements for the entire study
population because:

• The population is too large. Example: the population of IT students. If we want to take measurements on all IT
students in the world, it will most likely either take too long or cost too much

• The population is virtual. In this case “virtual” population is understood as a “hypothetical” population: it is
unlimited in size. Example: for an experimental study, we focus on men with blood cancer treated with a new
treatment. We do not know how many people will be treated, so the population varies, is infinite and
uncountable at the present time, and therefore virtual

• The population is not easily reachable. Example: the population of homeless persons in Belgium.

• For these reasons, measurements are made on a subgroup of observations from the
population, i.e., on a sample of our population.
• These measures are then used to draw conclusions about the population of interest.
• With an appropriate methodology and a sufficiently large sample size, the results
obtained on a sample are often almost as accurate as those that would be obtained on
the entire population.
Representative sample
• The sample must be selected to be representative of the population under study. If participants are included in a
study on a voluntary basis, there is a serious concern that the resulting sample may not be representative of the
population.
• It may be the case that volunteers are different in terms of the parameter of interest, leading to a selection bias.
Another selection bias can occur when, for instance, a researcher collects citizens’ wage, by the means of
internet. It might be the case that people having access to internet have different wages than people who do not
have access.
• The best way to select a sample representative of the population under study is by selecting a random sample. A
random sample is a sample selected at random from the population so that each member of the population has
an equal chance of being selected. A random sample is usually an unbiased sample, that is, a sample whose
randomness is not in doubt.
• In some situations (e.g., in medicine) it is complicated or even impossible to obtain a random sample of the
population. In such cases, it will be important to consider how representative the resulting sample will be.

Paired samples
• Finally, paired samples are samples in which groups (often pairs) of experimental units are linked together by the
same experimental conditions. For example, one may measure the hours of sleep for 20 individuals before taking
a sleeping pill (forming sample A), and then repeat the measurements on the same individuals after they have
taken a sleeping pill (forming sample B).
• The two measurements for each individual (hours of sleep before and after the sleeping pill) and the two samples
are of course related.
• Statistical tools accounting for a relation between the samples exist and should be preferred in that case.
Randomness and representativeness.
- A sample must be both random and representative for an insight to be precise.
- A random sample is collected when each member of the sample is chosen from the population strictly by chance.
- A representative sample is a subset of the population that accurately reflects the members of the entire population.
- Let's go back to the sample we just discussed the 50 students from the NYU canteen we walked into the university canteen
and violated both conditions.
- People were not chosen by chance. They were a group of NYU students who were there for lunch. Most members did not
even get the chance to be chosen as they were not in the canteen.
- Thus, we conclude the sample was not random but was it representative.
- It represented a group of people but definitely not all students in the university to be exact.
- It represented the people who have lunch at the university canteen.
- Had our survey been about job prospects of NYU students who eat in the university canteen we would have done well.
- You must be wondering how to draw a sample that is both random and representative.
- The safest way would be to get access to the student database and contact individuals in a random manner.
- However such surveys are almost impossible to conduct without assistance from the university.
Types of Data
- Different types of variables require different types of statistical and
visualization approaches.
- Therefore, to be able to classify the data you are working with is key.
- We can classify data in two main ways based on its type and on its
measurement level.
- Let us start from the types of data we can have.
- There are two types of data - categorical and numerical data.
- The categorical data describes categories or groups. One example is car
brands like Mercedes, BMW and Audi. They show different categories.
- Another instance is answers to yes and no questions. If I ask questions like
Are you currently enrolled in a university or do you own a car. Yes and no
would be the two groups of answers that can be obtained. This is
categorical data.
- Numerical data on the other hand as its name suggests represents
numbers.
- It is further divided into two subsets discrete and continuous.
- Discrete data can usually be counted in a finite matter. A good example
would be the number of children that you want to have. Even if you don't
know exactly how many you were absolutely sure that the value will be an
integer such as 0, 1, 2 or even 10.
- Another instance is grades on the exam. You may get 1000, 560, 1500, 2500.
- What is important for a variable to be defined as discrete is that you can imagine each member of the data set knowing that
all possible scores can be obtained is key.
- The continuous data is infinite and impossible to count. For example, your weight can take on every value in some range.
- You get on the scale and the screen shows 68 or 68.0369 kilograms. But this is just an approximation. If you gain 0.01 pound
the figure on the scale is unlikely to change but your new weight will be 68.034 kilograms.
- Now think about sweating every drop of sweat reduces your weight by the weight of that drop. But once again a scale is
unlikely to capture that change your exact weight is a continuous variable.
- It can take on an infinite number of values no matter how many digits there are after the dot. To sum up your weight can vary
by incomprehensibly small amounts and is continuous.
- Just to make sure there are some other examples of discrete and continuous data. Grades at university are discrete A B C D E F
or 0 to 100%
- The number of objects in general no matter if bottles, glasses, tables or cars they can only take integer values.
- Money can be considered both but physical money like bank notes and coins are definitely discrete.
- You can pay one dollar and two for three cents. You can only pay a dollar and 24 cents. That's because the difference between
two sums of money can be one cent at most. What else is continuous.
- Apart from weight other measurements are also continuous. Examples are height, area, Distance and time. All of these can
vary by infinitely smaller amounts.
- Incomprehensible for a human time on a clock is discrete but time in general. It can be anything like 72.123456 seconds.
- We are constrained in measuring weight, height, area, distance and time by our technology. But in general, they can take on
any value.
Levels of Measurement
- Furthermore, we saw that numerical data that can be discrete
and continuous. It's time to move on to the other classification
levels of measurement.
- These can be split into two groups qualitative and quantitative
data.
- The qualitative data can be nominal or ordinal
- The nominal variables or like the categories we talked about just
now Mercedes, BMW or Audi, they aren't numbers and cannot
be ordered.
- Ordinal data on the other hand consists of groups in categories
which follow a strict order. Imagine you have been asked to rate
your lunch and the options are disgusting, unappetising, neutral,
tasty and delicious. Although we have words and not numbers
but it is obvious that these preferences are ordered from
negative to positive. Thus, the level of measurement is
qualitative ordinal.
- The quantitative variables are also split into two groups interval
and ratio.
- Intervals and ratios are both represented by numbers but have
one major difference ratios have a true zero and intervals don't.
- Most things we observe in the real world are ratios. Their name comes from the fact that they can represent
ratios of things.
- For instance, if I have two apples and you have six apples you would have three times as many as I do. The
ratio of 6 and 2 is 3
- Other examples are a number of objects in general distance and time.
- Temperature is the most common example of an interval variable. Remember it can not represent a ratio of
things and doesn't have a true zero.
- Usually temperature is expressed in Celsius or Fahrenheit. They are both interval variables say today is 5
degrees Celsius or 41 degrees Fahrenheit and yesterday was 10 degrees Celsius or 50 degrees Fahrenheit.
- In terms of Celsius it seems today is twice colder. But in terms of Fahrenheit not really. The issue comes from
the fact that zero degrees Celsius and zero degrees Fahrenheit are not true zeros.
- These scales were artificially created by humans for convenience.
- Now there is another scale called Kelvin which has a true zero degrees Kelvin is the temperature at which
atoms stop moving and nothing can be colder than zero degrees Kelvin.
- This equals -273.15 degrees Celsius or – 459.67 degrees Fahrenheit.
- Variables shown in Kelvins are ratios as we have a true zero and we can make the claim that one temperature
is two times more than another Celsius and Fahrenheit have no true zero and are intervals.
• Finally, numbers like 2,3, 10, 10.5 etc. can be both interval or ratio.
The central tendency is stated as the statistical
measure that represents the single value of the
Mean, Median and Mode
entire distribution or a dataset.

It aims to provide an accurate description of the


entire data in the distribution.

In statistics, the central tendency is the descriptive


summary of a data set. Through the single value
from the dataset, it reflects the center of the data
distribution.

The central tendency of the dataset can be found


out using the three important measures
namely mean, median and mode.

A measure of central tendency is a summary statistic that


represents the center point or typical value of a dataset.
These measures indicate where most values in a distribution
fall and are also referred to as the central location of a
distribution. You can think of it as the tendency of data to
cluster around a middle value.
Mean


Median
- The median is basically the middle number in an ordered
dataset. For example, in order to calculate the median we have
to order our data in ascending order
- The median of the data set is the (n+1)/2 in the ordered list
where n is the number of observations
- Therefore the median for NYC is at the 6th position or six
dollars much closer to the observed prices than the mean of 11
dollars.
- What about L.A. We have just 10 observations in L.A.
According to our formula the median is that position 5.5
- In cases like this the median is the simple average of the
numbers at positions 5 and 6.
- Therefore at the median of L.A. prices is $5.5.

- We have seen the median is not affected by extreme prices


which is good when we have posh New York restaurants and a
street pizza sample, but we still don't get the full picture.
Mode
- The mode is the value that occurs most often.
- It can be used for both numerical and categorical data, but we will stick to our numerical example.
- After counting the frequencies of each value, we find that the mode of The New York pizza prices is $3.
- The most common price of pizza in NYC is just $3 but the mean and median let us to believe it was much more expensive.
- Let's do the same and find the mode of L.A. pizza prices. From each price appears only once. How do we find the most
then?
- We can say that there is no mode, but we can also say that there are 10 modes. Sure, you can but it will be meaningless
with 10 observations and an experienced statistician would never do that.
- In general, you often have multiple modes usually two or three modes are tolerable. But more than that would defeat the
purpose of finding a mode.

- There is one last question that we haven't answered which measure is best. The NY CNL example shows us that the
measures of central tendency should be used together rather than independently.

- Therefore, there is no best. But using only one is the worst.


- After exploring the measures of central tendency, move on
to the measures of a symmetry.
Skewness - The most used tool to measure a symmetry is skewness.
- Skewness indicates whether the observations in a dataset are concentrated on one side.

- An example, consider frequency distribution tables. Here we have three data sets and the
respective frequency distributions. We have also calculated the means medians and modes.

- The first dataset has a mean of 2.79 and a median of 2, hence the mean is bigger than the
median. We say that this is a positive or right skew from the graph.

- You can clearly see that the data points are concentrated on the left side. Note that the
direction of the skew is counter intuitive.

- It does not depend on which side the line is leading to but rather to which side its tail is
leaning to. So right skewness means that the outliers are to the right.

- It is interesting to see the measures of central tendency incorporated in the graph when we
have rights units the mean is bigger than the median and the mode is the value with the
highest visual representation.

- In the second graph. We have plotted a data set that has an equal mean median and
mode the frequency of occurrence is completely symmetrical, and we call this a zero
or no skew. Most often you will hear people say that the distribution is symmetrical for the
data set.

- In third graph, we have a mean 4.9, a median of 5 and a mode of 6. As the mean is lower
than the median. We say that there is a negative or a left skew.

- Once again, the highest point is defined by the mode. Why is it called the Left skew again,
because the outliers are to the left.

- So why is skewness important?

- Skewness tells us a lot about where the data is situated. The mean, median and mode
should be used together to get a good understanding of the dataset.

- Measures of asymmetry like skewness or the link between central tendency measures and
probability theory which ultimately allows us to get a more complete understanding of the
data we are working with.
• Variance
- Sample variance on the other hand is denoted by S2 and is equal to the sum
of squared differences between the observed sample values and the sample
mean divided by the number of sample observations minus one.
- The main part of the formula is its numerator. So that's what we want to
comprehend.
- The sum of differences between the observations and the mean squared
them.
- So, the closer a number to the mean the lower the result we will obtain and
the further away from the mean it lies the larger this difference easy.
- But why do we elevate to the second degree.
- Squaring the differences has two main purposes. First by squaring the
numbers we always get non-negative computations. As dispersion cannot
be negative, dispersion is about distance and distance cannot be negative.
- If on the other hand we calculate the difference and do not elevate to the
second degree we would obtain both positive and negative values that
when summed would cancel out leaving us with no information about the
dispersion second squaring amplifies the effect of large differences.
- For example, if the mean is 0 and you have an observation of 100 the
squared spread is 10,000
- Consider a practical example.
- We have a population of 5 observations as follows
- Let's find its variants.
- We start by calculating the mean (1+2+3+4+5)/5 = 3
- Then we apply the formula, we get two.
- So the population variance of the data set is 2 but what about the sample variance.
- This would only be suitable if we were told that these five observations were a sample drawn from a population.
- So let's imagine that's the case.

- The sample mean is once again 3, the numerator is the same but the denominator is going to be 4 instead of 5 giving us a
sample variance of 2.5. To conclude the variance topic, we should interpret the result. Why is the sample variance bigger
than the population variance in the first case we know of the population.
- That is we had all the data and we calculated the variance in the second case.
- We were told that 1 2 3 4 and 5 was a sample drawn from a bigger population.
- Imagine the population of this sample were these 9 numbers 1 1 1 2 3 4 5 5 5 and 5.
- Clearly the numbers are the same but there is a concentration around the two extremes of the data set 1 and 5.
- The variance of this population is 2.96 so our sample variance has rightfully corrected upwards in order to reflect the higher
potential variability.
- This is the reason why there are different formulas for sample and population data.
Standard Deviation and Coefficient of Variation
- Variance is a common measure of data dispersion. In most cases the figure you will
obtain is pretty large and hard to compare as the unit of measurement is squared.
- The easy fix is to calculate its square root and obtain a statistic known as standard
deviation.
- In most analyses you perform standard deviation will be much more meaningful than
variance.
- There are different measures for the population and sample variance. Consequently,
there is also population and sample standard deviation.

- The formulas are the square root of the population variance and square root of the
sample variance respectively.

- The other measure we still have to introduce is the coefficient of variation. It is equal
to the standard deviation divided by the mean. Another name for the term is relative
standard deviation.
- This is an easy way to remember its formula. It is simply the standard deviation
relative to the mean as you probably guessed there is a population and sample
formula.
- Once again so standard deviation is the most common measure of variability for a
single dataset.
- But why do we need yet another measure such as the coefficient of variation.
Comparing the standard deviations of two different data sets is meaningless. But
comparing coefficients of variation is not Aristotle once said.
- Here's an example of a comparison between standard deviations.
- Let's take the prices of pizza at 10 different places in New York. They range from 1 to $11.
- Now imagine that you only have Mexican pesos and to you the prices look more like 18.81
pesos to 206.91 pesos.
- Given the exchange rate of 18.81 pesos for $1. Let's combine our knowledge so far and find
the standard deviations and coefficients of variation of these two data sets.
- First, we have to see if this is a sample or a population.
- Are there only 11 restaurants in New York. Of course not. This is obviously a sample drawn
from all the restaurants in the city. Then we have to use the formulas for sample measures of
variability.
- Second, we have to find the mean. The mean in dollars is equal to 5.5 and the mean in pesos
to 103.46
- The third step of the process is finding the sample variance. Following the formula that we showed earlier we can obtain 10.72 dollars squared and 3793.69 peso
squared. The respective sample standard deviations are at 3.27 dollars and 61.59 pesos.
- Let's make a couple of observations. First Variance gives results in squared units while standard deviation in original units.
- This is the main reason why professionals prefer to use standard deviation is the main measure of variability.
- It is directly interpretable square dollars means nothing even in the field of statistics.
- Second, we got standard deviations of 3.27 and 61.59 for the same pizza at the same eleven restaurants in New York City. Seems wrong right.
- It is time to use our last tool. The coefficient of variation dividing the standard deviations by the respective means we get the two coefficient of variation. The
result is the same 0.60. Notice that it is not dollars. Pesos.
- Dollars squared or pesos squared. It is just 0.60.
- This shows us the great advantage that the coefficient of variation gives us. Now we can confidently say that the two data sets have the same variability. Which is
what we expected beforehand.
- There are three main measures of variability variance, standard deviation and coefficient of variation.
- Each of them has different strength and applications. You should feel competent using all of them as we are getting closer to more complex statistical topics.
Covariance
- We've covered all univariate measures.
Now it is time to see measures that are
used when we work with more than one
variable.

- We'll explore measures that can help us


explore the relationship between
variables.

- Our focus will be on covariance and the


linear correlation coefficient.
- Consider example of real estate. Which is one
of the main factors that determine house
prices? Their size right?
- Typically, larger houses are more expensive as
people like having extra space.
- The table that you can see here shows us data
about several houses on the left side.

- We can see the size of each house and on the


right we have the price at which it's been listed
in the local newspaper. We can present these
data points in a scatterplot.
- The X-axis will show a house this size and the
Y-axis will provide information about its price.
- We can certainly notice a pattern. There is a
clear relationship between these variables.
- We say that the two variables are correlated
and the main statistic to measure this
correlation is called covariance.
- Unlike variance, covariance may be positive,
equal to zero or negative to understand the
concept better.
- It has formulas with an S because once again there is a sample and a
population formula.
- Since this is obviously sample data we should use the sample
covariance formula.
- Let's apply it in practice for the example that we saw earlier x will be
house size and y stands for house price. We need to calculate the mean
size and the mean price. I will also compute the sample standard
deviations.
- Now let's calculate the nominator of the covariance function starting
with the first house.
- I'll multiply the difference between its size and the average house size.
Why the difference between the price of the same house and the
average house price once we're ready.
- We have to perform this calculation for all houses that we have in the
table and then some the numbers we've obtained. Our sample size is 5.
- Now we have to divide the sum above by the sample size minus one.
The result is the covariance.
- It gives us a sense of the direction in which the two variables are moving. If they go in the same direction the
covariance will have a positive sign. While if they move in opposite directions then will have a negative sign. Finally, if
their movements are independent the covariance between the house size and its price will be equal to zero.
- There is just one tiny problem with covariance though it could be a number like 5 or 50 but it can also be something like
0.0023456 or even over 30 million as in our example the use of a completely different scale.
- How could one interpret such numbers proceed to the next part to find out how the correlation coefficient can help us
with this issue.
• Correlation adjusts covariance so that the relationship
between the two variables becomes easy and intuitive to
interpret.
Correlation Coefficient
• The formulas for the correlation coefficient are the
covariance divided by the product of the standard
deviations of the two variables.
• This is either symbol or population depending on the data
you're working with.
• We already have the standard deviations of the two datasets.
• Now we'll use the formula in order to find the sample
correlation coefficient.
• Mathematically there is no way to obtain a correlation value
greater than 1 or less than -1.

• Remember the coefficient of variation have already done.


We manipulated the strange covariance value in order to get
something intuitive.
- Let's examine it for a bit. We got a sample correlation
coefficient of 0.87. So there is a strong relationship between
the two values.
- A correlation of one also known as perfect positive
correlation means that the entire variability of one variable is
explained by the other variable.
- However logically we know that size determines the price. On
average the bigger house you build the more expensive it will
be. This relationship goes only this way.
- Once a house is built if for some reason it becomes more
expensive its size doesn't increase although there is a positive
correlation.
- A correlation of zero between two variables means that they
are absolutely independent from each other.
- We would expect a correlation of zero between the price of
coffee in Brazil and the price of houses in London. The two
variables don't have anything in common. Finally we can
have a negative correlation coefficient.
- It can be perfect negative correlation of -1 or much more
likely an imperfect negative correlation of a value between -1
and zero.
- Think of the following businesses. A company producing ice-cream
and a company selling umbrellas. Ice cream tends to be sold more
when the weather is summer, and people buy umbrellas when it's
rainy. Obviously, there is a negative correlation between the two and
hence when one of the companies makes more money the other
won’t.

- Before we continue we must note that the correlation between two


variables x and y is the same as the correlation between y and x.

- The formula is completely symmetrical with respect to both


variables. Therefore the correlation of price and size is the same as
the one of size and price. This leads us to causality.
- It is very important for any analysts or researcher to understand the
direction of causal relationships in the housing business.
Statistics - Inferential Statistics Fundamentals

Now that we have covered the basics of descriptive statistics it is time


to move on to inferential statistics.

Inferential statistics refers to methods that rely on probability theory


and distributions to predict population values based on sample data.

This will naturally lead us to the point estimate, and we will conclude
the section with confidence intervals.
What is a Distribution?
- In statistics when we use the term distribution we usually mean
a probability distribution.
- A distribution is a function that shows the possible values for
a variable and how often they occur.
- We are sure that you have exhausted all possible values when
the sum of the probabilities is equal to 1.
- All outcomes have an equal chance of occurring. Each
probability distribution has a visual representation.

- It is a graph describing the likelihood of occurrence of every


event.

- Often when we talk about distributions, we make use of the


graph. The graph is just a visual representation.

- Examine some of the main types of continuous distributions


starting with the normal distribution.
The Normal Distribution


- You can notice that the highest point is located
at the mean because it coincides with the mode.
The spread of the graph is determined by the
standard deviation.
- Now let's try to understand the normal
distribution a little bit better. Let's look at this
approximately normally distributed histogram.
- There is a concentration of the observations
around the mean which makes sense as it is
equal to the mode. Moreover, it is symmetrical
on both sides of the mean.
- We used 80 observations to create this
histogram. It's mean is 743 and its standard
deviation is 140. But what if the mean is smaller
or bigger. Let's zoom out a bit by adding the
origin of the graph the origin is the zero point.
- Keeping the standard deviation fixed or in
statistical jargon controlling for the standard
deviation a lower mean would result in the same
shape of the distribution. But on the left side of
the plane in the same way.

- A bigger mean would move the graph to the


right. In our example this resulted in two new
distributions one with a mean of 470 and a
standard deviation of 140 and one with a mean
of 960 and a standard deviation of 140.
- All right let's do the opposite. Controlling for the
mean we can change the standard deviation and
see what happens.

- This time the graph is not moving but is rather


reshaping a lower standard deviation result in a
lower dispersion.

- So more data in the middle and thinner tails on


the other hand a higher standard deviation will
cause the graph to flatten out with less points in
the middle and more to the end or in statistics
jargon fatter tails.

- These are the basics of a normal distribution.


The Standard
Normal Distribution



Let's see an example that will help us get a better grasp
of the concept will take an approximately normally
distributed set of numbers 1 2 2 3 3 3 4 4 and 5.
- It's mean is 3 and it's standard deviation 1.22.
- Now let's subtract the mean from all data points we get a
new dataset. Let's calculate the new mean. It is 0 exactly
as we anticipated showing that on a graph, we have
shifted the curve to the left while preserving its shape. x x-µ x-µ /σ
Substract
- So far we have a new distribution which is still normal Mean Devide
but with a mean of zero and a standard deviation of 1.22 from Data by
Dataset dataset St dev
- The next step of the standardization is to divide all data
points by the standard deviation. 1 1 -2 -1.63
2 2 -1 -0.82 Mean 3 N ~ (3,1.22)
- This will drive the standard deviation of the new data set
to 1. Let's go back to our example. Both the original data 3 2 -1 -0.82 St Dev 1.22
set and the one we obtain after subtracting the mean from
each data point have a standard deviation of 1.22. Adding 4 3 0 0.00
and subtracting values to all data points does not change New N ~ (0, 1.22)
the standard deviation. 5 3 0 0.00 Mean 0
6 3 0 0.00 St Dev 1.22
- Now let's divide each data point by 1.22.
7 4 1 0.82
- If we calculate the standard deviation of this new dataset, Final
we will get 1 and the mean is still zero. 8 4 1 0.82 Mean 0.00 N ~ (0,1)
Final St
- In terms of a curve we kept it at the same position but 9 5 2 1.63 Dev 1
reshaped it a bit .
- This is how we can obtain a standard normal distribution
from any normally distributed dataset.

• Using it makes predictions and inference much easier. And


this will help us a great deal and what we will see next.
Central Limit Theorem
- Before we continue let's introduce a concept a sampling distribution,
say you have a population of used cars in a car shop.
- We want to analyse the car prices and be able to make some
predictions on them. Population parameters which may be of
interest are mean car price, standard deviation of prices, covariance
and so on normally.
- In statistics we would not have data on the whole population but
rather just a sample

- Let's draw a sample out of that data. The mean is 2617. 23$. Now a
problem arises from the fact that if I take another sample, I may get
a completely different mean 3201.34$. Then a third with a mean of
2844.33$. As you can see the sample mean depends on the
incumbents of the sample itself.
- So taking a single value as we did in descriptive statistics is
definitely suboptimal.
- What we can do is draw as many samples and create a new dataset
comprised of sample means these values are distributed in some
way.

- So we have a distribution when we are referring to a distribution


formed by samples. We use the term a sampling distribution for our
case. We can be even more precise that we are dealing with a
sampling distribution of the mean.
- Now if we inspect these values closely, we will realize that they are different but are concentrated
around a certain value.
- Or our case somewhere around 2800 dollars. Since each of these sample means are nothing but
approximations of the population mean, the value they revolve around is actually the
population mean itself. Most probably none of them is the population mean. But taken
together they give a really good idea.
- In fact if we take the average of those sample means we expect to get a very precise
approximation of the population mean.

- Let me give you some more information. Here's a plot of the distribution of the car prices. We
haven't seen many distributions but we know that this is not a normal distribution.
- It has a right skew and that's about all we can see here is the big revelation. It turns out that if we
visualize the distribution of the sampling means we get something else, something familiar,
something useful a normal distribution and that's what the central limit theorem states no
matter the distribution of the population by no mean uniform, exponential or another one
the sampling distribution of the mean will approximate a normal distribution.
- Not only that but its mean is the same as the population mean. That's something we already
noticed.
- What about the variance? It depends on the size of the samples we draw but it is quite elegant. It is
the population variance divided by the sample size since the sample size is in the denominator.
- The bigger the sample size the lower the variance or in other words the closer the approximation
we get.
- So if you are able to draw bigger samples your statistical results will be more accurate. We need a
sample size of at least 30 observations.
- Finally, why the central limit theorem is so important. As we already know the normal
distribution has elegant statistics and an unmatched applicability in calculating confidence
intervals and performing tests.
- The central limit theorem allows us to perform tests, solve problems and make inferences
using the normal distribution even when the population is not normally distributed.
Standard Error


Estimators and Estimates
- We learned about point estimators but as you can guess they are not exceptionally reliable.

- Imagine visiting 5 percent of the restaurants in London and saying that the average meal is worth
22.50 pounds. What are Confidence Intervals?
- You may be close, but chances are that the true value isn't really 20 to 50 but somewhere around
it. It's much safer to say that the average meal in London is somewhere between 20 and 25
pounds isn't it.

- In this way you have created a confidence interval around your point estimate of 20 to 50. A
confidence interval is a much more accurate representation of reality.

- However, there is still some uncertainty left which we measure in levels of confidence.

- So getting back to our example you may say that you are 95% confident that the population
parameter lies between 20 and 25.

- Keep in mind that you can never be 100% confident unless you go through the entire population
and there is of course a 5% chance that the actual population parameter is outside of the 20 to 25
pounds range.

- Observe that if the sample we have considered deviates significantly from the entire population.
There is one more ingredient needed, the level of confidence it is denoted by 1- α and is called
the confidence level of the interval. Alpha is a value between 0 and 1.

- For example, if we want to be 95% confident that the parameter is inside the interval Alpha is
5%.

- If we want a higher competence level of say 99% then alpha will be 1%.

- Then here it is the formula for all confidence intervals is from the point estimate minus the reliability factor times the standard error to the point estimate
Plus the reliability factor times the standard error.
- formula for all confidence intervals =
[point estimates – reliability factor * standard error, point estimates + reliability factor * standard error]
Confidence Intervals; Population Variance Known; Z-score
- A confidence interval is the range within which you expect the population parameter to be and its estimation is based on the
data we have in our sample.
- There can be two main situations when we calculate the confidence intervals for a population - when the population barrier is
known and when it is unknown. Depending on this situation, we would use a different calculation method.
- Now the whole field of statistics exists because we almost never have population data. Even if we do have population, we may
not be able to analyse it. It may be so much that it doesn't make sense to be used all at once.
- Here we will explore the confidence intervals for a population mean with a known variance. An important assumption in this
calculation is that the population is normally distributed even if it is not.
- You should use a large sample and let the Central Limit Theorem do the normalization magic for you. Remember if you work
with a sample which is large enough you can assume normality of sample means.
❑ Example:
- Let's say you want to become a data scientist and you're interested in the salary you're going to get.
- Imagine you have certain information that the population standard deviation of data sign salaries is equal to $15000.

- Furthermore, you know the salaries are normally distributed and your sample consists of 30 salary's.
The formula for the confidence interval with a known variance is given below.

- The population mean will fall between the sample mean minus Z of Alpha divided by two times the standard error and the
sample mean plus Z of Alpha divided by two times the standard error.
- The sample mean is the point estimate. You know all about the standard error already so let's compute it using the formula.

The table summarizes standard
normal distributions critical values
and corresponding (1-α)
- Let's say that we want to find the
values for the 95 percent confidence
interval Alpha is zero 0.05.
- Therefore, we are looking for Z of
Alpha divided by two or 0.05 in the
table. This will match the value of
1-0.025= 0.975
- The corresponding z comes from the
sum of the row and column table
headers associated with this cell. In
our case the value is 1.9 plus 0.06 or
1.96

- A commonly used term for the Z as


critical value. So we have found the
critical value for this confidence
interval.
Now we can easily substitute in the formula the final confidence interval becomes 94833 to 105568.

The interpretation is the following. We are 95 percent confident that the average data scientist’s salary will be in the
interval 94833 and 105568 dollars.
Example:
From 1984 to 1985, the mean height of 15 to 18-year-old males from Chile was
172.36 cm, and the standard deviation was 6.34 cm. Let Y = the height of 15 to
18-year-old males in 1984 to 1985. Then Y ~ N(172.36, 6.34).
About 95% of the y values lie between what two values?
Confidence Interval Clarifications
- Let's take a step back and try to understand confidence
intervals a bit better.
- Here is a graph of a normal distribution you know where
the sample mean is in the middle of the graph.
- Now if we know that a variable is normally distributed,
basically we are making the statement that most
observations will be around the mean and the rest far away
from it.
- Let's draw a confidence interval. There is the lower limit,
and the upper limit and 95% confidence interval would
imply that we are 95% confident that the true population
mean falls within this interval.
- There is 2.5% chance that it will be on the left of the lower
limit and 2.5% chance it will be on the right overall.

- There was 5% chance that our confidence that our role does
not contain the true population mean so when alpha is 0.05
or 5%, we have alpha divided by 2 or 2.5% chance that the
true mean is on the left of the interval and 2.5% on the
right.
- Using the z score and the formula we are
implicitly starting from a standard normal
distribution.
- Therefore, the mean is zero the lower limit
is -z, while the upper one is z.
- For a 95% confidence interval using the Z
table we can find that these limits are -1.96
and 1.96.

- Finally, the formula makes sure we go back


to the original range of values and we get
the interval for our dataset.
- What if we are looking at a 90% confidence interval?
- In that case the interval looks like this. And there is a 10% chance that the true mean is outside the interval 5% on each side.
- This causes the confidence interval to shrink. So, when our confidence is lower the confidence interval itself is smaller.
- Similarly, for a 99% confidence interval we would have a higher confidence but a much larger confidence interval.
- Consider an example just to make sure we have solidified this knowledge. I don't know your age to your student, but I am 95 percent
confident that you are between 18 and 55 years old based on the fact that you were learning this statistics course.
- That is not much information to begin with. As, I don't have any information about the age of any of the students. Hence the wide
interval.
- So, I am 95% confident you are between 18 and 55 years old. Also, I am 99% confident that you are between 10 and 70 years old.
- I am 100% confident that you are between 0 and 118 years old which is the age of the oldest person alive at the time of recording.
- Finally, I am 5% confident that you are 25 years old. Obviously, this is a completely arbitrary number as you can see there is a trade-off
between the level of confidence and the range of the interval.
- The 100% confidence that our role is completely useless as they must include all ages possible.
- In order to gain 100% confidence, 99% confidence gives me a much narrower range but it's still not insightful enough for this particular
problem.
- 25 years old on the other hand is a useful estimate as we have an exact number but the level of confidence of 5% is too small for us to
make use of in any meaningful analysis. There is always a trade-off which depends on the problem at hand.

- 95% is the accepted norm as we don’t compromise with accuracy too much but still get a relatively narrow interval.
Student's T Distribution
- William Gossett was an English statistician who worked for the brewery
of Guinness.
- He developed different methods for the selection of the best yielding
varieties of barley. An important ingredient in making beer, Gossett found
big samples tedious, so he was trying to develop a way to extract small
samples but still come up with meaningful predictions.
- He was a curious and productive researcher and published several papers
that are still relevant today.
- However due to his company policy he was not allowed to sign the
papers with his own name. Therefore, all his work was under the pen
name student.
- Later on a friend of his and a famous statistician Ronald Fischer stepping
on the findings of Gossett introduced the t-statistic and the name that
stuck with the corresponding distribution even today is Student’s t.
- The Student's t distribution is one of the biggest breakthroughs in
statistics as it allowed in France through small samples with an unknown
population variance.
- This setting can be applied to a big part of the statistical problems we
face today.

- Visually the Student's t distribution looks much like a normal distribution


but generally has fatter tails.
- The formula that allows us to calculate it is ‘t with n-1’ degrees of freedom and
a significant level of α equals the sample mean minus the population mean
divided by the standard error of the sample as you can see it is very similar to
the Z.
- After all this approximates the normal distribution. The last characteristic of a
student’s-t statistic is that there are degrees of freedom usually for a sample of
n, so we have n-1degrees of freedom.

- So we're a sample of 20 observations. The degrees of freedom are 19. Much like
the standard normal distribution table. We also have a student’s-t table where it
is the rows indicate different degrees of freedom, abbreviated as DF while the
columns common Alfas.

- Please note that after the 13th row the numbers don't vary that much.
- Actually after 30 degrees of freedom the teacher autistic table becomes almost
the same as the Z statistic as the degrees of freedom depend on the sample.

- In essence the bigger the sample the closer we get to the actual numbers a
common rule of thumb is that for a sample containing more than 50
observations we use the Z table instead of the T table.
Confidence Intervals; Population Variance Unknown; T-score

- So, we have learned that confidence intervals based on small samples from normally distributed populations are calculated with the t
statistic.
- Let check a similar example to the one we saw earlier. You are an aspiring data scientist and are wondering how much the mean data
scientist’s salary is?
- This time though you do not have the population variance.

- In fact, you have a sample of only nine compensations you found on the glass door and if summarize the information in the above
table,

• In this example we are going to use a
confidence level of 95%.
• This means that α is equal to 5%.
Therefore, half of Alpha would be
2.5%.
• You can now see that the associated t
statistic is 2.31
• We have all the information needed so we just plug in the numbers what we get is a confidence interval from 81806 $ to
103261 $.
- Let's compare this result of the results with a confidence interval with known population.
- We got a 95% confidence interval that was between 94,833 $ and 105,568 $.
- You can clearly note that when we know the population variance we get a narrower confidence interval. When we do not know the population
variance, there is a higher uncertainty that is reflected by wider boundaries for our interval.
- It means that even when we do not know the population variance, we can still make predictions, but they will be less accurate. Furthermore, the
proper statistic for estimating the confidence interval when the population variance is unknown is the statistic and not the Z statistic.
- The more observations there are in the sample the higher the chances of getting a good idea about the true mean of the entire population.
Hypothesis Testing
- Confidence intervals provide us with an
estimation of where the parameters are located.
- However, when you are making a decision, you
need a yes or no answer.
- The correct approach in this case is to use a test
- Here we learn how to perform one of the
fundamental tasks and statistics hypothesis
testing.
- There are four steps in data driven decision
making.
o First you must formulate a hypothesis.
o Second, once you have formulated a
hypothesis you will have to find the right
tests for your hypothesis.
o Third you execute the test.

o And fourth you make a decision based on


the result.
- The most intuitive is a hypothesis is an idea that can be
tested. This is not the formal definition, but it explains
the point very well.
- For example, if I tell you that apples in New York are
expensive. This is an idea or a statement but is not
testable until I have something to compare it with.
- For instance, if I define expensive as any price higher
than a $75 cents per pound then it immediately becomes a
hypothesis.
- What's something that cannot be a hypothesis? An
example may be would the USA do better or worse under
a Clinton administration compared to a Trump
administration?
- Statistically speaking, this is an idea but there is no data
to test it. Therefore, it cannot be a hypothesis of a
statistical test.
- It is more likely to be a topic of another discipline
conversely in statistics.
- We may compare different U.S. presidencies that have
already been completed such as the Obama administration
and the Bush administration as we have data on both.

- A simple topic that can be tested according to Glassdoor


the popular salary information website.

- The main data scientist salary in the US is 113000 dollars


- There are two hypotheses that are made.
- The null hypothesis denoted H0 and the alternative hypothesis denoted H1 or Ha.
- The null hypothesis is the one to be tested and the alternative is everything else.
- In our example the null hypothesis would be the mean data, scientist salary is 113000 dollars while the alternative the mean data scientist salary
is not 113000 dollars.
- Now you would want to check if 113000 is close enough to the true mean predicted by our sample, in this case it is you would accept the null
hypothesis otherwise you would reject the null hypothesis. The concept of the null hypothesis is similar to innocent until proven guilty.
- We assume that the mean salary is 113000 dollars, and we try to prove otherwise.
- This was an example of a two sided or a two tailed test.
- You can also form one sided or one tailed tests. See your friend Paul told you that he thinks data scientists earn more than 135000 dollars per
year you doubt him.
- So you design a test to see who's right. The null hypothesis of this test would be the mean data scientist salary is more or equal to 135000
dollars. The alternative will cover everything else.
- Thus, the mean data scientist salary is less than 135000 dollars. It is important to know that outcomes of tests refer to the population parameter
rather than the sample statistic.
- So the result that we get is for the population another crucial consideration is that generally the researcher is trying to reject the null hypothesis.
- Think about the null hypothesis as the status quo (status quo is to keep things the way they presently are.) and the alternative as the change or
innovation that challenges that status quo in our example.
- Paul was representing the status quo which we were challenging. Let me emphasize this once again in statistics.
- The null hypothesis is the statement we are trying to reject. Therefore, the null hypothesis is the present situation, while the alternative is our
opinion.
Rejection Region and Significance Level
- We will understand the reason why hypothesis testing works. First, we must define the term significance level.
- Normally we aim to reject the null if it is false.
- However as with any test there is a small chance that we could get it wrong and reject the null hypothesis that is true.
- The significance level is denoted by Alpha(α) and is the probability of rejecting the null hypothesis if it is true.
- So the probability of making this error typical values for Alpha are 0.01, 0.05, & 0.1.
- It is a value that you select based on the certainty you need in most cases.
- The choice of α is determined by the context you are operating in, 0.05 is the most used value.
- Examples consider you need to test of a machine is working properly. You would expect the test to make little or no mistakes as you want to be very precise.
- You should pick a low significance level such as 0.01. The famous Coca-Cola glass bottle is 12 ounces. If the machine pours 12.1 ounces some of the liquid will be spilled
and the label would be damaged as well.
- So in certain situations we need to be as accurate as possible however if we are analysing humans or companies we would expect more random or at least uncertain
behaviour and hence a higher degree of error.
- For instance, if we want to predict how much Coca-Cola its consumers drink on average the difference between 12 ounces and 12.1 ounces will not be that crucial.
- So we can choose a higher significance level like 0.05 or 0.1
- Now that we have an idea about the significance level. Let's get to the mechanics of hypothesis testing. Imagine you were consulting a university and want to carry out an
analysis on how students are performing.
- On average the University Dean believes that on average students have a GPA of 70%.
- Being the data driven researcher that you are, you cannot simply agree with his opinion. So, you start testing the null hypothesis is the population mean grade is 70%. This is
a hypothesized value, and we denote it with you 0.
- The alternative hypothesis is the population mean grade is not 70%. Assuming that the population of grades is normally distributed all grades received by students should
look this way.
Type I Error and Type II Error
• In general, we can have two types of errors type 1 error and type 2
error.
• Type 1 error is when you reject a true null hypothesis. It is also
called a false positive. The probability of making this error is
alpha, the level of significance, since you the researcher choose the
alpha. The responsibility for making this error lies solely on you.
• Type 2 error is when you accept a false null hypothesis. The
probability of making this error is denoted by β
• When you are conducting statistical tests to decide whether you
think the argument is true or false, you are doing hypothesis
testing.
• The null hypothesis is the initial statement that you are testing.
• The null hypothesis is believed to be true unless there is
overwhelming evidence to the contrary.
• However, there are times when scientists reject the null hypothesis
when they should not have rejected it.
• The reverse could also happen if the null hypothesis is not rejected
when it should have been. Data scientists refer to these errors as
Type I and Type II errors, respectively.
Type I Errors — False Positives (Alpha)
• There will almost always be a possibility of wrongly rejecting a null hypothesis when it should not have been rejected while
performing hypothesis tests.
• Data scientists have the option of selecting an alpha (𝛼) confidence level threshold that they will use to accept or reject the null
hypothesis.
• This confidence threshold, which is in other words a level of trust, is also the likelihood that you will reject the null hypothesis
when it is valid. This case is a type I error, which is more generally referred to as a false positive.
• In hypothesis testing, you need to decide what degree of confidence, or trust, for which you can dismiss the null hypothesis.
• If a scientist were to set alpha (𝛼) =.05, this means that there is a 5 percent probability that they would reject the null hypothesis
when it is actually valid.
• Another way to think about this is that you would expect the hypothesis to be rejected once, simply by chance, if you repeated
this experiment 20 times.
• Generally speaking, an alpha level of 0.05 is adequate to show that certain findings are statistically significant.

Type II Errors — False Negatives (Beta)


• Beta (β) is another type of error, which is the possibility that you have not rejected the null hypothesis when it is incorrect. Type
II errors are also known as false negatives.
• Beta is linked to something called Power, which, given that the null hypothesis is actually false, is the likelihood of rejecting it.
• When planning an experiment, researchers will always select the power level they want and get their Type II error rate from that.

The two types of error are inversely related to each other; decreasing type I errors will increase type II errors, and vice versa.
Example
1. Imagine that you are on a jury and that you need to determine if an individual is going to be sent to jail for a
crime. Since you don’t know the truth as to whether or not this person committed a crime. Here hypothesis
is “Person has not committed crime.”
• A type I error would suggest that, if they were really not guilty, you would send them to jail! The jury
has dismissed the null hypothesis that the defendant is innocent while he has not committed any
crime.
• You would also not want to make a type II error here because this would mean that someone has
actually committed a crime and the jury is letting them get away with it.

Ho: Person has not committed crime


H0 is true H0 is false
Accept 👍 Type II error:
(You accept Ho & sent him to (You accept Ho is false i.e., someone
home) has committed a crime and the jury is
letting them get away with it.)
Reject Type I error: 👍
You reject H0 (You accept Ho is false i.e. someone has
(False Positive) actually committed a crime and the jury
(You reject Ho & sent him to jail) is sending him to jail.)

You might also like