Cs - Fundamentals of Data Science
Cs - Fundamentals of Data Science
Cs - Fundamentals of Data Science
Predictive Analytics
Data Scientist
Specialist
Data science mainly needed for:
Examples:
1) Self-driving cars
2) Airlines
3) Logistic companies like FedEx
The Various Data Science Disciplines
Not everyone in the field of Data Science is a Data Scientist!
Analysis Analytics
Analytics generally refers to the future instead of
Consider you have a huge data set containing data explaining past events.
of various types. It explores potential future ones.
Instead of tackling the entire dataset and running Analytics is essentially the application of logical and
the risk of becoming overwhelmed, you separated computational reasoning to the component parts
into easier to digest chunks and study them obtained in an analysis and in doing this you are
individually and examine how they relate to other looking for patterns in exploring what you can do with
parts and that's analysis. them in the future.
One important thing to remember however is that Here analytics branches off into two areas.
you perform analyses on things that have Qualitative analytics
already happened in the past such as using an This is using your intuition and experience in
analysis to explain how a story ended the way it conjunction with the analysis to plan your next
did or how there was a decrease in the cells last business move
summer. Quantitative analytics
All this means that we do analyses to explain how This is applying formulas and algorithms to numbers
and or why something happened you have gathered from your analysis.
Business
Analytics,
Data Analytics,
Data Science:
An
Introduction
❑ Qualitative analytics.
- This includes working with
tools that help predict
future behaviour.
- Therefore must be placed
on the right.
- In essence what we have
now is qualitative Analytics
which belongs to the area
of business analytics.
❑ Sales Forecasting
though is a future
oriented activity so we
can move it to the right
of the black line but not
too much.
• It must still be long on
the sphere of business.
• So it must be in the area
where business analytics
and data intersect.
❑ Data Science
- The most sparkly of them
all is data science.
- Data science is a discipline
reliant on data availability
while business analytics
does not completely rely on
data.
- However, data science
incorporates part of data
analytics mostly the part
that uses complex
mathematical statistical and
programming tools.
- Consequently, this green
rectangle representing data
science on our diagram will
not overlap with data
analytics completely but it
will reach a point beyond
the area of business
analytics.
- An example of a discipline that belongs to
the field of data science and is considered
data analytics but not business analytics is
the oil and gas industry and the
optimization of drilling operations (It
aims to optimize weight on bit, bit rotation
for obtaining maximum drilling rate as
well as minimizing drilling cost).
- This is a perfect fit for this sub area data
science can be used to improve the
accuracy of predictions based on data
extracted from various activities typical for
drilling efficiency.
- Something that involves data analytics but
neither data science nor business analytics
can be digital signal processing.
- Digital signal is used to represent data in
the form of discrete values which is an
example of numeric data.
- Therefore data analytics can be applied to
digital signal in order to produce a higher
quality signal and that’s what digital signal
processing is all about.
- The business intelligence or BI is the process
of analysing and reporting historical business
data after reports and dashboards have been
prepared.
- They can be used to make an informed strategic
and tactical business decisions by end users
such as the general manager.
- Business intelligence aims to explain past
events using business data.
- It must go on the left of the timeline as it deals
only with past events and it must sit within the
data science rectangle as a subfield business
intelligence fits comfortably within data science
because it is the preliminary step of predictive
analytics.
- First you must analyse past data and extract
useful insights using these inferences will allow
you to create appropriate models that could
predict the future of your business accurately.
- As with reporting and creating dashboards these
are precisely what business intelligence is all
about. So we will neatly place these two into
the orange rectangle.
❑ Machine Learning
- The ability of machines to predict
outcomes without being explicitly
programmed to do so is regarded as
machine learning.
- Expanding on this is about creating
and implementing algorithms that let
machines receive data and use this
data to make predictions analyse
patterns and give recommendations
on their own.
- Machine learning cannot be
implemented without data. Hence it
should stay within Data analytics
completely.
- By definition it is about simulating
human knowledge and decision
making with computers.
- We as humans have only
managed to reach AI through
machine learning the
discipline we just talked
about and as the data
scientists we are interested in
how tools from machine
learning can help us improve
the accuracy of our
estimations.
- AI is beyond our expertise
- Artificial intelligence is
intelligence demonstrated
by machines, as opposed to
the natural intelligence
displayed by humans or
animals.
- The client retention(process of engaging
existing customers to continue buying
products or services from your business)
and acquisition(process of gaining new
customers) are two typical business
activities where machine learning is
involved. It helps develop models that
predict what a client's next purchase would
be.
- For example since we could say data
analytics and data science are applied in
client retention and acquisition as well we
can leave this term right over here.
- ML can be applied to fraud prevention as
another example we can feed a machine
learning algorithm with prior fraudulent
activity data. It will find patterns which the
human brain is incapable of seeing.
- Having a model which can detect such
transactions or operations in real time it has
helped the financial system prevent a huge
amount of fraudulent activity.
- When talking AI and ML usually
speech and image recognition
are among the most popular
examples as they are already
being implemented in products
like Siri, Cortana, Google's
assistant and more impressively
self-driving cars.
- Finally an example that is
considered artificial
intelligence but not
machine learning is
symbolic reasoning.
- It is based on the high
level human readable
representations of
problems in logic.
A
Breakdown
of Data
Science
The step by step comparison between
the terms and buzzwords related to each
discipline
- DATA - Big data is a term reserved for extremely large data and it is not just
- Data is defined as information stored in a humongous in terms of volume. This data could be in various formats.
digital format which can then be used as a It can be structured, Semi structure or unstructured. Big data is just
base for performing analyses and decision that big.
making. - You will also often see it characterized by the letter V as in big data.
- As you can see there are two types of data. They may include the vision you have about big data, the value Big
- Traditional data and big data: Dealing with Data carries, the visualization tools you use or the variability and the
data is the first step when solving business consistency of big data and so on.
problems or researching. - However, the following are probably the most important criteria. You
- Traditional data is the data in the form of must remember volume as we already said. Big Data needs a
tables containing numeric or text values data whopping amount of memory space typically distributed between
that is structured and stored in databases minicomputers. Its size is measured in terabytes, Peta bytes, and even
which can be managed from one computer. exabytes variety.
- Here we are not talking just about numbers and text. Big data often
implies dealing with images, audio files, mobile data and others.
- Velocity when working with big data. One's goal is to make extracting
patterns from it as quickly as possible. The progress that has been
done in this area is remarkable outputs from huge data sets can be
retrieved in real time.
- Data science is a broad subject. It's - Business Intelligence is the discipline includes technology driven tools
an interdisciplinary field that involved in the process of analysing, understanding, and reporting
combines statistical, mathematical, available past data. This will result in having reports or dashboards and will
programming, problem solving and help you on your way to making an informed strategic and tactical business
data management tools. decisions.
- We have divided data science into - You can extract insights and ideas about your business that will help to grow
three segments business intelligence, and give you an edge of your competitors giving you added stability.
traditional methods and machine - Business intelligence means understanding-
learning. • how your sales grew and why did competitors lose market share,
• Was there an increase in the price of your products or did you sell a mix
of more expensive products?
• How did your profitability margins behave in the same time frame of a
previous year?
• Were there client accounts that were more profitable?
- This is what BI is all about understanding past business performance in order
to improve future performance.
- Once your BI-reports and dashboards are completed and presented it's time to
apply one of two types of data science.
- Traditional methods according to our - The last column we will be discussing with
framework are a set of methods that are machine learning in contrast to traditional
derived mainly from statistics and are methods.
adapted for business. - The responsibility is left for the machine through
- There is no denying that these conventional mathematics.
data science tools are applicable today. They - A significant amount of computer power in
are perfect for forecasting future applying AI the machine is given the ability to
performance with great accuracy. predict outcomes from data without being
- Regression analysis, cluster analysis and explicitly programmed to smell is all about
factor analysis all of which are prime creating algorithms that let machines receive
examples of traditional methods. data perform calculations and apply statistical
analysis in order to make predictions with
unprecedented accuracy.
The Benefits
of Each
Discipline
There are two types of data. Traditional and big data. Data driven decisions require well organized and relevant raw data stored in a
digital format which can be processed and transformed into meaningful and useful information. It is the material on which you base your
analysis. Without data, a decision maker wouldn't be able to test their decisions and ensure they have taken the right course of
action.
The data you have describes what happened in the past. It is the job of the business intelligence analyst to study the numbers and explain
where and why some things went well and others not so well. Having the business context in mind the business intelligence analyst will
present the data in the form of reports and dashboards.
What else is needed once the patterns have been interpreted. You can forecast potential future outcomes. The application of any term
related to the columns traditional methods or machine learning can be said to belong to the field of predictive analytics.
There is a difference between the two. Traditional methods relate to traditional data. They were designed prior to the existence of big
data where the technology simply wasn't as advanced as it is today. They involve applying statistical approaches to create predictive
models.
If you want to dig deeper however or tackle huge amounts of big data utilizing unconventional methods or AI then you can predict
behaviour in unprecedented ways using machine learning techniques and tools. Both techniques are useful for different purposes.
Traditional methods are better suited for traditional data while machine learning will have better results when it comes to tackling big data.
Techniques for Working with Traditional Data
Data The gathering of raw data is Example would be the use of surveys asking people to rate how much they
Collection referred to as data collection like or dislike a product or experience on a scale of 1 to 10
One technique is class One such category is numerical. For The other label is categorical. Here
labelling. This involves example if you are storing the number you are dealing with information
Class labelling the data point to of goods sold daily t. These are that cannot have mathematical
Labelling the correct data type or numbers which can be manipulated manipulations. For example a
arranging data by category. such as the average number of goods person's profession or place of
sold per day or month. birth
This can come in various forms. Say you are provided with a data set
Data Cleansing The goal of data cleansing is to
containing the US states and a quarter of the names are misspelled in this
deal with inconsistent data.
situation. Certain techniques must be performed to correct these mistakes
Visualisa E R Diagram
tion Relational Scema
Real Life Examples of Traditional Data
Consider basic Customer data as example the difference between a numerical and categorical variable.
• The first column shows the id of the different customers. These numbers however cannot be
manipulated. Calculating an average ID is not something that would give you any sort of useful
information. This means that even though they are numbers they hold no numerical value and therefore
representing categorical data.
• Now focus on the last column. This shows how many times that customers filed a complaint. These
numbers are easily manipulated. Adding them all together to give a total number of
complaints is useful information. Therefore they are numerical data.
Text data mining represents the process of deriving valuable unstructured data from a text.
Consider you may have a database which has stored information from academic papers about marketing expenditure.
It may contain information from academic papers, blog, articles, online platforms, private Excel files and more.
This means you will need to extract marketing expenditure information from many sources. This technique can find the information
you need without much of a problem.
Data masking If you want to maintain a credible business or governmental activity you must preserve confidential information.
However when personal information is shared online it doesn't mean that it can't be touched or used for analysis. Instead you
must apply some data masking techniques so you can analyse the information without compromising private details like data
shuffling.
Masking can be quite complex. It conceals the original data with random and false data allowing you to conduct analysis and keep
all confidential information in a secure place. An example of applying data masking to big data is through what we called
confidentiality preserving data mining techniques.
Real Life Examples of Big Data
- Facebook keeps track of its users names, personal data, photos, videos, recorded messages
and so on. This means that their data has a lot of variety in with over 2 billion users
worldwide. The volume of data stored on their servers is tremendous.
- Facebook requires real time reporting of the aggregated anonymised voice of its users and
it applies many analytical tools for its mobile applications.
- This means the company is investing in boosting its real time data processing powers
increasing the velocity of its data set.
- Lets take financial trading data for example what happens when we record the stock
price every five seconds or every single second.
- We get a data set that is incredibly voluminous requiring significantly more memory
disk space in various techniques to extract meaningful information from it. Data like
this would also be considered big data.
Business Intelligence (BI) Techniques
Let's assume your data has been pre-processed and is ready for
analysis. It is beautifully organized. This means you are ready to
enter the realm of business intelligence.
The job of a business intelligence analyst requires her to understand
the essence of a business and strengthen that business through the
power of data.
So here we have techniques to measure business performance.
However no A measure is the A metric refers to a value
KPI
Measure
Metric
Quantification
Collecting observations
- BI can be used for price optimization, hotels use price optimization very effectively by
raising the price of a room at periods when many people want to visit the hotel. And by
reducing it to attract visitors when demand is low they can greatly increase their profits in
order to competently apply such a strategy.
- They must extract the relevant information in real time and compare it with historical BI
allows you to adjust your strategy to pass data. As soon as it is available.
• In addition, don't forget the robot Archer was an abstract depiction of what a machine learning model can do.
• In reality there are robots, but the model will be a highly complex mathematical formula the arrows will be a data set and the goals will be various and
quantifiable
• Here are the most notable approaches you will encounter when talking about machine learning support vector machines neural networks deep learning
random forced models and Bazy and networks are all types of supervised learning.
• There are neural networks that can be applied to an unsupervised type of machine learning, but K means is the most common unsupervised approach.
• By the way you may have noticed we have placed deep learning in both categories.
• This is a relatively new revolutionary computational approach which is acclaimed as the State-of-the-art email today.
• Describing it briefly we can say it is fundamentally different from the other approaches.
• However, it has a broad practical scope of application in all M-L areas because of the extremely high accuracy of its models.
• Note that deep learning is still divided and supervised, unsupervised and reinforcement, so it solves the same problems but in a conceptually different way.
Real Life Examples of Machine Learning (ML)
The financial sector and banks have ginormous data sets of credit card transactions.
Unfortunately, banks are facing issues with fraud daily. They are tasked with preventing fraudsters from
acquiring customer data and in order to keep customers funds safe they use machine learning algorithms.
They take past data and because they can tell the computer which transactions in their history were legitimate
and which were found to be fraudulent, they can label the data as such.
So through supervised learning they train models that detect fraudulent activity when these models detect even
the slightest probability of theft.
They flagged the transactions and prevent the fraud in real time. Although no one in the sector has reached a
perfect solution.
Another example of using supervise machine learning with label data can be found in client retention.
A focus of any business be it a global supermarket chain or an online clothing shop is to retain its customers.
But the larger a business grows the harder it is to keep track of customer trends. A local corner shop owner will
recognize and get to know their most loyal customers. They will offer them exclusive discounts to thank them for their
custom.
And by doing so keep them returning on a larger scale. Companies can use machine learning and past label data to
automate the practice.
And with this they can know which customers may purchase goods from them. This means the store can offer discounts
and a personal touch in an efficient way minimizing marketing costs and maximizing profits.
Popular Data Science Tools
STATISTICS
• Statistics is the discipline that
concerns the collection,
organization, analysis,
interpretation, and
presentation of data.
- By using probability and statistical data they can predict how likely
each outcome is and make the right call for their firm.
Example:
- Consider event A is flipping a coin and getting heads. - Imagine we have a standard six-sided die, and we want
- In this case heads are our only preferred outcome to roll to get four.
- Assuming the coin doesn't just somehow stay in the air - We have a single preferred outcome i.e. to get 4 but this
indefinitely. time we have a greater number of total possible
- There are only two possible outcomes Heads or tails. outcomes are 6.
- This means that our probability would be a half. So, we - Therefore, the probability of this event would look as
write the following, follows.
- For instance, the likelihood of getting the Ace of Spades equals the probability of getting an ace times the probability of getting a
spade.
P(Ace ♠)=P(Ace) .P(♠)
Problems:
1. Alice has 2 kids and one of them is a girl. What is the probability that the other child is also a girl? You can assume
that there are an equal number of males and females in the world.
2) A fair six-sided die is rolled twice. What is the probability of getting 2 on the first roll and not getting 4 on the second
roll?
3) Cross-fertilizing a red and a white flower produces red flowers 25% of the time. Now we cross-fertilize five pairs of
red and white flowers and produce five offspring. What is the probability that there are no red flower plants in the five
offspring?
Computing Expected Values
• Expected values represent what we expect the outcome to be if
we run an experiment many times to fully grasp the concept.
• So first what is an experiment? Imagine we don't know the
probability of getting heads when flipping a coin. We are going
to try to estimate it ourselves. So, we toss a coin several times
after doing one flip and recording the outcome. We complete a
trial by completing multiple trials. We are conducting an
experiment.
• For example, if we toss a coin 20 times and record the 20
outcomes that entire process is a single experiment with 20
trials.
• The probabilities we get after conducting experiments are
called experimental probabilities.
• Generally, when we are uncertain what the true probabilities
are or how to compute them.
• We like conducting experiments the experimental probabilities
we get are not always equal to the theoretical ones but are a
good approximation.
• The formula we use to calculate experimental probabilities is
similar to the formula applied for the theoretical ones, it is
simply the number of successful trials divided by the total
number of trials now that we know what an experiment is.
• The expected value of an event A denoted as E(A) is the
outcome we expect to occur when we run an experiment.
• To calculate Expected value we take the value for every element in the sample space and multiply it by its probability. Then we add all of those up
to get the expected value.
• For example, if our random variable were the number obtained by rolling 6-sided die, the expected value would be
E(x) = 1*(1/6) + 2*(1/6) + 3*(1/6) + 4*(1/6) + 5*(1/6) + 6*(1/6)
= 3.5
• You are trying to hit a target with a bow and arrow the target has three
layers. The outermost one is worth 10 points, the second one is worth
20 points and the innermost is worth 100. You have practiced enough
to always be able to hit the target but not so much that you hit the centre
every time.
• Suppose the probability of hitting each layer is as follows, 0.5 for the
outmost, 0.4 for the second and 0.1 for the centre.
• The expected value for this example would be,
E(X) = 0.5*10 + 0.4*20 + 0.1*100 = 23
- But we can never get 23 points with a single shot. So why is it important to know what the expected value of an event is?
We can use expected values to make predictions about the future, based on past data.
- We frequently make predictions using intervals instead of specific values due to the uncertainty the future brings.
Example 1:
A local club plans to invest $10000 to host a baseball game. They expect to sell tickets worth $15000 . But if it rains on the
day of game, they won't sell any tickets and the club will lose all the money invested. If the weather forecast for the day of
game is 20% possibility of rain, is this a good investment?
Example 2:
A company makes electronic gadgets. One out of every 50 gadgets is faulty, but the company doesn't know which ones are
faulty until a buyer complains. Suppose the company makes a $3 profit on the sale of any working gadget, but suffers a loss
of $80 for every faulty gadget because they have to repair the unit. Check whether the company can expect a profit in the
long term. Write the probability distribution.
5) Ahmed is playing a lottery game where he must pick 2 numbers from 0 to 9 followed by an English alphabet (from
26-letters). He may choose the same number both times.
If his ticket matches the 2 numbers and 1 letter drawn in order, he wins the grand prize and receives $10405. If just his
letter matches but one or both of the numbers do not match, he wins $100. Under any other circumstance, he wins
nothing. The game costs him $5 to play. Suppose he has chosen 04R to play. What is the expected net profit from playing
this ticket?
Frequency
- Sometimes the result of the expected value is confusing or doesn't tell us much.
- Consider a example throwing two standard six sided dice and adding up the numbers on top.
- We have six options for what the result of the first one could be regardless of the number we roll. We still have six different
possibilities for what we can roll on the second dice. That gives us a total of 6 * 6 = 36 different outcomes for the two roles.
- We can write out the results in a six by six table where rewrite the sum of the two dice.
- You can clearly see that we have repeating entries along the secondary diagonal and all diagonals parallel to it.
- Notice how 7 occurs 6 times in the table. This means we have six favourable
outcomes to get addition as 7. There are 36 possible outcomes, so the chance of
getting a seven is,
P(7) = 6/36 = 1/6
Probability Frequency
Distribution
• A probability frequency distribution
is a collection of the probabilities for
each possible outcome.
• If we write out all the outcomes in
ascending order and the frequency of each
one we construct a frequency distribution
table. By examining this table we can
easily see how the frequency changes with
the results.
• We need to transform the frequency of
each outcome into a probability. Knowing
the size of the sample space we can
determine the true probabilities for each
outcome. We simply divide the frequency
for each possible outcome by the size of
the sample space.
• A collection of all the probabilities for the
various outcomes is called a probability
frequency distribution.
• We can express this probability
frequency distribution through a table or
a graph.
• On the graph we see the probability
frequency distribution. The x axis depicts
the different possible numbers of sums
we can get and the y axis represents the
probability of getting each outcome
when making predictions.
• We generally want our interval to have
the highest probability.
• We can see that the individual outcomes
with the high best probability are the
ones with the highest bars in the graph
usually the highest bars will form around
the expected value.
• Thus the values around it would also be
the values with the highest probability.
Events and Their Complements
• A complement of an event is everything the event is not.
• If we add the probabilities of different events we get their
sum of probabilities. Now if we add up all the possible
outcomes of an event we should always get one.
• Example:
P(head) + P(tail) =1
1) Permutations
- Therefore for any combination of Sandwich and side we have two ways of completing the menu
thus we would have a total of three dishes times two sides times two drinks or 12 different lunch
menus at the diner.
3* 2 * 2 =12
- This is important because it shows us how many different possibilities there are available despite
the choices for each part seeming limited.
- Furthermore, this allows us to determine the appropriate amount of time it would take for such a
task to be completed when the components are simply too many. We can remove several of the
options to tremendously decrease the workload.
- The way of calculating the total number of combinations for these kinds of questions is by simply
multiplying the number of options available for each individual event.
Example:
Consider to win lottery you need to satisfy two independent events
1) Correctly guess “Powerball” number (From 1 to 26)
2)Correctly guess 5 regular numbers (From 1 to 69)
Find the probability of winning single ticket.
In how many different ways could 23 children sit on 23 chairs in a Maths Class? If you have 4 lessons a week and there are 52
weeks in a year, how many years does it take to get through all different possibilities? Note: The age of the universe is about
14 billion years.
Unfortunately, you can’t remember the code for your four-digit lock. You only know that you didn’t use any digit more than
once. How many different ways do you have to try? What do you conclude about the safety of those locks?
In a shop there are five different T-shirts you like, coloured red, blue, green, yellow and black. Unfortunately you only have
enough money to buy three of them. How many ways are there to select three T-shirts from the five you like?
Four children, called A, B, C and D, sit randomly on four chairs. What is the probability that A sits on the first chair?
- Every event has a set of outcomes that satisfy it. These are the favourable outcomes.
-
Sets And Events
For example, the event could be even, and the set of values would consist of 2, 4, 6 and all other even
numbers.
- However, values of a set do not always have to be numerical. For instance, an event can be being a
member of the European Union. Values like France or Germany would be a part of this set and
values like USA or Japan would not.
- Convention dictates that we use upper case letters to denote these sets and lower-case letters to
express individual elements.
- In the numerical example upper case X will express all even numbers and lowercase x.
- Any set can be either empty or have values in it.
- If it does not contain any values, we call it the empty set or null set and denote it with ∅
- The non-empty sets can be finite or infinite depending on the number of elements they have when
working with them.
- We often want to express if an element is part of a set the symbol, we use to denote that is the
“belongs to (∈)” symbol.
• x∈X
- We read it as x is an element of or simply in set X.
- But what if we want to show that an element is not contained in a set then we can use the same
notations but simply cross out the symbol with a single diagonal line like,
• x∉X
- So the statements now mean x IS NOT IN X and X does not contain x.
- A subset is a set that is fully contained in another set.
- If every element of A is also an element of B then A is a subset of B. We do know that with a subset
B as you can see not all elements of B are necessarily part of a going forward.
- Remember that every set contains at least two subsets itself and the null set.
- Take events A and B for example, we express the set of values that satisfy each of them as Circle's one for A and another one for B
- Any element that is part of either set will be represented by a point in the appropriate circle. Since we can have additional events the more events, we have the more circles
we draw.
- Let's only focus on A and B, the two circles can either not touch it all, intersect or one can completely overlap the other
- If the two circles never touch, then the two events can never happen simultaneously. Essentially event A occurring guarantees that event B is not occurring and vice
versa Example getting a diamond and getting a heart would be such a situation. If we get a heart, we can't get a diamond and if we get a diamond, we can't get a heart since
each card has exactly one suit.
- Now if these circles intersect it means that the two events can occur at the same time.
- Imagine we draw a card from a standard deck of playing cards. If event A is drawing a diamond and event B is drawing a queen, the area where they intersect will be
represented solely by the queen of diamonds the remaining area of a will represent all other diamonds whereas the area of B outside of that will represent all other queens.
- The third case happens if one circle completely overlaps another. That means that one event can only ever occur if the other one does as well. For instance, event A could
be drawing a red card and event B could be drawing a diamond, then the circle of B is completely contained inside A. So, we can only ever get a diamond if we get a red
card notice that, if the card we drew is black it cannot be a diamond.
- Thus, if event A does not occur then neither does event B. However, because we can draw a heart it is possible to get a red card that isn't a diamond.
- Therefore, event B not occurring does not guarantee event A not occurring. In short if an outcome is not part of a set it cannot be part of any of its subsets. However, an
outcome not being part of some subset does not exclude from the entirety of the Greater set.
- The intersection of two events occur when we want both A and B to happen at the
same time.
- Graphically the intersection is exactly as the name suggests the area where these
events intersect.
- It consists of all the outcomes that are favourable for both event A and event B
simultaneously as we denote it as A intersect B
- Consider the examples the intersection of all hearts and all diamonds is the empty
set as there are no outcomes which satisfy both events simultaneously, we would
write this as,
A∩B=∅
- Consider example, the intersection of all diamonds and all queens is represented
by the queen of diamonds. That card is the only one that satisfies being a queen
and being a diamond at the same time.
- In the example with red cards and diamonds the intersection of the two would
simply be all diamonds. That is because any diamond is simultaneously red and a
diamond. We would write this as
A∩B=B
- Everybody we use intersections only when we want to denote instances where
both events A and B happens simultaneously.
If we only require one of A or B to occur regardless which one that is the same as asking either A or B to happen, in such cases
we need to find the union of A and B
The union of two sets is a combination of all outcomes preferred for either A or B.
Let us examine what the unions would be in the three different cases if the sets A and B do not touch it all then their intersection
would be the empty set, therefore their union would simply be their sum.
(A U B) = A + B
Going back to the card example the union of hearts and diamonds would be all red cards, no card can have multiple suits so we
need not worry about counting a card twice therefore the number of red cards equal the union of cards which are either diamonds
or hearts
If the events intersect the area of the Union, then it is represented by the sum of the two sets minus their intersection
(A U B) = A + B – (A ∩ B)
That is because if we simply add up the area of the two sets we would be double counting every element that is part of the
intersection.
In fact the union formula we just showed you is universally true regardless of the relationship between A and B
So what happens if B is a subset of A. In that case the union would simply be the entire set A
Imagine event A is being from the U.S. an event B is being from California if you talk about all the people who are either from
California or the United States you are simply talking about all the people from the USA
Remember that the intersection of these two sets is equal to the entire set B, so the intersection of A and B represents all the
people from California
If we plug this into the formula, that would give us the following statement,
The union of all people from California or the United States = all California natives + all Americans - all Californians
we get exactly what we expect the union of people from either California or the United States equals the entire population of the
USA now
- Mutually exclusive sets are sets in which you're not allowed to have any
overlapping elements graphically their circles never intersect.
- Mutually exclusive sets have the empty set as their intersection.
- Therefore, if the intersection of any number of sets is the empty set then they must
be mutually exclusive and vice versa.
- What about their union if some sets are mutually exclusive? Their union is simply
the sum of all separate individual sets.
- Sets have complements which consist of all values that are parts of the sample
space but not part of the set.
- Consider a set consisting of all the odd numbers, its complement would be the set
of all even numbers. It means complements are always mutually exclusive.
- However not all mutually exclusive sets are complements.
- For instance, imagine A is the set of all even numbers and B is the set of all
numbers ending in five. We know that any number ending with five is odd.
- So these two sets are definitely mutually exclusive.
- However the complement of all even is all odd and not just the ones ending with 5.
- Therefore a number like 13 would be part of the complement but not the set B.
- We can have dependent events as their probabilities vary as conditions change.
- For instance, take the probability of drawing the Queen of Spades normally. The
answer is 1/52. Since we have exactly one favourable outcome and fifty-two elements
Dependence and Independence of
in the sample space.
Sets
- Now imagine we know that the card we drew was a spade. Our chances of getting the
queen of spades suddenly go up since the new sample space contains the 13 cards
from the suit only. Therefore, the probability becomes 1/13.
- Now imagine a different scenario instead of a spade. We know our card is a queen.
- So, the sample space only consists of 4 cards. Therefore, the probability of drawing
the Queen of Spades becomes 1/4
- With this example you could clearly see how the probability of an event changes
depending on the information we have.
- Lets introduce some new notation as usual. Suppose we have two events A and B to
express the probability of getting A, if we are given that B has occurred, we use the
following notation, P(A/B), we read this as P of A given B.
- Going back to our card example, event A is drawing the queen of spades and event B
is drawing a spade.
- Therefore P(A/B) would represent the probability of drawing the queen of spades. If
we know the card is a spade so
P(A/B)=1/13
- Similarly if event C represents getting a queen then P(A/C) expresses the likelihood
of getting the Queen of Spades assuming we drew a queen.
- Thus
P(A/C) = 1/4
- We call this probability the conditional probability and we use it to distinguish
dependent from independent events
• Example:
1) In a manufacturing unit 3 parts from assembly are selected. You are observing whether they are defective or
non-defective.
• Determine
a) The sample space.
b) The probability of the event of getting 2 defective parts if there is chance of 1 defective in 10.
The Conditional Probability Formula
•
•
Ex 1
Suppose you draw two cards from a deck and you win if you get a jack followed by an ace (without replacement).
What is the probability of winning, given we know that you got a jack in the first turn?
Ex 2:
Suppose you have a jar containing 6 marbles – 3 black and 3 white. What is the probability of getting a black given the
first one was black too without replacement
Marginal Probability
• Contingency table consists of rows and columns of two attributes at different
levels with frequencies or number in each of the cells.
• It is matrix of frequencies assigned to rows and columns.
• The term marginal is used to indicate that the probabilities are calculated
using a contingency table.
• Also called as joint probability table.
A survey of 200 families were conducted. Information regarding family income per year & whether family buys a car is given
in following table
Family Income below Rs 10 lakh Income above Rs 10lakh Total
Buyer of car 38 42 80
Non buyer 82 38 120
Total 120 80 200
a) What is the probability that a randomly selected family is buyer of the car?
b) What is the probability that a family is both buyer of the car & belonging to income of Rs 10 lakh & above.
c) A family selected at random is found to be belonging to income of Rs 10 lakh & above. What is the probability that this
family is buyer of car?
Example 3
A research group collected the yearly data of road accidents with respect to the conditions of following and not following the
traffic rules of an accident prone area. They are interested in calculating the probability of accident given that a person
followed the traffic rules. The table of the data is given as follows:
Accident 50 500
They often provide summarized statistics we use to analyse and interpret how certain factors affect one
another.
An example would illustrate this better imagine you conducted a survey where 100 men and women of all ages
were asked if they eat meat.
The results are summarized in the table, we see 15 of the 47 women that participated are vegetarian as our 29
out of the 53 men.
- Now if A represents being vegetarian and B represents being a woman then P (A/B) and P(B/A) expressed different events
- The former equals 15/47 represents the likelihood of a woman being vegetarian while the other equals 15/44 expresses the
likelihood of a vegetarian being a woman since 15/44 is greater than 15/47.
P(A/B)= 15/47 ≠ P(B/A) = 15/44
- It is more likely for a vegetarian to be female than for a woman not to eat meat.
- It shows you that in probability theory things are never straightforward
- Now we will discuss important concept the law of
total probability.
- Imagine A is the union of some finitely many
events B1, B2 and so on.
A=B1 U B2 U …….. U Bn
- This law dictates that the P( A) is the sum of all
the conditional probabilities of a given some B
multiplied by the probability of the associated B
P(A) = P(A/B1) * P(B1) + P(A/B2) * P(B2) ………..
P(Tableau) = 38%
P(SQL) = 45%
P(Tableau U SQL) = 66%
Transforming this into a probability gives us a likelihood of 0.17 for somebody in the office to be able to
proficiently implement SQL and Tableau
Example:
1) From a pack of well shuffled cards, a card is picked up at random.
a) What is probability that the selected card is a king or a queen?
b) What is probability that the selected card is king or a diamond?
2) The probability that you will get an A grade in quantitative method is 0.7. The probability that you get an A grade in
marketing is 0.5. Assuming these 2 courses are independent, compute the probability that you will get an A grade in both
this subjects.
The Multiplication Law
•
- Suppose the probability of event B is 0.5 and the probability of event A given B, P(A/B) is 0.8.
- This suggests that event B occurs 50% of the time and event A also appears in 80% of those 50% when B occurred.
So we have a probability of 0.191 of drawing a spade on the second turn assuming we did not draw 1 initially
• Example:
From a pack of cards, 2 cards are drawn in succession one after another. After every draw, the selected card is not replaced.
What is probability that in both the draws you will get spades?
• Bayes' Law
- One of the most prominent examples of using Bayes Rule is in medical research when trying to find a causal relationship
between symptoms.
- Knowing both conditional probabilities between the two helps us make more reasonable arguments about which one causes
the other.
- For instance, there is certain correlation between patients with back problems and patients wearing glasses.
- More specifically 67% of people with spinal problems wear glasses (P(VI/BP) =67%) while only 41% of patients with
eyesight issues have back pains (P(BP/VI)=41%)
- These conditional probabilities suggest that is much more likely for someone with back problems to wear glasses than the
other way around even though we cannot find a direct causal link between the two.
- There exists some arguments to support such claims.
- For instance, most patients with back pain are either elderly or work a desk job where they remained stationary for long
periods
- Old age and a lot of time in front of the desktop computer can have a deteriorating effect on an individual's eyesight
however many healthy and young individuals wear glasses from a young age.
- In those cases, there is no other underlying factor that would suggest incoming back pains.
- Similarly, we can also apply Bayes Theorem in business. Let's explore this fictional scenario.
•
Example1: A bag I contain 4 white and 6 black balls while another Bag II contains 4 white and 3 black balls. One ball is drawn
at random from one of the bags, and it is found to be black. Find the probability that it was drawn from Bag I.
Example 2:
You are planning a picnic today, but the morning is cloudy
∙ Oh no! 50% of all rainy days start off cloudy!
∙ But cloudy mornings are common (about 40% of days start cloudy)
∙ And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%)
What is the chance of rain during the day?
• Let us say P(Fire) means how often there is fire, and P(Smoke) means how often we see smoke, then:
• P(Fire|Smoke) means how often there is fire when we can see smoke
P(Smoke|Fire) means how often we can see smoke when there is fire
∙ dangerous fires are rare (1%)
∙ but smoke is fairly common (10%) due to barbecues,
∙ and 90% of dangerous fires make smoke
• We have
- Bayesian theorem idea is he switches which event is being conditioned on. He switches between A given B and B given A.
- In spam mail example we want to find out whether the email that we are receiving is spam or not.
- Gmail identifies emails and identify them as spam and moved them to specific folder.
- Let suppose we want to build same application based on the contents of mail that you need to find whether mail is spam or not.
- So, we can design the program like if I know the words of program then I can tell you whether it is spam or not i.e. I want probability of spam given words.
- It means if I tell you the words, can you tell me whether this is spam or not?
- We will solve this problem by finding opposite conditional i.e., what is the probability of words given spam.
- As it is for us to do.
- It means if I know that it is spam then I know the distribution of words & if I know that it is not spam then I know the distribution of words.
- So now twist the problem and say that “if you give me the words, I will tell you whether it is spam or not.”
•
•
Probability – Distributions
- A distribution shows the possible values a variable can take and how
frequently they occur
- Let us introduce some important notation we use. The uppercase Y represents the
actual outcome of an event and lowercase y represents one of the possible
outcomes.
- One way to denote the likelihood of reaching a particular outcome Y is
P(Y = y ) or p(y)
- For example uppercase Y could represent the number of red marbles we draw out
of a bag and lowercase y would be a specific number like 3 or 5, then we express
the probability of getting exactly 5 red marbles as,
P(Y)=5 or P (5)
- Since P(Y) expresses the probability for each distinct outcome. We call this the
probability function.
- So, probability distributions or simply probabilities measure the likelihood
of an outcome depending on how often it is featured in the sample space.
- The probability frequency distribution of an event is the frequency for each unique value and divided it by the total number
of elements in the sample space.
- Usually that is the way we construct these probabilities when we have a finite number of possible outcomes if we had an
infinite number of possibilities then recording the frequency for each one becomes impossible because there are infinitely
many of them.
- Imagine you are a data scientist and want to analyse the time it takes for your code to run any single compilation could take
anywhere from a few milliseconds to several days. Often the result will be between a few milliseconds and a few minutes.
•
- One idea which we will use a lot is that any value between µ - σ and µ + σ falls within one
standard deviation away from the mean.
- The more congested the middle of the distribution the more data falls within that interval
similarly the less data that falls within the interval the more dispersed the data is.
- It is important to know that a constant relationship exists between mean and variance for
any distribution.
- The variance equals,
σ 2 = E((Y - µ)2)
= E(Y2) - µ2
Types of Probability
Distributions
- Here we are going to discuss about various
types of probability distributions and what
kind of events they can be used to describe
certain distributions share features.
- So, we group them into types some like
rolling a die or picking a card have a finite
number of outcomes they follow discrete
distributions, others like recording time and
distance in track and field have infinitely
many outcomes. They follow continuous
distributions.
- Here we will focus on an important aspect of
it or when it is used?
- Before, we get into the specifics you will need
to know the proper notation we implement
when defining distributions, we start off by
writing down the variable name for our set of
values followed by the tilde sign, this is
superseded by a capital letter depicting the
type of the distribution and some
characteristics of the dataset in parentheses.
X ~ N ( µ , σ2 )
Discrete Distributions
- A discrete distribution is a distribution of data in statistics that
has discrete values. Discrete values are countable, finite,
non-negative integers, such as 1, 10, 15, etc.
- A discrete distribution, as mentioned earlier, is a distribution of
values that are countable whole numbers. On the other hand, a
continuous distribution includes values with infinite decimal
places. An example of a value on a continuous distribution would
be “pi.” Pi is a number with infinite decimal places (3.14159…).
- Both distributions relate to probability distributions, which are
the foundation of statistical analysis and probability theory.
- A probability distribution is a statistical function that is used to
show all the possible values and likelihoods of a random
variable in a specific range.
- The range would be bound by maximum and minimum values,
but the actual value would depend on numerous factors.
- There are descriptive statistics used to explain where the
expected value may end up. Some of which are:
∙ Mean (average)
∙ Median
∙ Mode
∙ Standard deviation
∙ Skewness
∙ Kurtosis
- Consider an example where you are counting the number of people walking
into a store in any given hour. The values would need to be countable, finite,
non-negative integers.
- It would not be possible to have 0.5 people walk into a store, and it would
not be possible to have a negative amount of people walk into a store.
- Observing the above discrete distribution of collected data points, we can see
that there were five hours where between one and five people walked into
the store. In addition, there were ten hours where between five and nine
people walked into the store and so on.
- The probability distribution above gives a visual representation of the
probability that a certain amount of people would walk into the store at any
given hour. Without doing any quantitative analysis, we can observe that
there is a high likelihood that between 9 and 17 people will walk into the
store at any given hour.
- Types of Discrete distributions are:
• Discrete Uniform distribution
• Bernoulli Distribution
• Binomial Distribution
• Poisson Distribution
Discrete Distributions: The Uniform Distribution
- It’s when all the distinct random variables have the exact same
probability values, so everything is constant or just a number.
- In a uniform probability distribution, all random variables have the
same or uniform probability; thus, it is referred to as a discrete uniform
distribution.
- Imagine a box of 12 donuts sitting on the table, and you are asked to
randomly select one donut without looking. Each of the 12 donuts has
an equal chance of being selected.
- Therefore, the probability of any one donut being chosen is the same or
uniform.
A binomial distribution graph where the probability of success does not equal the probability of failure looks like
Now, when probability of success = probability of failure, in such a situation the graph of binomial distribution looks like
1. A coin is tossed four times. Calculate the probability of obtaining more heads than tails.
2. An agent sells life insurance policies to five equally aged, healthy people. According to recent data, the probability of
a person living in these conditions for 30 years or more is 2/3. Calculate the probability that after 30 years:
1. All five people are still living.
2. At least three people are still living.
3. Exactly two people are still living.
Example:
Let X be the number of heads flipped in n=10 flips of a coin. If the coin is fair, then it is clear that X ~ Bin(n=10, p=0.500)
i) What is probability of getting 3 heads?
ii) What is probability of getting 3 tails?
iii) What is probability of getting at most 3 heads or at least 7 heads?
Example:
Given two randomly selected countries, the probability of war between them in given year is one in ten-thousand p=0.0001.
Calculate the probability of at least one war in a given year. (Given that there are 194 countries in world.)
Discrete Distributions: The Poisson Distribution
- Suppose you work at a call centre; approximately how many calls do you get in a day? It can be any number. Now,
the entire number of calls at a call centre in a day is modelled by Poisson distribution. Some more examples are
- You can now think of many examples following the same course.
- Poisson Distribution is applicable in situations where events occur at random points of time and space
wherein our interest lies only in the number of occurrences of the event.
- A distribution is called Poisson distribution when the following assumptions are valid:
1. Any successful event should not influence the outcome of another successful event.
2. The probability of success over a short interval must equal the probability of success over a longer interval.
3. The probability of success in an interval approaches zero as the interval becomes smaller.
The mean µ is the parameter of this distribution. µ is also defined as the
- Now, if any distribution validates the above assumptions then it λ times length of that interval. The graph of a Poisson distribution is
is a Poisson distribution. Some notations used in Poisson shown below:
distribution are:
- Here, X is called a Poisson Random Variable and the probability The graph shown below illustrates the shift in the curve due to
increase in mean.
distribution of X is called Poisson distribution.
- Let µ denote the mean number of events in an interval of length
t.
- Then, µ = λ*t.
It is perceptible that as the mean increases, the curve shifts to the right.
The mean and variance of X following a Poisson distribution:
Mean -> E(X) = µ
Variance -> Var(X) = µ
Example:
Births in a hospital occur randomly at an average rate of 1.8 births per hour. What is the probability of observing 4 births in a
given hour at the hospital?
If random variable X follows Poisson distribution, if P(X=2) = P(X=3) then find mean P(X=0)
Continuous Distributions
•
❑ Variance
- Each defined random variable has a variance associated with it as well.
- This is a measure of the concentration of the observations within that random variable. This number will give us intel on
how far spread the observations are from the mean.
- The standard deviation is also useful, this is equal to the square root of the variance.
- When calculating variance, the idea is to calculate how far each observation of the random variable is from it’s expected
value, square it, then take the average of all of these squared distances.
- The formula for variance is as follows:
Example:
Verify whether following function is pdf. If yes then Obtain
i) P(X < 0.5) ii) P(X > 0.25) iii) P(0.2 < X < 0.4)
where
f(x) = 6x(1-x); for o<x<1
=0 otherwise
Continuous Distributions: The Normal Distribution
- For starters we define a normal distribution as N(µ, σ2) using a
capital letter N followed by the mean and variance of the
distribution
- We read the following notation as variable X follows a normal
distribution with mean µ and variance σ2.
• X~N(µ, σ2)
- When dealing with actual data, We would usually know the
numerical values of µ and σ2
- The normal distribution frequently appears in nature as well as in
life in various shapes and forms.
- For example, the size of a fully grown male lion follows a normal
distribution. Many records suggest that the average lion weighs
between 150 and 250 kg or 332 - 550 pounds.
- Of course, specimens exist which fall outside of this range.
However, Lions weighing less than 150 or more than 250 kilograms
tend to be the exception rather than the rule.
- Such individuals serve as outliers in our set and the more data we
gather.
- The lower part of the data they represent now that you know what
types of events follow a normal distribution.
- Let us examine some of its distinct characteristics.
∙ For starters the graph of a normal distribution is bell shaped.
∙ Therefore, the majority of the data is centred around the mean.
• Thus, values further away from the mean are less likely to occur
furthermore we can see that the graph is symmetric with regards to the
mean that suggests values equally far away in opposing directions would
still be equally likely.
• Let's assume a population of animals in a zoo is known to
be normally distributed.
• Each animal lives to be 13.1 years old on average (mean),
and the standard deviation of the lifespan is 1.5 years.
• If someone wants to know the probability that an animal
will live longer than 14.6 years, they could use the
empirical rule.
• Knowing the distribution's mean is 13.1 years old, the
following age ranges occur for each standard deviation:
❖ One standard deviation (µ ± σ): (13.1 - 1.5) to (13.1 + 1.5), or 11.6 to 14.6
❖ Two standard deviations (µ ± 2σ): 13.1 - (2 x 1.5) to 13.1 + (2 x 1.5), or 10.1 to 16.1
❖ Three standard deviations (µ ± 3σ): 13.1 - (3 x 1.5) to 13.1 + (3 x 1.5), or, 8.6 to 17.6
• The person solving this problem needs to calculate the total probability of the animal living 14.6 years or longer.
• The empirical rule shows that 68% of the distribution lies within one standard deviation, in this case, from 11.6 to 14.6
years.
• Thus, the remaining 32% of the distribution lies outside this range.
• One half lies above 14.6 and the other below 11.6. So, the probability of the animal living for more than 14.6 is 16%
(calculated as 32% divided by two).
Continuous Distributions: The
Standard Normal Distribution
• Then we simply need to move the graph three places to the right,
similarly if we subtract a number from every element we would
simply move our current graph to the left to get the new one
•
Continuous Distributions: The Students' T Distribution
- We use the lowercase letter t to define a Student's t distribution followed by a
single parameter in parentheses called degrees of freedom t(k)
- We read this next statement as variable Y follows A Student's t distribution with 3
degrees of freedom.
Y ~ t(3)
- It is a small sample size approximation of a normal distribution in instances where
we would assume a normal distribution were it not for the limited number of
observations we use the Student's t distribution
- For instance, the average lap times for the entire season of a Formula One race
follow a normal distribution but the lap times for the first lap of the Monaco
Grand Prix would follow A Student's t distribution.
Continuous Distributions: The Chi-Squared Distribution
•
Continuous Distributions: The Exponential Distribution
•
•
Continuous Distributions: The Logistic Distribution
- We denote a logistic distribution with the entire word logistic followed by two
parameters.
Logistic (µ, S)
- It's mean and scale parameter like the one for the exponential distribution.
- We also refer to the mean parameter as the location and we shall use the terms
interchangeably.
- Thus we read the statement below as variable Y follows a logistic distribution with
location 6 and a scale of 3.
X ~ Logistic (6,3)
- We often encounter logistic distributions when trying to determine how continuous
variable inputs can affect the probability of a binary outcome.
- This approach is commonly found in forecasting competitive sports events where there
exist only two clear outcomes victory or defeat.
- For instance we can analyse whether the average speed of a tennis player serve plays a crucial role in the outcome of the match
- Expectation dictates that sending the ball with higher velocity leaves opponents with a shorter period to respond, this usually results in a better hit which could
lead to a point for the server
- To reach the higher speeds, Tennis players often give up some control over the shot so are less accurate therefore we cannot assume that there is a linear
relationship between point conversion and serve Speights theory suggests, there exists some optimal speed which enables the serve to still be accurate enough
then most of the shots we convert into points will likely have similar velocities as tennis players go further away from the optimal speed their shots either become
too slow and easy to handle or to an accurate.
- This suggests that the graph of the PDF of the logistic distribution would look similarly to the normal distribution.
- The graph of the logistic distribution is defined by two key features it's mean and its scale parameter.
•
Weibull Distribution
•
Population and Sample
• The branch of statistics called inferential statistics is often defined as
the science of drawing conclusions about a population from
observations made on a representative sample of that population.
- Samples are key to accuracy testicle insights. They have two defining characteristics.
Why a sample?
In general, it is almost always impossible to carry out measurements for the entire study
population because:
• The population is too large. Example: the population of IT students. If we want to take measurements on all IT
students in the world, it will most likely either take too long or cost too much
• The population is virtual. In this case “virtual” population is understood as a “hypothetical” population: it is
unlimited in size. Example: for an experimental study, we focus on men with blood cancer treated with a new
treatment. We do not know how many people will be treated, so the population varies, is infinite and
uncountable at the present time, and therefore virtual
• The population is not easily reachable. Example: the population of homeless persons in Belgium.
• For these reasons, measurements are made on a subgroup of observations from the
population, i.e., on a sample of our population.
• These measures are then used to draw conclusions about the population of interest.
• With an appropriate methodology and a sufficiently large sample size, the results
obtained on a sample are often almost as accurate as those that would be obtained on
the entire population.
Representative sample
• The sample must be selected to be representative of the population under study. If participants are included in a
study on a voluntary basis, there is a serious concern that the resulting sample may not be representative of the
population.
• It may be the case that volunteers are different in terms of the parameter of interest, leading to a selection bias.
Another selection bias can occur when, for instance, a researcher collects citizens’ wage, by the means of
internet. It might be the case that people having access to internet have different wages than people who do not
have access.
• The best way to select a sample representative of the population under study is by selecting a random sample. A
random sample is a sample selected at random from the population so that each member of the population has
an equal chance of being selected. A random sample is usually an unbiased sample, that is, a sample whose
randomness is not in doubt.
• In some situations (e.g., in medicine) it is complicated or even impossible to obtain a random sample of the
population. In such cases, it will be important to consider how representative the resulting sample will be.
Paired samples
• Finally, paired samples are samples in which groups (often pairs) of experimental units are linked together by the
same experimental conditions. For example, one may measure the hours of sleep for 20 individuals before taking
a sleeping pill (forming sample A), and then repeat the measurements on the same individuals after they have
taken a sleeping pill (forming sample B).
• The two measurements for each individual (hours of sleep before and after the sleeping pill) and the two samples
are of course related.
• Statistical tools accounting for a relation between the samples exist and should be preferred in that case.
Randomness and representativeness.
- A sample must be both random and representative for an insight to be precise.
- A random sample is collected when each member of the sample is chosen from the population strictly by chance.
- A representative sample is a subset of the population that accurately reflects the members of the entire population.
- Let's go back to the sample we just discussed the 50 students from the NYU canteen we walked into the university canteen
and violated both conditions.
- People were not chosen by chance. They were a group of NYU students who were there for lunch. Most members did not
even get the chance to be chosen as they were not in the canteen.
- Thus, we conclude the sample was not random but was it representative.
- It represented a group of people but definitely not all students in the university to be exact.
- It represented the people who have lunch at the university canteen.
- Had our survey been about job prospects of NYU students who eat in the university canteen we would have done well.
- You must be wondering how to draw a sample that is both random and representative.
- The safest way would be to get access to the student database and contact individuals in a random manner.
- However such surveys are almost impossible to conduct without assistance from the university.
Types of Data
- Different types of variables require different types of statistical and
visualization approaches.
- Therefore, to be able to classify the data you are working with is key.
- We can classify data in two main ways based on its type and on its
measurement level.
- Let us start from the types of data we can have.
- There are two types of data - categorical and numerical data.
- The categorical data describes categories or groups. One example is car
brands like Mercedes, BMW and Audi. They show different categories.
- Another instance is answers to yes and no questions. If I ask questions like
Are you currently enrolled in a university or do you own a car. Yes and no
would be the two groups of answers that can be obtained. This is
categorical data.
- Numerical data on the other hand as its name suggests represents
numbers.
- It is further divided into two subsets discrete and continuous.
- Discrete data can usually be counted in a finite matter. A good example
would be the number of children that you want to have. Even if you don't
know exactly how many you were absolutely sure that the value will be an
integer such as 0, 1, 2 or even 10.
- Another instance is grades on the exam. You may get 1000, 560, 1500, 2500.
- What is important for a variable to be defined as discrete is that you can imagine each member of the data set knowing that
all possible scores can be obtained is key.
- The continuous data is infinite and impossible to count. For example, your weight can take on every value in some range.
- You get on the scale and the screen shows 68 or 68.0369 kilograms. But this is just an approximation. If you gain 0.01 pound
the figure on the scale is unlikely to change but your new weight will be 68.034 kilograms.
- Now think about sweating every drop of sweat reduces your weight by the weight of that drop. But once again a scale is
unlikely to capture that change your exact weight is a continuous variable.
- It can take on an infinite number of values no matter how many digits there are after the dot. To sum up your weight can vary
by incomprehensibly small amounts and is continuous.
- Just to make sure there are some other examples of discrete and continuous data. Grades at university are discrete A B C D E F
or 0 to 100%
- The number of objects in general no matter if bottles, glasses, tables or cars they can only take integer values.
- Money can be considered both but physical money like bank notes and coins are definitely discrete.
- You can pay one dollar and two for three cents. You can only pay a dollar and 24 cents. That's because the difference between
two sums of money can be one cent at most. What else is continuous.
- Apart from weight other measurements are also continuous. Examples are height, area, Distance and time. All of these can
vary by infinitely smaller amounts.
- Incomprehensible for a human time on a clock is discrete but time in general. It can be anything like 72.123456 seconds.
- We are constrained in measuring weight, height, area, distance and time by our technology. But in general, they can take on
any value.
Levels of Measurement
- Furthermore, we saw that numerical data that can be discrete
and continuous. It's time to move on to the other classification
levels of measurement.
- These can be split into two groups qualitative and quantitative
data.
- The qualitative data can be nominal or ordinal
- The nominal variables or like the categories we talked about just
now Mercedes, BMW or Audi, they aren't numbers and cannot
be ordered.
- Ordinal data on the other hand consists of groups in categories
which follow a strict order. Imagine you have been asked to rate
your lunch and the options are disgusting, unappetising, neutral,
tasty and delicious. Although we have words and not numbers
but it is obvious that these preferences are ordered from
negative to positive. Thus, the level of measurement is
qualitative ordinal.
- The quantitative variables are also split into two groups interval
and ratio.
- Intervals and ratios are both represented by numbers but have
one major difference ratios have a true zero and intervals don't.
- Most things we observe in the real world are ratios. Their name comes from the fact that they can represent
ratios of things.
- For instance, if I have two apples and you have six apples you would have three times as many as I do. The
ratio of 6 and 2 is 3
- Other examples are a number of objects in general distance and time.
- Temperature is the most common example of an interval variable. Remember it can not represent a ratio of
things and doesn't have a true zero.
- Usually temperature is expressed in Celsius or Fahrenheit. They are both interval variables say today is 5
degrees Celsius or 41 degrees Fahrenheit and yesterday was 10 degrees Celsius or 50 degrees Fahrenheit.
- In terms of Celsius it seems today is twice colder. But in terms of Fahrenheit not really. The issue comes from
the fact that zero degrees Celsius and zero degrees Fahrenheit are not true zeros.
- These scales were artificially created by humans for convenience.
- Now there is another scale called Kelvin which has a true zero degrees Kelvin is the temperature at which
atoms stop moving and nothing can be colder than zero degrees Kelvin.
- This equals -273.15 degrees Celsius or – 459.67 degrees Fahrenheit.
- Variables shown in Kelvins are ratios as we have a true zero and we can make the claim that one temperature
is two times more than another Celsius and Fahrenheit have no true zero and are intervals.
• Finally, numbers like 2,3, 10, 10.5 etc. can be both interval or ratio.
The central tendency is stated as the statistical
measure that represents the single value of the
Mean, Median and Mode
entire distribution or a dataset.
•
Median
- The median is basically the middle number in an ordered
dataset. For example, in order to calculate the median we have
to order our data in ascending order
- The median of the data set is the (n+1)/2 in the ordered list
where n is the number of observations
- Therefore the median for NYC is at the 6th position or six
dollars much closer to the observed prices than the mean of 11
dollars.
- What about L.A. We have just 10 observations in L.A.
According to our formula the median is that position 5.5
- In cases like this the median is the simple average of the
numbers at positions 5 and 6.
- Therefore at the median of L.A. prices is $5.5.
- There is one last question that we haven't answered which measure is best. The NY CNL example shows us that the
measures of central tendency should be used together rather than independently.
- An example, consider frequency distribution tables. Here we have three data sets and the
respective frequency distributions. We have also calculated the means medians and modes.
- The first dataset has a mean of 2.79 and a median of 2, hence the mean is bigger than the
median. We say that this is a positive or right skew from the graph.
- You can clearly see that the data points are concentrated on the left side. Note that the
direction of the skew is counter intuitive.
- It does not depend on which side the line is leading to but rather to which side its tail is
leaning to. So right skewness means that the outliers are to the right.
- It is interesting to see the measures of central tendency incorporated in the graph when we
have rights units the mean is bigger than the median and the mode is the value with the
highest visual representation.
- In the second graph. We have plotted a data set that has an equal mean median and
mode the frequency of occurrence is completely symmetrical, and we call this a zero
or no skew. Most often you will hear people say that the distribution is symmetrical for the
data set.
- In third graph, we have a mean 4.9, a median of 5 and a mode of 6. As the mean is lower
than the median. We say that there is a negative or a left skew.
- Once again, the highest point is defined by the mode. Why is it called the Left skew again,
because the outliers are to the left.
- Skewness tells us a lot about where the data is situated. The mean, median and mode
should be used together to get a good understanding of the dataset.
- Measures of asymmetry like skewness or the link between central tendency measures and
probability theory which ultimately allows us to get a more complete understanding of the
data we are working with.
• Variance
- Sample variance on the other hand is denoted by S2 and is equal to the sum
of squared differences between the observed sample values and the sample
mean divided by the number of sample observations minus one.
- The main part of the formula is its numerator. So that's what we want to
comprehend.
- The sum of differences between the observations and the mean squared
them.
- So, the closer a number to the mean the lower the result we will obtain and
the further away from the mean it lies the larger this difference easy.
- But why do we elevate to the second degree.
- Squaring the differences has two main purposes. First by squaring the
numbers we always get non-negative computations. As dispersion cannot
be negative, dispersion is about distance and distance cannot be negative.
- If on the other hand we calculate the difference and do not elevate to the
second degree we would obtain both positive and negative values that
when summed would cancel out leaving us with no information about the
dispersion second squaring amplifies the effect of large differences.
- For example, if the mean is 0 and you have an observation of 100 the
squared spread is 10,000
- Consider a practical example.
- We have a population of 5 observations as follows
- Let's find its variants.
- We start by calculating the mean (1+2+3+4+5)/5 = 3
- Then we apply the formula, we get two.
- So the population variance of the data set is 2 but what about the sample variance.
- This would only be suitable if we were told that these five observations were a sample drawn from a population.
- So let's imagine that's the case.
- The sample mean is once again 3, the numerator is the same but the denominator is going to be 4 instead of 5 giving us a
sample variance of 2.5. To conclude the variance topic, we should interpret the result. Why is the sample variance bigger
than the population variance in the first case we know of the population.
- That is we had all the data and we calculated the variance in the second case.
- We were told that 1 2 3 4 and 5 was a sample drawn from a bigger population.
- Imagine the population of this sample were these 9 numbers 1 1 1 2 3 4 5 5 5 and 5.
- Clearly the numbers are the same but there is a concentration around the two extremes of the data set 1 and 5.
- The variance of this population is 2.96 so our sample variance has rightfully corrected upwards in order to reflect the higher
potential variability.
- This is the reason why there are different formulas for sample and population data.
Standard Deviation and Coefficient of Variation
- Variance is a common measure of data dispersion. In most cases the figure you will
obtain is pretty large and hard to compare as the unit of measurement is squared.
- The easy fix is to calculate its square root and obtain a statistic known as standard
deviation.
- In most analyses you perform standard deviation will be much more meaningful than
variance.
- There are different measures for the population and sample variance. Consequently,
there is also population and sample standard deviation.
- The formulas are the square root of the population variance and square root of the
sample variance respectively.
- The other measure we still have to introduce is the coefficient of variation. It is equal
to the standard deviation divided by the mean. Another name for the term is relative
standard deviation.
- This is an easy way to remember its formula. It is simply the standard deviation
relative to the mean as you probably guessed there is a population and sample
formula.
- Once again so standard deviation is the most common measure of variability for a
single dataset.
- But why do we need yet another measure such as the coefficient of variation.
Comparing the standard deviations of two different data sets is meaningless. But
comparing coefficients of variation is not Aristotle once said.
- Here's an example of a comparison between standard deviations.
- Let's take the prices of pizza at 10 different places in New York. They range from 1 to $11.
- Now imagine that you only have Mexican pesos and to you the prices look more like 18.81
pesos to 206.91 pesos.
- Given the exchange rate of 18.81 pesos for $1. Let's combine our knowledge so far and find
the standard deviations and coefficients of variation of these two data sets.
- First, we have to see if this is a sample or a population.
- Are there only 11 restaurants in New York. Of course not. This is obviously a sample drawn
from all the restaurants in the city. Then we have to use the formulas for sample measures of
variability.
- Second, we have to find the mean. The mean in dollars is equal to 5.5 and the mean in pesos
to 103.46
- The third step of the process is finding the sample variance. Following the formula that we showed earlier we can obtain 10.72 dollars squared and 3793.69 peso
squared. The respective sample standard deviations are at 3.27 dollars and 61.59 pesos.
- Let's make a couple of observations. First Variance gives results in squared units while standard deviation in original units.
- This is the main reason why professionals prefer to use standard deviation is the main measure of variability.
- It is directly interpretable square dollars means nothing even in the field of statistics.
- Second, we got standard deviations of 3.27 and 61.59 for the same pizza at the same eleven restaurants in New York City. Seems wrong right.
- It is time to use our last tool. The coefficient of variation dividing the standard deviations by the respective means we get the two coefficient of variation. The
result is the same 0.60. Notice that it is not dollars. Pesos.
- Dollars squared or pesos squared. It is just 0.60.
- This shows us the great advantage that the coefficient of variation gives us. Now we can confidently say that the two data sets have the same variability. Which is
what we expected beforehand.
- There are three main measures of variability variance, standard deviation and coefficient of variation.
- Each of them has different strength and applications. You should feel competent using all of them as we are getting closer to more complex statistical topics.
Covariance
- We've covered all univariate measures.
Now it is time to see measures that are
used when we work with more than one
variable.
This will naturally lead us to the point estimate, and we will conclude
the section with confidence intervals.
What is a Distribution?
- In statistics when we use the term distribution we usually mean
a probability distribution.
- A distribution is a function that shows the possible values for
a variable and how often they occur.
- We are sure that you have exhausted all possible values when
the sum of the probabilities is equal to 1.
- All outcomes have an equal chance of occurring. Each
probability distribution has a visual representation.
•
- You can notice that the highest point is located
at the mean because it coincides with the mode.
The spread of the graph is determined by the
standard deviation.
- Now let's try to understand the normal
distribution a little bit better. Let's look at this
approximately normally distributed histogram.
- There is a concentration of the observations
around the mean which makes sense as it is
equal to the mode. Moreover, it is symmetrical
on both sides of the mean.
- We used 80 observations to create this
histogram. It's mean is 743 and its standard
deviation is 140. But what if the mean is smaller
or bigger. Let's zoom out a bit by adding the
origin of the graph the origin is the zero point.
- Keeping the standard deviation fixed or in
statistical jargon controlling for the standard
deviation a lower mean would result in the same
shape of the distribution. But on the left side of
the plane in the same way.
•
•
Let's see an example that will help us get a better grasp
of the concept will take an approximately normally
distributed set of numbers 1 2 2 3 3 3 4 4 and 5.
- It's mean is 3 and it's standard deviation 1.22.
- Now let's subtract the mean from all data points we get a
new dataset. Let's calculate the new mean. It is 0 exactly
as we anticipated showing that on a graph, we have
shifted the curve to the left while preserving its shape. x x-µ x-µ /σ
Substract
- So far we have a new distribution which is still normal Mean Devide
but with a mean of zero and a standard deviation of 1.22 from Data by
Dataset dataset St dev
- The next step of the standardization is to divide all data
points by the standard deviation. 1 1 -2 -1.63
2 2 -1 -0.82 Mean 3 N ~ (3,1.22)
- This will drive the standard deviation of the new data set
to 1. Let's go back to our example. Both the original data 3 2 -1 -0.82 St Dev 1.22
set and the one we obtain after subtracting the mean from
each data point have a standard deviation of 1.22. Adding 4 3 0 0.00
and subtracting values to all data points does not change New N ~ (0, 1.22)
the standard deviation. 5 3 0 0.00 Mean 0
6 3 0 0.00 St Dev 1.22
- Now let's divide each data point by 1.22.
7 4 1 0.82
- If we calculate the standard deviation of this new dataset, Final
we will get 1 and the mean is still zero. 8 4 1 0.82 Mean 0.00 N ~ (0,1)
Final St
- In terms of a curve we kept it at the same position but 9 5 2 1.63 Dev 1
reshaped it a bit .
- This is how we can obtain a standard normal distribution
from any normally distributed dataset.
- Let's draw a sample out of that data. The mean is 2617. 23$. Now a
problem arises from the fact that if I take another sample, I may get
a completely different mean 3201.34$. Then a third with a mean of
2844.33$. As you can see the sample mean depends on the
incumbents of the sample itself.
- So taking a single value as we did in descriptive statistics is
definitely suboptimal.
- What we can do is draw as many samples and create a new dataset
comprised of sample means these values are distributed in some
way.
- Let me give you some more information. Here's a plot of the distribution of the car prices. We
haven't seen many distributions but we know that this is not a normal distribution.
- It has a right skew and that's about all we can see here is the big revelation. It turns out that if we
visualize the distribution of the sampling means we get something else, something familiar,
something useful a normal distribution and that's what the central limit theorem states no
matter the distribution of the population by no mean uniform, exponential or another one
the sampling distribution of the mean will approximate a normal distribution.
- Not only that but its mean is the same as the population mean. That's something we already
noticed.
- What about the variance? It depends on the size of the samples we draw but it is quite elegant. It is
the population variance divided by the sample size since the sample size is in the denominator.
- The bigger the sample size the lower the variance or in other words the closer the approximation
we get.
- So if you are able to draw bigger samples your statistical results will be more accurate. We need a
sample size of at least 30 observations.
- Finally, why the central limit theorem is so important. As we already know the normal
distribution has elegant statistics and an unmatched applicability in calculating confidence
intervals and performing tests.
- The central limit theorem allows us to perform tests, solve problems and make inferences
using the normal distribution even when the population is not normally distributed.
Standard Error
•
•
Estimators and Estimates
- We learned about point estimators but as you can guess they are not exceptionally reliable.
- Imagine visiting 5 percent of the restaurants in London and saying that the average meal is worth
22.50 pounds. What are Confidence Intervals?
- You may be close, but chances are that the true value isn't really 20 to 50 but somewhere around
it. It's much safer to say that the average meal in London is somewhere between 20 and 25
pounds isn't it.
- In this way you have created a confidence interval around your point estimate of 20 to 50. A
confidence interval is a much more accurate representation of reality.
- However, there is still some uncertainty left which we measure in levels of confidence.
- So getting back to our example you may say that you are 95% confident that the population
parameter lies between 20 and 25.
- Keep in mind that you can never be 100% confident unless you go through the entire population
and there is of course a 5% chance that the actual population parameter is outside of the 20 to 25
pounds range.
- Observe that if the sample we have considered deviates significantly from the entire population.
There is one more ingredient needed, the level of confidence it is denoted by 1- α and is called
the confidence level of the interval. Alpha is a value between 0 and 1.
- For example, if we want to be 95% confident that the parameter is inside the interval Alpha is
5%.
- If we want a higher competence level of say 99% then alpha will be 1%.
- Then here it is the formula for all confidence intervals is from the point estimate minus the reliability factor times the standard error to the point estimate
Plus the reliability factor times the standard error.
- formula for all confidence intervals =
[point estimates – reliability factor * standard error, point estimates + reliability factor * standard error]
Confidence Intervals; Population Variance Known; Z-score
- A confidence interval is the range within which you expect the population parameter to be and its estimation is based on the
data we have in our sample.
- There can be two main situations when we calculate the confidence intervals for a population - when the population barrier is
known and when it is unknown. Depending on this situation, we would use a different calculation method.
- Now the whole field of statistics exists because we almost never have population data. Even if we do have population, we may
not be able to analyse it. It may be so much that it doesn't make sense to be used all at once.
- Here we will explore the confidence intervals for a population mean with a known variance. An important assumption in this
calculation is that the population is normally distributed even if it is not.
- You should use a large sample and let the Central Limit Theorem do the normalization magic for you. Remember if you work
with a sample which is large enough you can assume normality of sample means.
❑ Example:
- Let's say you want to become a data scientist and you're interested in the salary you're going to get.
- Imagine you have certain information that the population standard deviation of data sign salaries is equal to $15000.
- Furthermore, you know the salaries are normally distributed and your sample consists of 30 salary's.
The formula for the confidence interval with a known variance is given below.
- The population mean will fall between the sample mean minus Z of Alpha divided by two times the standard error and the
sample mean plus Z of Alpha divided by two times the standard error.
- The sample mean is the point estimate. You know all about the standard error already so let's compute it using the formula.
•
The table summarizes standard
normal distributions critical values
and corresponding (1-α)
- Let's say that we want to find the
values for the 95 percent confidence
interval Alpha is zero 0.05.
- Therefore, we are looking for Z of
Alpha divided by two or 0.05 in the
table. This will match the value of
1-0.025= 0.975
- The corresponding z comes from the
sum of the row and column table
headers associated with this cell. In
our case the value is 1.9 plus 0.06 or
1.96
The interpretation is the following. We are 95 percent confident that the average data scientist’s salary will be in the
interval 94833 and 105568 dollars.
Example:
From 1984 to 1985, the mean height of 15 to 18-year-old males from Chile was
172.36 cm, and the standard deviation was 6.34 cm. Let Y = the height of 15 to
18-year-old males in 1984 to 1985. Then Y ~ N(172.36, 6.34).
About 95% of the y values lie between what two values?
Confidence Interval Clarifications
- Let's take a step back and try to understand confidence
intervals a bit better.
- Here is a graph of a normal distribution you know where
the sample mean is in the middle of the graph.
- Now if we know that a variable is normally distributed,
basically we are making the statement that most
observations will be around the mean and the rest far away
from it.
- Let's draw a confidence interval. There is the lower limit,
and the upper limit and 95% confidence interval would
imply that we are 95% confident that the true population
mean falls within this interval.
- There is 2.5% chance that it will be on the left of the lower
limit and 2.5% chance it will be on the right overall.
- There was 5% chance that our confidence that our role does
not contain the true population mean so when alpha is 0.05
or 5%, we have alpha divided by 2 or 2.5% chance that the
true mean is on the left of the interval and 2.5% on the
right.
- Using the z score and the formula we are
implicitly starting from a standard normal
distribution.
- Therefore, the mean is zero the lower limit
is -z, while the upper one is z.
- For a 95% confidence interval using the Z
table we can find that these limits are -1.96
and 1.96.
- 95% is the accepted norm as we don’t compromise with accuracy too much but still get a relatively narrow interval.
Student's T Distribution
- William Gossett was an English statistician who worked for the brewery
of Guinness.
- He developed different methods for the selection of the best yielding
varieties of barley. An important ingredient in making beer, Gossett found
big samples tedious, so he was trying to develop a way to extract small
samples but still come up with meaningful predictions.
- He was a curious and productive researcher and published several papers
that are still relevant today.
- However due to his company policy he was not allowed to sign the
papers with his own name. Therefore, all his work was under the pen
name student.
- Later on a friend of his and a famous statistician Ronald Fischer stepping
on the findings of Gossett introduced the t-statistic and the name that
stuck with the corresponding distribution even today is Student’s t.
- The Student's t distribution is one of the biggest breakthroughs in
statistics as it allowed in France through small samples with an unknown
population variance.
- This setting can be applied to a big part of the statistical problems we
face today.
- So we're a sample of 20 observations. The degrees of freedom are 19. Much like
the standard normal distribution table. We also have a student’s-t table where it
is the rows indicate different degrees of freedom, abbreviated as DF while the
columns common Alfas.
- Please note that after the 13th row the numbers don't vary that much.
- Actually after 30 degrees of freedom the teacher autistic table becomes almost
the same as the Z statistic as the degrees of freedom depend on the sample.
- In essence the bigger the sample the closer we get to the actual numbers a
common rule of thumb is that for a sample containing more than 50
observations we use the Z table instead of the T table.
Confidence Intervals; Population Variance Unknown; T-score
- So, we have learned that confidence intervals based on small samples from normally distributed populations are calculated with the t
statistic.
- Let check a similar example to the one we saw earlier. You are an aspiring data scientist and are wondering how much the mean data
scientist’s salary is?
- This time though you do not have the population variance.
- In fact, you have a sample of only nine compensations you found on the glass door and if summarize the information in the above
table,
•
• In this example we are going to use a
confidence level of 95%.
• This means that α is equal to 5%.
Therefore, half of Alpha would be
2.5%.
• You can now see that the associated t
statistic is 2.31
• We have all the information needed so we just plug in the numbers what we get is a confidence interval from 81806 $ to
103261 $.
- Let's compare this result of the results with a confidence interval with known population.
- We got a 95% confidence interval that was between 94,833 $ and 105,568 $.
- You can clearly note that when we know the population variance we get a narrower confidence interval. When we do not know the population
variance, there is a higher uncertainty that is reflected by wider boundaries for our interval.
- It means that even when we do not know the population variance, we can still make predictions, but they will be less accurate. Furthermore, the
proper statistic for estimating the confidence interval when the population variance is unknown is the statistic and not the Z statistic.
- The more observations there are in the sample the higher the chances of getting a good idea about the true mean of the entire population.
Hypothesis Testing
- Confidence intervals provide us with an
estimation of where the parameters are located.
- However, when you are making a decision, you
need a yes or no answer.
- The correct approach in this case is to use a test
- Here we learn how to perform one of the
fundamental tasks and statistics hypothesis
testing.
- There are four steps in data driven decision
making.
o First you must formulate a hypothesis.
o Second, once you have formulated a
hypothesis you will have to find the right
tests for your hypothesis.
o Third you execute the test.
The two types of error are inversely related to each other; decreasing type I errors will increase type II errors, and vice versa.
Example
1. Imagine that you are on a jury and that you need to determine if an individual is going to be sent to jail for a
crime. Since you don’t know the truth as to whether or not this person committed a crime. Here hypothesis
is “Person has not committed crime.”
• A type I error would suggest that, if they were really not guilty, you would send them to jail! The jury
has dismissed the null hypothesis that the defendant is innocent while he has not committed any
crime.
• You would also not want to make a type II error here because this would mean that someone has
actually committed a crime and the jury is letting them get away with it.