Business Analytics
Business Analytics
Business Analytics
INTRODUCTION TO
STATISTICS AND
PROBABILITY (BASICS)
CONTENTS
INTRODUCTION.
DATA AND CATEGORIES OF DATA
TYPES OF DATA
STATISTICS AND BASIC TERMINOLOGIES
DIFFERENT TYPES OF PROBABILITY SAMPLING AND TYPES OF STATISTICS
DESCRIPTIVE STATISTICS
INFERENTUAL STATISTICS
PROBABILITY AND BASIC TERMINOLOGIES
PROBABILITY DISTRIBUTION
TYPES OF PROBABILITY
BAYES THEOREM
COMPARISON BETWEEN STATISTICS AND PROBABILITY
CONCLUSION
INTRODUCTION
Statistics and probability is used in the field of data science for the accurate and fast
information about the processed data . It helps to provide proper precaution and the
forecasting measure for the future aspect of decisions.
Probability and statistics are related areas of mathematics. We use them to analyse the
relative frequency of the events. But there is a vast difference between the probability and
the statistics.
Probability deals with the prediction of the future events. But the statistics used to analyse
the frequency of the past events. The important thing to note is that the probability is the
theoretical branch of mathematics whereas the statistics is applied branch of mathematics.
DATA AND CATEGORIES OF DATA
Data refers to facts and statistics collected together for reference or analysis.
Data can be collected, measured and analyzed. It can also be visualized by using statistical models and graphs.
CATEGORIES OF DATA
Data can be classified into two types namely qualitative and quantitative data.
QUALITATIVE DATA: It means the data deals with characteristics and descriptors that can’t be easily measured, but can be
observed subjectively. This data is further classified into two types which are nominal and ordinal data.
QUANTITATIVE DATA: It means the data deals with numbers and things you can measure objectively. This data is also
further classified into two types which are discrete and continuous data.
VARIOUS TYPES OF DATA (SUB-CATEGORIES)
NOMINAL DATA (QUALITATIVE): Data with no inherent order or ranking such as gender or race.
•DISCRETE DATA (QUANTITATIVE): It is also known as categorical data, it can hold a finite number of possible
values.
Example: Number of students in a class.
•CONTINUOUS DATA (QUANTITATIVE): Data that can hold an infinite number of possible values.
Example: Weight of a person.
STATISTICS
Statistics is an area of applied mathematics concerned with data collection, analysis, interpretation, and presentation.
This area of mathematics deals with understanding how data can be used to solve complex problems.
For every Artificial Intelligence, Machine Learning, and Data Science enthusiast, Statistics is fundamental to learn in
order to dive deeper into these fields. With a proper understanding of Statistics, it will become easy while implementing
regression, classification, and numerous other algorithms in Machine Learning.
BASIC TERMINOLOGIES IN STATISTICS
The two important terminologies in statistics are the population and sample. But the test analyses is used with
respect to size of the given data.
POPULATION: A collection or set of individuals or objects or events whose properties are to be analyzed.
•SAMPLE: A subset of the population is called ‘Sample’. A well-chosen sample will contain most of the information
about a particular population parameter.
•There are some sampling techniques which used to test best sample for the entire population.
SAMPLING TECHNIQUES
SAMPLING IS A STATISTICAL METHOD THAT DEALS WITH THE SELECTION OF INDIVIDUAL
OBSERVATIONS WITHIN A POPULATION. IT IS PERFORMED TO INFER STATISTICAL KNOWLEDGE ABOUT A
POPULATION.
THE TWO MAIN SAMPLING TECHNIQUES ARE PROBABILITY SAMPLING AND NON-PROBABILITY
SAMPLING TECHNIQUE.
THE MOST COMMONLY USED FORM OF SAMPLING TECHNIQUE FROM STATISTICS IS PROBABILITY
SAMPLING . NON-PROBABILITY SAMPLING TECHNIQUE IS NOT USED AT THAT RANGE WHEN
COMPARED TO PROBABILITY SAMPLING.
RANDOM SAMPLING.
SYSTEMATIC SAMPLING.
STRATIFIED SAMPLING.
• DIFFERENT TYPES OF PROBABILITY SAMPLING TECHNIQUES
RANDOM SAMPLING: IN THIS METHOD, EACH MEMBER OF THE POPULATION HAS AN EQUAL CHANCE
OF BEING SELECTED IN THE SAMPLE.
SYSTEMATIC SAMPLING: IN SYSTEMATIC SAMPLING, EVERY NTH RECORD IS CHOSEN FROM THE
POPULATION TO BE A PART OF THE SAMPLE.
STRATIFIED SAMPLING: IN STRATIFIED SAMPLING, A STRATUM IS USED TO FORM SAMPLES FROM A LARGE POPULATION. A
STRATUM IS A SUBSET OF THE POPULATION THAT SHARES AT LEAST ONE COMMON CHARACTERISTIC. AFTER THIS, THE RANDOM SAMPLING
METHOD IS USED TO SELECT A SUFFICIENT NUMBER OF SUBJECTS FROM EACH STRATUM.
TYPES OF STATISTICS
There are two well-defined types of statistics is used in the data analysis framework.
Descriptive statistics .
Inferential statistics .
DESCRIPTIVE STATISTICS: Descriptive statistics is a method used to describe and understand the features of a specific data set by
giving short summaries about the sample and measures of the data. Descriptive Statistics is mainly focused upon the main
characteristics of data. It provides a graphical summary of the data.
INFERENTIAL STATISTICS: Inferential statistics makes inferences and predictions about a population based on a sample of
data taken from the population in question. Inferential statistics generalizes a large dataset and applies probability to draw a
conclusion. It allows us to infer data parameters based on a statistical model using sample data.
DESCRIPTIVE STATISTICS
Descriptive statistics is further classified into two types which are measures of central tendency and measures of
variability (spread).
MEASURES OF CENTRAL TENDENCY: It is the statistical measure which represent the summary of the dataset.
This measure of central tendency is further classified into three types – MEAN, MEDIAN AND MODE.
MEASURES OF VARIABILITY (SPREAD): This statistical measure is also called as measures of dispersion. It helps to
describe the variability in the sample (or) population. This measure of spread is further classified into four types –
RANGE , INTER-QUARTILE RANGE , VARIANCE AND STANDARD DEVIATION.
MEASURES OF CENTRAL TENDENCY
MEAN: In mathematics and statistics, the mean is the average of the numerical observations which is equal to the sum of the
observations divided by the number of observations. In other words, it can be simply stated as measure of the average of all the values
in a sample is called Mean.
Where,
A = arithmetic mean.
n = number of values.
ai = data set values.
MEDIAN : The median of the data, when arranged in ascending or descending value is the middle observation of the data, i.e., the
point separating the higher half to the lower half of the data. In other words, it can be simply stated as measure of the central value of
the sample set is called Median.
For calculating the median,
- First, arrange the data in ascending (or) descending order.
- The odd number of data points- the middle value is the median.
- The even number of data points- the average of the two middle values is the median.
Where,
X= an ordered list of values in the dataset.
n = number of values in the dataset.
MEASURES OF CENTRAL TENDENCY (CONTINUED)
MODE: The mode of the set of data points is the most frequently occurring value. In simple words, the value most recurrent in the sample set is
known as Mode.
FOR EXAMPLE,
5,2,6,5,1,1,2,5,3,8,5,9,5 are the set of data points. Here 5 is the mode because it’s occurring most frequently.
RANGE: It is the given measure of how spread apart the values in a data set .
It can be calculated as –
Where,
Max (x_i) = Maximum value of (X).
Min (x_i) = Minimum value of (X).
QUARTILE : Quartiles tell us about the spread of a data set by breaking the data set into quarters, just
like the median breaks it in half.
INTER-QUARTILE RANGE: It is the measure of variability, based on dividing a data set into
quartiles.
MEASURES OF SPREAD (VARIABILITY) (CONTINUED)
FOR EXAMPLE – QUARTILES AND INTER-QUARTILE RANGE.
MARKS OF 100 STUDENTS FROM LOWEST TO HIGHEST SCORES.
Where,
(xi) = Individual data.
(̅xi) = Mean data.
(n) = number of observation in dataset.
MEASURES OF SPREAD (VARIABILITY) (CONTINUED)
POPULATION VARIANCE: Population Variance is the average of squared deviations. Population data refers to the
complete data set .
SAMPLE VARIANCE: Sample Variance is the average of squared differences from the mean. Sample data refers to a part of
the population data which is used for analysis. Sampling is done to make analysis easier.
STANDARD DEVIATION : It is the measure of the dispersion of a set of data from its mean. Standard deviation measures
the variation or dispersion of the data points in a dataset. It depicts the closeness of the data point to the mean and is calculated as
the square root of the variance.
In data science, the standard deviation is usually used to identify the outliers in a data set. The data points which lie one standard
deviation away from the mean are considered to be unusual.
Where,
() = population standard deviation. (N) = size of the population. (xi) = Each value from the population.
(µ) = population mean .
INFERENTIAL STATISTICS
Inferential statistics is not easy statistics. It is more complicated than descriptive
statistics. It is produced through complex mathematical calculations. These calculations
are quite helpful for scientists.
And allow them to infer trends about a larger population based on a study of a sample
taken from it. Most predictions of the future are made with the help of inferential
statistics. Statisticians need to design the right experiment to draw the relevant
conclusions from his study.
There are five different types of inferential statistics, which are –
REGRESSION
ANALYSIS OF VARIANCE (ANOVA)
ANALYSIS OF COVARIANCE (ANCOVA)
STATISTICAL SIGNIFICANCE (T-TEST)
CORRELATION
REGRESSION ANALYSIS
Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable
and one or more independent variables. It can be utilized to assess the strength of the relationship between variables
and for modeling the future relationship between them. It shows the cause and effect relationship between the
dependent and independent variable in the model.
There are three types of regression analysis.
SIMPLE LINEAR REGRESSION
MULTIPLE LINEAR REGRESSION
NON-LINEAR REGRESSION
SIMPLE LINEAR REGRESSION: Simple linear regression is a model that assesses the relationship between a dependent
variable and an independent variable.
REGRESSION ANALYSIS (CONTINUED)
MULTIPLE LINEAR REGRESSION: Multiple linear regression analysis is essentially similar to the simple
linear model, with the exception that multiple independent variables are used in the model.
NON-LINEAR REGRESSION:Nonlinear regression analysis is commonly used for more complicated data sets in
which the dependent and independent variables show a nonlinear relationship.
ANALYSIS OF VARIANCE (ANOVA)
An ANOVA test is a type of statistical test used to determine if there is a statistically significant difference between two
or more categorical groups by testing for differences of means using variance. Another Key part of ANOVA is that it
splits the independent variable into 2 or more groups. For example, one or more groups might be expected to influences
the dependent variable while the other group is used as a control group, and is not expected to influence the dependent
variable.
The ANOVA is classified into two types namely- oneway ANOVA and twoway ANOVA.
ANALYSIS OF COVARIANCE (ANCOVA)
ANCOVA is a blend of analysis of variance (ANOVA) and regression. It is similar to factorial ANOVA, in
that it can tell you what additional information you can get by considering one independent variable
(factor) at a time, without the influence of the others. It can be used as:
•Probability is a measure of how likely an event is. So, if it is 60% chance that it will rain tomorrow, the probability of
Outcome “it rained” for tomorrow is 0.6
BASIC TERMINOLOGIES IN PROBABILITY
There are three main types of terminologies used in the concept of probability. The concepts are RANDOM
EXPERIMENT, SAMPLE SPACE AND EVENT.
- RANDOM EXPERIMENT: An experiment or a process for which the outcome cannot be predicted with certainty.
- SAMPLE SPACE: The entire possible set of outcomes of a random experiment is the sample space of that experiment.
- EVENT: One or more outcomes of an experiment is called an event. It is a subset of sample space. The event is further
classified into two types –
DISJOINT EVENT: Disjoint Events do not have any common outcomes. For example, a single card drawn from a deck
cannot be a king and a queen.
NON-DISJOINT EVENT: Non-Disjoint Events can have common outcomes. For example,
a student can get 100 marks in statistics and 100 marks in probability.
PROBABILITY DISTRIBUTION
The probability distribution is divided into four types namely- PROBABILITY DENSITY FUNCTION,
NORMAL DISTRIBUTION, BINOMIAL DISTRIBUTION AND CENTRAL LIMIT THEOREM.
PROBABILITY DENSITY FUNCTION: The Probability Density Function (PDF) is concerned with the
relative likelihood for a continuous random variable to take on a given value. The PDF gives the probability
of a variable that lies between the range ‘a’ and ‘b’.
The graph denotes the probability density function of a continuous variable over a range. The graph is
popularly known as bell-shaped curve.
PROBABILITY DISTRIBUTION (CONTINUED)
Properties of probability density function:
•Graph of a PDF will be continuous over a range.
•The area bounded by the curve of the density function and the x-axis is equal to 1.
•The probability that a random variable assumes a value between a and b is equal to the area under the PDF bounded
by a and b.
NORMAL DISTRIBUTION: Normal distribution, otherwise known as the Gaussian distribution, is a probability
distribution that denotes the symmetric property of the mean. The idea behind this function is that the data near the
mean occurs more frequently than the data away from the mean. It infers that the data around the mean represents
the entire data set.
The bell-shaped curve is the graph which is similar to probability density function.
PROBABILITY DISTRIBUTION (CONTINUED)
The graph of the normal distribution depends mainly on two factors namely- MEAN AND STANDARD DEVIATION.
MEAN: It determines the location of the centre of the graph.
STANDARD DEVIATION: It determines the height of the graph.
If the standard deviation is large, the curve will be short and wide.
If the standard deviation is small, the curve will be tall and narrow.
PROBABILITY DISTRIBUTION (CONTINUED)
BINOMIAL DISTRIBUTION : Most of the times, the situations we encounter are pass-fail type. So, there are only two
outcomes – win and lose or success and failure. The likelihood of the two may or may not be the same.
FOR EXAMPLE,
So, let’s define our random variable (X) to be a number of wins in 5 games. Remember probability of winning is 0.75 and losing is
0.25. Assume that a tie doesn’t happen.
X=Number of wins in 5 games
So, the first game has 2 outcomes – win and lose, second again has 2 and so on.
So , total possibilities is 2*2*2*2*2 = 32.
While we can count each of these possible outcomes, it becomes very exhaustive and intensive exercise. So, we can count with
another method- Choose 2 wins out of 5 games = 5C2 ().
•P(X=2) denotes the probability that you win 2 games. So there are 5C2() = 10 cases where you win 2 games. Hence probability =
10*0.75*0.75*0.25*0.25*0.25=0.088.
•P(X=3) denotes the probability that you win 3 games. So, there are 5C3() =10 cases where you win 3 games. Hence probability =
10*0.75*0.75*0.75*0.25*0.25=0.264.
SIMILARLY, P(X=4) = 0.395.
P(X=5) = 0.237.
THE ABOVE MENTIONED PROBABILITY ARE THE DISCRETE PROBABILITY FOR BINOMIAL DISTRIBUTION.
CENTRAL LIMIT THEOREM : THE CENTRAL LIMIT THEOREM STATES THAT THE SAMPLING DISTRIBUTION OF THE MEAN
OF ANY INDEPENDENT, RANDOM VARIABLE WILL BE NORMAL OR NEARLY NORMAL IF THE SAMPLE SIZE IS LARGE ENOUGH.
IN SIMPLE TERMS, IF WE HAD A LARGE POPULATION DIVIDED INTO SAMPLES, THEN THE MEAN OF ALL THE SAMPLES FROM THE
POPULATION WILL BE ALMOST EQUAL TO THE MEAN OF THE ENTIRE POPULATION.
THE ACCURACY (OR) RESEMBLANCE TO THE NORMAL DISTRIBUTION DEPENDS ON TWO MAIN
FACTORS .
NUMBER OF SAMPLE POINTS TAKEN.
THE SHAPE OF THE UNDERLYING POPULATION.
TYPES OF PROBABILITY
There are three main types of probability which is used in the analysis work.
MARGINAL PROBABILITY
JOINT PROBABILITY
CONDITIONAL PROBABILITY
MARGINAL PROBABILITY: The probability of an event occurring (p(A)), unconditioned on any other events.
FOR EXAMPLE, The probability of a card is drawn is (3) = [p(3)=1/13].
EQUATION:
JOINT PROBABILITY: Joint Probability is a measure of two events happening at the same time, i.e., p(A and B), The probability of
event A and event B occurring. It is the probability of the intersection of two or more events. The probability of the intersection of A
and B may be written p(A ∩ B).
FOR EXAMPLE: The probability that a card is a four and red = p(four and red) = 2/52 =1/26.
CONDITIONAL PROBABILITY: Probability of an event or outcome based on the occurrence of a previous event or outcome.
Conditional Probability of an event B is the probability that the event will occur given that an event A has already occurred.
Conditional probability is the probability of an event occurring provided another event has already occurred.
TYPES OF PROBABILITY (CONTINUED)
FOR EXAMPLE:
BAYES THEOREM
The Bayes theorem is used to calculate the conditional probability, which is nothing but the probability of an event
occurring based on prior knowledge of conditions that might be related to the event.
Bayes’ Theorem is a very important statistical concept used in many industries such as healthcare and finance. The formula of
conditional probability which we have used is also been derived from this theorem.
It is used to calculate the probability of a hypothesis based on the probabilities of various data provided in the hypothesis.
Mathematician
Statistician
Budget analyst
Economist
Financial analyst
Management analyst
Market Research analyst
Cost Estimator
THANK YOU