Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Stats For Managers - Intro

Download as pdf or txt
Download as pdf or txt
You are on page 1of 101

Statistics for Managers –

Theoretical Introduction
Prepared By:
Manuj Madan,
Assistant Professor,
Chitkara University
Elements Vs Variables?
Elements Vs Variables

Elements are entities on which data


are collected and variables are
characteristic of interest for the
element.
Quantitative Variable?
Quantitative Variable

It tells us about the quantity of


what is measured.
Categorical Variable?
Categorical Variable

They do not measure a quantity of


something.
Is Telephone Country Code a categorical
variable? Yes/No
Yes, because it does not measure a
quantity of something
Ordinal Vs Nominal Variables?
Ordinal Vs Nominal Variables

A categorical variable can be ordinal as well as nominal. When order is


specified such as a customer is asked about a product or service whether
‘Not Satisfied’, ‘Moderately satisfied’, ‘Extremely satisfied’, then this is
called________________________variable

Ordinal categorical

When order is not given, then it is called nominal variable(quantitative or


categorical). Let us see some examples on next slide.
Example of identifying variables types

Identifiers are special type of categorical variable


Population in Statistics?
Population in Statistics

The whole data, which is focus of our study is population.


I want to identify “How many male students wear jeans
daily in statistics class of Chitkara University” What is my
population?

Answer: Statistics Class, not Chitkara University


Sampling Frame?
Sampling Frame

The list from which we draw our sample from is called sampling frame.
What should be the sampling frame if I want to identify -
“How many male students wear jeans daily in statistics class of Chitkara
University?”

Male students of Statistic class only


Sample?
Sample

It is a subset of population.
Census?
Census

This is a special sample that contains whole of the


population.
Statistic and Parameter?
Statistic and Parameter

A statistic is description of the sample


whereas parameter is description of
population.

Mean, Median, mode if given for a sample


is called statistic and when given for a
population is known as parameter.
Descriptive Statistics?
Descriptive Statistics

A descriptive statistic is a summary statistic that quantitatively describes or


summarizes features of a collection of information.
Any Examples of descriptive stats?

Mean, Median, Standard deviation etc. are descriptive in nature as these describe the
performance of one set of data and no generalization about other data sets is made
from this.
Inferential Statistics?
Inferential Statistics

Inferential statistics makes inferences and


predictions about a population based on
a sample of data taken from the
population in question.
Any Examples?
Hypothesis Testing(will see later in course
what it is?)
Frequency Distribution?
Frequency Distribution

It is a tabular summary of data showing the number of


items in each non-overlapping class.
Probability Distribution?
Probability Distribution

It is a frequency distribution, which is one that describes


how outcomes (dependent variables) are expected to
vary.
Relative Frequency?
Relative Frequency

Frequency of class divided by total of all frequencies.


Scatter Diagram?
Scatter Diagram

Relationship Between two quantitative variables on graph.


Trendline?
Trendline?

Trendline is approximation of relation between


two variables.
Bar Chart vs Histogram
Bar Chart vs Histogram

Bar Chart uses qualitative data


whereas histogram uses quantitative
data.

Note: Pareto diagram is a type of bar


chart where arrangement of bars is in
descending order of height
Simpson’s Paradox

Conclusions drawn from two or more separate crosstabulations that can


be reversed when the data are aggregated into a single crosstabulation.
Percentile?
Percentile

p percentile refers to at least p percent of observations are


less or equal.
i = (p/100 ) * n
If i is an integer, pth percentile is average of values in i and
i+1 and if i is not an integer, then next integer greater then i
denotes the position of pth percentile.
Interquartile Range?
Interquartile Range

Q3 – Q1
where Q3 = 75th percentile
and Q1 = 25th percentile
Variance and Standard
Deviation?
Variance and Standard Deviation

∑(xi – µ)2 / N is population variance


where N is population size and µ = population mean
∑(xi – Х)2 / n-1 is sample variance
where n is sample size and Х is sample mean

Standard Deviation is another measure of dispersion of data and is calculated


using square root of variance.

Why we use n-1 in case of sample?


To avoid biases using degrees of freedom.
Coefficient of variation?
Coefficient of Variation

It is equal to standard deviation of mean divided by mean itself


multiplied whole by 100.

It is used when units are different for same data and std deviation is not
the right measure of dispersion.
Formula of Z-score?
Formula of Z-score

Z-score = (xi – X) / σ
xi= observation
where X = average of sample means data and
σ = standard deviation of sample means data

What is Outlier?
A value in a set of observations that is abnormally away from mean, median or
mode.
If Z-score > 3 or Z-score < -3, then
Observation is an outlier
How to calculate covariance
and correlation?
Calculate covariance and correlation

Covariance =∑ (xi- X)(yi-Y) / n-1


Where n = no of total observations in both variables
xi= ith observation of x variable
Yi = yth observation of y variable
X = mean of x variable observations
Y = mean of y variable observations
Correlation coefficient= covariance (x,y) / (σx *σy)

where σx= standard deviation of x variable observations


σy = standard deviation of x variable observations
Both are called
Measures of Association
Correlation vs Regression?
Correlation vs Regression

Both are used to describe nature and strength of relationship between two
continuous variables.
Correlation focusses on association whereas regression is inclined towards
making predictions.
Regression analysis always have dependent and independent variables
whereas correlation has any two variables.
Cause and effect? Yes/No
NEVER
Broad types of sampling?
Broad types of sampling

Probability Sampling and Non-Probability Sampling.


What is probability sampling?
The sampling on which statistical analysis can be done is called
probability sampling.
Non-probability Sampling?
No statistical analysis can be done. Also called judgement sampling.
Simple Random Sampling?
Simple Random Sampling

It is a type of probability sampling in which each observation has equal


chance of being selected.
Is it possible in real world? Yes/No
No, that is why sampling errors exist.
Biased Samples?
Biased Samples

The parliament is debating some gun control laws. You are asked to conduct
an opinion survey. Because hunters are the ones that are most affected by the
gun control laws, you went to a hunting lodge and interviewed the members
there. Then you reported that in a survey done by you, about 97 percent of
the respondents were in favour of repealing all gun control laws.
A week later, the Parliament took up another bill: “Should working pregnant
women be given a maternity leave of one year with full pay to take care of
new-born babies?” Because this issue affects women most, this time you went
to all the high-rise office complexes in your city and interviewed several
working women of child-bearing age. Again you reported that in a survey
done by you, about 93 percent of the respondents were in favour of the one-
year maternity leave with full pay.
In both of these situations you picked a biased sample by choosing people
who would have very strong feelings on one side of the issue.
Other names of
non-probability Sampling?

Convenience Sample (drawn according to


convenience of researcher)

Purposive or judgement sampling (drawn based on


experience of expert)
Sample Size and Its Determination?
Sample Size and Its Determination?

Optimization between achieving objectives and costs / resources is done.


Size depends on:
 Nature of Universe(Homogeneous or Heterogenous)
 Nature of Study (intense- small or general- large)
 Types of sampling technique ( small SRS is superior to large badly selected
sample)
 Availability of Finance
 Standard of accuracy
Primary Data and Secondary Data?

Primary data is data collected from methods of


questionnaire, observation, interviews and
schedules etc.
Secondary data is one that already exists and you
use somebody else’s primary data in your
research by giving references. Sources of
secondary data are to be checked for reliability.
Observation Method -
Advantages and Disadvantages?
Observation Method –
Advantages and Disadvantages
This method is subjected to checks and controls of
validity & reliability.
Subjective Bias is eliminated.
Info obtained is current and independent of
respondents willingness to respond.
Less demanding but very costly method.
Some people are rarely accessible to direct
observation.
Interview Method – Types?
Interview Method – Types

Structured – Set of predetermined questions.


Unstructured – freedom to ask supplementary questions and omit certain
questions and may even change the sequence of questions.
Focussed – attention on given experience of respondent & its effects.
Clinical interview – concerned with individuals life experience and feelings.
Non-directive interview – encourage respondents to talk about the given topic
with a bare minimum of direct questioning.
Question Sequence – easy questions in the
beginning or end?

Ideally, easy questions should be at the beginning


coz if respondent leaves questions in end,
considerable info would have already been
obtained.
What is Schedule method of collecting
primary data?
Enumerator in place of self for collecting data is
the only difference between questionnaire
method and schedule method.
What is Probability Distribution of two
possible number of tails from two tosses of a
fair coin?
No. of tails = 0 (H,H)
Probability of outcome is 0.5 X 0.5 = 0.25
No. of Tails = 1 (T,H) or (H,T)
Probability of outcome = 0.5
No. of tails = 2 (T,T)
Probability of outcome = 0.5 X 0.5 = 0.25
Frequency Distribution
vs Probability Distribution
Freq. distribution is a listing of observed frequencies of all possible
outcomes of an experiment that actually occurred when experiment
was done whereas

Probability Distribution is listing of the probability of possible outcomes


that could result if experiment was done.
Types of Probability Distribution

Discrete, which can be done only on a limited number of values that can be
listed down. Probability that you were born in a given month is discrete
because there are only 12 possible values.

Continuous in which variable under consideration is allowed to take on any


values within a range e.g. examining the level of effluent in a variety of
streams. We would expect continuous range of ppm from very low levels in
clear mountain streams to very high levels in polluted streams.
Random Variables

A variable is random if it takes on different values


as a result of outcomes of a random experiment.

A random variable can be discrete or continuous.


Bernoulli Process assumptions?

1. Each trial has only 2 possible outcomes i.e. heads / tails, yes/no,
success or failure.
2. Probability of outcome of any trial remains fixed over time e.g. with
a fair coin, the probability of head is 0.5.
3. All trials are statistically independent i.e. one outcome of toss does
not affect outcome of any other toss.
Binomial Distribution?

It is applied to discrete random variables only. It describes data


resulting from an experiment known as Bernoulli process.

Probability of r success in n trials =


(n!)*prqn-r
n!*(n-r)!
Where p = probability of success
q = probability of failure = 1-p
n = no. of trials undertaken and r = no. of successes
Using Binomial tables
Measures of central tendency and
dispersion for binomial distribution
Mean = np and
standard deviation = Square root(npq)
Poisson Distribution?

Discrete probability distribution again. Poisson distribution is useful for


characterizing events with very low probabilities of occurrence within some definite
time or space.

This is used in cases such as arrivals of trucks and cars at a tollbooth.


No. of patients who arrive at a physicians office in a given interval of time will be
0,1,2,3,4,5 or some other whole number.

Formula is P(x) =( λx e –λ )/ x!
Where P(x) = probability of exactly x no of occurrences
λ = mean number of occurrences per interval of time
Poisson as an approximation
of Binomial
Poisson can be approximation of binomial when n is large and p is small.

Formula becomes:
P(x) = (np)x e-np / x!
Normal Distribution?

It is a continuous probability distribution.

It occupies important place in statistics.


Characteristics are:
1. Unimodal bell shaped curve
2. Mean of normally distributed population lies at centre of its normal curve
3. Median and mode of distribution are also at centre
4. Two tails of normal distribution extend indefinitely
Areas under the normal curve

68 % of all values lie within + 1 σ

95.5 % of all values lie within + 2 σ

99.7 % of all values lie within + 3 σ


Z-table demonstration
Shortcoming of Normal Distribution

Tails approach horizontal axis, but never touch it


resulting in some probability that random variable
can take on enormous values.
Normal Distribution as approximation of
Binomial Distribution
Although Normal Distribution is continuous, but it
can be used to approximate discrete distribution
such as binomial when np>5 and nq>5.
Other continuous distributions

1. t-distribution
2. Chi-square
3. F-distribution
Sampling distribution of the mean

A probability distribution of all the possible means of


samples of given size n, from a population.
Sampling Error

Error or variation among sample statistics due to chance;


a measure of the extent to which we expect the means
from different samples to vary from the population mean,
owing to the chance error in sampling.
Standard error

The standard deviation of the sampling distribution


of a statistic.
Central Limit theorem

It is a result assuring that the sampling distribution of mean


approaches normality as the sample size increases,
regardless of the shape of the population distribution from
which the sample is collected.
Estimator and estimate

A sample statistic used to estimate the population


parameter is called estimator.

A specific observed value of an estimator is estimate.


Point estimate and interval
estimate
A single number used to estimate an unknown population
parameter is point estimate.

A range of values to estimate an unknown population


parameter is interval estimate.
Hypothesis?

Any assumption about the population


NULL and Alternate Hypothesis

Null hypothesis specifies a parameter and a value


for that parameter.

Alternate Hypothesis specifies a range of plausible


values should we succeed to reject the Null.
Testing Hypothesis

We must state the assumed or hypothesized value of the


population parameter before we begin sampling.

The assumption we wish to test is called NULL hypothesis


and is symbolized by H0(H sub zero).
Purpose of hypothesis testing

The purpose of hypothesis testing is not to question the


computed value of sample but to make a judgment
about difference between sample stat and hypothesized
population parameter.
Significance Level

A value indicating the percentage of sample values that is


outside limits, assuming the null hypothesis is correct.

It can also be called probability of rejecting Null hypothesis


when actually it is true.

Generally it is taken as 5% or 1% in real world situations.


Type I and Type II error

Rejecting a NULL hypothesis when it is true is type I error. Denoted by α.

Accepting a Null hypothesis when it is false is type II error. Denoted by β.

Power of a test is 1- β.

Trade-off between the two errors is needed depending on penalties attached


to each error.
Two tailed and one tailed tests

A hypothesis test in which the null hypothesis is rejected if the


sample value is significantly higher or lower than hypothesized
value of the population parameter, a test involving two
rejection regions.
Two sample tests

Hypothesis tests based on samples taken from two


populations in order to compare their means or
proportions.
t-Distribution

A family of probability distributions distinguished by their


individual degrees of freedom similar to normal distribution
and used when population standard deviation is unknown
with sample size less than 30.
Chi-Square distribution

A family of probability distributions, differentiated by their degrees of freedom,


used to test a number of different hypothesis about variances, proportions and
goodness of fit.
Chi square test is done when both the variables are categorical.
ANOVA

A statistical technique used to test the equality of 3 or


more sample means and thus, make inference as to
whether the samples come from populations having the
same mean.
Goodness of fit test

A statistical test for determining whether there is a significant


difference between an observed frequency distribution and a
theoretical probability distribution hypothesized to describe the
observed distribution.

You might also like