Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
0 views

Statistics

XCDDCFD

Uploaded by

Murali dharan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Statistics

XCDDCFD

Uploaded by

Murali dharan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

Essential Concepts in

Statistics

Dr . A.R. MURALI DHARAN.


M.Sc., M.S.I.T., M.Phil., Ph.d
Assistant Professor
BGS MCH College, Nagarur
Bangluru
Basics of
Statistics
Definition: Science of collection, presentation, analysis,
and reasonable interpretation of data.
▶ Statistics presents a rigorous scientific method for gaining insight into
data.
▶ For example, suppose we measure the weight of 100 patients in a study.
With so many measurements, simply looking at the data fails to provide
an informative account.
▶ However statistics can give an instant overall picture of data based on
graphical
presentation or numerical summarization irrespective to the number of data
points.
▶ Besides data summarization, another important task of statistics is to
make inference and predict relations of variables.
What is
Data?
Definition: Facts or figures, which are numerical or
otherwise, collected with a definite purpose are
called data.
▶ Everyday we come across a lot of information in the form of facts,
numerical figures, tables, graphs, etc .
▶ These are provided by newspapers, televisions, magazines and
other means of communication.
▶ These may relate to cricket batting or bowling averages, profits of
a company, temperatures of cities, expenditures in various sectors
of a five year plan, polling results, and so on.
▶ These facts or figures, which are numerical or otherwise, collected
with a definite purpose are called data.
Sources of Data
• Primary Sources: Firsthand data collection
(surveys, experiments, etc.).
• Secondary Sources: Pre-collected data
(government reports, research papers).
• Tertiary Sources: Summarized or consolidated
data (encyclopedias, aggregated databases).
Primary Data Vs Secondary
Data
Primary Data
▶ Primary data is the data that is collected for the first time
through personal experiences or evidence, particularly for
research.
▶ It is also described as raw data or first-hand information.
▶ The mode of assembling the information is costly.
▶ The data is mostly collected through observations, physical
testing, mailed questionnaires, surveys, personal interviews,
telephonic interviews, case studies, and focus groups, etc.
Primary Data Vs Secondary
Data
Secondary Data
▶ Secondary data is a second-hand data that is already collected and
recorded by some researchers for their purpose, and not for the current
research problem.
▶ It is accessible in the form of data collected from different sources such as
government publications, censuses, internal records of the organisation, books,
journal articles, websites and reports, etc .
▶ This method of gathering data is affordable, readily available, and saves cost
and time.
▶ However, the one disadvantage is that the information assembled is for some
other purpose and may not meet the present research purpose or may not be
accurate.
Discrete Vs continuous
data
▶ Discrete data (countable) is information that can only take certain
values. These values don’t have to be whole numbers but they are
fixed values – such as shoe size, number of teeth, number of kids,
etc.
▶ Discrete data includes discrete variables that are finite, numeric,
countable, and non-negative integers (5, 10, 15, and so on).
▶ Continuous data (measurable) is data that can take any value.
Height, weight, temperature and length are all examples of
continuous data.
▶ Continuous data changes over time and can have different
values at different time intervals like weight of a person.
Data
Presentation
▶ Two types of statistical presentation of data - graphical and
numerical.
▶ Graphical Presentation: We look for the overall pattern and
for striking deviations from that pattern. Over all pattern
usually described by shape, center, and spread of the data.
An individual value that falls outside the overall pattern is
called an outlier.
▶ Bar diagram and Pie charts are used for categorical
variables.
▶ Histogram, stem and leaf and Box-plot are used for numerical
variable.
Histogra
m A histogram is a graphical display of data using bars of different

heights. In a histogram, each bar groups numbers into ranges.
Taller bars show that more data falls in that range. A histogram
displays the shape and spread of continuous sample data
Box Plotting

▶ Box plots (also called box-and-whisker plots or box-


whisker plots) give a good graphical image of the
concentration of the data.
▶ They also show how far the extreme values are from
most of the data.
▶ A box plot is constructed from five values: the minimum
value, the first quartile, the median, the third quartile,
and the maximum value.
Box
Plotting

The image above is a boxplot. A boxplot is a standardized way of displaying the


distribution of data based on a five number summary (“minimum”, first quartile (Q1),
median, third quartile (Q3), and “maximum”). It can tell you about your outliers and
what their values are. It can also tell you if your data is symmetrical, how tightly your
data is grouped, and if and how your data is skewed.
Statistical concepts of classification
of Data
▶ Classification is the process of arranging data into homogeneous
(similar) groups according to their common characteristics.
▶ Raw data cannot be easily understood, and it is not fit for further
analysis and interpretation. Arrangement of data helps users in
comparison and analysis. It is also important for statistical
sampling.
Classification of Data
There are four types of classification. They are:
▶ Geographical classification
When data are classified on the basis of location or areas, it is called
geographical
classification
▶ Chronological classification
Chronological classification means classification on the basis of time, like months,
years etc.
▶ Qualitative classification
In Qualitative classification, data are classified on the basis of some attributes or
quality such as gender, colour of hair, literacy and religion. In this type of
classification, the attribute under study cannot be measured. It can only be found
out whether it is present or absent in the units of study.
▶ Quantitative classification
Quantitative classification refers to the classification of data according to some
characteristics, which can be measured such as height, weight, income, profits etc.
Quantitative classification
▶ There are two types of quantitative classification of data: Discrete
frequency distribution and Continuous frequency distribution.
▶ In this type of classification there are two elements
▶ variable
Variable refers to the characteristic that varies in magnitude or quantity.
E.g. weight of the students. A variable may be discrete or continuous.
▶ Frequency
Frequency refers to the number of times each variable gets repeated. For
example there are 50 students having weight of 60 kgs. Here 50 students is
the frequency.
Qualitative classification

 Qualitative classification is the process of


categorizing data based on non-numerical
characteristics or qualities.
 In qualitative classification, data is divided into
distinct classes based on descriptive
properties rather than numerical values. This
is useful in studies that focus on categorical
data rather than measurements or quantities.
Frequency
distribution
Frequency distribution refers to data classified on the basis of some

variable that can be measured such as prices, weight, height,
wages etc.
Frequency distribution
The following technical terms are important when a
continuous frequency distribution is formed
Class limits: Class limits are the lowest and highest
values that can be included in a class. For example
take the class 51-55. The lowest value of the class is 51
and the highest value is 55. In this class there can be
no value lesser than 51 or more than 55. 51 is the
lower class limit and 55 is the upper class limit.
Class interval: The difference between the upper and
lower limit of a class is known as class interval of that
class.
Class frequency: The number of observations
corresponding to a particular class is known as the
frequency of that class
Measures of Centre
Tendency
In statistics, the central tendency is the descriptive summary of a data

set.
▶ Through the single value from the dataset, it reflects the centre of the
data distribution.
▶ Moreover, it does not provide information regarding individual data from
the dataset, where it gives a summary of the dataset. Generally, the
central tendency of a dataset can be defined using some of the
measures in statistics.
Mean
▶ The mean represents the average value of the dataset.
▶ It can be calculated as the sum of all the values in the dataset
divided by the number of values. In general, it is considered as the
arithmetic mean.
▶ Some other measures of mean used to find the central tendency are
as follows:
▶ Geometric Mean (nth root of the product of n numbers)
▶ Harmonic Mean (the reciprocal of the average of the reciprocals)
▶ Weighted Mean (where some values contribute more than others)
▶ It is observed that if all the values in the dataset are the same, then all
geometric, arithmetic and harmonic mean values are the same. If
there is variability in the data, then the mean value differs.
Arithmetic
Mean
Arithmetic mean represents a number that is obtained by dividing the sum
of the elements of a set by the number of values in the set. So you can use
the layman term Average. If any data set consisting of the values b1, b2, b3,
…., b n then the arithmetic mean B is defined as:
B = (Sum of all observations)/ (Total number of observation)

The arithmetic mean of Virat Kohli’s batting scores also called his
Batting
Average is;
Sum of runs scored/Number of innings = 661/10
The arithmetic mean of his scores in the last 10 innings is 66.1.
Harmonic
Mean
A Harmonic Progression is a sequence if the reciprocals of its terms are in
Arithmetic Progression, and harmonic mean (or shortly written as HM) can be
calculated by dividing the number of terms by reciprocals of its terms.

In particular cases, especially those involving rates and ratios, the harmonic
mean gives the most correct value of the mean. For example, if a vehicle
travels a specified distance at speed x (eg 60 km / h) and then travels again
at the speed y (e.g.40 km / h), the average speed value is the harmonic mean
x, y (Ie, 48 km / h).
Geometric
Mean
▶ The Geometric Mean (GM) is the average value or mean which
signifies the central tendency of the set of numbers by finding the
product of their values.
▶ Basically, we multiply the numbers altogether and take out the nth
root of the multiplied numbers, where n is the total number of
values.
▶ For example: for a given set of two numbers such as 3 and 1, the
geometric mean is equal to √(3+1) = √4 = 2.
Use of Geometric
Mean
▶ For example, suppose you have an investment which earns 10% the first
year, 50% the second year, and 30% the third year. What is its average
rate of return?
▶ It is not the arithmetic mean, because what these numbers mean is that
on the first year your investment was multiplied (not added to) by 1.10, on
the second year it was multiplied by 1.60, and the third year it was
multiplied by 1.20. The relevant quantity is the geometric mean of these
three numbers.
▶ The question about finding the average rate of return can be rephrased
as: "by what constant factor would your investment need to be multiplied
by each year in order to achieve the same effect as multiplying by 1.10
one year, 1.60 the next, and 1.20 the third?"
▶ If you calculate this geometric mean
▶ You get approximately 1.283, so the average rate of return is about 28%
(not 30% which is what the arithmetic mean of 10%, 60%, and 20% would
give you).
Media
n Median is the middle value of the dataset in which

the dataset is arranged in the ascending order or in
descending order.
▶ When the dataset contains an even number of
values, then the median value of the dataset can
be found by taking the mean of the middle two
values.
▶ If you have skewed distribution, the best measure of
finding the central tendency is the median.
▶ The median is less sensitive to outliers (extreme scores)
than the mean and thus a better measure than the
mean for highly skewed distributions, e.g. family
income. For example mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations
out of 4 lie between 20-40. So, the mean 270 really
fails to give a realistic picture of the major part of
the data. It is influenced by extreme value 990.
Mode

▶ The mode represents the frequently occurring value in


the dataset.
▶ Sometimes the dataset may contain multiple modes and
in
some cases, it does not contain any mode at all.
▶ If you have categorical data, the mode is the best
choice to find the central tendency.
Measures of
Dispersion
Dispersion is the state of getting dispersed or spread. Statistical dispersion
means the extent to which a numerical data is likely to vary about an
average value. In other words, dispersion helps to understand the
distribution of the data.
Objectives of computing dispersion
Comparative study
▶ Measures of dispersion give a single value indicating the degree of consistency or
uniformity of distribution. This single value helps us in making comparisons of various
distributions.
Reliability of an average
▶ A small value of dispersion means low variation between observations and average.
It means that the average is a good representative of observation and very
reliable. A higher value of dispersion means greater deviation among the
observations.
Control the variability
▶ Different measures of dispersion provide us data of variability from different angles,
and this knowledge can prove helpful in controlling the variation.
Basis for further statistical analysis
▶ Measures of dispersion provide the basis for further statistical analysis like computing
correlation, regression, test of hypothesis, sampling etc.
Types of Measures of
Dispersion
There are two main types of dispersion methods in statistics which
are:
▶ Absolute Measure of Dispersion
▶ Relative Measure of Dispersion
Absolute Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original data set.
Absolute dispersion method expresses the variations in terms of the average of
deviations of observations like standard or means deviations. It includes range,
standard deviation, quartile deviation, etc. The types of absolute measures of
dispersion are:

▶ Range: It is simply the difference between the maximum value and the minimum
value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
▶ Variance: Deduct the mean from each data in the set then squaring each of
them and adding each square and finally dividing them by the total no of values
in the data set is the variance. Variance (σ2)=∑(X−μ)2/N
▶ Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
▶ Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.
▶ Mean and Mean Deviation: The average of numbers is known as the mean and
the arithmetic mean of the absolute deviations of the observations from a
measure of central tendency is known as the mean deviation (also called mean
absolute deviation).
Range
▶ It is the simplest method of measurement of dispersion.
▶ It is defined as the difference between the largest and the
smallest item in a given distribution.
▶ Range = Largest item (L) – Smallest item (S)
Interquartile Range
▶ It is defined as the difference between the Upper Quartile
and Lower Quartile of a given distribution.
▶ Interquartile Range = Upper Quartile (Q 3)–Lower
Quartile(Q 1)
Variance
▶ Variance is a measure of how data points differ from the mean.
▶ A variance is a measure of how far a set of data (numbers) are spread
out from their mean (average) value.
▶ The more the value of variance, the data is more scattered from its mean
and if the value of variance is low or minimum, then it is less scattered
from mean. Therefore, it is called a measure of spread of data from
mean.
▶ the formula for variance is
Var (X) = E[(X –μ) 2]
▶ the variance is the square of standard deviation, i.e.,
Variance = (Standard deviation)2= σ2
Variance

Example: Find the variance of the numbers 3, 8, 6, 10, 12, 9,


11, 10, 12,
7.
Given,
3, 8, 6, 10, 12, 9, 11, 10, 12, 7
Step 1: Compute the mean of the 10 values given.
Mean (μ) = (3+8+6+10+12+9+11+10+12+7) / 10 = 88 / 10 = 8.8
Variance
Relative measures of
dispersion
• Relative measures of dispersion are statistical tools that express the extent of
variability or spread of a dataset in relation to a central value (such as the mean or
median). Unlike absolute measures of dispersion (such as range or standard deviation),
relative measures are expressed as ratios, percentages, or coefficients, making them
useful for comparing the variability of datasets that have different units or scales.

• Relative measures of dispersion are especially valuable when you need to compare the
variability of different datasets where absolute values could mislead due to differing
scales.
▶ Coeffiecent of Range: It is simply the difference between the maximum value and the
minimum value by addition of Maximum and minimum
▶ Coeffiecent of Variance: This expresses the standard deviation as a percentage of
the mean. It is useful for comparing the degree of variability between datasets
with different units or vastly different means.
▶ Coeffiecent of Standard Deviation: This expresses the standard deviation as a non
percentage of the mean.
▶ Coeffiecent of Quartiles and Coeffiecent of Quartile Deviation: This measure is based
on the interquartile range (IQR), which is the difference between the third
quartile (Q3) and the first quartile (Q1). It is useful when the data distribution is
skewed, as it focuses on the middle 50% of the data.
Coefficient of
variance
The coefficient of variance (CV) is a relative measure of variability that

indicates the size of a standard deviation in relation to its mean.
▶ It is a standardized, unitless measure that
allows you to compare variability between disparate
groups and characteristics.
▶ It is also known as the relative standard deviation (RSD).
▶ The coefficient of variation facilitates meaningful comparisons
in
scenarios where absolute measures cannot.
Quartile
Deviation
The Quartile Deviation (QD) is the product of half of the difference

between the upper and lower quartiles.
▶ Mathematically we can define as: Quartile Deviation = (Q3 – Q1) /
2
▶ Quartile Deviation defines the absolute measure of dispersion.
Whereas the relative measure corresponding to QD, is known as
the coefficient of QD, which is obtained by applying the certain
set of the formula: Coefficient of Quartile Deviation = (Q3 – Q1)
/ (Q3 + Q1)
▶ A Coefficient of QD is used to study & compare the degree of
variation in different situations.
Skewness
▶ Skewness is a measure of the degree of asymmetry of a distribution.
▶ If the left tail (tail at small end of the distribution) is more
pronounced than the right tail (tail at the large end of the
distribution), the function is said to have negative skewness.
▶ If the reverse is true, it has positive skewness. If the two are equal, it
has zero skewness.
Kurtosis
▶ Kurtosis is a measure of whether the data are heavy-tailed or light-
tailed relative to a normal distribution.
▶ That is, data sets with high kurtosis tend to have heavy tails, or
outliers. Data sets with low kurtosis tend to have light tails, or lack of
outliers.
▶ Significant skewness and kurtosis clearly indicate that data are not
normal.
Types of Distributions
Normal Distribution
▶ In probability theory and statistics, the Normal Distribution, also
called the Gaussian Distribution, is the most significant
continuous probability distribution.
▶ A large number of random variables are either nearly or
exactly represented by the normal distribution, in every
physical science and economics.
▶ In a normal distribution, the mean, mean and mode are equal.
(i.e., Mean = Median= Mode). The normally distributed curve
should be symmetric at the centre.
Normal
Distribution
Hypothesis Testing
44

Introduction
 The purpose of hypothesis testing is to determine whether there is
enough statistical evidence in favor of a certain belief about a
parameter.
 An hypothesis is a preliminary or tentative explanation or postulate
by the researcher of what the researcher considers the outcome of an
investigation will be. It is an informed/educated guess.
 It indicates the expectations of the researcher regarding certain
variables. It is the most specific way in which an answer to a
problem can be stated.
What is hypothesis 45

 A tentative statement about a population parameter that


might be true or wrong
THE DIFFERENCE BETWEEN AN HYPOTHESIS AND A PROBLEM
46

 Both an hypothesis and a problem contribute to


the body of knowledge which supports or refutes
an existing theory.
 An hypothesis differs from a problem.
 A problem is formulated in the form of a question;
it serves as the basis or origin from which an
hypothesis is derived.
 An hypothesis is a suggested solution to a
problem.
A problem (question) cannot be directly tested,
whereas an hypothesis can be tested and verified.
WHEN IS AN HYPOTHESIS FORMULATED 47

 An hypothesis is formulated after the problem has been stated


and the literature study has been concluded.
 It is formulated when the researcher is totally aware of the
theoretical and empirical background to the problem.
PURPOSE AND FUNCTION OF AN HYPOTHESIS 48

 Itoffers explanations for the relationships between


those variables that can be empirically tested.
 Itfurnishes proof that the researcher has sufficient
background knowledge to enable him/her to make
suggestions in order to extend existing knowledge.
 It gives direction to an investigation.
 Itstructures the next phase in the investigation and
therefore furnishes continuity to the examination
of the problem.
CHARACTERISTICS OF AN HYPOTHESIS 49

 It should have elucidating power.


 It should strive to furnish an acceptable explanation of the phenomenon.
 It must be verifiable.
 It must be formulated in simple, understandable terms.
 It should corresponds with existing knowledge.
Types of Hypotheses
50

1. Descriptive Hypotheses:
 These are propositions that describe the characteristics
( such as size, form or distribution) of a variable. The variable
may be an object, person, organization etc. ,
e.g., The rate of unemployment among arts graduates is
higher than that of commerce graduates. The
educational system is not oriented to human resource
needs of a country.
2. Relational Hypotheses.
 These are propositions which describe the relationship
between tow variables.

e. g. , Families with higher incomes spend more for


recreation
Upper – class people have fewer children than
lower class people.
Cont…
51

3. Causal Hypotheses
 It state that the existence of, or a change in, one variable Causes or
leads to an effect on another variable.
 The first variable is called the independent variable, and the latter
the dependent variable.
 When dealing with causal relationships between variables the
researcher must consider the direction in which such relationship
flow
e.g: which is cause and which is effect

4. Working Hypotheses
 While planning the study of a problem, hypotheses are formed.
 Initially they may not be very specific. In such cases, they are
referred to as ‘ working hypotheses’ which are subject to
modification as the investigation proceeds.
Cont…
52
5. Null Hypotheses
 This hypotheses are formulated for testing statistical
significance, since, this form is a convenient approach to
statistical analysis. As the test would nullify the null hypotheses.

e.g., : There is a relationship between a family’s income


and expenditure on recreation, a null hypothesis may state:
There is no relationship between families income level and
expenditure on recreation.

6. Statistical Hypotheses
 These are statements about a statistical population.
These are derived from a sample. These are
quantitative in nature in that they are numerically
measurable
eg: Group A is older than B’
Cont…
7. Common Sense Hypotheses 53

 It state the existence of empirical uniformities


perceived through day to day observations.
e.g., “ Shop-assistants in small shops lack motivation”

8. Complex Hypotheses
 These aim at testing the existence of logically
derived relationships between empirical uniformities.
e.g., In the early stage human ecology described empirical
uniformities in the distribution of land values, industrial
concentrations, types of business and other phenomena.

9. Analytical Hypotheses:
 It concerned with the relationship of analytic variables.
These hypotheses occur at the highest level of
abstraction.
 These specify relationship between changes in one
property and changes in another.
Eg., The study of human fertility might show empirical
regularities by wealth, education, region, and religion.
Characteristics of a Good 54

Hypotheses
 Conceptual Clarity
 Specificity
 Testability
 Availability of Techniques
 Theoretical relevance
 Consistency
 Objectivity
 Simplicity
Sources of Hypotheses
55

 Theory
 Observation
 Analogies
 Intuition and personal experience
 Findings of studies
 State of Knowledge
 Culture
 Continuity of Research
Steps for Hypothesis Testing
Formulate H0 and H1

Select Appropriate Test


Choose Level of Significance

Calculate Test Statistic TSCAL

Determine Prob Determine Critical


Assoc with Test Value of Test Stat
TSCR
Stat
Determine if TSCR
Compare with falls into (Non)
Level of Rejection Region
Significance, 
Reject/Do not Reject
H0
Draw Marketing Research Conclusion
Step 1: Formulate the
Hypothesis
A null hypothesis is a statement of the status quo, one of
no difference or no effect. If the null hypothesis is not
rejected, no changes will be made.
 Analternative hypothesis is one in which some difference
or effect is expected.
 The null hypothesis refers to a specified value of the
m ,s ,p (e.g.,
population parameter ), not a sample statistic
X
(e.g., ).
Example of a Hypothesis
Test
For the data in Table 15.1, suppose we wanted to test
the hypothesis that the mean familiarity rating exceeds
4.0, the neutral value on a 7 point scale. A significance
level of a = 0.05 is selected. The hypotheses may be
formulated as:
H0: <4.0
H1:  >
4.0
t = (X - )/sX
sX = 0.293
tCAL = (4.724-4.0)/0.293 =
2.471
Test
The df for the t stat is n - 1. In this case, n - 1 = 28.
The probability assoc with 2.471 is less than 0.05. So the null
hypothesis is rejected

 Alternatively, the critical tα value for a significance level of 0.05


is 1.7011
Since, 1.7011 <2.471, the null hypothesis is rejected.
The familiarity level does exceed 4.0.
Note that if the population standard deviation was known to be
1.5, rather than estimated from the sample, a z test would be
appropriate.
Step 1: Formulate the
Hypothesis
 A null hypothesis may be rejected, but it
can never be accepted based on a single
test.
 In marketing research, the null hypothesis
is formulated in such a way that its
rejection leads to the acceptance of the
desired conclusion.
 A new Internet Shopping Service will be
introduced if more than 40% people use it:
H0: p £ 0.40
H1: p > 0.40
Step 1: Formulate the
Hypothesis
 In eg on previous slide, the null hyp is a one-tailed test,
because the alternative hypothesis is expressed
directionally.
 If not, then a two-tailed test would be required as foll:

H 0: p = 0.4 0
H1: p ¹ 0.40
Step 2: Select an Appropriate
Test
 The test statistic measures how close the
sample has come to the null hypothesis.
 The test statistic often follows a well-known
distribution (eg, normal, t, or chi-square).
 In our example, the z statistic, which follows
the standard normal distribution, would be
appropriate.
p-p
z= s
p
Where σp is standard
deviation
Step 3: Choose Level of
Significance
Type I Error
 Occurs if the null hypothesis is rejected when it is in
fact true.
 The probability of type I error ( α ) is also called the
level of significance.

Type II Error
 Occurs if the null hypothesis is not rejected when it is in
fact false.
 The probability of type II error is denoted by β .
 Unlike α, which is specified by the researcher, the
magnitude of β depends on the actual value of the
population parameter (proportion).

It is necessary to balance the two types of


Step 3: Choose Level of
Significance
Power of a Test
 The power of a test is the probability (1 - β) of
rejecting the null hypothesis when it is false and should
be rejected.
 Although β is unknown, it is related to α. An extremely
low value of α (e.g., = 0.001) will result in intolerably
high β errors.
Probability of z with a One-
Tailed Test
Fig. 15.5

Shaded Area
= 0.9699

Unshaded Area
= 0.0301

0 zCAL = 1.88
Step 4: Collect Data and
Calculate Test Statistic
 The required data are collected and the
value of the test statistic computed.
 In our example, 30 people were surveyed
and 17 shopped on the internet. The
value of the sample proportion is
p = 17/30 = 0.567.
 The value ofsp is:
sp=0.089
Step 4: Collect Data and
Calculate Test Statistic
The test statistic z can be calculated as follows:

pˆ - p
zCAL=
s p

= 0.567-0.40
0.089

= 1.88
Step 5: Determine Probability Value/
Critical Value
 Using standard normal tables (Table 2 of the Statistical
Appendix), the area to the right of zCAL is .0301 (zCAL =1.88)
 The shaded area between 0 and 1.88 is 0.4699. Therefore,
the area to the right of 1.88 is 0.5 - 0.4699 = 0.0301.
 Thus, the p-value is .0301

 Alternatively, the critical value of z, called zα, which will


give an area to the right side of the critical value of α=0.05,
is between 1.64 and 1.65. Thus zα =1.645.

 Note, in determining the critical value of the test statistic,


the area to the right of the critical value is either α or α/2.
It is α for a one-tail test and α/2 for a two-tail test.
Steps 6 & 7: Compare Prob
and Make the Decision
 Ifthe prob associated with the calculated value
of the test statistic ( zCAL) is less than the level
of significance (α), the null hypothesis is
rejected.
 Inour case, the p-value is 0.0301.This is less
than the level of significance of α =0.05.
Hence, the null hypothesis is rejected.
 Alternatively, if the calculated value of the test
statistic is greater than the critical value of the
test statistic ( zα), the null hypothesis is
rejected.
Steps 6 & 7: Compare Prob
and Make the Decision
 The calculated value of the test statistic zCAL=
1.88 lies in the rejection region, beyond the
value of zα=1.645. Again, the same conclusion
to reject the null hypothesis is reached.
 Notethat the two ways of testing the null
hypothesis are equivalent but mathematically
opposite in the direction of comparison.
 Writing Test-Statistic as TS:
If the probability of TSCAL < significance level
( α ) then reject H0 but if TSCAL > TSCR then reject
H0.
Step 8: Mkt Research
Conclusion

 The conclusion reached by hypothesis testing


must be expressed in terms of the marketing
research problem.

 In our example, we conclude that there is


evidence that the proportion of Internet users
who shop via the Internet is significantly greater
than 0.40. Hence, the department store should
introduce the new Internet shopping service.
Using a t-Test
 Assume that the random variable X is normally dist, with
unknown pop variance estimated by the sample variance s
2
.

 Then a t test is appropriate.

t = (X - m)/s X
 The t-statistic, is t distributed with n - 1 df.

 The t dist is similar to the normal distribution: bell-shaped


and symmetric. As the number of df increases, the t dist
approaches the normal dist.
Broad Classification of Hyp
Tests
Hypothesis Tests

Tests of Tests of
Association Differences

Means Proportion Means Proportions


s
Hypothesis Testing for
DifferencesHypothesis Tests

Parametric Non-parametric Tests


Tests (Metric) (Nonmetric)

One Two or More


* Sample
t test Samples
* Z test
Independen
t Samples
* Two-Group * Paired
t test t test
* Z test
Two Independent Samples:
Means
 In the case of means for two independent samples, the
hypotheses take the following form.

H :  
0 1 2

H :  
1 1 2

 The two populations are sampled and the means and


variances computed based on samples of sizes n1 and n2.
 The idea behind the test is similar to the test for a single
mean, though the formula for standard error is different
 Suppose we want to determine if internet usage is
different for males than for females, using data in Table
15.1
Two
Table
Independent-Samples: t
Tests
15.14
Table 15.14 Summary Statistics

Number Standard
of Cases Mean Deviation

Male 15 9.333 1.137


Female 15 3.867 0.435

F Test for Equality of Variances


F 2-tail
value probability

15.507 0.000

t Test
Equal Variances Assumed Equal Variances Not Assumed

t Degrees of 2-tail t Degrees of 2-tail


value freedom probability value freedom probability

4.492 28 0.000 -4.492 18.014 0.000


-
Two Independent Samples:
Proportions
Consider data of Table 15.1
Is the proportion of respondents using the
Internet for shopping the same for males and
females?
The null and alternative hypotheses are:
H 0:  1 =  2
H 1 : 1  2

The test statistic is similar to the one for


difference of means, with a different formula for
standard error.
Summary of Hypothesis
Tests
for Differences
Sample Application Level of Scaling Test/Comments

One Sample Proportion Metric Z test

One Sample Means Metric ttest, if variance is unknown


ztest, if variance is known
Summary of Hypothesis
Tests
for Differences Scaling
Two Indep Samples Application Test/Comments

Two indep samples Means Metric Two-groupt test


Ftest for equality of
variances

Two indep samples Proportions Metric z test


Nonmetric Chi -square test

You might also like