Testing Technique in Data Science
Testing Technique in Data Science
Testing
T-test, Wilcoxon Test, Mann-Whitney
U –Test, Friedman Test
Statistical Hypothesis
• A statistical hypothesis is an assertion or conjecture
concerning one or more populations.
• The truth or falsity of a statistical hypothesis is never
known with absolute certainty unless we examine the
entire population. This, of course, would be impractical
• in most situations.
• Instead, we take a random sample from the
populationof interest and use the data contained in
this sample to provide evidence that eithersupports or
does not support the hypothesis.
• Evidence from the sample that is inconsistent with the
stated hypothesis leads to a rejection of the hypothesis
Null Hypothesis(H0)
• A null hypothesis is a type of statistical hypothesis
that proposes that no statistical significance
exists in a set of given observations or sample.
• It is also called Hypothesis of no difference.
• The null hypothesis can also be described as the
hypothesis in which no relationship exists
between two sets of data or variables being
analyzed.
• If the null hypothesis is true, any experimentally
observed effect is due to chance alone.
Alternate Hypothesis
• Alternate Hypothesis claims that a
relationship does exist between two variables.
• The alternative hypothesis H1 usually
represents the research question to be
answered or the new
theory/method/technique to be tested.
• The null hypothesis H0 nullifies or opposes H1
and is often the logical complement to H1’
Some examples of Hypothesis in
Empirical Software Engineering
• H0(Null Hypothesis): There is no difference
between performance of technique X and
technique Y in predicting the fault proneness
of a class
• Ha(Alternate Hypothesis): “Technique X”
performs better than “Technique Y” in
predicting the fault proneness of a class.
• H0(Null Hypothesis): There is no relationship
between Coupling Between Objects( CBO)
metric and fault proneness of a class
• Ha(Alternate Hypothesis): Classes with high
value of CBO metric are more fault prone than
classes with low value of CBO metric.
Categories of Statistical Tests
• Statistical tests can be classified according to
– the relationship between the samples, that
is, whether they are independent or dependent.
_Underlying Data Distribution(Parametric and Non-
Parametric tests)
_Number of populations from which samples are
coming from.( One Sample, two sample, more than
two samples)
_Two sided (tailed ) or one sided(tailed)
One-Tailed and Two-Tailed Tests
• In two-tailed test, the deviation of the parameter in each direction from the
specified value is considered.
H0 : μ = μ0
Ha :μ ≠ μ0
where:
μ is the population mean
μ0 is the sample mean
• When the hypothesis is specified in one direction, then one-tailed test is used. For
example, consider the following null and alternative hypotheses for one tailed test:
H0 : μ = μ0
Ha :μ > μ0
where:
μ is the population mean
μ0 is the sample mean
Parametric Tests
• Parametric tests are those that make
assumptions about the parameters of the
population distribution from which the
sample is drawn.
• This is often the assumption that the
population data are normally distributed.
Non-Parametric Tests
• Non-parametric tests are those statistical tests
that do not make any assumptions about the
probability distribution of underlying
population from which the sample is drawn.
• Non-parametric methods are also called
distribution-free tests since they do not make
any assumptions about underlying
population.
Significance level of a test
• The level of significance (α)is the probability that
the results reported happened due to random
chance .
• The level of significance (α)is defined as the fixed
or preset probability of wrong elimination of null
hypothesis when in fact, it is true.
• Significance Level is generally set to 0.05 or 0.01
which means that there is 5% (or 1 % probability
) that the null hypothesis got rejected due to
random chance
p-value
• A P-value is the lowest level (of significance)
at which the observed value of the test
statistic is significant
Type I and Type II Errors
• Rejection of the null hypothesis when it is true is called a type I
error.
• In other words, a type I error occurs when the null hypothesis of
no difference is rejected, even when there is no difference.
• A type I error can also be called as “false positive”; a result when an
actual “hit” is erroneously seen as a “miss.”
• Non-rejection of the null hypothesis when it is false is called a type
II error.
• Type II error is also known as “false negative”; a result when an
actual “miss” is erroneously seen as a “hit.” The rate of the type II
error is denoted by the Greek letter beta (β) and related to the
power of a test(which equals 1 − β).
• The power of a test is the probability of rejecting H0 given that a
specific alternative is true.
Confusion matrix
H0 is true H0 is false
S1 10 S6 35 S11 24
S2 15 S7 26 S12 23
S3 24 S8 29 S13 14
S4 29 S9 19 S14 12
S5 16 S10 18 S15 5
t test statistic
Answer the following Questions
• State the Null and Alternate Hypothesis.
• Compare the t-calculated with t- table.
• Is the Null Hypothesis rejected at α=0.05 in
the above question?
• What is the conclusion?
Two independent samples t-test
• The two sample (independent sample) t-test determines the difference
between the unknown means of two populations based on the
independent samples drawn from the two populations.
• If the means of two samples are different from each other, then we
conclude that the population are different from each other.
• The samples are either derived from two different populations or the
population is divided into two random subgroups and the samples are
derived from these subgroups, where each group is subjected to a
different treatment (or technique).
• In both the cases, it is necessary that the two samples are independent to
each other.
The hypothesis for the application of Independent Samples t-test can be
formulated as given below:
H0: μ1 = μ2 (There is no difference in the mean values of both the samples.)
Ha: μ1 ≠ μ2 (There is difference in the mean values of both the samples.)
T-test statistic for two Independent
Samples
• The t-statistic for two sample t-test is given as,
• T+ =10.5
• T- = 179.5 , T= min(T+, T-)= 10.5
• μT = 95, σT = 24.8
• Z=-3.41, Null Hypothesis is rejected.(Refer to Normal distribution Table)
Mann- Whitney U-Test
• It is a non-parametric equivalent to two
independent samples t-test.
• The underlying data does not need to be normal
for the application of Wilcoxon–Mann–Whitney
test.
• This test investigates whether the two samples
drawn independently belong to the same
population by checking the equality of the two
sample means.
• It can be used when sample sizes are unequal.
Hypothesis for Mann- Whitney U-Test
The hypothesis for the application of Mann-
Whitney U-Test can be formulated as given
below:
• H0: μ1 - μ2 =0(The two sample means belong
to the same population and are identical.)
• Ha: μ1 - μ2 ≠0 (The two sample means are not
equal and belong to different populations.)
Steps of Mann- Whitney U test
• To perform the test, we need to compute the
rank-sum statistics for all the observations in
the following manner.
• We assume that the number of observations
in sample 1 is n1 and the number of
observations in sample 2 is n2.
• The total number of observations is denoted
by N (N = n1 + n2):
Mann- Whitney U test
1. Combine the data of the two samples. Arrange the data combined data in
ascending (low to high) order.
2. Assign ranks to all the observations. The lowest value observation is
provided rank 1, the next to lowest observation is provided rank 2, and so on,
with the highest observation given the rank N.
3. In case of ties (more than one observation having the same value), each
tied observation is assigned an average of tied ranks. For example: if there are
three observations of data value 20 each occupying 7th, 8th, and 9th ranks,
we would assign the mean rank, that is, 8 ([7 + 8 + 9]/3 = 8) to each of the
observation.
4. We then find the sum of all the ranks allotted to observations in sample 1
and denote it with T1. Similarly, find the sum of all the ranks allotted to
observations in sample 2 and denote it as T2.
5. Step 5 depends on whether the sample size is large or small. If both n1 and
n2 is <=10, the samples are considered small. If either n1 or n2 is greater
than 10 sample is considered large.
Mann-Whitney U-test Small Sample
Example
• Consider an example for comparing the coupling values
of two different software (one open source and other
academic software), to ascertain whether the two
samples are identical with respect to coupling values
(coupling of a module corresponds to the number of
other modules to which a module is coupled).
– Academic: 89, 93, 35, 43
– Open source: 52, 38, 5, 23, 32
Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The
hypotheses for the example are given above:
H0: μ1 − μ2 = 0 (The two samples are identical in terms of coupling values.)
Ha: μ1 − μ2 ≠ 0 (The two sample are not identical in terms of coupling values.)
Select the appropriate statistical test.
The two samples of our study are independent in nature, as they are collected from
two different software( Independent samples. Also, the outcome variable (amount of
coupling)is continuous or ordinal in nature. The data may not be normal. Hence, we
use the Mann–Whitney test for comparing the differences among coupling values of
an academic and open source software. Since both n1 and n2 is less than 10 , it is
small sample case
4∗(5)
Thus 𝑈1 = 4 ∗ 5 + - 27 = 3
2
5∗(6)
𝑈2 = 4 ∗ 5 + 2 - 18 = 17
U0= min(U1, U2)= min (3, 17)=3
Next look up for p- value in Mann- Whitney Table give on next slide. If p-value is less
than α =0.05 , the Null hypothesis is rejected else Null Hypothesis is not rejected .
For n1=4,n2=5 and U0=3, p-value as per Mann –Whitney table is 0.0556 which is
greater than α =0.05 , Thus Null Hypothesis is not rejected and the conclusion is that,
the two samples are identical in terms of coupling values
Mann – Whitney Table for small
samples
Mann-Whitney U-test Large Sample
Example
• Consider an example for comparing the
AVG_CC (Average Cyclomatic Complexity)
values of two different software (one open
source and other academic software), to
ascertain whether the two samples are
identical with respect to AVG_CC values.
• Data on next slide
Open Source AVG_CC Academic AVG_CC
2.75 4.10
3.29 4.75
4.53 3.95
3.61 3.50
3.10 4.25
4.29 4.98
2.25 5.75
2.97 4.10
4.01 2.70
3.68 3.65
3.15 5.11
2.97 4.80
4.05 6.25
3.60 3.89
n1=14 4.80
5.50
n2=16
Solution to the above large sample
Question
Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The
hypotheses for the example are given above:
H0: μ1 − μ2 = 0 (The two samples are identical in terms of AVG_CC values.)
Ha: μ1 − μ2 ≠ 0 (The two sample are not identical in terms of AVG_CC
values.)
Select the appropriate statistical test.
The two samples of our study are independent in nature, as they are collected from
two different software( Independent samples). Also, the outcome variable (amount of
AVG_CC )is continuous or ordinal in nature. The data may not be normal. Hence, we
use the Mann–Whitney test for comparing the differences among coupling values of
an academic and open source software. Since both n1 and n2 is greater than 10 , it is
large sample case