Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

Testing Technique in Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Testing Technique in Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Hypothesis Testing/Statistical

Testing
T-test, Wilcoxon Test, Mann-Whitney
U –Test, Friedman Test
Statistical Hypothesis
• A statistical hypothesis is an assertion or conjecture
concerning one or more populations.
• The truth or falsity of a statistical hypothesis is never
known with absolute certainty unless we examine the
entire population. This, of course, would be impractical
• in most situations.
• Instead, we take a random sample from the
populationof interest and use the data contained in
this sample to provide evidence that eithersupports or
does not support the hypothesis.
• Evidence from the sample that is inconsistent with the
stated hypothesis leads to a rejection of the hypothesis
Null Hypothesis(H0)
• A null hypothesis is a type of statistical hypothesis
that proposes that no statistical significance
exists in a set of given observations or sample.
• It is also called Hypothesis of no difference.
• The null hypothesis can also be described as the
hypothesis in which no relationship exists
between two sets of data or variables being
analyzed.
• If the null hypothesis is true, any experimentally
observed effect is due to chance alone.
Alternate Hypothesis
• Alternate Hypothesis claims that a
relationship does exist between two variables.
• The alternative hypothesis H1 usually
represents the research question to be
answered or the new
theory/method/technique to be tested.
• The null hypothesis H0 nullifies or opposes H1
and is often the logical complement to H1’
Some examples of Hypothesis in
Empirical Software Engineering
• H0(Null Hypothesis): There is no difference
between performance of technique X and
technique Y in predicting the fault proneness
of a class
• Ha(Alternate Hypothesis): “Technique X”
performs better than “Technique Y” in
predicting the fault proneness of a class.
• H0(Null Hypothesis): There is no relationship
between Coupling Between Objects( CBO)
metric and fault proneness of a class
• Ha(Alternate Hypothesis): Classes with high
value of CBO metric are more fault prone than
classes with low value of CBO metric.
Categories of Statistical Tests
• Statistical tests can be classified according to
– the relationship between the samples, that
is, whether they are independent or dependent.
_Underlying Data Distribution(Parametric and Non-
Parametric tests)
_Number of populations from which samples are
coming from.( One Sample, two sample, more than
two samples)
_Two sided (tailed ) or one sided(tailed)
One-Tailed and Two-Tailed Tests
• In two-tailed test, the deviation of the parameter in each direction from the
specified value is considered.
H0 : μ = μ0
Ha :μ ≠ μ0
where:
μ is the population mean
μ0 is the sample mean

• When the hypothesis is specified in one direction, then one-tailed test is used. For
example, consider the following null and alternative hypotheses for one tailed test:
H0 : μ = μ0
Ha :μ > μ0
where:
μ is the population mean
μ0 is the sample mean
Parametric Tests
• Parametric tests are those that make
assumptions about the parameters of the
population distribution from which the
sample is drawn.
• This is often the assumption that the
population data are normally distributed.
Non-Parametric Tests
• Non-parametric tests are those statistical tests
that do not make any assumptions about the
probability distribution of underlying
population from which the sample is drawn.
• Non-parametric methods are also called
distribution-free tests since they do not make
any assumptions about underlying
population.
Significance level of a test
• The level of significance (α)is the probability that
the results reported happened due to random
chance .
• The level of significance (α)is defined as the fixed
or preset probability of wrong elimination of null
hypothesis when in fact, it is true.
• Significance Level is generally set to 0.05 or 0.01
which means that there is 5% (or 1 % probability
) that the null hypothesis got rejected due to
random chance
p-value
• A P-value is the lowest level (of significance)
at which the observed value of the test
statistic is significant
Type I and Type II Errors
• Rejection of the null hypothesis when it is true is called a type I
error.
• In other words, a type I error occurs when the null hypothesis of
no difference is rejected, even when there is no difference.
• A type I error can also be called as “false positive”; a result when an
actual “hit” is erroneously seen as a “miss.”
• Non-rejection of the null hypothesis when it is false is called a type
II error.
• Type II error is also known as “false negative”; a result when an
actual “miss” is erroneously seen as a “hit.” The rate of the type II
error is denoted by the Greek letter beta (β) and related to the
power of a test(which equals 1 − β).
• The power of a test is the probability of rejecting H0 given that a
specific alternative is true.
Confusion matrix

H0 is true H0 is false

H0 not rejected Correct decision Type II error


(True Negative) (False Negative)

HO rejected Type I error Correct decision


(False Positive) (True Positive)
Steps in Hypothesis Testing
Step 1: Define hypothesis—In the first step, the hypothesis is defined corresponding
to the outcomes. The statistical tests are used to verify the hypothesis formed in the
experimental design phase.
Step 2: Select the appropriate statistical test—The appropriate statistical test is
determined in experiment design on the basis of assumptions of a given statistical
test.
Step 3: Apply test and calculate p-value—The next step involves applying the
appropriate statistical test and calculating the significance value, also known as p-
value.
Step 4: Define significance level—The threshold level or critical value (also known as
α-value) that is used to check the significance of the test statistic is defined. If p-value
is less than α-value Null Hypothesis is rejected. Else it is not rejected.
Step 5: Derive conclusions—Finally, the conclusions on the hypothesis are derived
using the results of the statistical test carried out in step 3.
Question on one-sample t-test
Consider Table below where the number of modules for 15
software systems are shown.
We want to conclude that whether the population from which
sample is derived is on average different than the 12 modules.
Softwa Number Softwa Number Software Number
re of re of Systems of
System Modules System modules Modules

S1 10 S6 35 S11 24
S2 15 S7 26 S12 23
S3 24 S8 29 S13 14
S4 29 S9 19 S14 12
S5 16 S10 18 S15 5
t test statistic
Answer the following Questions
• State the Null and Alternate Hypothesis.
• Compare the t-calculated with t- table.
• Is the Null Hypothesis rejected at α=0.05 in
the above question?
• What is the conclusion?
Two independent samples t-test
• The two sample (independent sample) t-test determines the difference
between the unknown means of two populations based on the
independent samples drawn from the two populations.
• If the means of two samples are different from each other, then we
conclude that the population are different from each other.
• The samples are either derived from two different populations or the
population is divided into two random subgroups and the samples are
derived from these subgroups, where each group is subjected to a
different treatment (or technique).
• In both the cases, it is necessary that the two samples are independent to
each other.
The hypothesis for the application of Independent Samples t-test can be
formulated as given below:
H0: μ1 = μ2 (There is no difference in the mean values of both the samples.)
Ha: μ1 ≠ μ2 (There is difference in the mean values of both the samples.)
T-test statistic for two Independent
Samples
• The t-statistic for two sample t-test is given as,

• where: μ1 and μ2 are the means of both the


samples, respectively
• σ1 and σ2 are the standard deviations of both the
samples, respectively
• The Degrees Of Freedom is n1 + n2 − 1, where n1
and n2 are the sample sizes of both the samples
Example of Two independent samples
t-test
Consider an example for comparing the properties
of industrial and open source software in terms of
the average amount of coupling between modules
(the other modules to which a module is coupled).
In this example, we wish to test the hypothesis that
the type of software affects the amount of coupling
between modules.
• Industrial: 150, 140, 172, 192, 186, 180, 144, 160,
188, 145, 150, 141
• Open source: 138, 111, 155, 169, 100, 151, 158,
130, 160, 156, 167, 132
Paired t-test
• Study From pdf provided
Answer the Following Questions
• State the Null and Alternate Hypothesis.
• Calculate Mean and Std. Dev of both samples
• Apply two independent sample t-test.
• Is the Null Hypothesis Rejected at α=0.05 in
the above question?
• What is the conclusion?
Wilcoxon Matched pair test
• Wilcoxon signed-ranks test is a nonparametric
test that is used to perform pairwise
comparisons among different treatments.
• It is also called Wilcoxon matched pairs test
and is used in the scenario of two related
samples.
• It is non-parametric equivalent of paired t-
test.
Hypothesis for Wilcoxon Test
• The Wilcoxon test is based on the following
hypotheses:
• H0: There is no statistical difference between
the two treatments.
• Ha: There exists a statistical difference
between the two treatments.
Steps 1-3 of Wilcoxon Test
• To perform the test, we compute the differences among the related pair of
values of both the treatments. The differences are then ranked based on
their absolute values.
• We perform the following steps while assigning ranks to the differences:
1. Exclude the pairs where the absolute difference is 0. Let nr be the reduced
number
of pairs.
2. Assign rank to the remaining nr pairs based on the absolute difference. The
smallest absolute difference is assigned a rank 1.
3. In case of ties among differences (more than one difference having the
same value), each tied difference is assigned an average of tied ranks. For
example, if there are two differences of data value 5 each occupying 7th and
8th ranks, we would assign the mean rank, that is, 7.5 ([7 + 8]/2 = 7.5) to each
of the difference.
Steps 4 of Wilcoxon Test
• We now compute two variables T+ and T−.
• T+ represents the sum of ranks assigned to
differences, where the data instance in the
first treatment outperforms the second
treatment.
• However, T− represents the sum of ranks
assigned to differences, where the second
treatment outperforms the first treatment.
• Find T= min(T+, T-)
Example of Wilcoxon Test (Small
Sample n<=15)
• Consider the Sum of Absolute residual Errors
(Sum ARE) of Two ML Techniques on Six
Software Maintainability Prediction Datasets.
Dataset ML1 ML2 di Abs(di) rank
SumARE SumARE
D1 1950 1760 +190 190 4 (+)
D2 1840 1870 -30 30 1 (-)
D3 2015 1810 +205 205 5 (+)
D4 1580 1660 -80 80 2 (-)
D5 1790 1340 +450 450 6 (+)
D6 1925 1765 +160 160 3 (+)
Contd..
Is there a statistically significant difference between
performance of two techniques?

• The Null and Alternate Hypothesis are


• H0: There is no statistical difference between the
SumARE of two ML Techniques
• Ha: There exists a statistical difference between
SumARE of two ML Techniques.
• Thus in the above example T+ = 3+4+5+6= 18
• T- = 1+2=3
• T= min (T+, T-)=3
Step 5 Refer to Wilcoxon Table
Step 6
• Because observed T=3 is greater than critical
T=1 from table for α=0.05, the decision is not
to reject the Null Hypothesis.
Wilcoxon Test Large Sample
case(n>15)
Example Wilcoxon Test Large Sample
case(n>15)
• Given the Productivity (in terms of number of
modules developed per day) of 20 SDEs before
and after Agile adoption.
• Research Question- Is there any significant
increase in productivity?
• ( Data on Next Slide)
• H0: There is no statistical difference between the
productivity, before and after Agile adoption
• Ha: Productivity significantly increases after Agile
adoption.

• T+ =10.5
• T- = 179.5 , T= min(T+, T-)= 10.5

• μT = 95, σT = 24.8
• Z=-3.41, Null Hypothesis is rejected.(Refer to Normal distribution Table)
Mann- Whitney U-Test
• It is a non-parametric equivalent to two
independent samples t-test.
• The underlying data does not need to be normal
for the application of Wilcoxon–Mann–Whitney
test.
• This test investigates whether the two samples
drawn independently belong to the same
population by checking the equality of the two
sample means.
• It can be used when sample sizes are unequal.
Hypothesis for Mann- Whitney U-Test
The hypothesis for the application of Mann-
Whitney U-Test can be formulated as given
below:
• H0: μ1 - μ2 =0(The two sample means belong
to the same population and are identical.)
• Ha: μ1 - μ2 ≠0 (The two sample means are not
equal and belong to different populations.)
Steps of Mann- Whitney U test
• To perform the test, we need to compute the
rank-sum statistics for all the observations in
the following manner.
• We assume that the number of observations
in sample 1 is n1 and the number of
observations in sample 2 is n2.
• The total number of observations is denoted
by N (N = n1 + n2):
Mann- Whitney U test
1. Combine the data of the two samples. Arrange the data combined data in
ascending (low to high) order.
2. Assign ranks to all the observations. The lowest value observation is
provided rank 1, the next to lowest observation is provided rank 2, and so on,
with the highest observation given the rank N.
3. In case of ties (more than one observation having the same value), each
tied observation is assigned an average of tied ranks. For example: if there are
three observations of data value 20 each occupying 7th, 8th, and 9th ranks,
we would assign the mean rank, that is, 8 ([7 + 8 + 9]/3 = 8) to each of the
observation.
4. We then find the sum of all the ranks allotted to observations in sample 1
and denote it with T1. Similarly, find the sum of all the ranks allotted to
observations in sample 2 and denote it as T2.
5. Step 5 depends on whether the sample size is large or small. If both n1 and
n2 is <=10, the samples are considered small. If either n1 or n2 is greater
than 10 sample is considered large.
Mann-Whitney U-test Small Sample
Example
• Consider an example for comparing the coupling values
of two different software (one open source and other
academic software), to ascertain whether the two
samples are identical with respect to coupling values
(coupling of a module corresponds to the number of
other modules to which a module is coupled).
– Academic: 89, 93, 35, 43
– Open source: 52, 38, 5, 23, 32

Since both n1 and n2 is less than 10 , it is small sample case


Solution to the above Question

Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The
hypotheses for the example are given above:
H0: μ1 − μ2 = 0 (The two samples are identical in terms of coupling values.)
Ha: μ1 − μ2 ≠ 0 (The two sample are not identical in terms of coupling values.)
Select the appropriate statistical test.
The two samples of our study are independent in nature, as they are collected from
two different software( Independent samples. Also, the outcome variable (amount of
coupling)is continuous or ordinal in nature. The data may not be normal. Hence, we
use the Mann–Whitney test for comparing the differences among coupling values of
an academic and open source software. Since both n1 and n2 is less than 10 , it is
small sample case

Now, we apply Mann-Whitney U-Test steps


• Step 1 and 2.
• Step 3 there are no ties

• Step 3 there are no ties


• Step 4 Calculate Sum of ranks for both
samples
• Sum of ranks assigned to observations in Academic
software (T1) = 4 + 6 + 8 + 9 = 27.
• Sum of ranks assigned to observations in open
source software (T2) = 1 + 2 + 3 + 5 + 7 = 18.
Man- Whitney U- statistic
• Step 5 Calculate U1, U2 , U0 according to the formula given below:
𝑛1∗(𝑛1+ 1)
– 𝑈1 = [(𝑛1 ∗ 𝑛2) + 2
] - T1
𝑛2∗(𝑛2+ 1)
– 𝑈2 = [ 𝑛1 ∗ 𝑛2 + ] - T2
2

4∗(5)
Thus 𝑈1 = 4 ∗ 5 + - 27 = 3
2
5∗(6)
𝑈2 = 4 ∗ 5 + 2 - 18 = 17
U0= min(U1, U2)= min (3, 17)=3
Next look up for p- value in Mann- Whitney Table give on next slide. If p-value is less
than α =0.05 , the Null hypothesis is rejected else Null Hypothesis is not rejected .

For n1=4,n2=5 and U0=3, p-value as per Mann –Whitney table is 0.0556 which is
greater than α =0.05 , Thus Null Hypothesis is not rejected and the conclusion is that,
the two samples are identical in terms of coupling values
Mann – Whitney Table for small
samples
Mann-Whitney U-test Large Sample
Example
• Consider an example for comparing the
AVG_CC (Average Cyclomatic Complexity)
values of two different software (one open
source and other academic software), to
ascertain whether the two samples are
identical with respect to AVG_CC values.
• Data on next slide
Open Source AVG_CC Academic AVG_CC
2.75 4.10
3.29 4.75
4.53 3.95
3.61 3.50
3.10 4.25
4.29 4.98
2.25 5.75
2.97 4.10
4.01 2.70
3.68 3.65
3.15 5.11
2.97 4.80
4.05 6.25
3.60 3.89
n1=14 4.80
5.50
n2=16
Solution to the above large sample
Question
Formation of hypothesis.
In this step, null (H0) and alternative (Ha) hypotheses are formed. The
hypotheses for the example are given above:
H0: μ1 − μ2 = 0 (The two samples are identical in terms of AVG_CC values.)
Ha: μ1 − μ2 ≠ 0 (The two sample are not identical in terms of AVG_CC
values.)
Select the appropriate statistical test.
The two samples of our study are independent in nature, as they are collected from
two different software( Independent samples). Also, the outcome variable (amount of
AVG_CC )is continuous or ordinal in nature. The data may not be normal. Hence, we
use the Mann–Whitney test for comparing the differences among coupling values of
an academic and open source software. Since both n1 and n2 is greater than 10 , it is
large sample case

Now, we apply Mann-Whitney U-Test steps for large sample case


• Step 1, 2 ,3 on next slide
AVG_CC Value Rank Group AVG_CC Rank Group
Value

2.25 1 OSS 4.10 18.5 Acad


2.70 2 Acad 4.10 18.5 Acad
2.75 3 OSS 4.25 20 Acad
2.97 4.5 OSS 4.29 21 OSS
2.97 4.5 OSS 4.53 22 OSS
3.10 6 OSS 4.75 23 Acad
3.15 7 OSS 4.80 24.5 Acad
3.29 8 OSS 4.80 24.5 Acad
3.50 9 Acad 4.98 26 Acad
3.60 10 OSS 5.11 27 Acad
3.61 11 OSS 5.50 28 Acad
3.65 12 Acad 5.75 29 Acad
3.68 13 OSS 6.25 30 Acad
3.89 14 Acad
3.95 15 Acad
4.01 16 OSS
4.05 17 OSS
• Step 4 Calculate Sum of ranks for both
samples
– Sum of ranks assigned to observations in OSS
software (T1) =
1+3+4.5+4.5+6+7+8+10+11+13+16+17+21+22=144
– Sum of ranks assigned to observations in
Academic software (T2) =
2+9+14+15+18.5+18.5+20+23+24.5+24.5+26+27+28+29+30=321
Step 5 for large sample case
15∗(16)
– 𝑈1 = [(14 ∗ 16) + ] - 144 =185
2
𝟏𝟒∗𝟏𝟔∗𝟑𝟏
– μU = (14*16)/2=112, σU = =24.1
𝟏𝟐
185−112
– Z= =3.03
24.1

The p-value corresponding to the Z- value from normal


distribution table is 0.0012 . Hence the Null Hypothesis
is rejected.
Conclusion There is Significant difference in AVG_CC of
OSS and Academic Software.
Friedman Test
• This Test is used to compare multiple classifiers over multiple datasets
• The statistical hypothesis is:
– Ho: There is no difference in the performance among k classifiers.
– vs.
– H1: At least two classifiers have significantly different performance.
• When more than two classifiers are under comparison, a multiple test
procedure may be appropriate. If the null hypothesis of equivalent
performance among k classifiers is rejected, we can proceed with a post-
hoc test.
• Demsar (2006) overviewed theoretical work on statistical tests for
classifier comparison.
• When the comparison includes more than two classifiers over multiple
datasets, he recommends the Friedman test followed by the
corresponding post-hoc Nemenyi test.
• The Friedman test and the Nemenyi test are nonparametric counterparts
for analysis of variance (ANOVA) and Tukey test
Steps1 and 2 to compute the Friedman
test statistic χ2
1. Organize and sort the data values of all the treatments
for a specific data instance or data set in descending (high
to low) order. Allocate ranks to all the observations from
1 to k, where rank 1 is assigned to the best performing
treatment value and rank k to the worst performing
treatment. In case of two or more observations of equal
values, assign the average of the ranks that would have
been assigned to the observations.
2. We then compute the total of ranks allocated to a
specific treatment on all the data instances. This is done
for all the treatments and the rank total for k treatments
isdenoted by R1, R2, … Rk.
Step 3 to compute the Friedman test
statistic χ2
Friedman test example
• Consider the table on next slide, where the
performance values of six different ML
classifiers are stated when they are evaluated
on six data sets.
• Investigate whether the performance of
different methods differ significantly.
PERFORMANCE OF SIX CLASSIFIERS ON
SIX DATA SETS
classifiers
Datasets ML1 ML2 ML3 ML4 ML5 ML6
D1 83.07 75.38 73.84 72.30 56.92 52.30

D2 66.66 75.72 73.73 71.71 70.20 45.45

D3 83.00 54.00 54.00 77.00 46.00 59.00

D4 61.93 62.53 62.53 64.04 56.79 53.47

D5 74.56 74.56 73.98 73.41 68.78 43.35

D6 72.16 68.86 63.20 58.49 60.37 48.11


Computation of Ranks
Classifiers
Datasets ML1 ML2 ML3 ML4 ML5 ML6
D1 1 2 3 4 5 6
D2 5 1 2 3 4 6
D3 1 4.5 4.5 2 6 3
D4 4 2.5 2.5 1 5 6
D5 1.5 1.5 3 4 5 6
D6 1 2 3 5 4 6
Sum of 13.5 13.5 18 19 29 33
Ranks
Mean
Rank 2.25 2.25 3 3.166667 4.833333 5.5
Calculate Friedman test statistic χ2
• Using the formula in Step3 calculated χ2=15.88
• Degrees of freedom is k-1=5
• Table χ2 for 5 DOF at α=0.05 significance is
11.070.
• χ2(Calculated or observed)> χ2(Calculated or
observed)
• Thus Null Hypothesis is rejected.
• Conclusion- At least two classifiers have
significantly different performance
Nemenyi test
• The Nemenyi test is a post-hoc test of the Friedman
test, applied if the null hypothesis is rejected.
• It will compare all classifiers with each other.
• The critical difference (CD) of the Nemenyi test is
𝑘(𝐾+1)
calculated as CD= qα *
6∗𝑛
• Wher K is no of classifiers , n is number of datasets.
• Where critical values qα is studentized range statistic
divided by √2, α is significance level
• The performance difference between the two
classifiers is significant if the difference of average
ranks between them is larger than the value of CD.
Table for NemenyiTest
6∗7
• For Six classifiers CD=2.85* =3.078
6∗6
• Thus the following classifiers are significantly
different.
• ML1-ML6
• ML2-ML6
Chi-Square Test
• Study from pdf provided

You might also like