Qunt Data Coding & Analysis
Qunt Data Coding & Analysis
International Webinar
Dr Rajesh G Konnur
Director / Pro V C
ERC / ERU
In this webinar-
- Concept & definition of Data Coding
2
Quantitative Data Coding:
3
Coding Process:
1.Variable Identification: Before coding, identify the variables you want
to study. These are the characteristics or attributes that will be measured or
observed.
4
Cont…
4. Data Entry: The coded data is entered into a database or statistical
software for analysis. This could involve manual data entry or the use of
automated tools.
1. Continuous Data:
• Continuous data can take any value within a given range and can be measured with
great precision.
• Examples: Age, height, weight, temperature, income, and time are typical
examples of continuous data.
• Examples: Gender (male, female), eye color, ethnicity, and types of cars are examples of
nominal data.
• Measurement: Nominal data are often used for classification purposes, and the categories
have no numerical significance.
4. Ordinal Data:
• Ordinal data have categories with a meaningful order or ranking, but the intervals between the
categories are not consistent or measurable.
• Examples: Educational levels (e.g., high school, college, graduate school), socioeconomic
status, and customer satisfaction ratings (e.g., very dissatisfied, dissatisfied, neutral, satisfied,
very satisfied) are examples of ordinal data.
• Measurement: The order matters, but the differences between the categories are not
8
standardized or quantifiable.
Interval Measurement:
In interval measurement, the intervals between consecutive values are equal and
meaningful, but there is no true zero point.
• Characteristics:
• The zero point is arbitrary and does not represent the absence of the measured
quantity.
• Characteristics:
• Zero represents a complete absence of the measured attribute.
• Ex: Height, weight, age, and income are often measured on a ratio scale. For
example, a person with a height of 180 cm is twice as tall as a person with a
height of 90 cm.
10
Data Analysis
11
Introduction:
• Data analysis is the process of cleaning , changing &
processing raw data & extracting actionable, relevant
information that help researchers make informed
decisions.
• Statistical analysis is the organisation and analysis of
quantitative or qualitative data using statistical procedures,
including both descriptive and inferential statistics.
• It’s the science of collecting, exploring and presenting
large amounts of data to discover underlying patterns
and trends.
12
Types of Data Analysis:
1) Descriptive Analysis:
- summarize the data & for better understanding the data
- expressed to describe the variables.
2) Exploratory Analysis:
- Used to discover patterns & relationships in data
- To identify correlations or create predictive models
3)Predictive Analysis:
- Used to make predictions about future outcomes
4) Prescriptive Analysis:
- used to suggest best courses of action
13
Data Analysis Process:
Data Requirement Gathering : …….what type of data you want to use & what
data you plan to analyze.
Data Collection : Collecting data from varied sources , e.g. case studies,
surveys, interviews, questionnaires, observation ….
Data Cleaning : not all of the data you collect will be useful, so it’s time to
clean it up. Remove white spaces, duplicate records & basic errors.
Data Analysis : use of software to interpret & understand the data & arrive at
conclusions. Ex. Excel, Python, R, Looker, Rapid Miner, Chartio, Metabase,
Redash & Microsoft Power BI.
Data Interpretation : -- after results, interpret them & come up with the best
courses of action based on the findings.
Data Visualization: .. Graphically showing information by use of charts,
graphs, maps, bullet points or a host of other methods. It derives valuable
insights by helping to compare datasets & observe relationships.
14
Elements of Statistical Analysis:
Exploratory
If the problem requires for data to
be classified or pattern determined
Statistics
16
STATISTICAL MEASURES
Mean
Mode
Median
Interquartile Range
Standard Deviation.
17
M EA
N
18
Example of Mean :
19
MEDIAN
• When all the observations are arranged in ascending or descending
orders of magnitude, the middle one is the median.
• For raw data, If n is the total number of observations, the value of the
𝑛+1
[ ] th item will be called median .
2
𝑛
• if n is the even number, the mean of n/2 th item and [ + th item
2
1] be median.
will
20
M ODE
• The Mode is the value of a series which appears most frequently than
any other .
• For grouped data,
Mode, M0 = L0 +{ 𝛥1 } x c
𝛥1+𝛥2
Where, L0 is lower limit of modal class,
C is class interval
𝛥 1 is difference between modal frequency and its preceding class
∆2 is difference between modal frequency and following class
frequency.
Example: mode of given data 80, 90, 86, 80, 72, 80, 96 is
80 21
INTERQUARTILE
RANGE
• IQR = Q3 − Q1.
22
Example
• Q1 =[ 𝑛 +1
4
]th item = 1.5th item = 20+ 0.5 (30-20) = 25
• Q3 = 3[ 𝑛 +1
4
]th item = 50 +0.5x (60-50) = 55.
• IQR = 30
23
STANDARD DEVIATION
25
EXAMPLE :
27
TYPES
28
PARAMETRIC STATISTICAL
ANALYSIS
30
NONPARAMETRIC STATISTICAL
ANALYSIS
2. Ordinal or ranked data: Non-parametric tests are suitable for analyzing data that can
only be ranked, such as Likert scale data or data involving ratings. Parametric tests
require interval or ratio data. 31
Cont…
3.Small sample size: When the sample size is small, non-parametric tests can be
more robust. Parametric tests may not be reliable with small sample sizes.
6.Categorical data: Non-parametric tests are also used when the data is inherently
categorical, such as in the case of frequency counts or proportions.
32
EXPLORATORY DATA ANALYSIS AND
CONFIRMATORY DATA ANALYSIS
• John Tukey
35
Student's t-test :
• Developed by Prof.W.S.Gossett
• Student's t-test is used to test the null hypothesis that there is no difference between
the means of the two groups.
• One-sample t-test:
•tcalculated = 2.981
•t table value = 2.093
•tcalculated > t table value ;
Reject H0.
39
Independent Two Sample T Test (the unpaired t-test)
40
𝑥1−𝑥 2
t=
1 2 1 + 1
𝑛 1 −1 𝑆 2 + 𝑛 2 −1 𝑆2 𝑛1 𝑛2
𝑛 1 +𝑛 2 −2
41
Example
Mean Hb level of 5 male are 10, 11, 12.5, 10.5, 12 and 5 female are 10, 17.5, 14.2,15 and
14.1 . Test whether there is any significant difference between Hb values.
• H0: There is no significant difference between Hb Level
• H1: There is no significant difference between Hb level.
t= 𝑥1−𝑥 2
𝑛 1 −1 𝑆 21+ 𝑛 2 −1 𝑆2 2 1 + 1
𝑛 1 +𝑛 2 −2 𝑛1 𝑛2
42
X1 X2
X1 - 𝑥1 X2 - 𝑥2 (X1 - 𝑥1)2 (X2 - 𝑥2)2
10 10 -1.2 -4.16 1.44 17.305
11 17.5 - 0.2 3.34 0.04 11.156
12.5 14.2 1.3 0.04 1.69 0.0016
10.5 15 -0.7 0.84 0.49 0.706
12 14.1 0.8 -0.06 0.64 0.0036
Σ = 56 70.8 4.3 29.172
43
The paired t-test
where 𝑑 is the mean difference and Sd denotes the standard deviation of the
difference.
44
Example
Generally, z-tests are used when we have large sample sizes (n > 30),
whereas t-tests are most helpful with a smaller sample size (n < 30).
Both methods assume a normal distribution of the data, but the z-tests
are most useful when the standard deviation is known.
z = (x – μ) / (σ / √n)
47
ANALYSIS OF VARIANCE
(ANOVA)
• R. A. Fischer.
• The Student's t-test cannot be used for comparison of three or more groups.
• The purpose of ANOVA is to test if there is any significant difference between the
means of two or more groups.
• The analysis of variance is the systematic algebraic procedure of decomposing the
overall variation in the responses observed in an experiment into variation.
• Two variances – (a) between-group variability and (b) within-group variability that is
variation existing between the samples and variations existing within the sample.
• The within-group variability (error variance) is the variation that cannot be
accounted for in the study design.
• The between-group (or effect variance) is the result of treatment
48
• A simplified formula for the F statistic is
where MST is the mean squares between the groups and MSE is the
mean squares within groups
49
CHI-SQUARE TEST
50
CHI-SQUARE
TEST
• The Chi-square test compares the frequencies and tests whether the
observed data differ significantly from that of the expected data.
51
CHI-SQUARE
TEST
52
Example
Vaccinated 10 90
Not vaccinated 26 74
53
• H0: There is no significant association between vaccination and attack
of measles.
54
Oi Ei Oi - Ei (Oi - Ei )2 (Oi - Ei )2 /
Ei
10 18 -8 64 3.556
90 82 8 64 0.780
26 18 8 64 3.556
74 82 -8 64 0.780
𝛴= 8.672
• Chi square table value = 3.841 , chi square calculated value = 8.672
• 𝑥
2
calculated > 𝑥2 table value ; Reject H0.
55
THE WILCOXON'S SIGNED RANK
TEST :
• Wilcoxon's rank sum test ranks all data points in order, calculates the rank
sum of each sample and compares the difference in the rank sums.
56
M ethod :
• H0: There is no difference in the paired values, on an average, between the two
groups.
• H1: There is difference in the paired values, on an average, between the two
groups.
• Compute the difference between each group of paired values in the two group.
• Rank the difference from smallest, without considering the sign of difference.
• After giving ranks, the corresponding sign should be attached.
• T+ (Sum of ranks of positive sign) and T- (Sum of ranks between negative sign). T is
taken as smallest of T+ and T-. Then Wstat is the smallest value of T- and T+ .
• Find the W critical value from Wilcoxon’s Signed rank Table .
• if Wstat < WCritical Value; Reject H0. 57
EXAM PLE :
Before 40 60 55 65 43 70 80 60
After 50 80 50 70 40 60 90 85
58
• H0: There is no difference in the paired values.
• H1: There is difference in the paired values.
Before 40 60 55 65 43 70 80 60
After 50 80 50 70 40 60 90 85
Difference -10 -20 5 -5 3 10 -10 -15
Absolute 10 20 5 5 3 10 10 15
difference
59
• T+ = 8.5, T- = 27.5. T = 8.5
60
• If Assuming normal distribution for the differences, test statistic is,
Z = {|T-m| -0.5} / SD
𝑛 𝑛+1 2𝑛+ 1
SD = √{ }
24
• If Z is less than 1.96, H0 is accepted and if Z>1.96 , H0 is rejected
61
MANN-WHITNEY U
TEST
62
METHOD of calculation:
H0: The average values in the two groups are the same H1: The
average values in the two groups are the different
• Let n1 is the sample size of one group and n2 is the sample size of second
group, Rank all the values in the two groups take together. Tied values
should be given same ranks.
• The ranksum of each group is taken and Ustat is calculated using
Ustat = Rank Sum - {n(n +1)/2 }.
• Both U1 and U2 is calculated and smaller value is taken as Ustat. and
Ucritical value is calculated from the Mann- Whitney U test table
• if U < U ; ; Reject H0. 63
Example
Treatment A Treatment B
3 9
4 7
2 5
6 10
2 6
5 8
64
• H0: The average values in the 2 treatment are the same.
Ranks 1 2 3 4 5 6 7 8 9 10 11 12
Values 2 2 3 4 5 5 6 6 7 8 9 10
65
• UA = 23 – 21 = 2, UB = 55- 21 =34 so Ustat = 2 (lowest
value)
• Ucritic = 5
• Ustat < UCritical value; Reject H0.
66
• Assuming that the ranks are randomly distributed in the two groups,
the test statisticis
Z = {|m-T| -0.5} / SD
Where T = smaller of T1 and T2.
T1 = sum of the ranks of smaller group, T2 = {(n1 +n2)(n1 +n2 +1) / 2} – T1 ,
n 1 x n 2 )( n 1+ n 2+ 1
SD = √{ }
12
Where n is the total of sample sizes in all the groups and Ri is the sum of
the ranks in the ith group.
69
M ethod :
H0: The average values in the different groups are the same
H1: The average values in the different groups are the different
• Rank the all values taking all the group together.
70
Example
71
• H0: The average values in the three groups are the same.
• H1: The average values in the three groups are the different.
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Values 8 8 9 9 9 9 10 10 11 12 13 13 13 13 14 15 16 17
Tied 1.5 1.5 4.75 4.75 4.75 4.75 7.5 7.5 9 10 12.5 12.5 12.5 12.5 15 16 17 18
rank
72
Sample 1 Rank 1 Sample 2 Rank 2 Sample 3 Rank 3
8 1.5 10 7.5 13 12.5
10 7.5 9 4.75 8 1.5
9 4.75 13 12.5 9 4.75
12 10 14 15 13 12.5
11 9 9 4.75 17 18
13 12.5 16 17 15 16
𝛴= 45.25 𝛴= 61.5 𝛴= 65.25
• H = {12/18x19 [ (45.252 /6 ) + (61.52/6) + (65.52/6 )]} – 3x19
74
TESTING NULL HYPOTHESIS
• When attempting to determine if an outcome is related to a cause, it is necessary
to know if the outcomes or results could have occurred by chance alone.
• This cannot be done with certainity, but researchers can determine the probability
that the hypothesis is true.
• Using a null hypothesis enables the researcher to quantify and report the
probability that the outcome was due to random error.
75
DETERMINING THE PROBABILITY OF TYPE I
AND TYPE II ERROR
• Before accepting the results as evidence for practice, however
the probability that an error was made should be evaluated.
• On the other hand, when effect sizes are very small, then the
potential for effects from extraneous variables is more likely and
the results may have less credibility.
77
Ensuring data meet the fundamental assumptions
of the statistical test
• Data analysis is based on many assumptions about the
nature of the data, the statistical procedures that are
used to conduct the analysis and the match between the
data and the procedure.
•
In accurate conclusions lead to an error, which in
turn affects the validity of a study.
78
RESOURCES FOR STATISTICAL
ANALYSIS PROGRAM
80
Pit Falls of Statistics :
• Pie chart
• Simple bar diagram
• Multiple bar diagram
• Component bar diagram or subdivided bar
• diagram Histogram
• Frequency polygon
• Frequency curve
• O give curve
• Scatter diagram
• Line diagram
• Pictogram
• Statistical maps
Bar charts
Year Wise Enrollment of students in Government school
300
300
260
230
250
200
200 160
150
150 120
100
100 70
50
0
One Two Three Four Five Six Eight Nine
Seven
No. of Students
Multiple Bar Charts
• Also called compound bar charts.
• More then one sub-attribute of variable can be expressed.
6
0
Population
5 Land
0
Percentage of World
4
0
3
0
Total
2
0 Asi Europe Africa Latin USSR North Oceania
a America
1
0 America
Component bar charts
• When there are many categories on X-axis (more than 5) and they
have further subcategories, then to accommodate the categories, the
bars may be divided into parts, each part representing a certain item
and proportional to the magnitude of that particular item.
India :: Growth of Population
800
Population in Million
700 Growth
600
500
400
300
200
100
0
01
31
61
81
91
11
21
41
51
71
01
19
19
19
19
19
19
19
19
19
19
20
Censs Decades
Cont…
12
0
10
0 Female
80 Male
60
40
20
Pakista US Swede
0 n A n
Histogram
80
70
frequency 60
50
40
30
20
10
0
161- 171- 181- 191- 201- 211- 221- 231- 241- 251-
170 180 190 200 210 220 230 240 250
260
Serum Cholestrol, mg/dl
Frequency polygon
250
200
percentage total
150 frequency
100
50
0
59-69 69-79 79-89 89-99 99- 109- 119- 129-
109 119 129 139
Line diagram
•
Line diagrams are used to show the trend of events with the passage of time.
Line diagram showing the malaria cases reported throughout the word excluding
African region during 1972-78.
C 10
8
a
s6
e4
s 0
2
1972 73 74 75 76 77 78
Pie charts
World Population
Developi
ng
Countries
Developed Countries
26%
Developing Countries
Develope
d
Countries
74%
Pictogram
When your dependent variable may have multiple values for each value of
your independent variable.
When trying to determine whether the two variables are related, such as,
By using two variables, Test Cases and Employees with designation in scatter diagram, it is
easy to determine the best productivity of employees.
60 Test lead
35 Senior Test Engineer
20 Trainee
100
80
No of Test Cases 60
40
20
T TL M 102
JTE STE
Employees
Summary & Conclusion:
103
Thanks
104