Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Initial Data Analysis

STATISTICS MODULE

Uploaded by

Baby Shark
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Initial Data Analysis

STATISTICS MODULE

Uploaded by

Baby Shark
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Initial Data Analysis (IDA)

INITIAL DATA ANALYSIS


process the data and assess its quality before any further analysis
is undertaken. It is crucial to be aware and confident on the integrity of
your data. All recording/ encoding and processing errors have been
identified and remedied.

Data Data
Processing Description

Data Scrutiny
and Cleaning
Data Quality

Accuracy. All entries must be correct. Relevance. Data might be accurate and complete
Completeness. Let your data be complete as much but some variables are not relevant to a
as possible. particular analysis/ research inquiry.
Reliability. A reliable data means that variables Availability. The data must be always organized
should conform, not contradict, other related and useful readily available for possible
variables. analysis in the future
Timeliness. All entries should be updated and timely
especially for variables that are constantly
changing and observation/ analysis is
longitudinal.
INITIAL DATA ANALYSIS
Data Validity and
Processing Reliability
(if Survey questionnaire)

Ethical
Consideration
Data Processing
• Number of observations/ respondents
involves coding and • Number of variables
entry of the data into a • Scale of measurement for each variable
dataset with a format
suitable for further data • Labels/ categories/ characteristics the numerical code represents for nominal
analysis and ordinal variables. The numerical code must be consistent.
• For ratio and interval variables, be consistent on the decimal points of values
each variable must be • Provide codes for missing values (no value recorded), out-of-range values (a
numerically coded value recorded but known to be impossible), and “don’t know” and “not
applicable” responses.

• Make sure that missing values are no values recorded and not zeroes. A zero
value when it is zero is a value and still be recorded as zero especially for interval
and ratio variables. Hence, a zero should not be used as code for missing
values. Some may keep it as a missing cell (no values encoded).
INITIAL DATA ANALYSIS
Validity and
Reliability
(if Survey questionnaire)

Data Scrutiny
and Cleaning
Data Scrutiny and Cleaning

checking on the quality and How to handle…


structure of data and
• MISSING DATA
correcting any errors due to • OUTLIERS
recording and processing.
Data Scrutiny and Cleaning
MISSING DATA
If missing data are scattered at random throughout
If the missing values seems non-random, it the cases/observations/ respondents and variables,
should be retained and subject to investigation. either estimate the missing values or delete the
cases/ observations/ respondents or a particular
Most of the time the missing values in researches variables that have missing data
in researches are deleted unintentionally hence
must be revalidated again. Deleting cases/ observations/ respondents are
only advice when only a few cases/respondents/
If survey questionnaire was used, one must call- observations have missing data.
out and check if indeed the missing values are
“no responses”. It this case, make sure that each Dropping variables but retaining
questionnaire are embossed with a control cases/observations/ respondents is an alternative but
code/number and reflected on the data encoded. is generally suitable when the variable is not critical to
the analysis.
Data Scrutiny and Cleaning
MISSING DATA
DATA IMPUTATION • Best guess: not advisable.
• The overall mean for that particular variable: not suitable for
binary/ dichotomous variable (e.g. sex, Yes and No responses, etc.)
substituting missing values
or for nominal and ordinal data, in general.
with the “best estimates” is • a relevant group mean: not suitable for binary/ dichotomous
another alternative variable or nominal and ordinal in general.
• a regression equation based on complete data to predict
missing values: use only if other variables are likely to predict the
variables
• Or a generalized approach based on the likelihood function: It is
sophisticated and generally not necessary as this uses the iterative
two-step, expectation, maximization (E-M) algorithm to derive
maximum likelihood estimates for incomplete values
Data Scrutiny and Cleaning
OUTLIERS
It is spurious values that are extreme but If the raw data had been checked and recording,
plausible. Extreme values are possible but transcription, and typing errors had been eliminated, then
not consisted with the remaining data. analyses with and without the outliers can be done.
Provided that the interpretation of the findings were
This is more problematic than the gross not radically different for both analyses then it is not
error which needs to be checked carefully. crucial whether or not the outliers are counted as valid
values.
May be an error or a valid and influential In this particular instance, the outlier is not an error but
observation an influential observations. However, one should
interpret such data with care as the influential
observations may represent a different population to the
majority of observations.
INITIAL DATA ANALYSIS
Data
Description
Data Description

involves summary [measures] and The main purpose of data description is to


display of the main characteristics present essential and important features of the
of data distribution data usually in tables (summary measures),
graphs, and charts (data presentations).
Data Description

DETECTING Box and Whisker Plot


OUTLIERS?
Data Description

Examining the Histograms


DISTRIBUTION?
Data Description

Detecting OUTLIER STEM AND LEAF PLOT


and examining
DISTRIBUTION
at the same
time?
DESCRIPTIVE STATISTICS
Summary Measures
Graphical Presentation
SUMMARY MEASURES
Measures of Central From its name, it is a central tendency as
Tendency it shows the center point of the data
Measures of It is a measure to describe the amount of
Variability/ variability or spread of the values around
Dispersion the central tendency in a set of data
Indicates the relative standing (position)
Measures of
of a particular value relative to the entire
Position/ Location
data set
Measures of
measure of the degree of asymmetry and
Skewness and
‘peakness’ of a frequency distribution
Kurtosis
SUMMARY MEASURES
Measures of Central From its name, it is a central tendency as
Tendency it shows the center point of the data
Measures of It is a measure to describe the amount of
Variability/ variability or spread of the values around
Dispersion the central tendency in a set of data
Indicates the relative standing (position)
Measures of
of a particular value relative to the entire
Position/ Location
data set
Measures of
measure of the degree of asymmetry and
Skewness and
‘peakness’ of a frequency distribution
Kurtosis
16 18 12
30 7 13
Ungrouped 8 10 10

data 9
24
18
28
11
23
16 20 27
13 21 16
10 6 19
24 16 6
8 8 27
CI f TCB CM (xi) RF(%) <cF >cF Rank
Grouped 6 - 10 10 5.5 – 10.5 8 33.33 10 30 5
data 11 - 15
16 - 20
4
8
10.5 – 15.5
15.5 – 20.5
13
18
13.33
26.67
14
22
20
16
(3) 2
4
21 - 25 4 20.5 – 23.5 23 13.33 26 8 (2) 2
26 - 31 4 23.5 – 28.5 28 13.33 30 4 (1) 2
SUMMARY MEASURES
Visualization using Box and Whisker Plot
Mean 15.8
16 18 12
30 7 13 Highest value Median 16
8 10 10
9 18 11
Mode 16
24 28 23 StDev 7.20345
16 20 27 75th percentile
13 21 16 IQR 7.75
10 6 19
24 16 6 Mean (x)
IQR Median (line)
8 8 27
Mode

25th percentile

Lowest value
SUMMARY MEASURES
Visualization using Histogram and Frequency Polygon
MEASURE OF SKEWNESS
Mean 16
Median 16.125
Mode 9
StDev 7.4138

MEASURE OF KURTOSIS

10. 15. 20. 25. 30.


5 5 5 5 5
SUMMARY MEASURES
ORDINAL DATA
SUMMARY MEASURES
NOMINAL DATA
SUMMARY MEASURES
In relation to SCALE OF MEASUREMENTS
SCALE OF MEASUREMENT Appropriate Commonly used
of the data/variable SUMMARY STATISTICS GRAPHICAL PRESENTATION

NOMINAL Mode Frequency and Percentage


(e.g. Ethnicity) No Measure of Variability Distribution, Bar Graph

Frequency and Percentage


ORDINAL Median, Mode
Distribution
(e.g. Satisfaction Level) Range, Interquartile Range
Column Chart/Graph
Mean, Median, Mode
Frequency Distribution Table
INTERVAL/RATIO Range, Interquartile, Variance
(FDT)
(e.g. Weight) and Standard Deviation,
Histogram, Box and Whisker
Standard Error of the Mean
DATA PRESENTATIONS
TEXTUAL: paragraph or narrative
DATA PRESENTATIONS

Word Cloud: A word cloud


(also called tag cloud or weighted
list) is a visual representation of
text data. Words are usually single
words, and the importance of each
is shown with font size or color.
DATA PRESENTATIONS

TABULAR is a
systematic way of
arranging data in columns
and rows according to
classifications or
categories.

Meana et al.,
(2019), MJSIR
DATA PRESENTATIONS
GRAPHICAL. A graph or chart is a pictorial presentation of a set of data
Bar Graph vs. Column Chart Clustered vs. Stacked

https://depictdatastudio.com/when-to-use-horizontal-bar-charts-vs-vertical-column-charts/
https://www.xelplus.com/excel-clustered-column-and-stacked-chart/
DATA PRESENTATIONS
Histogram vs. Pareto

https://study.com/academy/lesson/difference-between-a-pareto-chart-histogram.html
DATA PRESENTATIONS
Line graph is used to show fluctuations and A frequency polygon is obtained
trends in the components of the total over by plotting the Class Mark vs. the
time, or pattern of changes in the data Frequency of that class
https://www.excel-easy.com/examples/depreciation.html

https://www.sciencedirect.com/topics/mathematics/fr
equency-polygon

Ogive is constructed by plotting lower true


class boundary of each class vs. the
cumulative frequency of the corresponding
class
https://www.brainkart.com/article/Types-of-
Graphs_35074/
DATA PRESENTATIONS
Pie Diagram this is simply a circle Doughnut Chart is a variation of pie chart but it can
subdivided into a number of slices contain more than one data series.
that represent the various categories

https://www.think- https://community.plotly.com/t/nested-pie-
cell.com/en/resources/manual/pie.html charts/24011
DATA PRESENTATIONS
Pictograph makes Statistical map. This
use of symbols and shows the geographical
is used to compare location and may contain
few quantitative different symbols on the
discrete data map. The legend which
usually of one kind tells what the symbols
represent is very
https://psa.gov.ph/content/philippine necessary.
-population-density-based-2015-
census-population

https://old.pcij.org/stories/stats-on-the-
state-of-the-regionsland-population-
population-density/
https://www.pinterest.ph/pin/540924605225268792/
https://www.anychart.com/products/anychart/gallery/Radar_Charts_(Spiderweb)/
https://support.microsoft.com/en-us/office/available-chart-types-in-office-a6187218-807e-4103-9e0a-27cdb19afb90
https://www.anychart.com/products/anychart/gallery/Tree_Map_Charts/
https://www.excel-exercise.com/sunburst-chart/

DATA PRESENTATIONS
https://towardsdatascience.com/scattered-boxplots-graphing-experimental-results-with-matplotlib-seaborn-and-pandas-
81f9fa8a1801?gi=c1f8591680e3
DATA PRESENTATIONS
INFOGRAPHIC
“A chart, diagram, or illustration (as in a book or
magazine, or on a website) that uses graphic elements
to present information in a visually striking way”
(Merriam-Webster Dictionary)

“A picture or diagram or a group of pictures or


diagrams showing or explaining information” (Cambridge
Dictionary)

“Information or data that is shown in a chart, diagram,


etc. so that it is easy to understand” (Oxford Learner’s
Dictionary)
Thanks!!
INITIAL DATA ANALYSIS (IDA)

You might also like