0% found this document useful (0 votes)

4 views

Initial Data Analysis

STATISTICS MODULE

Uploaded by

Baby Shark

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Initial Data Analysis

STATISTICS MODULE

Uploaded by

Baby Shark

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Initial Data Analysis (IDA)

INITIAL DATA ANALYSIS

process the data and assess its quality before any further analysis
is undertaken. It is crucial to be aware and confident on the integrity of
your data. All recording/ encoding and processing errors have been
identified and remedied.

Data Data
Processing Description

Data Scrutiny
and Cleaning
Data Quality

Accuracy. All entries must be correct. Relevance. Data might be accurate and complete
Completeness. Let your data be complete as much but some variables are not relevant to a
as possible. particular analysis/ research inquiry.
Reliability. A reliable data means that variables Availability. The data must be always organized
should conform, not contradict, other related and useful readily available for possible
variables. analysis in the future
Timeliness. All entries should be updated and timely
especially for variables that are constantly
changing and observation/ analysis is
longitudinal.
INITIAL DATA ANALYSIS
Data Validity and
Processing Reliability
(if Survey questionnaire)

Ethical
Consideration
Data Processing
• Number of observations/ respondents
involves coding and • Number of variables
entry of the data into a • Scale of measurement for each variable
dataset with a format
suitable for further data • Labels/ categories/ characteristics the numerical code represents for nominal
analysis and ordinal variables. The numerical code must be consistent.
• For ratio and interval variables, be consistent on the decimal points of values
each variable must be • Provide codes for missing values (no value recorded), out-of-range values (a
numerically coded value recorded but known to be impossible), and “don’t know” and “not
applicable” responses.

• Make sure that missing values are no values recorded and not zeroes. A zero
value when it is zero is a value and still be recorded as zero especially for interval
and ratio variables. Hence, a zero should not be used as code for missing
values. Some may keep it as a missing cell (no values encoded).
INITIAL DATA ANALYSIS
Validity and
Reliability
(if Survey questionnaire)

Data Scrutiny
and Cleaning
Data Scrutiny and Cleaning

checking on the quality and How to handle…

structure of data and
• MISSING DATA
correcting any errors due to • OUTLIERS
recording and processing.
Data Scrutiny and Cleaning
MISSING DATA
If missing data are scattered at random throughout
If the missing values seems non-random, it the cases/observations/ respondents and variables,
should be retained and subject to investigation. either estimate the missing values or delete the
cases/ observations/ respondents or a particular
Most of the time the missing values in researches variables that have missing data
in researches are deleted unintentionally hence
must be revalidated again. Deleting cases/ observations/ respondents are
only advice when only a few cases/respondents/
If survey questionnaire was used, one must call- observations have missing data.
out and check if indeed the missing values are
“no responses”. It this case, make sure that each Dropping variables but retaining
questionnaire are embossed with a control cases/observations/ respondents is an alternative but
code/number and reflected on the data encoded. is generally suitable when the variable is not critical to
the analysis.
Data Scrutiny and Cleaning
MISSING DATA
DATA IMPUTATION • Best guess: not advisable.
• The overall mean for that particular variable: not suitable for
binary/ dichotomous variable (e.g. sex, Yes and No responses, etc.)
substituting missing values
or for nominal and ordinal data, in general.
with the “best estimates” is • a relevant group mean: not suitable for binary/ dichotomous
another alternative variable or nominal and ordinal in general.
• a regression equation based on complete data to predict
missing values: use only if other variables are likely to predict the
variables
• Or a generalized approach based on the likelihood function: It is
sophisticated and generally not necessary as this uses the iterative
two-step, expectation, maximization (E-M) algorithm to derive
maximum likelihood estimates for incomplete values
Data Scrutiny and Cleaning
OUTLIERS
It is spurious values that are extreme but If the raw data had been checked and recording,
plausible. Extreme values are possible but transcription, and typing errors had been eliminated, then
not consisted with the remaining data. analyses with and without the outliers can be done.
Provided that the interpretation of the findings were
This is more problematic than the gross not radically different for both analyses then it is not
error which needs to be checked carefully. crucial whether or not the outliers are counted as valid
values.
May be an error or a valid and influential In this particular instance, the outlier is not an error but
observation an influential observations. However, one should
interpret such data with care as the influential
observations may represent a different population to the
majority of observations.
INITIAL DATA ANALYSIS
Data
Description
Data Description

involves summary [measures] and The main purpose of data description is to

display of the main characteristics present essential and important features of the
of data distribution data usually in tables (summary measures),
graphs, and charts (data presentations).
Data Description

DETECTING Box and Whisker Plot

OUTLIERS?
Data Description

Examining the Histograms

DISTRIBUTION?
Data Description

Detecting OUTLIER STEM AND LEAF PLOT

and examining
DISTRIBUTION
at the same
time?
DESCRIPTIVE STATISTICS
Summary Measures
Graphical Presentation
SUMMARY MEASURES
Measures of Central From its name, it is a central tendency as
Tendency it shows the center point of the data
Measures of It is a measure to describe the amount of
Variability/ variability or spread of the values around
Dispersion the central tendency in a set of data
Indicates the relative standing (position)
Measures of
of a particular value relative to the entire
Position/ Location
data set
Measures of
measure of the degree of asymmetry and
Skewness and
‘peakness’ of a frequency distribution
Kurtosis
SUMMARY MEASURES
Measures of Central From its name, it is a central tendency as
Tendency it shows the center point of the data
Measures of It is a measure to describe the amount of
Variability/ variability or spread of the values around
Dispersion the central tendency in a set of data
Indicates the relative standing (position)
Measures of
of a particular value relative to the entire
Position/ Location
data set
Measures of
measure of the degree of asymmetry and
Skewness and
‘peakness’ of a frequency distribution
Kurtosis
16 18 12
30 7 13
Ungrouped 8 10 10

data 9
24
18
28
11
23
16 20 27
13 21 16
10 6 19
24 16 6
8 8 27
CI f TCB CM (xi) RF(%) <cF >cF Rank
Grouped 6 - 10 10 5.5 – 10.5 8 33.33 10 30 5
data 11 - 15
16 - 20
4
8
10.5 – 15.5
15.5 – 20.5
13
18
13.33
26.67
14
22
20
16
(3) 2
4
21 - 25 4 20.5 – 23.5 23 13.33 26 8 (2) 2
26 - 31 4 23.5 – 28.5 28 13.33 30 4 (1) 2
SUMMARY MEASURES
Visualization using Box and Whisker Plot
Mean 15.8
16 18 12
30 7 13 Highest value Median 16
8 10 10
9 18 11
Mode 16
24 28 23 StDev 7.20345
16 20 27 75th percentile
13 21 16 IQR 7.75
10 6 19
24 16 6 Mean (x)
IQR Median (line)
8 8 27
Mode

25th percentile

Lowest value
SUMMARY MEASURES
Visualization using Histogram and Frequency Polygon
MEASURE OF SKEWNESS
Mean 16
Median 16.125
Mode 9
StDev 7.4138

MEASURE OF KURTOSIS

10. 15. 20. 25. 30.

5 5 5 5 5
SUMMARY MEASURES
ORDINAL DATA
SUMMARY MEASURES
NOMINAL DATA
SUMMARY MEASURES
In relation to SCALE OF MEASUREMENTS
SCALE OF MEASUREMENT Appropriate Commonly used
of the data/variable SUMMARY STATISTICS GRAPHICAL PRESENTATION

NOMINAL Mode Frequency and Percentage

(e.g. Ethnicity) No Measure of Variability Distribution, Bar Graph

Frequency and Percentage

ORDINAL Median, Mode
Distribution
(e.g. Satisfaction Level) Range, Interquartile Range
Column Chart/Graph
Mean, Median, Mode
Frequency Distribution Table
INTERVAL/RATIO Range, Interquartile, Variance
(FDT)
(e.g. Weight) and Standard Deviation,
Histogram, Box and Whisker
Standard Error of the Mean
DATA PRESENTATIONS
TEXTUAL: paragraph or narrative
DATA PRESENTATIONS

Word Cloud: A word cloud

(also called tag cloud or weighted
list) is a visual representation of
text data. Words are usually single
words, and the importance of each
is shown with font size or color.
DATA PRESENTATIONS

TABULAR is a
systematic way of
arranging data in columns
and rows according to
classifications or
categories.

Meana et al.,
(2019), MJSIR
DATA PRESENTATIONS
GRAPHICAL. A graph or chart is a pictorial presentation of a set of data
Bar Graph vs. Column Chart Clustered vs. Stacked

https://depictdatastudio.com/when-to-use-horizontal-bar-charts-vs-vertical-column-charts/
https://www.xelplus.com/excel-clustered-column-and-stacked-chart/
DATA PRESENTATIONS
Histogram vs. Pareto

https://study.com/academy/lesson/difference-between-a-pareto-chart-histogram.html
DATA PRESENTATIONS
Line graph is used to show fluctuations and A frequency polygon is obtained
trends in the components of the total over by plotting the Class Mark vs. the
time, or pattern of changes in the data Frequency of that class
https://www.excel-easy.com/examples/depreciation.html

https://www.sciencedirect.com/topics/mathematics/fr
equency-polygon

Ogive is constructed by plotting lower true

class boundary of each class vs. the
cumulative frequency of the corresponding
class
https://www.brainkart.com/article/Types-of-
Graphs_35074/
DATA PRESENTATIONS
Pie Diagram this is simply a circle Doughnut Chart is a variation of pie chart but it can
subdivided into a number of slices contain more than one data series.
that represent the various categories

https://www.think- https://community.plotly.com/t/nested-pie-
cell.com/en/resources/manual/pie.html charts/24011
DATA PRESENTATIONS
Pictograph makes Statistical map. This
use of symbols and shows the geographical
is used to compare location and may contain
few quantitative different symbols on the
discrete data map. The legend which
usually of one kind tells what the symbols
represent is very
https://psa.gov.ph/content/philippine necessary.
-population-density-based-2015-
census-population

https://old.pcij.org/stories/stats-on-the-
state-of-the-regionsland-population-
population-density/
https://www.pinterest.ph/pin/540924605225268792/
https://www.anychart.com/products/anychart/gallery/Radar_Charts_(Spiderweb)/
https://support.microsoft.com/en-us/office/available-chart-types-in-office-a6187218-807e-4103-9e0a-27cdb19afb90
https://www.anychart.com/products/anychart/gallery/Tree_Map_Charts/
https://www.excel-exercise.com/sunburst-chart/

DATA PRESENTATIONS
https://towardsdatascience.com/scattered-boxplots-graphing-experimental-results-with-matplotlib-seaborn-and-pandas-
81f9fa8a1801?gi=c1f8591680e3
DATA PRESENTATIONS
INFOGRAPHIC
“A chart, diagram, or illustration (as in a book or
magazine, or on a website) that uses graphic elements
to present information in a visually striking way”
(Merriam-Webster Dictionary)

“A picture or diagram or a group of pictures or

diagrams showing or explaining information” (Cambridge
Dictionary)

“Information or data that is shown in a chart, diagram,

etc. so that it is easy to understand” (Oxford Learner’s
Dictionary)
Thanks!!
INITIAL DATA ANALYSIS (IDA)

DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Section 1 Slide
No ratings yet
Section 1 Slide
132 pages
BRM Chapter 6
No ratings yet
BRM Chapter 6
8 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Summary- Data Quality
No ratings yet
Summary- Data Quality
7 pages
Summary Data Quality Course
No ratings yet
Summary Data Quality Course
7 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Qunt Data Coding & Analysis
No ratings yet
Qunt Data Coding & Analysis
104 pages
Unit .......
No ratings yet
Unit .......
45 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
data analysis
No ratings yet
data analysis
26 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
Research File 3
No ratings yet
Research File 3
10 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
3Is Q4 Complete Notes
No ratings yet
3Is Q4 Complete Notes
20 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
1st Part of Material
No ratings yet
1st Part of Material
15 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
In-Class Exercise #1 Notes
No ratings yet
In-Class Exercise #1 Notes
7 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Statistical Concepts
No ratings yet
Statistical Concepts
51 pages
Session 4 Data Analysis
No ratings yet
Session 4 Data Analysis
18 pages
Data Analysis Topics Discussed Getting Data Ready For Analysis 1) - Editing Data (Definition)
No ratings yet
Data Analysis Topics Discussed Getting Data Ready For Analysis 1) - Editing Data (Definition)
8 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Lecture 8 Data Analysis
No ratings yet
Lecture 8 Data Analysis
30 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
Practical Research Week 1
No ratings yet
Practical Research Week 1
1 page
Unit 1
No ratings yet
Unit 1
26 pages
Six Sigma: Statistics: By: - Hakeem-Ur-Rehman
No ratings yet
Six Sigma: Statistics: By: - Hakeem-Ur-Rehman
44 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
L11 - 12-Quantitative Analysis-2 - Page
No ratings yet
L11 - 12-Quantitative Analysis-2 - Page
9 pages
4-DataUnderstanding
No ratings yet
4-DataUnderstanding
51 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
Unit 1
No ratings yet
Unit 1
21 pages
Econ 522- Chapter 6
No ratings yet
Econ 522- Chapter 6
37 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
20210129--Lecture01
No ratings yet
20210129--Lecture01
76 pages
Quantitative Research Methods - Data Processing and Analysis
No ratings yet
Quantitative Research Methods - Data Processing and Analysis
25 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
EDA
100% (1)
EDA
9 pages
25 Essential Data Analysis Terms Every Analyst Should Know
No ratings yet
25 Essential Data Analysis Terms Every Analyst Should Know
11 pages
Research Bussiness
No ratings yet
Research Bussiness
9 pages
Eda 2022 04 11 09352244
No ratings yet
Eda 2022 04 11 09352244
35 pages
Chapter Six Methods of Describing Data
No ratings yet
Chapter Six Methods of Describing Data
20 pages
rRM 2023 (1)
No ratings yet
rRM 2023 (1)
105 pages
WINSEM2020-21 ECE3502 ETH VL2020210501413 Reference Material I 29-Apr-2021 New PPT
No ratings yet
WINSEM2020-21 ECE3502 ETH VL2020210501413 Reference Material I 29-Apr-2021 New PPT
23 pages
Module 2 - Statistical Foundations
No ratings yet
Module 2 - Statistical Foundations
108 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
AF Notes W2
No ratings yet
AF Notes W2
2 pages
CLC - Data Cleansing and Data Summary
No ratings yet
CLC - Data Cleansing and Data Summary
17 pages
Chapter 8
No ratings yet
Chapter 8
36 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Cosine Rule (Law of Cosines) - Brilliant Math & Science Wiki
No ratings yet
Cosine Rule (Law of Cosines) - Brilliant Math & Science Wiki
7 pages
CNS 01
No ratings yet
CNS 01
14 pages
Production of Granola Breakfast Cereal B
No ratings yet
Production of Granola Breakfast Cereal B
6 pages
Cables de Acero Usha Martin User Manual1
100% (1)
Cables de Acero Usha Martin User Manual1
60 pages
Elife-Drive H Series Application Reference Manual RevB v2.4
No ratings yet
Elife-Drive H Series Application Reference Manual RevB v2.4
80 pages
Maestro m100 2g GSM Edge Modem PDF
No ratings yet
Maestro m100 2g GSM Edge Modem PDF
2 pages
GL200 SMS Protocol V102 Decrypted.100130920 PDF
No ratings yet
GL200 SMS Protocol V102 Decrypted.100130920 PDF
28 pages
NN3 PDF
No ratings yet
NN3 PDF
7 pages
Twist, Writhe, and Geometry A Loop Containing Equally Spaced Coplanar Bends
No ratings yet
Twist, Writhe, and Geometry A Loop Containing Equally Spaced Coplanar Bends
16 pages
Five Project Management Performance Metrics Key To Successful Project Execution
No ratings yet
Five Project Management Performance Metrics Key To Successful Project Execution
7 pages
AAP-Question Bank Solving Program (Yearwise)
No ratings yet
AAP-Question Bank Solving Program (Yearwise)
49 pages
Thermal Dehydrocondensation of Benzene To Diphenyl in A Nonisothermal Flow Reactor
No ratings yet
Thermal Dehydrocondensation of Benzene To Diphenyl in A Nonisothermal Flow Reactor
6 pages
This Study Resource Was: K 61 Units
No ratings yet
This Study Resource Was: K 61 Units
4 pages
Differentiation Notes (1)
No ratings yet
Differentiation Notes (1)
47 pages
Part and Assembly Modeling: With Solidworks 2014
100% (1)
Part and Assembly Modeling: With Solidworks 2014
123 pages
5807 Digital Model1
No ratings yet
5807 Digital Model1
53 pages
B. Exegetical Bible Study Methods F. Schaeffer
100% (2)
B. Exegetical Bible Study Methods F. Schaeffer
26 pages
Lecture 4 Introduction To Analog and Digital Communications, S. Haykin, M. Moher, 2nd
No ratings yet
Lecture 4 Introduction To Analog and Digital Communications, S. Haykin, M. Moher, 2nd
10 pages
Bulletin 627 Bureau of Mines Flammability of Combustible GAs and Vapors
No ratings yet
Bulletin 627 Bureau of Mines Flammability of Combustible GAs and Vapors
130 pages
Yuxuan's 4 Look Last Layer Tutorial
100% (1)
Yuxuan's 4 Look Last Layer Tutorial
4 pages
P.T. Stanvac Indonesia P.T. Schlumberger Geophysics: Nusantara
No ratings yet
P.T. Stanvac Indonesia P.T. Schlumberger Geophysics: Nusantara
16 pages
CSE Syllabus
No ratings yet
CSE Syllabus
88 pages
I28 UserManual
No ratings yet
I28 UserManual
271 pages
322 Sample Chapter
100% (1)
322 Sample Chapter
16 pages
Automated Welding Manipulators Available Manipulator Sizes
No ratings yet
Automated Welding Manipulators Available Manipulator Sizes
4 pages
Advapi32 Include File
No ratings yet
Advapi32 Include File
30 pages
Interest Rate Futures and Options OpenGamma
No ratings yet
Interest Rate Futures and Options OpenGamma
7 pages
Quality Control of Products in Petroleum Refining
No ratings yet
Quality Control of Products in Petroleum Refining
29 pages
Vacuum Unit Conversion Chart, An ISM Resource
No ratings yet
Vacuum Unit Conversion Chart, An ISM Resource
5 pages
74HC4051A Datasheet
No ratings yet
74HC4051A Datasheet
15 pages