Session 2 - Excel Fundamentals For Data Exploration
Session 2 - Excel Fundamentals For Data Exploration
Session 2 - Excel Fundamentals For Data Exploration
Excel Fundamentals
for Data Exploration
!"#$%&"'(
TABLE OF CONTENTS
01 02 03
Excel is
clear
Company Open
Data Data
URL of the page visited An identifier for the The timestamp An identifier for the user
element that was clicked of the event that performed the action
2.1 HANDLING
DATA SOURCE
Survey Data
Data
Qualitative Quantitative
described by a characteristic described by a numerical scale
20%
Analysing
WHY?
o Better decision-making process
DIRTY
DATA
DATA MISSING
ISSUES DATA
OUTLIERs
2.3 HANDLING
DATA ISSUES
Types of Data Issues
DIRTY
DATA
DATA MISSING
ISSUES DATA
OUTLIERs
2.3 HANDLING
DATA ISSUES
Dirty Data Dirty Data contains errors in them, or in a format that’s unfriendly or unusable
Email
Name LName Fname Name Name John.doe@gmail.com
Smith, John Smith John “John Smith” John Smith helen@hotmail
schow@yahoo.com
“Patrick”
“Ben”
2.3 HANDLING
DATA ISSUES Raw data should be store in smallest part since it will
Dirty Data – Not Parsed Correctly be easier to analyze
2.3 HANDLING
DATA ISSUES
Types of Data Issues
DIRTY
DATA
DATA MISSING
ISSUES DATA
OUTLIERs
2.3 HANDLING
DATA ISSUES
Missing Data
Missing data = Gaps in data
Name City Age GPA Name City Age GPA Name City Age GPA
John Paris 12 90 John Paris 12 90 John Paris 12 90
Minh New York 100 Minh New York (null) 100 Minh New York N/A 100
Mei 23 Mei (null) (null) 23 Mei N/A N/A 23
Lucy Tokyo 11 92 Lucy Tokyo 11 92 Lucy Tokyo 11 92
Peter Beijing 12 Peter Beijing 12 (null) Peter Beijing 12 N/A
ID Age ID Age
1 27 1 27
2 37 2 37
Average age without Average age with missing
3 70 missing values: 3 values:
4 55 44.5 4 55 40
5 23 5 23
6 25 6 25
7 35 7 35 DOWNWARD
8 51 8 51 BIAS
9 65 9
10 67 10 67
Imputation 2
2.3 HANDLING
DATA ISSUES
Missing Data – Deleting Missing Data Deleting missing data is often the default method
because it's simplicity.
ID Name City Age GPA
1 John Paris 12 90
2 Minh New York 100
3 Mei 23
However, you should make sure that deleting missing
4 Lucy Tokyo 11 92
data doesn't have adverse effects on your analysis.
5 Peter Beijing 12
For example, if a particular demographic tended to
leave a response blank in a survey, then removing
records with blank entries will mean that a part of the
population is underrepresented.
ID Name City Age GPA
One of the downsides is that eliminating missing
1 John Paris 12 90
data reduces the size of the dataset. (Ex: cost data)
4 Lucy Tokyo 11 92
2.3 HANDLING
DATA ISSUES
Missing Data – Imputation
MODE
In statistics, Imputation is the process of substituting MEDIAN
values in the data where the value are missing
MEAN
2.3 HANDLING
DATA ISSUES MEAN vs MEDIAN
Missing Data – Imputation When is a Median a better summary description of
data as compared to the Mean?
Let’s look at seven employee at a small firm with
the following salaries.
How can we decide when to use mean, median and mode?
Salary
$28,000
Date Rate What’s the typical salary
$33,000
in this group?
1/1/2022 0.936 $33,000
MODE is not a relevant 2/1/2022 0.93 $33,000 Mean: $86,000
descriptive statistics when the Median: $34,000
3/1/2022 0.876 $34,000
data is essentially continuous.
$37,000
4/1/2022 0.875
$40,000
5/1/2022 0.86
$400,000
DIRTY
DATA
DATA MISSING
ISSUES DATA
OUTLIERS
2.3 HANDLING
DATA ISSUES
Outliers
Incorrect Data
OR
• Violin Plot (a hybrid of a box plot and a kernel density plot): shows the Upper Fence
volume of the distribution
• Others: z-scores or standard deviations
Lower Fence
Median
3rd Quartile
Interquartile
Range Outliers
1st Quartile
BOX AND WHISKER
PLOT ELEMENTS
2.3 HANDLING
DATA ISSUES Outliers
Outliers – Identify: Exercise
Upper Fence
10 12 11 15 10 11 12 13 14 15 16 17 18 19 20 21 22
11 14 13 17
Sort the data points
12 22 14 11 10 11 11 11 12 12 13 14 14 15 17 22
in ascending order
IQR = Q1 – Q2 = 3.5
2.3 HANDLING
DATA ISSUES
Outliers – Dealing with outliers
AFTER
Just because numbers are atypical doesn’t mean they are unreasonable.
Many phenomena yield “long-tail” distributions where a few outliers
legitimately exist. For instance, in economics most people have modest
wealth, but a few have very high net worth, and to exclude them from
analysis would be misleading.
BEFORE AFTER
2.3 HANDLING
DATA ISSUES We can cap those answers to within a defined range by setting the
Outliers – Dealing with outliers: WINSORIZING “ceiling” and “floor” attributes.
The 3 Hows:
1
Pivot/Unpivot
2
Aggregate Data/Group Data
3
Conditional Formatting
2.4 HANDLING
DATA FORMATTING
Data Formatting – Pivot/Unpivot