Week2_UnderstandingData
Week2_UnderstandingData
Learning Applications
Roselyne Tchoua
rtchoua@depaul.edu
School of Computing, CDM, DePaul
University
Understanding Your Data
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 3 0 1 0 0 1 2 2 0 3 0
Text Data Matrices
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute
• Data is an m by n matrix, where there are m rows (objects), one for each
object, and n columns, one for each attribute
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 3 0 1 0 0 1 2 2 0 3 0
Basic Statistical Description of Data
Before deeper analysis, it’s important to explore the basic characteristics and
relationships in the data set
• Descriptive Statistics
– To better understand the characteristics of attributes and fields: central
tendency, variation, spread, etc.
– To get a feel for general patterns or relationships among variables: e.g.,
correlation, covariance, etc.
• Data Visualization
– Visual examination of data distributions often help in uncovering
important patterns and guide further investigation or decision making
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population): 1 n
x = å xi µ= å x
Note: n is sample size and N is population size. n i =1 N
n
– Weighted arithmetic mean: åw x i i
• Median: åw
i =1
i
Y1
Y1’ y=x+1
X1 x
Graphic Display of Basic Stat. Description
• Boxplot: graphic display of five-number summary
• Histogram: x-axis è values, y-axis è frequencies
• Quantile plot: each value xi is paired with fi
indicating that ~ 100 fi % of data are £ xi
• Quantile-quantile (q-q) plot graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
– Giving you an at-a-glance idea whether two vars may come
from similar distribution
• Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
Boxplot
• Five-number summary of a distribution
“Minimum”, Q1, Median, Q3, “Maximum”
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the
box
– Whiskers: two lines outside the box extended
to Minimum and Maximum (within [Q1-1.5
IQR, Q3+1.5IQR])
– Outliers: points beyond a specified outlier
threshold, plotted individually
Histograms
• Histogram: displays the frequency
distribution of continuous data. It 40
as non-overlapping intervals of 15
https://towardsdatascience.com/violin-plots-explained-fb1d115e023d
Correlations (Categorical)
• 𝐶ℎ𝑖 − 𝑠𝑞𝑢𝑎𝑟𝑒 (𝜒2)𝑡𝑒𝑠𝑡
(Observed - Expected ) 2
c2 = å
Expected
• Chi-squared test is used to determine whether there is a statistically
significant difference between the expected frequencies and the
observed frequencies in one or more categories of a contingency
table
• The larger the 𝜒 2 value, the more likely the variables are related
• The cells that contribute the most to the 𝜒 2 value are those whose
actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Correlations (Categorical Example)
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Outlook =
2 3
sunny
Outlook =
2 3
rain
Outlook =
2 2
overcast
ID Outlook Temperature Humidity Windy
1 sunny 85 85 FALSE 3
2 sunny 80 90 TRUE
3 overcast 83 78 FALSE 2.5
4 rain 70 96 FALSE
2
5 rain 68 80 FALSE
6 rain 65 70 TRUE Outlook = sunny
1.5
7 overcast 58 65 TRUE Outlook = rain
Outlook = overcast
8 sunny 72 95 FALSE 1
9 sunny 69 70 FALSE
10 rain 71 80 FALSE 0.5
11 sunny 75 70 TRUE
12 overcast 73 90 TRUE 0
Windy Not Windy
13 overcast 81 75 FALSE
14 rain 75 80 TRUE
Other Types of Visualization
• Understanding Properties of Text
– Zipf distribution
– TF x IDF
– Tag/Word Clouds
• Graph Visualization