Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Week2_UnderstandingData

The document provides an overview of understanding data in machine learning, including types of datasets, data objects, attributes, and basic statistical descriptions. It emphasizes the importance of exploring data characteristics, relationships, and visualizations to inform further analysis. Additionally, it covers correlation analysis and various statistical plots to aid in data interpretation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Week2_UnderstandingData

The document provides an overview of understanding data in machine learning, including types of datasets, data objects, attributes, and basic statistical descriptions. It emphasizes the importance of exploring data characteristics, relationships, and visualizations to inform further analysis. Additionally, it covers correlation analysis and various statistical plots to aid in data interpretation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

DSC478: Programming Machine

Learning Applications
Roselyne Tchoua
rtchoua@depaul.edu
School of Computing, CDM, DePaul
University
Understanding Your Data

Before doing any machine learning, you should have an idea


of what your data looks like
– What type of data is it?
– What are the features? What types are they? Do they need to
be converted into different types? e.g., categorical to numeric
– Are some features correlated?
– Is the data skewed?
– Are there outliers?
– If you have class labels, is it imbalanced? e.g., a few transactions
labeled as fraud in a sea of “normal” transactions
– Part of visualization is about whether you can easily digest and
convey the characteristics of your data
Types of Dataset
• Record Data • Ordered
– Relational records – Video data: sequence of
– Data matrix, similar to records images
but all numeric data – Temporal data: time-series
– Document data: text – Sequential Data: transaction
documents: term-frequency sequences
vector – Genetic sequence data
– Transaction data • Spatial and multimedia:
• Graph and network – Spatial data: maps
– World Wide Web – Image + Video + Text data…
– Social or information networks
– Molecular Structures
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: object à customers, store items, sales
– medical database: object à patients, treatments
– university database: object à students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples,
vectors.
• Data objects are described by attributes.
• Database rows à data objects; columns à attributes.
Attributes
• Attribute (or dimensions, features, variables): a data field
representing a characteristic or property of a data object
– E.g., customer _ID, name, address, income, GPA, ….
• Types:
– Nominal (Categorical)
• e.g., Gender (M,F), Movie Genre (Action, Drama, Comedy etc.)
– Ordinal (Ordered categories)
• e.g., Army ranks, Age groups, Grades
– Numeric: quantitative
• Interval-scaled: e.g., dates (can add/subtract but cannot multiply)
• Ratio-scaled: e.g., length, counts (distinct, ordered and can +, - , * and /)
Data Objects and Attributes
Attributes

ID Outlook Temperature Humidity Windy


1 sunny 85 85 FALSE
2 sunny 80 90 TRUE
3 overcast 83 78 FALSE
4 rain 70 96 FALSE
5 rain 68 80 FALSE
6 rain 65 70 TRUE
Objects 7 overcast 58 65 TRUE
8 sunny 72 95 FALSE
9 sunny 69 70 FALSE
10 rain 71 80 FALSE
11 sunny 75 70 TRUE
12 overcast 73 90 TRUE
13 overcast 81 75 FALSE
14 rain 75 80 TRUE

Example of Record Data


Data Matrices: Bag-of-Word
• Data is an m by n matrix, where there are m rows (objects), one for each
object, and n columns, one for each attribute
• For text data, a common way to represent a document (text message,
email, document) is a vector of the frequencies of the words it contains (0
for words in the corpus that do not occur in that document)

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0 Feature vector

Document 3 0 1 0 0 1 2 2 0 3 0
Text Data Matrices
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute
• Data is an m by n matrix, where there are m rows (objects), one for each
object, and n columns, one for each attribute

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0 Feature vector

Document 3 0 1 0 0 1 2 2 0 3 0
Basic Statistical Description of Data
Before deeper analysis, it’s important to explore the basic characteristics and
relationships in the data set
• Descriptive Statistics
– To better understand the characteristics of attributes and fields: central
tendency, variation, spread, etc.
– To get a feel for general patterns or relationships among variables: e.g.,
correlation, covariance, etc.
• Data Visualization
– Visual examination of data distributions often help in uncovering
important patterns and guide further investigation or decision making
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population): 1 n
x = å xi µ= å x
Note: n is sample size and N is population size. n i =1 N
n
– Weighted arithmetic mean: åw x i i

– Trimmed mean: chopping extreme values x= i =1


n

• Median: åw
i =1
i

– Middle value if odd number of values, or average of the


middle two values otherwise
– Estimated by interpolation (for grouped data):
• Mode Median
interval

– Value that occurs most frequently in the data


– Unimodal, bimodal, trimodal
– Empirical formula: mean - mode = 3 ´ (mean - median )
• Midrange: (susceptible to outliers)
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed


Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked;
add whiskers, and plot outliers individually
– Outlier: usually, a value lower than Q1 - 1.5 x IQR and higher
than Q3 + 1.5 x IQR
• Variance and standard deviation
– Variance or s2: (algebraic, scalable computation)
n n
1 n 1 n 2 1 n 2 1 1
å [å xi - (å xi ) ] s = å µ å xi - µ 2
2
s =
2
( xi - x ) =
2 2
( xi - 2
) =
n - 1 i =1 n - 1 i =1 n i =1 N i =1 N i =1

– Standard deviation is the square root of variance


Properties of Normal Distribution Curve
• The normal (distribution) curve
• From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
• From μ–2σ to μ+2σ: contains about 95% of it
• From μ–3σ to μ+3σ: contains about 99.7% of it
• Depending on problem 3σ may be high
Outlier Detection
• Univariate case: use IQR or StanDev
• Multivariate case, use:
– Regression: Errors ≥ some threshold
– Clustering: Points that do not belong to any group

Y1

Y1’ y=x+1

X1 x
Graphic Display of Basic Stat. Description
• Boxplot: graphic display of five-number summary
• Histogram: x-axis è values, y-axis è frequencies
• Quantile plot: each value xi is paired with fi
indicating that ~ 100 fi % of data are £ xi
• Quantile-quantile (q-q) plot graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
– Giving you an at-a-glance idea whether two vars may come
from similar distribution
• Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
Boxplot
• Five-number summary of a distribution
“Minimum”, Q1, Median, Q3, “Maximum”
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the
box
– Whiskers: two lines outside the box extended
to Minimum and Maximum (within [Q1-1.5
IQR, Q3+1.5IQR])
– Outliers: points beyond a specified outlier
threshold, plotted individually
Histograms
• Histogram: displays the frequency
distribution of continuous data. It 40

indicates the number of 35

observations that lie in-between a 30

range of values (bin) 25

• The categories are usually specified 20

as non-overlapping intervals of 15

some variable. The categories 10

(bars) must be adjacent 5


0
10000 30000 50000 70000 90000
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
Scatter Plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc.
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
Other Useful Stat. Plots for Projects
• Violin plot (to see what boxplots maybe
hiding)
• QQ plots

https://towardsdatascience.com/violin-plots-explained-fb1d115e023d
Correlations (Categorical)
• 𝐶ℎ𝑖 − 𝑠𝑞𝑢𝑎𝑟𝑒 (𝜒2)𝑡𝑒𝑠𝑡
(Observed - Expected ) 2
c2 = å
Expected
• Chi-squared test is used to determine whether there is a statistically
significant difference between the expected frequencies and the
observed frequencies in one or more categories of a contingency
table
• The larger the 𝜒 2 value, the more likely the variables are related
• The cells that contribute the most to the 𝜒 2 value are those whose
actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Correlations (Categorical Example)
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• 𝜒2 (chi-square) calculation (numbers in parenthesis are expected counts


calculated based on the data distribution in the two categories)

(250 - 90) 2 (50 - 210) 2 (200 - 360) 2 (1000 - 840) 2


c = 2
+ + + = 507.93
90 210 360 840

• It shows that like_science_fiction and play_chess are correlated in the


group
Correlations (Numeric)
• Correlation coefficient (also called Pearson’s product moment
coefficient)

åi=1 (ai - A)(bi - B) å


n n
(ai bi ) - n AB
rA, B = = i =1
(n - 1)s As B (n - 1)s As B

• where n is the number of tuples, and are the respective


means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
Visualizing Correlations

Visualizing correlation with


colormap or heatmap

Scatter plots showing the correlations from –1 to 1.


Correlation viewed as a linear
relationship
• Correlation measures the linear relationship between
objects
• To compute correlation, we standardize data objects,
A and B, and then take their dot product

a'k = (ak - mean( A)) / std ( A)

b'k = (bk - mean( B)) / std ( B)


correlation( A, B) = A'•B'
Visualizing Patterns Using Aggregation
Visualizing cross-tabulation (contingency table)
Windy Not Windy

Outlook =
2 3
sunny
Outlook =
2 3
rain

Outlook =
2 2
overcast
ID Outlook Temperature Humidity Windy
1 sunny 85 85 FALSE 3
2 sunny 80 90 TRUE
3 overcast 83 78 FALSE 2.5

4 rain 70 96 FALSE
2
5 rain 68 80 FALSE
6 rain 65 70 TRUE Outlook = sunny
1.5
7 overcast 58 65 TRUE Outlook = rain
Outlook = overcast
8 sunny 72 95 FALSE 1
9 sunny 69 70 FALSE
10 rain 71 80 FALSE 0.5
11 sunny 75 70 TRUE
12 overcast 73 90 TRUE 0
Windy Not Windy
13 overcast 81 75 FALSE
14 rain 75 80 TRUE
Other Types of Visualization
• Understanding Properties of Text
– Zipf distribution
– TF x IDF
– Tag/Word Clouds

• Graph Visualization

You might also like