0% found this document useful (0 votes)

2 views

Week2_UnderstandingData

The document provides an overview of understanding data in machine learning, including types of datasets, data objects, attributes, and basic statistical descriptions. It emphasizes the importance of exploring data characteristics, relationships, and visualizations to inform further analysis. Additionally, it covers correlation analysis and various statistical plots to aid in data interpretation.

Uploaded by

bommenasaikiran07

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Week2_UnderstandingData

Uploaded by

bommenasaikiran07

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

DSC478: Programming Machine

Learning Applications
Roselyne Tchoua
rtchoua@depaul.edu
School of Computing, CDM, DePaul
University
Understanding Your Data

Before doing any machine learning, you should have an idea

of what your data looks like
– What type of data is it?
– What are the features? What types are they? Do they need to
be converted into different types? e.g., categorical to numeric
– Are some features correlated?
– Is the data skewed?
– Are there outliers?
– If you have class labels, is it imbalanced? e.g., a few transactions
labeled as fraud in a sea of “normal” transactions
– Part of visualization is about whether you can easily digest and
convey the characteristics of your data
Types of Dataset
• Record Data • Ordered
– Relational records – Video data: sequence of
– Data matrix, similar to records images
but all numeric data – Temporal data: time-series
– Document data: text – Sequential Data: transaction
documents: term-frequency sequences
vector – Genetic sequence data
– Transaction data • Spatial and multimedia:
• Graph and network – Spatial data: maps
– World Wide Web – Image + Video + Text data…
– Social or information networks
– Molecular Structures
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: object à customers, store items, sales
– medical database: object à patients, treatments
– university database: object à students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples,
vectors.
• Data objects are described by attributes.
• Database rows à data objects; columns à attributes.
Attributes
• Attribute (or dimensions, features, variables): a data field
representing a characteristic or property of a data object
– E.g., customer _ID, name, address, income, GPA, ….
• Types:
– Nominal (Categorical)
• e.g., Gender (M,F), Movie Genre (Action, Drama, Comedy etc.)
– Ordinal (Ordered categories)
• e.g., Army ranks, Age groups, Grades
– Numeric: quantitative
• Interval-scaled: e.g., dates (can add/subtract but cannot multiply)
• Ratio-scaled: e.g., length, counts (distinct, ordered and can +, - , * and /)
Data Objects and Attributes
Attributes

ID Outlook Temperature Humidity Windy

1 sunny 85 85 FALSE
2 sunny 80 90 TRUE
3 overcast 83 78 FALSE
4 rain 70 96 FALSE
5 rain 68 80 FALSE
6 rain 65 70 TRUE
Objects 7 overcast 58 65 TRUE
8 sunny 72 95 FALSE
9 sunny 69 70 FALSE
10 rain 71 80 FALSE
11 sunny 75 70 TRUE
12 overcast 73 90 TRUE
13 overcast 81 75 FALSE
14 rain 75 80 TRUE

Example of Record Data

Data Matrices: Bag-of-Word
• Data is an m by n matrix, where there are m rows (objects), one for each
object, and n columns, one for each attribute
• For text data, a common way to represent a document (text message,
email, document) is a vector of the frequencies of the words it contains (0
for words in the corpus that do not occur in that document)

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0 Feature vector

Document 3 0 1 0 0 1 2 2 0 3 0
Text Data Matrices
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute
• Data is an m by n matrix, where there are m rows (objects), one for each
object, and n columns, one for each attribute

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0 Feature vector

Document 3 0 1 0 0 1 2 2 0 3 0
Basic Statistical Description of Data
Before deeper analysis, it’s important to explore the basic characteristics and
relationships in the data set
• Descriptive Statistics
– To better understand the characteristics of attributes and fields: central
tendency, variation, spread, etc.
– To get a feel for general patterns or relationships among variables: e.g.,
correlation, covariance, etc.
• Data Visualization
– Visual examination of data distributions often help in uncovering
important patterns and guide further investigation or decision making
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population): 1 n
x = å xi µ= å x
Note: n is sample size and N is population size. n i =1 N
n
– Weighted arithmetic mean: åw x i i

– Trimmed mean: chopping extreme values x= i =1

• Median: åw
i =1
i

– Middle value if odd number of values, or average of the

middle two values otherwise
– Estimated by interpolation (for grouped data):
• Mode Median
interval

– Value that occurs most frequently in the data

– Unimodal, bimodal, trimodal
– Empirical formula: mean - mode = 3 ´ (mean - median )
• Midrange: (susceptible to outliers)
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked;
add whiskers, and plot outliers individually
– Outlier: usually, a value lower than Q1 - 1.5 x IQR and higher
than Q3 + 1.5 x IQR
• Variance and standard deviation
– Variance or s2: (algebraic, scalable computation)
n n
1 n 1 n 2 1 n 2 1 1
å [å xi - (å xi ) ] s = å µ å xi - µ 2
2
s =
2
( xi - x ) =
2 2
( xi - 2
) =
n - 1 i =1 n - 1 i =1 n i =1 N i =1 N i =1

– Standard deviation is the square root of variance

Properties of Normal Distribution Curve
• The normal (distribution) curve
• From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
• From μ–2σ to μ+2σ: contains about 95% of it
• From μ–3σ to μ+3σ: contains about 99.7% of it
• Depending on problem 3σ may be high
Outlier Detection
• Univariate case: use IQR or StanDev
• Multivariate case, use:
– Regression: Errors ≥ some threshold
– Clustering: Points that do not belong to any group

Y1’ y=x+1

X1 x
Graphic Display of Basic Stat. Description
• Boxplot: graphic display of five-number summary
• Histogram: x-axis è values, y-axis è frequencies
• Quantile plot: each value xi is paired with fi
indicating that ~ 100 fi % of data are £ xi
• Quantile-quantile (q-q) plot graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
– Giving you an at-a-glance idea whether two vars may come
from similar distribution
• Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
Boxplot
• Five-number summary of a distribution
“Minimum”, Q1, Median, Q3, “Maximum”
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the
box
– Whiskers: two lines outside the box extended
to Minimum and Maximum (within [Q1-1.5
IQR, Q3+1.5IQR])
– Outliers: points beyond a specified outlier
threshold, plotted individually
Histograms
• Histogram: displays the frequency
distribution of continuous data. It 40

indicates the number of 35

observations that lie in-between a 30

range of values (bin) 25

• The categories are usually specified 20

as non-overlapping intervals of 15

some variable. The categories 10

(bars) must be adjacent 5

0
10000 30000 50000 70000 90000
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
Scatter Plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc.
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
Other Useful Stat. Plots for Projects
• Violin plot (to see what boxplots maybe
hiding)
• QQ plots

https://towardsdatascience.com/violin-plots-explained-fb1d115e023d
Correlations (Categorical)
• 𝐶ℎ𝑖 − 𝑠𝑞𝑢𝑎𝑟𝑒 (𝜒2)𝑡𝑒𝑠𝑡
(Observed - Expected ) 2
c2 = å
Expected
• Chi-squared test is used to determine whether there is a statistically
significant difference between the expected frequencies and the
observed frequencies in one or more categories of a contingency
table
• The larger the 𝜒 2 value, the more likely the variables are related
• The cells that contribute the most to the 𝜒 2 value are those whose
actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Correlations (Categorical Example)
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• 𝜒2 (chi-square) calculation (numbers in parenthesis are expected counts

calculated based on the data distribution in the two categories)

(250 - 90) 2 (50 - 210) 2 (200 - 360) 2 (1000 - 840) 2

c = 2
+ + + = 507.93
90 210 360 840

• It shows that like_science_fiction and play_chess are correlated in the

group
Correlations (Numeric)
• Correlation coefficient (also called Pearson’s product moment
coefficient)

åi=1 (ai - A)(bi - B) å

n n
(ai bi ) - n AB
rA, B = = i =1
(n - 1)s As B (n - 1)s As B

• where n is the number of tuples, and are the respective

means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
Visualizing Correlations

Visualizing correlation with

colormap or heatmap

Scatter plots showing the correlations from –1 to 1.

Correlation viewed as a linear
relationship
• Correlation measures the linear relationship between
objects
• To compute correlation, we standardize data objects,
A and B, and then take their dot product

a'k = (ak - mean( A)) / std ( A)

b'k = (bk - mean( B)) / std ( B)

correlation( A, B) = A'•B'
Visualizing Patterns Using Aggregation
Visualizing cross-tabulation (contingency table)
Windy Not Windy

Outlook =
2 3
sunny
Outlook =
2 3
rain

Outlook =
2 2
overcast
ID Outlook Temperature Humidity Windy
1 sunny 85 85 FALSE 3
2 sunny 80 90 TRUE
3 overcast 83 78 FALSE 2.5

4 rain 70 96 FALSE
2
5 rain 68 80 FALSE
6 rain 65 70 TRUE Outlook = sunny
1.5
7 overcast 58 65 TRUE Outlook = rain
Outlook = overcast
8 sunny 72 95 FALSE 1
9 sunny 69 70 FALSE
10 rain 71 80 FALSE 0.5
11 sunny 75 70 TRUE
12 overcast 73 90 TRUE 0
Windy Not Windy
13 overcast 81 75 FALSE
14 rain 75 80 TRUE
Other Types of Visualization
• Understanding Properties of Text
– Zipf distribution
– TF x IDF
– Tag/Word Clouds

• Graph Visualization

Instant Access to Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye ebook Full Chapters
No ratings yet
Instant Access to Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye ebook Full Chapters
41 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
02Data
No ratings yet
02Data
66 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Data Mining Memahami Data
No ratings yet
Data Mining Memahami Data
38 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
02 Data
No ratings yet
02 Data
64 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Lect 3
No ratings yet
Lect 3
51 pages
Lecture 3 Variables and Data Preprocessing
No ratings yet
Lecture 3 Variables and Data Preprocessing
17 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Module 1
No ratings yet
Module 1
64 pages
02 Data
No ratings yet
02 Data
35 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
02 Data
No ratings yet
02 Data
65 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
CH 2
No ratings yet
CH 2
68 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
02 Data
No ratings yet
02 Data
64 pages
Data ch2
No ratings yet
Data ch2
16 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
Ch01_ICS422_04
No ratings yet
Ch01_ICS422_04
84 pages
Data N Features - Meet 3
No ratings yet
Data N Features - Meet 3
20 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
03 ML Data Intro
No ratings yet
03 ML Data Intro
12 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Lec2 1-Dataset1
No ratings yet
Lec2 1-Dataset1
32 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Concepts and Techniques
100% (2)
Concepts and Techniques
118 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
02Data
No ratings yet
02Data
65 pages
Chap2-Data
No ratings yet
Chap2-Data
101 pages
Data Distribution
No ratings yet
Data Distribution
26 pages
IDS Unit 2 Additional Topics
No ratings yet
IDS Unit 2 Additional Topics
15 pages
02data Part1
No ratings yet
02data Part1
19 pages
Lecture 1 Introduction To Data Mining
No ratings yet
Lecture 1 Introduction To Data Mining
50 pages
Data Mining Summary (Final)
No ratings yet
Data Mining Summary (Final)
10 pages
02data InClass 20150827
No ratings yet
02data InClass 20150827
18 pages
DSM050 Data Visualisation Topic3
No ratings yet
DSM050 Data Visualisation Topic3
46 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Mining Essential Concepts for Analytics (Dr K Seefeld) (Z-Library)
No ratings yet
Data Mining Essential Concepts for Analytics (Dr K Seefeld) (Z-Library)
168 pages
Gr11 Mathematics P2 (ENG) NOV Question Paper
No ratings yet
Gr11 Mathematics P2 (ENG) NOV Question Paper
11 pages
Bcoc 134
No ratings yet
Bcoc 134
31 pages
Zedd77gmailcomsample Question of QTM1
No ratings yet
Zedd77gmailcomsample Question of QTM1
10 pages
Term Exam Paper 5B Paper 2 Answers: New Century Mathematics (Second Edition) 5B
No ratings yet
Term Exam Paper 5B Paper 2 Answers: New Century Mathematics (Second Edition) 5B
11 pages
Lesson 2 - Univariate Statistics and Experimental Design
No ratings yet
Lesson 2 - Univariate Statistics and Experimental Design
34 pages
2023 Tutorial 12
No ratings yet
2023 Tutorial 12
6 pages
Chapter 4 - Part 1 - Student PDF
No ratings yet
Chapter 4 - Part 1 - Student PDF
12 pages
Scheme of Work SS2 3RD Term
No ratings yet
Scheme of Work SS2 3RD Term
2 pages
CSEC Add Maths - Paper 2 - June 2021 - Solutio
No ratings yet
CSEC Add Maths - Paper 2 - June 2021 - Solutio
32 pages
Data Science Report
No ratings yet
Data Science Report
35 pages
The Effectiveness of Graded Motor Imagery For Reducing Phantom Limb Pain in Amputees: A Randomised Controlled Trial
No ratings yet
The Effectiveness of Graded Motor Imagery For Reducing Phantom Limb Pain in Amputees: A Randomised Controlled Trial
16 pages
MBA 2020 Basics
No ratings yet
MBA 2020 Basics
43 pages
Lecture 2.1.2-2.1.8 Data Analysis Using Spss
No ratings yet
Lecture 2.1.2-2.1.8 Data Analysis Using Spss
135 pages
CS3352 Fds
No ratings yet
CS3352 Fds
1 page
UNIT IV Dispersion and Skewness
No ratings yet
UNIT IV Dispersion and Skewness
12 pages
Assignment #3
100% (1)
Assignment #3
9 pages
Business Statistics: Shalabh Singh Room No: 231 Shalabhsingh@iim Raipur - Ac.in
No ratings yet
Business Statistics: Shalabh Singh Room No: 231 Shalabhsingh@iim Raipur - Ac.in
58 pages
Bustat Reviewer
No ratings yet
Bustat Reviewer
6 pages
Statistics Notes Part 1
No ratings yet
Statistics Notes Part 1
26 pages
Biostatistics (HFS3283) Introduction To Biostatistics
No ratings yet
Biostatistics (HFS3283) Introduction To Biostatistics
43 pages
Chapter 20 Cumulative Frequency Percentiles Quartile IQR
No ratings yet
Chapter 20 Cumulative Frequency Percentiles Quartile IQR
19 pages
Python Pandas II Notes XII
No ratings yet
Python Pandas II Notes XII
20 pages
Correlation Between GDP Per Capita and Meat Consumption Per Capita
No ratings yet
Correlation Between GDP Per Capita and Meat Consumption Per Capita
26 pages
Representation of Data (S1) # 1
No ratings yet
Representation of Data (S1) # 1
6 pages
8th PPT Lecture On Measures of Position
0% (1)
8th PPT Lecture On Measures of Position
19 pages
Q. Find Out Inter Quartile Range, Quartile Deviation, Co-Efficient of Quartile
No ratings yet
Q. Find Out Inter Quartile Range, Quartile Deviation, Co-Efficient of Quartile
13 pages
SM025 GIAT 9 (Solutions)
No ratings yet
SM025 GIAT 9 (Solutions)
3 pages
Analisis Granul
No ratings yet
Analisis Granul
90 pages