Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

Data Visualization

This document discusses univariate descriptive statistics and examples of various graphs and measures used to summarize quantitative data, including histograms, boxplots, dot plots, density plots, pie charts, bar graphs, measures of center such as mean, median and mode, and measures of spread such as standard deviation, interquartile range, and mean absolute deviation. Examples are provided to illustrate sea urchin size data, weather categories, personal income distributions, and comparing income distributions from different years.

Uploaded by

Chakradhar Nakka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Visualization

This document discusses univariate descriptive statistics and examples of various graphs and measures used to summarize quantitative data, including histograms, boxplots, dot plots, density plots, pie charts, bar graphs, measures of center such as mean, median and mode, and measures of spread such as standard deviation, interquartile range, and mean absolute deviation. Examples are provided to illustrate sea urchin size data, weather categories, personal income distributions, and comparing income distributions from different years.

Uploaded by

Chakradhar Nakka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Univariate Descriptive Statistics

Displays: pie charts, bar graphs, box plots, his-


tograms, density estimates, dot plots, stem-
leaf plots, tables, lists.

Example: sea urchin sizes


Boxplot Histogram
60
10 20 30 40 50 60

50
Number of Urchins
Urchin Size (mm)

40
30
20
10
0
0

0 10 20 30 40 50 60 70

Urchin Size (mm)

Dot Plot Density


0.015
0.010
Density

0.005
0.000

0 10 20 30 40 50 60 −20 0 20 40 60 80

Urchin Size (mm) Urchin Size (mm)

14
Points:

1) Useful for quantitative variables.

2) Boxplot shows five point summary: mini-


mum, first quartile, median, third quartile, max-
imum.

3) Dot Plot illegible with 250 data points. (1


dot for each size plotted on line.)

4) Histogram, density plot serve similar pur-


poses.

5) Density goes below 0: bad.

6) Histogram doesn’t show clustering density


plot shows.

15
Example: Categorical: Weather in Central
Park
Pie Chart Bar Graph

10
clear

8
6
partly.cloudy
4

cloudy
2
0

clear partly.cloudy cloudy

Pie chart harder to read.

General summary: Pie Charts are bad.

More useful with more categories.

Ordering of categories important for nominal


variables.

Cloudiness is ordinal.
16
Pie charts: wedge has area proportional to #
of individuals in category.

Bar chart: bar has height equal to # of indi-


viduals in category.

Density estimates not discussed in this course.

Histogram:

1) divide range of values into intervals.

2) Count numbers of individuals in each inter-


val.

3) bar AREA is proportional to # of individuals


in interval; width is length of interval.

4) equal width bars best – then height propor-


tional to # of individuals.

5) label x-axis; include units.

6) label y-axis.
17
Example: Personal Income for BC (ages 15+).
(For those with income.) Source: 2001 Cen-
sus.
Adult Personal Income (BC)
0.03
0.02
0.01
0.00

0 20 40 60 80 100

Income ($000s)

18
Points

1) Bar widths unequal – census tables given


that way.

2) So take width times height to get area =


fraction of population in that income group.

3) Last group on right open ended – artificially


cut off at $100,000 by me.

4) Plot is “long-tailed to the right” or “skewed


to the right”.

5) Based on 20% sample of 1,523,720 people


aged 15 + in BC on census day, 2001.

6) Income is for previous year – 2000.

19
Comparison of 1995, 2005.
1996 Income
Density

0 20 40 60 80 100

2001 Income
Density

0 20 40 60 80 100

20
Comparison of 2000, 2005.

BC Individual Income 2000 and 2005


0.030

2005
2000
0.025
0.020
Density

0.015
0.010
0.005
0.000

0 20 40 60 80 100

BC Individual Income 2000

21
Summarizing the pictures.

Purposes: less space in text than a graph; pre-


cise numerical comparison between groups.

Summarizing a histogram:

Where is centre of the x-axis values? Jargon:


location or centre.

How far do the x values extend on either side?


Jargon: spread, variation, width.

Is the picture symmetric or does it extend far-


ther to right than left?

Location and number of bumps.

22
Measures of location:

Mean, Arithmetic Mean, Average, Arith-


metic Average: total of x-values divided by
number of x values.

Histogram balances at mean. (First Moment


in physics.) Think of See-Saw: small kid far
from centre balances big kid close to centre.

Formula: data X1, . . . , Xn.


Pn
Xi
X̄ = i=1
n

Utility of summation notation in this course:


NIL. But X̄ is standard notation for average of
X.

Median: number such that 1/2 of X values at


least that large, and 1/2 of X values at least
that small.

Sort list: if n is odd median is middle of sorted


list. If n is even take average of two middle
values.
23
Numerical examples: ages in my family:

50, 50, 20, 15, 8, 8.


50 + 50 + 20 + 15 + 8 + 8 151
Ā = = ≈ 25.2
6 6

Median age: middle numbers are 15, 20.

Halfway between is median = 17.5.

Mode: most common value. Not useful con-


cept in most cases. Location of tallest bar in
histogram (affected by definition of classes).

Mode of ages is not unique: 50 or 8. Not


useful summary of centre.

24
Comparison:

Advantages of mean:

1) if your average weekly income is $100 you


know how you will do in the long run; not so
if median weekly income is $100.

2) Same point: average and sample size tells


you total.

3) Has simpler mathematical behaviour than


median.

Advantages of median:

Not influenced by extreme members of list.

Median income, for instance, gives more infor-


mation about typical person.

25
Measures of spread:

Standard Deviation

Interquartile Range

Mean Absolute Deviation.

Deviations from the mean: subtract mean from


each number in list: Xi − X̄. For my family de-
viations are

24.8, 24.8, −5.2, −10.2, −17.2, −17.2.


Summarize size of deviations:

Average is 0. Not useful as measure of size


since pluses cancel minuses.

26
Mean absolute deviation: take absolute values
(ignore − signs) and average
24.8 + 24.8 + 5.2 + 10.2 + 17.2 + 17.2
6
= 16.6 years

Standard deviation: square deviations, aver-


age, take square root:
s
(24.8)2 + · · · + (−17.2)2
s=
5
= 19.8 years.
WARNING: notice the 5 not 6. This is Tradi-
tional. Not important in large data sets.

Jargon: variance is s2:

2 (24.8)2 + · · · + (−17.2)2
s =
5
= 390.6 years2

27
Interquartile Range:

First define quartiles, quintiles, etc.

First, second and third quartiles split list into


4 equal pieces.

One quarter of list below first quartile, two


quarters below second, three quarters below
third.

Second quartile is median.

Interquartile range is third quartile minus first


quartile.

Book gives method to find quartiles.

Quintiles split list into 5 equal parts.

Percentiles split list into 100 equal parts.

28
Comparison:

Advantages of IQR: like median not influenced


by extremes.

Easily related to proportions of population.

But: rather than use 2 number summary (me-


dian, IQR) typically use 3 number summary
(quartiles) or 5 number summary (min, max,
quartiles).

Boxplot is graph of 5 number summary.

Advantages of Mean Absolute Deviation.

Seems intuitive.

Less influenced by extremes than Standard De-


viation.

But: poor mathematical properties.

We mostly use Standard Deviation.


29
Why the Standard Deviation?

Usual explanation: squares nicer mathemati-


cally than absolute values.

Real explanation (WARNING: personal view):


ONLY the SD works in normal approximations
for sums.

Normal approximations? A common summary


for curves.

Rule of thumb: in many lists of data about


2/3 of the observations are within 1 SD of the
mean, about 95% within 2 SDs of the mean
and almost all within 3 SDs of the mean.

NEXT TOPIC: the normal curve. (bell curve,


Gaussian)

30

You might also like