Numerical Descriptive Techniques (6 Hours)

Chapter 3
Numerical Descriptive
Techniques (6 hours)
Learning Objectives
In this chapter you learn:
 1. Measures of centre and location
 2. Measures of dispersion and variation
 3. Measures of correlation
Definitions
 The central tendency is the extent to which the
values of a numerical variable group around a typical
or central value.
 The variation is the amount of dispersion or

scattering away from a central value that the values
of a numerical variable show.
 The shape is the pattern of the distribution of values

from the lowest value to the highest value.
Central tendency
The central tendency of the set of
measurements–that is, the tendency of the data to
cluster, or center, about certain numerical values.
Central Tendency
(Location)
Variation
The variability of the set of measurements–that is,
the spread of the data.
Variation
(Dispersion)
Measures of Central Tendency:
The Mean
 The arithmetic mean (often just called the “mean”)

is the most common measure of central tendency
 For a sample of size n:

Pronounced x-bar
The ith value
n
X i
X1  X 2    Xn
X i1

n n
Sample size Observed values
The Mean (con’t)
 The most common measure of central tendency

 Mean = sum of values divided by the number of values
 Affected by extreme values (outliers)
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Mean = 13 Mean = 14
11  12  13  14  15 65 11  12  13  14  20 70
  13   14
5 5 5 5
Numerical Descriptive
Measures for a Population
 Descriptive statistics discussed previously described a

sample, not the population.
 Summary measures describing a population, called

parameters, are denoted with Greek letters.
 Important population parameters are the population mean,

variance, and standard deviation.
Numerical Descriptive Measures
for a Population: The mean µ
 The population mean is the sum of the values in

the population divided by the population size, N
N
X i
X1  X 2    XN
 i1

N N
Where μ = population mean
N = population size
Xi = ith value of the variable X
Example
An investment of $100,000 declined to $50,000 at the

end of year one and rebounded to $100,000 at end
of year two:
X1  $100,000 X2  $50,000 X3  $100,000
50% decrease 100% increase
The overall two-year return is zero, since it started and

ended at the same level.
Arithmetic Mean
 The arithmetic mean (mean) is the most
common measure of central tendency
 For a sample of size n:

n
X i
X1  X 2    Xn
X i1

n n
Sample size Observed values

Geometric Mean
 Geometric mean
 Used to measure the rate of change of a variable
over time
XG  ( X1  X 2    Xn ) 1/ n
 Geometric mean rate of return

 Measures the status of an investment over time
RG  [(1  R1 )  (1  R 2 )    (1  Rn )]1/ n  1
 Where Ri is the rate of return in time period i
The Median
 In an ordered array, the median is the “middle”

number (50% above, 50% below)
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Median = 13 Median = 13
 Less sensitive than the mean to extreme values

Locating the Median
 The location of the median when the values are in numerical order (smallest to largest):
n 1
 If theMedian
number of values is odd,the median position
position is the middlein the ordered data
number
2
 If the number of values is even, the median is the average of the two middle numbers
Note that is not the value of the median, only the position of
the median in the ranked data
n 1
2
The Mode
 Value that occurs most often

 Not affected by extreme values
 Used for either numerical or categorical data
 There may be no mode
 There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode = 9 No Mode
Which Measure to Choose?
 The mean is generally used, unless extreme values

(outliers) exist.
 The median is often used, since the median is not
sensitive to extreme values. For example, median
home prices may be reported for a region; it is less
sensitive to outliers.
 In some situations it makes sense to report both the
mean and the median.
Shape of a Distribution
 Describes how data are distributed
 Two useful shape related statistics are:
 Skewness
 Measures the extent to which data values are not
symmetrical
 Kurtosis
 Kurtosis affects the peakedness of the curve of
the distribution—that is, how sharply the curve
rises approaching the center of the distribution
Shape of a Distribution
(Skewness)
 Measures the extent to which data is not

symmetrical
Left-Skewed Symmetric Right-Skewed
Mean < Median Mean = Median Median < Mean
Skewness
Statistic < 0 0 >0
Measures of Variation
Variation
Range Variance Standard Coefficient

Deviation of Variation
 Measures of variation give

information on the spread
or variability or
dispersion of the data
values.
Same center,
different variation
Measures of Variation:
The Range
 Simplest measure of variation

 Difference between the largest and the smallest values:
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 - 1 = 12
Why The Range Can Be Misleading
 Does not account for how the data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
The Sample Variance
DCOVA
 Average (approximately) of squared deviations
of values from the mean
n
 Sample variance:
 (X  X) i
2
S 2 i1
n -1
Where X = arithmetic mean
n = sample size
The Sample Standard Deviation
 Most commonly used measure of variation

 Shows variation about the mean
 Is the square root of the variance
 Has the same units as the original data
n
 Sample standard deviation:  (X  X)

i
2
S i1
n -1
The Standard Deviation
Steps for Computing Standard Deviation
1. Compute the difference between each value and the

mean.
2. Square each difference.
3. Add the squared differences.
4. Divide this total by n-1 to get the sample variance.
5. Take the square root of the sample variance to get
the sample standard deviation.
Sample Standard Deviation:
Calculation Example
Sample
Data (Xi) : 10 12 14 15 17 18 18 24
n=8 Mean = X = 16
(10  X)2  (12  X)2  (14  X)2    (24  X)2

S
n 1
(10  16)2  (12  16)2  (14  16)2    (24  16)2


8 1
130 A measure of the “average”

  4.3095
7 scatter around the mean
Comparing Standard Deviations
Smaller standard deviation
Larger standard deviation

Numerical Descriptive Measures For A
Population: The Variance σ2
 Average of squared deviations of values from

the mean
N
 Population variance:  (X  μ)
i
2
σ2  i1
N
Where μ = population mean

N = population size
Numerical Descriptive Measures For A
Population: The Standard Deviation σ
DCOVA
 Most commonly used measure of variation
 Shows variation about the mean
 Is the square root of the population variance
 Has the same units as the original data
N
 Population standard deviation:  i
(X  μ) 2
σ i1
N
Sample statistics versus
population parameters
 X
2 S2
 S
Interpreting Standard
Deviation: Empirical Rule
 Applies to data sets that are mound shaped and

symmetric
 Approximately 68% of the measurements lie in
the interval x  s to x  s
 Approximately 95% of the measurements lie in
the interval x  2s to x  2s
 Approximately 99.7% of the measurements lie in
the interval
x  3s to x  3s
Interpreting Standard
Deviation: Empirical Rule
x – 3s x – 2s x–s x x+s x +2s x + 3s
Approximately 68% of the measurements
Approximately 95% of the measurements

Approximately 99.7% of the measurements
Empirical Rule Example
Previously we found the mean closing stock
price of new stock issues is 15.5 and the
standard deviation is 3.34. If we can assume
the data is symmetric and mound shaped,
calculate the percentage of the data that lie
within the intervals
x  s, x  2s, x  3s.
Empirical Rule Example
According to the Empirical Rule, approximately 68% of
the data will lie in the interval ( x  s, x  s ),
(15.5 – 3.34, 15.5 + 3.34) = (12.16, 18.84)
Approximately 95% of the data will lie in the interval

( x  2s, x  2 s),
(15.5 – 2∙3.34, 15.5 + 2∙3.34) = (8.82, 22.18)
Approximately 99.7% of the data will lie in the interval
( x  3s, x  3s ),
(15.5 – 3∙3.34, 15.5 + 3∙3.34) = (5.48, 25.52)
Numerical Measures of
Relative Standing: z–Scores
 Describes the relative location of a measurement
compared to the rest of the data
Sample z–score Population z–score

xx x µ
z z
s 
Measures the number of standard deviations
away from the mean a data value is located
z–Score Example
 The mean time to assemble a product is 22.5
minutes with a standard deviation of 2.5 minutes.
 Find the z–score for an item that took 20 minutes
to assemble.
 Find the z–score for an item that took 27.5
minutes to assemble.
Interpretation of z–Scores for
Mound-Shaped Distributions of
Data
1. Approximately 68% of the measurements will
have a z-score between –1 and 1.
2. Approximately 95% of the measurements will
3. Approximately 99.7% of the measurements will
(see the figure on the next slide)
Interpretation of z–Scores
Numerical Measures of
Relative Standing:
Percentiles
 Describes the relative location of a measurement
compared to the rest of the data are called
measures of relative standing.
 The pth percentile is a number such that p% of
the data falls below it and (100 – p)% falls above
it
 Median = 50th percentile
Quartiles
Measure of noncentral tendency
Split ordered data into 4 quarters
25% 25% 25% 25%
Q1 Q2 Q3
Lower quartile QL is 25th percentile.
Middle quartile m is the median.
Upper quartile QU is 75th percentile.

Percentile Example
 You scored 560 on the GMAT exam. This score

puts you in the 58th percentile.
 What percentage of test takers scored lower
than you did?
 58% of test takers scored lower than 560.
 What percentage of test takers scored higher
than you did?
 (100 – 58)% = 42% of test takers scored
higher than 560.
Outlier
An observation (or measurement) that is unusually
large or small relative to the other values in a data
set is called an outlier. Outliers typically are
attributable to one of the following causes:
1. The measurement is observed, recorded, or
entered into the computer incorrectly.
2. The measurement comes from a different
population.
3. The measurement is correct but represents a
rare (chance) event.
Measure of noncentral tendency
Split ordered data into 4 quarters

25% 25% 25% 25%
Q1 Q2 Q3
Lower quartile QL is 25th percentile.
Middle quartile m is the median.
Upper quartile QU is 75th percentile.
Interquartile range: IQR = QU – QL
Quartile (Q2) Example
 Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7

 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
 Position: 1 2 3 4 5 6
Q2 is the median, the average of the two middle

scores (7.7 + 8.9)/2 = 8.3
 Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7

 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
 Position: 1 2 3 4 5 6
QL is median of bottom half = 6.3

 Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7

 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
 Position: 1 2 3 4 5 6
QU is median of bottom half = 10.3

Interquartile Range
1. Measure of dispersion
2. Also called midspread
3. Difference between upper and lower quartiles
 Interquartile Range = QU – QL
4. Spread in middle 50%
5. Not affected by extreme values
Thinking Challenge
 You’re a financial analyst for Prudential-Bache
Securities. You have collected the following
closing stock prices of new stock issues: 17,
16, 21, 18, 13, 16, 12, 11.
 What are the quartiles, Q1 and Q3, and the
interquartile range?
Box Plot
1. Graphical display of data using 5-number

summary
Xsmallest Q 1 Median Q 3 Xlargest
4 6 8 10 12
Box Plot
1. Draw a rectangle (box) with the ends

(hinges) drawn at the lower and upper
quartiles (QL and QU). The median data is
shown by a line or symbol (such as “+”).
2. The points at distances 1.5(IQR) from each
hinge define the inner fences of the data set.
Line (whiskers) are drawn from each hinge to
the most extreme measurements inside the
inner fence.
Box Plot
3. A second pair of fences, the outer fences, are
defined at a distance of 3(IQR) from the
hinges. One symbol (*) represents
measurements falling between the inner and
outer fences, and another (0) represents
measurements beyond the outer fences.
4. Symbols that represent the median and
extreme data points vary depending on
software used. You may use your own
symbols if you are constructing a box plot by
hand.
Shape & Box Plot
Left-Skewed Symmetric Right-Skewed

Q 1 Median Q3 Q1 Median Q 3 Q 1 Median Q 3
Detecting Outliers
Box Plots: Observations falling between the inner

and outer fences are deemed suspect outliers.
Observations falling beyond the outer fence
are deemed highly suspect outliers.
z-scores: Observations with z-scores greater than
3 in absolute value are considered outliers.
(For some highly skewed data sets,
observations with z-scores greater than 2 in
absolute value may be outliers.)
Example
 In the Journal of Experimental Social

Psychology (Vol. 45, 2009) study on whether
money can buy love, recall that the researchers
measured the quantitative variable birthday gift
price (dollars) for each of the 237 participants.
Are there any unusual reported prices in the
BUYLOV data set?
Example
The Sample Covariance
 The sample covariance measures the strength of the
linear relationship between two variables (called
bivariate data)
 The sample covariance:

n
 ( X  X)( Y  Y )
i i
cov ( X , Y )  i1
n 1
 Only concerned with the strength of the relationship
 No causal effect is implied
Interpreting Covariance
 Covariance between two random variables:
cov(X,Y) > 0 X and Y tend to move in the same direction
cov(X,Y) < 0 X and Y tend to move in opposite directions
cov(X,Y) = 0 X and Y are independent

Coefficient of Correlation
 Measures the relative strength of the linear
relationship between two variables
 Sample coefficient of correlation:

n
 ( X  X)( Y  Y )
i i
cov ( X , Y )
r i1

n n SX SY
 i
(
i1
X  X ) 2
 i
(
i1
Y  Y ) 2
Features of
Correlation Coefficient, r
 Unit free
 Ranges between –1 and 1
 The closer to –1, the stronger the negative linear
relationship
 The closer to 1, the stronger the positive linear
relationship
 The closer to 0, the weaker any positive linear
relationship
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Applications of standard deviation
 Quality Management: control chart

 Risk Management
Quality Management
Control Chart
Process Variation
Total Process Common Cause Special Cause

Variation = Variation + Variation
 Variation is natural; inherent in the world

around us
 No two products or service experiences are
exactly the same
 With a fine enough gauge, all things can be
seen to differ
Process Variation

Variation is often due to differences in:

 People
 Machines
 Materials
 Methods
 Measurement
 Environment
Process Variation

Common cause variation

 naturally occurring and expected
 the result of normal variation in materials,
tools, machines, operators, and the
environment
Process Variation

Special cause variation

 abnormal or unexpected variation
 has an assignable cause
 variation beyond what is considered
inherent to the process
Control Limits
Forming the Upper control limit (UCL) and the Lower
control limit (LCL):
UCL = Process Mean + 3 Standard Deviations

LCL = Process Mean – 3 Standard Deviations
UCL
+3σ
Process Average
- 3σ
LCL
time
Control Chart Basics
Special Cause Variation:

Range of unexpected variability
UCL
Common Cause +3σ
Process Mean
Variation: range of
- 3σ
expected LCL
variability
time
Process Variability
Special Cause of Variation:
A measurement this far from the process average is very
unlikely if only expected variation is present
UCL
±3σ → 99.7% of
process values Process Mean
should be in this
range LCL
time
Using Control Charts
Control Charts are used to check for process control
If the process is found to be out of control, steps

should be taken to find and eliminate the special
causes of variation
In-control Process
 A process is said to be in control when

the control chart does not indicate any
out-of-control condition
 Contains only common causes of variation
 If the common causes of variation is small, then
control chart can be used to monitor the process
 If the common causes of variation is too large, you
need to alter the process
Process In Control
 Process in control: points are randomly

distributed around the center line and all
points are within the control limits
UCL
Process Mean
LCL
time
Process Not in Control
Out of control conditions:
 One or more points outside control limits

 8 or more points in a row on one side of the
center line
 8 or more points in a row moving in the same
direction
Process Not in Control
One or more points outside Eight or more points in a row on one
control limits side of the center line
UCL UCL
Process Process
Average Average
LCL LCL
Eight or more points in a row

moving in the same direction
UCL
Process
Average
LCL
Out-of-control Processes
 When the control chart indicates an out-of-

control condition (a point outside the control
limits or exhibiting trend, for example)
 Contains both common causes of variation and
assignable causes of variation
 The assignable causes of variation must be identified
 If detrimental to the quality, assignable causes of variation
must be removed
 If increases quality, assignable causes must be incorporated
into the process design
Financial Risk Management
Measures of Risk
• Standard deviation: is a measure of the dispersion
of a set of returns around their expected value.
• Beta: (systematic risk) measures the degree to which
the stock moves with the overall market.
E ( Ri )  R f   ( E ( Rm )  R f )
• Volatility : The volatility is the standard deviation of
the continuously compounded rate of return in 1 year
 Uncorrected sample standard deviation/
standard deviation of the sample
1 n
 
n i 1
( Ri  E ( R)) 2
 Corrected sample standard deviation
n
1
 
n  1 i 1
( Ri  E ( R)) 2
Two Assets With Same Expected Return But
Different (Continuous) Probability Distributions
Probability Density
Stock 1
Stock 2
0 5 6 7 8 9 10 11 12 13 14 15
Return %
Return and Risk of a portfolio
RP  R1 w1  R2 w2
  w   w   2 w1w2 cov( R1 , R2 )
2
P
2
1
2
1
2
2
2
2
 w   w   2 w1w2 12 1 2
2
1
2
1
2
2
2
2
The Question Being Asked in VaR
“What loss level is such that we are X%

confident it will not be exceeded in N
business days?”
The VaR measure
 When using the value-at-risk measure, an analyst is
interested in making a statement of the following form:
“I am X percent certain there will not be a loss of more than
V dollars in the next N days”.
 The variable V is the VaR of the portfolio. It is a function
of two parameters: the time horizon (N days) and the

confidence level (X%).
 When N days is the time horizon and X% is the confidence level, VaR is the
loss corresponding to the (100- X) th percentile of the distribution of the gain
in the value of the portfolio over the next N days.
 It is the loss level over N days that has a probability of
only (100-X)% of being exceeded. Bank regulators require
banks to calculate VaR for market risk with N = 10 and X
=99.
Distribution of the change in the
portfolio value
 END OF CHAPTER 3

Numerical Descriptive Techniques (6 Hours)

Uploaded by

Copyright:

Available Formats

Numerical Descriptive Techniques (6 Hours)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numerical Descriptive Techniques (6 Hours)

Uploaded by

Copyright:

Available Formats

Chapter 3

 The variation is the amount of dispersion or

 The shape is the pattern of the distribution of values

 The arithmetic mean (often just called the “mean”)

 For a sample of size n:

 The most common measure of central tendency

 Descriptive statistics discussed previously described a

 Summary measures describing a population, called

 Important population parameters are the population mean,

 The population mean is the sum of the values in

An investment of $100,000 declined to $50,000 at the

X1  $100,000 X2  $50,000 X3  $100,000

50% decrease 100% increase

The overall two-year return is zero, since it started and

 For a sample of size n:

Sample size Observed values

 Geometric mean rate of return

 In an ordered array, the median is the “middle”

 Less sensitive than the mean to extreme values

the median in the ranked data

 Value that occurs most often

 The mean is generally used, unless extreme values

 Measures the extent to which data is not

Range Variance Standard Coefficient

 Measures of variation give

 Simplest measure of variation

Range = Xlargest – Xsmallest

 Does not account for how the data are distributed

 Most commonly used measure of variation

 Sample standard deviation:  (X  X)

Steps for Computing Standard Deviation

1. Compute the difference between each value and the

(10  X)2  (12  X)2  (14  X)2    (24  X)2

(10  16)2  (12  16)2  (14  16)2    (24  16)2

130 A measure of the “average”

Smaller standard deviation

Larger standard deviation

 Average of squared deviations of values from

Where μ = population mean

 Applies to data sets that are mound shaped and

x – 3s x – 2s x–s x x+s x +2s x + 3s

Approximately 68% of the measurements

Approximately 95% of the measurements

Approximately 95% of the data will lie in the interval

Sample z–score Population z–score

Upper quartile QU is 75th percentile.

 You scored 560 on the GMAT exam. This score

Split ordered data into 4 quarters

 Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7

Q2 is the median, the average of the two middle

 Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7

QL is median of bottom half = 6.3

 Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7

QU is median of bottom half = 10.3

1. Graphical display of data using 5-number

Xsmallest Q 1 Median Q 3 Xlargest

1. Draw a rectangle (box) with the ends

Left-Skewed Symmetric Right-Skewed