Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
130 views

Class Test 1 Revision Notes

This document provides a summary of key concepts in descriptive statistics, including: 1) Types of variables (categorical, ordinal, quantitative) and charts (histogram, bar chart) used to visualize data. 2) Pros and cons of different visual displays (box plots, stem-and-leaf plots, dot plots, histograms) for analyzing quantitative data. 3) Measures of central tendency (mean, median, mode), variability (standard deviation, range, interquartile range), and shape (skewness, kurtosis).

Uploaded by

Harry Kwong
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

Class Test 1 Revision Notes

This document provides a summary of key concepts in descriptive statistics, including: 1) Types of variables (categorical, ordinal, quantitative) and charts (histogram, bar chart) used to visualize data. 2) Pros and cons of different visual displays (box plots, stem-and-leaf plots, dot plots, histograms) for analyzing quantitative data. 3) Measures of central tendency (mean, median, mode), variability (standard deviation, range, interquartile range), and shape (skewness, kurtosis).

Uploaded by

Harry Kwong
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Class Test 1 Revision Note

Chapter 1 Descriptive Statics

Types of Variables
- Categorical variables
- Ordinal variables
- Quantitative variables

Charts
- Histogram

- Histogram vs. Bar Chart


 One implication of this distinction: it is always appropriate to talk about the
skewness of a histogram; that is, the tendency of observations to fall more
on the low end or the high end of the x-axis
 With bar charts, however, the x-axis does not have a low end or a high end;
because the labels on the x-axis are categorical – not quantitative. As a
result, it is less appropriate to comment on the skewness of a bar chart.

- Pros and Cons of the Four Visual Displays for Quantitative Variables
 Box plots, stem-and-leaf plots, dot plots, and histograms organize
quantitative data in ways that let us begin to find the information in a data
set.
 As to the question of which type of display is the best, there is no unique
answer.
 The answer depends on what feature of the data may be of interest and, to a
certain degree, on the sample size.
 Box plot
 Strength:
 Give a direct look at central location and spread as it summarizes
the five-number summary.
 Can identify outliers.
 Side-by-side box plot is an excellent tool for comparing two or
more groups
 Weakness:
 Not entirely useful for judging shape.
 Cannot distinguish between bell-shaped or bimodal.
 Stem-and-Leaf plot
 Strength:
 Excellent for sorting data.
 With a sufficient sample size, it can be used to judge shape.
 Weakness:
 With a large sample size, a stem-and-leaf plot may be too
cluttered because the display shows all individual data values.
 More restricted in the choices for “intervals” when compared to
histograms.
 Dot plot
 Strength:
 Can present all individual data values.
 Easy to create.
 Weakness:
 With a large sample size, a dot plot may be too cluttered.
 Histogram
 Strength:
 Excellent for judging the shape of a data set with moderate or
large sample sizes.
 Flexible in choosing number as well as the width of the intervals
for the display.
 Between 6 and 15 intervals usually gives a good picture of the
shape.
 Weakness:
 With a small sample size, a histogram may not “fill in”
sufficiently well to show the shape of the data.
 With either too few intervals or too many, we may not see the true
shape of the data.

- Misleading Graphs
 Statistics can be misleading if not presented appropriately.
 Same data can appear very differently when graphed.
 E.g. break in the vertical axis.
 Frequency on the vertical axis should be continuous from zero.
When we put a break in the axis, we lose proportional relationship
among class interval frequencies.

- Shape of Frequency Distributions


 J-shaped
 Positively skewed
 Negatively skewed
 Rectangular
 Bimodal
 Bell-shaped

Numerical Summaries
- Measures of Central Location: Mean, Mode, Median
 Mean as the Balance Point of a Distribution:
 Unlike the median and the mode, the mean is responsive to the exact
position of each score in the distribution. It is the balance point of a
distribution.
 Median in the Case with Outliers:
 The median is less sensitive than the mean to the presence of a
few extreme scores (outliers)
 Is it permissible to calculate the mean for tests in the behavioral
sciences? First of all, we have to ask ourselves a question: “Is the
measurement on this scale interval or ordinal?” Sometimes it may not
be interval nor ordinal.
 Measures of Variability: Standard Deviation, Range, Interquartile Range
 The standard deviation, like the mean, is responsive to the exact
position of every score in the distribution, because it is calculated by
taking deviations from the mean, if a score is shifted to a position more
deviant from the mean, the standard deviation will increase. If the shift
is to a position closer to the mean, the standard deviation decreases.
 Measures of Shape: Skewness, Kurtosis
 Skewness is a measure of a data set’s deviation from symmetry

 Skewness 
m3
, m2 
 (x  x ) 2

, m3 
 (x  x ) 3

3
m2 2 n n

The value of this measure generally lies between -3 and +3. The
closer the value lies to -3, the more the distribution is skewed left,
vice versa. A value close to 0 indicates a symmetric distribution. A
normal distribution is symmetric and has skewness of 0.
 There are other measures of skewness:
 1. Pearson mode skewness or fist skewness coefficient
mean  mode
skewness 
s.d .
Mean < (>) mode  distribution is -ve-ly (+ve-ly) skewned

 2. Pearson median skewness or second skewness coefficient


3(mean  median)
skewness 
s.d .
Mean < (>) median  distribution is -ve-ly (+ve-ly) skewed
 3. Bowley skewness or quartile skewness coefficient
(Q  Q2 )  (Q2  Q1 ) Q3  2Q2  Q1
skewness  3 
Q3  Q1 Q3  Q1
Distribution Coefficient of Skewness Measures of Central
Location
Symmetrical 0 Mean = Median = Mode
Skewed to the right >0 Mean > Median > Mode
Skewed to the left <0 Mean < Median < Mode
 Kurtosis is a measure of peakedness of a distribution.
m
kurtosis  42
m2
 Excess kurtosis is defined as the kurtosis minus 3, i.e.
excess kurtosis = kurtosis – 3
Normal distribution has an excess kurtosis of 3.
Generally, if a distribution has a greater excess kurtosis, it has a
higher peak and thicker tails, compared to another distribution of
the same kind.
 Outlier is a data point that is not consistent with the bulk of the data.
If an observation is outside the range [Q1 – 1.5IQR , Q3+1.5IQR],
then it is regarded as outlier.
 Possible reasons for outliers and what to do about them:
 Outlier is legitimate data value and represents natural
variability for the group and variable(s) measured. Values
may not be discarded. They provide important information
about location and spread.
 Mistake made while taking measurement or entering it into
computer. If verified, should be discarded or corrected.
 Individual in question belongs to a different group other than
bulk of individuals measured. Values may be discarded if
summary is desired and reported for the majority group only.
 Coefficient of Variation
 The standard deviation measures the variation in a set of data. For
decision makers, the standard deviation indicates how spread out a
distribution is.
 For distributions having the same mean, the distribution with the
largest standard deviation has the greatest relative spread.
 When two or more distributions have different means, the relative
spread cannot be determined by merely comparing the standard
deviations.
 Coefficient of variation (CV), is used to measure the relative variation
for distributions with different means.
s
 Sample coefficient of variation = (100%)
x
 When the coefficients of variation for two or more distributions are
compared, the distribution with the largest CV is said to have the
greatest relative spread.

Normal Distribution

Percentile

- k-th percentile is a number that has k% of the data values at or below it and
(100-k)% of the data values at or above it. Lower quartile, median, upper quartile
are special cases of percentile. Lower quartile = 25th percentile, median = 50th
percentile, upper quartile = 75th percentile.
Value-at-Risk (VaR)

- One important application of percentile in risk management is VaR.

- VaR is defined as the worst loss over a target horizon that will not be exceeded
with a certain confidence level. For instance, the VaR at the 95% confidence level
gives a loss value that will not be exceeded with no less than 95% of probability.

Z-score

-   1 contains about 68% of the scores

-   2 contains about 95% of the scores

-   3 contains about 99.7% of the scores


Chapter 2 Correlation and Regression

Scatterplot

- Positive/negative association, linear relationship/nonlinear (curvilinear)


relationship

Correlation Coefficient r

- Strength

 It is determined by the closeness of the points to a straight line.

- Direction

 It is determined by whether one variable generally increases or generally


decreases when the other variable increases

- Linear

 When the pattern is nonlinear, the correlation coefficient is not an


appropriate way to measure the strength of the relationship.

- The measure is also called Pearson product-moment correlation coefficient.

r
S xy

 ( x  x )( y  y )
( S xx )( S yy )  (x  x )  ( y  y)
2 2

where
( x) 2
S xx   ( x  x ) 2   x 2  nx 2   x 2 
n
( y ) 2
S yy   ( y  y )   y  ny   y 
2 2 2 2

S xy   ( x  x )( y  y )   xy  nx  y   xy 
 x y
n

- r is always -1 and +1.

- Magnitude indicates the strength of the linear relationship.

- Sign indicates the direction of the association.

Rank Correlation Coefficient rs

- Since rankings are qualitative data but not quantitative data even though they are
numerical, sample correlation coefficient r cannot be used.
- Instead, we will use the nonparametric counterparts of r, the rank correlation
coefficient rs, to perform correlation analysis to a form of qualitative data:
bivariate rankings.

- If we wish to assess the strength of the relation between the two sets of ranks, we
can compute the sample rank correlation coefficient rs.

- The Spearman correlation coefficient rs is defined as the Pearson correlation


coefficient between the ranks of the data.

rs 
 ( R  R )( R  R )
x x y y
, where Rx and R y are the ranks of the two
 (R  R )  (R  R )
x x
2
y y
2

variables of interest.

If there are no tied ranks in the data, then the following formula also works

6 i 1 di2
n

Shortcut formula: rs  1  ,
n(n 2  1)

di
 Rank ( xi )  Rank ( yi )
where ,
 Rxi  Ryi (difference between a pair of ranks)

n = the number of pairs of ranks

- When to use rs instead of r?

 Situation 1: Data are given in the form of ranks.

 Situation 2: Data are given in the form of scores, but what matters is that
one score is higher than another and how much higher is not really
important. Then, translating scores to ranks will be suitable.

- Cautions in the use of correlation

 Bear in mind the following five cautions in the use of correlation.

 Correlation does not prove causation

 If variation in X causes variation in Y, that causal connection will


appear in some degree of correlation between X and Y.

 However, we cannot reason backward from a correlation to a


causal relationship.
 We must always remember “correlation does not imply
causation”.

 There are at least four possibilities of an observed correlation.

Denote X as the explanatory variable, Y as the response variable.

(a) Causation – X is a cause of Y.

(b) Reverse of causation – Y is a cause of X.

(c) A third variable influences both X and Y.

(d) A complex of interrelated variables influences X and Y.

Note: Two or more of these situations may occur simultaneously.


For example, X and Y may influence each other. (a+b)

 r and rs are only for linear relationship

 When data for one or both variables are not linear, other measures
of association are better.

 effect of variability

 The correlation coefficient is sensitive to the variability


characterizing the measurements of the two variables.

 For example, suppose a university had only minimal entrance


requirements, the relationship between total SAT scores, and the
other university is a more selective private university which
admits students only with SAT scores of 1200 or higher. The
correlation will be weaker in the latter case.

 Therefore, restricting the range, whether in X, in Y, or in both,


results in lower correlation coefficient (in magnitude).

 effect of discontinuity

 The correlation tends to be an overestimate in discontinuous


distributions.

 Usually, discontinuity, whether in X, in Y, or in both, results in a


higher correlation coefficient.

 correlation for combined data

 correlation coefficient may increase or decrease, depends.

- Examples of deceiving relationship


 Outliers can substantially inflate or deflate correlations.
 An outlier that is consistent with the trend of the rest of the data will
inflate the correlation.

 An outlier that is not consistent with the rest of the data can
substantially decrease the correlation.

 Groups combined inappropriately may mask relationships.

 The missing link is a third variable.

 Simpson’s Paradox

 Two or more groups

 Variables for each group may be strongly correlated

 When groups combined into one, very little correlation between


the two variables.

Simple Linear Regression

You might also like