Class Test 1 Revision Notes
Class Test 1 Revision Notes
Types of Variables
- Categorical variables
- Ordinal variables
- Quantitative variables
Charts
- Histogram
- Pros and Cons of the Four Visual Displays for Quantitative Variables
Box plots, stem-and-leaf plots, dot plots, and histograms organize
quantitative data in ways that let us begin to find the information in a data
set.
As to the question of which type of display is the best, there is no unique
answer.
The answer depends on what feature of the data may be of interest and, to a
certain degree, on the sample size.
Box plot
Strength:
Give a direct look at central location and spread as it summarizes
the five-number summary.
Can identify outliers.
Side-by-side box plot is an excellent tool for comparing two or
more groups
Weakness:
Not entirely useful for judging shape.
Cannot distinguish between bell-shaped or bimodal.
Stem-and-Leaf plot
Strength:
Excellent for sorting data.
With a sufficient sample size, it can be used to judge shape.
Weakness:
With a large sample size, a stem-and-leaf plot may be too
cluttered because the display shows all individual data values.
More restricted in the choices for “intervals” when compared to
histograms.
Dot plot
Strength:
Can present all individual data values.
Easy to create.
Weakness:
With a large sample size, a dot plot may be too cluttered.
Histogram
Strength:
Excellent for judging the shape of a data set with moderate or
large sample sizes.
Flexible in choosing number as well as the width of the intervals
for the display.
Between 6 and 15 intervals usually gives a good picture of the
shape.
Weakness:
With a small sample size, a histogram may not “fill in”
sufficiently well to show the shape of the data.
With either too few intervals or too many, we may not see the true
shape of the data.
- Misleading Graphs
Statistics can be misleading if not presented appropriately.
Same data can appear very differently when graphed.
E.g. break in the vertical axis.
Frequency on the vertical axis should be continuous from zero.
When we put a break in the axis, we lose proportional relationship
among class interval frequencies.
Numerical Summaries
- Measures of Central Location: Mean, Mode, Median
Mean as the Balance Point of a Distribution:
Unlike the median and the mode, the mean is responsive to the exact
position of each score in the distribution. It is the balance point of a
distribution.
Median in the Case with Outliers:
The median is less sensitive than the mean to the presence of a
few extreme scores (outliers)
Is it permissible to calculate the mean for tests in the behavioral
sciences? First of all, we have to ask ourselves a question: “Is the
measurement on this scale interval or ordinal?” Sometimes it may not
be interval nor ordinal.
Measures of Variability: Standard Deviation, Range, Interquartile Range
The standard deviation, like the mean, is responsive to the exact
position of every score in the distribution, because it is calculated by
taking deviations from the mean, if a score is shifted to a position more
deviant from the mean, the standard deviation will increase. If the shift
is to a position closer to the mean, the standard deviation decreases.
Measures of Shape: Skewness, Kurtosis
Skewness is a measure of a data set’s deviation from symmetry
Skewness
m3
, m2
(x x ) 2
, m3
(x x ) 3
3
m2 2 n n
The value of this measure generally lies between -3 and +3. The
closer the value lies to -3, the more the distribution is skewed left,
vice versa. A value close to 0 indicates a symmetric distribution. A
normal distribution is symmetric and has skewness of 0.
There are other measures of skewness:
1. Pearson mode skewness or fist skewness coefficient
mean mode
skewness
s.d .
Mean < (>) mode distribution is -ve-ly (+ve-ly) skewned
Normal Distribution
Percentile
- k-th percentile is a number that has k% of the data values at or below it and
(100-k)% of the data values at or above it. Lower quartile, median, upper quartile
are special cases of percentile. Lower quartile = 25th percentile, median = 50th
percentile, upper quartile = 75th percentile.
Value-at-Risk (VaR)
- VaR is defined as the worst loss over a target horizon that will not be exceeded
with a certain confidence level. For instance, the VaR at the 95% confidence level
gives a loss value that will not be exceeded with no less than 95% of probability.
Z-score
Scatterplot
Correlation Coefficient r
- Strength
- Direction
- Linear
r
S xy
( x x )( y y )
( S xx )( S yy ) (x x ) ( y y)
2 2
where
( x) 2
S xx ( x x ) 2 x 2 nx 2 x 2
n
( y ) 2
S yy ( y y ) y ny y
2 2 2 2
S xy ( x x )( y y ) xy nx y xy
x y
n
- Since rankings are qualitative data but not quantitative data even though they are
numerical, sample correlation coefficient r cannot be used.
- Instead, we will use the nonparametric counterparts of r, the rank correlation
coefficient rs, to perform correlation analysis to a form of qualitative data:
bivariate rankings.
- If we wish to assess the strength of the relation between the two sets of ranks, we
can compute the sample rank correlation coefficient rs.
rs
( R R )( R R )
x x y y
, where Rx and R y are the ranks of the two
(R R ) (R R )
x x
2
y y
2
variables of interest.
If there are no tied ranks in the data, then the following formula also works
6 i 1 di2
n
Shortcut formula: rs 1 ,
n(n 2 1)
di
Rank ( xi ) Rank ( yi )
where ,
Rxi Ryi (difference between a pair of ranks)
Situation 2: Data are given in the form of scores, but what matters is that
one score is higher than another and how much higher is not really
important. Then, translating scores to ranks will be suitable.
When data for one or both variables are not linear, other measures
of association are better.
effect of variability
effect of discontinuity
An outlier that is not consistent with the rest of the data can
substantially decrease the correlation.
Simpson’s Paradox