Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Initial Data Analysis: Central Tendency

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 20

Initial Data Analysis

Central Tendency
Outline
 What is ‘central tendency’?
 Classic measures
 Mean, Median, Mode
 What’s an ‘average’?
 Properties of statistics
 Sufficiency
 Efficiency
 Bias
 Resistance
 Resistant measures
Measures of Central Tendency
 While distributions provide an overall picture of
some data set, it is sometimes desirable to represent
some property of the entire data set using a single
statistic
 The first descriptive statistic we will discuss are
those used to indicate where the ‘center’ of the
distribution lies.
 The expected value
 It is not a value that has to be in the dataset itself
 There are different measures of central tendency,
each with their own advantages and disadvantages
The Mode
 The mode is simply the value of the relevant variable that
occurs most often (i.e., has the highest frequency) in the
sample

 Note that if you have done a frequency histogram, you can


often identify the mode simply by finding the value with the
highest bar.

 However, that will not work when grouping was performed


prior to plotting the histogram (although you can still use the
histogram to identify the modal group, just not the modal
value).

 Modes in particular are probably best applied to nominal data


Mode
 Advantages
 Very quick and easy to determine
 Is an actual value of the data
 Not affected by extreme scores

 Disadvantages
 Sometimes not very informative (e.g. cigarettes smoked
in a day)
 Can change dramatically from sample to sample
 Might be more than one (which is more representative?)
The Median
 The median is the point corresponding to the score that lies in
the middle of the distribution (i.e., there are as many data
points above the median as there are below the median).
 To find the median, the data points must first be sorted into
either ascending or descending numerical order.
 The position of the median value can then be calculated using
the following formula:

Median Location = N + 1
2
Median
 Advantage:
 Resistant to outliers

 Disadvantage:
 May not be so informative:
 (1, 1, 2, 2, 2, 2, 5, 6, 9, 9, 10 )

 Does the value of 2 really represent this sample as a


whole very well?
The Mean
 The most commonly used measure of central
tendency is called the mean (denoted X for
a sample, and µ for a population).

 The mean is the same of what many of us call


the ‘average’, and it is calculated in the
following manner:
X  X
N
Mode vs. Median vs. Mean
 When there is only one mode and distribution
is fairly symmetrical the three measures (as
well as others to be discussed) will have
similar values

 However, when the underlying distribution is


not symmetrical, the three measures of central
tendency can be quite different.
Some Visual Demos
 Here is a demonstration1 that allows you to change a
frequency histogram while simultaneously noting the
effects of those changes on the mean versus the
median.

 As you use the demo, you should fairly easily be


able to think about how these changes are also
affecting the mode

 Note that the order would go Mode Median then


Mean in the direction the tail is pointing.
What’s an average?
 We’ve been referring to the mean without qualification, but
in fact there are many types of averages, and that is only one
 The mean we typically use is the arithmetic mean
 Along with the geometric mean and harmonic mean, they are
the Pythagorean means.
 In their calculation, the Arithmetic mean is greater than or equal to
the Geometric mean, which is greater than or equal to the harmonic
mean
 The geometric mean for n values is to multiply them all and
take the nth root of that number
 The harmonic mean can be seen as the reciprocal1 of the
arithmetic mean of the reciprocals of all the values of the
variable in question2
More means
 The geometric mean is particularly appropriate for
exponential type of data
 E.g. Human population over a period of time
 The harmonic mean is good for things like rates and
ratios where an arithmetic mean would actually be
incorrect1, but whenever you see an ANOVA with
unequal sample sizes, the far and away most
common procedure uses the harmonic mean of
sample sizes
 As a result, an unbalanced design will have less statistical
power because the average sample size will tend toward
the least sample
More means
 Weighted averages
 Sometimes we will want to weight a measure of
some variable by the values of some other variable
 E.g. If each person gets a score on several items and we
want an average of the total score for each person across
the items, we might weight them by 1/variance to give the
more consistent scorers more importance in the calculation
 The arithmetic mean is a weighted average in which
all weights = 1.
Properties of a Statistic: Sampling
Distribution
 In order to examine the properties
of a statistic we often want to take
repeated samples from some
population of data and calculate
the relevant statistic on each
sample.
 We can then look at the
distribution of the statistic across
these samples and ask a variety of
questions about it.
Properties of a Statistic
 Sufficiency
 A sufficient statistic is one that makes use of all of the information in
the sample to estimate its corresponding parameter
 For example, this property makes the mean more attractive as a measure
of central tendency compared to the mode or median.
 Unbiasedness
 A statistic is said to be an unbiased estimator if its expected value
(i.e., the mean of a number of sample means) is equal to the
population parameter it is estimating.
 As one can see using the resampling procedure, the mean can be shown
to be an unbiased estimator
Properties of a Statistic
 Efficiency
 The efficiency of a statistic is reflected in the variance that is observed
when one examines the statistic over independently chosen samples
 Standard error
 The smaller the variance, the more efficient the statistic is said to be
 Resistance
 The resistance of an estimator refers to the degree to which that
estimate is effected by extreme values i.e. outliers
 Small changes in the data result in only small changes in estimate
 Finite-sample breakdown point
 Measure of resistance to contamination
 The smallest proportion of observations that, when altered sufficiently, can
render the statistic arbitrarily large or small
 Median = n/2
 Trimmed mean = whatever the trimming amount is
 Mean = 1/n
Resistant measures of central tendency
 Trimmed mean
 Created by “trimming” some percentage of the
high and low ends of the data
 The median is actually a trimmed estimate
 Windsorized mean
 M-estimators
 Extreme values are given less weight than those closer to
the center of the distribution.
 May be more robust than mean or median for certain
types of “funky” data
Practical Example
 Administer the BDI to 10 randomly selected UNT students
 8 of the students score less than 25, two scored greater than 45.
 8, 12, 6, 16, 10, 20, 22, 25, 47, 55
 Median = 18
 Mean =22.1
 Which is more accurate regarding generalization to the ‘typical
UNT student’? One that includes:
 Two people that perhaps reversed their ratings on the items?
 A score that was miskeyed (using the number pad they hit a 4 instead
of 1 leading to a score of 47)?
 Two people who do not have English as their native language?
 Two people that did not answer honestly?
 Two people that are actually clinically depressed?
 One that is clinically depressed, one that just ‘wants to be different’?
Practical Example
 While many think of outliers as representing the ‘complexity of human
nature’1 the issue more revolves around inadequate data collection to
detect why the score is what it is and problematic population description
 E.g. my definition of typical UNT student, if such a thing could be said to
exist at all, is not one that is on suicide watch
 However, the previous problem most likely represents an attempt to
generalize to something that doesn’t exist.
 Better populations to try and represent: UNT Texans, UNT Psych grad
students, UNT international students, UNT students who have visited C & T
in the last semester (in which case those would probably not be outliers) etc.
 Application to current events: Do you really think there is a ‘middle
America’, a ‘female vote’ etc. to which the presidential candidates are
trying to appeal? There are demographics, very specific ones yes, but
those connotations do little to note the specifics.
Summary
 Favoritism for the arithmetic mean is the result of familiarity
only1, and until you came to this course you would have been
hard-pressed to explain your preference outside of arguments
from authority
 The AM is to be valued for some properties it has relative to
other measures (sufficiency, efficiency, unbiased), and also
rejected for the same reason (least amount of resistance)
 In many cases it’s entirely inappropriate to use the AM as it
would be a distorted view of central tendency
 Which statistics you use to represent your data should be
considered as much as the measures themselves.

You might also like