Introductory Statistical Data Analysis For Geoscientists Using Matlab
Introductory Statistical Data Analysis For Geoscientists Using Matlab
SCHOOL OF GEOSCIENCES
DIVISION OF GEOLOGY AND GEOPHYSICS
45
40
35
30
25
20
15
10
0
-6000 -5000 -4000 -3000 -2000 -1000 0
INTRODUCTORY STATISTICAL
DATA ANALYSIS
FOR GEOSCIENTISTS
USING MATLAB/OCTAVE
R. Dietmar Müller
TABLE OF CONTENTS
1 Introduction...................................................................................................................1
2 Introduction to MatLab .................................................................................................1
2.1 What is MATLAB? ......................................................................................................... 1
2.2 Getting Started ................................................................................................................ 1
2.3 Help.................................................................................................................................. 1
2.4 Colon's & brackets .......................................................................................................... 2
2.5 Plotting/Graphs: online help ........................................................................................... 3
2.6.1 Formats ..................................................................................................................................... 4
2.6.2 Saving and loading the workspace.............................................................................................. 5
2.7 Introduction to Plotting ................................................................................................... 6
2.8 overview, colourmaps ..................................................................................................... 7
2.8.1 Important steps for making 2D Graphs ....................................................................................... 7
2.8.2 Important steps for making 3D Graphs ....................................................................................... 8
2.8.3 Functions that create and plot continuous surfaces ...................................................................... 9
2.9 COLOUR....................................................................................................................... 10
3. Types of Geoscience Data and Data Quality................................................................12
3.1 Types of data ........................................................................................................................ 12
3.2 Dependent Versus Independent Variables .......................................................................... 13
3.3 Data quality ................................................................................................................... 13
4. Statistical data analysis................................................................................................13
4.1 Introduction.......................................................................................................................... 13
4.2 Plotting data as line/symbol and stem diagrams........................................................... 14
4.3 Plotting data as histograms ........................................................................................... 14
4.4 Plotting data as scatterplots .......................................................................................... 16
5. Probability ...................................................................................................................16
5.1. Random variables.......................................................................................................... 16
5.2 What is statistical significance?..................................................................................... 18
5.3 How to determine that a result is "really" significant.................................................. 18
5.4 Gaussian distribution .................................................................................................... 19
5.5 Random Data Sample Statistics .................................................................................... 21
5.5.1 Mean ....................................................................................................................................... 21
5.5.2 Standard deviation ................................................................................................................... 21
5.5.3 Standardization to unit standard deviation................................................................................. 22
5.5.4 Geometric mean....................................................................................................................... 22
5.6 Other distributions and their applications ................................................................... 22
5.6.1 Binomial.................................................................................................................................. 22
5.6.2 Multinomial............................................................................................................................. 23
5.6.3 Poisson .................................................................................................................................... 23
5.6.4 Exponential.............................................................................................................................. 24
5.6.5 Gamma.................................................................................................................................... 25
5.6.6 Beta......................................................................................................................................... 25
I
5.6.7 Negative Binomial ................................................................................................................... 26
5.6.8 Log-normal.............................................................................................................................. 26
5.6.9 Rayleigh .................................................................................................................................. 26
5.6.10 Weibull ............................................................................................................................... 26
6 Significance Tests ........................................................................................................27
6.1 Introduction................................................................................................................... 27
6.2 Testing whether data are normally distributed............................................................ 27
6.3 Outliers .......................................................................................................................... 30
6.3 distribution of the sample mean.................................................................................... 33
6.5 Student t-distribution.................................................................................................... 35
6.6 Student t-Test for Independent Samples ...................................................................... 36
6.7 Student t-test for dependent samples ............................................................................ 36
6.7.1 Within-group Variation ............................................................................................................ 36
6.7.2 Purpose.................................................................................................................................... 37
6.7.3 Assumptions............................................................................................................................ 37
6.7.4 One Sample Student's t test ...................................................................................................... 37
6.7.5 Two Independent Samples Student's t test................................................................................ 38
6.8 Chi-square distribution................................................................................................. 41
6.9 Peasron Chi-Square test ................................................................................................ 42
6.10 Fisher F distribution...................................................................................................... 45
7 ANOVA: Analysis of Variance ....................................................................................46
8 Statistics between two or more variables......................................................................47
8.1 Correlations between two or more variables ................................................................ 47
8.1.1 Introduction ............................................................................................................................. 47
8.1.2 Significance of Correlations .................................................................................................... 47
8.1.3 Nonlinear Relations between Variables ................................................................................... 48
8.1.4 Measuring Nonlinear Relations ............................................................................................... 48
9 When to Use Nonparametric Techniques ....................................................................50
10 Directional and Oriented data..................................................................................52
10.1 Introduction........................................................................................................................ 52
10.2 Rose Plots............................................................................................................................ 52
10.3 Plotting and contouring oriented data on stereonets......................................................... 54
10.3.1 Types of stereonets.............................................................................................................. 54
10.3.2 Which net to use for which purpose?.................................................................................... 54
10.2 Tests of significance of mean direction .............................................................................. 55
11 Spatial data analysis: Contouring unequally spaced data........................................56
12 Overview of Computer Intensive Statistical Inference Procedures ..........................58
12.1 Introduction................................................................................................................... 58
12.2 Monte Carlo Methods ................................................................................................... 58
12.2.1 Introduction......................................................................................................................... 58
12.2.2 Monte Carlo Estimation....................................................................................................... 59
12.2.3 Bootstrapping...................................................................................................................... 60
12.2.4 The "Jackknife" ................................................................................................................... 61
12.2.5 Markov Chain Monte Carlo Estimation................................................................................ 61
II
12.2.6 Meta-Analysis ..................................................................................................................... 62
12.2.7 Multivariate Modelling........................................................................................................ 62
12.2.8 Overall Assessment of Strengths and Weaknesses................................................................ 63
13 References................................................................................................................65
13.1 Geosciences .................................................................................................................... 65
13.2 General .......................................................................................................................... 65
14 Appendix..................................................................................................................67
14.1 Tables............................................................................................................................. 67
14.1.1 Critical values of R for Rayleigh's test ................................................................................. 67
14.1.2 Values of concentration parameter K from R for Rayleigh's test ........................................... 68
14.1.3 Critical values of Spearman's rank correlation coefficient..................................................... 69
14.1.4 Critical values of T for Mann-Whitney Test (α=5%)........................................................... 69
14.2 STIXBOX Contents....................................................................................................... 70
14.2 Computational Tools and Demos on the Internet......................................................... 71
III
1 INTRODUCTION
This course module is designed to convey the principles of statistics applied to earth
science data. Problem solving is illlustrated based on Matlab and its free counterpart, Octave,
and two free sets of toolboxes, which work with both Matlab (www.mathworks.com) and
Octave (www.octave.org): the "stixbox" (www.maths.lth.se/matstat/stixbox/), which includes
most popular forward and inverse distribution functions, various hypothesis tests and graphics
functions, and an earth science toolbox written by G. Middleton (1999).
2 INTRODUCTION TO MATLAB
2.1 WHAT IS MATLAB?
Definition. MATLAB is a high-performation language for technical computing. After
mastering the basics, you will see amazing power that combines computational capabilities
with graphics capabilities. MATLAB stands for matrix laboratory, which reflects its original
application to matrix applications. With MATLAB, you have the following capabilities:
2.3 HELP
Your first resource should be the "help" function that is part of the MATLAB program.
There is a "Help" icon which will bring up a searchable help data base. It contains tutorials
for basic MATLAB functions. If you are not familiar with MATLAB I strongly
recommend to go through the help pages on all basic MATLAB functions.
Another nice tool is the "lookfor keyword" command. (type "help help" for information
on this). It will look through all the help pages and give you back the commands that have the
keyword in the first line of its help page. For example, what would we do to learn of all the
commands that relate to "meshes" in MATLAB? Type:
lookfor mesh
You can then follow this up by doing a "help" on any of these programs/tools.
1
2.4 COLON'S & BRACKETS
The colon operator is very useful and important for array definitions, and number increment
size. For instance, type the expression:
20:30
ans =
20 21 22 23 24 25 26 27 28 29 30
You can easily change the default increment, in this case "1", to anything else, if you simply
add the incrementation size between the end values. For example, type:
20:2:30
ans =
20 22 24 26 28 30
Pretty straight forward stuff. Using the left and right brackets "[" and "]", you can define
matrices, where the rows are separated by a semicolon ";". For example, to define a 3 by 3
matrix with the numbers 1 through 9, type:
[1 2 3;4 5 6;7 8 9]
ans =
1 2 3
4 5 6
7 8 9
To give this matrix a name in MATLAB's memory, such as Ed, then type:
Ed = [1 2 3;4 5 6;7 8 9]
Ed =
2
1 2 3
4 5 6
7 8 9
If you don't give such things a name, the default name "ans", which stands for answer is the
variable name. Let's use the colon and brackets to define a 5X5 matrix:
junk =
1 2 3 4 5
10 20 30 40 50
5 6 7 8 9
1000 900 800 700 600
3 5 2 9 44
Online Reference MATLAB manual, and a very nice Frequently Asked Questions page
(Univ. Texas, Austin):
http://www-math.cc.utexas.edu/math/Matlab/Manual/ReferenceTOC.html
Let's look at an example from the above site. Here's a "plotting lines in 3D" page :
http://www-math.cc.utexas.edu/math/Matlab/Manual/plot3.html Do the example
Colourmaps:
http://www-math.cc.utexas.edu/math/Matlab/Manual/colourmap.html
3
Prefab shapes: sphere:
http://www-math.cc.utexas.edu/math/Matlab/Manual/sphere.html Do the example (and
launch the MATLAB window, go to the 3D shapes page in Visualizations, and experiment
with changing the surfacing, shading, colourmap.)
A UTAH PAGE: (nice tutorial) note, an important point about array math:
http://www.mines.utah.edu/gg_computer_seminar/matlab/tut3.html
2.6.1 Formats
MATLAB has different formating options for how we view the variables in the workspace.
Common formats are listed below (do a help on these to learn more):
format short
format short e
format long
format long e
format bank
x = [4/3 1.2345e-6]
x =
1.3333 0.0000
format short
x
x =
1.3333 0.0000
format short e
x
x =
1.3333e+00 1.2345e-06
format long
x
4
x =
1.33333333333333 0.00000123450000
format long e
x
x =
1.333333333333333e+00 1.234500000000000e-06
format bank
x
x =
1.33 0.00
Suppressing output: By default, MATLAB always displays the results of a command you
type onto the screen. This is not always optimal, especially if you define a rather large array
of numbers. However, to suppress the output from being displayed on the screen, simply, put
a semi-colon at the end of the line ";".
save
will create a file called matlab.mat in your present working directory. To access this file
during another MATLAB session, just type:
load
which load's the matlab.mat file. You can give the file it's own name with:
save filename
then to load it into memory at a later time (as you might imagine), type:
load filename
If you are only interested in save a particular variable, no problem: let's say we want to just
save our A, and want to put it into a file called A_nov21.mat. Then type:
save A_nov21 A
5
2.6.3 Loading ASCII Data Files
Arrays of numbers can be stored in files on disk. For instance, open Text Editor and create a
file that has an array of numbers:
1 3 5 7
2 2 2 2
3 4 4 4
1 1 1 1
NOTE: don't worry about copying my numbers, any 4x4 matix will do. Then save your file.
Ideally, save it as filename.dat, then you will know it is a data file that you use with
MATLAB (however, the ".dat" extension is not necessary). If i called my file "crap.dat", then
to load it into MATLAB, I type "load crap.dat". Now, the variable "crap" (without the ".dat")
is assigned to the data that was in that file. To be sure it is properly loaded, in MATLAB,
simply type the variable name and hit return.
We will use this file for our simple demonstration of 2D graphs. This is a classic data set of
118 measurements of zinc (ZN) made at 2 meter intervals along a single sphalerite quartz vein
in the Pulacayo Mine in Chile, by De Wijs (1951). So, these data are measured in equal
intervals in space. Load the data into MATLAB after you copy it over, with:
load dewijs.dat
If you now type "dewijs", the array will spew back to you (recall, to view a screen at a time,
type "more on"). Make a simple XY plot with circles for symbols:
plot(dewijs, 'o')
You can see that the Y ordinate plotted at the value of the measurement, but the X ordinate is
simply a count of the number of the entry (entry 1, entry 2, etc), and has nothing to do with
the 2 meter interval. We can address this a number of ways... 118 times 2 is 236. So let's just
make a new array of numbers: 0, 2, 4 ... up to 234 (not 236, because we are counting zero):
x = [0:2:234]
Now, let's plot both arrays, so that the Y values are now properly spaced in X:
This literally says: plot a 'o' at [x(1),dewijs(1)], plot a 'o' at [x(2),dewijs(2)], and so on. Now
look at the X-axis. Things are shaping up. Let's add labels:
6
Let's add a solid line to the plot, since it is difficult to identify any kind of trend. We can add
as many data sets to a plot with:
Thus, we will plot the same data set twice, once with "o"s, and once as a solid line (the
default, so we don't need to give it attributes for that):
stem(dewijs)
So, there doesn't appear to be any strong systematic trends to this data set, so let's do some
statistics with it. From inspection of our last plot, we see our Y values ranging from a little
under 5 to almost 40. Let's make a histogram of the data. First, we need to define bins for the
data. For this data set, let's make our bins 5 units wide, and from 2.5-7.5, 7.5-12.5, ... up to
32.5-37.5. In one line:
x = [2.5:5:37.5]
hist(dewijs,x)
7
2.8.2 Important steps for making 3D Graphs
The typical steps in making a 3D graph are similar to the 2D case, except now we call a
3D graping function, which typically has far more options, such as lighting and viewpoint,
etc. These are really just attributes to the porjection and how we "foof" it up. They are
important functions to know about if you are going to continue on in MATLAB.
8
2.8.3 Functions that create and plot continuous surfaces
MATLAB defines a surface by the z-coordinates of points above a rectangular grid in the
x-y plane. The plot is formed by joining adjacent points with straight lines. Surface plots are
useful for visualizing matrices that are too large to display in numerical form, and for
graphing functions of two variables.
For us in the Earth sciences, the importance of plotting surfaces is obvious: so much of
our data is spatially oriented, such as topography, gravity, heat, magnetism, yada yada yada.
The table below lists functions that makes surfaces from your input matrices. And all of our
spatial data sets can be made into matrices (since we have x=longitude, y=latitude, and
z=measurement).
figure
creates a new window and makes it the current destination for graphics output. You can make
any figure window the current/active one by clicking it with the mouse, or look at the title bar
of your figure windows -- they are numbered. To make the nth window active, just type:
figure(n)
2.8.5 Subplots
You can display multiple plots in the same figure window. The function
subplot(n,n,i)
breaks the figure window into an m-by-n matrix of small subplots (m rows and n columns),
and selects the ith subplot for the current plot. For example, if m=3 and n=4, then we are
9
dividing the figure window into 12 subplots: 3 rows and 4 columns.
Let's do one with 2 rows and 2 columns. Then we have "subplot(2,2,i)". We would designate
i=1 for the 1st plot -- they are ordered from left to right in row one, then row two, and so on.
Here's our example... let's plot a bunch of relationships between the sine and cosine function,
all on the same page:
t=0:pi/20:2*pi
[x,y]=meshgrid(t);
subplot(2,2,1)
plot(sin(t),cos(t))
axis equal
subplot(2,2,2)
z = sin(x) + cos(y)
plot(t,z)
axis([0 2*pi -2 2])
subplot(2,2,3)
z = sin(x).*cos(y);
plot(t,z)
axis([0 2*pi -1 1])
subplot(2,2,4)
z = (sin(x).^2)-(cos(y).^2);
plot(t,z)
axis([0 2*pi -1 1])
2.9 COLOUR
2.9.1 Defaults
As you noticed, your line colours in your last plot were cycled through a range of colours.
This is a default feature, which can all be changed. Look at the help page for "plot" to see how
you can easily hardwire different colours to different lines.
You can easily change your background colour of your images, which is white by default. Try
colourdef black
to change it to black (use the UP arrow to find your command for running the subplot_ex
example, and run it again).
colourdef white
changes is back to the white background.
10
Red Green Blue Colour
0 0 0 black
1 1 1 white
1 0 0 red
0 1 0 green
0 0 1 blue
1 1 0 yellow
1 0 1 magenta
0 1 1 cyan
.5 .5 .5 gray
.5 0 0 dark red
1 .62 .40 copper
.49 1 .83 aquamarine
2.9.3 Colourmaps
Each MATLAB window has a "colourmap" associated with it. A colourmap is simply a 3
column matrix whose length (# of rows) is equal to the number of colours it defines.
MATLAB's default colourmap is "jet", actually, it is "jet(64)" -- a 64-colour rendering of jet.
Type "colormap" and MATLAB will spew it's 3 column matrix of jet (64) -- remember, these
are just RGB values.
You can see the colour scale by looking at a colourbar: open a new figure window (type
"figure") then type "colorbar", and a colour definition scale bar will appear on your map.
Look at some of the default maps: try
colormap(pink)
colormap(copper)
colormap(flag)
Flag is obviously not well suited for our purposes, but type "colourbar" so MATLAB spews
"flag's" RGB matrix out. You can see how it is composed:
ans =
1 0 0 (red)
1 1 1 (white)
0 0 1 (blue)
0 0 0 (black)
1 0 0
1 1 1
0 0 1
0 0 0
1 0 0
...etc...
11
3. TYPES OF GEOSCIENCE DATA AND DATA QUALITY
3.1 TYPES OF DATA
The data types we analyse in statistics are also called "variables" in a statistical context.
Variables can be things that we measure, control, or manipulate in research. In our case, we
will only concerned with data or variables we measure.
Data differ in "how well" they can be measured, i.e., in how much measurable
information their measurement scale can provide. There is obviously some measurement error
involved in every measurement, which determines the "amount of information" that we can
obtain. Another factor that determines the amount of information that can be provided by a
variable is its "type of measurement scale." Specifically data are classified as follows.
a. Nominal (or categorical) data allow for only qualitative classification. That is, they can
be measured only in terms of whether the individual items belong to some distinctively
different categories, but we cannot quantify or even rank order those categories. For
example, information may be given as a list of names, descriptions etc (e.g. sediment type
given as sand, silt, clay, ooze, …)
b. Ordinal data allow us to rank order the items we measure in terms of which has less and
which has more of the quality represented by the variable, but still they do not allow us to
say "how much more." A typical example of an ordinal variable in geology is Moh's
hardness scale or Richter's earthquake scale.
c. Interval data allow us not only to rank order the items that are measured, but also to
quantify and compare the sizes of differences between them. For example, temperature,
as measured in degrees Celsius, constitutes an interval scale. We can say that a
temperature of 40 degrees is higher than a temperature of 30 degrees, and that an increase
from 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees.
d. Ratio data are very similar to interval variables; in addition to all the properties of
interval variables, they feature an identifiable absolute zero point, thus they allow for
statements such as x is two times more than y. Typical examples of ratio scales are
measures of time or space. For example, as the Kelvin temperature scale is a ratio scale,
not only can we say that a temperature of 200 degrees is higher than one of 100 degrees,
we can correctly state that it is twice as high. Interval scales do not have the ratio
property. Most statistical data analysis procedures do not distinguish between the interval
and ratio properties of the measurement scales.
e. Discrete data: These data can only assume specific (usually integer)values (e.g. counts
of objects).
f. Closed data: percentages, or ppm's etc. These are very common in geochemistry and
petrology.
g. Directional data: These data are given as angles, and are extremely important in
geoscience, as dips and strikes of structures measured in the field or in cores are
expressed this way. Most standard statistics textbooks do not include a treatise of
directional data.
12
3.2 DEPENDENT VERSUS INDEPENDENT VARIABLES
Independent variables are those that are manipulated whereas dependent variables are
only measured or registered. This distinction appears terminologically confusing to many
because, we may say, "all variables depend on something." However, once you get used to
this distinction, it becomes indispensable. For example, if we collect geological data along a
creek-bed, or geophysical data on a paddock or at sea, distance would be a dependent
variable, whereas rock type, type of fossils, or variation in the magnetic field etc would be an
independent variable.
If repeated measurements are similar, there are called precise. If the same systematic
error is made for a set of repeart measurements, they would be precise but entirely inaccurate
at the same time. The robustness of a procedure is the extent to which its properties do not
depend on those assumptions which you do not wish to make.
This class will be concerned mostly with data analysis/hypotheeis testing and decision
making. For the purpose of statistical data analysis, distinguishing between cross-sectional
and time series data is important. Cross-sectional data re data collected at the same or
approximately the same point in time. Time series data are data collected over several time
periods. A Meta-analysis deals with a set of results to give an overall result that is
comprehensive and valid. Principal component analysis and factor analysis are used to
reduce the dimensionality of multivariate data. In these techniques correlations and
interactions among the variables are summarized in terms of a small number of underlying
factors. The methods rapidly identify key variables or groups of variables that control the
system under study. The resulting dimension reduction also permits graphical representation
of the data so that significant relationships among observations or samples can be identified.
13
consideration of objects on each of which are observed the values of a number of variables. A
wide range of methods is used for the analysis of multivariate data, and this course will give a
view of the variety of methods available, as well as going into some of them in detail.
35
30
25
Zn (%)
20
15
10
0
0 50 100 150 200 250
Distance (m)
Plot of zink data points from a Quartz vein marked as circles, connected by lines.
Very simple diagram to create using the MATLAB plot function.
Stem Diagram of De Wijs data
40
35
30
25
Zn (%)
20
15
10
0
0 20 40 60 80 100 120
Position (2m units)
14
Histogram of De Wijs data
40
35
30
25
frequency
20
15
10
0
0 5 10 15 20 25 30 35 40
Zn (%)
Before we can construct a histogram we must determine how many classes we should
use. This is purely arbitrary, but too few classes or too many classes will not provide as clear
a picture as can be obtained with some more nearly optimum number. An empirical
relationship (known as Sturges' rule) which seems to hold and which may be used as a guide
to the number of classes (k) is given by
k = the smallest integer greater than or equal to 1 + Log(n) / Log (2) = 1 + 3.332Log(n)
To have an 'optimum' you need some measure of quality - presumably in this case, the
'best' way to display whatever information is available in the data. The sample size contributes
to this, so the usual guidelines are to use between 5 and 15 classes, with more classes possible
if you have a larger sample. You take into account a preference for tidy class widths,
preferably a multiple of 5 or 10, because this makes it easier to appreciate the scale.
Beyond this it becomes a matter of judgement - try out a range of class widths and choose
the one that works best. (This assumes you have a computer and can generate alternative
histograms fairly readily). There are often management issues that come into it as well. For
example, if your data is to be compared to similar data - such as prior studies, or from other
countries - you are restricted to the intervals used therein.
If the histogram is very skewed, then unequal classes should be considered. Use narrow
classes where the class frequencies are high, wide classes where they are low. The following
approach is common:
15
2.9 3.2 3.4 2.9 3.6 3.7 3.3
3.4 4.0 3.8 3.7 3.3 2.9 3.1
3.2 3.6 3.5 3.3 3.4
% Plotting histograms
load ex_2_1.dat
help histo
% Sturge's rule:
% 1 + log(40)/log(2)=6.3 suggests 6 bins
subplot(1,2,1)
histo(ex_2_1,6,1)
subplot(1,2,2)
histo(ex_2_1,6,2)
mx = max(ex_2_1)
mn = min(ex_2_1)
range = mx - mn
mids = [2.95 3.15 3.35 3.55 3.75 3.95]
[n,x] = hist(ex_2_1, mids)
bar(x,n,1,'w')
___________________________________________________________________________
0.3
0.25
0.2
0.15
0.1
0.05
0
-3 -2 -1 0 1 2 3 4 5
Delta O18, PDB
5. PROBABILITY
5.1. RANDOM VARIABLES
Turning geoscience data into knowledge requires an ability to test hypotheses and to
analyse errors. This in turn requires some understanding of probability and certain statistical
16
principles. An underlying concept in statistics is that of a random variable. Random
variables may be thought of as physical quantities, which are yet to be known. Since we
cannot predict their values, we may say that they depend on "chance". Examples for random
variables include quantities in the future (like future earthquakes or floods), quantities in the
past (like the past state of the Earth's magnetic field), or present properties of the Earth,
which we attempt to infer from geological/geophysical measurements (like the motion of a
particular tectonic plate that is measured by GPS data). More traditional types of random
variables include the outcome of future experiments, like the toss of a coin, or the content X
of a can of coke drawn from a vending machine and selected at random, or the outcome of a
physical experiment.
After collecting a series of data, these data are regarded as a set in probability theory,
defined as a collection of objects (also called points or elements) about which it is possible to
determine whether any particular object is a member of the set. In particular, the possible
result of a series of measurements (or experiments) represent a set of points called the sample
space. These points may be grouped together in various ways called events, and under
suitable conditions probability functions may be assigned to each. The probabilities always
lie between zero and one, such that an impossible event has the probability of zero, and the
probability of a certain event is one.
Each x(j) has a probability p(j). The discrete distribution function f(x) of p(j) is:
F(x) = ΣŠ f(x j ) = ΣŠ pj
xj x xj x
The first moment of a probability distribution is the mean, and the first central moment
about the mean is zero. The second moment about the mean is called the variance, σx2 , and
its square root, σx , is called the standard variation. The third moment about the mean is
called skewness γ, and is zero for PDF's which are symmetric about the mean. The fourth
moment about the mean is the kurtosis. It measures "peakedness" of the distribution.
17
Second central moment: σ 2 = E (x 2 ) - µ 2 .
γ = 13 E([x - µ]3 )
Third central moment: σ
.
The median value divides the probability density distribution in two halves such that
there is a 50% chance for x to be less than the median and a 50% chance for it to be greater
than mx. The next figure shows measures of central tendency of an arbitrary probability
density distribution.
For example, a p-level of .05 (i.e.,1/20) indicates that there is a 5% probability that the
relation between the variables found in our sample is a "fluke." In other words, assuming that
in the population there was no relation between those variables whatsoever, and we were
repeating experiments like ours one after another, we could expect that approximately in
every 20 replications of the experiment there would be one in which the relation between the
variables in question would be equal or stronger than in ours. In many areas of research, the p-
level of .05 is customarily treated as a "border-line acceptable" error level.
18
consistent supportive evidence in the entire data set, and on "traditions" existing in the
particular area of research. Typically, in many sciences, results that yield p=0.05 are
considered borderline statistically significant but remember that this level of significance still
involves a pretty high probability of error (5%). Results that are significant at the p=0.01 level
are commonly considered statistically significant, and p=0.05 or p=0.001 levels are often
called "highly" significant. But remember that those classifications represent nothing else but
arbitrary conventions that are only informally based on general research experience.
A random variable is said to follow a Gaussian (or normal) distribution, if its probability
density function is given by:
The Gaussian PDF is completely specified by the mean µx and standard deviation σ. The
shape of the Gaussian distribution is a bell-shaped curve, symmetric about the mean, with
68% of its area within one standard deviation, and 95% within two standard deviations.
19
Normal probability density and distribution fcunctions
The integral of the Gaussian distribution is the error function, which is important for
solving thermal heat conduction problems. The Gaussian distribution is quite frequently used
in science, because of a result known as the central limit theorem, which states that the sum
of many independent random variables tends to behave as a Gaussian random variable, no
matter what their distribution, as long as it is the same for all. This result implies that any
physical process which is the sum of random events is Gaussian in its distribution.
Unfortunately this assumption often does not hold for some distributions of "real" data we
have to deal with.
Most computers contain a random number generator, which produces numbers between 0
and 1 with an approximately uniform distribution. One may use the random number
generator, and the central limit theorem, to compute random numbers, which are
approximately Gaussian with zero mean and unit variance. With a computer program to
generate uniformly distributed random numbers on the interval (0,1), one may compute the
sum of 12 of them, which by the Central Limit Theorem is approximately Gaussian, subtract
the mean value (6), and obtain approximately Gaussian numbers with unit variance.
Application: Many applications arise from central limit theorem (average of values of n
observations approaches normal distribution, irrespective of form of original distribution
under quite general conditions). Consequently, the normal distribution is an appropriate
model for many, but not all, physical phenomena. Examples: Distribution of physical
measurements on fossils, abundance of elements in rocks, average temperatures, etc. Many
methods of statistical analysis presume normal distribution.
___________________________________________________________________________
Worked Matlab Example: Normal distribution
Grades of chip samples from a body of ore have a normal distibution with a mean of 12%
and a standard deviation of 1.6%. Find the probability that the grade of a chip sample taken at
random will have a grade of:
1) 15% or less
2) 14% or more
3) 8% or less
4) between 8% and 15%
%prob <15%
20
%first standardise
z1=sdiz(15,m1,sd1)
%then use (just created) cumulative prob fn
prob1=cump(z1)
%Now prob>14%
z2=sdiz(14,m1,sd1)
prob2=1-cump(z2)
% 8%<prob<15%
%already standerised 8 & 15% (ie z3 and z1)
prob4=cump(z1)-cump(z3)
___________________________________________________________________________
5.5.1 Mean
The mean value of a data series gt is given by:
N
1
g=
N
∑g
t =1
t
where N is the number of data samples. The quantity g calculated here is an unbiased
estimate of the "true" mean value of the continuous function g(t). For many data analysis
procedures it is necessary to remove the mean value. For example, in Fourier transformed
data the presence of a mean value different from zero would result in a spurious large
frequency component in the amplitude spectrum. Hence we form a new time series given by:
xt = gt − g t = 1,2,...,N.
When a distribution is skewed (i.e. not normal), the mean is not representative and the
median should rather be used. The median, which splits the distribution into two halves, is
often very similar to the mode, the most commonly occurring value, but the median is much
easier to compute than the mode. Therefore, the median is often used as a simple substitute
for the mode.
1/ 2
N (x )2
s = ∑ t
t =1 N − 1
Note that here xt has zero mean. Both s and s2 are unbiased estimates of the standard
deviation σ and the variance σ2 (for details see Bendat and Piersol, 1986, Section 4.1).
21
5.5.3 Standardization to unit standard deviation
For some computer operations which require fixed point rather than floating point
calculations, it is desirable to standardize the time series to unit standard deviation. This can
be achieved by multiplying the transformed values xt by 1/s:
xt
zt = t = 1,2,...,N.
s
xgeo = (‹ XN)1/N
___________________________________________________________________________
Worked Matlab Example: Basic sample statistics
The following data are diameters (in mm) of clasts from a conglomerate:
median1=median(ex_2_2)
mean1=mean(ex_2_2)
5.6.1 Binomial
Application: Gives probability of exactly successes in n independent trials, when
probability of success p on single trial is a constant. Used frequently in quality control,
reliability, survey sampling, and other industrial problems.
22
Example: What is the probability of 7 or more "heads" in 10 tosses of a fair coin? The
binomial distribution can sometimes be approximated by a normal or by a Poisson
distribution.
5.6.2 Multinomial
Application: Gives probability of exactly ni outcomes of event i, for i = 1, 2, ..., k in n
independent trials when the probability pi of event i in a single trial is a constant. Used
frequently in quality control and other industrial problems. Generalization of binomial
distribution for more than 2 outcomes.
Example: Four companies are bidding for each of three contracts, with specified success
probabilities. What is the probability that a single company will receive all the orders?
5.6.3 Poisson
Application: Gives probability of exactly x independent occurrences during a given
period of time if events take place independently and at a constant rate. May also represent
number of occurrences over constant areas or volumes. Used frequently in quality control,
reliability, queuing theory, and so on. Frequently used as approximation to binomial
distribution.
The number of major floods occurring in 50-year periods in a certain region has a Poisson
distribution with a mean of 2.2. What is the probability of the region experiencing
prob1=exp(-m1)*((m1).^2)/factorial(2)
prob2=exp(-m1*0.5)*((m1*0.5).^1)/factorial(1)
% at least one flood in 50 years (t=1) is all but zero in 50 years
% ie 1-P(X=0)
23
prob3=1-exp(-m1)*((m1).^0)/factorial(0) %nb 0!=1
prob4=(exp(-m1*0.5)*((m1*0.5).^0)/factorial(0))+(exp(-
m1*0.5)*((m1*0.5).^1)/ factorial(1))+(exp(-
m1*0.5)*((m1*0.5).^2)/factorial(2))
___________________________________________________________________________
5.6.4 Exponential
Many geological events can be represented by points in space or time. When discrete
events occur randomly and independently at a mean rate l per unit interval, the intervals
between events give rise to a probability density function as follows:
In this case, the number of events occurring in a unit interval has a Poisson distribution
with parameter λ, the mean rate of occurrence, and a mean time between events of 1/λ.
The probability for a certain time period x separating two events is:
P(X S x) = exp(-λx)
and
P(X T x) = 1 - exp(-λx)
It follows that
Where P(x1 T X T x2) is the probability that an event will occur within a time period
between x1 and x2 years.
___________________________________________________________________________
Worked Matlab Example: Exponential distribution
The number of major earthquakes occuring in 100-year intervals in a certain region has a
Poisson distribution with a mean rate of 2.1. Find the probability that the time between two
successive earthquakes is
24
m1=2.1;
5.6.5 Gamma
Application: A basic distribution of statistics for variables bounded at one side - for
example x greater than or equal to zero. Gives distribution of time required for exactly k
independent events to occur, assuming events take place at a constant rate. Used frequently in
queuing theory, reliability, and other industrial applications.
Erlangian, exponential, and chi-square distributions are special cases. The Dirichlet is a
multidimensional extension of the Beta distribution.
Distribution of a product of iid uniform (0, 1) random? Like many problems with
products, this becomes a familiar problem when turned into a problem about sums. If X is
uniform (for simplicity of notation make it U(0,1)), Y=-log(X) is exponentially distributed, so
the log of the product of X1, X2, ... Xn is the sum of Y1, Y2, ... Yn which has a gamma
(scaled chi-square) distribution. Thus, it is a gamma density with shape parameter n and scale
1.
5.6.6 Beta
Application: A basic distribution of statistics for variables bounded at both sides - for
example x between o and 1. Useful for both theoretical and applied problems in many areas.
Uniform, right triangular, and parabolic distributions are special cases. To generate beta,
generate two random values from a gamma, g1, g2. The ratio g1/(g1 +g2) is distributed like a
beta distribution. The beta distribution can also be thought of as the distribution of X1 given
(X1+X2), when X1 and X2 are independent gamma random variables.
There is also a relationship between the Beta and Normal distributions. The conventional
calculation is that given a PERT Beta with highest value as b lowest as a and most likely as
m, the equivalent normal distribution has a mean and mode of (a + 4M + b)/6 and a standard
deviation of (b - a)/6. Many stixbox distribution functions are based on the beta and gamma
functions.
25
5.6.7 Negative Binomial
Application: Gives probability similar to Poisson distribution when events do not
occur at a constant rate and occurrence rate is a random variable that follows a gamma
distribution. Generalization of Pascal distribution when s is not an integer. Many authors do
not distinguish between Pascal and negative binomial distributions.
5.6.8 Log-normal
Application: Permits representation of random variable whose logarithm follows
normal distribution. Model for a process arising from many small multiplicative errors.
Appropriate when the value of an observed variable is a random proportion of the previously
observed value.
In the case where the data are lognormally distributed, the geometric mean acts as a better
data descriptor than the mean. The more closely the data follow a lognormal distribution, the
closer the geometric mean is to the median, since the log re-expression produces a
symmetrical distribution. The ratio of two log-normally distributed variables is log-normal.
5.6.9 Rayleigh
Application: Gives distribution of radial error when the errors in two mutually
perpendicular axes are independent and normally distributed around zero with equal
variances. Special case of Weibull distribution.
5.6.10 Weibull
Application: General time-to-failure distribution due to wide diversity of hazard-rate
curves, and extreme-value distribution for minimum of N values from distribution bounded at
left. The Weibull distribution is often used to model "time until failure." In this manner, it is
applied in actuarial science and in engineering work.
Example: Life distribution for some capacitors, ball bearings, relays, and so on.
26
6 SIGNIFICANCE TESTS
6.1 INTRODUCTION
Significance tests are based on certain assumptions: The data have to be random samples
out of a well defined basic population and one has to assume that some variables follow a
certain distribution - in most cases the normal distribution is assumed.
Power of a test is the probability of correctly rejecting a false null hypothesis. A null
hypothesis is a hypothesis of no difference. If a null-hypothesis is rejected when it is
actually true, a Type I error has occurred. This probability is one minus the probability of
making a Type II error (β), which is the error that occurs when an erroneous hypothesis
is accepted. We choose the probability of making a Type I error when we set α and that if we
decrease the probability of making a Type I error we increase the probability of making a
Type II error.
Thus, the probability of correctly retaining a true null hypothesis has the same
relationship to Type I errors as the probability of correctly rejecting an untrue null hypothesis
does to Type II error.
Power and the True Difference Between Population Means: Anytime we test whether a
sample differs from a population or whether two sample come from 2 separate populations,
there is the assumption that each of the populations we are comparing has it's own mean and
standard deviation (even if we do not know it). The distance between the two population
means will affect the power of our test.
Power as a Function of Sample Size and Variance: You should notice that what really
made the difference in the size of β is how much overlap there is in the two distributions.
When the means are close together the two distributions overlap a great deal compared to
when the means are farther apart. Thus, anything that effects the extent the two distributions
share common values will increase β (the likelihood of making a Type II error).
Sample size has an indirect effect on power because it affects the measure of variance we
use to calculate the t-test statistic. Since we are calculating the power of a test that involves
the comparison of sample means, we will be more interested in the standard error (the average
difference in sample values) than standard deviation or variance by itself. Thus, sample size is
of interest because it modifies our estimate of the standard deviation. When n is large we will
have a lower standard error than when n is small. In turn, when N is large well have a smaller
β region than when n is small.
27
computed & compared with a histogram. The non-standardised variable x is plotted and the
area under the curve is equal to the total frequency times the histogram class interval.
Histogram and fitted Normal curve
45
40
35
30
25
frequency
20
15
10
0
-3 -2 -1 0 1 2 3
x
A histogram of 200 values drawn at random from a Normal population with a mean 0 and a standard
deviation 1, together with the fitted Normal curve.
These data are clearly quite close to a normal distribution. Now let's look at another
example, namely data on uranium content (in ppm) in lake sediments from Saskatchewan,
Canada.
Uranium data from lake sediments
70
60
50
40
30
fre quenc y
20
10
0
0 20 40 60 80 100 120 140 160 180
x
Initially it looks like there is not too much hope that these data might be anything close to
a normal distribution. However, it turns out that log-normal distributions are quite common
in nature. For example if you take a hammer, and smash a quarz grain to pieces, the resulting
grain size distribution will likely be log-normal. Let's have a look whether this might hold for
the lake sediment data:
28
Log of Uranium data from lake sediments
25
20
15
10 y
fre quenc
0
-1 0 1 2 3 4 5 6
x
Now the picture has changed: Our data may indeed be log-normally distributed, but the
distribution is not quite symmetric. In order to be sure that we can safely follow the
assumption that say at 95% confidence the data are log-normally distributed, we would now
have to carry out a chi-squared test (see later).
The following plot shows a bathymetric profile across the Hawaiian seamount chain.
-1000
-2000
-3000
-4000
-5000
A histogram of the depths (below) clearly shows that this is at the very least a bimodal
distribution, with one very shallow peak that correcponds to the top of the Hawaiian chain,
and a large range of abyssal seafloor depths that could be broken up into two further
distributions (flexural bulges next to the seamount chain, and true abyssal seafloor depths).
45
40
35
30
25
20
15
10
0
-6000 -5000 -4000 -3000 -2000 -1000 0
29
45
40
35
30
25
20
15
10
0
-6000 -5000 -4000 -3000 -2000 -1000 0
The stixbox normmix function estimates the mixture of normal distributions, and their
means, standard deviations, and mixture weights. In this case, this matrix contains the
following values for three estimated distributions.
Other tests to determine the probability that a sample came from a normally distributed
population of observations is the Kolmogorov-Smirnov test and the Shapiro-Wilks' W test.
6.3 OUTLIERS
Outliers are atypical (by definition), infrequent observations. Because of the way in
which the regression line is determined (especially the fact that it is based on minimizing not
the sum of simple distances but the sum of squares of distances of data points from the line),
outliers have a profound influence on the slope of the regression line and consequently on the
value of the correlation coefficient. A single outlier is capable of considerably changing the
slope of the regression line and, consequently, the value of a correlation, or it may also change
the outcome of a test for normality of a distribution. Note that if the sample size is relatively
small, then including or excluding specific data points that are not as clearly "outliers".
Typically, we believe that outliers represent a random error that we would like to be able
to control. Unfortunately, there is no widely accepted method to remove outliers
automatically (however, see the next paragraph). However, quantitative methods to exclude
outliers have been proposed. The best method that I have come across is "Chauvenet's
criterion", as described in Taylor's book "An introduction to error analysis". Chauvenet's
criterion states that if the expected number of measurements at least as bad as the
suspect measurement is less than 1/2, then the suspect measurement should be rejected.
where tsus is the number of standard deviations by which the suspect measurement xsus
differs from the mean. We next find the probability P(outside tsus*std) that a legitimate
mesurement will differ from the mean by tsus or more standard deviations. This can be done
in Matlab by using the stixbox function pnorm. pnorm computes values of the normal
30
distribution function (i.e. cumulative density function) (right on figure below), whereas dnorm
computes the normal density function (left on figure below).
For example:
Pnorm(0) = 0.5
As shown in the figure above, because at a z (or x)-value of 0, the grey area underneath
the density function is exactly half the area under this curve.
pnorm([-1 1])
yields:
0.1587 0.8413
z (or x) = -1 and 1 corresponds to minus and plus one standard deviation of a normal
distribution function.
We can use a small Matlab script, based on the commands dnorm and pnorm to
reproduce the figure above:
clear
x = [-4:0.1:4];
dens=dnorm(x);
dist=pnorm(x);
subplot(1,2,1)
plot(x,dens)
grid on
title('Normal distribution function')
xlabel('x')
ylabel('Probability')
subplot(1,2,2)
plot(x,dist)
grid on
title('Normal density function')
xlabel('x')
ylabel('Cumulative area')
31
Normal density function Normal distribution function
0.4 1
0.35
0.8
0.3
0.25 0.6
0.2
0.15 0.4
Probabilit y
Cumulative area
0.1
0.2
0.05
0 0
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
x x
The probability that a measurement from a normally distributed population with a certain
standard deviation x (will occur corresponds to the area under the normal curve between -x
and +x.
The likelihood that a measurement within one standard deviation will occur corresponds
to the area under the normal density curve (left above) between -1 and 1. To find this area all
we have to do is look up the cumulative area at x=-1 and x=1 on the normal distribution
curve, and take the difference between them. In Matlab, we type:
diff(pnorm([-1 1]))
ans = 0.6827
We have just verified that about 68% of data from a normal distribution should lie within 1
standard deviation of the mean (which is zero for standardized data). In other words, the
probability for this to happen is 0.68, or 68%. By the same token, the probability that a
measurement will lie outside one standard deviation is 1-0.68=0.32.
Now back to Chauvenet's criterion. We want to find the probability that an "outlier" in our
series of measurements is actually an "erroneus" measurement that should be removed. For
that we need to find the probability P(outside tsus*std) that a legitimate mesurement will differ
from the mean by tsus or more standard deviations. For a standard deviation of one, we have
just shown that this probability is 0.32. Now, we multipy this value by the number of
measurements N:
n(worse than tsus) =N * P(outside tsus*std)
We notice that the value 58 seems anomalously large, check our field records, but cannot find
any evidence that this measurement was caused by a mistake. We use Chauvenet's criterion
to evaluate whether this mesasurement should be rejected as an unexpected outlier in a data
set from a presumably normal population.
32
First we compute a mean and standard deviation for all ten measurements as 45.8 and 5.1.
The difference xsus between the suspect value and the mean is 12.2, or 2.4 standard deviations:
(58 - 45.8)/5.1
ans = 2.4
diff(pnorm([-2.4 2.4]))
ans = 0.9836
1 - 0.9836
ans = 0.0164
10 * 0.0164
ans = 0.164
In ten measurements, we would expect to find only 0.164 of one measurement as bad as our
suspect result. This is less than the number 0.5 set by Chauvenet's criterion, so we should
consider rejecting the anomalous measurement.
Our next most suspect result is 38. Check whether this result is expected or not given our ten
measurements.
___________________________________________________________________________
σ =σ
2 2
x
/n
The square root of the variance of this or any other sampling distribution is called the
standard error (SE) of the statistic:
SE ( x ) = σ
2
n
The standard error is useful to calculate confidence intervals of means, as seen in the
following example.
Sometimes we have to estimate confidence intervals for differences between two means.
The estimated standard error of the difference in sample means is:
33
+ 1 / nB
2
s (1/ n A
(n A − 1) s A + (nB − 1) s B
2 2
=
2
s −2 n −n
A B
___________________________________________________________________________
Worked example: Confidence intervals of means
1) Given data on the percentage of quartz in thin sections from an igneous rock, what is the
confidence interval around the estimated mean quartz percentage in the rock?
2) Is there any evidence that two brachiopod samples could have been derived from
populations having the same mean (data are contained in data files ex_2_22_a.dat and
ex_2_22_b.dat)?
load ex_2_22_a.dat;
load ex_2_22_b.dat;
t1=qt(0.975,16)
st=t1*sqrt(sc*((1/n1)+(1/n2)))
34
c1=dm-st
c2=dm+st
___________________________________________________________________________
Here x is the data mean and µ the (unknown) population mean. The population standard
deviation s is usually unknown, and therefore we must estimate it with the statistic s, which
gives the following:
x−µ
t=
s/ n
The distribution of this quantity is not normal, but it is bell-shaped and centered on zero.
It is dependent on the degrees of freedom (d.f.), denoted by ? (nu), which are n-1, with n
being the number of sample observations.
The t distributions were discovered in 1908 by William Gosset who was a chemist and a
statistician employed by the Guinness brewing company. He considered himself a student still
learning statistics, so that is how he signed his papers as pseudonym "Student". Or perhaps he
used a pseudonym due to "trade secrets" restrictions by Guinness.
Note that there are different t distributions, it is a class of distributions. When we speak of
a specific t distribution, we have to specify the degrees of freedom. The t density curves are
symmetric and bell-shaped like the normal distribution and have their peak at 0.
However, the spread is more than that of the standard normal distribution. The larger the
degrees of freedom, the closer the t-density is to the normal density (see figure below).
35
6.6 STUDENT T-TEST FOR INDEPENDENT SAMPLES
The t-test is the most commonly used method to evaluate the differences in means
between two groups. Theoretically, the t-test can be used even if the sample sizes are very
small (e.g., as small as 10; some researchers claim that even smaller n's are possible), as long
as the variables are normally distributed within each group and the variation of scores in the
two groups is not reliably different. As mentioned before, the normality assumption can be
evaluated by investigating the distribution of the data via histograms or by performing a
normality test. If the normality condition is not met, then you can evaluate the differences in
means between two groups using a nonparametric alternatives to the t- test (discussed later).
The p-level reported with a t-test represents the probability of error involved in
accepting our research hypothesis about the existence of a difference. Technically
speaking, this is the probability of error associated with rejecting the hypothesis of no
difference between the two categories of observations (corresponding to the groups) in the
population when, in fact, the hypothesis is true.
If the difference is in the predicted direction, you can consider only one half (one "tail")
of the probability distribution and thus divide the standard p-level reported with a t-test (a
"two-tailed" probability) by two.
In order to perform the t-test for independent samples, one independent (grouping)
variable and at least one dependent variable (e.g., a test score) are required. The means of the
dependent variable will be compared between selected groups based on the specified values of
the independent variable.
It often happens in research practice that you need to compare more than two groups, or
compare groups created by more than one independent variable while controlling for the
separate influence of each of them. In these cases, you need to analyze the data using Analysis
of variance (ANOVA), which can be considered to be a generalization of the t-test. In fact, for
two group comparisons, ANOVA will give results identical to a t-test (t**2 [df] = F[1,df]).
However, when the design is more complex, ANOVA offers numerous advantages that t-tests
cannot provide.
For example, if the mean count of oil inclusions was 102 in formation A and 104 in
formation B, then this difference of "only" 2 points would be extremely important if all values
for formation A fell within a range of 101 to 103, and all scores for formation B fell within a
range of 103 to 105. However, if the same difference of 2 was obtained from very
differentiated scores (e.g., if their range was 0-200), then we would consider the difference
36
entirely negligible. That is to say, reduction of the within-group variation increases the
sensitivity of our test.
6.7.2 Purpose
The t-test for dependent samples helps us to take advantage of one specific type of design
in which an important source of within-group variation (or so-called, error) can be easily
identified and excluded from the analysis. Specifically, if two groups of observations (that are
to be compared) are based on the same set of samples who were analysed twice (e.g., before
and after a particular treatment), then a considerable part of the within-group variation in both
groups of scores can be attributed to the initial individual differences between subjects.
Note that, in a sense, this fact is not much different than in cases when the two groups are
entirely independent, where individual differences also contribute to the error variance; but in
the case of independent samples, we cannot do anything about it because we cannot identify
(or "subtract") the variation due to individual differences in subjects.
However, if the same sample was tested twice, then we can easily identify (or "subtract")
this variation. Specifically, instead of treating each group separately, and analyzing raw
scores, we can look only at the differences between the two measures (e.g., "pre-test" and
"post test") in each subject. By subtracting the first score from the second for each subject and
then analyzing only those "pure (paired) differences," we will exclude the entire part of the
variation in our data set that results from unequal base levels of individual subjects. This is
precisely what is being done in the t-test for dependent samples, and, as compared to the t-test
for independent samples, it always produces "better" results (i.e., it is always more sensitive).
6.7.3 Assumptions
The theoretical assumptions of the t-test for independent samples also apply to the
dependent samples test; that is, the paired differences should be normally distributed. If these
assumptions are clearly not met, then one of the nonparametric alternative tests should be
used.
Formula - tobs :
Two-Tailed Tests: Statistical Hypotheses, Critical Values and comparisons for a two-
tailed test
37
One-Tailed Tests: Statistical Hypotheses, Critical Values and comparisons for a one-
tailed test
Formula - tobs :
Two-Tailed Tests: Statistical Hypotheses, Critical Values and comparisons for a two-
tailed test
38
One-Tailed Tests: Statistical Hypotheses, Critical Values and comparisons for a one-
tailed test
___________________________________________________________________________
Worked example: T-test
A random sample of 12 observations is obtained from a normal distribution. What value
of the t-statistic will be exceeded with a probability of 0.025? The number of degrees of
freedom (df) is n-1=11. Use the help function to find out how dt, pt and qt work.
p=0.975;
df=11; %n-1
%inverse t is qt
t=qt(p,df)
A random sample of 8 hand specimens of rock was analysed for organis material; th4
sample mean was found to be be 5.8 % and the sample standard deviation was 2.3. Do you
think it reasonable to suppose that the organix content of the rock is 5.0%?
--------
function t = tstat(m1,m2,s,n)
%Calculates t- test statistic
t = (m1-m2)/(s/sqrt(n))
--------
39
m1=5.8; %sample mean
m2=5.0; %suggested mean
s=2.3; %std
n=8;
t2=tstat(m1,m2,s,n)
Is there any evidence that the igneous rock from which the eight measurements of quartz
were taken has a mean quartz percentage greater than 20%?
qz=[23.5 16.6 25.4 19.1 19.3 22.4 20.9 24.9];
m1=mean(qz)
sd1=std(qz)
s1=sd1*sd1
n=size(ex_2_23,1)
ss1=sqrt(s1/n)
%Our t-stat
m2=20
t2=tstat(m1,m2,sd1,n)
___________________________________________________________________________
1) Is there any evidence that the brachiopods from horizon A are longer than those from
horizon B?
2) Is there any evidence of difference in lengths between A and B?
load ex_2_24_a.dat
load ex_2_24_b.dat
% Basic Stats
m1=mean(ex_2_24_a)
m2=mean(ex_2_24_b)
s1=std(ex_2_24_a)
s2=std(ex_2_24_b)
ss1=s1*s1
ss2=s2*s2
n1=size(ex_2_24_a,1)
n2=size(ex_2_24_b,1)
40
%Degrees of freedom n1 + n2 - 2
df=n1+n2-2;
%Critical t p=0.95
t1=qt(0.95,df)
%Common variance
sc=((n1-1)*ss1+(n2-1)*ss2)/(n1+n2-2)
%Question II
%Here use +/- ends of t-graph to get alpha=5%, ie critical value 0.25
t3=qt(0.975,df)
1) Chi-square Test for Association is a (non-parametric, therefore can be used for nominal
data) test of statistical significance widely used bivariate tabular association analysis.
Typically, the hypothesis is whether or not two different populations are different
enough in some characteristic or aspect of their behavior based on two random
samples. This test procedure is also known as the Pearson chi-square test.
Like the Student's t-Distribution, the Chi-square distribution's shape is determined by its
degrees of freedom. The animation below shows the shape of the Chi-square distribution as
the degrees of freedom increase (1, 2, 5, 10, 25 and 50).
41
6.9 PEASRON CHI-SQUARE TEST
The Pearson Chi-square test is the most common test for significance of the relationship
between categorical variables. This measure is based on the fact that we can compute the
expected frequencies in a two-way table (i.e., frequencies that we would expect if there was
no relationship between the variables). For example, suppose we ask 20 male and 20 female
geologists to choose between two brands of beer (brands A and B). If there is no relationship
between preference and gender, then we would expect about an equal number of choices of
brand A and brand B for each sex. The Chi-square test becomes increasingly significant as the
numbers deviate further from this expected pattern; that is, the more this pattern of choices for
males and females differs.
The value of the Chi-square test and its significance level depends on the overall number
of observations and the number of cells in the table. Relatively small deviations of the relative
frequencies across cells from the expected pattern will prove significant if the number of
observations is large.
The only assumption underlying the use of the Chi-square test (other than random
selection of the sample) is that the expected frequencies are not very small. The reason for this
is that the Chi-square test inherently tests the underlying probabilities in each cell; and when
the expected cell frequencies fall, for example, below 5, those probabilities cannot be
estimated with sufficient precision.
k (O j − E j )2
Χ =∑
2
j =1 E j
42
___________________________________________________________________________
Worked example: Pearson Chi-square test
You have developed a theory that the proportions of four minerals in granite are 4:1:2:3.
You have analysed a random sample of 100 grains in a thin section consisting of 35, 12, 22
and 31 of these species. You would like to test if this sample lends support to your theory or
not.
obsf=[35 12 22 31];
exf=[40 10 20 30];
% (Observed - Expected)2/Expected
dif2=dif.^2./exf;
dif3=cumsum(dif2);
dif4=dif3(1,4);
qchisq(0.95,3);
___________________________________________________________________________
Worked example: Goodness-of-fit Chi-square test for a normal distribution
The data file ex_2_25.dat contains data on uranium content from lake sediments from 71
sites from Saskatchewan, Canada. Are these data drawn from a normally distributed
population? First plot a histogram of these data. You quickly see that the data are not
normally distributed at all. Our next guess is that they may be log-normally distributed, and
we proceed from there.
load ex_2_25.dat
histo(ex_2_25)
u=log(ex_2_25);
umean=mean(u)
ustd=std(u)
nsamp=length(u);
ust=(u-umean)/ustd;
43
% Bin data
% To compare with normal dist must take bins symmetric about 0
% Boundaries of bins
[N,bin]=histc(ust,yb)
bar(yb,N,'histc','c')
n=length(yb);
ep=diff(pnorm(yb));
efreq=ep*nsamp;
% Calculate (Obsfreq-expfreq)^2/expfreq
chi2=((ofreq-efreq).^2)./efreq
chi2sum=sum(chi2)
classes=length(yb);
df=classes-1-2
chi2inv(0.95,df)
44
% -> Calc. chi2sum (9.4) does not exceed critical Chi2 (11.4)
% -> Accept null hypo.
% -> Data do not differ sign. from normal dist.
Suppose we want to perform an hypothesis to determine if two population variances are the
same:
Test Statistic:
A specific F distribution is denoted by the degrees of freedom for the numerator Chi-
square and the degrees of freedom for the denominator Chi-square. An example of the F(10,10)
distribution is shown in the figure below. When referencing the F distribution, the numerator
degrees of freedom are always given first, as switching the order of degrees of freedom
changes the distribution (e.g., F(10,12) does not equal F(12,10)).
___________________________________________________________________________
Worked example: F-test
We have obtained two sets of porosity measurements from a sandstone formation at two
different locations (5 and 11 samples). We are interested in determining if the variation in
porosity is the same in both areas. We will use a 95% level of significance. The two degrees
45
of freedom (n-1) are 4 and 10. What value of a statistic from the F4,10 distribution will be
exceeded with probability 0.05?
% Porosity measurements
a=[10.0 8.5 7.9 9.2 7.5]';
b=[10.5 7.9 8.7 7.3 10.4 8.8 7.7 9.4 10.4 8.3 9.2]';
% Compute F-statistic
av=(std(a))^2
bv=(std(b))^2
% the larger variance must be in he numerator
F=bv/av
p=0.95;
%use finv distribution generator qf to obatin critical F
f=qf(0.95,10,4)
Thus, when the variability that we predict (between the two groups) is much greater than
the variability we don't predict (within each group) then we will conclude that our treatments
produce different results.
Levene's Test: Suppose that the sample data do not support the homogeneity of variance
assumption, however, there is a good reason that the variations in the population are almost
the same, then is such a sitiuation you may like to use the Levene's modified test: In each
group first compute the absolute deviation of the individual values from the median in that
group. Apply the usual one way ANOVA on the set of deviation values and then interpret the
results.
The stixbox does not have an anova tool available, but both the (commercial) statistics
toolbox from Matlab and the free octave software do have anova functions built in, which are
fairly easy to use.
46
8 STATISTICS BETWEEN TWO OR MORE VARIABLES
8.1 CORRELATIONS BETWEEN TWO OR MORE VARIABLES
8.1.1 Introduction
Correlation is a measure of the relation between two or more variables. The measurement
scales used should be at least interval scales, but other correlation coefficients are available to
handle other types of data. Correlation coefficients can range from -1.00 to +1.00. The value
of -1.00 represents a perfect negatice correlation while a value of +1.00 represents a perfect
positive correlation. A value of 0.00 represents a lack of correlation. The most widely-used
type of correlation coefficient is Pearson r, also called linear or product- moment correlation.
Pearson correlation (hereafter called correlation), assumes that the two variables are
measured on at least interval scales, and it determines the extent to which values of the two
variables are "proportional" to each other. The value of correlation (i.e., correlation
coefficient) does not depend on the specific measurement units used; for example, the
correlation between height and weight will be identical regardless of whether inches and
pounds, or centimeters and kilograms are used as measurement units. Proportional means
linearly related; that is, the correlation is high if it can be "summarized" by a straight line
(sloped upwards or downwards).
This line is called the regression line or least squares line, because it is determined such
that the sum of the squared distances of all the data points from the line is the lowest possible.
Note that the concept of squared distances will have important functional consequences on
how the value of the correlation coefficient reacts to various specific arrangements of data (as
we will later see).
As mentioned before, the correlation coefficient (r) represents the linear relationship
between two variables. If the correlation coefficient is squared, then the resulting value (r2,
the coefficient of determination) will represent the proportion of common variation in the two
variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the
correlation between variables, it is important to know this "magnitude" or "strength" as well
as the significance of the correlation.
However, Monte Carlo studies suggest that meeting those assumptions closely is not
absolutely crucial if your sample size is not very large. It is impossible to formulate precise
recommendations based on those Monte- Carlo results, but many researchers follow a rule of
47
thumb that if your sample size is 50 or more then serious biases are unlikely, and if your
sample size is over 100 then you should not be concerned at all with the normality
assumptions. There are, however, much more common and serious threats to the validity of
information that a correlation coefficient can provide; they are briefly discussed in the
following paragraphs.
___________________________________________________________________________
Worked example: Linear and polynomial regression
On ODP Leg 183 (Kerguelen Plateau) sediment velocity data were collected based on
both downhole geophysical logs and on laboratory measurements of core samples. The
question arises: How well do the two sets of measurements correlate?
We can only work with values collected below 80 m below sea floor, because the hole
was cased above this depth, preventing us to collect data based on downhole logs.
clear
load vel_log.dat
load vel_samp.dat
deplog=vel_log(:,1);
depsamp=vel_samp(:,1);
48
vellog=vel_log(:,2);
velsamp=vel_samp(:,2);
plot(vellog,-deplog,'o')
hold on
plot(velsamp,-depsamp,'+r')
title('Velocities from logs (blue) and from samples (red)');
xlabel('Velocity [km/s]')
ylabel('Depth [m]')
hold off
help interp1
vell=interp1(deplog,vellog,depi);
vels=interp1(depsamp,velsamp,depi);
figure
plot(vell,vels,'+');
Log Samples
-50 -50
-100 -100
-150 -150
-200 -200
-250 -250
-300 -300
-350 -350
0 2 4 6 8 0 2 4 6 8
linreg(vell,vels)
49
7
0
1 2 3 4 5 6 7
A pointwise confidence band for the expected y-value is plotted, as well as a dashed line
which indicates the prediction interval given x (e.g. by default the dashed line encompasses
95% of the data).
Now add arguments for confidence interval and polynomial degree (from 1-3).
Which polynomial model fits the data? Also check out the identify stixbox function.
___________________________________________________________________________
1. The data entering the analysis are enumerative - that is, count data representing the
number of observations in each category or cross-category.
2. The data are measured and /or analyzed using a nominal scale of measurement.
2. The data are measured and /or analyzed using an ordinal scale of measurement.
4. The inference does not concern a parameter in the population distribution - as, for
example, the hypothesis that a time-ordered set of observations exhibits a random pattern.
5. The probability distribution of the statistic upon which the analysis is based is not
dependent upon specific information or assumptions about the population(s) which the
sample(s) are drawn, but only on general assumptions, such as a continuous and/or
symmetric population distribution.
By this definition, the distinction of nonparametric is accorded either because of the level
of measurement used or required for the analysis, as in types 1 through 3; the type of
inference, as in type 4 or the generality of the assumptions made about the population
distribution, as in type 5. For example one may use the Mann-Whitney Rank Test as a
50
nonparametric alternative to Students T-test when one does not have normally distributed
data.
Wilcoxon: To be used with two related (i.e., matched or repeated) groups (analogous to the
related samples t-test) (stixbox function test1r)
Kruskall-Wallis: To be used with two or more independent groups (analogous to the single-
factor between-subjects ANOVA)
Friedman: To be used with two or more related groups (analogous to the single-factor
within-subjects ANOVA)
Spearman's rank correlation coefficient: Rank correlation is useful if variables are not
normally distributed. For example, the depth or time ranges of the occurrence of a
particular fossil can be cross-correlated (stixbox function spearman)
51
10 DIRECTIONAL AND ORIENTED DATA
10.1 INTRODUCTION
Directional and oriented data abound in geology and geophysics.
1) Asymmetrical ripples
2) Flute marks
3) Faults (downthrown side known)
4) Belemnites
5) Gastropods
1) Symmetrical ripples
2) Grooves
3) Joints
4) Graptolites
5) Crinoid stems
% Dips and dip directions of faults measured on the 592 North Level at
% Brown's Creek Copper-Gold Mine, Blayney NSW.
fdip = [48 65 83 74 87 52 56 57 60 73 79 52 43 57 69 71 69 87 70 35 33 75
87 72 15 34 60 63 59 59 64 60 42 40 80 97 79 69 83 60 79 58 66 16 78 69 76
82 84 73 85 73 74 81 77 52 69 68 81 83 71 83 87 78 69 63 74 81 86 87 77 69
72 74 19 86 81 74 37 31 74 79 84 85 85 79 86 82 73 75 76 78 69 86 72 35 47
51 55 82 79 71 86 81 84 72 79 76 85 78 79 85 60 81 71 75 74 71 78 50 84 77
41 53 74 60];
fdipdir = [330 070 084 350 275 081 084 324 075 050 069 001 016 098 085 095
108 114 107 271 339 305 110 311 123 324 295 295 055 096 268 290 067 105 090
087 119 092 280 086 089 088 094 280 302 047 042 274 107 114 314 098 349 088
255 090 092 065 078 078 085 028 045 091 117 285 295 104 273 105 108 034 048
52
079 145 300 297 103 081 087 079 173 072 071 283 060 256 270 259 249 087 093
105 109 108 125 095 103 105 291 266 250 062 268 271 284 279 278 105 295 268
292 278 096 293 109 107 314 160 108 273 245 118 313 286 112];
load browns_creek.dat
% Dip direction
fdipdir=browns_creek(:,2);
figure
grose3(fdipdir,24,1,0)
20
15
10
126
0
-20 -15 -10 -5 0 5 10 15 20
-5
-10
-15
-20
figure
grose3(fdipdir,24,1,1)
3.8
2.5
1.3
0
126
-5 -3.8 -2.5 -1.3 0 1.3 2.5 3.8 5
-1.3
-2.5
-3.8
-5
53
10.3 PLOTTING AND CONTOURING ORIENTED DATA ON STEREONETS
1) Equal-angle stereonet, also termed Wulff net. It maintains angular relationships within
the projection plane of the stereonet. For example, if the small circle intersection of a
cone with the lower hemisphere is plotted, on an equal-angle net the shape of this surface
will project as a perfect circle.
2) Equal-area stereonet, also termed Schmidt net. It maintains the proportion of the lower
hemisphere surface projected to the plane of the net. In other words, no preferred
alignment of data will be apparent if the data are truly random.
In effect, both nets preserve the angular relationships between lines and planes in three
dimensional space; however, when these elements are projected into the two dimensional
plane of the net diagram they are somewhat distorted on the equal area stereonet.
___________________________________________________________________________
Worked example: Plotting and contouring oriented data on stereonets
Now we'll use a couple of Middleton's scripts to create a Schmidt (equal area) net, and plot
the Browns Creek data.
schmidt
Snetplot
% Now the data can be contoured
vgcnt3(X);
pause
Snetplot will ask you for input interactively, i.e. you have to type in the data file name
(e.g. "browns_creek.dat"), and the symbol for the plot (e.g. '+b' will plot blue plus signs).
54
___________________________________________________________________________
2 2
1 n + n
R= ∑ sinθ i ∑ cosθ i
n
i =1 i =1
___________________________________________________________________________
Worked example: Rayleigh's test
1) Calculate R based on above equation. You have to convert all angles to radians in
Matlab.
2) The critical value for R has to be taken from an appropriate table (see Swan and
Sandilands, 1995) for a given n and α=0.05. In this case, it is 0.27.
3) If the calculated value exceeds the critical value, we reject the null hypothesis (i.e. there is
a preferred trend). The direction of the preferred trend is found by:
55
θ = tan (∑ sin θ / ∑ cosθ )
−1
%Rayleighs test
load ex_5_2_a_fa.dat;
fa=ex_5_2_a_fa;
histo(fa)
figure
na=length(fa);
ssrfa=(sum(sin(rfa)))^2;
scrfa=(sum(cos(rfa)))^2;
Rfa=(1/na)*(sqrt(ssrfa+scrfa))
dirtenda=(dira/pi)*180
___________________________________________________________________________
For more examples for directional data analysis see Swan and Sandilands (1995).
56
Worked example: Gridding data
ti = 0:0.25:6.5;
[xi,yi,zi] = griddata(dmap(:,1),dmap(:,2),dmap(:,3), ti, ti');
v = 700:25:950;
contour(ti, ti', zi, v), axis('square')
Now let's look at a mesh version of this in 3d, with a defined perspective, then add the data to
it (note, these commands are in the runfile "gd2.m"):
mesh(xi,yi,zi);
view(170,30);
hold on;
plot3(dmap(:,1), dmap(:,2), dmap(:,3), 'o')'
hold off;
Notice the "view" command. Let's do a "help view" to understand what that defined. Again,
we can get rid of the default colourising of the grid by defining any colour as a new colour
map, such as:
colourmap(hot)
Now use triangle based cubic and nearest neighbor interpolation to grid the data and then plot
them again. Do you notice differences?
help interp2
to find out how it works. Try the "spline" option, plot the data again, and evaluate the
difference to previous results.
___________________________________________________________________________
57
12 OVERVIEW OF COMPUTER INTENSIVE
STATISTICAL INFERENCE PROCEDURES
12.1 INTRODUCTION
Resampling procedures, also commonly referred to as computer intensive statistical
inference procedures, may be used to assess the significance of a statistic in a hypothesis test
or to determine the lower and upper bounds for a confidence interval when the usual
assumptions of parametric statistical procedures are not met (Manly, 1991). Computer
intensive procedures require the recomputation of hundreds or thousands of artificially
constructed data sets. Like other nonparametric statistical procedures, these procedures
existed as theory on paper long before they were brought into the practical mainstream. The
Monte Carlo method of resampling, for example, was introduced by Barnard in 1963 (Noreen,
1989), but at that time could only be illustrated and implemented operationally on very small
sample sizes.
However, with the advent of fast, inexpensive computing, essentially since around 1990,
the use of computer intensive procedures has grown dramatically, particularly in the area of
basic academic research. Actually, with the widespread availability of powerful personal
computers and free statistical software like the stixbox for Matlab and Octave that even
brings resampling-type methods right into the home, the name computer intensive seems
today to be as anachronistic as it was descriptive just a few years ago.
Computer intensive procedures are for probability estimation; that is, they are used to
calculate p-values for test statistics or lower and upper bounds for confidence intervals
without relying on classical inferential assumptions like normality of the sampling
distribution and the Central Limit Theorem (Noreen, 1989). The taxonomy of computer
intensive methods is difficult to define, primarily for three reasons:
1) there are many subtly different ways to perform each of the methods, with each way
leading to slightly different results and interpretations;
3) asymptotically, all the procedures are forms of each other and of the permutation test
(Noreen, 1989). Manly (1991) and Noreen (1989) concur on a taxonomy that divides
computer intensive procedures into two related yet unique streams: randomization
methods and Monte Carlo methods.
12.2.1 Introduction
Monte Carlo methods are most often used in simulation studies on computer-generated
data to show how p-values for a statistical test or estimation method functions when no
58
convenient real data exist. Monte Carlo methods are used to make inferences about the
population from which a sample has been drawn. The Monte Carlo methods are Monte Carlo
estimation, bootstrapping, the jackknife, and Markov Chain Monte Carlo estimation.
It’s important to note that Monte Carlo refers to the type of resampling process used to
produce a probability estimate, not the act of generating a data set. That is, it is possible to
computer-generate a data set, and numerous recombinations of it, without it being a Monte
Carlo procedure. For example, Hambleton, Swaminathan, and Rogers (1991) identify a
procedure for detecting differential item functioning in IRT-calibrated test items that uses one
or more simulations of test data to generate sets of item and examinee ability parameters for
comparison, but this is not truly a Monte Carlo procedure. On the other hand, Yen (1986), in
assessing the distributional qualities of Thurstonian absolute scaling, presents a method for
generating two simulation samples of data drawn at random from an assumed population. In
this case, even though only two samples are drawn at a time, this is, according to Noreen
(1989), considered a Monte Carlo method.
The steps for performing a Monte Carlo procedure, as given by Noreen (1989), are:
1. [Given a "real" sample of n observations from a defined population and a computed statistic of interest]
Identify a model of the population from which simulated samples are to be drawn.
2. Generate a large number N of simulated samples of size n, and compute the statistic of interest for each
sample.
3. Order the computed simulated sample statistics in a distribution, called the "Monte Carlo distribution" of
the statistic.
4. Map the "real" statistic to the Monte Carlo distribution; use the would-be percentile rank of the "real"
statistic to estimate its p-value.
The procedure for establishing a confidence interval for the "real" statistic in a Monte
Carlo procedure is analogous to that used in a randomization test, as described above. As with
randomization confidence intervals, a Monte Carlo confidence interval need not be
symmetrical.
59
12.2.3 Bootstrapping
According to legend, Baron Münchhausen saved himself from drowning in quicksand by
pulling himself up using only his bootstraps. The statistical bootstrap, which uses
resampling from a given set of data to mimic the variability that produced the data in
the first place, has a rather more dependable theoretical basis and can be a highly effective
procedure for estimation of error quantities in statistical problems.
Efron & Tibshirani (1993) provide the generic algorithm for performing a bootstrapping
procedure as follows:
1. [Given a sample of size n and a calculated sample statistic of interest] Draw a random "bootstrap" sample
of size n with replacement (that is, an observation, once drawn, may be drawn again), and calculate the
"bootstrap" statistic of interest from this sample.
3. Estimate the "bootstrap standard error" of the parameter of interest using the N bootstrap statistics as the
inputs for the usual standard error equation.
An estimate of the bias of the statistic of interest is obtained simply by subtracting the
mean bootstrap statistic from the original sample statistic. While this bias estimate may be
useful as a descriptive tool for readers (Efron, 1982), a problem arises from "adjusting out"
the bias: "the bootstrap bias estimator from a single sample contains an indeterminate amount
of random variability along with bias, and this may artificially inflate the mean squared error
of the statistic" (Mooney & Duval, 1993).
There are a variety of bootstrapping methods available; the shift method and the normal
approximation method are two popular methods (Noreen, 1989). According to Noreen, both
of these methods are frequently used "for estimating significance levels based on the
bootstrap sampling distribution … the ‘shift’ method assumes that the bootstrap sampling
distribution and the null hypothesis sampling distribution have the same shape but different
distributions… the ‘normal approximation’ method assumes that the null hypothesis sampling
distribution is normal; the bootstrap sampling distribution is used only to estimate the
variance of the normal distribution."
60
the population from which the sample is drawn." Monte Carlo studies (!) on simulated highly
nonnormal distributions have shown bootstrap estimates to be discomfortingly liberal with
respect to Type I error (Mooney & Duval, 1993). Currently, bootstrapping is most commonly
used to estimate population variances in the absence of conventional parametric estimation
assumptions (Noreen, 1989).
2. Drop out one subsample from the entire original sample. Calculate θ hat – 1 from that reduced sample of
size (g-1)h.
Since bootstrapping is considered a more general statistical procedure and works at least
as well as the jackknife in most situations, the jackknife is generally only of historical interest
today. An exception to this is in the area of identifying influential cases or strata in a
statistical model. Here, "the case subgroup or stratum pseudovalue can be used to determine
whether the subgroup or stratum has a greater-than-average effect on the overall parameter
estimate than other subgroups" (Mooney & Duval, 1993). This useful by-product of jackknife
estimation remains popular.
61
Applications of MCMC currently popular in the literature include Gibbs sampling and
data augmentation. Both of these have been used in the context of hierarchical modeling in
the social sciences. Gibbs sampling has been particularly popular in recent years, as interest in
modeling multilevel social phenomena has unearthed the problem of estimating level-wise
parameters when marginal posterior distributions of those parameters are unknown. This
problem is elegantly resolved by use of Gibbs sampling, in which Monte Carlo estimation is
performed on conditional posterior distributions of known form (e.g., "known" normals, as in
standard scores). Plots of Monte Carlo-estimated values then reveal the contours of the
unknown marginal posterior and joint posterior distributions of the parameters of interest, and
descriptions and inferences can be presented (Seltzer, 1993). For a complete introduction to
data augmentation, I refer the reader to Tanner & Wong (1987); for Gibbs sampling, Gelfand
& Smith (1990).
12.2.6 Meta-Analysis
Harwell (1990) provides an overview of many types of Monte Carlo studies done for the
purpose of illustrating the performance of certain statistical methods, such as single- and
multifactor ANOVA and ANCOVA, multiple regression, and hierarchical models. Harwell
reviews methods employed for synthesizing Monte Carlo studies, and criticizes the way
Monte Carlo results are conducted without regard for some "overarching theory" to guide
interpretation. He presents a five-step strategy for summarizing Monte Carlo results that
includes specific problem formulation, data design and collection data evaluation, analysis
and interpretation, and presentation of results. In a later article (1992), Harwell embellishes
this strategy specifically for one- and two-factor fixed effects ANOVA.
Seltzer (1993) draws the same conclusion about the use of Gibbs sampling with HMs,
providing more detail on what goes wrong when, under the standard assumptions of normality
in HM, "fat-tailed" data is simulated. Seltzer concludes that many MCMC studies may
already exist that mislead the reader as to the precision of the HM primarily because proper
treatment for platykurtic data has not been addressed.
Several other authors present Monte Carlo investigation studies for performance of
statistical estimation procedures: Bacon (1995) for performance of correlational outlier
identification methods over a variety of data distribution types; Bengt (1994) for testing of a
Bayesian process for filling in missing data for covariance structure modeling; Wolins (1995)
for comparing the speed and efficiency of maximum likelihood and unweighted least squares
estimation procedures in factor analysis, over many samples and with a variety of
62
distributions represented; and Finch, et al. (1997) for examining bias in the estimation of
indirect effects and their standard errors, using simulated skewed data, in structural equation
models using maximum likelihood estimation.
These studies represent the type of research, almost entirely investigative in nature, that
has arisen from the advances in Monte Carlo methods. While highly sophisticated and likely
inaccessible beyond the abstract and discussion to those not intimately familiar with the
procedures, these studies nevertheless convey the complex dimensionality and depth of
inquiry now achievable using these new methods and a good computer.
• MCMC and variations: Very promising for investigation of complex, multilevel social
phenomena, especially with HMs; provide both descriptive and inferential
information; great for use with missing data and/or latent variables;
• All methods: Easily adaptable (or perhaps even already adapted) to a user’s
substantive field;
Weaknesses
• All methods: Not yet widely available in commercial PC- or LAN-based statistical
packages, although more widely available for use on mainframe computers;
• All methods: Requires computer automation, distancing the user/student from the
procedure and potentially undermining the acquisition of conceptual understanding;
63
• Jackknife: Largely irrelevant, especially when bootstrapping procedures available;
same weaknesses as bootstrapping;
64
13 REFERENCES
13.1 GEOSCIENCES
Davis, J.C., 1973, Statistics and data analysis in geology, Wiley International, 550 pp.
Middleton, 1999, Data analysis in the earth sciences using Matlab, Prentice Hall.
Swan, A.R.H., and Sandilands, M., 1995, Introduction to geological data analysis, Blackwell
Science, 446 pp.
13.2 GENERAL
Akkermans, W. M. W. (1994). Monte Carlo estimation of the conditional Rasch model.
Research Report 94-09, Faculty of Educational Science and Technology, University of
Twente, The Netherlands.
Draper, D. (1995). Inference and hierarchical modeling in the social sciences. Journal of
Educational and Behavioral Statistics, 20(2), 115-147.
Edgington, E. S. (1987). Randomization tests (2nd Ed.). New York: Marcel Dekker.
Efron, B. (1982). The Jackknife, the Bootstrap and other resampling plans. Philadelphia:
Society for Industrial and Applied mathematics.
Efron B., and R. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, 1994.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman
& Hall.
Finch, J. F., et al. (1997). Effects of sample size and nonnormality on the estimation of
mediated effects in latent variable models. Structural Equation Modeling, 4(2), 87-107.
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling based approaches to calculating marginal
densities. Journal of the American Statistical Association, 85, 398-409.
Good, P. (1994). Permutation tests: A practical guide to resampling methods for testing
hypotheses. New York: Springer-Verlag New York.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response
theory. Newbury Park, CA: Sage Publications.
Harwell, M. R. (1997). An investigation of the Raudenbush (1988) test for studying variance
65
homogeneity. Journal of Experimental Education, 65(2), 181-190.
Harwell, M. R. (1997). Analyzing the results of Monte Carlo studies in item response theory.
Educational and Psychological Measurement, 57(2), 266-279.
Harwell, M. R. (1990). Summarizing Monte Carlo results in methodological research. Journal
of Educational Statistics, 17(4), 297-313.
Harwell, M. R., Rubinstein, E.N., Hayes, W. S., & Olds, C. C. (1992). Summarizing Monte
Carlo results in methodological research: The one- and two-factor fixed effects
ANOVA cases. Journal of Educational Statistics, 17(4), 315-339.
Ludbrook, J., & Dudley, H. (1998). Why permutation tests are superior to t and F tests in
biomedical research. The American Statistician, 52(2), 127-132.
Manly F., Multivariate Statistical Methods: A Primer, Chapman and Hall, London, 1986.
Manly, B. F. J. (1991). Randomization and Monte Carlo methods in biology. London, U.K.:
Chapman & Hall
Marco D., Building and Managing the Meta Data Repository: A Full Lifecycle Guide,
John Wiley, 2000.
Westphal Ch., T. Blaxton, Data Mining Solutions: Methods and Tools for Solving Real-
World Problems, John Wiley, 1998.
Wolins, L. (1995). A Monte Carlo study of constrained factor analysis using maximum
likelihood and unweighted least squares. Educational and Psychological Measurement,
55(4), 545-557.
Yen, W.M. (1986). The choice of scale for educational measurement: An IRT perspective.
Journal of Educational Measurement, 23(4), 299-325.
66
14 APPENDIX
14.1 TABLES
67
14.1.2 Values of concentration parameter K from R for Rayleigh's test
68
14.1.3 Critical values of Spearman's rank correlation coefficient
69
14.2 STIXBOX CONTENTS
A statistics toolbox for Matlab and Octave.
Version 1.29, 10-May-2000
GNU Public Licence Copyright (c) Anders Holtsberg.
Comments and suggestions to andersh@maths.lth.se.
Distribution functions.
dbeta - Beta density function.
dbinom - Binomial probability function.
dchisq - Chisquare density function.
df - F density function.
dgamma - Gamma density function.
dhypg - Hypergeometric probability function.
dlognorm - The log-normal density function.
dnorm - Normal density function.
dt - Student t density function.
dweib - The Weibull density function.
dgumbel - The Gumbel density function.
Logistic regression.
ldiscrim - Compute a linear discriminant and plot the result.
logitfit - Fit a logistic regression model.
lodds - Log odds function.
70
loddsinv - Inverse of log odds function.
Various functions.
bincoef - Binomial coefficients.
cat2tbl - Take category data and produce a table of counts.
getdata - Some famous multivariate data sets.
quantile - Empirical quantile (percentile).
ranktrf - Rank transform data.
spearman - Spearman's rank correlation coefficient.
stdize - Standardize columns to have mean 0 and standard deviation 1.
corr - Correlation coefficient.
cvar - Covariance.
Resampling methods.
covjack - Jackknife estimate of the variance of a parameter estimate.
covboot - Bootstrap estimate of the variance of a parameter estimate.
stdjack - Jackknife estimate of the parameter standard deviation.
stdboot - Bootstrap estimate of the parameter standard deviation.
rboot - Simulate a bootstrap resample from a sample.
ciboot - Bootstrap confidence interval.
test1b - Bootstrap t test and confidence interval for the mean.
Graphics.
qqgamma - Gamma probability paper plot (and estimate).
qqnorm - Normal probability paper plot.
qqplot - Plot empirical quantile vs empirical quantile.
qqweib - Weibull probability paper plot.
qqgumbel - Gumbel probability paper plot.
kaplamai - Plot Kaplan-Maier estimate of survivor function.
linreg - Linear or polynomial regression, including plot.
histo - Plot a histogram (alternative to hist).
plotsym - Plot with symbols.
plotdens - Draw a nonparametric density estimate.
plotempd - Plot empirical distribution.
identify - Identify points on a plot by clicking with the mouse.
pairs - Pairwise scatter plots.
71
Statistical Calculators Presided at UCLA. Material here includes: Power Calculator, Statistical
Tables, Regression and GLM Calculator, Two Sample Test Calculator, Correlation and
Regression Calculator, and CDF/PDF Calculators.
http://www.stat.ucla.edu/calculators
External Links, by SPSS, Free resources for spss, excel, word & more...
http://www.spss.org/wwwroot
Interactive Statistics, by University of Illinois. Examples from over '5' Calculators include:
Data, Correlations, Scatter plot, Box Models, and Chisquare Applet,
http://www.stat.uiuc.edu/
Interactive Statistical Calculation, by John Pezzullo, Web pages that perform most of
statistical calculations.
http://members.aol.com/johnp71/javastat.html
Statistics (Guide to basic stats labs, ANOVA, Confidence Intervals, Regression, Spearman's
rank correlation, T-test, Simple Least-Squares Regression, and Discriminant Analysis.
Demos Contains few interesting demos such as changing the parameters of various
distributions, convergence of t-distribution to normal, etc.
http://www-stat.stanford.edu/
Statistics: The Study of Stability in Variation, by Jan de Leeuw, The Textbook has
components which can be used on all levels of statistics teaching. It is disguised as an
introductory textbook, perhaps, but many parts are completely unsuitable for introductory
teaching. Its contents are Introduction, Analysis of a Single Variable, Analysis of a Pair of
Variables, and Analysis of Multi-variables.
http://www.stat.ucla.edu/textbook
72
Selecting Statistics, Cornell University. Answer the questions therein correctly, then Selecting
Statistics leads you to an appropriate statistical test for your data.
http://trochim.human.cornell.edu/selstat/ssstart.htm
SURFSTAT Australia, by Keith Dear, Summarizing and Presenting Data, Producing Data,
Variation and Probability, Statistical Inference, Control Charts.
http://www.anu.edu.au/nceph/surfstat/surfstat-home/surfstat.html
Introduction to Quantitative Methods, by Gene Glass, A basic statistics course in the College
of Education at Arizona State University.
http://olam.ed.asu.edu/%7eglass/502/home.html
Sanda Kaufman's Teachnig Resources, contains teaching resources for variety of topics
including quantitative methods.
http://cua6.csuohio.edu/~sanda/teach.htm
Some experimental pages for teaching statistics, by Juha Puranen, contains some - different
methods for visualizing statistical phenomena, such as Power and Box-Cox transformations.
http://noppa5.pc.helsinki.fi/koe/index.htm
Statistical Home Page by David C. Howell, Containing statistical material covered in the
author's textbooks (Statistical Methods for Psychology and Fundamental Statistics for the
Behavioral Sciences), but it will be useful to others not using this book. It is always under
construction.
http://www.uvm.edu/~dhowell/StatPages/StatHomePage.html
73