Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Pcs - 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Chapter 2

Background Mathematics

This section will attempt to give some elementary background mathematical skills that
will be required to understand the process of Principal Components Analysis. The
topics are covered independently of each other, and examples given. It is less important
to remember the exact mechanics of a mathematical technique than it is to understand
the reason why such a technique may be used, and what the result of the operation tells
us about our data. Not all of these techniques are used in PCA, but the ones that are not
explicitly required do provide the grounding on which the most important techniques
are based.
I have included a section on Statistics which looks at distribution measurements,
or, how the data is spread out. The other section is on Matrix Algebra and looks at
eigenvectors and eigenvalues, important properties of matrices that are fundamental to
PCA.

2.1 Statistics
The entire subject of statistics is based around the idea that you have this big set of data,
and you want to analyse that set in terms of the relationships between the individual
points in that data set. I am going to look at a few of the measures you can do on a set
of data, and what they tell you about the data itself.

2.1.1 Standard Deviation


To understand standard deviation, we need a data set. Statisticians are usually con-
cerned with taking a sample of a population. To use election polls as an example, the
population is all the people in the country, whereas a sample is a subset of the pop-
ulation that the statisticians measure. The great thing about statistics is that by only
measuring (in this case by doing a phone survey or similar) a sample of the population,
you can work out what is most likely to be the measurement if you used the entire pop-
ulation. In this statistics section, I am going to assume that our data sets are samples

2
of some bigger population. There is a reference later in this section pointing to more
information about samples and populations.
Here’s an example set:

I could simply use the symbol to refer to this entire set of numbers. If I want to
refer to an individual number in this data set, I will use subscripts on the symbol to
indicate a specific number. Eg. refers to the 3rd number in , namely the number
4. Note that is the first number in the sequence, not like you may see in some
textbooks. Also, the symbol will be used to refer to the number of elements in the
set
There are a number of things that we can calculate about a data set. For example,
we can calculate the mean of the sample. I assume that the reader understands what the
mean of a sample is, and will only give the formula:

Notice the symbol (said “X bar”) to indicate the mean of the set . All this formula
says is “Add up all the numbers and then divide by how many there are”.
Unfortunately, the mean doesn’t tell us a lot about the data except for a sort of
middle point. For example, these two data sets have exactly the same mean (10), but
are obviously quite different:

So what is different about these two sets? It is the spread of the data that is different.
The Standard Deviation (SD) of a data set is a measure of how spread out the data is.
How do we calculate it? The English definition of the SD is: “The average distance
from the mean of the data set to a point”. The way to calculate it is to compute the
squares of the distance from each data point to the mean of the set, add them all up,
divide by , and take the positive square root. As a formula:

Where is the usual symbol for standard deviation of a sample. I hear you asking “Why
are you using and not ?”. Well, the answer is a bit complicated, but in general,
if your data set is a sample data set, ie. you have taken a subset of the real-world (like
surveying 500 people about the election) then you must use because it turns out
that this gives you an answer that is closer to the standard deviation that would result
if you had used the entire population, than if you’d used . If, however, you are not
calculating the standard deviation for a sample, but for an entire population, then you
should divide by instead of . For further reading on this topic, the web page
http://mathcentral.uregina.ca/RR/database/RR.09.95/weston2.html describes standard
deviation in a similar way, and also provides an example experiment that shows the

You might also like