Advanced Excel Formulas
Advanced Excel Formulas
Advanced
Excel
Statistical
Functions and
Formulae
Document No. IS-113 v1
Contents
A test of Association that allows the comparison of two values in a sample of data to determine
if there is any relationship between them.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí. 2
UCL Information Systems
Data management
Although Excel doesn’t provide the sophisticated data coding techniques of a specialist statistical
application, there are useful methods for accomplishing some common data management tasks.
We can label column G Mean Result and then enter the following formula in cell G2
=sum(D2,E2,F2)/3
and then copy the formula using the fill handle down to row 31. This will calculate the average
exam score for each pupil.
Missing values
Sometimes you will not have a recorded observation or score for some case of a variable - that is
there will be missing values. In this case, you have to decide how to manage these cases. Usual
practise involves choosing a code to be input whenever a missing value is encountered for some
case or to impute a value for the missing observations. Since Excel doesn’t have the sophisticated
recoding methods available that specialist packages do, you will have to code missing values
yourself in such a way that your analysis can be carried out accurately.
Choose the codes for your missing values carefully. If you have numeric variables, remember that
there is no way to define a particular value as missing and thus exclude it from calculations.
Therefore, while you might be tempted to code a missing age as 999 if you do this and then
compute mean age, Excel will include all your 999 year olds. It may be wise to use a string as the
missing value since strings will normally be excluded from Excel’s calculations.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí. 4
UCL Information Systems
Descriptive measures
Below is a list of common Excel functions used for descriptive statistical measures.
MEDIAN(range) Calculates the median value for a data set; half the values in
the data set are greater than the median and half are less than
the median
SMALL(range,k) Returns the kth smallest or kth largest value in a specified data
LARGE(range,k) range
Each of these can be accessed from the menu sequence Insert |Function or using the function
wizard or by writing a formula in a cell.
Using the mouse, I highlight the cells containing the data range just entered or you can select data
by first clicking the collapse icons.
These are the collapse icons and are used in
selecting ranges in many Excel dialogues.
Notice that as you fill in the ranges Excel previews the value that will result from applying the
function.
Click OK.
The value of the mean will now appear in the blank cell you selected in step 2.
To calculate the median or mode, follow the same procedure but highlight MEDIAN or MODE in
step 4. Alternately you can enter the formulae directly into spreadsheet cells as shown below. All
the statistical functions are accessed in the same way and have a similar interface.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí. 6
UCL Information Systems
Using formulae in cells to calculate descriptive statistical
measures
N
Before we calculate the measures of central tendency, we need to find out the value of N – the
number of subjects or observations. The way to do this in excel is to use the Count() function over
the range of values. In the results spreadsheet, use Count() to find out the number of pupils.
Mode
The syntax for this computation is
=Mode(Range)
Median
The syntax for this computation is
=Median(Range)
Mean
There is a built in Excel function that returns the mean as its value
=Average(Range)
It is often useful to put the result of this function into a suitably named cell in a spreadsheet.
Measures of Dispersion
Range
The range of a sample is the largest score minus the smallest score. This can be calculated using the
Excel Formula
=(Max(A1:A10))-(Min(A1:A10))
Variance
The variance in a population is calculated as follows. We won’t build this equation ourselves in
Excel during this session but I give it here so that you can try it in your own time.
x x
2
S2
N
gives the population variance and
x x
2
S 2
N 1
gives the sample variance.
This formula depends upon first calculating X and N which we have already seen above.
The Excel function to calculate the variance for a population is
varp(range)
And for a sample
var(range)
You can access both from the function wizard or use them by typing formulae in cells.
The following worksheet contains the examination results for 14 students. The numbers in the
column headed Score Below is the bins array.
Before keying in the function, you must select the range of the array for the result. In this case it
will be F8:F17.
With this range selected, the following function is keyed into the Formula bar:
=FREQUENCY(C4:C17,E8:E17)
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí. 8
UCL Information Systems
Press Shift+Ctrl+Enter.
The array is now filled with data. This data shows that no student scored below 30, 1 student scored
between 30 and 39, 3 between 40 and 49, 1 between 50 and 59, 3 between 60 and 69, 1 between 70
and 79, 3 between 80 and 89, and 2 scored between 90 and 100.
If any of the results are changed, the data in the No. In Range column will be updated automatically.
n x 2
x
2
n y 2
y
2
We would build a complicated formula like this in steps – incrementally - having broken it down to
its component parts, each of which could be written simply using standard Excel features. If we
have time, we will construct this formula in the training session.
Using an Excel function
=CORREL(A1:A15,B1:B15)
We will build this and see that the result from the hand built formula is more than tolerably close to
Excel’s result. When you have built it, you can compare your result with that in the spreadsheet
pearson.xls.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí.
10 UCL Information Systems
Note that in order to find the acceleration, we must divide the slope by 2 and to find the initial
velocity, we must take the square root of the y-intercept.
Note that the CORREL( ) function was used to ensure that the data did display a linear trend --
otherwise, the slope and y-intercept values are meaningless! It is always a good idea to plot the data
as well as use these statistics functions because sometimes trends are not obvious. Additionally, a
plot of the data allows us to visualize the data and gross blunders and errant data points are easily
detected. The graph below tells us immediately that our data appears reasonable.
Enter your data as we did in columns B and C. The reason for this is strictly cosmetic as you will
soon see.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí.
12 UCL Information Systems
3. Finally, use the above components and the linear regression equations given in the previous
section to calculate the slope (m), y-intercept (b) and correlation coefficient (r) of the
data. The spread sheet will look like that below. Note that our equations for the slope, y-
intercept and correlation coefficient are highlighted in yellow.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí.
14 UCL Information Systems
1Goal seeking
Excel has a number of ways of altering conditions on the spreadsheet and making formulae produce
whatever result is required. Excel can also forecast what conditions on the spreadsheet would be
needed to optimise the result of a formula. For instance, there may be a profits figure that needs to
be kept as high as possible, a costs figure that needs to be kept to a minimum, or a budget constraint
that has to equal a certain figure exactly. Usually, these figures are formulae that depend on a great
many other variables on the spreadsheet. Therefore, you would have to do an awful lot of trial-and-
error analysis to obtain the desired result. Excel can, however, perform this analysis very quickly to
obtain optimum results. The Goal Seek command can be used to make a formula achieve a certain
value by altering just one variable. The Solver can be used for more painstaking analysis where
many variables could be adjusted to reach a desired result. The Solver can be used not only to
obtain a specific value, but to maximise or minimise the result of a formula (e.g. maximise profits
or minimise costs).
Goal Seek
The Goal Seek command is used to bring one formula to a specific value. It does this by changing
one of the cells that is referenced by the formula. Goal Seek asks for a cell reference that contains a
formula (the Set cell). It also asks for a value, which is the figure you want the cell to equal. Finally,
Goal Seek asks for a cell to alter in order to take the Set cell to the
required value.
In this example, cell B6 contains a formula that sums Costs and
Salaries. Cell B9 contains a Profits formula based on the Income figure,
minus the Total Costs.
A user may want to see how a profit of £6,000.00 can be achieved by
altering Salaries.
Solver parameters
The Solver needs quite a lot of information in order for it to be able to come up with a realistic
solution. These are the Solver parameters.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí.
18 UCL Information Systems
Constraints
Constraints prevent the Solver from coming up with unrealistic solutions.
This dialog box asks you to choose a cell whose value will be kept within certain limits. It can be any cell or
cells on the spreadsheet (simply type the reference or select the range).
This cell can be subjected to an upper or lower limit, made to equal a specific value or forced to be a whole
number. Use the drop-down arrow in the centre of the Constraint box to see the list of choices: to set an
upper limit, click on the <= symbol; for a lower limit, >=; the = sign for a specific value and the int option
for an integer (whole number).
Once the OK button is chosen, the Solver Parameter dialog box displays again and the constraint appears in
the window at the bottom. This constraint can be amended using the Change button, or removed using the
Delete button.
IMPORTANT
When maximising or minimising a formula value, it is important to include constraints, which set
upper or lower limits on the changing values. For instance, when maximising Profits by changing
Income figures, the Solver could conceivably increase these figures to infinity. If the Income figures
are not limited by an upper constraint, the Solver will return an error message stating that the cell
values do not converge. Similarly, minimising total costs could be achieved by making one of the
contributing costs, i.e. Salaries and Costs, infinitely less than zero. A constraint should be included,
therefore, to set a minimum level on these values.
When Solve is chosen, the Solver carries out its analysis and finds a solution. This may be
unsatisfactory. Further constraints could now be added to force the Solver to increase salaries or
costs etc.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí.
20 UCL Information Systems
Anova
An ANOVA is a guide for determining whether or not an event was most likely due to the random
chance of natural variation. Or, conversely, the same method provides guidance in saying with a
95% level of confidence that a certain factor (X) or factors (X, Y, and/or Z) were the more likely
reason for the event.
Once you are sure you have the Analysis ToolPak installed, open the file results.xls. We would like
to know if there is any significant difference between the mean scores in the three subjects, English,
History and Maths. We can’t use a student t-test because that test will only compare two groups of
scores.
The F ratio is the probability information produced by an ANOVA. It was named for Fisher. The
orthogonal array and the Results Project, DMAIC designed experiment's cube were also his
inventions.
An ANOVA can be, and ought to be, used to evaluate differences between data sets. It can be used
with any number of data sets, recorded from any process. The data sets need not be equal in size.
Data sets suitable for an ANOVA can be as small as three or four numbers, to infinitely large sets of
numbers.
Here is how you could use an Excel ANOVA to determine who is a better bowler. You could and
can use an ANOVA to compare any scores. Lengths of stay, days in AR, the number of phone calls,
readmission rates, stock prices and any other measure are all fair game for an ANOVA. Below are
six game scores for three bowlers. Which bowler is best? If there is a best bowler, is the difference
between bowlers statistically significant?
Step 1. Recreate the columns using Excel. Each bowler's name is the field title.
Step 2. Go to Tools and select Data Analysis as shown. If Data Analysis does not appear as the last
choice on the list in your computer, you must click Add-Ins and click the Analysis ToolPak options.
Step 5. Interpret the probability results by evaluating the F ratio. If the F ratio is larger than the F
critical value, F crit, there is a statistically significant difference. If it is smaller than the F crit value,
the score differences are best explained by chance.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí.
22 UCL Information Systems
The F ratio 12.57 is larger than the F crit value 3.68. Mark is a better bowler. The difference
between him and the other two bowlers is statistically significant. Excel automatically calculated
the average, the variance - which is the standard deviation, s, squared - and the essential probability
information instantly. You can use this technique to compare physicians, nurses, hospital lengths of
stay, revenue, expense, supply cost, days in accounts receivable, or any other factor of interest.
¡Error! Utilice la pestaña Inicio para aplicar Heading 1 al texto que desea que aparezca aquí.
26 UCL Information Systems
These workbooks are available for students at the Help Desk.
Online learning
There is also a comprehensive range of online training available via TheLearningZone at:
www.ucl.ac.uk/elearning
Getting help
The following faculties have a dedicated Faculty Information Support Officer (FISO) who works
with faculty staff on one-to-one help as well as group training, and general advice tailored to your
subject discipline:
Arts and Humanities
The Bartlett
Engineering
Maths and Physical Sciences
Life Sciences
Social & Historical Sciences
See the faculty-based support section of the www.ucl.ac.uk/is/fiso Web page for more details.
A Web search using a search engine such as Google (www.google.co.uk) can also retrieve helpful
Web pages. For example, a search for "Excel tutorial” would return a useful selection of tutorials.