ch02 DataPresVisualisation
ch02 DataPresVisualisation
1.1. Organising your material. Your organisation of material is the main task.
1.1.1. Document skeleton. Important ideas should be distinguished by (sub)section headings, not
buried in text. When writing your document or report in a top-down fashion, you first explicitly highlight
the main ideas, then develop these.
Once you have a rough understanding of the material that fits in each section (or chapter in a longer
document), try to write out the main ideas as the section headings, the breakdown of these as the
subsection headings, and so on, with a couple of “stake-in-the-ground” sentences in each (sub)section
for later expansion. Using physical materials (e. g., a blank wall/desk and post-its as mentioned in §3.1
below) can be very handy here.
This process will result in a skeleton of the document, and will feed into writing the document roadmap.
Writing at a higher level, and so focussing on the main points of the document, will help clarify your
own thought (seeing the wood instead of the trees) and aid in developing later presentations or other
texts based on your work.
An initial skeleton is provisional: you may well alter it and any roadmap section as you work more on
the chapter: there is nothing wrong with this. Over time, you will of course flesh it out.
It is very useful to have this skeleton present at as early a stage as possible, so that a reader (in particular
your co-authors) can form a high-level view of the emerging document. The helpfulness of including these
broad brushstrokes, however rough (perhaps with a health warning emphasising its provisional nature)
outweighs the dangers of giving a disjointed or partial picture.
A subsection of more than a couple of pages is probably too long: it may have more than one major
topic in it, and so warrant breaking up and treatment in several subsections.
2.1. Introduction. The function of the document’s introduction is to tell the reader what the
document is about: it is where you state and address your business and/or research problem(s). It does
this because having a road map before embarking on a detailed exposition creates a level of comfort for
the reader; but, more importantly, avoids the reader’s ‘reading in’ his/her own interpretation of what the
work is about and judging it against possibly false expectations. That is, you must manage the reader’s
expectations.
In it, you set the scene: describe the purpose and kind of your project, and state your research and/or
business question(s). It must accurately describe the scope of your work. An important purpose of the
introduction is to establish the significance of your study: why did you need to do this piece of work? It
can finish with your statement of objectives and/or with a brief statement of your main findings. Either
11
12 MIS10090
way, you must give the reader an idea of where the document is heading, so that he/she can follow the
development of the argument and its supporting evidence.
This is all it needs to do. It should not present any of the material at this stage (a common problem
when the introduction is written too early). There should be no “data” included, or analysis done here.
As with the conclusion, it is meta-discourse about the work done.
2.2. Literature Review and Methodology. You will probably have a very limited (if any)
Literature Review, but any sources used (for data or previous work you are building on) must be
referenced correctly.
You should have a short section called Methodology.
Here, you will need to provide enough information so that another researcher/worker could repeat your
work and reproduce your results if desired (excepting the provision of confidential data).
In statistical work, clearly indicate sample sizes, measures of central tendency, dispersion, etc., used,
and comment on survey response rates: what percentage of each population surveyed gave answers, and
which questions they answered; and consequent potential bias.
2.3. Results. Here you describe your results. State your findings clearly and simply. Present the
output data, summarised, distilled and condensed. The background to each result must be clear to the
reader.
Keep your writing as short and sweet as possible; however, it is possible to be over-concise.
Extract and describe any important trends. You cannot expect the readers to distil significant trends
from the data by themselves. This is your role, not theirs.
Combine your text, figures and tables to condense the data and show trends. Thus, figures and tables
will be very important in this chapter.
If you benchmark your methods against other “control” methods, you must provide here the results of
the comparisons (whether advantageous to your argument or not).
If your work involves testing empirical research hypotheses, you need to state
• whether your survey results allow you to reject (or not reject, as the case may be) the hypotheses,
• and at what confidence level.
2.4. Analysis/Discussion. Here you analyse and discuss your results, as found from your experi-
ments, surveys or other investigations. Make sure that you address your research aims and question(s).
Seek patterns in your results, and explain what you saw. Do your data allow you to give underlying
causes? Or can you only observe statistical regularities?
Your discussion must be firmly supported by the evidence presented in your results section. Thus, you
should refer briefly to your results to support the statements in your discussion. You may only make
conclusions that are directly supported by the evidence of your results, and may not extend beyond
those.
Data Analysis for Decision Makers 13
2.5. Conclusions. This section is a “wrap-up” for the document as a whole. It is not just a
summary. A small part may be a summary to make it self-contained but fundamentally it is:
• (conceptually) a reflection on the importance of what has been said and done in the rest of your
work; and
• (pragmatically) an assessment of the value of your work — a conversation with the examiners
about why it should be passed.
A reflection means that you the author(s) now step out of the plane of the document and look down on
it from above, and discuss its contents (with the examiners in mind as the audience). The reader should
not be left thinking “So what?”
There should be no new material introduced in the Conclusions chapter: it is tying together what you
have discussed in the previous parts.
Many people find it best to start in the middle of the document, with what is most natural, or what you
know best, e. g., with the literature review or business background.
Do not try to complete the Introduction first: you will end up rewriting it (maybe several times) because
your understanding of what it is about will change as you write. In practice, you will probably write
your document in a combination of:
Top down: where you start with the high level skeleton and progressively add levels of detail
(flesh); and
Bottom up: where you start with isolated snippets and progressively combine them into a whole.
3.1. Sentences and paragraphs. A trick some people like is to use post-it notes to record the
main idea in each paragraph of the existing writing. Then take the post-it notes somewhere else with
plenty of space (e. g., a white board or desk) and reorganise these first. The point of this is that working
directly with text is quite a challenge — you want to work with ideas first, get those right, and do word
processing later.
Every paragraph should state an idea (topic sentence), then provide support for that idea, then (possibly)
give examples. (There are variations, of course, but this is the basic logical structure.) Single sentence
paragraphs are usually bits of paragraphs that have lost their home. Such one sentence paragraphs are
a sure sign of problems with paragraph or section structure.
3.2. Logical structure versus appearance. It is very easy with modern tools to get caught up
in the form or appearance of your document. This advice1 is very relevant:
Concentrate on the structure of your document (defining a portion of your work as a section, subsection,
etc.) and afterwards use the tool to overlay the appearance according to the visual style you have chosen.
4. Writing Style
We will say more about graphics later, but every graphic should adhere to the same principles as all of
your writing: it should be
• informative,
• accurate (A),
• as concise as possible (B),
• clear (C),
• uncluttered but not overly information-sparse (B again), and
• never misleading (A again).
Your writing style should aim first and foremost to convey the message of your work to the reader.
Anything that gets in the way of that should be avoided. Eschew long-windedness. Write as simply and
as clearly as possible: there are some caveats here but clarity is your main objective. This applies to all
business writing.
Use the most precise word you can. Words like “good” and “nice” have such broad meanings as to be
almost meaningless. (If something is good, clarify what measure of goodness is being used.)
For more detail, see Strunk’s pithy book (Strunk and White, 1979), e. g., “Omit needless words!”, or
There are many other aspects of standard style, which are worth informing yourself of. For example, it
is common practice in a piece of text to use words for numbers up to ten, but numerals from 11 on (as
done in this sentence).
4.1. Some dos and don’ts. I suggest you write in the active voice, e. g., “we carried out a sur-
vey. . . ” rather than the passive voice: “a survey was carried out. . . ”. Why? — a series of sentences in
the passive voice can be very hard to read. There are some possible exceptions2, but be careful not to
overuse the passive voice.
Avoid acronyms and jargon as far as possible. In particular, an abstract or executive summary must
never contain either. Always explain an acronym on its first use, e. g., “A three-letter acronym (TLA)
is not always useful.”
In short, make the reader’s life easier. Think of documents you found clear and easy to read and why
you found them so. Conversely, think of documents you found unclear and hard to read (allowing for
possibly boring subject matter) and what made them so.
4.2. The golden rule: know your audience. Always ask yourself:
• Who is my audience?
• What background do they have?
• What do they know or not know?
• What is their context?
• What preconceived ideas might they have?
To be useful, analytics must not only analyse data but convey meaning to its audience; hence, we now
consider ways to organise and communicate quantitative information. Structures to convey meaning
include the main text of your document, tables and graphics (also called figures, particularly when
encapsulated with explanatory text such as a caption). A very useful, well-laid-out and readable reference
here is (Tufte, 2001): Edward Tufte (2001), The Visual Display of Quantitative Information, Graphics
Press, Cheshire, Connecticut.
Note: a graphic (or table) is only as good as the data it shows. If the data are dubious, no amount of
creativity will give a good graphic.
Recall that data can be
5.1. Figures and tables: commonalities. Figures (diagrams, artwork, etc.) and tables must be
neat, legible and not hand-drawn.
Each figure or table must be centred on the page and must have a caption. The caption must contain
enough information for the reader to understand the figure or table without needing to refer to the text.
Each figure or table must be self-contained, with all terms in it clearly defined. The text must explicitly
refer to and “talk us through” each one, explaining what it shows, how it fits in to your general argument
and drawing attention to significant features. Do not leave this work to the reader!
Figures are numbered consecutively, starting at 1. Tables are also numbered consecutively, with their
own counter, also starting at 1. Put the number and caption after the figure/table. The following
subsections say more on the use of tables and figures.
Organise your table so that entries of the same kind read down (in columns), not across.
Tables should have no vertical lines separating columns (see the Chicago Manual of Style) unless abso-
lutely necessary.
Choose units of measurement (e. g., thousands or millions) to avoid having an excessive number of digits.
Do not include columns of data that have the same value in each entry. If that value is important for
the table, give it in the caption.
Note that a single row or column of values is a list rather than a table.
Number Percentage
Male 30 61%
Female 19 39%
Total 49 100%
6.1. When to use a table. Tables are ideal for capturing a simple relationship between quanti-
tative data and the categories they relate to. The main strength of a table is that it is easy to look up
particular values.
Few (2012, 45) summarises the criteria for choosing a tabular display as
(1) The display will be used to look up individual values;
(2) It will be used to compare individual values but not entire series of values to one another;
(3) Precise values are required;
(4) The quantitative information to be communicated involves more than one unit of measure;
(5) Both summary and detail values are included.
Tufte (2001, 33) argues that
Small, noncomparative, highly labeled data sets usually belong in tables.
Our brains are well able to comprehend one, two or even a few values at the same time. The number
of things we can simultaneously grasp varies among people, with studies suggesting it is usually in the
range 5 ± 2. Tufte (2001, 56) says that “tables usually outperform graphics in reporting on small data
sets of 20 numbers or less.” In particular, graphics should rarely, if ever, be used to display very small
data sets; however, Few (2012, 50–51) gives an example of how the right, well-designed graphic may help
to tell a story about a very small data set.
6.2. When to use a graphic. The human eye-brain combination has evolved a great ability to
detect patterns in, and garner meaning from, visual depictions of information: graphics. Leverage this
where you can, provided it is appropriate. This visual perception di↵ers from our sequential processing
of text and the rows and columns of tables, which is handled by our verbal (language) system.
One picture is worth ten thousand words. . .
— F. R. Barnard, “Printer’s Ink”, 10 March 1927
. . . or ten thousand numbers.
But remember our earlier comments (and more examples later): one picture may not be worth a few
numbers (even 20 or so).
18 MIS10090
The earliest known map is Babylonian, dating from about 2400–2200 BC. The Chinese map in Figure 2.2
(from c. the 11th century AD) is the first known to use grid lines.
Nicolas Oresme (1323-1381) plotted one variable against another (see Figure 2.3), now called Cartesian
xy co-ordinates after René Descartes (1596–1650).
Graphical representation of data can be traced back to Johann H. Lambert (1728–1777), a Swiss-German
mathematician5, and William Playfair (1759–1823), an English political economist.
They introduced abstract forms of graphical display of data.
The idea of a relational graphic of statistical variables (e. g., standard linear regression, where a “best-
fitting” line is drawn to several observed points) goes back to Lambert (1765)6. Lambert says there7
We have in general two variable quantities x, y, which will be collated with one other
by observation, so that we can determine for each value of x, which may be considered
as an abscissa, the corresponding ordinate y. Were the experiments or observations
completely accurate, these coordinates would give a number of points through which
a straight or a curved line should be drawn. But as this is not so, the line deviates to
a greater or lesser extent from the observational points. It must therefore be drawn in
4This is reproduced in F.J.AnscombeFeb1973 GraphsinStatisticalAnalysis AmerStatnV27n1p17-21.pdf on the
Brightspace page.
5Lambert was the first to prove that ⇡ is irrational. He also studied non-Euclidean geometry (e. g., geometry on a sphere)
and map projections of the spherical earth onto a flat surface, inventing seven projections including the transverse Mercator
(widely used in maps) and Lambert conformal conic (widely used in aeronautical charts) projections. He introduced the
properties of conformality (projections that preserve angles) and equal area preservation and showed that they were mutually
exclusive.
6
Lambert, J. H. (1765) Beytrage sum Gebrauche der Mathematik und deren Andwendung.
7
As quoted in: Laura Tilling (1975). Early Experimental Graphs. British Journal for the History of Science. 8:204–205.
Data Analysis for Decision Makers 19
Figure 2.2. The map of the tracks of Yu the great, engraved on stone, China, c. 1100AD
such a way that it comes as near as possible to its true position and goes, as it were,
through the middle of the given points.
See Figure 2.4 for an example.
You will be familiar with other graphics where area depicts quantity, such as the bar chart (Figure 2.5),
time series (Figure 2.6) and pie chart (Figure 2.7), all invented by Playfair8.
• A bar chart displays (for the usual case of vertical bars) a numerical y-value (height of the
bar) corresponding to a discrete (either categorical or numerical) x-value; the bars have gaps
between them to emphasise the discreteness of the x-values;
• a time series displays a numerical y-value corresponding to an x-value denoting time;
• a pie chart displays relative proportion around the circumference of a disc (harder for our brains
to compare).
8The first two were introduced in his 1786 book The Commercial and Political Atlas; the last in his 1801 book The
Statistical Breviary.
20 MIS10090
Figure 2.4. An example plot of temperature variation and lags with depth, from Lam-
bert (1779) Pyrometrie.
Figure 2.5. An example of a Playfair bar chart, showing exports from and imports to
Scotland in 1781
Data Analysis for Decision Makers 21
Figure 2.6. An example where Playfair plots three parallel time-series: price of wheat,
weekly wages of a “Good Mechanic” (skilled labourer), and the reigns of British monarchs
Figure 2.7. A Playfair graphic showing population, tax revenue, etc., of the Great
Powers. This seems to be the first graphic to use multivariate data. It is also a bubble
plot (see later), in that the areas of the circles are proportional to the areas of nations,
and includes pie charts (to show the divisions of the Turkish and German Empires) and
lines (to show population sizes and taxes raised)
There are many other graphics that may be less familiar; some are presented here to give you ideas of
what can be done.
22 MIS10090
A Nightingale Rose chart (or Nightingale Coxcomb chart), e. g., Figure 2.8, from (Nightingale, 1858)9 is
a kind of pie chart with relative segment areas determined by radius rather than angle.10
A Sankey flow diagram shows flows (of energy, goods, etc) among parts of a system using links (often
depicted as arrows), with the thickness of the links proportional to the magnitude of the flow. It is
named after Irish engineer Charles Sankey, who used this kind of diagram in 1898 to show where energy
went in a steam engine.
The first use of a Sankey-type diagram appears to be by another Irishman, Lieutenant Henry D. Harness
of the Royal Engineers. He included maps “drawn to a new design” in an 1837 report (Figure 2.9) for
the Irish Railway Commissioners. These maps showed lines joining pairs of locations; each line’s width
was proportional to the average weekly number of travellers between the two locations. Many people
commuted from Kingstown (now Dún Laoghaire) to Dublin, so — rather than show an extremely wide
line — Harness showed a narrow black line and made a note in the text to explain it. Also, this is
a bubble plot (see below), since the size of the dot representing a given town is proportional to the
population of that town.
Colour coding can be helpful to indicate (roughly) a third dimension. For example, this is often done in
Self-Organising Maps (SOMs) where a colour “heatmap” indicates the value of a variable not included
in the SOM training. This can suggest clusters to the naked eye. See Figure 2.10.
9 Florence Nightingale (1858). Notes on matters a↵ecting the health, efficiency and hospital administration of the
British army.
10 “We do not want impressions, we want facts. You complain that your report would be dry. The dryer the better.
Statistics should be the dryest of all reading.” — William Farr, Compiler of Abstracts in the General Registry Office,
responding to Nightingale’s Rose chart showing bad sanitation to be by far the greatest cause of death during the Crimean
War. Nightingale had compared his (innovative) civilian mortality tables to her data, and had shown that soldiers had
twice the mortality of civilians, even in peacetime. She became the first female fellow of the Statistical Society of London
(now Royal Statistical Society) in 1858.
Data Analysis for Decision Makers 23
Figure 2.9. Excerpt from Harness’s original “Sankey” flow diagram, 1837
Figure 2.10. Example of a Self-Organising Map (SOM) where a colour “heatmap” in-
dicates the value of a variable not included in the SOM training
8.1. Maps and plans. These show spatial aspects, as, e. g., in John Snow’s famous map in Fig-
ure 2.11. Figure 2.12 is an example of combining time-series and spatial information. Figures 2.13–2.14
are examples of combining virtual and spatial information. The most extensive data maps can place
millions of bits of information on a single page.
24 MIS10090
Figure 2.11. John Snow’s map of cholera deaths in the Broad Street area of London,
1854. This graphic efficiently testified about the data. It showed the clustering of deaths
around the water pump ⇥ in Broad Street, disproving the “bad air” theory of cholera
and persuading the council to remove the handle of the pump
Figure 2.12. Nathan Yau: Unemployment in the United States, 2004–2009. From
http://projects.flowingdata.com/america/unemployment
8.2. Illustrative diagrams. These portray a real object in simplified or schematic form. Such
images are models and so trade o↵ realism and abstraction. Examples include diagrams of the human
eye in cross-section (Figure 2.15), or the covers of Haynes manuals (Figure 2.16).
Data Analysis for Decision Makers 25
Figure 2.13. Paul Butler: visualizing friendships in the facebook social graph,
2010. From http://www.facebook.com/notes/facebook-engineering/visualizing-
friendships/469716398919
8.3. Organisational diagrams. These show relationships among parts of real/abstract objects,
e. g., UML shapes class hierarchy as in Figure 2.17, or volumes of trade in crude oil (a Sankey flow
diagram) as in Figure 2.18.
8.4. Statistical graphics. These are some of the most important for us, so we dedicate all of the
next section to them.
26 MIS10090
Figure 2.15. A section through the human eye, with an enlarged schematic of the retina
Figure 2.16. Haynes manual schematics of a Land Rover (left) and the USS Enterprise
NCC-1701 (right)
9. Statistical graphics
Statistical graphics represent more abstract designs, including relational graphics such as function plots.
Examples of statistical graphics include
9.1. Histograms. A frequency distribution is a summary table of numerical data (usually continu-
ous) in which the data are grouped into numerically ordered nonoverlapping classes (also called categories
or bins).
The frequency (or number of occurrences) of a class is the number of observations (data items) falling
within that class.
You must carefully select (a) the appropriate number of classes, (b) a suitable fixed class width, and (c)
suitable boundaries of each class to avoid overlapping.
28 MIS10090
(a) The number k of classes depends on the number n of data items (typically more classes for a
larger number of items); k should be between 4 and 15 to aid human understanding: too many
classes defeats this. The “2k ” rule/guideline (also called Sturges’s rule) says:
• use k classes, where k is chosen as the smallest integer for which 2k > n; for example, if
you have n = 19 items, the smallest power of 2 greater than 19 is 32 = 25 so we choose
k = 5.
This is a guideline, and we might choose k 1 or k + 1 classes; but not a very di↵erent number
from k.
(b) The class width of each interval is the data range (highest value lowest value) divided by the
number k of classes, i. e.,
max min
.
k
(c) The class boundaries are where the classes meet: so classes touch but don’t overlap.
A histogram is a vertical “bar” plot of a frequency distribution over a range. Each bar represents a
class. There are no gaps between adjacent bars. The class boundaries (or class midpoints) are shown on
the horizontal axis. The vertical axis (showing height of bars) is either frequency, relative frequency, or
percentage.
Advantages of a histogram:
• Di↵erent class boundaries (e. g., from more or fewer classes) may give di↵erent pictures for the
same data (especially for smaller data sets);
• Shifts in data concentration may show up when di↵erent class boundaries are chosen;
• As the size of the data set increases, the impact of changes in the choices of class boundaries is
greatly reduced.
• A barchart displays a y-value corresponding to a discrete x-value, e. g., the numbers of students
from particular countries in the class (here, the x-value would be a country [nominal level of
measurement] and the y-value is the corresponding number of students); or the probability of
a particular discrete outcome (see later). The bars are drawn so as not to touch each other,
emphasising the discreteness of the x-values;
• A histogram displays a y-value which is the frequency (number of occurrences) corresponding
to the range (width) of this class; that is, how many times did we see an observation within
this class. Thus the vertical bars are drawn to touch each other, since each covers a range of
x-values from the start to the end of the class it covers.
62.5, 63.4, 64.7, 66.5, 67.7, 68.4, 70.5, 70.9, 71.3, 72.5, 72.9, 73.8, 74.5, 74.9, 75.3, 75.7, 76.2, 76.8, 77.1,
77.6, 78.3, 78.7, 79.4, 79.8, 80.5, 81.8, 82.6, 83.7, 84.5, 86.9
then classed (grouped/binned) into six classes of 5-feet intervals, ordered by increasing tree height. For
each class, the number of trees in that class is counted; the number counted is called the frequency for
that class and is shown on the vertical axis.
This gives us a visual representation of how the heights are distributed over the set of trees: how many
trees of each height range there are. We will have much more to say about distributions later.
Data Analysis for Decision Makers 29
Table 2.2. Frequency distribution: six classes of great white shark lengths, giving the
upper and lower class boundaries and the number of items (frequency) in each class
In the sample in Figure 2.19, there are 30 observations, so by Sturges’s rule we would first consider using
5 classes since 25 = 32, the first power of 2 which is bigger than the sample size of 30. However, the
authors decided to use 6 classes here, presumably for clarity and/or symmetry.
Example 2.1. Suppose that a marine biologist has measured the lengths in metres of 37 adult great
white sharks, and found that they were (in ascending order):
3.2, 3.5, 3.6, 3.9, 4.2, 4.4, 4.7, 4.8, 4.8, 5.0, 5.0, 5.1, 5.1, 5.3, 5.4, 5.4, 5.6, 5.6, 5.7, 5.7, 5.9, 5.9, 5.9, 6.0,
6.2, 6.3, 6.5, 6.6, 6.8, 7.1, 7.2, 7.2, 7.4, 7.5, 7.5, 7.8, 8.0.
We first decide on the number of bins or classes: since there are 37 data items, we choose k, the number
of classes, to be the smallest number for which 2k > 37, that is11, k = 6.
Next, we identify the maximum length as 8.0m and the minimum as 3.2m, so the total range is 8 3.2 =
4.8. Dividing this into k = 6 equal intervals gives a class width of 0.8m. Counting the number of items in
each class we get the frequency distribution (Table 2.2). The resulting histogram is shown in Figure 2.20.
From this, we see that there is not such a well-defined bell shape as in the previous example; but real
examples are rarely truly symmetrical. }
11If you are familiar with logarithms, just find log 37 = 5.209453366 and let k be the smallest integer greater than
2
this, i. e., k = 6.
30 MIS10090
Figure 2.20. Example of a histogram of 37 great white shark lengths (di↵erent colours
have been used for each bin for clarity; usually all bins are shown in the same colour)
For more examples of histograms, e. g., skewed, bimodal, etc, see https://www.cuemath.com/data/histograms
In tutorials you will see how to create a histogram in Excel. The chart, title, axes, etc. can be modified
and formatted to best reflect the data under analysis. The histogram is treated as though it was any
other chart in Excel.
There are many other plot options available in tools such as Excel. Other commonly used plots are
Scatter Plots, Line Plots, Pie Charts, Stem and Leaf Displays and Cumulative graphs.
9.2. Scatter Plots. A scatter plot depicts a finite number of pairs of numerical values (x, y) as
isolated points on the plane. It has two axes, one for each variable x, y, and each observation is plotted
as a point — no lines or bars. In Figure 2.21, each point represents an observation (one person).
Weight/Height relationship
●
180
●
●
170
●
Height (cm)
160
●
150
50 55 60 65 70 75
Weight (kg)
Figure 2.21. Scatter plot showing height and weight of a sample of adults
9.2.1. Bubble Plots. We mentioned that colour (or shading density) can represent a third dimension;
in fact, by varying the size of points on a scatter plot we can even get a fourth dimension, giving a
bubble plot as in Figure 2.22.
Data Analysis for Decision Makers 31
Figure 2.22. An example of a bubble plot, with axes showing number of burglaries
versus number of murders per 100,000 population. Every bubble is a US state: the size of
each bubble represents the population of the state; the colour is the number of larcenies.
From http://glowingpython.blogspot.ie/2011/11/how-to-make-bubble-charts-with.html
10.1. A famous example of a good graphic. A famous graphic is Charles Joseph Minard’s
1869 depiction of Napoleon’s 1812 campaign in Russia, ending in retreat from Moscow. See Figure 2.23
(Minard was over 80 when he made this). Minard shows seven kinds of information (in eight dimensions)
on a two-dimensional chart:
(1) the two-dimensional path of the army’s movement (from the Polish-Russian border to Moscow,
and back: follow the midpoints of the gold and black bands);
(2) direction of movement (colour coded: gold for outward and black for return);
(3) number of troops remaining (widths of the paths, as in a Sankey flow diagram);
(4) geographical information (names of rivers, towns, etc);
(5) names of major battles (locations along the troop movement path);
(6) date (for the return journey, as a separate plot to the bottom, directly under the corresponding
event on the march);
(7) temperature (also for the return journey, shown against date; in addition, occurrence of rain
[pluie] is given).
Figure 2.23. Minard’s 1869 depiction of Napoleon’s 1812 campaign in Russia shows:
the path of the army’s movement (follow midpoints of the gold and black bands); di-
rection of movement (gold for outward and black for return); and the number of troops
remaining (widths of the paths); as well as geographical information, names of battles,
dates, temperatures and rain occurrence on the return journey
34 MIS10090
• Clear, detailed and thorough labelling should be used to defeat graphical distortion and ambi-
guity. Write out explanations of the data on the graphic itself. Label important events in the
data.
• Show data variation, not design variation.
• In time series displays of money, deflated and standarized units of monetary measurement are
nearly always better than nominal units.
• The number of information-carrying (variable) dimensions depicted should not exceed the num-
ber of dimensions in the data.
• Graphics must not quote data out of context.
11.1. Low data density. Recall from §6 that graphics should not be used to display very small
data sets.
The graphic in Figure 2.24 is probably a record: it uses a lot of space and ink to tell us one number,
90%.
11.2. Misleading graphics. From the principles we’ve seen, a graphic should inform, not mislead,
the reader, and give a truthful view of the data.
Distortion can be accidental or deliberate. Deliberate distortions are usually an attempt to hide a feature
of the data.
Note that any medium of communication can be used to mislead. Graphics are no more vulnerable to
exploitation by liars than other media. Our intuition is usually a good graphical lie detector, spotting
frauds.
If the reader begins to suspect you of manipulating his/her impressions, it may cast doubt on all of your
work.
A graphic with low data density (little information for the page space it takes up) promises more than
it delivers and so may be regarded as misleading. Consider whether the data can be better represented
in a table.
11.2.1. Distortion in graphics. For most people, the perceived area of a disc grows more slowly than
the actual (physical) area, as the radius increases.13 Also, perceptions alter with experience; perceptions
are context-dependent; and they depend on what we have heard others say about the object.
Given these di↵erences in perception, Tufte (2001, 56) suggests principles to enhance graphical integrity.
(1) Make the physical dimensions of numbers on the graphic directly proportional to the numerical
quantities represented.
(2) Defeat graphical distortion and ambiguity by clear, detailed and thorough labelling:
• Give explanations of the data on the graphic itself;
• Label important events in the data.
13Perceived area = actual areac where c ⇡ 0.8 (Tufte, 2001, 55).
Data Analysis for Decision Makers 35
Tufte (2001, 57) proposes a way to measure violations of the first principle:
size of e↵ect shown in graphic
Lie Factor = .
size of e↵ect in data
A Lie Factor between 0.95 and 1.05 indicates reasonably accurate representation of the underlying num-
bers. A Lie Factor outside this range indicates substantial distortion, beyond minor plotting inaccuracy.
Most distortions overstate, with Lie Factors of two to five not uncommon.
Tufte (2001, 57) reprints an example from The New York Times, 9 August 1978, p. D-2 (Figure 2.25).
We can calculate the Lie Factor from the above formula:
Figure 2.25. A misleading graphic from “The New York Times”, 9 August 1978, p. D-2,
reprinted in (Tufte, 2001, 57)
He states “Each part of a graphic generates visual expectations about its other parts, [so] these expec-
tations often determine what the eye sees. Deception results from the incorrect extrapolation of visual
expectations”. For example, in Figure 2.25
As a general principle, do not mix scales on the same axis. If a scale moves at regular intervals in one
part of the graphic, the eye expects it to move to the end in a coherent way.
Mixing design variation with data variation gives ambiguity and deception, as the eye may not distinguish
between them (Tufte, 2001, 61).
Our eye judges area and volume di↵erently from distances. Distortion can arise from the inappropriate
use of linear scaling in every dimension when using area or volume to represent values. The eye con-
fuses design variation with data variation. Furthermore, the use of 3-dimensional “e↵ects” often causes
distortions, even if not representing data values (see, for example, Figures 2.27 and 2.29 below).
Tufte (2001, 71) encapsulates this in the principle
36 MIS10090
Figure 2.26. A misleading graphic from “Time”, 9 April, 1979, p. 57, reprinted in
(Tufte, 2001, 62)
There are other ways graphics can mislead, which you should consider and avoid if possible. Be careful
with an xy plot having a greater vertical range than horizontal range (taller than it is wide). In particular,
if the horizontal axis shows time, you may give the impression of large variations with time. Is this what
you want to show? Would it mislead the reader? Variations can be further exaggerated by having the
vertical axis only show part of the range from 0 to max. Again, be cautious with this.
The moral is: take care with graphics, as with all statistical interpretation of data.
11.3. Chartjunk. Our guiding principle that a graphic should clearly portray complexity means
the graphic must not be more complex than the data it portrays.
Chartjunk is a term for things which introduce unnecessary complexity:
• Irrelevant optical art or decoration such as Moiré e↵ects (cross-hatching gives a vibrating,
illusory, eye-straining e↵ect and should be replaced by shades of grey)
• Grid (esp. dark lines): mute or suppress this “so that its presence is only implicit—lest it
compete with the data . . . Dark grid lines . . . carry no information, clutter up the graphic and
generate activity unrelated to data information” (Tufte, 2001, 112)
• Duck: “when a graphic is taken over by decorative forms or computer debris, . . . , when the
overall design purveys Graphical Style rather than quantitative information, then that graphic
may be called a duck ” (Tufte, 2001, 116)
– meaningless colours
– fake perspective and 3-d e↵ects
Data Analysis for Decision Makers 37
Figure 2.27. The dangers of colour and/or 3-d e↵ects: Percentage of Total College
Enrolment aged 25 or over (1972–1976), from “American Education”
11.4.2. Pie charts: avoid unless absolutely necessary. Many authorities, such as Bertin (1977); Tufte
(2001); Few (2012, 94), argue that pie charts have limited use at best, since they have low data density
and fail to order numbers along a visual dimension (the eye can judge linear ranges better); one of the
few valid uses being Figure 2.28.
Three-dimensional pie charts are an abomination and may be very misleading because of their tilt, as in
(the exaggerated) Figure 2.29.
38 MIS10090