Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
27 views

ch02 DataPresVisualisation

Uploaded by

Herman Herman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

ch02 DataPresVisualisation

Uploaded by

Herman Herman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

CHAPTER 2

Presenting and Visualising Data

1. Writing a report on data analysis

We now give a brief overview of some ideas on writing and graphics.


The output of a data analysis exercise is likely a technical document that aims to make a case: to
persuade. Much of your career may involve producing such texts, if not on the same scale.
Far too much business writing is a meandering muddle of jargon, wa✏e, pu↵ and platitudes. If your
writing is clear, to the point, explanatory and persuasive, it will stand out from the crowd.

1.1. Organising your material. Your organisation of material is the main task.
1.1.1. Document skeleton. Important ideas should be distinguished by (sub)section headings, not
buried in text. When writing your document or report in a top-down fashion, you first explicitly highlight
the main ideas, then develop these.
Once you have a rough understanding of the material that fits in each section (or chapter in a longer
document), try to write out the main ideas as the section headings, the breakdown of these as the
subsection headings, and so on, with a couple of “stake-in-the-ground” sentences in each (sub)section
for later expansion. Using physical materials (e. g., a blank wall/desk and post-its as mentioned in §3.1
below) can be very handy here.
This process will result in a skeleton of the document, and will feed into writing the document roadmap.
Writing at a higher level, and so focussing on the main points of the document, will help clarify your
own thought (seeing the wood instead of the trees) and aid in developing later presentations or other
texts based on your work.
An initial skeleton is provisional: you may well alter it and any roadmap section as you work more on
the chapter: there is nothing wrong with this. Over time, you will of course flesh it out.
It is very useful to have this skeleton present at as early a stage as possible, so that a reader (in particular
your co-authors) can form a high-level view of the emerging document. The helpfulness of including these
broad brushstrokes, however rough (perhaps with a health warning emphasising its provisional nature)
outweighs the dangers of giving a disjointed or partial picture.
A subsection of more than a couple of pages is probably too long: it may have more than one major
topic in it, and so warrant breaking up and treatment in several subsections.

2. The main parts of a data analysis document

2.1. Introduction. The function of the document’s introduction is to tell the reader what the
document is about: it is where you state and address your business and/or research problem(s). It does
this because having a road map before embarking on a detailed exposition creates a level of comfort for
the reader; but, more importantly, avoids the reader’s ‘reading in’ his/her own interpretation of what the
work is about and judging it against possibly false expectations. That is, you must manage the reader’s
expectations.
In it, you set the scene: describe the purpose and kind of your project, and state your research and/or
business question(s). It must accurately describe the scope of your work. An important purpose of the
introduction is to establish the significance of your study: why did you need to do this piece of work? It
can finish with your statement of objectives and/or with a brief statement of your main findings. Either
11
12 MIS10090

way, you must give the reader an idea of where the document is heading, so that he/she can follow the
development of the argument and its supporting evidence.
This is all it needs to do. It should not present any of the material at this stage (a common problem
when the introduction is written too early). There should be no “data” included, or analysis done here.
As with the conclusion, it is meta-discourse about the work done.

2.2. Literature Review and Methodology. You will probably have a very limited (if any)
Literature Review, but any sources used (for data or previous work you are building on) must be
referenced correctly.
You should have a short section called Methodology.
Here, you will need to provide enough information so that another researcher/worker could repeat your
work and reproduce your results if desired (excepting the provision of confidential data).
In statistical work, clearly indicate sample sizes, measures of central tendency, dispersion, etc., used,
and comment on survey response rates: what percentage of each population surveyed gave answers, and
which questions they answered; and consequent potential bias.

2.3. Results. Here you describe your results. State your findings clearly and simply. Present the
output data, summarised, distilled and condensed. The background to each result must be clear to the
reader.
Keep your writing as short and sweet as possible; however, it is possible to be over-concise.
Extract and describe any important trends. You cannot expect the readers to distil significant trends
from the data by themselves. This is your role, not theirs.
Combine your text, figures and tables to condense the data and show trends. Thus, figures and tables
will be very important in this chapter.
If you benchmark your methods against other “control” methods, you must provide here the results of
the comparisons (whether advantageous to your argument or not).
If your work involves testing empirical research hypotheses, you need to state

• whether your survey results allow you to reject (or not reject, as the case may be) the hypotheses,
• and at what confidence level.

2.4. Analysis/Discussion. Here you analyse and discuss your results, as found from your experi-
ments, surveys or other investigations. Make sure that you address your research aims and question(s).

• What did your results tell you?


• What is their meaning, in academic and/or practical contexts?
• Have you established or reinforced any principles?
• Can you make any generalisations?
• How do your findings compare to those of others?
• How do your findings compare to expectations from previous work?
• How did your methods perform against benchmarks?

Seek patterns in your results, and explain what you saw. Do your data allow you to give underlying
causes? Or can you only observe statistical regularities?
Your discussion must be firmly supported by the evidence presented in your results section. Thus, you
should refer briefly to your results to support the statements in your discussion. You may only make
conclusions that are directly supported by the evidence of your results, and may not extend beyond
those.
Data Analysis for Decision Makers 13

2.5. Conclusions. This section is a “wrap-up” for the document as a whole. It is not just a
summary. A small part may be a summary to make it self-contained but fundamentally it is:

• (conceptually) a reflection on the importance of what has been said and done in the rest of your
work; and
• (pragmatically) an assessment of the value of your work — a conversation with the examiners
about why it should be passed.

A reflection means that you the author(s) now step out of the plane of the document and look down on
it from above, and discuss its contents (with the examiners in mind as the audience). The reader should
not be left thinking “So what?”
There should be no new material introduced in the Conclusions chapter: it is tying together what you
have discussed in the previous parts.

3. Putting together the document

Many people find it best to start in the middle of the document, with what is most natural, or what you
know best, e. g., with the literature review or business background.
Do not try to complete the Introduction first: you will end up rewriting it (maybe several times) because
your understanding of what it is about will change as you write. In practice, you will probably write
your document in a combination of:

Top down: where you start with the high level skeleton and progressively add levels of detail
(flesh); and
Bottom up: where you start with isolated snippets and progressively combine them into a whole.

3.1. Sentences and paragraphs. A trick some people like is to use post-it notes to record the
main idea in each paragraph of the existing writing. Then take the post-it notes somewhere else with
plenty of space (e. g., a white board or desk) and reorganise these first. The point of this is that working
directly with text is quite a challenge — you want to work with ideas first, get those right, and do word
processing later.
Every paragraph should state an idea (topic sentence), then provide support for that idea, then (possibly)
give examples. (There are variations, of course, but this is the basic logical structure.) Single sentence
paragraphs are usually bits of paragraphs that have lost their home. Such one sentence paragraphs are
a sure sign of problems with paragraph or section structure.

3.2. Logical structure versus appearance. It is very easy with modern tools to get caught up
in the form or appearance of your document. This advice1 is very relevant:

. . . [S]top worrying about appearance as a be-all-and-end-all. Many people have become


‘Word Processing Junkies’ and no longer ‘write’ documents, they ‘draw’ them, almost
at the same level as a pre-literate 3-year old child might pretend to ‘write’ a story, but
is just creating a sequence of pictures with a pad of paper and box of Crayolas — this
is perfectly normal and healthy in a 3-year old child who is being creative, but is of
questionable usefulness for, say, a grad student writing a Master’s or PhD thesis or a
business person writing a white paper, etc. For this reason, I strongly recommend not
using any sort of fancy GUI ‘crutch’. Use a plain vanilla text editor and treat it like
an old-fashioned typewriter. Don’t waste time playing with your mouse.
Note: I am not saying that you should have no concerns about the appearance of
your document, just that you should write the document (completely) first and tweak
the appearance later. . . not [spend time on] lots of random editing in the bulk of the
document itself.
1From news:comp.text.tex/MPG.18d82140d65ddc5898968c@news.earthlink.net, Heller, New To LAT X. . . Unlearn-
E
ing Bad Habits (11 March 2003)
14 MIS10090

Concentrate on the structure of your document (defining a portion of your work as a section, subsection,
etc.) and afterwards use the tool to overlay the appearance according to the visual style you have chosen.

4. Writing Style

Remember the ABC of good writing: Accuracy, Brevity, Clarity

Accuracy: a true representation of what you did, not misleading


Brevity: as short as possible, but no shorter: use the minimum detail necessary.
Don’t emulate Sir Humphrey Appleby from “Yes, Minister”:
Never use one word where seventeen will do.
Clarity: unambiguous, precise and intelligible.

We will say more about graphics later, but every graphic should adhere to the same principles as all of
your writing: it should be

• informative,
• accurate (A),
• as concise as possible (B),
• clear (C),
• uncluttered but not overly information-sparse (B again), and
• never misleading (A again).

Your writing style should aim first and foremost to convey the message of your work to the reader.
Anything that gets in the way of that should be avoided. Eschew long-windedness. Write as simply and
as clearly as possible: there are some caveats here but clarity is your main objective. This applies to all
business writing.
Use the most precise word you can. Words like “good” and “nice” have such broad meanings as to be
almost meaningless. (If something is good, clarify what measure of goodness is being used.)
For more detail, see Strunk’s pithy book (Strunk and White, 1979), e. g., “Omit needless words!”, or

Vigorous writing is concise. A sentence should contain no unnecessary words, a para-


graph no unnecessary sentences, for the same reason that a drawing should have no
unnecessary lines and a machine no unnecessary parts. This requires not that the
writer make all his sentences short, or that [s]he avoid all detail and treat his[/her]
subjects only in outline, but that every word tell.

There are many other aspects of standard style, which are worth informing yourself of. For example, it
is common practice in a piece of text to use words for numbers up to ten, but numerals from 11 on (as
done in this sentence).

4.1. Some dos and don’ts. I suggest you write in the active voice, e. g., “we carried out a sur-
vey. . . ” rather than the passive voice: “a survey was carried out. . . ”. Why? — a series of sentences in
the passive voice can be very hard to read. There are some possible exceptions2, but be careful not to
overuse the passive voice.
Avoid acronyms and jargon as far as possible. In particular, an abstract or executive summary must
never contain either. Always explain an acronym on its first use, e. g., “A three-letter acronym (TLA)
is not always useful.”
In short, make the reader’s life easier. Think of documents you found clear and easy to read and why
you found them so. Conversely, think of documents you found unclear and hard to read (allowing for
possibly boring subject matter) and what made them so.

2Some journals or editors prefer the passive voice.


Data Analysis for Decision Makers 15

4.2. The golden rule: know your audience. Always ask yourself:

• Who is my audience?
• What background do they have?
• What do they know or not know?
• What is their context?
• What preconceived ideas might they have?

With that in mind,

• explain things they might not be aware of;


• do not bore them (and waste valuable words) explaining things they do know.

5. Visualisation: leveraging our spatial awareness to convey information better

To be useful, analytics must not only analyse data but convey meaning to its audience; hence, we now
consider ways to organise and communicate quantitative information. Structures to convey meaning
include the main text of your document, tables and graphics (also called figures, particularly when
encapsulated with explanatory text such as a caption). A very useful, well-laid-out and readable reference
here is (Tufte, 2001): Edward Tufte (2001), The Visual Display of Quantitative Information, Graphics
Press, Cheshire, Connecticut.
Note: a graphic (or table) is only as good as the data it shows. If the data are dubious, no amount of
creativity will give a good graphic.
Recall that data can be

quantitative: measuring something (ordinal/ranking, interval, ratio, correlation, summary): we


can perform mathematical operations on them; or
categorical: (nominal e. g., class labels, ordinal, hierarchical) indicating what the quantitative
data measure: here, mathematical operations would make no sense.

Quantitative information tells a story about a relationship:

• maybe a simple relationship between quantitative data and corresponding categories;


• maybe a more complex relationship among multiple sets of quantitative data.

5.1. Figures and tables: commonalities. Figures (diagrams, artwork, etc.) and tables must be
neat, legible and not hand-drawn.
Each figure or table must be centred on the page and must have a caption. The caption must contain
enough information for the reader to understand the figure or table without needing to refer to the text.
Each figure or table must be self-contained, with all terms in it clearly defined. The text must explicitly
refer to and “talk us through” each one, explaining what it shows, how it fits in to your general argument
and drawing attention to significant features. Do not leave this work to the reader!
Figures are numbered consecutively, starting at 1. Tables are also numbered consecutively, with their
own counter, also starting at 1. Put the number and caption after the figure/table. The following
subsections say more on the use of tables and figures.

5.2. Tables. A table is characterised by Few (2012, 43) as:

• information is structured as rows and columns;


• information is realised in text form (such as numbers and words).
16 MIS10090

Organise your table so that entries of the same kind read down (in columns), not across.
Tables should have no vertical lines separating columns (see the Chicago Manual of Style) unless abso-
lutely necessary.
Choose units of measurement (e. g., thousands or millions) to avoid having an excessive number of digits.
Do not include columns of data that have the same value in each entry. If that value is important for
the table, give it in the caption.
Note that a single row or column of values is a list rather than a table.

5.3. Figures. A graphic or figure is characterised by Few (2012, 45):


• information is displayed within an area marked out by one or more axes;
• information is realised as visual objects positioned in relation to the axes;
• axes provides scales (both quantitative and categorical) that are used to label and assign values
to the visual objects.
That is, graphics capture quantitative values visually, giving them shape.
If the graphic has axes, ensure that they are clearly labelled. Include a legend describing the figure. It
should be concise but still give enough information for the reader to interpret the figure without referring
to the main text. Put the legend below the figure or in an otherwise empty part of the figure.
Each figure should be within the same margins as the rest of your text: it should never extend beyond
the text and may be smaller.
If the figure is a plot, give each axis a short but informative title, including units of measurement. Make
sure your figure has enough room for axis titles and numbers/ticks, as well as the caption: do not fill
the whole page with the plot. The bulk of the figure should be the plot, with the axes not extending
far beyond the range of the data. For example, if the data range between 17 and 43, your axis might
extend between 20 and 45. However, if a sequence of figures show results from similar but separate
scenarios, use the same axes extensions for all the figures, to allow for greater ease of comparison. (In
this last case, you should consider whether all the results could gainfully be put in the same figure, or
whether the result would be too cluttered.)
Colour figures may be used; however, the figure captions and the dissertation text proper must make
sense if the document is printed either in black & white or colour. (For example, don’t refer to “the red
section of the pie-chart”).

6. Text, Table or Graphic?

Few (2012, 9)3 says:


The purpose of quantitative tables and graph[ic]s is to communicate important infor-
mation e↵ectively. That’s it. Not to entertain, not to indulge in self-expression, not
to make numbers that you would otherwise find boring suddenly interesting through
flash and dazzle.
Present your data either in the main text, or as a table or as a figure, but never in more than one way.
It may be that less is more: do not give data in the form of figures or tables, if this could easily be
replaced by a sentence or two of text. In particular, tables containing just a few entries should be
critically reviewed to see if they are really necessary. By the same token, consider whether a graphic
(figure) with low data density (little information for the page space it takes up) can be better represented
in a table.
In the following, which is the best way to communicate the gender breakdown of the class?
Would you use the sentence “The class size is 49, with 30 male (61%) and 19 female (39%) students.”
Or is Table 2.1 the best approach? Or does Figure 2.1 do the best job?
3Few, Stephen (2012). Show Me the Numbers: Designing Tables and Graphs to Enlighten. Burlingame, CA: Analytics
Press.
Data Analysis for Decision Makers 17

Number Percentage
Male 30 61%
Female 19 39%
Total 49 100%

Table 2.1. Gender composition of the class

Figure 2.1. Gender composition of the class

6.1. When to use a table. Tables are ideal for capturing a simple relationship between quanti-
tative data and the categories they relate to. The main strength of a table is that it is easy to look up
particular values.
Few (2012, 45) summarises the criteria for choosing a tabular display as
(1) The display will be used to look up individual values;
(2) It will be used to compare individual values but not entire series of values to one another;
(3) Precise values are required;
(4) The quantitative information to be communicated involves more than one unit of measure;
(5) Both summary and detail values are included.
Tufte (2001, 33) argues that
Small, noncomparative, highly labeled data sets usually belong in tables.
Our brains are well able to comprehend one, two or even a few values at the same time. The number
of things we can simultaneously grasp varies among people, with studies suggesting it is usually in the
range 5 ± 2. Tufte (2001, 56) says that “tables usually outperform graphics in reporting on small data
sets of 20 numbers or less.” In particular, graphics should rarely, if ever, be used to display very small
data sets; however, Few (2012, 50–51) gives an example of how the right, well-designed graphic may help
to tell a story about a very small data set.

6.2. When to use a graphic. The human eye-brain combination has evolved a great ability to
detect patterns in, and garner meaning from, visual depictions of information: graphics. Leverage this
where you can, provided it is appropriate. This visual perception di↵ers from our sequential processing
of text and the rows and columns of tables, which is handled by our verbal (language) system.
One picture is worth ten thousand words. . .
— F. R. Barnard, “Printer’s Ink”, 10 March 1927
. . . or ten thousand numbers.
But remember our earlier comments (and more examples later): one picture may not be worth a few
numbers (even 20 or so).
18 MIS10090

Few (2012, 48) says:


Because of their visual nature, graph[ic]s present the overall shape of the data. Text,
displayed in tables, cannot do this. The patterns revealed by graph[ic]s enable readers
to detect many points of interest in a single collection of information.
Thus, we should use graphics when
(1) We wish to compare multiple sets of quantitative data to each other;
(2) The message is contained in the overall shapes of the sets
• patterns
• trends (especially when there is time variation)
• exceptions (see (Anscombe, 1973) examples for outliers)
(3) Precision is less important;
(4) Summary and detail values may not both be needed;
(5) There are three or four main “chunks” of information, in visual form (thus, a legend with eight
di↵erent symbols will have your reader constantly referring to it).
Very di↵erent data sets may have similar or identical summary statistics, but a graphic quickly shows
how radically di↵erent they are: see (Anscombe, 1973)4 for an example of four such data sets. We will
return to this in the next chapter.
Graphics depicting data appropriately allow the human eye-brain’s pattern-detection and interpretive
strengths to be leveraged. If we see that a set of data follows a particular pattern, it may allow us to
induce a rule or principle: to move from the particular to the general. A good graphic may provoke
(research or other) questions, or suggest theories. This is why good graphics are so important.
There are many options for producing plots and other graphics available in Excel and in more professional
tools such as R, Python or gnuplot.

7. The history and range of graphical representation

The earliest known map is Babylonian, dating from about 2400–2200 BC. The Chinese map in Figure 2.2
(from c. the 11th century AD) is the first known to use grid lines.
Nicolas Oresme (1323-1381) plotted one variable against another (see Figure 2.3), now called Cartesian
xy co-ordinates after René Descartes (1596–1650).
Graphical representation of data can be traced back to Johann H. Lambert (1728–1777), a Swiss-German
mathematician5, and William Playfair (1759–1823), an English political economist.
They introduced abstract forms of graphical display of data.
The idea of a relational graphic of statistical variables (e. g., standard linear regression, where a “best-
fitting” line is drawn to several observed points) goes back to Lambert (1765)6. Lambert says there7
We have in general two variable quantities x, y, which will be collated with one other
by observation, so that we can determine for each value of x, which may be considered
as an abscissa, the corresponding ordinate y. Were the experiments or observations
completely accurate, these coordinates would give a number of points through which
a straight or a curved line should be drawn. But as this is not so, the line deviates to
a greater or lesser extent from the observational points. It must therefore be drawn in
4This is reproduced in F.J.AnscombeFeb1973 GraphsinStatisticalAnalysis AmerStatnV27n1p17-21.pdf on the
Brightspace page.
5Lambert was the first to prove that ⇡ is irrational. He also studied non-Euclidean geometry (e. g., geometry on a sphere)
and map projections of the spherical earth onto a flat surface, inventing seven projections including the transverse Mercator
(widely used in maps) and Lambert conformal conic (widely used in aeronautical charts) projections. He introduced the
properties of conformality (projections that preserve angles) and equal area preservation and showed that they were mutually
exclusive.
6
Lambert, J. H. (1765) Beytrage sum Gebrauche der Mathematik und deren Andwendung.
7
As quoted in: Laura Tilling (1975). Early Experimental Graphs. British Journal for the History of Science. 8:204–205.
Data Analysis for Decision Makers 19

Figure 2.2. The map of the tracks of Yu the great, engraved on stone, China, c. 1100AD

Figure 2.3. Oresme’s plots of semicircle, quadrant and linear function

such a way that it comes as near as possible to its true position and goes, as it were,
through the middle of the given points.
See Figure 2.4 for an example.
You will be familiar with other graphics where area depicts quantity, such as the bar chart (Figure 2.5),
time series (Figure 2.6) and pie chart (Figure 2.7), all invented by Playfair8.
• A bar chart displays (for the usual case of vertical bars) a numerical y-value (height of the
bar) corresponding to a discrete (either categorical or numerical) x-value; the bars have gaps
between them to emphasise the discreteness of the x-values;
• a time series displays a numerical y-value corresponding to an x-value denoting time;
• a pie chart displays relative proportion around the circumference of a disc (harder for our brains
to compare).

8The first two were introduced in his 1786 book The Commercial and Political Atlas; the last in his 1801 book The
Statistical Breviary.
20 MIS10090

Figure 2.4. An example plot of temperature variation and lags with depth, from Lam-
bert (1779) Pyrometrie.

Figure 2.5. An example of a Playfair bar chart, showing exports from and imports to
Scotland in 1781
Data Analysis for Decision Makers 21

Figure 2.6. An example where Playfair plots three parallel time-series: price of wheat,
weekly wages of a “Good Mechanic” (skilled labourer), and the reigns of British monarchs

Figure 2.7. A Playfair graphic showing population, tax revenue, etc., of the Great
Powers. This seems to be the first graphic to use multivariate data. It is also a bubble
plot (see later), in that the areas of the circles are proportional to the areas of nations,
and includes pie charts (to show the divisions of the Turkish and German Empires) and
lines (to show population sizes and taxes raised)

There are many other graphics that may be less familiar; some are presented here to give you ideas of
what can be done.
22 MIS10090

A Nightingale Rose chart (or Nightingale Coxcomb chart), e. g., Figure 2.8, from (Nightingale, 1858)9 is
a kind of pie chart with relative segment areas determined by radius rather than angle.10

Figure 2.8. An example of a Nightingale Rose: Deaths in the Crimea, 1858

A Sankey flow diagram shows flows (of energy, goods, etc) among parts of a system using links (often
depicted as arrows), with the thickness of the links proportional to the magnitude of the flow. It is
named after Irish engineer Charles Sankey, who used this kind of diagram in 1898 to show where energy
went in a steam engine.
The first use of a Sankey-type diagram appears to be by another Irishman, Lieutenant Henry D. Harness
of the Royal Engineers. He included maps “drawn to a new design” in an 1837 report (Figure 2.9) for
the Irish Railway Commissioners. These maps showed lines joining pairs of locations; each line’s width
was proportional to the average weekly number of travellers between the two locations. Many people
commuted from Kingstown (now Dún Laoghaire) to Dublin, so — rather than show an extremely wide
line — Harness showed a narrow black line and made a note in the text to explain it. Also, this is
a bubble plot (see below), since the size of the dot representing a given town is proportional to the
population of that town.
Colour coding can be helpful to indicate (roughly) a third dimension. For example, this is often done in
Self-Organising Maps (SOMs) where a colour “heatmap” indicates the value of a variable not included
in the SOM training. This can suggest clusters to the naked eye. See Figure 2.10.
9 Florence Nightingale (1858). Notes on matters a↵ecting the health, efficiency and hospital administration of the
British army.
10 “We do not want impressions, we want facts. You complain that your report would be dry. The dryer the better.
Statistics should be the dryest of all reading.” — William Farr, Compiler of Abstracts in the General Registry Office,
responding to Nightingale’s Rose chart showing bad sanitation to be by far the greatest cause of death during the Crimean
War. Nightingale had compared his (innovative) civilian mortality tables to her data, and had shown that soldiers had
twice the mortality of civilians, even in peacetime. She became the first female fellow of the Statistical Society of London
(now Royal Statistical Society) in 1858.
Data Analysis for Decision Makers 23

Figure 2.9. Excerpt from Harness’s original “Sankey” flow diagram, 1837

Figure 2.10. Example of a Self-Organising Map (SOM) where a colour “heatmap” in-
dicates the value of a variable not included in the SOM training

8. Types of modern graphic

8.1. Maps and plans. These show spatial aspects, as, e. g., in John Snow’s famous map in Fig-
ure 2.11. Figure 2.12 is an example of combining time-series and spatial information. Figures 2.13–2.14
are examples of combining virtual and spatial information. The most extensive data maps can place
millions of bits of information on a single page.
24 MIS10090

Figure 2.11. John Snow’s map of cholera deaths in the Broad Street area of London,
1854. This graphic efficiently testified about the data. It showed the clustering of deaths
around the water pump ⇥ in Broad Street, disproving the “bad air” theory of cholera
and persuading the council to remove the handle of the pump

Figure 2.12. Nathan Yau: Unemployment in the United States, 2004–2009. From
http://projects.flowingdata.com/america/unemployment

8.2. Illustrative diagrams. These portray a real object in simplified or schematic form. Such
images are models and so trade o↵ realism and abstraction. Examples include diagrams of the human
eye in cross-section (Figure 2.15), or the covers of Haynes manuals (Figure 2.16).
Data Analysis for Decision Makers 25

Figure 2.13. Paul Butler: visualizing friendships in the facebook social graph,
2010. From http://www.facebook.com/notes/facebook-engineering/visualizing-
friendships/469716398919

Figure 2.14. Olivier Beauchesne: Map of scientific collaboration between researchers


2005–2009. From http://olihb.com/2011/01/23/map-of-scientific-collaboration-between-
researchers

8.3. Organisational diagrams. These show relationships among parts of real/abstract objects,
e. g., UML shapes class hierarchy as in Figure 2.17, or volumes of trade in crude oil (a Sankey flow
diagram) as in Figure 2.18.

8.4. Statistical graphics. These are some of the most important for us, so we dedicate all of the
next section to them.
26 MIS10090

Figure 2.15. A section through the human eye, with an enlarged schematic of the retina

Figure 2.16. Haynes manual schematics of a Land Rover (left) and the USS Enterprise
NCC-1701 (right)

9. Statistical graphics

Statistical graphics represent more abstract designs, including relational graphics such as function plots.
Examples of statistical graphics include

• Bar charts (e. g., Playfair’s in Figure 2.5),


• Histograms (see below),
• Scatter plots (see below),
• Bubble plots (scatter plot where the dot size has meaning),
• Function plots (e. g., line plots such as Lambert’s in Figure 2.4),
• Density plots,
Data Analysis for Decision Makers 27

Figure 2.17. An example of a UML class hierarchy

Figure 2.18. An example of an organisational diagram: volumes of trade in crude oil


(a Sankey flow diagram)

• Pie charts (not recommended: see below).

9.1. Histograms. A frequency distribution is a summary table of numerical data (usually continu-
ous) in which the data are grouped into numerically ordered nonoverlapping classes (also called categories
or bins).
The frequency (or number of occurrences) of a class is the number of observations (data items) falling
within that class.
You must carefully select (a) the appropriate number of classes, (b) a suitable fixed class width, and (c)
suitable boundaries of each class to avoid overlapping.
28 MIS10090

(a) The number k of classes depends on the number n of data items (typically more classes for a
larger number of items); k should be between 4 and 15 to aid human understanding: too many
classes defeats this. The “2k ” rule/guideline (also called Sturges’s rule) says:
• use k classes, where k is chosen as the smallest integer for which 2k > n; for example, if
you have n = 19 items, the smallest power of 2 greater than 19 is 32 = 25 so we choose
k = 5.
This is a guideline, and we might choose k 1 or k + 1 classes; but not a very di↵erent number
from k.
(b) The class width of each interval is the data range (highest value lowest value) divided by the
number k of classes, i. e.,
max min
.
k
(c) The class boundaries are where the classes meet: so classes touch but don’t overlap.

A histogram is a vertical “bar” plot of a frequency distribution over a range. Each bar represents a
class. There are no gaps between adjacent bars. The class boundaries (or class midpoints) are shown on
the horizontal axis. The vertical axis (showing height of bars) is either frequency, relative frequency, or
percentage.
Advantages of a histogram:

• It condenses the raw data into a more useful form;


• It allows for a quick visual interpretation of the data;
• It enables determination of the major characteristics of the data set including where the data
are concentrated / clustered.

Some tips for histograms:

• Di↵erent class boundaries (e. g., from more or fewer classes) may give di↵erent pictures for the
same data (especially for smaller data sets);
• Shifts in data concentration may show up when di↵erent class boundaries are chosen;
• As the size of the data set increases, the impact of changes in the choices of class boundaries is
greatly reduced.

A histogram di↵ers from a barchart:

• A barchart displays a y-value corresponding to a discrete x-value, e. g., the numbers of students
from particular countries in the class (here, the x-value would be a country [nominal level of
measurement] and the y-value is the corresponding number of students); or the probability of
a particular discrete outcome (see later). The bars are drawn so as not to touch each other,
emphasising the discreteness of the x-values;
• A histogram displays a y-value which is the frequency (number of occurrences) corresponding
to the range (width) of this class; that is, how many times did we see an observation within
this class. Thus the vertical bars are drawn to touch each other, since each covers a range of
x-values from the start to the end of the class it covers.

In the example in Figure 2.19, derived from https://www.cuemath.com/data/histograms, the heights


in feet of 30 cherry trees are measured as

62.5, 63.4, 64.7, 66.5, 67.7, 68.4, 70.5, 70.9, 71.3, 72.5, 72.9, 73.8, 74.5, 74.9, 75.3, 75.7, 76.2, 76.8, 77.1,
77.6, 78.3, 78.7, 79.4, 79.8, 80.5, 81.8, 82.6, 83.7, 84.5, 86.9

then classed (grouped/binned) into six classes of 5-feet intervals, ordered by increasing tree height. For
each class, the number of trees in that class is counted; the number counted is called the frequency for
that class and is shown on the vertical axis.
This gives us a visual representation of how the heights are distributed over the set of trees: how many
trees of each height range there are. We will have much more to say about distributions later.
Data Analysis for Decision Makers 29

Figure 2.19. Example of a histogram (from https://www.cuemath.com/data/histograms)

Class Lower Upper Frequency


1 3.2 4.0 4
2 4.0 4.8 5
3 4.8 5.6 9
4 5.6 6.4 8
5 6.4 7.2 6
6 7.2 8.0 5

Table 2.2. Frequency distribution: six classes of great white shark lengths, giving the
upper and lower class boundaries and the number of items (frequency) in each class

In the sample in Figure 2.19, there are 30 observations, so by Sturges’s rule we would first consider using
5 classes since 25 = 32, the first power of 2 which is bigger than the sample size of 30. However, the
authors decided to use 6 classes here, presumably for clarity and/or symmetry.
Example 2.1. Suppose that a marine biologist has measured the lengths in metres of 37 adult great
white sharks, and found that they were (in ascending order):
3.2, 3.5, 3.6, 3.9, 4.2, 4.4, 4.7, 4.8, 4.8, 5.0, 5.0, 5.1, 5.1, 5.3, 5.4, 5.4, 5.6, 5.6, 5.7, 5.7, 5.9, 5.9, 5.9, 6.0,
6.2, 6.3, 6.5, 6.6, 6.8, 7.1, 7.2, 7.2, 7.4, 7.5, 7.5, 7.8, 8.0.
We first decide on the number of bins or classes: since there are 37 data items, we choose k, the number
of classes, to be the smallest number for which 2k > 37, that is11, k = 6.
Next, we identify the maximum length as 8.0m and the minimum as 3.2m, so the total range is 8 3.2 =
4.8. Dividing this into k = 6 equal intervals gives a class width of 0.8m. Counting the number of items in
each class we get the frequency distribution (Table 2.2). The resulting histogram is shown in Figure 2.20.

From this, we see that there is not such a well-defined bell shape as in the previous example; but real
examples are rarely truly symmetrical. }
11If you are familiar with logarithms, just find log 37 = 5.209453366 and let k be the smallest integer greater than
2
this, i. e., k = 6.
30 MIS10090

Figure 2.20. Example of a histogram of 37 great white shark lengths (di↵erent colours
have been used for each bin for clarity; usually all bins are shown in the same colour)

For more examples of histograms, e. g., skewed, bimodal, etc, see https://www.cuemath.com/data/histograms
In tutorials you will see how to create a histogram in Excel. The chart, title, axes, etc. can be modified
and formatted to best reflect the data under analysis. The histogram is treated as though it was any
other chart in Excel.
There are many other plot options available in tools such as Excel. Other commonly used plots are
Scatter Plots, Line Plots, Pie Charts, Stem and Leaf Displays and Cumulative graphs.

9.2. Scatter Plots. A scatter plot depicts a finite number of pairs of numerical values (x, y) as
isolated points on the plane. It has two axes, one for each variable x, y, and each observation is plotted
as a point — no lines or bars. In Figure 2.21, each point represents an observation (one person).

Weight/Height relationship


180



170


Height (cm)

160


150

50 55 60 65 70 75

Weight (kg)

Figure 2.21. Scatter plot showing height and weight of a sample of adults

9.2.1. Bubble Plots. We mentioned that colour (or shading density) can represent a third dimension;
in fact, by varying the size of points on a scatter plot we can even get a fourth dimension, giving a
bubble plot as in Figure 2.22.
Data Analysis for Decision Makers 31

Figure 2.22. An example of a bubble plot, with axes showing number of burglaries
versus number of murders per 100,000 population. Every bubble is a US state: the size of
each bubble represents the population of the state; the colour is the number of larcenies.
From http://glowingpython.blogspot.ie/2011/11/how-to-make-bubble-charts-with.html

10. What should a good graphic do?

Seek the clear portrayal of complexity. Avoid complication of the simple.


Tufte (2001, 51) gives these principles of graphical excellence:
• Graphical excellence is the well-designed presentation of interesting data.
• Excellence in statistical graphics consists of complex ideas communicated with clarity, precision
and efficiency.
• Graphical excellence gives to the viewer the greatest number of ideas in the shortest time with
the least ink in the smallest space.
• Graphical excellence is nearly always multivariate.
• Graphical excellence requires telling the truth about the data.
Tufte (2001) argues that
Graphical displays should
• show the data
• induce the viewer to think about the substance rather than about methodology,
graphic design, the technology of graphic production, or something else
• avoid distorting what the data have to say
• present many numbers in a small space
• make large data sets coherent
• encourage the eye to compare di↵erent pieces of data
• reveal the data at several levels of detail, from a broad overview to the fine struc-
ture
• serve a reasonably clear purpose: description, exploration, tabulation or decora-
tion
• be closely integrated with the statistical and verbal description of a data set.
32 MIS10090

Graphics reveal data.


Tufte (2001) makes the point that as much as possible of a graphic’s ink12 should present data. He calls
Data-ink
the non-erasable core of a graphic, the non-redundant ink arranged in response to
variation in the numbers represented.
He defines the data-ink ratio as
data-ink
data-ink ratio =
total ink used to print the graphic
= proportion of graphic’s ink devoted to the non-redundant display of data-information
=1 proportion of a graphic that can be erased without loss of data-information
Tufte (2001, 105) gives the principles
• Above all else show the data.
• Maximize the data-ink ratio.
• Erase non-data-ink (decorations, intrusive grid lines, etc.).
• Erase redundant data-ink.
• Revise and edit.
Strunk (Strunk and White, 1979) had a famous no padding principle for writing: “Omit needless words!”
Tufte is saying the same for graphics: “Omit needless ink!”
The less padding, verbiage, and needless graphical decoration, the more your message will shine through.

10.1. A famous example of a good graphic. A famous graphic is Charles Joseph Minard’s
1869 depiction of Napoleon’s 1812 campaign in Russia, ending in retreat from Moscow. See Figure 2.23
(Minard was over 80 when he made this). Minard shows seven kinds of information (in eight dimensions)
on a two-dimensional chart:
(1) the two-dimensional path of the army’s movement (from the Polish-Russian border to Moscow,
and back: follow the midpoints of the gold and black bands);
(2) direction of movement (colour coded: gold for outward and black for return);
(3) number of troops remaining (widths of the paths, as in a Sankey flow diagram);
(4) geographical information (names of rivers, towns, etc);
(5) names of major battles (locations along the troop movement path);
(6) date (for the return journey, as a separate plot to the bottom, directly under the corresponding
event on the march);
(7) temperature (also for the return journey, shown against date; in addition, occurrence of rain
[pluie] is given).

11. What should a good graphic not do?

Tufte (2001, 74) says


Graphics must not quote data out of context.
He points out that ‘to be truthful and revealing, data graphics must bear on the question at the heart
of quantitative thinking: “Compared to what?”.’
A data-thin design should always provoke suspicion, for graphics often lie by omission, leaving out data
sufficient for comparisons.
We gather together six principles of graphical integrity from (Tufte, 2001, 77):
• The representation of numbers, as physically measured on the surface of the graphic itself,
should be directly proportional to the numerical quantities represented.
12On a screen, this means pixels coloured di↵erently from the background.
Data Analysis for Decision Makers 33

Figure 2.23. Minard’s 1869 depiction of Napoleon’s 1812 campaign in Russia shows:
the path of the army’s movement (follow midpoints of the gold and black bands); di-
rection of movement (gold for outward and black for return); and the number of troops
remaining (widths of the paths); as well as geographical information, names of battles,
dates, temperatures and rain occurrence on the return journey
34 MIS10090

• Clear, detailed and thorough labelling should be used to defeat graphical distortion and ambi-
guity. Write out explanations of the data on the graphic itself. Label important events in the
data.
• Show data variation, not design variation.
• In time series displays of money, deflated and standarized units of monetary measurement are
nearly always better than nominal units.
• The number of information-carrying (variable) dimensions depicted should not exceed the num-
ber of dimensions in the data.
• Graphics must not quote data out of context.

11.1. Low data density. Recall from §6 that graphics should not be used to display very small
data sets.
The graphic in Figure 2.24 is probably a record: it uses a lot of space and ink to tell us one number,
90%.

Figure 2.24. The lowest of low data densities

11.2. Misleading graphics. From the principles we’ve seen, a graphic should inform, not mislead,
the reader, and give a truthful view of the data.
Distortion can be accidental or deliberate. Deliberate distortions are usually an attempt to hide a feature
of the data.
Note that any medium of communication can be used to mislead. Graphics are no more vulnerable to
exploitation by liars than other media. Our intuition is usually a good graphical lie detector, spotting
frauds.
If the reader begins to suspect you of manipulating his/her impressions, it may cast doubt on all of your
work.
A graphic with low data density (little information for the page space it takes up) promises more than
it delivers and so may be regarded as misleading. Consider whether the data can be better represented
in a table.
11.2.1. Distortion in graphics. For most people, the perceived area of a disc grows more slowly than
the actual (physical) area, as the radius increases.13 Also, perceptions alter with experience; perceptions
are context-dependent; and they depend on what we have heard others say about the object.
Given these di↵erences in perception, Tufte (2001, 56) suggests principles to enhance graphical integrity.
(1) Make the physical dimensions of numbers on the graphic directly proportional to the numerical
quantities represented.
(2) Defeat graphical distortion and ambiguity by clear, detailed and thorough labelling:
• Give explanations of the data on the graphic itself;
• Label important events in the data.
13Perceived area = actual areac where c ⇡ 0.8 (Tufte, 2001, 55).
Data Analysis for Decision Makers 35

Tufte (2001, 57) proposes a way to measure violations of the first principle:
size of e↵ect shown in graphic
Lie Factor = .
size of e↵ect in data
A Lie Factor between 0.95 and 1.05 indicates reasonably accurate representation of the underlying num-
bers. A Lie Factor outside this range indicates substantial distortion, beyond minor plotting inaccuracy.
Most distortions overstate, with Lie Factors of two to five not uncommon.
Tufte (2001, 57) reprints an example from The New York Times, 9 August 1978, p. D-2 (Figure 2.25).
We can calculate the Lie Factor from the above formula:

Figure 2.25. A misleading graphic from “The New York Times”, 9 August 1978, p. D-2,
reprinted in (Tufte, 2001, 57)

27.5 18 5.3 0.6


Data increase = = 0.53; Graphical increase = = 7.83.
18 0.6
This give a Lie Factor of 7.83/0.53 = 14.8.
Tufte (2001, 60–61) gives the principle

Show data variation, not design variation.

He states “Each part of a graphic generates visual expectations about its other parts, [so] these expec-
tations often determine what the eye sees. Deception results from the incorrect extrapolation of visual
expectations”. For example, in Figure 2.25

• the perspective e↵ect further fools the eye;


• the non-uniform time scale from 1978–1985 makes the change with time harder to judge.

As a general principle, do not mix scales on the same axis. If a scale moves at regular intervals in one
part of the graphic, the eye expects it to move to the end in a coherent way.
Mixing design variation with data variation gives ambiguity and deception, as the eye may not distinguish
between them (Tufte, 2001, 61).
Our eye judges area and volume di↵erently from distances. Distortion can arise from the inappropriate
use of linear scaling in every dimension when using area or volume to represent values. The eye con-
fuses design variation with data variation. Furthermore, the use of 3-dimensional “e↵ects” often causes
distortions, even if not representing data values (see, for example, Figures 2.27 and 2.29 below).
Tufte (2001, 71) encapsulates this in the principle
36 MIS10090

The number of information-carrying (variable) dimensions depicted should not exceed


the number of dimensions in the data.
In Figure 2.26, use of area, with each dimension expanded by the same factor, means the data increase of
454% is shown as an increase of 4280%. This gives a Lie Factor of 9.4. The eye is confused by using 2-d
area to show 1-d data. This is compounded further by the fake 3-d which the eye interprets as varying
as the cube of the linear increase, so the Lie Factor = 27,000/454 = 59.4. To make matters even worse,
this graphic uses “nominal” dollars, not adjusting for inflation (Playfair did adjust for inflation).

Figure 2.26. A misleading graphic from “Time”, 9 April, 1979, p. 57, reprinted in
(Tufte, 2001, 62)

There are other ways graphics can mislead, which you should consider and avoid if possible. Be careful
with an xy plot having a greater vertical range than horizontal range (taller than it is wide). In particular,
if the horizontal axis shows time, you may give the impression of large variations with time. Is this what
you want to show? Would it mislead the reader? Variations can be further exaggerated by having the
vertical axis only show part of the range from 0 to max. Again, be cautious with this.
The moral is: take care with graphics, as with all statistical interpretation of data.

11.3. Chartjunk. Our guiding principle that a graphic should clearly portray complexity means
the graphic must not be more complex than the data it portrays.
Chartjunk is a term for things which introduce unnecessary complexity:
• Irrelevant optical art or decoration such as Moiré e↵ects (cross-hatching gives a vibrating,
illusory, eye-straining e↵ect and should be replaced by shades of grey)
• Grid (esp. dark lines): mute or suppress this “so that its presence is only implicit—lest it
compete with the data . . . Dark grid lines . . . carry no information, clutter up the graphic and
generate activity unrelated to data information” (Tufte, 2001, 112)
• Duck: “when a graphic is taken over by decorative forms or computer debris, . . . , when the
overall design purveys Graphical Style rather than quantitative information, then that graphic
may be called a duck ” (Tufte, 2001, 116)
– meaningless colours
– fake perspective and 3-d e↵ects
Data Analysis for Decision Makers 37

11.4. Some examples of bad graphics.


11.4.1. A classic bad graphic: a duck. The graphic in Figure 2.27 uses six colours, unnecessary
perspective (3-d) and a multiply split axis to report five pieces of data (since each year sums to 100%).
Tufte (2001, 118) says this “delighted connoisseurs of the graphically preposterous . . . [it] may well be
the worst graphic ever to find its way into print”.
A table can capture all the facts succinctly, as in Table 2.3.

Year Total Enrolment (%)


1972 28.0
1973 29.2
1974 32.7
1975 33.6
1976 33.0

Table 2.3. Percentage of Total College Enrolment aged 25 or over (1972–1976)

Figure 2.27. The dangers of colour and/or 3-d e↵ects: Percentage of Total College
Enrolment aged 25 or over (1972–1976), from “American Education”

11.4.2. Pie charts: avoid unless absolutely necessary. Many authorities, such as Bertin (1977); Tufte
(2001); Few (2012, 94), argue that pie charts have limited use at best, since they have low data density
and fail to order numbers along a visual dimension (the eye can judge linear ranges better); one of the
few valid uses being Figure 2.28.
Three-dimensional pie charts are an abomination and may be very misleading because of their tilt, as in
(the exaggerated) Figure 2.29.
38 MIS10090

Figure 2.28. A valid use of a pie chart

Figure 2.29. The dangers of three-dimensional pie charts

12. Data Presentation: Summary of Guidelines

We summarise some guidelines (rather than rules) on data presentation.


(1) Use tables to display data details in one-to-one relationships, which would be lost in graphs.
(2) Use a line plot to demonstrate how something has changed over a period of time.
(3) Use a barchart to compare discrete data, to show the relative sizes of separate values (use
two-dimensional bar charts as three-dimensional ones can be hard to read).
(4) Consider a pie chart to show how percentages relate to each other within a whole: how some-
thing, the 100%, is shared. But note that the human brain finds it much harder to compare
areas or distances around a circle than it does areas or distances arranged linearly: so a linear
or rectangular display of the areas is preferable. Avoid 3-d pie charts.
(5) Use a map to illustrate di↵erences in rates between and among counties, regions or countries.
Exercise 2.2. What could be the difficulties with 3-d charts displayed on a screen or sheet of paper?
Exercise 2.3. Consider when each of the following might be most appropriate for displaying data:
within the text; in a table; in a graphic.

You might also like