The Open Data Handbook (Open Knowledge Foundation)
The Open Data Handbook (Open Knowledge Foundation)
Release 1.0.0
Contents
Table of Contents
1.1 Introduction . . . . . . . . . . . . . . . . .
1.2 Why Open Data? . . . . . . . . . . . . . . .
1.3 What is Open Data? . . . . . . . . . . . . .
1.4 How to Open up Data . . . . . . . . . . . .
1.5 So Ive Opened Up Some Data, Now What? .
1.6 Glossary . . . . . . . . . . . . . . . . . . .
1.7 Appendices . . . . . . . . . . . . . . . . . .
Indices and tables
Index
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
5
6
11
14
16
21
23
ii
This handbook discusses the legal, social and technical aspects of open data. It can be used by anyone but is
especially designed for those seeking to open up data. It discusses the why, what and how of open data why to
go open, what open is, and the how to open data.
To get started, you may wish to look at the Introduction. You can navigate through the report using the Table
of Contents (see sidebar or below).
We warmly welcome comments on the text and will incorporate feedback as we go forward. We also welcome
contributions or suggestions for additional sections and areas to examine.
Contents
Contents
CHAPTER 1
Table of Contents
1.1 Introduction
Do you know exactly how much of your tax money is spent on street lights or on cancer research? What is the
shortest, safest and most scenic bicycle route from your home to your work? And what is in the air that you
breathe along the way? Where in your region will you find the best job opportunities and the highest number of
fruit trees per capita? When can you influence decisions about topics you deeply care about, and whom should
you talk to?
New technologies now make it possible to build the services to answer these questions automatically. Much of the
data you would need to answer these questions is generated by public bodies. However, often the data required
is not yet available in a form which is easy to use. This book is about how to unlock the potential of official and
other information to enable new services, to improve the lives of citizens and to make government and society
work better.
The notion of open data and specifically open government data - information, public or otherwise, which anyone
is free to access and re-use for any purpose - has been around for some years. In 2009 open data started to
become visible in the mainstream, with various governments (such as the USA, UK, Canada and New Zealand)
announcing new initiatives towards opening up their public information.
This book explains the basic concepts of open data, especially in relation to government. It covers how open
data creates value and can have a positive impact in many different areas. In addition to exploring the background,
the handbook also provides concrete information on how to produce open data.
1.1.2 Credits
Credits and Copyright
Contributing authors
Daniel Dietrich
Jonathan Gray
Tim McNamara
Antti Poikola
Rufus Pollock
Julian Tait
Ton Zijlstra
Existing sources directly used
Technical Proposal for how IATI is implemented. The IATI Technical Advisory Group led by Simon Parrish
Unlocking the Potential of Aid Information. Rufus Pollock, Jonathan Gray, Simon Parrish, Jordan Hatcher
Finnish manual authored by Antti Poikola
Beyond Access Report. Access Info and the Open Knowledge Foundation
Other sources
Keep it simple. Start out small, simple and fast. There is no requirement that every dataset must be made
open right now. Starting out by opening up just one dataset, or even one part of a large dataset, is fine of
course, the more datasets you can open up the better.
Remember this is about innovation. Moving as rapidly as possible is good because it means you can build
momentum and learn from experience innovation is as much about failure as success and not every dataset
will be useful.
Engage early and engage often. Engage with actual and potential users and re-users of the data as early
and as often as you can, be they citizens, businesses or developers. This will ensure that the next iteration
of your service is as relevant as it can be.
It is essential to bear in mind that much of the data will not reach ultimate users directly, but rather via
info-mediaries. These are the people who take the data and transform or remix it to be presented. For
example, most of us dont want or need a large database of GPS coordinates, we would much prefer a map.
Thus, engage with infomediaries first. They will re-use and repurpose the material.
Address common fears and misunderstandings. This is especially important if you are working with or
within large institutions such as government. When opening up data you will encounter plenty of questions
and fears. It is important to (a) identify the most important ones and (b) address them at as early a stage as
possible.
There are four main steps in making data open, each of which will be covered in detail below. These are in very
approximate order - many of the steps can be done simultaneously.
1. Choose your dataset(s). Choose the dataset(s) you plan to make open. Keep in mind that you can (and
may need to) return to this step if you encounter problems at a later stage.
2. Apply an open license.
(a) Determine what intellectual property rights exist in the data.
(b) Apply a suitable open license that licenses all of these rights and supports the definition of openness
discussed in the section above on What Open Data
(c) NB: if you cant do this go back to step 1 and try a different dataset.
3. Make the data available - in bulk and in a useful format. You may also wish to consider alternative ways
of making it available such as via an API.
4. Make it discoverable - post on the web and perhaps organize a central catalog to list your open datasets.
Available Data should be priced at no more than a reasonable cost of reproduction, preferably as a free download
from the Internet. This pricing model is achieved because your agency should not undertake any cost when
it provides data for use.
In bulk The data should be available as a complete set. If you have a register which is collected under statute, the
entire register should be available for download. A web API or similar service may also be very useful, but
they are not a substitutes for bulk access.
In an open, machine-readable format Re-use of data held by the public sector should not be subject to patent
restrictions. More importantly, making sure that you are providing machine-readable formats allows for
greatest re-use. To illustrate this, consider statistics published as PDF (Portable Document Format) documents, often used for high quality printing. While these statistics can be read by humans, they are very hard
for a computer to use. This greatly limits the ability for others to re-use that data.
Here are a few policies that will be of great benefit:
Keep it simple,
Move fast
Be pragmatic.
In particular it is better to give out raw data now than perfect data in six months time.
There are many different ways to make data available to others. The most natural in the Internet age is online
publication. There are many variations to this model. At its most basic, agencies make their data available via
their websites and a central catalog directs visitors to the appropriate source. However, there are alternatives.
When connectivity is limited or the size of the data extremely large, distribution via other formats can be warranted.
This section will also discuss alternatives, which can act to keep prices very low.
Online methods
Via your existing website
The system which will be most familiar to your web content team is to provide files for download from webpages.
Just as you currently provide access to discussion documents, data files are perfectly happy to be made available
this way.
One difficulty with this approach is that it is very difficult for an outsider to discover where to find updated
information. This option places some burden on the people creating tools with your data.
Via 3rd party sites
Many repositories have become hubs of data in particular fields. For example, pachube.com is designed to connect
people with sensors to those who wish to access data from them. Sites like Infochimps.com and Talis.com allow
public sector agencies to store massive quantities of data for free.
Third party sites can be very useful. The main reason for this is that they have already pooled together a community
of interested people and other sets of data. When your data is part of these platforms, a type of positive compound
interest is created.
Wholesale data platforms already provide the infrastructure which can support the demand. They often provide
analytics and usage information. For public sector agencies, they are generally free.
These platforms can have two costs. The first is independence. Your agency needs to be able to yield control to
others. This is often politically, legally or operationally difficult. The second cost may be openness. Ensure that
your data platform is agnostic about who can access it. Software developers and scientists use many operating
sytems, from smart phones to supercomputers. They should all be able to access the data.
A less fashionable method for providing access to files is via the File Transfer Protocol (FTP). This may be suitable
if your audience is technical, such as software developers and scientists. The FTP system works in place of HTTP,
but is specifically designed to support file transfers.
FTP has fallen out of favour. Rather than providing a website, looking through an FTP server is much like looking
through folders on a computer. Therefore, even though it is fit for purpose, there is far less capacity for web
development firms to charge for customisation.
As torrents
BitTorrent is a system which has become familiar to policy makers because of its association with copyright
infringement. BitTorrent uses files called torrents, which work by splitting the cost of distributing files between all
of the people accessing those files. Instead of servers becoming overloaded, the supply increases with the demand
increases. This is the reason that this system is so successful for sharing movies. It is a wonderfully efficient way
to distribute very large volumes of data.
As an API
Data can be published via an Application Programming Interface (API). These interfaces have become very popular. They allow programmers to select specific portions of the data, rather than providing all of the data in bulk
as a large file. APIs are typically connected to a database which is being updated in real-time. This means that
making information available via an API can ensure that it is up to date.
Publishing raw data in bulk should be the primary concern of all open data intiatives. There are a number of costs
to providing an API:
1. The price. They require much more development and maintainence than providing files.
2. The expectations. In order to foster a community of users behind the system, it is important to provide
certainty. When things go wrong, you will be expected to incur the costs of fixing them.
Access to bulk data ensures that:
1. there is no dependency on the original provider of the data, meaning that if a restructure or budget cycle
changes the situation, the data are still available.
2. anyone else can obtain a copy and redistribute it. This reduces the cost of distribution away from the source
agency and means that there is no single point of failure.
3. others can develop their own services using the data, because they have certainty that the data will not be
taken away from them.
Providing data in bulk allows others to use the data beyond its original purposes. For example, it allows it to be
converted into a new format, linked with other resources, or versioned and archived in multiple places. While the
latest version of the data may be made available via an API, raw data should be made available in bulk at regular
intervals.
For example, the Eurostat statistical service has a bulk download facility offering over 4000 data files. It is updated
twice a day, offers data in Tab-separated values (TSV) format, and includes documentation about the download
facility as well as about the data files.
Another example is the District of Columbia Data Catalog, which allows data to be downloaded in CSV and XLS
format in addition to live feeds of the data.
10
The most important thing is to provide a neutral space which can overcome both inter-agency politics and future
budget cycles. Jurisdictional borders, whether sectorial or geographical, can make cooperation difficult. However,
there are significant benefits in joining forces. The easier it is for outsiders to discover data, the faster new and
useful tools will be built.
Existing tools
There are a number of tools which are live on the web that are specifically designed to make data more discoverable.
One of the most prominent is the DataHub and is a catalog and data store for datasets from around the world. The
site makes it easy for individuals and organizations to publish material and for data users to find material they
need.
In addition, there are dozens of specialist catalogs for different sectors and places. Many scientific communities
have created a catalog system for their fields, as data are often required for publication.
For government
As it has emerged, orthodox practice is for a lead agency to create a catalog for the governments data. When
establishing a catalog, try to create some structure which allows many departments to easily keep their own
information current.
Resist the urge to build the software to support the catalog from scratch. There are free and open source software
solutions (such as CKAN) which have been adopted by many governments already. As such, investing in another
platform may not be needed.
There are a few things that most open data catalogs miss. Your programme could consider the following:
Providing an avenue to allow the private and community sectors to add their data. It may be worthwhile to
think of the catalog as the regions catalog, rather than the regional governments.
Facilitating improvement of the data by allowing derivatives of datasets to be cataloged. For example,
someone may geocode addresses and may wish to share those results with everybody. If you only allow
single versions of datasets, these improvements remain hidden.
Be tolerant of your data appearing elsewhere. That is, content is likely to be duplicated to communities
of interest. If you have river level monitoring data available, then your data may appear in a catalog for
hydrologists.
Ensure that access is equitable. Try to avoid creating a privileged level of access for officials or tenured
researchers as this will undermine community participation and engagement.
For civil society
11
This section looks at additional things which can be done to promote data re-use.
12
Social media
Its inefficient for cash-strapped agencies to spend hours on social media sites. The most significant way that your
voice can be heard through these fora is by making sure that blog posts are easily shareable. That means, before
reading the next section, make sure that you have read the last. With that in mind, here are a few suggestions:
Discussion fora Twitter has emerged as the platform of choice for disseminating information rapidly.
Anything tagged with #opendata will be immediately seen by thousands.
LinkedIn has a large selection of groups which are targeted towards open data.
While Facebook is excellent for a general audience, it has not received a great deal of attention
in the open data community.
Link aggregators Submit your content to the equivalent of newswires for geeks. Reddit and Hacker
News are the two biggest in this arena at the moment. To a lesser extent, Slashdot and Digg are
also useful tools in this area.
These sites have a tendency to drive significant traffic to interesting material. They are also
heavily focused on topic areas.
13
1.6 Glossary
Anonymisation The process of adapting data so that individuals cannot be identified from it.
Anonymization See Anonymisation.
14
IARs can be developed in different ways. Government departments can develop their own IARs and these
can be linked to national IARs. IARs can include information which is held by public bodies but which has
not yet been and maybe will not be proactively published. Hence they allow members of the public to
identify information which exists and which can be requested.
1.6. Glossary
15
For the public to make use of these IARs, it is important that any registers of information held should be as
complete as possible in order to be able to have confidence that documents can be found. The incompleteness of some registers is a significant problem as it creates a degree of unreliability which may discourage
some from using the registers to search for information.
It is essential that the metadata in the IARs should be comprehensive so that search engines can function
effectively. In the spirit of open government data, public bodies should make their IARs available to the
general public as raw data under an open license so that civic hackers can make use of the data, for example
by building search engines and user interfaces.
Intellectual property rights Monopolies granted to individuals for intellectual creations.
IP rights See Intellectual property rights.
Machine-readable Formats that are machine readable are ones which are able to have their data extracted by
computer programs easily. PDF documents are not machine readable. Computers can display the text nicely,
but have great difficulty understanding the context that surrounds the text.
Open Data Open data are able to be used for any purpose. More details can be read at opendefinition.org.
Open Government Data Open data produced by the government. This is generally accepted to be data gathered
during the course of business as usual activities which do not identify individuals or breach commercial
sensitivity. Open government data is a subset of Public Sector Information, which is broader in scope. See
http://opengovernmentdata.org for details.
Open standards Generally understood as technical standards which are free from licencing restrictions. Can
also be interpreted to mean standards which are developed in a vendor-neutral manner.
PSI See Public Sector Information.
Public domain No copyright exists over the work. Does not exist in all jurisdictions.
Public Sector Information Information collected or controlled by the public sector.
Re-use Use of content outside of its original intention.
Share-alike License A license that requires users of a work to provide the content under the same or similar
conditions as the original.
Tab-separated values Tab-separated values (TSV) are a very common form of text file format for sharing tabular
data. The format is extremely simple and highly machine-readable.
Web API An API that is designed to work over the Internet.
1.7 Appendices
1.7.1 File Formats
An Overview of File Formats
JSON
JSON is a simple file format that is very easy for any programming language to read. Its simplicity means that it
is generally easier for computers to process than others, such as XML.
XML
XML is a widely used format for data exchange because it gives good opportunities to keep the structure in the
data and the way files are built on, and allows developers to write parts of the documentation in with the data
without interfering with the reading of them.
16
RDF
A W3C-recommended format called RDF makes it possible to represent data in a form that makes it easier to
combine data from multiple sources. RDF data can be stored in XML and JSON, among other serializations.
RDF encourages the use of URLs as identifiers, which provides a convenient way to directly interconnect existing
open data initiatives on the Web. RDF is still not widespread, but it has been a trend among Open Government
initiatives, including the British and Spanish Government Linked Open Data projects. The inventor of the Web,
Tim Berners-Lee, has recently proposed a five-star scheme that includes linked RDF data as a goal to be sought
for open data initiatives.
Spreadsheets
Many authorities have information left in the spreadsheet, for example Microsoft Excel. This data can often be
used immediately with the correct descriptions of what the different columns mean.
However, in some cases there can be macros and formulas in spreadsheets, which may be somewhat more cumbersome to handle. It is therefore advisable to document such calculations next to the spreadsheet, since it is
generally more accessible for users to read.
Comma Separated Files
CSV files can be a very useful format because it is compact and thus suitable to transfer large sets of data with the
same structure. However, the format is so spartan that data are often useless without documentation since it can
be almost impossible to guess the significance of the different columns. It is therefore particularly important for
the comma-separated formats that documentation of the individual fields are accurate.
Furthermore it is essential that the structure of the file is respected, as a single omission of a field may disturb the
reading of all remaining data in the file without any real opportunity to rectify it, because it cannot be determined
how the remaining data should be interpreted.
Text Document
Classic documents in formats like Word, ODF, OOXML, or PDF may be sufficient to show certain kinds of data for example, relatively stable mailing lists or equivalent. It may be cheap to exhibit in, as often it is the format the
data is born in. The format gives no support to keep the structure consistent, which often means that it is difficult
to enter data by automated means. Be sure to use templates as the basis of documents that will display data for
re-use, so it is at least possible to pull information out of documents.
It can also support the further use of data to use typography markup as much as possible so that it becomes easier
for a machine to distinguish headings (any type specified) from the content and so on. Generally it is recommended
not to exhibit in word processing format, if data exists in a different format.
Plain Text
Plain text documents (.txt) are very easy for computers to read. They generally exclude structural metadata from
inside the document however, meaning that developers will need to create a parser that can interpret each document
as it appears.
Some problems can be caused by switching plain text files between operating systems. MS Windows, Mac OS X
and other Unix variants have their own way of telling the computer that they have reached the end of the line.
Scanned image
Probably the least suitable form for most data, but both TIFF and JPEG-2000 can at least mark them with documentation of what is in the picture - right up to mark up an image of a document with full text content of the
1.7. Appendices
17
document. It may be relevant to their displaying data as images whose data are not born electronically - an obvious
example is the old church records and other archival material - and a picture is better than nothing.
Proprietary formats
Some dedicated systems, etc. have their own data formats that they can save or export data in. It can sometimes be
enough to expose data in such a format - especially if it is expected that further use would be in a similar system as
that which they come from. Where further information on these proprietary formats can be found should always
be indicated, for example by providing a link to the suppliers website. Generally it is recommended to display
data in non-proprietary formats where feasible.
HTML
Nowadays much data is available in HTML format on various sites. This may well be sufficient if the data is very
stable and limited in scope. In some cases, it could be preferable to have data in a form easier to download and
manipulate, but as it is cheap and easy to refer to a page on a website, it might be a good starting point in the
display of data.
Typically, it would be most appropriate to use tables in HTML documents to hold data, and then it is important
that the various data fields are displayed and are given IDs which make it easy to find and manipulate data. Yahoo
has developed a tool (http://developer.yahoo.com/yql/) that can extract structured information from a website, and
such tools can do much more with the data if it is carefully tagged.
Open File Formats
Even if information is provided in electronic, machine-readable format, and in detail, there may be issues relating
to the format of the file itself.
The formats in which information is published in other words, the digital base in which the information is stored
- can either be open or closed. An open format is one where the specifications for the software are available
to anyone, free of charge, so that anyone can use these specifications in their own software without any limitations
on re-use imposed by intellectual property rights.
If a file format is closed, this may be either because the file format is proprietary and the specification is not
publicly available, or because the file format is proprietary and even though the specification has been made public,
re-use is limited. If information is released in a closed file format, this can cause significant obstacles to reusing
the information encoded in it, forcing those who wish to use the information to buy the necessary software.
The benefit of open file formats is that they permit developers to produce multiple software packages and services
using these formats. This then minimises the obstacles to reusing the information they contain.
Using proprietary file formats for which the specification is not publicly available can create dependence on thirdparty software or file format license holders. In worst-case scenarios, this can mean that information can only be
read using certain software packages, which can be prohibitively expensive, or which may become obsolete.
The preference from the open government data perspective therefore is that information be released in open file
formats which are machine-readable.
Example: UK traffic data
Andrew Nicolson is a software developer who was involved in an (ultimately successful) campaign against the
construction of a new road, the Westbury Eastern bypass, in the UK. Andrew was interested in accessing and using
the road traffic data that was being used to justify the proposals. He managed to obtain some of the relevant data
via freedom of information requests, but the local government provided the data in a proprietary format which
can only be read using software produced by a company called Saturn, who specialise in traffic modelling and
forecasting. There is no provision for a read only version of the software, so Andrews group had no choice but
to purchase a software license, eventually paying 500 ( C600) when making use of an educational discount. The
18
main software packages on the April 2010 price list from Saturn start at 13,000 (over C15,000), a price which is
beyond the reach of most ordinary citizens.
Although no access to information law gives a right of access to information in open formats, open government
data initiatives are starting to be accompanied by policy documents which stipulate that official information must
be made available in open file formats. Setting the gold standard has been the Obama Administration, with the
Open Government Directive issued in December 2009, which says:
To the extent practicable and subject to valid restrictions, agencies should publish information online
in an open format that can be retrieved, downloaded, indexed, and searched by commonly used web
search applications. An open format is one that is platform independent, machine readable, and made
available to the public without restrictions that would impede the re-use of that information.
How do I use a given format?
When an authority must exhibit new data data that has not been exhibited before you should choose the format
that provides the best balance between cost and suitability for purpose. For each format there are some things you
should be aware of, and this section aims to explain them.
This section focuses only on how the cut surfaces are best arranged so that machines can access them directly.
Advice and guidance about how web sites and web solutions should be designed can be found elsewhere.
Web services
For data that changes frequently, and where each pull is limited in size, it is very relevant to expose data through
web services. There are several ways to create a web service, but some of the most used is SOAP and REST.
Generally, SOAP over REST, REST services, but are very easy to develop and use, so it is a widely used standard.
Database
Like web services, databases provide direct access to data dynamically. Databases have the advantage that they
can allow users to put together just the extraction that they are interested in.
There are some security concerns about allowing remote database extraction and database access is only useful if
the structure of the database and the importance of individual tables and fields are well documented. Often, it is
relatively simple and inexpensive to create web services that expose data from a database, which can be an easy
way to address safety concerns.
19
point of various substances. While the database as a whole might be protected by law so that one is not allow
to access, re-use or redistribute it without permission, this would never prevent you from stating the fact that
substance Y melts at temperature Z.
Forms of protection fall broadly into two cases:
Copyright for compilations
A sui generis right for collections of data
As we have already emphasized, there are no general rules and the situation varies by jurisdiction. Thus we
proceed country by country detailing which (if any) of these approaches is used in a particular jurisdiction.
Finally, we should point out that in the absence of any legal protection, many providers of (closed) databases are
able to use a simple contract combined with legal provisions prohibiting violation of access-control mechanisms
to achieve results similar to a formal IP right. For example, if X is a provider of a citation database, it can achieve
any set of terms of conditions it wants simply by:
1. Requiring users to login with a password
2. Only providing a user with an account and password on the condition that the user agrees to the terms and
conditions
You can read more about the jurisdiction by jurisdiction situation in the Guide to Open Data Licensing.
20
CHAPTER 2
genindex
search
21
22
Index
A
Anonymisation, 14
Anonymization, 14
API, 15
Application Programming Interface, 15
AR, 15
Attribution License, 15
B
BitTorrent, 15
C
Connectivity, 15
Copyright, 15
Public domain, 16
Public Sector Information, 16
R
Re-use, 16
S
Share-alike License, 16
T
Tab-separated values, 16
W
Web API, 16
D
DAP, 15
Data Access Protocol, 15
Data protection legislation, 15
Database rights, 15
E
EU, 15
EU PSI Directive, 15
I
IAR, 15
Information Asset Register, 15
Intellectual property rights, 16
IP rights, 16
M
Machine-readable, 16
O
Open Data, 16
Open Government Data, 16
Open standards, 16
P
PSI, 16
23