SQL For Data Science
SQL For Data Science
Antonio Badia
SQL for
Data Science
Data Cleaning, Wrangling and
Analytics with Relational Databases
Data-Centric Systems and Applications
Series Editors
Michael J. Carey, University of California, Irvine, CA, USA
Stefano Ceri, Politecnico di Milano, Milano, Italy
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Data Science (or Data Analytics, or whatever one prefers to call it) is a ‘hot’
topic right now. There is an explosion of courses on the subject, especially online:
many universities and several for-profit and non-profit organizations (Coursera, edX,
Udacity, Udemy, DataCamp, and many others) offer on-campus and online courses,
certification, and degrees. The coverage of these offerings is quite diverse, reflecting
the fact that Data Science is still a young and evolving field. However, many courses
seem to coalesce around a few topics (Machine Learning, mostly) and tools (R,
Python, and SQL, mostly). What few of these courses offer is a textbook.
There are already many books on databases and SQL, but almost all of them
focus on the traditional curriculum for Computer Science majors or Information
Systems majors (there are a few exceptions, like [11] and [17]). In contrast, the
present book explains SQL within the context of Data Science and is more in line
with what is being taught in these new courses. This book introduces the different
parts of SQL as they are needed for the tasks usually carried out during data analysis.
Using the framework of the data life cycle, it focuses on the steps that are given the
short shift in traditional textbooks, like data loading, cleaning, and pre-processing.
This book is for anyone interested in Data Science and/or databases. It should
prove useful to anyone taking any of the abovementioned courses, online or on-
campus, as well as to students working on their own. It assumes very little from
the reader; it just demands a bit of ‘computer fluency,’ but no background on
databases or data analysis. In general, all concepts are introduced intuitively and
with a minimum of specialized jargon. It contains an appendix (Appendix A)
meant to help students without prior experience with databases, with instructions
on how to download and install the two open-source database systems (MySQL and
Postgres) that we use for examples throughout the book. All readers of the book are
encouraged to install both systems and follow the book along with a computer in
order to practice, do the exercises, and play around—simply reading the book alone
is going to be much less useful than using it.
The book is organized as follows: Chapter 1 describes the Data Life Cycle, the
sequence of stages, from data acquisition and ingestion until archiving, that data
goes through as it is prepped for analysis and then actually analyzed, together with
v
vi Preface
the different activities that take place at each stage. It also explains the different
ways that datasets can be organized, and the different types of data one may have
to deal with. Many students have an intuitive understanding of the concepts in this
chapter, but it is useful to have it all together in one place and to give a name to
each concept for later reference. Chapter 2 gets into databases proper, explaining
how relational databases organize data. The chapter also explains how data in tables
should look like (what Hadley Wickham has called tidy data [19]), a point which
is not traditionally emphasized and can lead to severe problems down the road.
Non-traditional data, like XML and text, are also covered. Chapter 3 introduces
SQL queries, the SQL commands that allow us to ask questions about the data.
Unlike traditional textbooks, queries and their parts are described around typical
data analysis tasks (data exploration, cleaning, and transformation). These tasks
are vital for a proper examination of the data but are frequently overlooked in
Data Mining and Machine Learning textbooks. Chapter 4 introduces some basic
techniques for Data Analysis. Even though this is not the focus of the book, the
chapter shows that SQL can be used for some simple analyses without too much
complication.
After this part, which constitutes the core of the book, Chap. 5 introduces
additional SQL constructs that come in handy in a variety of situations. This chapter
completes the coverage of SQL queries so that readers get an overview of all the
main aspects of this important topic. Chapter 6 briefly explains how to use SQL from
within R and from within Python programs. This chapter is not an introduction to
R (or to Python) and, unlike other chapters in the book, does assume that the reader
is already familiar with at least the basics of R and Python. It focuses on how these
languages can interact with a database, and how what has been learned about SQL
can be leveraged to make life easier when using R or Python.
The book also contains another appendix (besides the one already mentioned),
which introduces some basic approaches for handling very large datasets. The
purpose of this appendix is to demystify the ideas behind the vague label Big Data
and give the readers basic guidance on how to use their newly acquired skills in this
world.
As in many textbooks, none of what this one contains is new. This book covers
the same (or very similar) content to what can be found in many sources, especially
online. What this book does is to put it all together under one roof and to give it
some order and structure. In many blogs and sites, the material is presented as an
answer to a particular question (how do you. . . ?), which may be useful to someone
with a specific need but gives the impression that learning SQL is about a bag of
tricks. Here, the material is logically organized using the idea of the data life cycle
so that all the concepts introduced can be understood as parts of a coherent whole.
Data Science itself is a relatively new and still changing field, but it has deep
roots, as it uses approaches and techniques from well-established fields, mostly math
(statistics, linear algebra, and others) and computer science (databases, machine
learning, and others). As a result, the same concept is sometimes given different
names by different authors in different textbooks. Whenever I am aware of this, I
Preface vii
have given a list of known names so that readers with different backgrounds can
relate what is in here with what they already know.
The goal of the book is to introduce some basic concepts to a wide variety
of readers and provide them a good foundation on which they can build. After
going through this book, readers should be able to profitably learn more about
Data Mining, Machine Learning, and database management from more advanced
textbooks and courses. It is my hope that most of them feel that they have been
given a springboard from which they are in a good position to dive deeper into the
fascinating world of data analysis.
ix
x Contents
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 281
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 283
Chapter 1
The Data Life Cycle
It is sometimes said that “data is the new oil.” This is true in several ways:
in particular, data, like oil, needs to be processed before it is useful. Crude oil
undergoes a complex refining procedure as the substance that comes out of wells
is transformed into several products, mostly fuels (but also many other useful
by-products, from asphalt to wax). A complex infrastructure, from pipelines to
refineries, supports this process. In a similar way, raw data must be thoroughly
treated before it can be used for anything. Unfortunately, there is not a big and
sophisticated infrastructure to support data processing. There are many tools that
support some of the steps in the process, but it is still up to every practitioner to
learn them and combine them appropriately.
In this chapter, we introduce the stages through which data passes as it is refined,
analyzed, and finally disposed of. The collection of stages is usually called the data
life cycle, inspired by the idea that data is ‘born’ when it is captured or generated
and goes through several stages until it reaches ‘maturity’ (is ready for analysis) and
finally an end-of-life, at which point it is deleted or archived. Data analysis, which
is the focus of Data Mining and Machine Learning books and courses, is but one
step in this process. The other steps are equally important and often neglected.
The main purpose of this chapter is to introduce a framework that will help
organize the contents of the rest of the book. As part of this, it introduces some
basic concepts and terms that are used in the following chapters. In particular, it
provides a classification of the most common types of datasets and data domains
that will be useful for later work. We will come back to these topics throughout the
book, so the reader is well served to start here, even though SQL itself does not
appear until the next chapter. Also, for readers who are new to data analysis, this
chapter provides a basic outline of the field.
Data
Interpretation/ Preparation
Data Validation Data
Source Analysis
Results Prepared
Data
Archival
Purge
Archived Data
Data Heaven
The term data life cycle refers both to the transformations applied to data and to the
states that data goes through as a result of these transformations. While there is not,
unfortunately, general agreement on the exact details of what is involved at each
transformation and state, or how to refer to them, there is a wide consensus on the
basic outlines. The states of the cycle can be summarized as follows:
Raw data → cleaned data → prepared data → data + results → archived data
The arrows here indicate precedence; that is, raw data comes first, and cleaned
data is extracted from it, and so on. The activities are usually described as follows:1
accessible, sometimes a prior step that uncovers sources of relevant data2 must be
carried out. What we obtain as the result of this step is called raw data.
It is very important to understand that “raw” refers to the fact that this is the data
before any processing has been applied to it, but does not indicate that this data is
“neutral” or “unfiltered.” In statistics, the domain of study is called the population,
and the data collected about the domain is called the sample. It is understood that
the sample is always a subset of the whole population and may vary in size from a
very small part to a substantial one. However, the sample is never the population,
and the fact that sometimes we have a large amount of data should not fool us
into believing otherwise. For analysis of the sample to provide information about
the population, the sample must be representative of the population. For this to
happen, the sample must be chosen at random from elements of the population
which are equally likely to be selected. It is very typical in data science that the
data is collected in an opportunistic manner, i.e. data is collected because it is
(easily) available. Furthermore, in science data usually comes from experiments,
i.e. a setting where certain features are controlled, while a lot of data currently
collected is observational, i.e. derived from uncontrolled settings. There are always
some decisions as to what/when/how to collect data. Thus, raw data should not be
considered as an absolute source of truth, but carefully analyzed.
When data comes in, we can have two different situations. Sometimes datasets
come with a description of the data they contain; this description is called metadata
(metadata is described in some detail in Sect. 1.4). Sometimes the dataset comes
without any indication of what the data is about, or a very poor one. In either
situation, the first step to take is Exploratory Data Analysis (EDA) (also called
data profiling). In this step, we try to learn the basic characteristics of the data
and whatever objects or events or observations it describes. If there is metadata, we
check the dataset against it, trying to validate what we have been told—and augment
it, if possible. If there is not metadata, this is the moment to start gathering it. This is
a crucial step, as it will help us build our understanding of the data and guide further
work. This step involves activities like classifying the dataset, getting an idea of
the attributes involved, and for each attribute, getting an idea of data distribution
through visualization techniques, or descriptive statistics tools, like histograms and
measures of centrality or dispersion.3
We use the knowledge gained in EDA to determine whether data is correct and
complete, at least for current purposes. Most of the time, it will not be, so once we
have determined what problems the data has, we try to fix them. There are often
issues that need to be dealt with: the data may contain errors or omissions, or it may
not be in the right format for analysis. There are many sources of errors: manual
(unreliable) data entry; changes in layout (for records); variations in measurement,
scale, or format (for values); changes in how default or missing values are marked;
or outdated values (called “gaps” in time series). Many of these issues can only be
addressed by changing the data gathering or acquisition phase, while others have to
be fixed once data is acquired.4 The tools and techniques used to fix these problems
are usually called data cleaning (or data cleansing, data wrangling, data munging,
among others). The issues faced, and the typical operations used, include
• Finding and handling missing values. Such values may be explicitly or implicitly
denoted. Explicitly denoted missing values are usually identified with a marker
like ‘NULL,’ ‘NA’ (or “N.A.,” for “Not Available”) or similar; but different
datasets may use different conventions. Implicit missing values are denoted
by the absence of a value instead of by a marker. Because of this variety,
finding missing values is not always easy. Handling the absence of values can
be accomplished simply by deleting incomplete data, but there are also several
techniques to impute a missing value, using other related values in the dataset.
For example, assume that we have a dataset describing people, including their
weight in pounds. We realize that sometimes the weight is missing. We could
look for the weight of people with similar age, height, etc. in the dataset and use
such values to fill in for the missing ones.
• Finding and handling outliers. Outliers are data values that have characteristics
that are very different from the characteristics of most other data values in a set.
For example, assume that in the people dataset we also have their height in feet.
This is a value that usually lies in the 4.5–6.5 range; anyone below or above is
considered very short or very tall. A value of 7.5 is possible, but suspicious; it
could be the result of an error in measurement or data entry. As this example
shows, finding outliers (and determining when an outlier is a legitimate value or
an error) may be context-dependent and extremely hard.
• Finding and handling duplicate data. When two pieces of the dataset refer to the
same real-world item (entity, fact, event, or observation), we say the data contains
duplicates. We usually want to get rid of duplicate data, since it could bias (or
otherwise negatively influence) the analysis. Just like dealing with outliers, this
is also a complex task, since it is usually very hard to come up with ways to
determine when duplicate data exists. Using again the example of the people
dataset, it is probably not smart to assume that two records with the same name
refer to the same person; some names are very common and we could have
two people that happen to share the same name. Perhaps if two records have
the same name and address, that would do—although we can imagine cases
where this rule does not work, like a mother and a daughter with the same
name living together. Maybe name, address, and age will work? Many times,
the possibility of duplication depends on the context; for instance, if our dataset
comes from children in a certain school, first and last name and age will usually
do to determine duplication; but if the dataset comes from a whole city, this may
not be enough.
4 The overall management of issues in data is sometimes called Data Quality; see Sect. 1.4.
1.1 Stages and Operations in the Data Life Cycle 5
The result of these activities is usually referred to as clean data, as in ‘data that has
been cleaned and fixed.’ While cleaning the data is a necessary pre-requisite for any
type of analysis, at this point the data is still not ready to be analyzed. This is because
different types of analysis may require different additional treatment. Therefore,
another step, usually called data pre-processing or data preparation is carried out
in order to prepare the data for analysis. Typical tasks of this step include:
• Transformations to put data values in a certain format or within a certain
frame of reference. This involves operations like normalization, scaling, or
standardization.5
• Transformations that change the data value from one type to another, like
discretization or binarization.
• Transformations that change the structure of the dataset, like pivoting or
(de)normalization. Most data analysis tools assume that datasets are organized
in a certain format, called tabular data; datasets not in this format need to be
restructured. We describe tabular data in the next section and discuss how to
restructure datasets in Sect. 3.4.
Data is now finally ready for analysis. Many techniques have been developed for
this step, mostly under the rubric of Statistics, Data Mining, and Machine Learning.
These techniques are explained in detail in many other books and courses; in this
book we explain a selected few in detail (including an implementation in SQL) in
Chap. 4.
Once data has been analyzed, the results of the analysis are usually examined
to see if they confirm or disprove any hypothesis that the researcher/investigator
may have in mind. The results sometimes generate further questions and produce
a cycle of further (or alternative) data analysis. They can also force a rethinking of
assumptions and may lead to alternative ways of pre-processing the data. This is
why there is a loop in Fig. 1.1, indicating that this may become an iterative process.
Finally, once the cycle of analysis is considered complete, the results themselves
are stored, and a decision must be taken about the data. The data is either purged
(deleted) or archived, that is, stored in some long-term storage system in case it is
useful in the future. In many cases, the data is published so it can be shared with
other researchers. This enables others to reproduce an analysis, to make sure that
the results obtained are correct. The publication also allows the data to be reused
for different analyses. Whenever data is published, it is very important that it be
accompanied by its metadata, so that others can understand the meaning of the
dataset (what exactly it is describing) as well as its scope and limitations. If the data
was cleaned and pre-processed, those activities should also be part of the metadata.
In any case, data (like oil) should be disposed of carefully.
5 Again, readers not familiar with these should wait until their introduction in the next chapter.
6 1 The Data Life Cycle
Our first task is to understand the data. Here we describe how datasets are usually
classified and described.
There are, roughly speaking, two very different types of data: alphanumeric
and multimedia data. Multimedia refers to data that represents audiovisual (video,
images, audio) information. This data is usually encoded using one of the several
standards for such media (for instance, JPEG for digital images6 or MPEG for
audio/video7). Alphanumeric data refers to collections of characters8 used to
represent alphabetic (names) and numeric individual datum. For instance, ‘123’
represents a number (an integer) in decimal notation; ‘blue’ represents the name
of a color. Such data is used to provide basic values, which are then grouped or
organized in several ways (described below). Most methods for data analysis have
been developed to deal with alphanumeric data, and that is the only data that we
cover in this book. Handling multimedia data requires specialized tools: in order
to display the image or play the video or music, a special program (a ‘video/audio
player’) that understands how the encoding works is needed.9
An alphanumeric dataset (henceforth, simply a ‘dataset’) is a collection of data
items. An item is usually called a row or tuple in database parlance; a record, in
general Computer Science parlance; an observation, in statistical parlance; or an
entity, instance, or a (data) point in other contexts. Each item describes a real-world
entity, fact, or event; it consists of a group of related characteristics, each one giving
information about some aspect of the entity, fact, or event being described. Such
characteristics are called attributes in database parlance; variables, in Statistics;
features in Machine Learning; and properties or measurements in other contexts.
For instance, in our previous example of the people dataset, we implicitly assumed
that the data was an assemblage of items, each item describing a person, and that
the items were composed of attributes describing (among others) the name, address,
age, weight, and height of each person. Important note: in this book we will use
the terms record (although we will still use row for data in tables) and attribute from
now on as unifying vocabulary; in formulas, we will use r, s, r1 , r2 , . . . as variables
over records and A, B, A1 , A2 , . . . as variables over attributes.
The number of records in the dataset is usually termed its size (in databases, the
cardinality). Conversely, the number of attributes present in a record is called the
dimensionality. In some cases, all records in a dataset are similar and share the same
(or almost the same) dimensionality, so that we can speak of the dimensionality of
the dataset too.
6 https://en.wikipedia.org/wiki/Image_file_formats.
7 https://en.wikipedia.org/wiki/MPEG-1.
8 In this context, a character is any symbol that can be produced by a key on the computer’s
Structured data refers to datasets where all records share a common schema, i.e.
they all have values for the same attributes (sometimes, such datasets are called
homogeneous, to emphasize that all records have a similar structure).
Tabular data is structured data where all attributes are considered simple:
their values are all labels or numbers without any parts. For example, the dataset
ny-flights is considered tabular, while a dataset with records like the imaginary
employee in the previous example would not be considered a tabular dataset, as
some attributes are complex. As we will see later, this type of data is called
hierarchical. It is sometimes possible to transform hierarchical data into tabular and
vice versa.
When tabular data is in a file, each record is usually in a separate line, and inside
each line, each attribute is separated by the next one by a character called a delimiter;
usually, a comma or a tab. When using commas, the file is called a CSV file (for
Comma Separated Values). If the schema itself (names of attributes) is included in
the file with the data, it is usually in the first line—this is the reason it is called the
header.
Tabular data is so-called because it is often presented as a table, organized into
rows and columns. The rows correspond to data records and the columns to their
attributes. Intuitively, each row describes an entity, or an event, or a fact that we
wish to capture, with the attributes describing aspects of the entity (or event or fact).
Not all data in tables follows this structure; data that does is called tidy, as we will
see in Sect. 2.1.4.
In most datasets, the size (recall: the number of records or records in the dataset)
is much larger (at least one order of magnitude) than the dimensionality (recall: the
number of attributes in the schema), but some datasets have high dimensionality
1.2 Types of Datasets 9
flightid year month day dep_time sched_dep_time dep_delay carrier flight origin dest
1 2013 1 1 517 515 2 "UA" 1545 "EWR" "IAH"
2 2013 1 1 533 529 4 "UA" 1714 "LGA" "IAH"
Semistructured data is data where each record may have a different schema
(also sometimes called heterogeneous data). In particular, some attributes may be
optional, in that they are present in some records and not in others. Also, attributes
may be a mix of simple and complex. Finally, some attributes may have as value
collections of values as opposed to a single one (with each value in turn being simple
or complex). As a consequence, records in semistructured data may have a complex
structure.
In most real situations, semistructured data is used for datasets where attributes
are not simple. Hence, it is very common to have datasets where the records, on top
of being different from each other, have a complex structure.
10 Availableat https://bit.ly/30ZDJbf.
11 This example is taken from the Enron dataset, a collection of emails from the Enron Corporation
that has been used extensively by researchers for testing and analysis, as it is one of the few
10 1 The Data Life Cycle
ID: 19475126.1075855757890
Header:
timestamp:
date: Sun, Feb 4 2001
time: 03:06:00
sender: robert.benson@enron.com
receiver: bsunsurf@aol.com
subject: Rob-are you getting this?
CC: peter.shipman@axiaenergy.com, gwadsworth@midf.com
Body: "How about lunch tomorrow?"
The above presentation mixes schema (attribute names) and data (values), while
the tabular representation shows the schema once and does not repeat it. This is due
to the fact that in tabular data, all records share the same schema, and therefore there
is no need to repeat it for each record. On semistructured data, though, a record may
be different from others; hence, we need to indicate, in each record, which attributes
are present. Semistructured data is sometimes called self-describing, because each
record contains both schema and data.
Semistructured data includes data in XML and JSON, two very popular data
formats. They are very similar, differing only in how they present the data and a
few other small details. XML describes the schema by using tags, labels that are
enclosed in angular brackets. Tags always come in pairs, composed of an opening
and a closing bracket. They can be identified because the closing bracket is exactly
like the opening one but with the addition of a backslash. The value of the attribute
goes between the tag pair.
Note that indentation is no longer needed, as the tags clearly indicate the data for all
attributes, simple or complex; it is used here only to aid legibility.
JSON uses a different format for the same idea. Instead of tags, JSON uses labels
for attributes, which are separated by a colon (:) from their value. Also, when an
attribute denotes a collection, the values are enclosed in square brackets ([]), and
when they denote a complex object, they are enclosed in curly brackets ({}).
Note how the whole thing is enclosed in curly brackets, as it denotes a single record.
Note also how some values are enclosed in quotes, so that it is clear where they end
(there are no ending tags in JSON). All values (even complex ones) end with a
comma, with the exception of the last value in the array.
When complex attributes are present, the schema of a record can be described by
what is called, in Computer Science, a tree: a hierarchical structure with a root or
main part that is subdivided into parts. Each part can in turn be further subdivided.
without any sub-parts, are called the leaves of the tree. In this example, the root
is Email and the leaves are ID, Date, Time, Address, Sender, Receiver, Body. The
node A above another node B is called the parent of B; B is the child of A. In a
tree, each node has only one parent, but an arbitrary number of children. The parent
of Sender is Header; the parent of Header is Email. The node CC has multiple
children, indicated by the dots (. . . ).
Header Body
This tree represents the schema of the above examples; to add data, we would add
a value to each leaf in the tree (since leaves represent simple attributes that have a
simple value).
If we wanted to put this data in a table, we would have missing values. The same
two records above are shown here; when there is a missing value in a cell, it is left
empty.
<Employees>
<Employee>
<Name> Jones </Name>
<JobTitle> Pool Motor Truck </JobTitle>
<Department> Aviation </Department>
<Full-Part Time> P </Full-Part Time>
<Salary-Hourly> Hourly </Salary-Hourly>
<TypicalHours> 10 </TypicalHours>
<Hourly Rate> 32.81 </HourlyRate>
</Employee>
<Employee>
<Name> Smith </Name>
<Job Title> Aldermanic Aide </JobTitle>
<Department> City Council </Department>
<Full-Part Time> F </Full-Part Time>
12 Available at https://data.cityofchicago.org/Administration-Finance/Current-Employee-Names-
Salaries-and-Position-Title/xzkq-xp2w.
14 1 The Data Life Cycle
Semistructured data can be used for tabular data, since semistructured data can
accommodate cases where all records have the same schema, although in this case
repeating the schema for each record is highly redundant. The idea is to treat the
table as an object made of a repeated attribute; this attribute in turn is an object
representing the row, with attributes for each column.
Exercise 1.2 Put the data about New York flights in XML and JSON.
One particular, important case of semistructured data is graph data. Graph data
(sometimes called network data) represents collections of objects that are connected
(or linked, or related) to each other. A fundamental part of the dataset, then, is not
just the objects themselves, but their connections. In many common situations, each
connection links a pair of objects; the graph can be seen as a collection of objects
and their pair-wise connections.13 In the context of graph data, the objects are called
nodes or vertices, and the links are called edges.
This is usually depicted in a diagram, typically using circles for the nodes and
arcs between them for the edges.
13 It is possible for a connection to link more than two objects; the typical example is a relation
SUPPLIES that connects suppliers, parts, and projects (3 objects). Graphs are limited to binary
(two objects) relations, but are still very useful.
1.2 Types of Datasets 15
Follows
Follows Likes
Re−tweets Daphne
Velma
One way to think of graph data is as records where the value(s) of some attributes
are other records. For instance, in the previous example we can think of each node
as a ‘person’ record where some attributes (like “follows”) have as the value another
person. Seen this way, graph data is semistructured because a node may have any
number of such attributes—since a node may have an arbitrary number of edges to
other nodes.
What makes graph data special is that the relationships between nodes are
considered particularly important, so most analysis focuses on properties related
to the edges, like finding whether two nodes N1 and N2 are connected (i.e. whether
there is a sequence of edges, called a path, leading from N1 to another node N,’
and from N’ to another node N,”. . . , and so on until one last edge leads to N2),
or how many connections there are from a given node to others. We discuss graph
processing in Sect. 4.6.
A few variations of this idea are useful. Sometimes the edges between nodes
have a direction, that is, they represent a one-way relationship. Such relationship is
directed; a graph made up (exclusively) of directed edges is called a directed graph.
14 This example is a simplification. Both Facebook and Twitter keep a very detailed dataset for each
user, including every photo, video, or text that the user has ever posted, all their profile information,
and everyone you have ever declared a friend (plus, in the case of Facebook, information about
advertising categories that Facebook thinks you fit).
16 1 The Data Life Cycle
Other times, the edges have no direction. Such relationship is undirected; a graph
made up (exclusively) of undirected edges is called an undirected graph. A good
example of this is Facebook vs. Twitter. On Facebook, if person A is a friend of
person B, then (usually) person B is a friend of person A, so this is an undirected
relation. In Twitter, if person A ‘follows’ person B, it may or may not be the case
that person B follows person A, so this is a directed relation.
Sometimes nodes in a graph represent different types of data. For instance, we
may have a graph where some nodes represent people and other nodes represent
books, and an edge between a person node and a book node represents the fact that
the person has read the book. In such cases, nodes usually have a type attribute,
and the graph is called a typed graph. Also, in some cases the vertices of a graph
have labels, a value that gives some additional information about the connection
represented by the vertex. Following with the example of books and people, suppose
that some people may have read the same book more than once, and so it is necessary
to indicate, when a person has read a book, how many times this has happened.
Such information is usually represented by a label with a positive integer value. The
graphs that use labels are called labeled graphs. The graph depicted in the previous
example is directed and labeled.
It is interesting to note that, technically speaking, trees are special types of
graphs. In a tree, edges are directed in the direction parent → son, and each edge
must have only one parent (although it can have an arbitrary number of sons). Also,
trees do not admit cycles (paths that end up in the same node that they started). This
implies that no node can appear as its own descendant (a son, or a son of a son, or a
son of a son of a son, etc.) or as its own ancestor (a parent, or a parent of a parent,
or a parent of a parent of a parent, etc.). Hence, all information coming in XML and
JSON can be considered ‘graph data.’ However, this is not how such data is treated.
For one, trees are much simpler to deal with than arbitrary graphs, so it is important
to note when data schema forms a tree and when it forms a graph. The reverse of
this is that graph data can handle some things that are difficult to represent as a tree,
as we will see in the next chapter.
attributes are textual in nature, as in the email dataset example, where the subject
and body attributes can be considered text, as in the next example.
We discuss how to store text in Sect. 2.3.3 and how to analyze it in Sect. 4.5.
Again, it is important to remember that the same dataset can be structured in
different ways. For instance, even though Twitter data can be seen as a graph,
Twitter actually shares its data using JSON: each Tweet is described as a JSON
structure with fields like text, user (with subobjects id, name, screen_name,
location), entities (with subobjects hashtags and urls, both of them collections).
As another example, the ny-flights data introduced in a tabular way in Exam-
ple 1.2.1 could also be presented as graph data by making the airports the nodes and
news-index.
18 1 The Data Life Cycle
considering each flight an edge from airport A to airport B, using the information
in attributes origin and dest. Note that this is a directed graph, and that the edges
do not have just a single label, but a collection of them (all other attributes). This
format could be more appropriate than the tabular one for certain types of analysis.
This lesson carries one very important consequence: when we acquire data, it will
come in a certain format. This format may or may not be the appropriate one for the
analysis that we want to carry out. Hence, it will sometimes become necessary to
transform or restructure the data from one format to another. We will discuss this
task in Sect. 3.4.1.
Exercise 1.3 A description of Twitter data in JSON can be found at
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-
object.
Using that description, build a small JSON dataset with two imaginary users.
Exercise 1.4 Take a few rows of the ny-flights data from Example 1.2.1 and
create a diagram to show it in graph format.
One final observation is that some datasets come as a mix of formats, which
usually renders them useless for analysis. Typical examples include data from
spreadsheets, very popular tools for dealing with data. Even though many datasets
look like a collection of columns organized by rows, spreadsheets are structured
differently—in fact, they are not structured at all. A user is free to put whatever she
wants in any cell of the spreadsheet; as a result, columns many not contain the same
type of data at all. Also, rows are not required to look like other rows. Spreadsheets
and their problems are discussed in depth in Sect. 2.1.4.
As we have seen, structured and semistructured data are made up of basic building
blocks: individual attributes that represent features or characteristics of data records.
Complex attributes are made up of schemas of simple ones; in the end, each simple
attribute represents a domain or set of values. For a given data record, the attribute
provides a value from its domain. For instance, in the tabular dataset with people
information, the attribute ‘height’ has values that provide the height of a particular
person, expressed by numbers like 5.8 (if we assume that heights are measured in
feet and inches) or 183 (if we assume that heights are expressed in centimeters).
We would expect all such values to be numbers and to have a certain range. Thus,
we can say that the attribute represents a numerical domain, expressed by real,
positive numbers with one digit precision and up to a certain value. Conversely, an
attribute like ‘name’ would have values like “Smith”; this domain is not numerical,
and in principle it seems impossible to determine what values are in it (even
something like “Xyz” could be a name in some languages). It would make sense
to compare two heights and see which number is larger. It would not make sense
to compare two names that way. It is clear that domains are of different types, and
that different operations can be applied to different domains. Hence, recognizing
different domains is an essential part of data analysis. Roughly speaking, domains
for simple attributes fall into one of the three categories: nominal (also called
20 1 The Data Life Cycle
categorical), ordinal, and numerical. Some people call nominal data qualitative
and numerical data quantitative.16 We explain each domain in more detail next.
16 There is no total agreement on the details of this division. For some people, qualitative data
this refers to current countries, or any country that has ever existed. For more on this issue, keep
on reading.
18 Here and in the next few paragraphs we mention several statistical operations; for readers not
familiar with them, these are all discussed in the next chapter.
1.3 Types of Domains 21
located within a city, the city within a county, the county within a state, the state
within a country. When looking at such data, it is possible to look at the dataset from
several levels of the hierarchy; this is usually called the granularity level. Moving
across levels is, in fact, a common analysis technique.
Ordinal attributes are, like nominal ones, sets of labels. However, a linear ordering is
available for the domain. We can think of a linear order as arranging the elements in
a line (hence the “linear”), with elements ‘going before’ others. Technically, a linear
order is one that is irreflexive (no element goes before itself), asymmetric (if A goes
before B, it cannot be that B goes before A), and transitive (if A goes before B and B
goes before C, then A goes before C). As a consequence of having an order, we can
always enumerate all elements in the domain and distinguish between a first one, a
second one, . . . and a last one.19 The order also allows us to compare values with
respect to their position in the order. Examples would be classifying symptoms of a
sickness as very mild, mild, medium, severe, very severe; or opinions on a subject
as strongly agree, agree, neutral, disagree, strongly disagree. Note that, since values
can be ranked, we can compute a median (and, in the associated frequencies, a
mode) and also percentiles and rank correlation, but we cannot do further operations.
This type of attribute also includes codes; for instance, using ‘1,2,3,4,5’ for ‘very
satisfied,’ ‘satisfied,’ etc. These codes are still not numbers.
Ordinal attributes occupy a gray space between nominal and numerical attributes.
They are labels, but the fact that they can be ordered means that we can, to a small
degree, treat them as numbers. In fact, numbers are also ordinal attributes, as they
can always be put in the usual linear order < (‘less than’).
Numerical attributes are representations of quantities, dimensions, etc. and are truly
represented by numbers, be they integers or reals.20 It is customary to further
distinguish two types of numerical attributes:21
• interval: in these, the zero value (i.e. a point where all values start) is arbitrary. As
a consequence, the distance between one value and another is (approximately)
the same, so we can compare values and add/subtract them, but taking ratios
19 At least for finite domains, which are the only ones of interest here.
20 Note that reals can only be represented in a computer with a certain degree of precision. This
will be important for numerical computations, and we will discuss it later.
21 Some statistics textbooks distinguish between interval, ratio, and absolute types.
22 1 The Data Life Cycle
make no sense. We can take the mean, median, and value, as well as range
and standard deviation/variance. Hence, we can apply the t-test and calculate
Pearson’s correlation coefficient. A typical example is a ‘date’ attribute: there
is no ‘zero’ date, but the difference between the two given dates can always
be calculated. Another example is the temperature measured in Celsius or
Fahrenheit: while both scales have a zero degree mark, they both continue below
zero: zero is not the absolute lowest value, and the place where the zero is
somewhat arbitrary (as shown by the fact that the ‘zero’ in Celsius and the ‘zero’
in Fahrenheit are different).
• ratio: in these, there is an absolute zero, hence ratios make sense. A typical
example is the temperature measured in Kelvin degrees (in this scale, zero
marks the point where there is no thermal motion, and therefore it is impossible
to go below it). In fact, most measurements (of time, distance, mass, money,
etc.) fit here. For instance, a time of 0 s means no time has passed; a time of
10 s is twice as much as 5 s. With this domain, we can compare the values,
add/subtract, multiply/divide, find mode, median, and mean (including arithmetic
and geometric mean), as well as pretty much any mathematical manipulation. To
note, the counts that result from associating frequencies to categorical values
are in this category, which is why calculating frequencies is so useful and so
commonly done.
Determining the type of a domain is extremely useful for analysis since it gives
us an idea of what operations make sense for an attribute.
Exercise 1.5 Classify the domains of all the attributes in the schema of
ny-flights.
Exercise 1.6 A car database has an attribute called “Maintenance.” Typical values
are “60000 km/2 years,” “80000 km/3 years.” Is this a categorical or numerical
attribute? What should it be for analysis purposes?
1.3 Types of Domains 23
For practical purposes, it is a good idea to examine the domain from the point of
view of the possible values allowed—equivalently, how membership in the domain
is determined. Roughly, we can distinguish between:
• closed sets: also called enumeration domains. These are domains where the set
of labels allowed can be precisely given. An example are the cities in a country
or the months of the year. These domains tend to be static, i.e. do not change
over time. There is usually some external resource that can help determine
membership, which is simply a matter of looking the label up in the resource.
Even though these simple sets can present issues, determining membership is
usually not hard. For instance, the list of cities in a country may be different
under different names—a list of German cities in German vs one in English, but
once the language has been fixed, we can determine valid labels.
• bounded sets: these are domains where the elements are taken from a larger
domain but somehow limited in some way. A typical example is age, which
is expressed by a number—but not any number. For instance, the age of a
person, expressed in years, can be any number between 0 and, say, 120. Negative
numbers are excluded since they do not make sense; fractions and reals are also
excluded due to the convention of expressing age in rounded amounts. Very large
numbers (like 5,000) are out of bounds (this would be a typical example of an
outlier). The difficulty of determining membership in these domains depends on
whether the bounds are fixed or variable, strict or fuzzy.
• patterned sets: these are domains where all elements must obey a certain pattern
(or one of a finite number of patterns). Typical examples are email addresses
and phone numbers. The difficulty of determining membership here depends
on the complexity of the patterns. For instance, for regular mail addresses in
one country, the complexity tends to be low; however, mail addresses over the
world follow a bewildering variety of patterns, and determining whether an
international address is correct can be extremely hard.
• open sets: these are sets where it is not possible to give a definite criterion for
membership. A typical example is person names. What counts as a person’s
name, even in a given language, is subjected to change over time (i.e. these sets
tend to be dynamic) and is open-ended (in some countries, parents can give their
children whatever they want as a name, i.e. they do not have to follow any rules).
Consequently, determining membership may be very difficult or even impossible.
The importance of these distinctions is that one of the main tasks during EDA
and cleaning is to make sure that all values in our data are correct for their domain.
As we can see, deciding this depends on the type of the underlying domain, and it
may be extremely simple or plain impossible.
Typical Hours, Annual Salary, Hourly Rate are very likely bounded, in
that we can determine minimum and maximum values for each. If this is correct,
we can determine whether values in our data are within permissible bounds or they
are the result of some error.
Exercise 1.7 Classify the domains of the attributes in the schema of ny-flights
from this point of view.
1.4 Metadata
Metadata is data about data. It describes what the data refers to and its characteris-
tics as a whole. Having metadata is highly beneficial for several reasons:
• To operate meaningfully in data, we must only carry out operations that make
sense for that data: it makes no sense to add a temperature and a height, even
though they are both numbers.
• In general, having metadata can guide our decision as to how to clean and pre-
process the data.
• Sometimes data analysis will reveal that a transformation was not quite appro-
priate; we may want to undo the effects of the change and perhaps try something
different. This is much easier to do if we recorded clearly what the changes were
in the first place.
• If we need to share/export our data, we need to explain what the data are (what
they mean or refer to) for others to be able to use them. Data storage and
manipulation always happens in the context of a project that has certain goals.
However, very frequently data collected for a certain purpose is reused later for
different projects, with different goals. Hence, it is important to understand what
the data was originally meant to represent, and how it has been transformed, in
order to assess suitability for different purposes than the original one.22
• Because it provides a trace of how data came to be, metadata is crucial in helping
with repeatability of analysis in order to generate reproducible results. This is
becoming more and more important in all kinds of analysis.
In spite of its importance, metadata has traditionally been ignored in data
projects. One reason is that it is typically considered overhead, i.e. work that gets in
the way of obtaining results from the data. Another significant reason is that there is
very little agreement among experts as to what constitutes good metadata, and how
it should be generated and stored. However, given its potential positive impact, an
attempt should be made to manage the metadata of any given dataset. In this section,
we give some basic guidance as to what metadata should be kept and when it should
be created; later in Sect. 3.5 we discuss how to store and manage it.
The concept of metadata is vague and it can be extended to cover many aspects of
data and data processing. As a consequence, there are many diverse classifications
of metadata in the literature. However, there are some basic parts that enjoy wide
support:
• structural metadata: for both structured and semistructured data, metadata should
include the schema or a given dataset23 and also a description and classification
of the domain of each attribute in the schema.
• domain metadata: Domain information is especially useful. Two aspects merit
mention: syntactic and semantic. Syntactic metadata refers to how data is
represented, what kind of values it can take. For numeric values, this includes
appropriate range and typical values; in the case of measures, unit, precision,
and scale are a must. For instance, a field giving salaries may have values in
Canadian dollars, representing thousands (so that the value ‘85’ actually means
‘CAN $85,000’); a field giving people’s heights (an example we have used
before) should, at the minimum, describe which units are involved: metric ones
(meters and centimeters) or English units (feet and inches).24 For categorical
values, this may involve (depending on the kind of domain) a list of valid values
or valid patterns; this will allow us to check for correct data. For instance, a
name field may come with a description of what a good value is supposed to
look like: last name, followed by a comma, followed by a first name, with both
names capitalized. For dates, a description of the format (for instance, ‘month-
day-year’) is highly desirable, as this avoids (typical but painful) confusions.
Semantic metadata refers to what the data is supposed to represent. For
instance, knowing that an attribute is for people’s names or is a measurement of
people’s height helps considerably when examining the data since in many cases
the analyst can think of typical values, values that may not be correct, etc. Note
that in such cases the name of the attribute (‘name,’ ‘height’) tends to be enough
to point us in the right direction, but even when using meaningful names, this
may not be enough. For instance, an attribute named ‘price’ may give the price
of a product, but this may be before or after taxes, with or without discounts.25
In the case of codes, semantic metadata is especially useful, as many codes tend
to have cryptic names. For instance, in a list of products, a numerical attribute
called FY15 may be an enigma, until we find out that the name refers to ‘Females
Younger than 15’ (and not, say, to ‘Fiscal Year 2015’). The problem with codes
may be in the values themselves; for instance, an attribute Customer Satisfaction
may have values 1–5, which need to be interpreted (while this is a typical ordinal
23 In the case of semistructured data, each record may have a different schema, but in most practical
and scale by looking at some data; however, this will not always be the case.
25 In fact, a product may have many prices in a given dataset, because of these distinctions.
26 1 The Data Life Cycle
26 Such information is very useful to, for instance, have someone to go and ask questions.
27 Note that recording an action is no guarantee that it can be undone (some actions cannot be
reversed); however, not recording it pretty much guarantees that the action cannot be undone and,
worse, that trying to identify and mitigate its consequences will be almost impossible.
1.4 Metadata 27
population completeness (do we have all values from the domain?). The latter
is especially important since many datasets are samples from an underlying
population, and often they have been obtained in a process that mixes
opportunistic and random characteristics.
– Consistency: this measures whether the data as a whole is sound; it answers
the question: are there contradictions in or across records? Sometimes a record
may contain inconsistencies; for example, a person record with a name like
“Jim Jones” and “female” for sex; or a record with ‘5’ for age and ‘married’
for marital-status; or an order record where the date when the order has to be
delivered is earlier than the date the order was placed. Sometimes two or more
records may contradict each other (in a dataset of flight information, there
may be two records sharing the same identifier but different destinations).
Inconsistencies in the dataset need to be eliminated before analyzing the data,
as they negatively affect pretty much any type of analysis.
– Timeliness: this refers to value changes along time. We want to know, mainly,
two things: how often data changes ( volatility) and how long ago was
data created and/or acquired (currency); the two of them together determine
whether the data is still valid (and, in general, how long it will remain valid).
Volatility refers to the frequency with which data changes (rate of change).
Data can be stable (volatility: 0), long-term (volatility: low), changing
(volatility: high). Currency refers to when data was created, but may also refer
sometimes to when data was acquired for the dataset. For instance, assume a
sensor that takes temperature measurements every hour. Usually, a timestamp
attribute will tell us when the values of the temperature attribute were taken.
Assume, further, that the sensor sends information wirelessly every 24 h to
a database. Thus, the acquisition time is the same for a whole set of data.
Note that data may be current and not timely (for instance, a schedule of
classes is posted after the semester starts); this is because timeliness depends
on both currency and volatility. Note also that if data is updated, this affects
its currency.
– Certainty: how reliable is the data (how much do we trust the source)? For
many measurements, this is not an issue, but in certain analysis it can become
a crucial element. For instance, when analyzing news reports, we may find
out that reports on certain subjects (for instant, sudden, high-impact events)
are highly uncertain.
It is clear that these aspects are related: if the data is uncertain, we cannot estimate
its completeness, but if it is inconsistent, it cannot be certain. Also, accurate data
tends to be certain; it is unlikely (although possible) that we have a very accurate
measurement but are uncertain about its reliability. Thus, it may be tricky to evaluate
all the aspects in isolation. Also, not all aspects apply to all datasets. Thus, one may
want to create only essential metadata for a given project. However, all shortcuts
taken will restrict our ability to reuse and publish the data.
28 1 The Data Life Cycle
Some authors will consider, beyond what we have outlined here, other aspects
like administrative metadata—a description of rights/licenses, ownership, permis-
sions, regulations that affect the data, etc. This may be advisable for data subjected
to legal or ethical rules, for instance, private data that must be kept out of reach of
the general public.
Exercise 1.8 Generate domain metadata to describe, as best as you can, the
meaning of the attributes in the schema of ny-flights.
When is metadata generated? Ideally, metadata should be created at the acqui-
sition/storage stage, when data is first acquired. This metadata should reflect what
we know about the source of the data and the domains that the data is supposed
to represent, that is, the domain knowledge. Having metadata that describes the
data is very useful for data exploration, cleaning, and pre-processing. In particular,
dealing with missing values and outliers is much easier when we have an idea of
what data is supposed to be like. This initial metadata can be refined by EDA. For
instance, suppose that we are gathering physiological data for a medical study, and
one of the features we have is blood pressure. This is measured with a couple
of numbers that give the systolic and diastolic pressure, and for most people
is around 120 (for systolic) and 70 (for diastolic). Numbers up to 130 and 85
are still considered normal, but numbers above those indicate hypertension (high
blood pressure). Conversely, numbers below 120 and 70 may indicate hypotension
(low blood pressure). Knowing all this can help us determine if some values are
outliers (for instance, numbers like 300 or 20 are probably the result of some
error) or extreme values (for example, numbers like 180 and 110 indicate extreme
hypertension, but are certainly possible). Note that this information comes from the
domain knowledge and needs to be reconciled with what EDA extracts from the
data. For instance, if our dataset is from people with heart disease, it could be the
case that most of them have high blood pressure. Or, if the dataset is about infants
or school-age children, the data will have different values. If EDA finds out that
the metadata is not a good description of the data, we must reconcile the observed
measurements with the assumed interpretation of the data.
During cleaning and pre-processing we may apply several changes to the data
(see Sects. 3.3 and 3.4 for details). These changes should also be incorporated into
the metadata, as already discussed.
As analysis proceeds, the methods used, together with any parameters and
assumptions, should also be recorded. This will contribute to a correct interpretation
of the results.
When it comes to archiving, metadata itself should always be a prime candidate
to be kept. Having provenance metadata may allow us or others to recreate the
original dataset (if not kept) as well as all the work done. Metadata is also the
main tool to determine whether data can be reused for purposes different from the
ones that motivated its collection—in particular, structural and domain metadata
will ensure that the dataset is correctly understood. Finally, metadata tends to be
much smaller than the data itself, and therefore it can be easier to store.
1.5 The Role of Databases in the Cycle 29
A database can be used in several roles in Data Analysis, depending on the exact
situation. The first scenario is when the data already exists in a database; therefore,
we have to go there to access it. After accessing the data, we have two options:
one is to export the data to files (see Sect. 2.4 for details) and use R or Python or
some other software to do all the work (see Chap. 6 for a brief description of how
R and Python can interact with a database). Another option is to carry out some of
the work (for instance, some EDA, some data cleaning and pre-processing) in the
database and then export only the clean, relevant data to a file and continue with
other tools. When the analysis does not include sophisticated Machine Learning or
Data Mining techniques, we may be able to do all the work inside the database,
avoiding the extra effort to export the data and use a different tool (see Chap. 4).
Even when this is not the case, doing some of the work within the database still
offers a number of advantages over tools like R and Python. First, when the dataset
is very big, some preliminary work may allow us to extract only a part of the data
(either a sample or a carefully chosen subset), which would be beneficial for further
work, as neither R nor Python is particularly suited to work with large datasets.
Second, databases offer strong control access: we can carefully monitor and limit
access to the data, an advantage if there are concerns about data confidentiality.
Third, if the data arrives periodically, or all at once but is later updated (that is, if
data changes at all), the database offers tools to handle this evolution of data that do
not exist in R or Python. Certainly, changes can also be managed at the file level,
especially if one is knowledgeable about the power of the command line [9], but it
is certainly nice to have tools that make life easier (see Sect. 2.4.2 for information
about modifying existing data).
In a second scenario, data is to be captured yet and it is decided to use a database
because of the size of the dataset or its complexity. In that case, we start by moving
the data into the database (again, see Sect. 2.4 for details). Once this is done, we are
in the previous scenario, and the same considerations apply.
In many cases right now data exists in files and the whole process takes place
in R or Python. That is, databases are absent from the process. While this is
totally justifiable in many cases (small datasets, complex, or ad hoc processing),
this approach is manual-intensive and tends to produce results that are not well
documented and are difficult to reproduce unless the markup tools available in R
and Python are used.
All in all, databases can be a helpful tool in managing data through its life cycle,
many times in combination with other tools. The next chapter goes through the
process of putting data in the database, updating it if needed and exporting it. Later
chapters discuss how to examine, clean, and transform the data for analysis.