Chapter 3 Introduction To Data Science A Python Approach To Concepts, Techniques and Applications
Chapter 3 Introduction To Data Science A Python Approach To Concepts, Techniques and Applications
3.1 Introduction
Descriptive statistics applies the concepts, measures, and terms that are used to
describe the basic features of the samples in a study. These procedures are essential
to provide summaries about the samples as an approximation of the population.
Together with simple graphics, they form the basis of every quantitative analysis of
data. In order to describe the sample data and to be able to infer any conclusion, we
should go through several steps:
1. Data preparation: Given a specific example, we need to prepare the data for
generating statistically valid descriptions.
2. Descriptive statistics: This generates different statistics to describe and summa-
rize the data concisely and evaluate different ways to visualize them.
One of the first tasks when analyzing data is to collect and prepare the data in a format
appropriate for analysis of the samples. The most common steps for data preparation
involve the following operations.
1. Obtaining the data: Data can be read directly from a file or they might be obtained
by scraping the web.
2. Parsing the data: The right parsing procedure depends on what format the data
are in: plain text, fixed columns, CSV, XML, HTML, etc.
3. Cleaning the data: Survey responses and other data files are almost always in-
complete. Sometimes, there are multiple codes for things such as, not asked, did
not know, and declined to answer. And there are almost always errors. A simple
strategy is to remove or ignore incomplete records.
4. Building data structures: Once you read the data, it is necessary to store them in
a data structure that lends itself to the analysis we are interested in. If the data fit
into the memory, building a data structure is usually the way to go. If not, usually
a database is built, which is an out-of-memory data structure. Most databases
provide a mapping from keys to values, so they serve as dictionaries.
Let us consider a public database called the Adult dataset, hosted on the UCIs
Machine Learning Repository.1 It contains approximately 32,000 observations con-
cerning different financial parameters related to the US population: age, sex, marital
(marital status of the individual), country, income (Boolean variable: whether the per-
son makes more than $50,000 per annum), education (the highest level of education
achieved by the individual), occupation, capital gain, etc.
We will show that we can explore the data by asking questions like: Are men
more likely to become high-income professionals than women, i.e., to receive an
income of over $50,000 per annum?
1 https://archive.ics.uci.edu/ml/datasets/Adult.
3.2 Data Preparation 31
data = []
for line in file :
data1 = line . split ( , )
if len ( d a t a 1 ) == 15:
d a t a . a p p e n d ([ c h r _ i n t ( d a t a 1 [0]) , data1 [1] ,
c h r _ i n t ( d a t a 1 [2]) , data1 [3] ,
c h r _ i n t ( d a t a 1 [4]) , data1 [5] ,
d a t a 1 [6] , d a t a 1 [7] , d a t a 1 [8] ,
d a t a 1 [9] , c h r _ i n t ( d a t a 1 [ 1 0 ] ) ,
chr_int ( data1 [11]) ,
chr_int ( data1 [12]) ,
d a t a 1 [13] , data1 [14]
])
The command shape gives exactly the number of data samples (in rows, in this
case) and features (in columns):
In [4]:
df . s h a p e
Thus, we can see that our dataset contains 32,561 data records with 15 features
each. Let us count the number of items per country:
In [5]:
c o u n t s = df . g r o u p b y ( c o u n t r y ) . size ()
p r i n t c o u n t s . h e a d ()
Out[5]: country
? 583
Cambodia 19
Vietnam 67
Yugoslavia 16
The first row shows the number of samples with unknown country, followed by
the number of samples corresponding to the first countries in the dataset.
Let us split people according to their gender into two groups: men and women.
In [6]:
ml = df [( df . sex == Male ) ]
The data that come from performing a particular measurement on all the subjects
in a sample represent our observations for a single characteristic like country,
age, education, etc. These measurements and categories represent a sample
distribution of the variable, which in turn approximately represents the population
distribution of the variable. One of the main goals of exploratory data analysis is
to visualize and summarize the sample distribution, thereby allowing us to make
tentative assumptions about the population distribution.
The data in general can be categorical or quantitative. For categorical data, a simple
tabulation of the frequency of each category is the best non-graphical exploration
for data analysis. For example, we can ask ourselves what is the proportion of high-
income professionals in our database:
3.3 Exploratory Data Analysis 33
In [8]:
df1 = df [( df . i n c o m e == >50 K \ n ) ]
p r i n t The rate of people with high income is : ,
int ( len ( df1 ) / f l o a t ( len ( df ) ) * 1 0 0 ) , %.
p r i n t The rate of men with high i n c o m e is : ,
int ( len ( ml1 ) / f l o a t ( len ( ml ) ) * 1 0 0 ) , %.
p r i n t The rate of women with high i n c o m e is : ,
int ( len ( fm1 ) / f l o a t ( len ( fm ) ) * 1 0 0 ) , %.
3.3.1.1 Mean
One of the first measurements we use to have a look at the data is to obtain sample
statistics from the data, such as the sample mean [1]. Given a sample of n values,
{xi }, i = 1, . . . , n, the mean, , is the sum of the values divided by the number of
values,2 in other words:
n
1!
= xi . (3.1)
n
i=1
The terms mean and average are often used interchangeably. In fact, the main
distinction between them is that the mean of a sample is the summary statistic com-
puted by Eq. (3.1), while an average is not strictly defined and could be one of many
summary statistics that can be chosen to describe the central tendency of a sample.
In our case, we can consider what the average age of men and women samples in
our dataset would be in terms of their mean:
2 We will use the following notation: X is a random variable, x is a column vector, xT (the transpose
of x) is a row vector, X is a matrix, and xi is the i-th element of a dataset.
34 3 Descriptive Statistics
In [9]:
p r i n t The a v e r a g e age of men is : ,
ml [ age ]. mean ()
p r i n t The a v e r a g e age of w o m e n is : ,
fm [ age ]. mean ()
m l _ m e d i a n _ a g e = ml1 [ age ]. m e d i a n ()
f m _ m e d i a n _ a g e = fm1 [ age ]. m e d i a n ()
p r i n t " M e d i a n age per men and w o m e n w i t h high -
income : " ,
ml_median_age , fm_median_age
Fig. 3.1 Histogram of the age of working men (left) and women (right)
That value, x p , is the p-th quantile, or the 100 p-th percentile. For example, a
5-number summary is defined by the values xmin , Q 1 , Q 2 , Q 3 , xmax , where Q 1 is
the 25 p-th percentile, Q 2 is the 50 p-th percentile and Q 3 is the 75 p-th
percentile.
Summarizing data by just looking at their mean, median, and variance can be danger-
ous: very different data can be described by the same statistics. The best thing to do
is to validate the data by inspecting them. We can have a look at the data distribution,
which describes how often each value appears (i.e., what is its frequency).
The most common representation of a distribution is a histogram, which is a graph
that shows the frequency of each value. Let us show the age of working men and
women separately.
In [12]:
m l _ a g e = ml [ age ]
ml_age . hist ( normed = 0, histtype = stepfilled ,
bins = 20)
In [13]:
f m _ a g e = fm [ age ]
fm_age . hist ( normed = 0, histtype = stepfilled ,
bins = 10)
The output can be seen in Fig. 3.1. If we want to compare the histograms, we can
plot them overlapping in the same graphic as follows:
3.3 Exploratory Data Analysis 37
Fig. 3.2 Histogram of the age of working men (in ochre) and women (in violet) (left). Histogram of
the age of working men (in ochre), women (in blue), and their intersection (in violet) after samples
normalization (right)
In [14]:
i m p o r t s e a b o r n as sns
fm_age . hist ( normed = 0, histtype = stepfilled ,
alpha = .5 , bins = 20)
ml_age . hist ( normed = 0, histtype = stepfilled ,
a l p h a = .5 ,
c o l o r = sns . d e s a t u r a t e ( " i n d i a n r e d " ,
.75) ,
bins = 10)
The output can be seen in Fig. 3.2 (left). Note that we are visualizing the absolute
values of the number of people in our dataset according to their age (the abscissa of
the histogram). As a side effect, we can see that there are many more men in these
conditions than women.
We can normalize the frequencies of the histogram by dividing/normalizing by
n, the number of samples. The normalized histogram is called the Probability Mass
Function (PMF).
In [15]:
fm_age . hist ( normed = 1, histtype = stepfilled ,
alpha = .5 , bins = 20)
ml_age . hist ( normed = 1, histtype = stepfilled ,
alpha = .5 , bins = 10 ,
c o l o r = sns . d e s a t u r a t e ( " i n d i a n r e d " ,
.75) )
This outputs Fig. 3.2 (right), where we can observe a comparable range of indi-
viduals (men and women).
The Cumulative Distribution Function (CDF), or just distribution function,
describes the probability that a real-valued random variable X with a given proba-
bility distribution will be found to have a value less than or equal to x. Let us show
the CDF of age distribution for both men and women.
38 3 Descriptive Statistics
In [16]:
m l _ a g e . h i s t ( n o r m e d = 1 , h i s t t y p e = step ,
c u m u l a t i v e = True , l i n e w i d t h = 3.5 ,
bins = 20)
f m _ a g e . h i s t ( n o r m e d = 1 , h i s t t y p e = step ,
c u m u l a t i v e = True , l i n e w i d t h = 3.5 ,
bins = 20 ,
c o l o r = sns . d e s a t u r a t e ( " i n d i a n r e d " ,
.75) )
The output can be seen in Fig. 3.3, which illustrates the CDF of the age distributions
for both men and women.
As mentioned before, outliers are data samples with a value that is far from the central
tendency. Different rules can be defined to detect outliers, as follows:
For example, in our case, we are interested in the age statistics of men versus
women with high incomes and we can see that in our dataset, the minimum age is 17
years and the maximum is 90 years. We can consider that some of these samples are
due to errors or are not representable. Applying the domain knowledge, we focus on
the median age (37, in our case) up to 72 and down to 22 years old, and we consider
the rest as outliers.
3.3 Exploratory Data Analysis 39
In [17]:
df2 = df . drop ( df . index [
( df . i n c o m e == >50 K \ n ) &
( df [ age ] > df [ age ]. m e d i a n () + 35) &
( df [ age ] > df [ age ]. m e d i a n () -15)
])
m l 1 _ a g e = ml1 [ age ]
f m 1 _ a g e = fm1 [ age ]
m l 2 _ a g e = m l 1 _ a g e . drop ( m l 1 _ a g e . i n d e x [
( m l 1 _ a g e > df [ age ]. m e d i a n () + 35) &
( m l 1 _ a g e > df [ age ]. m e d i a n () - 15)
])
f m 2 _ a g e = f m 1 _ a g e . drop ( f m 1 _ a g e . i n d e x [
( f m 1 _ a g e > df [ age ]. m e d i a n () + 35) &
( f m 1 _ a g e > df [ age ]. m e d i a n () - 15)
])
We can check how the mean and the median changed once the data were cleaned:
In [18]:
m u 2 m l = m l 2 _ a g e . mean ()
s t d 2 m l = m l 2 _ a g e . std ()
m d 2 m l = m l 2 _ a g e . m e d i a n ()
m u 2 f m = f m 2 _ a g e . mean ()
s t d 2 f m = f m 2 _ a g e . std ()
m d 2 f m = f m 2 _ a g e . m e d i a n ()
Fig. 3.4 The red shows the cleaned data without the considered outliers (in blue)
Figure 3.4 shows the outliers in blue and the rest of the data in red. Visually, we
can confirm that we removed mainly outliers from the dataset.
Next we can see that by removing the outliers, the difference between the popula-
tions (men and women) actually decreased. In our case, there were more outliers in
men than women. If the difference in the mean values before removing the outliers
is 2.5, after removing them it slightly decreased to 2.44:
In [20]:
p r i n t The mean d i f f e r e n c e with o u t l i e r s is : %4.2 f .
% ( ml_age . mean () - fm_age . mean () )
p r i n t The mean d i f f e r e n c e w i t h o u t o u t l i e r s is :
%4.2 f .
% ( m l 2 _ a g e . mean () - f m 2 _ a g e . mean () )
val = [( d i v i s i o n x [ i ] + d i v i s i o n x [ i +1]) /2
for i in r a n g e ( len ( d i v i s i o n x ) - 1) ]
plt . p l o t ( val , c o u n t x - county , o - )
The results are shown in Fig. 3.5. One can see that the differences between male
and female values are slightly negative before age 42 and positive after it. Hence,
women tend to be promoted (receive more than 50 K) earlier than men.
3.3 Exploratory Data Analysis 41
Fig. 3.5 Differences in high-income earner men versus women as a function of age
For univariate data, the formula for skewness is a statistic that measures the asym-
metry of the set of n data samples, xi :
"
1 i (xi 3 )
g1 = , (3.3)
n 3
where is the mean, is the standard deviation, and n is the number of data points.
Negative deviation indicates that the distribution skews left (it extends further
to the left than to the right). One can easily see that the skewness for a normal
distribution is zero, and any symmetric data must have a skewness of zero. Note
that skewness can be affected by outliers! A simpler alternative is to look at the
relationship between the mean and the median 12 .
In [22]:
def s k e w n e s s ( x ) :
res = 0
m = x . mean ()
s = x . std ()
for i in x :
res += (i - m ) * ( i - m ) * (i - m )
res /= ( len ( x ) * s * s * s )
r e t u r n res
3.3.4.1 Discussions
After exploring the data, we obtained some apparent effects that seem to support
our initial assumptions. For example, the mean age for men in our dataset is 39.4
years; while for women, is 36.8 years. When analyzing the high-income salaries, the
mean age for men increased to 44.6 years; while for women, increased to 42.1 years.
When the data were cleaned from outliers, we obtained mean age for high-income
men: 44.3, and for women: 41.8. Moreover, histograms and other statistics show the
skewness of the data and the fact that women used to be promoted a little bit earlier
than men, in general.
Fig. 3.6 Exponential CDF (left) and PDF (right) with = 3.00
#x
is defined as FX (x) where this satisfies: FX (x) = f X (t)t for all x. There are
many continuous distributions; here, we will consider the most common ones: the
exponential and the normal distributions.
C D F(x) = 1 ex , P D F(x) = ex .
The parameter defines the shape of the distribution. An example is given in
Fig. 3.6. It is easy to show that the mean of the distribution is 1 , the variance is 12
and the median is ln(2)
.
Note that for a small number of samples, it is difficult to see that the exact empirical
distribution fits a continuous distribution. The best way to observe this match is to
generate samples from the continuous distribution and see if these samples match
the data. As an exercise, you can consider the birthdays of a large enough group of
people, sorting them and computing the inter-arrival time in days. If you plot the
CDF of the inter-arrival times, you will observe the exponential distribution.
There are a lot of real-world events that can be described with this distribution,
including the time until a radioactive particle decays; the time it takes before your
next telephone call; and the time until default (on payment to company debt holders)
in reduced-form credit risk modeling. The random variable X of the lifetime of some
batteries is associated with a probability density function of the form: P D F(x) =
(x)2
1 x4
4e e 2 2 .
44 3 Descriptive Statistics
The normal CDF has no closed-form expression and its most common represen-
tation is the PDF: 2
1 (x)
PDF(x) = e 2 . 2
2 2
The parameter defines the shape of the distribution. An example of the PDF of
a normal distribution with = 6 and = 2 is given in Fig. 3.7.
Fig. 3.8 Summed kernel functions around a random set of points (left) and the kernel density
estimate with the optimal bandwidth (right) for our dataset. Random data shown in blue, kernel
shown in black and summed function shown in red
a continuous function that when normalized would approximate the density of the
distribution:
In [24]:
x1 = np . r a n d o m . n o r m a l ( -1 , 0.5 , 15)
x2 = np . r a n d o m . n o r m a l (6 , 1 , 10)
y = np . r_ [ x1 , x2 ] # r_ t r a n s l a t e s s l i c e o b j e c t s to
c o n c a t e n a t i o n along the first axis .
x = np . l i n s p a c e ( min ( y ) , max ( y ) , 100)
s = 0.4 # S m o o t h i n g p a r a m e t e r
# C a l c u l a t e the k e r n e l s
k e r n e l s = np . t r a n s p o s e ([ norm . pdf (x , yi , s ) for yi
in y ])
plt . plot (x , kernels , k : )
plt . plot (x , k e r n e l s . sum (1) , r )
plt . plot (y , np . zeros ( len ( y ) ) , bo , ms = 10)
Figure 3.8 (left) shows the result of the construction of the continuous function
from the kernel summarization.
In fact, the library SciPy3 implements a Gaussian kernel density estimation that
automatically chooses the appropriate bandwidth parameter for the kernel. Thus, the
final construction of the density estimate will be obtained by:
3 http://www.scipy.org.
46 3 Descriptive Statistics
In [25]:
from s c i p y . s t a t s i m p o r t kde
d e n s i t y = kde . g a u s s i a n _ k d e ( y )
x g r i d = np . l i n s p a c e ( x . min () , x . max () , 200)
plt . hist (y , bins = 28 , n o r m e d = T ru e )
plt . plot ( xgrid , d e n s i t y ( x g r i d ) , r - )
Figure 3.8 (right) shows the result of the kernel density estimate for our example.
3.4 Estimation
An important aspect when working with statistical data is being able to use estimates
to approximate the values of unknown parameters of the dataset. In this section, we
will review different kinds of estimators (estimated mean, variance, standard score,
etc.).
In continuation, we will deal with point estimators that are single numerical estimates
of parameters of a population.
3.4.1.1 Mean
Let us assume that we know that our data are coming from a normal distribution and
the random samples drawn are as follows:
{0.33, 1.76, 2.34, 0.56, 0.89}.
The question is can we guess the mean of the distribution? One approximation is
given by the sample mean, x. This process is called estimation and the statistic (e.g.,
the sample mean) is called an estimator. In our case, the sample mean is 0.472, and it
seems a logical choice to represent the mean of the distribution. It is not so evident if
we add a sample with a value of 465. In this case, the sample mean will be 77.11,
which does not look like the mean of the distribution. The reason is due to the fact
that the last value seems to be an outlier compared to the rest of the sample. In order
to avoid this effect, we can try first to remove outliers and then to estimate the mean;
or we can use the sample median as an estimator of the mean of the distribution.
If there are no outliers, the sample mean x minimizes the following mean squared
error:
1!
MSE = (x )2 ,
n
where n is the number of times we estimate the mean.
Let us compute the MSE of a set of random data:
3.4 Estimation 47
In [26]:
NTs = 200
mu = 0.0
var = 1.0
err = 0.0
NPs = 1000
for i in r a n g e ( NTs ) :
x = np . r a n d o m . n o r m a l ( mu , var , NPs )
err += ( x . mean () - mu ) **2
p r i n t MSE : , err / N T e s t s
3.4.1.2 Variance
If we ask ourselves what is the variance, 2 , of the distribution of X , analogously we
can use the sample variance as an estimator. Let us denote by 2 the sample variance
estimator:
1!
2 = (xi x)2 .
n
For large samples, this estimator works well, but for a small number of samples
it is biased. In those cases, a better estimator is given by:
1 !
2 = (xi x)2 .
n1
Variables of data can express relations. For example, countries that tend to invest in
research also tend to invest more in education and health. This kind of relationship
is captured by the covariance.
48 3 Descriptive Statistics
Fig. 3.9 Positive correlation between economic growth and stock market returns worldwide (left).
Negative correlation between the world oil production and gasoline prices worldwide (right)
3.4.2.1 Covariance
When two variables share the same tendency, we speak about covariance. Let us
consider two series, {xi } and {yi }. Let us center the data with respect to their mean:
d xi = xi X and dyi = yi Y . It is easy to show that when {xi } and {yi } vary
together, their deviations tend to have the same sign. The covariance is defined as
the mean of the following products:
n
1!
Cov(X, Y ) = d xi dyi ,
n
i=1
where n is the length of both sets. Still, the covariance itself is hard to interpret.
having = 0, does not necessarily mean that the variables are not correlated! Pear-
sons correlation captures correlations of first order, but not nonlinear correlations.
Moreover, it does not work well in the presence of outliers.
between the sets. However, the Spearmans rank coefficient, capturing the correlation
between the ranks, gives as a final value of 0.80, confirming the correlation between
the sets. As an exercise, you can compute the Pearsons and the Spearmans rank
correlations for the different Anscombe configurations given in Fig. 3.10. Observe if
linear and nonlinear correlations can be captured by the Pearsons and the Spearmans
rank correlations.
3.5 Conclusions
In this chapter, we have familiarized ourselves with the basic concepts and procedures
of descriptive statistics to explore a dataset. As we have seen, it helps us to understand
the experiment or a dataset in detail and allows us to put the data in perspective. We
introduced the central measures of tendency such as the sample mean and median;
and measures of variability such as the variance and standard deviation. We have also
discussed how these measures can be affected by outliers. In order to go deeper into
visualizing the dataset, we have introduced histograms, quantiles, and percentiles.
In many situations, when the values are continuous variables, it is convenient to
use continuous distributions; the most common of which are the normal and the
exponential distributions. The advantage of most continuous distributions is that
we can have an explicit expression for their PDF and CDF, as well as the mean
and variance in terms of a closed formula. Also, we learned how, by using the
kernel density, we can obtain a continuous representation of the sample distribution.
Finally, we discussed how to estimate the correlation and the covariance of datasets,
where two of the most popular measures are the Pearsons and the Spearmans rank
correlations, which are affected in different ways by the outliers of the dataset.
Acknowledgements This chapter was co-written by Petia Radeva and Laura Igual.
References
1. A. B. Downey, Probability and Statistics for Programmers, OReilly Media, 2011, ISBN-10:
1449307116.
2. Probability Distributions: Discrete vs. Continuous, http://stattrek.com/probability-distributions/
discrete-continuous.aspx.