Data Exploration and Visualization
Data Exploration and Visualization
1.15. Cross-Tabulations...................................................................................1.67
UNIT II
2.7. How to manually add a legend with a color box on a Matplotlib figure?
................................................................................................................................. 2.20
2.9. Subplots......................................................................................................2.23
2.11. Customization..........................................................................................2.38
UNIT III
3.4. Inequality...................................................................................................3.23
3.5. Smoothing..................................................................................................3.30
UNIT IV
4.6. Transformations........................................................................................4.25
UNIT V
5.9. Grouping....................................................................................................5.28
5.10. Resampling..............................................................................................5.28
EDA Fundamentals
Understanding Data Science
Significance of EDA
Making Sense of Data
Comparing EDA with Classical and Bayesian Analysis
Software Tools for EDA
Visual Aids for EDA
Data Transformation Techniques
Merging Database, Reshaping and Pivoting
Transformation Techniques
Grouping Datasets
Data Aggregation
Pivot Tables and Cross -Tabulations
UNIT I
EXPLORATORY DATA ANALYSIS
The main takeaway here is the stages of EDA. Let's understand in brief what these
stages are:
Data requirements: There can be various sources of data for an organization. It
is important to comprehend what type of data is required for the organization to be
collected, curated, and stored.
Data collection: Data collected from several sources must be stored in the correct
format and transferred to the right information technology personnel within a
company. As mentioned previously, data can be collected from several objects on
several events using different types of sensors and storage tools.
Data processing: Preprocessing involves the process of pre-curating the dataset
before actual analysis. Common tasks involve correctly exporting the dataset, placing
them under the right tables, structuring them, and exporting them in the correct
format.
Data cleaning: Preprocessed data must be correctly transformed for an
incompleteness check, duplicates check, error check, and missing value check. These
1.2 Data Exploration and Visualization
tasks are performed in the data cleaning stage, which involves matching the correct
record, finding inaccuracies in the dataset, understanding the overall data quality,
removing duplicate items, and filling in the missing values.
Modeling and algorithm: From a data science perspective, generalized models
or mathematical formulas can represent or exhibit relationships among different
variables, such as correlation or causation. These models or equations involve one or
more variables that depend on other variables to cause an event.
Data Product: Any computer software that uses data as inputs, produces outputs,
and provides feedback based on the output to control the environment is referred to
as a data product. A data product is generally based on a model developed during
data analysis, for example, a recommendation model that inputs user purchase
history and recommends a related item that the user is highly likely to buy.
Communication: This stage deals with disseminating the results to end
stakeholders to use the result for business intelligence. One of the most notable steps
in this stage is data visualization. Visualization deals with information relay
techniques such as tables, charts, summary diagrams, and bar charts to show the
analyzed result.
Exploratory data analysis is key, and usually the first exercise in data mining. It
allows us to visualize data to understand it as well as to create hypotheses for further
analysis. The exploratory analysis centers around creating a synopsis of data or
insights for the next steps in a data mining project.
Steps in EDA
Problem definition: The problem definition works as the driving force for a data
analysis plan execution. The main tasks involved in problem definition are defining
the main objective of the analysis, defining the main deliverables, outlining the main
roles and responsibilities, obtaining the current status of the data, defining the
timetable, and performing cost/benefit analysis.
Exploratory Data Analysis 1.3
Data preparation: This step involves methods for preparing the dataset before
actual analysis. In this step, we define the sources of data, define data schemas and
tables, understand the main characteristics of the data, clean the dataset, delete non-
relevant datasets, transform the data, and divide the data into required chunks for
analysis.
Data analysis: This is one of the most crucial steps that deals with descriptive
statistics and analysis of the data. The main tasks involve summarizing the data,
finding the hidden correlation and relationships among the data, developing
predictive models, evaluating the models,and calculating the accuracies. Some of the
techniques used for data summarization are summary tables, graphs, descriptive
statistics, inferential statistics, correlation statistics, searching and grouping.
Development and representation of the results: This step involves presenting
the dataset to the target audience in the form of graphs, summary tables, maps, and
diagrams. This is also an essential step asthe result analyzed from the dataset should
be interpretable by the business stakeholders, which is one of the major goals of
EDA.
Numerical Data
This data has a sense of measurement involved in it; for example, a person'sage,
height, weight, blood pressure, heart rate, temperature, number of teeth, number of
bones, and the number of family members. This data is often referred to as
quantitative data in statistics. The numerical dataset can be either discrete or
continuous types.
1.4 Data Exploration and Visualization
Discrete Data
This is data that is countable and its values can be listed out. For example, ifwe
flip a coin, the number of heads in 200 coin flips can take values from 0 to 200
(finite) cases. A variable that represents a discrete dataset is referred to as a discrete
variable. The discrete variable takes a fixed number of distinct values. For example,
the Country variable can have values such as Nepal, India, Norway, and Japan. It is
fixed. The Rank variable of a student in a classroom can take values from 1, 2, 3, 4,
5, and so on.
Continuous Data
A variable that can have an infinite number of numerical values within aspecific
range is classified as continuous data. A variable describing continuous data is a
continuous variable.
Categorical Data
This type of data represents the characteristics of an object; for example, gender,
marital status, type of address, or categories of the movies. This data is often referred
to as qualitative datasets in statistics.
A variable describing categorical data is referred to as a categorical variable.
These types of variables can have one of a limited number of values. There are
different types of categorical variables:
A binary categorical variable can take exactly two values and is also referred to as
a dichotomous variable. For example, when you create an experiment, the result is
either success or failure. Hence, results can be understood as a binary categorical
variable.
Polytomous variables are categorical variables that can take more than two
possible values. For example, marital status can have severalvalues, such as annulled,
divorced, interlocutory, legally separated, married, polygamous, never married,
domestic partners, unmarried, widowed, domestic partner, and unknown. Since
marital status cantake more than two possible values, it is a polytomous variable.
Exploratory Data Analysis 1.5
Fig. 1.1.
1.6 Data Exploration and Visualization
There are several software tools that are available to facilitate EDA.
NumPy
For importing numpy, we will use the following code:
import numpy as np
For NumPy arrays and file operations, we will use the following
code: # Save a numpy array into filex = np.arange(0.0,50.0,1.0)
np.savetxt('data.out', x, delimiter=',')
# Loading numpy array from text
z = np.loadtxt('data.out', unpack=True)print(z)
# Loading numpy array using genfromtxt method my_array2 =
np.genfromtxt('data.out',
skip_header=1, filling_values=-999) print(my_array2)
For inspecting NumPy arrays, we will use the following code:
# Print the number of `my2DArray`'s dimensionsprint(my2DArray.ndim)
# Print the number of `my2DArray`'s elementsprint(my2DArray.size)
# Print information about `my2DArray`'s memory layoutprint(my2DArray.flags)
# Print the length of one array element in bytesprint(my2DArray.itemsize)
# Print the total consumed bytes by `my2DArray`'s elements
print(my2DArray.nbytes)
Pandas
1. Use the following to set default parameters:
import numpy as npimport pandas as pd
print("Pandas Version:", pd. version )
pd.set_option('display.max_columns', 500)
Exploratory Data Analysis 1.7
pd.set_option('display.max_rows', 500)
'B': pd.Timestamp('20190526'),
})
print(series_df)
'gender','capital_gain','capital_loss','hours_per_week','country_of_origin','inco me']
df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data',names=columns)
df.head(10)
If you run the preceding cell, you should get an output similar to the following
screenshot:
1.8 Data Exploration and Visualization
work education- marital-
age fnlwgt education Occupation relationship Ethinicity Gender Capital_gain Capital_loss Hours_per_week
class num status
0 39 State- 77516 Bachelors 13 Never- Adm- Not-in- White Male 2174 0 40
gov married clerical family
1 50 Self- 83311 Bachelors 13 Married- Exec- Husband White Male 0 0 13
emp- civ- managerial
not-inc spouse
2 38 Private 215646 HS-grad 9 Divorced Handlers- Not-in- White Male 0 0 40
cleaners family
3 53 Private 234721 11th 7 Married- Handlers- Husband Black Male 0 0 40
civ- cleaners
spouse
4 28 Private 338409 Bachelors 13 Married- Prof- Wife Black Female 0 0 40
civ- speciality
spouse
5 37 Private 284582 Masters 14 Married- Exec- Wife White Female 0 0 40
civ- managerial
spouse
6 49 Private 160187 9th 5 Married- Other- Not-in Black Female 0 0 16
spouse- service family
absent
7 52 Self- 209642 HS-grad 9 Married- Exec- Husband White Male 0 0 45
emp- civ- managerial
not-inc spouse
8 31 Private 45781 Masters 14 Never- Prof- Not-in White Female 14084 0 50
married speciality family
9 42 Private 159449 Bachelors 13 Married- Exec- Husband White Male 5178 0 40
civ- managerial
spouse
Exploratory Data Analysis 1.9
4. The following code displays the rows, columns, data types, and memory
usedby the dataframe:
df.info()
The output of the preceding code snippet should be similar to thefollowing:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560 Data columns (total 15 columns):age
32561 non-null int64 workclass 32561 non-null object fnlwgt 32561 non-null
int64 education 32561 non-null object
education_num 32561 non-null int64 marital_status 32561 non-null object
occupation 32561 non-null object relationship 32561 non-null object ethnicity
32561 non-null object gender 32561 non-null object capital_gain 32561 non-null
int64 capital_loss 32561 non-null int64 hours_per_week 32561 non-null int64
country_of_origin 32561 non-null objectincome 32561 non-null
object dtypes: int64(6), object(9)memory usage: 3.7 + MB
5. Let's now see how we can select rows and columns in any dataframe:
# Selects a rowdf.iloc[10]
# Selects 10 rowsdf.iloc[0:10]
# Selects a range of
rowsdf.iloc[10:15] # Selects the last 2
rowsdf.iloc[-2:]
# Selects every other row in columns 3-5df.iloc[::2, 3:5].head()
6. Let's combine NumPy and pandas to create a dataframe as follows:
import pandas as pdimport numpy as np
np.random.seed(24)
dFrame = pd.DataFrame({'F': np.linspace(1, 10, 10)})
dFrame = pd.concat([df, pd.DataFrame(np.random.randn(10, 5),
columns=list('EDCBA'))],
axis=1) dFrame.iloc[0, 2] = np.nandFrame
7. Let's style this table using a custom rule. If the values are greater than
zero, we change the color to black (the default color); if the value is less
1.10 Data Exploration and Visualization
than zero,we change the color to red; and finally, everything else would
be colored green. Let's define a Python function to accomplish that:
# Define a function that should color the values that are less than 0 def
colorNegativeValueToRed(value):
if value < 0: color = 'red'
elif value > 0: color =
'black' else:
color = 'green'
return 'color: %s' % color
8. Now, let's pass this function to the dataframe. We can do this by using
the style method provided by pandas inside the dataframe:
s = df.style.applymap(colorNegativeValueToRed, subset=['A','B','C','D','E'])s
It should display a colored dataframe as shown in the followingscreenshot:
F E D C B A
0 1 1.32921 nan –0.31628 –0.99081 –1.07082
1 2 –1.43871 0.564417 0.295722 –1.6264 0.219565
2 3 0.678805 1.88927 0.961538 0.104011 –0.481165
3 4 0.850229 1.45342 1.05774 0.165562 0.515018
4 5 –1.33694 0.562861 1.39285 – 0.063328 0.121668
5 6 1.2076 – 0.00204021 1.6278 0.354493 1.03753
6 7 –0.385684 0.519818 1.68658 –1.32596 1.42898
7 8 –2.08935 – 0.12982 0.631523 –0.586538 0.29072
8 9 1.2641 0.290035 –1.97029 0.803906 1.03055
9 10 0.118098 – 0.0218533 0.0468407 –1.62875 –0.392361
It should be noted that the apply map and apply methods are computationally
expensive as they apply to each value inside the dataframe. Hence, it will take some
time to execute. Have patience and await execution.
Exploratory Data Analysis 1.11
9. Now, let's go one step deeper. We want to scan each column and
highlight themaximum value and the minimum value in that column:
def highlightMax(s): isMax = s == s.max()
return ['background-color: orange' if v else '' for v in
isMax] def highlightMin(s): isMin = s == s.min()
return ['background-color: green' if v else '' for v in isMin]
We apply these two functions to the dataframe as follows:
df.style.apply(highlightMax).apply(highlightMin).highlight_null(null_color='red)
F E D C B A
0 1 1.32921 nan –0.31628 –0.99081 –1.07082
1 2 –1.43871 0.564417 0.295722 –1.6264 0.219565
SciPy
SciPy is a scientific library for Python and is open source. We are going touse this
library in the upcoming chapters. This library depends on the NumPy library, which
provides an efficient n-dimensional array manipulation function. If you want to get
started early, check for scipy.stats from the SciPy library.
1.12 Data Exploration and Visualization
Matplotlib
Matplotlib provides a huge library of customizable plots, along with a
comprehensive set of backends. It can be utilized to create professional reporting
applications, interactive analytical applications, complex dashboard applications,
web/GUI applications, and embedded views.
As data scientists, two important goals in our work would be to extract knowledge
from the data and to present the data to stakeholders. Presenting results to
stakeholders is very complex in the sense that our audience may not have enough
technical know-how to understand programming jargon and other technicalities.
Hence, visual aids are very useful tools.
Line Chart
We have created a function using the faker Python library to generate the dataset.
It is the simplest possible dataset you can imagine, with just two columns. The first
column is Date and the second column is Price,
My generate Datafunction is defined here:
import datetimeimport math
import pandas as pdimport random
import radar
from faker import Fakerfake = Faker()
def generateData(n):listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)delta = end - start
for _ in range(n):
date = radar.random_datetime(start='2019-08-1', stop='2019-08-30').strftime("%Y-
%m-%d")
Exploratory Data Analysis 1.13
Date Price
2019-08-01 999.598900
2019-08-02 957.870150
2019-08-04 978.674200
2019-08-05 963.380375
2019-08-06 978.092900
2019-08-07 987.847700
2019-08-08 952.669900
2019-08-10 973.929400
2019-08-13 971.485600
2019-08-14 977.036200
Steps involved
Let's look at the process of creating the line chart:
1. Load and prepare the dataset.
2. Import the matplotlib library. It can be done with this command:
import matplotlib.pyplot as plt
3. Plot the graph:
1.14 Data Exploration and Visualization
plt.plot(df)
4. Display it on the screen:
plt.show()
Here is the code if we put it all together:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (14, 10)plt.plot(df)
And the plotted graph looks something like this:
Fig. 1.2.
Bar Charts
This is one of the most common types of visualization that almost everyonemust
have encountered. Bars can be drawn horizontally or vertically to represent
categorical variables.
Bar charts are frequently used to distinguish objects between distinct collections
in order to track variations over time. In most cases, bar charts are very convenient
when the changes are large. In order to learn about bar charts, let's assume a
Exploratory Data Analysis 1.15
pharmacy in Norway keeps track of the amount of Zoloft sold every month. Zoloft
is a medicine prescribed to patients suffering from depression. We can use the
calendar
height = rectangle.get_height()
Fig. 1.3.
Scatter Plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams, and scatter
diagrams. They use a Cartesian coordinates system to display values of typically
two variables for a set of data.
When should we use a scatter plot? Scatter plots can be constructed in the
following two situations:
When one continuous variable is dependent on another variable, which is under
the control of the observer
When both continuous variables are independent
Exploratory Data Analysis 1.17
Fig. 1.4.
Pie Chart
This is one of the more interesting types of data visualization graphs.
There are two ways in which you can load the data: first, directly from the GitHub
URL; or you can download the dataset from the GitHub and reference it from your
local machine by providing the correct path. In either case, you can use the read_csv
method from the pandaslibrary. Check out the followingsnippet:
# Create URL to JSON file (alternatively this can be a filepath)
url='https://raw.githubusercontent.com/hmcuesta/PDA_Book/master/Chapter3/poke
monByType
.csv'
# Load the first sheet of the JSON file into a data framepokemon = pd.read_csv(url,
index_col='type')pokemon
The preceding code snippet should display the dataframe as follows:
Exploratory Data Analysis 1.19
Type Amount
Bug 45
Dark 16
Dragon 12
Electric 7
Fighting 3
Fire 14
Ghost 10
Grass 31
Ground 17
Ice 11
Normal 29
Poison 11
Psychic 9
Rock 24
Steel 13
Water 45
Fig. 1.5.
Histogram
Histogram plots are used to depict the distribution of any continuous variable.
These types of plots are very popular in statistical analysis.
Consider the following use cases. A survey created in vocational training sessions
of developers had 100 participants. They had several years of Python programming
experience ranging from 0 to 20.
Let's import the required libraries and create the dataset:
import numpy as np
import matplotlib.pyplot as plt
#Create data set
yearsOfExperience = np.array([10, 16, 14, 5, 10, 11, 16, 14, 3, 14, 13, 19, 2, 5,
7, 3, 20,11, 11, 14, 2, 20, 15, 11, 1, 15, 15, 15, 2, 9, 18, 1, 17, 18,
13, 9, 20, 13, 17, 13, 15, 17, 10, 2, 11, 8, 5, 19, 2, 4, 9,
17, 16, 13, 18, 5, 7, 18, 15, 20, 2, 7, 0, 4, 14, 1, 14, 18,
8, 11, 12, 2, 9, 7, 11, 2, 6, 15, 2, 14, 13, 4, 6, 15, 3,
6, 10, 2, 11, 0, 18, 0, 13, 16, 18, 5, 14, 7, 14, 18])
Exploratory Data Analysis 1.21
yearsOfExperience
In order to plot the histogram chart, execute the following steps:
1. Plot the distribution of group experience:
nbins = 20n, bins, patches = plt.hist(yearsOfExperience, bins=nbins)
2. Add labels to the axes and a title:
plt.xlabel("Years of experience with Python Programming")plt.ylabel("Frequency")
plt.title("Distribution of Python programming experience in the vocational training
session")
3. Draw a green vertical line in the graph at the average experience:
plt.axvline(x=yearsOfExperience.mean(), linewidth=3, color = 'g')
4. Display the plot:
plt.show()
The preceding code generates the following histogram:
Fig. 1.6.
1.22 Data Exploration and Visualization
Although there are a lot of objects returned by the extracted data, we do not need
all the items. We will only extract the required fields. Data cleansing is one of the
essential steps in the data analysis phase. For our analysis, all we need is data for the
following: subject, from, date, to, label, and thread.
Data Ccleansing
Let's create a CSV file with only the required fields. Let's start with thefollowing
steps:
1. Import the csvpackage:
import csv
2. Create a CSV file with only the required
attributes: with open('mailbox.csv', 'w') as outputfile:
Exploratory Data Analysis 1.23
writer = csv.writer(outputfile) writer. write row (['subject', 'from', 'date', 'to', 'label',
'thread'])
for message in mbox:writer.writerow([
message['subject'],message['from'],
message['date'],
message['to'],
message['X-Gmail-Labels'],message['X-GM-THRID']
]
Loading the CSV File
We will load the CSV file. Refer to the following code block:
dfs = pd.read_csv('mailbox.csv', names=['subject', 'from', 'date', 'to', 'label','thread'])
The preceding code will generate a pandas dataframe with only the requiredfields
stored in the CSV file.
Converting the Date
Next, we will convert the date.
Check the datatypes of each column as shown here:
dfs.dtypes
The output of the preceding code is as follows:
subject objectfrom object date object
to object label object
thread float64dtype:
object
Note that a date field is an object. So, we need to convert it into a DateTime
argument. In the next step, we are going to convert the date field into an actual
DateTime argument. We can do this by using the pandas to_datetime() method. See
the following code:
dfs['date'] = dfs['date'].apply(lambda x: pd.to_datetime(x, errors='coerce',utc=True))
Removing NaN Values
Next, we are going to remove NaN values from the field. We can do this as
follows:
dfs = dfs[dfs['date'].notna()]
1.24 Data Exploration and Visualization
Next, it is good to save the preprocessed file into a separate CSV file in case we
need it again. We can save the dataframe into a separate CSV fileas follows:
dfs.to_csv('gmail.csv')
Data Refactoring
We noticed that the from field contains more information than we need. We just
need to extract an email address from that field. Let's do some refactoring:
1. First of all, import the regular expression package:
import re
2. Next, let's create a function that takes an entire string from any columnand
extracts an email address:
def extract_email_ID(string):
email = re.findall(r'<(.+?)>', string)if not email:
email = list(filter(lambda y: '@' in y, string.split())) return email[0] if email else
np.nan
3. Next, let's apply the function to the fromcolumn:
dfs['from'] = dfs['from'].apply(lambda x: extract_email_ID(x))
We used the lambda function to apply the function to each and every
value in the column.
1. Next, we are going to refactor the label field. The logic is simple. If an
email is from your email address, then it is the sent email. Otherwise, it is
a received email, that is, an inbox email:
myemail = 'itsmeskm99@gmail.com'
dfs['label'] = dfs['from'].apply(lambda x: 'sent' if x==myemail else'inbox')
Dropping Columns
Let's drop a column:
1. Note that the to column only contains your own email. So, we can drop
this irrelevant column:
dfs.drop(columns='to', inplace=True)
2. This drops the to column from the dataframe. Let's display the first 10
entries now:
dfs.head(10)
Exploratory Data Analysis 1.25
Data Analysis
This is the most important part of EDA. This is the part where we gain insights
from the data that we have.
Let's answer the following questions one by one:
1. How many emails did I send during a given timeframe?
2. At what times of the day do I send and receive emails with Gmail?
3. What is the average number of emails per day?
4. What is the average number of emails per hour?
5. What am I mostly emailing about?
Number of emails
The answer to the first question, "How many emails did I send during a given
timeframe?", can be answered as shown here:
print(dfs.index.min().strftime('%a, %d %b %Y %I:%M %p'))print (dfs.index. max().
strftime('%a, %d %b %Y %I:%M %p'))
print(dfs['label'].value_counts())
The output of the preceding code is given here:
Tue, 24 May 2011 11:04 AM
Fri, 20 Sep 2019 03:04 PM
inbox 32952
sent 4602
Name: label, dtype: int64
ax.set_xticklabels([datetime.datetime.strptime(str(int(np.mod(ts, 24))),
"%H").strftime("%I %p")
for ts in ax.get_xticks()]);elif orientation == 'horizontal':
ax.set_ylim(0, 24) ax.yaxis.set_major_locator(MaxNLocator(8))
ax.set_yticklabels([datetime.datetime.strptime(str(int(np.mod(ts, 24))),
"%H").strftime("%I %p")
for ts in ax.get_yticks()]);
Fig. 1.7.
Exploratory Data Analysis 1.29
Fig. 1.8.
The preceding output shows that my busiest day is Thursday. I receive most of
my emails on Thursdays. Let's go one step further and see the most active days for
receiving and sending emails separately:
sdw = sent.groupby('dayofweek').size() / len(sent)
rdw = received.groupby('dayofweek').size() / len(received)
df_tmp = pd.DataFrame(data={'Outgoing Email': sdw, 'Incoming Email':rdw})
df_tmp.plot(kind='bar', rot=45, figsize=(8,5), alpha=0.5)
plt.xlabel('');
plt.ylabel('Fraction of weekly emails');plt.grid(ls=':', color='k', alpha=0.5)
1.30 Data Exploration and Visualization
Fig. 1.9.
Fig. 1.10.
Exploratory Data Analysis 1.31
Merging on Index
Sometimes the keys for merging dataframes are located in the dataframesindex. In
such a situation, we can pass left_index=True or right_index=True toindicate that the
index should be accepted as the merge key.
2. Now, using the stack() method on the preceding dframe1, we can pivot
thecolumns into rows to produce a series:
stacked = dframe1.stack()stacked
The output of this stacking is as follows:
Rainfall Bergen 0
Oslo 1
Trondheim 2
Stavanger 3
Kristiansand 4
Humidity Bergen 5
Oslo 6
Trondheim 7
Stavanger 8
Kristiansand 9
Wind Bergen 10
Oslo 11
Trondheim 12
Stavanger 13
Kristiansand 14
dtype: int64
3. The preceding series stored unstacked in the variable can be rearranged
into a dataframe using the unstack() method:
stacked.unstack()
4. Now, let's unstack the concatenated frame:
series1 = pd.Series([000, 111, 222, 333], index=['zeros','ones', 'twos','threes'])
series2 = pd.Series([444, 555, 666], index=['fours', 'fives', 'sixes'])
frame2 = pd.concat([series1, series2], keys=['Number1', 'Number2'])frame2.unstack()
The output of the preceding unstacking is shown in the followingscreenshot:
1.38 Data Exploration and Visualization
Let's dive more into how we can perform other types of data transformations
including cleaning,filtering, deduplication, and others.
1 Looping 10
2 Looping 22
3 Functions 23
4 Functions 23
5 Functions 24
6 Functions 24
frame3.duplicated()
The output of the preceding code is pretty easy to interpret:
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
Column 1 Column 2
0 Looping 10
2 Looping 22
3 Functions 23
5 Functions 24
Note that rows 1, 4, and 6 are removed. Basically, both the duplicated() and
drop_duplicates() methods consider all of thecolumns for comparison. Instead of all
the columns, we could specify any subset of the columns to detect duplicated items.
4. Let's add a new column and try to find duplicated items based on the
second column:
frame3['column 3'] = range(7)
frame5 = frame3.drop_duplicates(['column 2'])frame5
The output of the preceding snippet is as follows:
1.40 Data Exploration and Visualization
Replacing Values
Often, it is essential to find and replace some values inside a dataframe. This
can be done with the following steps:
1. We can use the replace method in such
cases: import numpy as np
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234.,
444., -786., 332., 3332. ], 'column 2': range(9)})
replaceFrame.replace(to_replace =-786, value= np.nan)
The output of the preceding code is as follows:
Column 1 Column 2
0 200.0 0
1 3000.0 1
2 NaN 2
3 3000.0 3
4 234.0 4
5 444.0 5
6 NaN 6
7 332.0 7
8 3332.0 8
Exploratory Data Analysis 1.41
Note that we just replaced one value with the other values. We can also replace
multiple values at once.
2. In order to do so, we display them using a list:
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234.,
444., -786., 332., 3332. ], 'column 2': range(9)})
replaceFrame.replace(to_replace =[-786, 0], value= [np.nan, 2])
Note that the True values indicate the values that are NaN. Pretty obvious, right?
Alternatively, we can also use the notnull() method to do the same thing. The only
difference would be that the function will indicate True for the values which are not
null.
2. Check it out in action:
dfx.notnull()
And the output of this is as follows:
Compare these two tables. These two functions, notnull() andisnull(), are the
complement to each other.
3. We can use the sum() method to count the number of NaN values in each
store. How does this work, you ask? Check the following code:
dfx.isnull().sum()
And the output of the preceding code is as follows:
store1 1
store2 1
store3 1
store4 5
store5 7 dtype: int64
1.44 Data Exploration and Visualization
The fact that True is 1 and False is 0 is the main logic for summing.The preceding
results show that one value was not reported by store1, store2, and store3. Five
values were not reported by store4 andseven values were not reported by store5.
4. We can go one level deeper to find the total number of missing
values: dfx.isnull().sum().sum()
And the output of the preceding code is as follows:
15
This indicates 15 missing values in our stores. We can use an alternative way to
find how many values were actually reported.
5. So, instead of counting the number of missing values, we can count the
number of reported values:
dfx.count()
And the output of the preceding code is as follows:
store1 6
store2 6
store3 6
store4 2
store5 0 dtype: int64
Pretty elegant, right? We now know two different ways to find the missingvalues,
and also how to count the missing values.
Dropping by Rows
We can also drop rows that have NaN values. To do so, we can use the how=all
argument to drop only those rows entire values are entirely NaN:
dfx.dropna(how='all')
The output of the preceding code is as follows:
store store store store store
1 2 3 4 5
apple 15.0 16.0 17.0 20.0 NaN
banana 18.0 19.0 20.0 NaN NaN
kiwi 21.0 22.0 23.0 NaN NaN
grapes 24.0 25.0 26.0 NaN NaN
mango 27.0 28.0 29.0 NaN NaN
watermelon 15.0 16.0 17.0 18.0 NaN
Note that only the orange rows are removed because those entire rowscontained
NaN values.
Dropping by Columns
Furthermore, we can also pass axis=1 to indicate a check for NaN
bycolumns. dfx.dropna(how='all', axis=1)
And the output of the preceding code is as follows:
Note that store5 is dropped from the dataframe. By passing in axis=1, we are
instructing pandas to drop columns if all the values in the column are NaN.
Furthermore, we can also pass another argument, thresh, to specify a minimum
number of NaNs that must exist before the column should be dropped:
dfx.dropna(thresh=5, axis=1)
And the output of the preceding code is as follows:
store 1 store 2 store 3
apple 15.0 16.0 17.0
banana 18.0 19.0 20.0
kiwi 21.0 22.0 23.0
grapes 24.0 25.0 26.0
mango 27.0 28.0 29.0
watermelon 15.0 16.0 17.0
oranges NaN NaN NaN
Compared to the preceding, note that even the store4 column is now dropped
because it has more than five NaNvalues.
Note that in the preceding dataframe, all the NaN values are replaced by 0.
Replacing the values with 0 will affect several statistics including mean, sum, and
median.
Check the difference in the following two examples:
dfx.mean()
And the output of the preceding code is as follows:
store1 20.0
store2 21.0
store3 22.0
store4 19.0
store5 NaN dtype: float64
And the output we get is as follows:
store1 17.142857
store2 18.000000
store3 18.857143
store4 5.428571
store5 0.000000dtype: float64
Mean/average
The mean, or average, is a number around which the observed continuous
variables are distributed. This number estimates the value of the entire dataset.
Mathematically, it is the result of the division of the sum of numbers by the number
of integers in the dataset.
Let x be a set of integers:
x = (12, 2, 3, 5, 8, 9, 6, 4, 2)
Hence, the mean value of x can be calculated as follows:
12 + 2 + 3 + 5 + 8 + 9 + 6 + 4 + 2
Mean (x) = = 5.66
9
Median
Given a dataset that is sorted either in ascending or descending order, the median
divides the data into two parts. The general formula for calculatingthe median is as
follows:
(n + 1)
Median position = th observation
2
Here, n is the number of items in the data. The steps involved in calculatingthe
median are as follows:
1. Sort the numbers in either ascending or descending order.
2. If n is odd, find the (n + 1) / 2th term. The value corresponding to this
termis the median.
3. If n is even, find the (n + 1) / 2th term. The median value is the average of
numbers on either side of the median position.
For a set of integers such as x, we must arrange them in ascending order and then
select the middle integer.
In ascending order = (2, 2, 3, 4, 5, 6, 8, 9, 12).Here, the median is 5.
Mode
The mode is the integer that appears the maximum number of times in the dataset.
It happens to be the value with the highest frequency in the dataset. In the x dataset
in the median example, the mode is 2 because it occurs twice in the set.
Exploratory Data Analysis 1.49
Python provides different libraries for operating descriptive statistics in the dataset. Commonly used libraries are
pandas, numpy, and scipy. These measures of central tendency can simply be calculated by the numpy and pandas
functionalities.
Here is a dataset of automobiles that enlists different features and attributes of cars, such as symboling, normalized
losses, aspiration, and many others, an analysis of which will provide some valuable insight and findings in relation to
automobiles in this dataset.
Let's begin by importing the datasets and the Python libraries required:
import pandas as pdimport numpy as np
Now, let's load the automobile database: df =
pd.read_csv("data.csv")df.head() The output of the
code is given here:
numbe
normalized drive- engine wheel- curb engine- Num of engine
Symboling make aspiration r of body style length width height
losses wheels location base weight type cylinders -size
doors
0 3 122 alfa- std two convertible rwd front 88.6 0.811148 0.890278 48.8 2548 dohc four 130
1 3 122 romero std two convertible rwd front 88.6 0.811148 0.890278 48.8 2548 dohc four 130
2 1 122 alfa- std two hatchback rwd front 94.5 0.822681 0.909722 52,4 2823 ohcv six 152
3 2 164 romero std four sedan fwd front 99.8 0.848630 0.919444 54.3 2337 ohc four 109
alfa
4 2 164 std four sedan 4wd front 99.4 0.848630 0.922222 54.3 2824 ohc five 136
romero
audi
audi
1.50 Data Exploration and Visualization
Standard Deviation
Different Python libraries have functions to get the standard deviation of the
dataset. The NumPy library has the numpy.std(dataset) function. The statistics library
has the statistics.stdev(dataset). function. Using the pandas library, we calculate the
standard deviation in our df data frame using the df.std() function:
#standard variance of dataset using std() functionstd_dev =df.std()
print(std_dev)
# standard variance of the specific column sv_height=df.loc[:,"height"].std()
print(sv_height)
The output of the preceding code is as follows:
symboling
1.254802
normalized-losses
31.996250
wheel-base
6.066366
length
width 0.059213
height 0.029187
curb-weight 2.447822
engine-size 517.296727
bore 41.546834
stroke 0.268072
compression-ratio 0.319256
horsepower 4.004965
peak-rpm 37.365700
city-mpg 478.113805
highway-mpg
6.423220
price
6.815150
city-L/100 km
7947.066342
diesel
2.534599
gas
0.300083
dtype: float64
0.300083
2.44782216129631
Exploratory Data Analysis 1.51
Variance
Variance is the square of the average/mean of the difference between each value
in the dataset with its average/mean; that is, it is the square of standard deviation.
Different Python libraries have functions to obtain the variance of the dataset. The
NumPy library has the numpy.var(dataset) function. The statistics library has the
statistics.variance(dataset) function. Using the pandas library, we calculate the
variance in our df data frame using the df.var() function:
# variance of dataset using var() functionvariance=df.var()
print(variance)
# variance of the specific column var_height=df.loc[:,"height"].var()print(var_height)
The output of the preceding code is as follows:
symboling 1.574527e+00
normalized-losses 1.023760e+03
wheel-base 3.680079e+01
length
3.506151e–03
width
8.518865e–04
height
5.991833e+00
curb-weight
engine-size 2.675959e+05
bore 1.726139e+03
stroke 7.186252e–02
compression-ratio 1.019245e–01
horsepower 1.603975e+01
peak-rpm
1.396195e+03
city-mpg
2.285928e+05
highway-mpg
4.125776e+01
price
city-L/100 km 4.644627e+01
diesel 6.315586e+07
gas 6.424193e+00
dtype: float64 9.004975e–02
5.991833333333338 9.004975e–02
1.52 Data Exploration and Visualization
Skewness
In probability theory and statistics, skewness is a measure of the asymmetryof
the variable in the dataset about its mean. The skewness value can be positive or
negative, or undefined. The skewness value tells us whether the data is skewed or
symmetric. Here's an illustration of a positively skewed dataset, symmetrical data,
and some negatively skewed data:
Fig. 1.11.
Note the following observations from the preceding diagram:
The graph on the right-hand side has a tail that is longer than the tail on the right-
hand side. This indicates that the distribution of the data is skewed to the left. If you
select any point in the left-hand longer tail, the mean is less than the mode. This
condition is referred to as negativeskewness.
The graph on the left-hand side has a tail that is longer on the right- hand side. If
you select any point on the right-hand tail, the mean valueis greater than the mode.
This condition is referred to as
Positive Skewness.
The graph in the middle has a right-hand tail that is the same as the left-
hand tail. This condition is referred to as a symmetrical condition.
Different Python libraries have functions to get the skewness of the dataset. The
SciPy library has a scipy.stats.skew(dataset) function. Using the pandas library, we
can calculate the skewness in our df data frame using the df.skew()function.
Exploratory Data Analysis 1.53
Here, in our data frame of automobiles, let's get the skewness using the
df.skew() function:
df.skew()
The output of the preceding code is as follows:
symboling
0.204275
normalized-losses
0.209007
wheel-base
1.041170
length
0.154086
width
0.900685
height
0.064134
curb-weight
0.668942
engine-size
1.934993
bore
0.013419
stroke
–0.669515
compression-ratio
2.682640
horsepower
9.985047
peak-rpm
0.073094
city-mpg
0.673533
highway-mpg
0.549104
price
1.812335
dtype: float64
Kurtosis
Basically, kurtosis is a statistical measure that illustrates how heavily the tails of
distribution differ from those of a normal distribution. This technique can identify
whether a given distribution contains extreme values.
Kurtosis, unlike skewness, is not about the peakedness or flatness. It is the
measure of outlier presence in a given distribution. Both high and low kurtosis are an
indicator that data needs further investigation. The higher the kurtosis, the higher the
outliers.
1.54 Data Exploration and Visualization
Types of Kurtosis
There are three types of kurtosis - mesokurtic, leptokurtic, and platykurtic. Let's
look at these one by one:
Mesokurtic: If any dataset follows a normal distribution, it follows amesokurtic
distribution. It has kurtosis around 0.
Leptokurtic: In this case, the distribution has kurtosis greater than 3 and the fat
tails indicate that the distribution produces more outliers. Platykurtic: In this case,
the distribution has negative kurtosis and the tails are very thin compared to the
normal distribution.
All three types of kurtosis are shown in the following diagram:
Fig. 1.12.
Calculating Percentiles
Percentiles measure the percentage of values in any dataset that lie below a
certain value. In order to calculate percentiles, we need to make sure our list is
sorted. An example would be if you were to say that the 80th percentile of data is
130: then what does that mean? Well, it simply means that 80% of the values lie
below 130. Pretty easy, right? We will use the following formula for this:
The formula for calculating Number of values less than X
` =
percentile of X Total number of observations 100
Suppose we have the given data: 1, 2, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10. Then the
percentile value of 4 = (4 / 12) * 100 = 33.33%.
Exploratory Data Analysis 1.55
Quartiles
Given a dataset sorted in ascending order, quartiles are the values that split the
given dataset into quarters. Quartiles refer to the three data points that divide the
given dataset into four equal parts, such that each split makes 25% of the dataset. In
terms of percentiles, the 25th percentile is referred to as the first quartile (Q1), the
50th percentile is referred to as the second quartile (Q2), and the 75th percentile is
referred to as the third quartile (Q3).
Based on the quartile, there is another measure called inter-quartile rangethat also
measures the variability in the dataset. It is defined as follows:
IQR = Q3 – Q1
IQR is not affected by the presence of outliers. Let's get the IQR for the price
column from the same dataframe we have been using so far:
rice = df.price.sort_values() Q1 = np.percentile(price, 25) Q2 =
np.percentile(price, 50) Q3 = np.percentile(price, 75)
IQR = Q3 - Q1IQR
The output of the preceding snippet is as follows:
8718.5
Understanding groupby()
During the data analysis phase, categorizing a dataset into multiple categories or
groups is often essential. We can do such categorization usingthe pandas library. The
pandas groupby function is one of the most efficient and time-saving features for
doing this. Groupby provides functionalities that allow us to split-apply-combine
throughout the dataframe; that is, this function can be used for splitting, applying,
and combining dataframes.
1.56 Data Exploration and Visualization
Group by mechanics
To work with groupby functionalities, we need a dataset that has multiple numerical as well as categorical records in it
so that we can group by different categories and ranges.
Let's take a look at a dataset of automobiles that enlists the different features and attributes of cars, such as symbolling,
normalized-losses, make, aspiration, body- style, drive-wheels, engine-location, and many others. Let's get started:
1. Let's start by importing the required Python libraries and datasets:
import pandas as pd
df = pd.read_csv("/content/automobileEDA.csv")df.head()
The output of the preceding code is as follows:
Symboling normalized make aspiration number body style drive- engine wheel- length width height curb engine sum of engine
of wheels location base weight type cylinders size
losses
doors
0 3 122 alfa- std two convertible rwd front 88.6 0.811148 0.890278 48.8 2548 dohc four 130
romero
1 3 122 std two convertible rwd front 88.6 0.811148 0.890278 48.8 2548 dohc four 130
alfa
2 1 122 std two hatchback rwd front 94.5 0.822681 0.909722 52,4 2823 ohcv six 152
romero
3 2 164 std four sedan fwd front 99.8 0.848630 0.919444 54.3 2337 ohc four 109
alfa
4 2 164 romero std four sedan 4wd front 99.4 0.848630 0.922222 54.3 2824 ohc five 136
audi
audi
Exploratory Data Analysis 1.57
As you can see, there are multiple columns with categorical variables.
2. Using the groupby() function lets us group this dataset on the basis of the body-style
column: df.groupby('body-style').groups.keys()
The output of the preceding code is as follows:
dict_keys(['convertible', 'hardtop', 'hatchback', 'sedan', 'wagon'])
From the preceding output, we know that the body-style column has five unique values, including convertible,
hardtop, hatchback, sedan, and wagon.
3. Now, we can group the data based on the body-style column. Next, let's print the values contained in that
group that have the body-style value ofconvertible. This can be done using the following code:
# Group the dataset by the column body-stylestyle = df.groupby('body-style')
# Get values items from group with value convertiblestyle.get_group("convertible")
The output of the preceding code is as follows:
Symboling normalized make aspiration number body style drive- engine wheel- length width height curb engine sum of engine
0 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 0.890278 48.8 2548 dohc four 130
1 3 122 alfa romero std two convertible rwd front 88.6 0.811148 0.890278 48.8 2548 dohc four 130
69 3 142 Mercedes std two convertible rwd front 96.6 0.866410 0.979167 50.8 3685 ohcv eight 234
125 3 122 benz std two convertible rwd rear 89.5 0.811629 0.902778 – 51.6 2800 ohcf six 194
168 2 134 porsche std two convertible rwd front 98.4 0.846708 0.911111 53.0 2975 ohc four 146
185 2 122 Toyota std two convertible fwd front 94.5 0.765497 0.891667 55.6 2254 ohc four 109
volkswagen
1.58 Data Exploration and Visualization
Curb-
Length Width Height Price
weight
Next, let's apply a single aggregation to get the mean of the columns. To do this,
we can use the agg() method, as shown in the following code:
# applying single aggregation for mean over the columns new_dataset.agg("mean",
axis="rows")
The output of the preceding code is as follows:
length 0.837102
width 0.915126
height 53.766667
curb-weight 2555.666667
1.60 Data Exploration and Visualization
price 13207.129353
dtype: float64
Group-wise operations
Let's group the DataFrame, df, by body-style and drive-wheels and extract stats
from each groupby passing a dictionary of aggregation functions:
# Group the data frame df by body-style and drive-wheels and extract stats fromeach
group
df.groupby(
["body-style","drive-wheels"]
).agg(
'height':min, # minimum height of car in each group 'length': max, # maximum length
of car in each group 'price': 'mean', # average price of car in each group
)
Exploratory Data Analysis 1.61
The preceding code groups the dataframe according to body-style and then driver-
wheels. Then, the aggregate functions are applied to the height, length, and price
columns, which return the minimum height, maximum length, and average price in
the respective groups.
# create dictionary of aggregationsaggregations=(
{
'height':min, # minimum height of car in each group 'length': max, # maximum length
of car in each group 'price': 'mean', # average price of car in each group
}
)
1.62 Data Exploration and Visualization
Group-wise transformations
Working with groupby() and aggregation, you must have thought, why can't we
group data, apply aggregation, and append the result into the dataframe directly? Is it
possible to do all this in a single step? Yes, it is.
Performing a transformation on a group or a column returns an object that is
indexed by the same axis length as itself. It is an operation that's used in conjunction
with groupby(). The aggregation operation has to return a reduced version of the
Exploratory Data Analysis 1.63
196 23883.016667
197 23883.016667
198 23883.016667
199 23883.016667
3. Now, create a new column for an average price in the originaldataframe:
df["average-price"]=df.groupby(["body-style","drive-wheels"])
["price"].transform('mean')
# selecting columns body-style,drive-wheels,price and average-pricedf.loc[:,["body-
style","drive-wheels","price","average-price"]]
The output of the preceding code is as follows:
Curb-
Body-style Height Length Price Width
weight
Convertible 2801.666667 51.433333 0.818757 24079.550000 0.910880
drive Curb-
Body-style Height Length Price Width
wheels weight
drive-
body-style Height Price Width
wheels
max. min mean max min
This pivot table represents the maximum and minimum of the height and width
and the average price of cars in the respective categories mentioned in the index.
We can customize the pandas dataframe with another technique called cross-
tabulation. This allows us to cope with groupby and aggregation for better data
analysis. Pandas has the crosstab function, which helps when it comes to building a
cross-tabulation table. The cross-tabulation table shows the frequency with which
certain groups of data appear. Let's take a look:
1.68 Data Exploration and Visualization
1. Let's use pd.crosstab() to look at how many different body styles cars are
made by different makers:
pd.crosstab(df["make"], df["body-style"])
The output of the preceding code is as follows:
Body-style
Convertible Hardtop Hatchback Sedan Wagon
make
alfa-romero 2 0 1 0 0
audi 0 0 0 5 1
bmw 0 0 0 8 0
Chevrolet 0 0 2 1 0
Dodge 0 0 5 3 1
Honda 0 0 7 5 1
Isuzu 0 0 1 1 0
Jaguar 0 0 0 3 0
Mazda 0 0 10 7 0
Mercedes-benz 1 2 0 4 1
Mercury 0 0 1 0 0
Let's apply margins and the margins_name attribute to display the row- wise and
column-wise sum of the cross tables, as shown in the following code:
# apply margins and margins_name attribute to displays the row wise # and
column wise sum of the cross table
pd.crosstab(df["make"], df["body-style"],margins=True,margins_name="Total
Made")
Exploratory Data Analysis 1.69
Body-style Total
Convertible Hardtop Hatchback Sedan Wagon
make Made
alfa-romero 2 0 1 0 0 3
audi 0 0 0 5 1 6
bmw 0 0 0 8 0 8
Chevrolet 0 0 2 1 0 3
Dodge 0 0 5 3 1 9
Honda 0 0 7 5 1 13
Isuzu 0 0 1 1 0 2
Jaguar 0 0 0 3 0 3
Mazda 0 0 10 7 0 17
Mercedes-benz 1 2 0 4 1 8
Mercury 0 0 1 0 0 1
Mitsubishi 0 0 9 4 0 13
Nissan 0 1 5 9 3 18
Peugot 0 0 0 7 4 11
2. Let's see how the data is distributed by the body-type and drive_wheels
columns within the maker of car and their door type in a crosstab:
pd.crosstab([df["make"],df["num-of-doors"]], [df["body-style"],df["drive- wheels"]],
margins=True,margins_name="Total Made")
1.70 Data Exploration and Visualization
drive
wheels Convertible Hardtop Hatchback Sedan Wagon Total
Make
num- made
of-
wheels
fwd rwd fwd rwd 4wd fwd rwd 4wd fwd rwd 4wd fwd rwd
alfa-
two 0 2 0 0 0 0 1 0 0 0 0 0 0 3
romero
audi four 0 0 0 0 0 0 0 1 3 0 0 1 0 5
two 0 0 0 0 0 0 0 0 1 0 0 0 0 1
bmw four 0 0 0 0 0 0 0 0 0 5 0 0 0 5
two 0 0 0 0 0 0 0 0 0 3 0 0 0 3
Chevrolet four 0 0 0 0 0 0 0 0 1 0 0 0 0 1
two 0 0 0 0 0 2 0 0 0 0 0 0 0 2
Dodge four 0 0 0 0 0 1 0 0 3 0 0 1 0 5
two 0 0 0 0 0 4 0 0 0 0 0 0 0 4
Honda four 0 0 0 0 0 0 0 0 4 0 0 1 0 5
two 0 0 0 0 0 7 0 0 1 0 0 0 0 8
Isuzu four 0 0 0 0 0 0 0 0 0 1 0 0 0 1
The pivot table syntax of pd.crosstab also takes some arguments, such as
dataframe columns, values, normalize, and the aggregation function. We can apply
the aggregation function to a cross table at the same time. Passing the aggregation
function and values, which are the columns that aggregation will be applied to, gives
us a cross table of a summarized subset of the dataframe.
3. First, let's look at the average curb-weight of cars made by different
makers with respect to their body-style by applying the mean()
aggregationfunction to the crosstable:
Exploratory Data Analysis 1.71
make
1. Define EDA.
EDA is a process of examining the available dataset to discover patterns, spot
anomalies, test hypotheses, and check assumptions using statistical measures. In
this chapter, we are going to discuss the steps involved in performing top-notch
exploratory data analysis and get our hands dirty using some open source
databases.
2. What is data processing?
Preprocessing involves the process of pre-curating the dataset before actual
analysis. Common tasks involve correctly exporting the dataset, placing them
under the right tables, structuringthem, and exporting them in the correct format.
3. What do you understand from data cleaning?
Preprocessed data must be correctly transformed for an incompleteness
check, duplicates check, error check, and missing value check. These tasks are
performed in the data cleaning stage, which involves matching the correct
record, finding inaccuracies in the dataset, understanding the overall data
quality, removing duplicate items, and filling in the missing values.
4. List down the steps in EDA
Problem definition
Data preparation
Data analysis
Development and representation of the results
5. What are different categories of data available in EDA?
Numerical data
Discrete data
Continuous data
Categorical data
6. Brief the term Bayesian analysis?
The Bayesian approach incorporates prior probability distribution
knowledge into the analysis steps as shown in the following diagram. Well,
Exploratory Data Analysis 1.73
simply put, prior probability distribution of any quantity expresses the belief
about that particular quantity before considering some evidence.
7. List the Software tools available for EDA.
NumPy
Pandas
Sea born
Sci py
Matplotlib
8. Define matplotlib
Matplotlib provides a huge library of customizable plots, along with a
comprehensive set of back ends. It can be utilized to create professional
reporting applications, interactive analytical applications, complex dashboard
applications, web/GUI applications, embedded views, and manymore. What are
the visual aids for EDA?
Line chart
Bar chart
Scatter plot
Pie chart
Histogram
9. What is the purpose of bar chart?
This is one of the most common types of visualization that almost everyone
must have encountered. Bars can be drawn horizontally or vertically to
represent categorical variables. Bar charts are frequently used to distinguish
objects between distinct collections in order to track variations over time. In
most cases, bar charts are very convenient when the changes are large.
10. What is a scatter plot?
Scatter plots are also called scatter graphs, scatter charts, scattergrams, and
scatter diagrams. They use a Cartesian coordinates system to display values of
typically two variables for a set of data.
1.74 Data Exploration and Visualization
Importing Matplotlib
Simple Line Plots
Simple Scatter Plots
Visualizing Errors
Density and Contour Plots
Histograms
Legends
Colors
Subplots
Text and Annotation
Customization
Three Dimensional Plotting
Geographic Data with Basemap
Visualization with Seaborn
UNIT II
VISUALIZING USING
MATPLOTLIB
To check Python
python --version
If python is successfully installed, the version of python installed on your
system will be displayed.
To check pip
pip -V
The version of pip will be displayed, if it is successfully installed on your
system.
Pyplot
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually
imported under the plt alias:
import matplotlib.pyplot as plt
Now the Pyplot package can be referred to as plt.
Fig. 2.1.
Example
Draw a line in a diagram from position (0,0) to position (6,250):
Visualizing using Matplotlib 2.3
ax.plot(x, np.sin(x));
Alternatively, we can use the pylab interface and let the figure and axes be created
for us in the background,
plt.plot(x, np.sin(x));
If we want to create a single figure with multiple lines, we can simply call
the plot function multiple times:
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));
The plt.axis() method goes even beyond this, allowing you to do things like
automatically tighten the bounds around the current plot:
plt.plot(x, np.sin(x))
plt.axis('tight');
It allows even higher-level specifications, such as ensuring an equal aspect ratio
so that on your screen, one unit in x is equal to one unit in y:
plt.plot(x, np.sin(x))
plt.axis('equal');
Labeling Plots
As the last piece of this section, we'll briefly look at the labeling of plots: titles,
axis labels, and simple legends.
Titles and axis labels are the simplest such labels—there are methods that can be
used to quickly set them:
plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
The position, size, and style of these labels can be adjusted using optional
arguments to the function. For more information, see the Matplotlib documentation
and the docstrings of each of these functions.
When multiple lines are being shown within a single axes, it can be useful to
create a plot legend that labels each line type. Again, Matplotlib has a built-in way of
quickly creating such a legend. It is done via the (you guessed it)plt.legend() method.
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.legend();
As you can see, the plt.legend() function keeps track of the line style and color,
and matches these with the correct label. More information on specifying and
formatting plot legends can be found in the plt.legend docstring
Another commonly used plot type is the simple scatter plot, a close cousin of the
line plot. Instead of points being joined by line segments, here the points are
represented individually with a dot, circle, or other shape. We‟ll start by setting up
the notebook for plotting and importing the functions we will use:
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
plt.legend(numpoints=1)
plt.xlim(0, 1.8);
Additional keyword arguments to plt.plot specify a wide range of properties of the
lines and markers:
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2)
plt.ylim(-1.2, 1.2);
Notice that the color argument is automatically mapped to a color scale (shown
here by the colorbar() command), and that the size argument is given in pixels. In
this way, the color and size of points can be used to convey information in the
visualization, in order to visualize multidimensional data.
For example, we might use the Iris data from Scikit-Learn, where each sample is
one of three types of flowers that has had the size of its petals and sepals carefully
measured:
import load_iris
iris = load_iris()
features = iris.data.T
plt.scatter(features[0],features[1],alpha=0.2,s=100*features[3],c=iris.target,
cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);
We can see that this scatter plot has given us the ability to simultaneously explore
four different dimensions of the data: the (x, y) location of each point corresponds to
the sepal length and width, the size of the point is related to the petal width, and the
color is related to the particular species of flower. Multicolor and multifeature scatter
plots like this can be useful for both exploration and presentation of data.
Basic Errorbars
A basic errorbar can be created with a single Matplotlib function call:
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy *
np.random.randn(50) plt.errorbar(x, y,
yerr=dy, fmt='.k');
Here the fmt is a format code controlling the appearance of lines and points, and
has the same syntax as the shorthand used in plt.plot, outlined in Simple Line
Plots and Simple Scatter Plots.
In addition to these basic options, the errorbar function has many options to fine-
tune the outputs. Using these additional options you can easily customize the
aesthetics of your errorbar plot.
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
ecolor='lightgray', elinewidth=3,
capsize=0);
In addition to these options, you can also specify horizontal errorbars (xerr), one-
sided errorbars, and many other variants. For more information on the options
available, refer to the docstring of plt.errorbar.
Continuous Errors
In some situations it is desirable to show errorbars on continuous quantities.
Though Matplotlib does not have a built-in convenience routine for this type of
Visualizing using Matplotlib 2.11
Note what we've done here with the fill_between function: we pass an x value,
then the lower y-bound, then the upper y-bound, and the result is that the area
between these regions is filled. The resulting figure gives a very intuitive view into
what the Gaussian process regression algorithm is doing: in regions near a measured
data point, the model is strongly constrained and this is reflected in the small model
errors. In regions far from a measured data point, the model is not strongly
constrained, and the model errors increase.
such data is to use the np.meshgrid function, which builds two-dimensional grids
from one-dimensional arrays:
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
Now let's look at this with a standard line-only contour plot:
plt.contour(X, Y, Z, colors='black');
Notice that by default when a single color is used, negative values are represented
by dashed lines, and positive values by solid lines. Alternatively, the lines can be
color-coded by specifying a colormap with the cmap argument. Here, we'll also
specify that we want more lines to be drawn—20 equally spaced intervals within the
data range:
plt.contour(X, Y, Z, 20, cmap='RdGy');
Here we chose the RdGy (short for Red-Gray) colormap, which is a good choice
for centered data. Matplotlib has a wide range of colormaps available, which you can
easily browse in IPython by doing a tab completion on the plt.cm module:
plt.cm.<TAB>
Our plot is looking nicer, but the spaces between the lines may be a bit distracting.
We can change this by switching to a filled contour plot using
the plt.contourf() function (notice the f at the end), which uses largely the same
syntax as plt.contour().Additionally, we'll add a plt.colorbar() command, which
automatically creates an additional axis with labeled color information for the plot:
plt.contourf(X, Y, Z, 20,
cmap='RdGy') plt.colorbar();
The colorbar makes it clear that the black regions are "peaks," while the red
regions are "valleys."
One potential issue with this plot is that it is a bit "splotchy." That is, the color
steps are discrete rather than continuous, which is not always what is desired. This
2.14 Data Exploration and Visualization
could be remedied by setting the number of contours to a very high number, but this
results in a rather inefficient plot: Matplotlib must render a new polygon for each
step in the level. A better way to handle this is to use the plt.imshow() function,
which interprets a two-dimensional grid of data as an image.
The following code shows this:
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',cmap='RdGy')
plt.colorbar()
plt.axis(aspect='image');
There are a few potential gotchas with imshow(), however:
plt.imshow() doesn't accept an x and y grid, so you must manually specify
the extent [xmin, xmax, ymin, ymax] of the image on the plot.
plt.imshow() by default follows the standard image array definition where
the origin is in the upper left, not in the lower left as in most contour plots.
This must be changed when showing gridded data.
plt.imshow() will automatically adjust the axis aspect ratio to match the
input data; this can be changed by setting, for
example, plt.axis(aspect='image') to make x and y units match.
Finally, it can sometimes be useful to combine contour plots and image
plots. For example, here we'll use a partially transparent background image
(with transparency set via the alpha parameter) and overplot contours with
labels on the contours themselves (using the plt.clabel() function):
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',cmap='RdGy', alpha=0.5)
plt.colorbar();
The combination of these three functions—plt.contour, plt.contourf,
and plt.imshow—gives nearly limitless possibilities for displaying this sort of three-
dimensional data within a two-dimensional plot.
Visualizing using Matplotlib 2.15
Creating a Histogram
To create a histogram the first step is to create bin of the ranges, then distribute
the whole range of the values into a series of intervals, and count the values which
fall into each of the intervals. Bins are clearly identified as consecutive, non-
overlapping intervals of variables. The matplotlib.pyplot.hist() function is used to
compute and create histogram of x.
The following table shows the parameters accepted by matplotlib.pyplot.hist()
function :
Attribute Parameter
x array or sequence of array
bins optional parameter contains integer or sequence or
strings
density optional parameter contains boolean values
range optional parameter represents upper and lower
range of bins
histtype optional parameter used to create type of histogram
[bar, barstacked, step, stepfilled], default is “bar”
align optional parameter controls the plotting of histogram
[left, right, mid]
weights optional parameter contains array of weights having
same dimensions as x
bottom location of the baseline of each bin
rwidth optional parameter which is relative width of the bars
with respect to bin width
2.16 Data Exploration and Visualization
Attribute Parameter
color optional parameter used to set color or sequence of
color specs
label optional parameter string or sequence of string
to match with multiple datasets
log optional parameter used to set histogram axis on log
scale
Let‟s create a basic histogram of some random values. Below code creates a
simple histogram of some random values:
import numpy as np
# Creating dataset
a = np.array([22, 87, 5, 43, 56,73, 55, 54, 11,20, 51, 5, 79, 31,27])
# Creating histogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a, bins = [0, 25, 50, 75, 100])
# Show plot plt.show()
Output :
Fig. 2.2.
Visualizing using Matplotlib 2.17
Customization of Histogram
Matplotlib provides a range of different methods to customize
histogram. matplotlib.pyplot.hist() function itself provides many attributes with the
help of which we can modify a histogram. The hist() function provide a patches
object which gives access to the properties of the created objects, using this we can
modify the plot according to our will.
Example 1:
import numpy as np
# Creating dataset
np.random.seed(23685752)
N_points = 10000
n_bins = 20
# Creating distribution
x = np.random.randn(N_points)
y = .8 ** x + np.random.randn(10000) + 25
# Creating histogram
tight_layout = True)
# Show plot
plt.show()
2.18 Data Exploration and Visualization
Output :
Example 2:
The code below modifies the above histogram for a better view and accurate
readings.
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter
# Creating dataset
np.random.seed(23685752)
N_points = 10000
n_bins = 20
# Creating distribution
x = np.random.randn(N_points)
y = .8 ** x + np.random.randn(10000) + 25
legend = ['distribution']
Visualizing using Matplotlib 2.19
# Creating histogram
fig, axs = plt.subplots(1, 1,
figsize =(10, 7),
tight_layout = True)
# Remove axes splines
for s in ['top', 'bottom', 'left', 'right']:
axs.spines[s].set_visible(False)
# Remove x, y ticks
axs.xaxis.set_ticks_position('none')
axs.yaxis.set_ticks_position('none')
# Add padding between axes and labels
axs.xaxis.set_tick_params(pad = 5)
axs.yaxis.set_tick_params(pad = 10)
# Add x, y gridlines
axs.grid(b = True, color ='grey',
linestyle ='-.', linewidth = 0.5,
alpha = 0.6)
# Add Text watermark
fig.text(0.9, 0.15, 'Jeeteshgavande30',
fontsize = 12,
color ='red',
ha ='right',
va ='bottom',
alpha = 0.7)
# Creating histogram
N, bins, patches = axs.hist(x, bins = n_bins)
# Setting color
fracs = ((N**(1 / 5)) / N.max())
2.20 Data Exploration and Visualization
Output :
Fig. 2.3.
A legend is basically an area in the plot which describes the elements present in
the graph. Matplotlib provides an inbuilt method named legend() for this purpose.
The syntax of the method is below :
Visualizing using Matplotlib 2.21
Output:
Fig. 2.4.
Example 1:
# Import libraries
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
# Creating plot
plt.plot([1, 2, 3, 4], color='blue')
plt.title('simple legend example ')
# Creating legend with color
box
blue_patch = mpatches.Patch(color='blue', label='blue legend')
plt.legend(handles=[blue_patch])
# Show plot
plt.show()
Output:
Fig. 2.5.
Visualizing using Matplotlib 2.23
What is a Subplot?
There are many cases where you will want to generate a plot that contains several
smaller plots within it. That is exactly what a subplot is! A common version of the
subplot is the 4x4 subplot. An example of the 4x4 subplot is below:
Fig. 2.6.
Subplots can be very complicated to create when done properly. As an example,
consider the code that I used to create the above 4 4 subplot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
2.24 Data Exploration and Visualization
tech_stocks_data =
pd.read_csv('https://raw.githubusercontent.com/nicholasmccullum/python-
visualization/master/tech_stocks/GOOG_MSFT_FB_AMZN_data.csv')
tech_stocks_data.sort_values('Period', ascending = True, inplace = True)
google = tech_stocks_data['Alphabet Inc Price']
amazon = tech_stocks_data['Amazon.com Inc Price']
facebook = tech_stocks_data['Facebook Inc Price']
microsoft = tech_stocks_data['Microsoft Corp Price']
dates = tech_stocks_data['Period']
x = []
for date in tech_stocks_data['Period']:
x.append(datetime.strptime(date, '%Y-%m-%d %H:%M:%S').year)
plt.figure(figsize=(16,12))
#Plot 1
plt.subplot(2,2,1)
plt.xticks(np.arange(0, len(x) + 1)[::365], x[::365])
plt.plot(dates, google)
plt.title('Alphabet (GOOG) (GOOGL) Stock Price')
#Plot 2
plt.subplot(2,2,2)
plt.xticks(np.arange(0, len(x) + 1)[::365], x[::365])
plt.plot(dates, amazon)
plt.title('Amazon (AMZN)) Stock Price')
#Plot 3
plt.subplot(2,2,3)
Visualizing using Matplotlib 2.25
#Plot 4
plt.subplot(2,2,4)
plt.xticks(np.arange(0, len(x) + 1)[::365], x[::365])
plt.plot(dates, microsoft)
plt.title('Microsoft (MSFT) Stock Price')
Fig. 2.7.
Let's make each subplot a scatterplot, with the x-variable for each scatterplot
being fixed acidity. Name each plot with an appropriate title for an outside reader to
understand it.
Give this a try yourself before proceeding!
Once you have attempted this on your own, you can view the code below for a
full solution:
x = wine_data['fixed acidity']
plt.subplot(2,3,1)
plt.scatter(x, wine_data['chlorides'])
plt.title('Chlorides plotted against Fixed Acidity')
plt.subplot(2,3,2)
plt.scatter(x, wine_data['quality'])
plt.title('Quality plotted against Fixed Acidity')
plt.subplot(2,3,3)
plt.scatter(x, wine_data['alcohol'])
plt.title('Alcohol plotted against Fixed Acidity')
plt.subplot(2,3,4)
plt.scatter(x, wine_data['density'])
plt.title('Density plotted against Fixed Acidity')
plt.subplot(2,3,5)
plt.scatter(x, wine_data['total sulfur dioxide'])
plt.title('Total Sulfur Dioxide plotted against Fixed Acidity')
plt.subplot(2,3,6)
plt.scatter(x, wine_data['citric acid'])
plt.title('Citric Acid plotted against Fixed
Acidity')
Visualizing using Matplotlib 2.29
Fig. 2.8.
If arrows and texts are used within the “plt.annotate()” function, you can also use
two xy coordinates, one for the arrow and the other for the text. This can be declared
via “xy()” and “xytext()” respectively.
These are the following parameters used:
s : The text of the annotation
xy : The point (x,y) to annotate
xytext : The position (x,y) to place the text at (If None, defaults to xy)
arrowprops : The properties used to draw an arrow between the positions xy and
xytext
#input annotation
plt.annotate(
# Label and coordinate
'My Money Goal Has been Reached!', xy=(2003, 14000), xytext=(2002, 20000),
#Arrow Pointer
arrowprops=dict(facecolor='red'))
Fig. 2.9.
!pip install matplotlib # install matplotlib
import matplotlib.pyplot as plt #import matplotlib
Visualizing using Matplotlib 2.31
#input annotation
plt.annotate(
# Label and coordinate
„My Money Goal Has been Reached!‟, xy=(2003, 14000), xytext=(2002,
20000), #Arrow Pointer
arrowprops=dict(facecolor=‟red‟))
#output chart
plt.show( )
Fig. 2.10.
!pip install matplotlib # install matplotlib
import matplotlib.pyplot as plt #import matplotlib
verticalalignment='top', bbox=textbox)
Fig. 2.11.
Fig. 2.12.
!pip install matplotlib #download matplotlib
Visualizing using Matplotlib 2.35
Fig. 2.13.
2.36 Data Exploration and Visualization
#create a graph
year=[2005,2006, 2007, 2008, 2009, 2010, 2011]
income=[45000, 60000,70000,50000,60000,70000,80000]
plt.plot(year, income)
Fig. 2.14.
Visualizing using Matplotlib 2.37
#create a graph
year=[2005,2006, 2007, 2008, 2009, 2010, 2011]
income=[45000, 60000,70000,50000,60000,70000,80000]
fig, ax = plt.subplots( )
ax.plot(year, income)
Fig. 2.15.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
# Creating out data
np.random.seed(19680801)
x, y = np.random.rand(2, 200)
#plotting our graph
fig, ax = plt.subplots( )
ax.plot(x, y, alpha=0.2)
#creating our circle
circle = patches.Circle(0.5, 0.5), 0.25, alpha=0.8,
fc=‟yellow‟) ax.add_patch(circle)
Adding Fills
Fills are useful for (among other things) visualizing where multiple distributions
overlap.
Let‟s start with several Gaussian distributions for illustration. The probability
density for a Gaussian distribution is given by
p(x)=1√2πσ2e−(x−μ)22σ2p(x)=12πσ2e−(x−μ)22σ2
where μμ is the mean, and σσ is the standard deviation.
def gaussian(x, μ=0, σ=1, normalized=True):
u = (x - μ) / σ
g = np.exp(-u**2 / 2)
if normalized:
g /= np.sqrt(2 * π * σ**2)
return g
Let‟s generate 3 Gaussian distributions for the plot.
z = np.linspace(-10, 10, 1000)
μ0, μ1, μ2 = -4, 0, 2
y0 = gaussian(z, μ=μ0, σ=1.25)
y1 = gaussian(z, μ=μ1, σ=1.0)
y2 = gaussian(z, μ=μ2, σ=1.5)
You can specify colors from Matplotlib‟s color cycler with a “CN” color
specification. Since this plot only has a few lines, it‟s simpler to explicitly match the
fill color to the color of the associated line. (If you have more than a few lines in
your plot, iterate over the lines and use line.get_color() to set the fill color.)_, ax =
plt.subplots()
ax.plot(z, y0, label=r"$G_0$",
color="C0") ax.plot(z, y1,
label=r"$G_1$", color="C1") ax.plot(z,
y2, label=r"$G_2$", color="C2")
ax.legend(loc="upper right")
Next, place tick marks at the mean of each distribution.
Visualizing using Matplotlib 2.41
If you place the above code block in a file called LaTeX_everywhere.mplstyle (or
similar) in the stylelib directory (see above), you can then invoke it with
with plt.style.context("LaTeX_everywhere"):
Rendering the trigonometric plot from the first example with our custom LaTeX
style produces
Example:
import numpy as np
import matplotlib.pyplot as plt
Visualizing using Matplotlib 2.43
fig = plt.figure()
ax = plt.axes(projection ='3d')
Output:
Fig. 2.16.
With the above syntax three -dimensional axes are enabled and data can be plotted
in 3 dimensions. 3 dimension graph gives a dynamic approach and makes data more
interactive. Like 2-D graphs, we can use different ways to represent 3-D graph. We
can make a scatter plot, contour plot, surface plot, etc. Let‟s have a look at different
3-D plots.
Output:
Fig. 2.17.
# defining axes
z = np.linspace(0, 1, 100)
x = z * np.sin(25 * z)
y = z * np.cos(25 * z)
c=x+y
ax.scatter(x, y, z, c = c)
# syntax for plotting
ax.set_title('3d Scatter plot geeks for geeks')
plt.show()
Output:
Fig. 2.18.
have in their virtual toolbelts. In this section, we'll show several examples of the
type of map visualization that is possible with this toolkit.
Installation of Basemap is straightforward; if you're using conda you can type
this and the package will be downloaded:
$ conda install basemap
We add just a single new import to our standard boilerplate:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
Once you have the Basemap toolkit installed and imported, geographic plots are
just a few lines away (the graphics in the following also requires the PIL package in
Python 2, or the pillow package in Python 3):
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
The meaning of the arguments to Basemap will be discussed momentarily.
The useful thing is that the globe shown here is not a mere image; it is a fully-
functioning Matplotlib axes that understands spherical coordinates and which allows
us to easily overplot data on the map! For example, we can use a different map
projection, zoom-in to North America and plot the location of Seattle. We'll use an
etopo image (which shows topographical features both on land and under the ocean)
as the map background:
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
# Map (long, lat) to (x, y) for plotting
Visualizing using Matplotlib 2.47
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);
This gives you a brief glimpse into the sort of geographic visualizations that are
possible with just a few lines of Python. We'll now discuss the features of Basemap
in more depth, and provide several examples of visualizing map data. Using these
brief examples as building blocks, you should be able to create nearly any map
visualization that you desire.
Map Projections
The first thing to decide when using maps is what projection to use. You're
probably familiar with the fact that it is impossible to project a spherical map, such
as that of the Earth, onto a flat surface without somehow distorting it or breaking its
continuity. These projections have been developed over the course of human history,
and there are a lot of choices! Depending on the intended use of the map projection,
there are certain map features (e.g., direction, area, distance, shape, or other
considerations) that are useful to maintain.
The Basemap package implements several dozen such projections, all referenced
by a short format code. Here we'll briefly demonstrate some of the more common
ones.
from itertools import chain
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
2.48 Data Exploration and Visualization
Cylindrical Projections
The simplest of map projections are cylindrical projections, in which lines of
constant latitude and longitude are mapped to horizontal and vertical lines,
respectively. This type of mapping represents equatorial regions quite well, but
results in extreme distortions near the poles. The spacing of latitude lines varies
between different cylindrical projections, leading to different conservation
properties, and different distortion near the poles. In the following figure we show an
example of the equidistant cylindrical projection, which chooses a latitude scaling
that preserves distances along meridians. Other cylindrical projections are the
Mercator (projection='merc') and the cylindrical equal area (projection='cea')
projections.
Fi
g. 2.19.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl',
resolution=None, llcrnrlat=-90,
urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
Visualizing using Matplotlib 2.49
Pseudo-cylindrical Projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of
constant longitude) remain vertical; this can give better properties near the poles of
the projection. The Mollweide projection (projection='moll') is one common example
of this, in which all meridians are elliptical arcs. It is constructed so as to preserve
area across the map: though there are distortions near the poles, the area of small
patches reflects the true area. Other pseudo-cylindrical projections are the sinusoidal
(projection='sinu') and Robinson (projection='robin') projections.
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,
lat_0=0, lon_0=0)
draw_map(m)
Fig. 2.20.
The extra arguments to Basemap here refer to the central latitude (lat_0) and
longitude (lon_0) for the desired map.
Perspective Projections
Perspective projections are constructed using a particular choice of perspective
point, similar to if you photographed the Earth from a particular point in space (a
point which, for some projections, technically lies within the Earth!). One common
example is the orthographic projection (projection='ortho'), which shows one side of
the globe as seen from a viewer at a very long distance. As such, it can show only
2.50 Data Exploration and Visualization
half the globe at a time. Other perspective-based projections include the gnomonic
projection (projection='gnom') and stereographic projection (projection='stere').
These are often the most useful for showing small portions of the map.
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None,
lat_0=50, lon_0=0)
draw_map(m);
Fig. 2.21.
Conic Projections
A Conic projection projects the map onto a single cone, which is then unrolled.
This can lead to very good local properties, but regions far from the focus point of
the cone may become very distorted. One example of this is the Lambert Conformal
Conic projection (projection='lcc'), which we saw earlier in the map of North
America. It projects the map onto a cone arranged in such a way that two standard
parallels (specified in Basemap by lat_1 and lat_2) have well-represented distances,
with scale decreasing between them and increasing outside of them. Other useful
conic projections are the equidistant conic projection (projection='eqdc') and the
Visualizing using Matplotlib 2.51
Fig. 2.22.
longitude coordinates to (x, y) coordinates for plotting with plt, as we saw earlier in
the Seattle example.
In addition to this, there are many map-specific functions available as methods of
the Basemap instance. These work very similarly to their standard Matplotlib
counterparts, but have an additional Boolean argument latlon, which if set
to True allows you to pass raw latitudes and longitudes to the method, rather than
projected (x, y) coordinates.
Some of these map-specific methods are:
contour()/contourf() : Draw contour lines or filled contours
imshow(): Draw an image
pcolor()/pcolormesh() : Draw a pseudocolor plot for irregular/regular meshes
plot(): Draw lines and/or markers.
scatter(): Draw points with markers.
quiver(): Draw vectors.
barbs(): Draw wind barbs.
drawgreatcircle(): Draw a great circle.
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
Next, we set up the map projection, scatter the data, and then create a colorbar and
legend:
# 1. Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h',
lat_0=37.5, lon_0=-119,
width=1E6, height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True,
c=np.log10(population), s=area,
cmap='Reds', alpha=0.5)
# 3.make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a,
label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False,
labelspacing=1, loc='lower left');
Visualizing using Matplotlib 2.55
Fig. 2.23.
This shows us roughly where larger populations of people have settled in
California: they are clustered near the coast in the Los Angeles and San Francisco
areas, stretched along the highways in the flat central valley, and avoiding almost
completely the mountainous regions along the borders of the state.
Pandas
Pandas offer tools for cleaning and process your data. It is the most popular
Python library that is used for data analysis. In pandas, a data table is called a
dataframe.
Example 1:
# Python code demonstrate creating
import pandas as pd
# initialise data of lists.
data = {'Name':[ 'Mohe' , 'Karnal' , 'Yrik' , 'jack' ],
'Age':[ 30 , 21 , 29 , 28 ]}
# Create DataFrame
df = pd.DataFrame( data )
# Print the output.
df
Output:
Name Age
0 Mohe 30
1 Karnal 21
2 Yrik 29
3 Jack 28
Example 2:
Load the CSV data from the system and display it through pandas.
# import module
import pandas
# load the csv
Visualizing using Matplotlib 2.57
data = pandas.read_csv("nba.csv")
# show first 5 column
data.head()
Output:
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 NaN 5000000.0
Seaborn
Seaborn is an amazing visualization library for statistical graphics plotting in
Python. It is built on the top of matplotlib library and also closely integrated into the
data structures from pandas.
Installation
For python environment :
pip install seaborn
Let‟s create Some basic plots using
seaborn: # Importing libraries
import numpy as np
import seaborn as sns
# Selecting style as white,
# dark, whitegrid, darkgrid
# or ticks
sns.set( style = "white" )
# Generate a random univariate
# dataset
2.58 Data Exploration and Visualization
rs = np.random.RandomState( 10 )
d = rs.normal( size = 50 )
# Plot a simple histogram and kde
# with binsize determined automatically
sns.distplot(d, kde = True, color = "g")
Output:
Fig. 2.24.
Line plot:
Lineplot Is the most popular plot to draw a relationship between x and y with the
possibility of several semantic groupings.
Syntax : sns.lineplot(x=None, y=None)
Parameters:
x, y: Input data variables; must be numeric. Can pass data directly or reference
columns in data.
Let‟s visualize the data with a line plot and pandas:
Example 1:
# import module
import seaborn as sns
import pandas
# loading csv
data = pandas.read_csv("nba.csv")
# plotting lineplot
sns.lineplot( data['Age'], data['Weight'])
Output:
Fig. 2.25.
2.60 Data Exploration and Visualization
Scatter Plot:
Scatterplot Can be used with several semantic groupings which can help to
understand well in a graph against continuous/categorical data. It can draw a two-
dimensional graph.
Syntax: seaborn.scatterplot(x=None, y=None)
Parameters:
y: Input data variables that should be numeric.
Returns: This method returns the Axes object with the plot drawn onto it.
Example 1:
# import module
import seaborn
import pandas
# load csv
data = pandas.read_csv("nba.csv")
# plotting
seaborn.scatterplot(data['Age'],data['Weight'])
Output:
Fig. 2.26.
Visualizing using Matplotlib 2.61
Box Plot:
A box plot (or box-and-whisker plot) s is the visual representation of the depicting
groups of numerical data through their quartiles against continuous/categorical data.
A box plot consists of 5 things.
Minimum
First Quartile or 25%
Median (Second Quartile) or 50%
Third Quartile or 75%
Maximum
Syntax:
seaborn.boxplot(x=None, y=None, hue=None, data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting. If x and y are absent, this is interpreted as wide-form.
Returns: It returns the Axes object with the plot drawn onto it.
Draw the box plot with Pandas:
Example 1:
Python3
# import module
import seaborn as sns
import pandas
# read csv and plotting
data = pandas.read_csv( "nba.csv" )
sns.boxplot( data['Age'] )
Output:
2.62 Data Exploration and Visualization
Fig. 2.27.
Bar Plot:
Barplot represents an estimate of central tendency for a numeric variable with the
height of each rectangle and provides some indication of the uncertainty around that
estimate using error bars.
Syntax : seaborn.barplot(x=None, y=None, hue=None, data=None)
Parameters :
x, y : This parameter take names of variables in data or vector data, Inputs for
plotting long-form data.
hue : (optional) This parameter take column name for colour encoding.
data : (optional) This parameter take DataFrame, array, or list of arrays, Dataset
for plotting. If x and y are absent, this is interpreted as wide-form. Otherwise it is
expected to be long-form.
Returns : Returns the Axes object with the plot drawn onto it.
Example 1:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
# read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.barplot(x =data["Age"])
Visualizing using Matplotlib 2.63
Output:
Fig. 2.28.
KDE Plot:
KDE Plot described as Kernel Density Estimate is used for visualizing the
Probability Density of a continuous variable. It depicts the probability density at
different values in a continuous variable. We can also plot a single graph for multiple
samples which helps in more efficient data visualization.
Syntax: seaborn.kdeplot(x=None, *, y=None, vertical=False, palette=None,
**kwargs)
Parameters:
x, y : vectors or keys in data
vertical : boolean (True or False)
data : pandas.DataFrame, numpy.ndarray, mapping, or sequence
Draw the KDE plot with Pandas:
Example 1:
# importing the required libraries
from sklearn import datasets
import pandas as pd
import seaborn as sns
# Setting up the Data Frame
2.64 Data Exploration and Visualization
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns=['Sepal_Length',
'Sepal_Width', 'Patal_Length', 'Petal_Width'])
iris_df['Target'] = iris.target
iris_df['Target'].replace([0], 'Iris_Setosa', inplace=True)
iris_df['Target'].replace([1], 'Iris_Vercicolor', inplace=True)
iris_df['Target'].replace([2], 'Iris_Virginica', inplace=True)
Output:
Fig. 2.29.
only one quantity that changes. It does not deal with causes or relationships and the
main purpose of the analysis is to describe the data and find patterns that exist within
it.
Let‟s see an example of Bivariate data :
Output:
Fig. 2.30.
Let‟s see an example of univariate data distribution:
Example: Using the dist plot
# import module
import seaborn as sns
2.66 Data Exploration and Visualization
import pandas
# read top 5 column
data = pandas.read_csv("nba.csv").head()
sns.distplot( data['Age'])
Output:
Fig. 2.31.
1. What is a Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a
visualization utility.Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C,
Objective-C and Javascript for Platform compatibility. Matplotlib is a Python
library that helps to plot graphs. It is used in data visualization and graphical
plotting.
2. What is a simple line plot?
The simplest of all plots is the visualization of a single function y=f(x)y=f(x).
Here we will take a first look at creating a simple plot of this type. we'll start by
plotting and importing the packages we will use:
Visualizing using Matplotlib 2.67
values which fall into each of the intervals. Bins are clearly identified as
consecutive, non-overlapping intervals of variables. The matplotlib.pyplot.hist()
function is used to compute and create histogram of x.
8. What is a Subplot?
There are many cases where you will want to generate a plot that contains
several smaller plots within it. That is exactly what a subplot is! A common
version of the subplot is the 4x4 subplot. Subplots can be very complicated to
create when done properly.
9. How To Create Subplots in Python Using Matplotlib?
We can create subplots in Python using matplotlib with the subplot method,
which takes three arguments:
nrows: The number of rows of subplots in the plot grid.
ncols: The number of columns of subplots in the plot grid.
index: The plot that you have currently selected.
10. How do you annotate text and graph?
Annotate using text
Matplotlib offers the ability to place text within a chart. The only condition is
it requires the positioning co-ordinate of the x and y-axis to place the text.
Annotate using graph: plt.annotate()
To input text using matplotlib‟s “plt.annotate()” we need to declare two
things, which is the “xy” coordinates which tells matplotlib where we want to
input our text and the “s” attribute.
There is also an added attribute aswell, this is called the “arrowprops”
attribute, which basically allows us to input an arrow pointing towards a specific
point in our graph.
11. Brief three-dimensional Plotting in Python using Matplotlib
Matplotlib was introduced keeping in mind, only two-dimensional plotting.
But at the time when the release of 1.0 occurred, the 3d utilities were developed
upon the 2d and thus, we have 3d implementation of data available today! The
Visualizing using Matplotlib 2.69
3d plots are enabled by importing the mplot3d toolkit. In this article, we will
deal with the 3d plots using matplotlib.
Example:
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection ='3d')
12. What is Geographic data with basemap
One common type of visualization in data science is that of geographic data.
Matplotlib's main tool for this type of visualization is the Basemap toolkit,
which is one of several Matplotlib toolkits which lives under
the mpl_toolkits namespace. Admittedly, Basemap feels a bit clunky to use, and
often even simple visualizations take much longer to render than you might
hope. More modern solutions such as leaflet or the Google Maps API may be a
better choice for more intensive map visualizations. Still, Basemap is a useful
tool for Python users to have in their virtual toolbelts.
13. What is Visualization with seaborn
Data Visualization is the presentation of data in pictorial format. It is
extremely important for Data Analysis, primarily because of the fantastic
ecosystem of data-centric Python packages. And it helps to understand the data,
however, complex it is, the significance of data by summarizing and presenting
a huge amount of data in a simple and easy-to-understand format and helps
communicate information clearly and effectively
14. Define a KDE plot?
KDE Plot described as Kernel Density Estimate is used for visualizing the
Probability Density of a continuous variable. It depicts the probability density at
different values in a continuous variable. We can also plot a single graph for
multiple samples which helps in more efficient data visualization.
Syntax: seaborn.kdeplot(x=None, *, y=None, vertical=False, palette=None,
**kwargs)
2.70 Data Exploration and Visualization
15. Differentiate Bivariate and Univariate data using seaborn and pandas:
Bivariate data: This type of data involves two different variables. The
analysis of this type of data deals with causes and relationships and the analysis
is done to find out the relationship between the two variables.
Univariate data: This type of data consists of only one variable. The
analysis of univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes. It does not deal with
causes or relationships and the main purpose of the analysis is to describe the
data and find patterns that exist within it.
16. What is a Box plot?
A box plot (or box-and-whisker plot) s is the visual representation of the
depicting groups of numerical data through their quartiles against
continuous/categorical data.
A box plot consists of 5 things.
Minimum
First Quartile or 25%
Median (Second Quartile) or 50%
Third Quartile or 75%
Maximum
17. What are the different map projections available?
Cylindrical projections
Pseudo-cylindrical projections
Perspective projections
Conic projections
18. Define Customization
Customization challenges the boundaries of business analytics by constantly
reinventing the delicate configurations between a user‟s creativity and
fundamental design principles. For that reason, it aims to showcase better
pictorial representations with more flexible and interactive use of the underlying
data.
Visualizing using Matplotlib 2.71
******************
UNIT III
UNIVARIATE ANALYSIS
SYLLABUS
Inequality
Preliminaries
Two organizing concepts have become the basis of the language of data analysis:
cases and variables. The cases are the basic units of analysis, the things about which
information is collected. The word variable expresses the fact that this feature varies
across different cases.
We will look at some useful techniques for displaying information about the
values of single variables, and will also introduce the differences between interval
level and ordinal level variables.
One simple device is the bar chart, a visual display in which bars are drawn to
represent each category of a variable such that the length of the bar is proportional to
the number of cases in the category.
A pie chart can also be used to display the same information. It is largely a matter
of taste whether data from a categorical variable are displayed in a bar chart or a pie
chart. In general, pie charts are to be preferred when there are only a few categories
and when the sizes of the categories are very different.
Fig. 3.3.
Bar charts and pie charts can be an effective medium of communication if they are
well drawn.
Histograms
Charts that are somewhat similar to bar charts can be used to display interval level
variables grouped into categories and these are called histograms. They are
constructed in exactly the same way as bar charts except that the ordering of the
categories is fixed, and care has to be taken to show exactly how the data were
grouped.
Let focus on the topic of working hours to demonstrate how simple descriptive
statistics can be used to provide numerical summaries of level and spread. The
chapter will begin by examining data on working hours in Britain taken from the
Univariate Analysis 3.5
General Household Survey discussed in the previous chapter. These data are used to
illustrate measures of level such as the mean and the median and measures of spread
or variability such as the standard deviation and the midspread.
Summaries of level
The level expresses where on the scale of numbers found in the dataset the
distribution is concentrated
Residuals
Another way of expressing this is to say that the residual is the observed data
value minus the predicted value and in this case 45 – 40 = 5. Any data value such as
a measurement of hours worked or income earned can be thought of as being
composed of two components: a fitted part and a residual part. This can be expressed
as an equation:
Data = Fit + Residual
The median
The value of the case at the middle of an ordered distribution would seem to have
an intuitive claim to typicality. Finding such a number is easy when there are very
few cases. In the example of hours worked by a small random sample of 15 men
(figure 3.4), the value of 48 hours per week fits the bill. There are six men who work
fewer hours and seven men who work more hours while two men work exactly 48
3.6 Data Exploration and Visualization
hours per week. Similarly, in the female data, the value of the middle case is 3 7
hours. The data value that meets this criterion is called the median: the value of the
case that has equal numbers of data points above and below it. The median hours
worked by women in this very small sample is 11 hours less than the median for
men. This numeric summary of the level of the data therefore confirms our first
impressions from simply looking at the histograms in figures 3.1 and 3.2 that women
generally work shorter hours than men.0020
30
37
39
40
45
47
48
Median value 48
50
54
55
55
67
70
80
Fig. 3.4. Men’s working hours ranked to show the median
Univariate Analysis 3.7
Summaries of Spread
The second feature of a distribution visible in a histogram is the degree of
variation or spread in the variable.
Once again, there are many candidates we could think of to summarize the spread.
One might be the distance between the two extreme values (the range). Or we might
work out what was the most likely difference between any two cases drawn at
random from the dataset.
The midspread
The range of the middle 5 0 per cent of the distribution is a commonly used
measure of spread because it concentrates on the middle cases. It is quite stable from
sample to sample. The points which divide the distribution into quarters are called
the quartiles (or sometimes 'hinges' or 'fourths'). The lower quartile is usually
denoted QL and the upper quartile Q0. (The middle quartile is of course the median.)
The distance between QL and Q0 is called the midspread (sometimes the
'interquartile range'), or the dQ for short.
3.8 Data Exploration and Visualization
37
39
40
Q L = 42.5
45
47
48
48
50
54
55
QU = 55
55
67
70
80
Fig. 3.5. Men’s working hours ranked and showing the upper and lower quartiles
There is a measure of spread which can be calculated from these squared distances
from the mean. The standard deviation essentially calculates a typical value of these
distances from the mean. It is conventionally denoted s, and defined as:
–
(Y
i – Y )2
s = (N – 1)
Univariate Analysis 3.9
The deviations from the mean are squared, summed and divided by the sample
size and then the square root is taken to return to the original units. The order in
which the calculations are performed is very important. As always, calculations
within brackets are performed first, then multiplication and division, then addition
(including summation) and subtraction. Without the square root, the measure is
called the variance, s2• The layout for a worksheet to calculate the standard deviation
of the hours worked by this small sample of men is shown in figure 3.6.
Y Y–Y (Y – Y)2
54 3 9
30 –21 441
47 –4 16
39 –12 144
50 –1 1
48 –3 9
45 –6 36
40 –11 121
37 –14 196
48 –3 9
67 16 256
55 4 16
55 4 16
80 29 841
70 19 361
Sum = 765 Sum of squared residuals = 2472
Fig. 3.6. Worksheet for standard deviation of men’s weekly working hours
– 2472 14
s = (Y
i – Y )2
(N – 1) = = 13.29
3.10 Data Exploration and Visualization
We can see that on average men tend to work more hours per week than women
(39.2 hours vs 29.6 hours) and also the higher standard deviation for women, 12.3 vs
11.6 for men indicates that there is more variation among women in terms of the
hours they usually work per week. It should also be noted that the figures for the
means and standard deviations are pasted directly from the SPSS output. We can see
that in each case the number of decimal places provided is four for the mean and five
for the standard deviation.
Total Work Hours (Men)
N Valid 6392
Missing 8188
Mean 39.2268
Std. Deviation 11.64234
Missing 9362
Mean 29.5977
Improvements can often be made to the material at hand without resorting to the
expense of collecting new data.
We must feel entirely free to rework the numbers in a variety of ways to achieve
the following goals:
to make them more amenable to analysis
to promote comparability
to focus attention on differences.
Fig. 3.8.(a) Histogram of weekly alcohol consumption of men who describe themselves as
‘drinking quite a lot’ or ‘heavy drinkers’
A common example of this is the re-expression of one currency in terms of
another. For example, in order to convert pounds to US dollars, the pounds are
multiplied by the current exchange rate. Multiplying or dividing each of the values
3.14 Data Exploration and Visualization
has a more powerful effect than adding or subtracting. The result of multiplying or
dividing by a constant is to scale the entire variable by a factor, evenly stretching or
shrinking the axis like a piece of elastic. To illustrate this, let us see what happens if
data from the General Household Survey on the weekly alcohol consumption of men
who classify themselves as moderate or heavy drinkers are divided by seven to give
the average daily alcohol consumption.
Fig. 3.8. (b) Histogram of daily alcohol consumption of men who describe themselves as
‘drinking quite a lot’ or ‘heavy drinkers’
The overall shape of the distributions in figures 3.8 (a) and 3.8 (b) are the same.
The data points are all in the same order, and the relative distances between them
have not been altered apart from the effects of rounding. The whole distribution has
simply been scaled by a constant factor.
In SPSS it is very straightforward to multiply or divide a set of data by a constant
value. For example, using syntax, the command to create the variable drday ‘Average
Univariate Analysis 3.15
daily alcohol consumption’ from the variable drating ‘Average weekly alcohol
consumption’ is as follows:
COMPUTE DRDAY — DRATING/7.
Alternatively, to create a new variable ‘NEWVAR’ by multiplying an existing
variable ‘OLDVAR’ by seven the syntax would be:
COMPUTE NEWVAR = OLDVAR*7.
The ‘Compute’ command can also be used to add or subtract a constant,
for example:
COMPUTE NEWVAR = OLDVAR + 100.
COMPUTE NEWVAR = OLDVAR – 60.
The value of multiplying or dividing by a constant is often to promote
comparability between datasets where the absolute scale values are different. For
example, one way to compare the cost of a loaf of bread in Britain and the United
States is to express the British price in dollars. Percentages are the result of dividing
frequencies by one particular constant - the total number of cases.
The second use which is more immediately intelligible: standardized variables are
useful in the process of building complex measures based on more than one
indicator. In order to illustrate this, we will use some data drawn from the National
Child Development Study (NCDS). This is a longitudinal survey of all children born
in a single week of 1958.
There is a great deal of information about children’s education in this survey.
Information was sought from the children’s schools about their performance at state
examinations, but the researchers also decided to administer their own tests of
attainment.
Rather than attempt to assess knowledge and abilities across the whole range of
school subjects, the researchers narrowed their concern down to verbal and
mathematical abilities. Each child was given a reading comprehension test which
was constructed by the National Foundation for Educational Research for use in the
study, and a test of mathematics devised at the University of Manchester.
The two tests were administered at the child’s school and had very different
methods of scoring. As a result they differed in both level and spread. As can be seen
from the descriptive statistics in figure 3.4, the sixteen-year-olds in the National
Child Development Study apparently found the mathematics test rather more
difficult than the reading comprehension test. The reading comprehension was scored
out of a total of 35 and sixteen-year- olds gained a mean score of 25.37, whereas the
mathematics test was scored out of a possible maximum of 31, but the 16-year-olds
only gained a mean score of 12.75.
Descriptive Statistics
N Minimum Maximum Mean Std.Deviation
Age 16 Test 1–reading 11920 0 35 25.37 7.024
Comprehension
Comprehension
The first two columns of figure 3.11 show the scores obtained on the reading and
mathematics test by fifteen respondents in this study. There is nothing inherently
interesting or intelligible about the raw numbers. The first score of 31 for the reading
test can only be assessed in comparison with what other children obtained. Both tests
can be thought of as indicators of the child’s general attainment at school. It might be
useful to try to turn them into a single measure of that construct.
5
1 2 3 4
Composite
Raw reading Raw maths Standardized Standardized
score of
score score reading score maths score
attainment
31 17 0.8 0.61 1.41
Standardizing the variables was a necessary, but not a sufficient condition for
creating a simple summary score. It is also important to have confidence that the
components are both valid indicators of the underlying construct of interest.
As the figures stand, the most dominant feature of the dataset is a rather
uninteresting one: the change in the value of the pound. While the median and mid-
spreads of the money incomes each year have increased substantially in this period,
real incomes and differentials almost certainly have not. How could we present the
data in order to focus on the trend in real income differentials over time?
One approach would be to treat the distribution of incomes for each sex in each
year as a separate distribution, and express each of the quartiles relative to the
median. The result of doing this is given in figure 3.16.
money is transferred from the rich to the poor this will increase the happiness of the
poor more than it diminishes the happiness of the rich. This in turn suggests that the
overall happiness rating of a country will go up if income is distributed more equally.
Of course, as Layard acknowledges, the problem with this argument is that it only
works if it is possible to reduce inequality without raising taxes to such an extent that
there is no longer an incentive for individuals to strive to make money so that the
total income is reduced as a result of policies aimed at redistribution. It is clearly
important to understand the principal ways of measuring inequality if we are to
monitor the consequences of changing levels of inequality in society. This chapter
will focus on how we can measure inequality in such a way as to make it possible to
compare levels of inequality in different societies and to look at changes in levels of
inequality over time.
Definition of Income
To say that income is a flow of revenue is fine in theory, but we have to choose
between two approaches to making this operational. One is to follow accounting and
tax practices, and make a clear distinction between income and additions to wealth.
With this approach, capital gains in a given period, even though they might be used
in the same way as income, would be excluded from the definition. This is the
approach of the Inland Revenue, which has separate taxes for income and capital
gains. In this context a capital gain is defined as the profit obtained by selling an
asset that has increased in value since it was obtained. However, interestingly, in
most cases this definition (for the purposes of taxation) does not include any profit
made when you sell your main home.
The second approach is to treat income as the value of goods and services
consumed in a given period plus net changes in personal wealth during that period.
This approach involves constantly monitoring the value of assets even when they do
not come to the market. That is a very hard task.
So, although the second approach is theoretically superior, it is not very practical and
the first is usually adopted.
The definition of income usually only includes money spent on goods and
services that are consumed privately. But many things of great value to different
people are organized at a collective level: health services, education, libraries, parks,
museums, even nuclear warheads.
The benefits which accrue from these are not spread evenly across all members of
society. If education were not provided free, only families with children would need
to use their money income to buy schooling.
Sources of income are often grouped into three types:
earned income, from either employment or self-employment;
unearned income which increases from ownership of investments,
property, rent and so on;
transfer income, that is benefits and pensions transferred on the basis of
entitlement, not on the basis of work or ownership, mainly by the
government but occasionally by individuals .
3.26 Data Exploration and Visualization
2nd 7 11 12 12
3rd 15 16 17 16
4th 24 22 22 22
Top 51 44 42 44
All households 100 100 100 100
Decile group
Bottom 1 3 3 2
Top 33 29 27 29
Fig. 3.17. Percentage shares of household income, 2003-4
fashion. It can be noted that the first two columns of this table are simply a more
detailed version of the data presented in figure 3.18. For example, from figure 3.18
we can see that the top quintile group receives 51 per cent of original income; this
figure is also obtained if you sum the first three numbers in the first column of figure
3.19.
The cumulative percentage of the population is then plotted against the
cumulative share of total income. The resulting graphical display is known as a
Lorenz curve. It was first introduced in 1905 and has been repeatedly used for visual
communication of income and wealth inequality. The Lorenz curve for pre-tax
income in 2003/4 in the UK is shown in figure 3.20.
Lorenz curves have visual appeal because they portray how near total equality or
total inequality a particular distribution falls. If everyone in society had the same
income, then the share received by each decile group, for example, would be 10 per
cent, and the Lorenz curve would be completely straight, described by the diagonal
line.
Scale independence
However, it is important that the measure be sensitive to the level of the
distribution. Imagine a hypothetical society containing three individuals who earned
5,000, 10,000 and 15,000 pounds respectively. If they all had an increase in their
incomes of l million pound, we would expect a measure of inequality to decline,
since the differences between these individuals would have become trivial. The
standard deviation and midspread would be unaffected. A popular approach is to log
income data before calculating the numerical summaries of spread. If two
distributions differ by a scaling factor, the logged distributions will differ only in
3.30 Data Exploration and Visualization
level. However, if they differ by an arithmetic constant, they will have different
spreads when logged. The existence of units with zero incomes leads to problems,
since the log of zero cannot be defined mathematically. An easy technical solution to
this problem is to add a very small number to each of the zeros. If a numerical
summary of spread in a logged distribution met the other desirable features of a
measure of inequality, we could stop here. Unfortunately, it does not.
Time series such as that shown in the second column of figure 3.21 are displayed
by plotting them against time, as shown in figure 3.22. When such trend lines are
smoothed, the jagged edges are sawn off. A smoothed version of the total numbers of
Univariate Analysis 3.31
recorded crimes over the thirty years from the mid 1960s to the mid 1990s is
displayed in figure 3.23.
the points. The result has a somewhat jagged appearance. The sharp edges do not
occur because very sudden changes really occur in numbers of recorded crimes.
They are an artefact of the method of constructing the plot, and it is justifiable to
want to remove them. According to Tukey (1977, p. 205), the value of smoothing is
'the clearer view of the general, once it is unencumbered by detail'. The aim of
smoothing is to remove any upward or downward movement in the series that is not
part of a sustained trend.
Sharp variations in a time series can occur for many reasons. Part of the variation
across time may be error. For example, it could be sampling error. The opinion-poll
data used later in this chapter were collected in monthly sample surveys, each of
which aimed to interview a cross-section of the general public, but each of which
will have deviated from the parent population to some extent. Similarly, repeated
measures may each contain a degree of measurement error. In such situations,
smoothing aims to remove the error component and reveal the underlying true trend.
But the variable of interest may of course genuinely swing around abruptly. For
example, the monthly count of unemployed people rises very sharply when school-
leavers come on to the register. In these cases, we may want to smooth to remove the
effect of events which are unique, or which are simply not the main trend in which
we are interested. It is good practice to plot the rough as well as the smooth values, to
inspect exactly what has been discarded.
In engineering terms we want to recover the signal from a message by filtering out
the noise. The process of smoothing time series also produces such a decomposition
of the data. In other words, what we might understand in engineering as
Message = Signal +Noise
becomes
Data = Smooth+ Rough
This choice of words helps to emphasize that we impose no a priori structure on
the form of the fit. The smoothing procedure may be determined in advance, but this
is not the case for the shape and form of the final result: the data are allowed to speak
for themselves. Put in another way, the same smoothing recipe applied to different
Univariate Analysis 3.33
time series will produce different resulting shapes for the smooth, which, as we will
see in, is not the case when fitting straight lines.
As so often, this greater freedom brings with it increased responsibility. The
choice of how much to smooth will depend on judgement and needs. If we smooth
too much, the resulting rough will itself exhibit a trend. Of course, more work is
required to obtain smoother results, and this is an important consideration when
doing calculations by hand. The smoothing recipe described later in the chapter
generally gives satisfactory results and involves only a limited amount of
computational effort.
Most time series have a past, a present and a future. For example, the rising crime
figures plotted in figure 3.22 and figure 3.23 are part of a story that begins well
before the 1960s and continues to the present day. However, the goal of the
smoothing recipes explained in this chapter is not the extrapolation of a given series
into the future. The following section provides the next instalment in this story and
discusses what happened after the very dramatic increases in total recorded crime in
the early 1990s.
******************
UNIT IV
BIVARIATE ANALYSIS
SYLLABUS
Percentage Tables
Transformations
UNIT IV
BIVARIATE ANALYSIS
Fig. 4.1.
Such models are drawn up according to a set of conventions:
1. The variables are represented inside boxes or circles and labelled; in this
example the variables are class background and performance at school.
2. Arrows run from the variables which we consider to be causes to those we
consider to be effects; class background is assumed to have a causal effect
on school performance.
4.2 Data Exploration and Visualization
3. Positive effects are drawn as unbroken lines and negative effects are drawn
as dashed lines.
4. A number is placed on the arrow to denote how strong the effect of the
explanatory variable is.
5. An extra arrow is included as an effect on the response variable, often
unlabelled, to act as a reminder that not all the causes have been specified
in the model.
Fig. 4.2.
Proportions and percentages are bounded numbers, in that they have a floor of
zero, below which they cannot go, and a ceiling of 1.0 and 100 respectively.
Proportions can be used descriptively as in figure 4.1 to represent the relative size
of different subgroups in a population. But they can also be thought of as
probabilities. For example, we can say that the probability of an individual aged 19 in
2005 having a parent in a 'Higher professional' occupation is 0.168.
Contingency Tables
A contingency table does numerically what the three-dimensional bar chart does
graphically. The Concise Oxford Dictionary defines contingent as 'true only under
existing or specified conditions'. A contingency table shows the distribution of each
variable conditional upon each category of the other. The categories of one of the
variables form the rows, and the categories of the other variable form the columns.
Each individual case is then tallied in the appropriate pigeonhole depending on its
value on both variables. The pigeonholes are given the more scientific name cells,
and the number of cases in each cell is called the cell frequency. Each row and
column can have a total presented at the right-hand end and at the bottom
respectively; these are called the marginals, and the univariate distributions can be
obtained from the marginal distributions. Figure 4.4 shows a schematic contingency
table with four rows and four columns.
Panel (b) of figure 4.6 shows the percentage of young people within each category
of social class background who are in each main activity grouping at age 19. The
table was constructed by dividing each cell frequency by its appropriate row total.
We can see that whereas nearly two-thirds of those with a parent in a higher
professional occupation are still in full-time education at age 19, less than a quarter
of those with parents in Lower supervisory or Routine occupations are still in full-
time education by this age. Tables that are constructed by percentaging the rows are
usually read down the columns (reading along the rows would probably only confirm
two things we already know: the broad profile of the marginal distribution and the
fact that the percentages sum to 100). This is sometimes called an 'outflow' table.
The row percentages show the different outcomes for individuals with a particular
social class background.
It is also possible to tell the story in a rather different way, and look at where
people who ended up doing the same main activity at age 19 came from: the 'inflow
table'. This is shown in panel (c) of figure 4.6.
Higher
professional
10.8 0.7 4.0 0.7 0.3 0.0 0.3 16.8
Lower
13.9 1.7 8.6 1.4 1.4 0.3 0.6 27.7
professional
8.1 1.6 9.0 1.1 1.3 0.7 0.5 22.3
Intermediate
2.6 1.1 5.4 0.9 0.7 0.4 0.1 11.2
Lower
supervisory 3.1 1.5 6.0 1.2 1.6 0.6 0.6 14.5
Total
4.6 Data Exploration and Visualization
(b)Row Percentages
Higher
professional
64 4 24 4 2 - 2 100.0
Lower
50 6 31 5 5 1 2 100.0
professional
36 7 40 5 6 3 2 100.0
Intermediate
23 10 48 8 6 4 1 100.0
Lower
supervisory 21 10 41 8 11 4 4 100.0
Routine 32 4 31 12 14 6 2 100.0
Other/unclassified
Higher
26.4 9.8 11.4 10.8 5.3 0.0 15.6 16.8
professional
34.0 24.4 24.3 22.5 21.6 11.3 25.2 27.7
Lower professional
19.8 23.2 25.5 18.3 21.1 28.0 20.7 22.3
Intermediate
6.3 16.5 15.2 14.6 10.4 18.0 5.2 11.2
Lower supervisory
7.5 21.5 17.0 19.0 25.1 24.0 26.7 14.5
Routine
5.9 4.5 6.6 14.8 16.5 18.7 6.7 7.6
Other/unclassified
100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Total
(ii) Labelling
The title of a table should be the first thing the reader looks at. A clear title should
summarize the contents. It should be as short as possible, while at the same time
making clear when the data were collected, the geographical unit covered, and the
unit of analysis.
(iii) Sources
The reader needs to be told the source of the data. It is not good enough to say that
it was from Social Trends. The volume and year, and either the table or page, and
sometimes even the column in a complex table must be included. When the data are
first collected from a published source, all these things should be recorded, or a
return trip to the library will be needed.
4.8 Data Exploration and Visualization
(vi) Definitions
There can be no hard and fast rule about how much definitional information to
include in your tables. They could become unreadable if too much were included. If
complex terms are explained elsewhere in the text, include a precise section or page
reference.
(ix) Layout
The effective use of space and grid lines can make the difference between a table
that is easy to read and one which is not. In general, white space is preferable, but
grid lines can help indicate how far a heading or subheading extends in a complex
table.
Bivariate Analysis 4.9
Tables of monthly data can be broken up by spaces between every December and
January, for example. Labels must not be allowed to get in the way of the data. Set
variable headings off from the table, and further set off the category headings.
Make a decision about which variable to put in the rows and which in the columns
by combining the following considerations:
1. Closer figures are easier to compare
Comparisons are more easily made down a column
A variable with more than three categories is best put in the rows so that there is plenty of room
proportion of each category of a 'feeling safe walking alone after dark' variable who
are old. This can be formalized into a rule when dealing with contingency data:
Construct the proportions so that they sum to one within the categories of the
explanatory variable.
The rule is illustrated by the following diagram.
Fig. 4.7.
Note that it cannot be formulated as 'always calculate proportions along the rows'.
This would only work if the explanatory variable was always put in the rows, and no
such convention has been established.
Example:
Very safe / fairly
Age Very unsafe Total
safe / a bit unsafe
group
p N p N p N
16-39 0.93 13,589 0.07 1083 1 14,672
40-59 0.93 13,861 0.07 1099 1 14,960
Which categories should be selected as bases for comparison among age groups
feeling unsafe walking alone after dark? An important rule of thumb is to choose a
category with a relatively large number of individuals within it. In this case, since the
age-groups are all of similar size, any one of them could be used as the base category
for the age-group variable.
If we select the youngest age group as the base and then pick feeling very unsafe
as the base for comparison in the fear of walking alone after dark variable, we will
almost certainly avoid too many negative relationships. In summary, each age group
can be compared with those aged 16-39 in their feeling very unsafe when walking
alone after dark.
In order to represent one three-category variable, like age group, in a causal path
model, we have to present it as two dichotomous variables. Instead of coding the age
of respondents as 1, 2 or 3 to denote 60 and over, 40-59, or 16-39, for example, the
information is effectively presented as two dichotomous variables - whether someone
is aged 60 and over or not, and aged 40-59 or not.
Someone who was in neither of these age groups would, by elimination, be in the youngest age
group.
Age group as two dichotomies
Age group as a three-category variable
Aged 60+ or not Aged 40-59 or not
60+ 1 1 0
40-59 2 0 1
16-39 3 0 0
Choosing one category as a base effectively turns any polytomous variable into a
series of dichotomous variables known as dummy variables. Figure 4.9 shows how
the effect of a three - category explanatory variable on a dichotomous response
variable can be portrayed in a causal path model. Age group is represented by two
dummy variables. The effect of the first is denoted b1 and the effect of the second b2.
A line is drawn under which the base category of the explanatory variable is
noted; the fact that some young people are afraid of walking alone after dark
(path a)
4.12 Data Exploration and Visualization
reminds us that there are some factors influencing feeling very unsafe that this
particular model does not set out to explain.
Fig. 4.9. Casual path model of age group and feeling unsafe walking alone after dark
alone after dark, the d for the oldest age group would have been 0.84 - 0.93, or -0.09.
The magnitude of effect would not have altered but the sign would have been
reversed.
Path b2 represents the effect of being in the middle age group on feeling very
unsafe walking alone after dark. We might expect this to be lower than the effect of
being in the oldest age group. It is. In fact, d = 0.07 - 0.07, or O; the younger two age
groups are extremely similar in their fear of walking alone after dark. While the paths
b1 and b2 are the focus of our attention, it is also important to remember the other
factors which lead to people being afraid to walk alone after dark: age group is not a
complete determinant of who is fearful, since some in the youngest age group report
feeling very unsafe about walking alone after dark. Path a reminds us of this.
The value of path a is given by the proportion of cases in the base category of the
explanatory variable who fall in the non-base category of the response variable.
The quantified model is shown in figure 4.10. The model allows us to decompose
the proportion of older people who are fearful of walking alone after dark (0.16) into
a fitted component (0.07) and an effect ( +0.09).
Total ? ? ? ? 1 200
Fig. 4.11. Feeling safe walking alone after dark by gender (hypothetical survey of 200
individuals)
Once we carry out the survey let us imagine that we find that in total 20 individuals
i.e. 0.1 of the sample state that they feel very unsafe when walking alone after dark.
We therefore now have some more information that we can add to our table and this
is entered as the column marginals in figure 4.12 below.
If, in the population as a whole, the proportion of men who feel very unsafe
walking alone after dark is the same as the proportion of women who feel very
unsafe walking alone after dark, we would expect this to be reflected in our sample
survey. The expected proportions and frequencies would then be as shown in figure
4.13.
p N p N p N
Male 0.95 95 0.05 5 1 100
Female 0.85 85 0.15 15 1 100
way of making this comparison. The equation for chi-square is given below. In
practical terms we need to find the difference between the observed and expected
frequencies for each cell of the table. We then square this value before dividing it by
the expected frequency for that cell. Finally we sum these values over all the cells of
the table.
Fig. 4.15.
For the previous example, the computational details are provided in figure 4.16.
The total chi-square value is calculated as 5.56. Although this provides a measure of
the difference between all the observed and expected values in the table.
5 10 –5 25 2.5
85 90 –5 25 0.28
15 10 5 25 2.5
This probability is sometimes thought of as the likelihood that we will make what is
called a 'Type 1' error.
In some surveys, particularly where the sample size is small, we may obtain what
looks like an interesting difference between two groups, but find that the probability
associated with the chi- square is above the conventional cut-off of 0.05. It is in this
situation that we run the risk of making a 'Type 2' error.
Degrees of Freedom
A table with two rows and two columns is said have one degree of freedom
because only one cell is known (e.g. once we know how many women are afraid to
walk alone after dark) the values in the other cells can be calculated based on the row
and column marginals. Similarly, a table with two columns and three rows is said to
have two degrees of freedom. In formal terms the number of degrees of freedom for a
table with r rows and c columns is given by the equation below:
Degrees of freedom(Df) = (r – 1) (c – 1)
In this, a new graphical method the boxplot, will be presented which facilitates
comparisons between distributions, and the idea of an unusual data value will be
given more systematic treatment than previously.
Boxplots
Most people agree that it is important to display data well when communicating it
to others. Pictures are better at conveying the story line than numbers. However,
visual display also has a role that is less well appreciated in helping researchers
themselves understand their data and in forcing them to notice features that they did
not suspect. We have already looked at one pictorial representation of data, the
histogram. Its advantage was that it preserved a great deal of the numerical
information. For some purposes, however, it preserves too much.
The boxplot is a device for conveying the information in the five number
summaries economically and effectively. The important aspects of the distribution
are represented schematically as shown in figure 4.17.
4.18 Data Exploration and Visualization
inner fences) and the points beyond which the far outliers fall (the outer fences) are
identified; inner fences lie one step beyond the quartiles and outer fences lie two
steps beyond the quartiles.
The boxplot of unemployment in the East Midlands is shown in figure 4.18. It
contains the same data as figure 8.4:
Outliers
Some datasets contain points which are a lot higher or lower than the main body
of the data. These are called outliers. They are always point that require the data
analyst’s special attention. They are important and arise for one of four reasons:
1. They may just result from a fluke of the particular sample that was
drawn. The probability of this kind of fluke can be assessed by
traditional statistical tests, if sensible assumptions can be made about the
shape of the distribution.
2. They may arise through measurement or transcription errors, which can
occur in official statistics as well as anywhere else. We always want to
be alerted to such errors, so that they can be corrected, or so that the
points can be omitted from the analysis.
3. They may occur because the whole distribution is strongly skewed. In
this case they point to the need to transform the data. As we will see in,
transformations such as logging or squaring the values may remove these
outliers.
4.20 Data Exploration and Visualization
4. Most interesting of all, they may suggest that these particular data
points do not really belong substantively to the same data batch.
Fig. 4.19. Boxplots comparing girls and boys mathematics scores at age 11
In the example above, mathematics score is an interval level variable and we
therefore need a different statistical test to check whether the results are significant.
In this specific example we have two groups (boys and girls) defined by a
dichotomous variable and we are comparing them on an interval level variable
(mathematics score). In these circumstances the statistical test that we need to use is
called the T-test.
The T-Test
T-Test provide a measure of the difference between the means of two groups.
Bivariate Analysis 4.21
T-test Formula
The formula for a two-sample t-test where the samples are independent as in the
example of boys and girls mathematics test scores) is
–– ––
X1 – X2
t =
1 + 1
SX 1 X 2 n1n2
where X1 and X 2 are the means of the two samples and SX is known as the
1X2
here SX is the standard deviation of one sample and SX is the standard deviation
1 2
of the other sample. In these formulae n 1 is the sample size of the first sample and n 2
is the sample size of the second sample. In simple terms therefore the size of the t-
statistic depends on the size of the difference between the two means adjusted for the
amount of spread and the sample sizes of the two samples.
Scatterplots
To depict the information about the value of two interval level variables at once,
each case is plotted on a graph known as a scatterplot, such as figure. Visual
inspection of well-drawn scatterplots of paired data can be one of the most effective
ways of spotting important features of a relationship.
A scatterplot has two axes – a vertical axis, conventionally labeled Y and a
horizontal axis, labeled X. The variable that is thought of as a cause (the explanatory
variable) is placed on the X-axis and the variable that is thought of as an effect (the
response variable) is placed on the Y-axis. Each case is entered on the plot at the
point representing its X and Y values.
Scatterplots depict bivariate relationships. To show a third variable would require
a three-dimensional space, and to show four would be impossible.
4.22 Data Exploration and Visualization
Lone Parents
The data in figure 4.21 relate to the percentage of households that are headed by a
lone parent and contain dependent children, and the percentage of households that
have no car or van.
Linear Relationships
Y = a + bX
always describe lines. In this equation, Y and X are the variables, and a and b are
coefficients that quantify any particular line; figure shows this diagrammatically.
4.24 Data Exploration and Visualization
Log Transformation
One method for transforming data or re-expressing the scale of measurement is to rake the logarithm
of each data point. This keeps all the data points in the same order but stretches or shrinks the scale by
varying amounts at different points.
GNI per capita in
Log GNI per capita
2000 ($US)
Australia 20060 4.3
Benin 340 2.53
Burundi 120 2.08
China 930 2.97
Czech Republic 5690 3.76
Estonia 4070 3.61
Germany 25510 4.41
Haiti 490 2.69
Israel 17090 4.23
Korea, Rep. 9790 3.99
4.26 Data Exploration and Visualization
You will notice that all the GNI per capita figures between 100 and 1000 have
been transformed, by taking logs, to lie between 2 and 3 (e.g. Benin with a GNI per
capita of 340 has a log GNI per capita of 2.53 ). While all the data lying between
10,000 and 100,000 have been transformed to lie between 4 and 5 (e.g. Australia
with a GNI per capita of 20,060 has a log GNI per capita of 4.3 ). The higher values
have therefore been pulled down towards the centre of the batch, bringing the United
States and Germany into the main body of the data, and the bottom of the scale has
been stretched out correspondingly. The shape is now more symmetrical.
Going up the ladder of powers corrects downward straggle, whereas going down
corrects upward straggle.
level.
Figure 10.9 shows the effect of taking logs on the distribution of GNI in the
different country groups. Logging GNI per capita goes a long way towards holding
the midspreads constant by making them similar in size. This means that statements
can be made describing typical differences in wealth between the country groups
without needing to mention the differences in spread in the same breath. But, by
transforming, progress has also been made towards the first three goals: the batches
are more symmetrical and bell-shaped, and some of the outliers in the original batch
were not really unusual values, but merely a product of the upward straggle of the
raw numbers.
Fig. 4.28.
3. What is Proportions, percentages and probabilities.
To express a variable in proportional terms, the number in each category is
divided by the total number of cases N. Percentages are proportions multiplied
by
4.30 Data Exploration and Visualization
100. Proportions and percentages are bounded numbers, in that they have a floor
of zero, below which they cannot go, and a ceiling of 1.0 and 100 respectively.
Proportions can be used descriptively to represent the relative size of different
subgroups in a population. But they can also be thought of as probabilities.
4. Define a contingency table
A contingency table does numerically what the three-dimensional bar chart
does graphically. A contingency table shows the distribution of each variable
conditional upon each category of the other. The categories of one of the
variables form the rows, and the categories of the other variable form the
columns. Each individual case is then tallied in the appropriate pigeonhole
depending on its value on both variables. The pigeonholes are given the more
scientific name cells, and the number of cases in each cell is called the cell
frequency. Each row and column can have a total presented at the right-hand end
and at the bottom respectively; these are called the marginals, and the univariate
distributions can be obtained from the marginal distributions.
5. What is a percentage table?
The common way to make contingency tables readable is to cast them in
percentage form. There are three different ways in which this can be done. The
table was constructed by dividing each cell frequency by its appropriate row
total. Tables that are constructed by percentaging the rows are usually read down
the columns. This is sometimes called an 'outflow' table.
6. What are the guidelines for a well designed table?
(i) Reproducibility versus clarity
(ii) Labelling
(iii) Sources
(iv) Sample data
(v) Missing data
(vi) Opinion data
(vii)Layout
Bivariate Analysis 4.31
the table.
Fig. 4.29.
8. Define degree of freedom
A table with two rows and two columns is said have one degree of freedom
because only one cell is known (e.g. once we know how many women are afraid to
walk alone after dark) the values in the other cells can be calculated based on the
row and column marginals. Similarly, a table with two columns and three rows is
said to have two degrees of freedom. In formal terms the number of degrees of
freedom for a table with r rows and c columns is given by the equation below:
9. What is a Box plot?
The method to summarize a set of data that is measured using an interval scale is
called a box and whisker plot. These are maximum used for data analysis. We use
these types of graphs or graphical representation to know:
Distribution Shape
Central Value of it
Variability of it
10. What is an outlier?
In data analytics, outliers are values within a dataset that vary greatly from the
others - they’re either much larger, or significantly smaller. Outliers may
indicate
4.32 Data Exploration and Visualization
T-test Formula
The formula for a two-sample t-test where the samples are independent as in the
example of boys and girls mathematics test scores) is
–– ––
X1 – X2
t =
1 + 1
SX 1 X 2 n1n2
where X1 and X 2 are the means of the two samples and SX is known as the
1X2
pooled standard deviation and is calculated as follows:
(n1 – 1) XS2+
1
(n2 – 1) S2 X2
SX = n1 + n2 – 2
1X2
here SX is the standard deviation of one sample and SX is the standard deviation
1 2
of the other sample. In these formulae n 1 is the sample size of the first sample and n 2
is the sample size of the second sample. In simple terms therefore the size of the t-
statistic depends on the size of the difference between the two means adjusted for the
amount of spread and the sample sizes of the two samples.
12. What are scatter plots?
Scatter plots are the graphs that present the relationship between two
variables in a data-set. It represents data points on a two-dimensional plane or on
a Cartesian system. The independent variable or attribute is plotted on the X-
axis, while the dependent variable is plotted on the Y-axis. These plots are often
called scatter graphs or scatter diagrams.
13. What is a resistant line?
We explore paired data where you suspect a relationship between xx and yy.
The focus here on how to fit a line to data in a “resistant” fashion, so the fit is
relatively insensitive to extreme points. The first step to fitting a line,
Bivariate Analysis 4.33
******************
UNIT V
MULTIVARIATE AND TIME
SERIES ANALYSIS
SYLLABUS
Assumption 1
X is casually prior to Y
There is nothing in the data to tell us whether X causes Y or Y causes X, so we
have to make the most plausible assumption we can, based on our knowledge of the
subject matter and our theoretical framework.
Assumption 2
and Y because the only way in which the randomized control groups are allowed to
vary is in terms of X. No such assumption can be made with non-experimental data.
Assumption 3
All variables intervening between X and Y have been controlled.
This assumption is not required before you can assume that there is a causal link
between X and Y, but it is required if you aim to understand how X is causing Y.
Let us first consider a hypothetical example drawn from the earlier discussion of
the causes of absenteeism. Suppose previous research had shown a positive bivariate
relationship between low social status jobs and absenteeism. The question arises: is
there something about such jobs that directly causes the people who do them to go
off sick more than others? Before we can draw such a conclusion, two assumptions
have to be made.
There are many possible outcomes once the relationship between all three
variables is considered at once, four of which are shown in figure 5.3.
Fig. 5.3. The effect of job status on absenteeism: controlling a prior variable
5.6 Data Exploration and Visualization
Simpson's Paradox
In some cases the relationship between two variables is not simply reduced when
a third, prior, variable is taken into account but indeed the direction of the
relationship is completely reversed. This is often known as Simpson's paradox
Multivariate and Time Series Analysis 5.7
(named after Edward Simpson who wrote a paper describing the phenomenon that
was published by the Royal Statistical Society in 1951). However, the insight that a
third variable can be vitally important for understanding the relationship between two
other variables is also credited to Karl Pearson in the late nineteenth century.
Simpson's paradox can be succinctly summarized as follows: every statistical
relationship between two variables may be reversed by including additional factors in
the analysis.
Men Women
The set of paths of causal influence, both direct and indirect, that we want to
begin to consider are represented in figure 12.5. In this causal model we are trying to
explain social trust, the base is therefore the belief that 'You can't be too careful'. The
base categories selected for the explanatory variables are having lower levels of
qualifications and not being a member of a voluntary organization, to try and avoid
negative paths. Each arrow linking two variables in a causal path diagram represents
5.8 Data Exploration and Visualization
the direct effect of one variable upon the other, controlling all other relevant
variables. The rule for identifying the relevant variables was given in chapter 11:
when we are assessing the direct effect of one variable upon another, any third
variable which is likely to be causally connected to both variables and prior to one of
them should be controlled. Coefficient b in figure 12.5 shows the direct effect of
being in a voluntary association on the belief that most people can be trusted. To find
its value, we focus attention on the proportion who say that most people can be
trusted, controlling for level of qualifications.
Fig. 5.9. Social trust by membership of voluntary association and level of qualifications:
casual path diagram
trust. The following section describes the conceptual foundations that underlie
models to examine the factors influencing a simple dichotomous (two-category)
variable.
It is important to distinguish longitudinal data from the time series data. Although
time series data can provide us with a picture of aggregate change, it is only
longitudinal data that can provide evidence of change at the level of the individual.
Time series data could perhaps be understood as a series of snapshots of society,
whereas longitudinal research entails following the same group of individuals over
time and linking information about those individuals from one time point to another.
For example, in a study such as the British Household Panel Survey, individuals
are interviewed each year about a range of topics including income, political
preferences and voting. This makes it possible to link data about individuals over
time and examine, for example, how an individual's income may rise (or fall) year on
year and how their political preferences may change.
5.10 Data Exploration and Visualization
The first part provides a brief introduction to longitudinal research design and
focuses on some of the issues in collecting longitudinal data and problems of
attrition. The second part then provides a brief conceptual introduction to the analysis
of longitudinal data.
Time series data is a collection of quantities that are assembled over even
intervals in time and ordered chronologically. The time interval at which data is
collected is generally referred to as the time series frequency.
Time-series data
Structured data
No updates on data
Single data source
The ratio of read/write is smaller
5.12 Data Exploration and Visualization
want to analyze data regarding millennial customers, but your dataset includes older
generations, you might remove those irrelevant observations. This can make analysis
more efficient and minimize distraction from your primary target—as well as
creating a more manageable and more performant dataset.
A time series is a series of data points indexed in time order. If you index the
dataset by date, you can easily carry out a time series analysis.
Multivariate and Time Series Analysis 5.15
Let‘s create a variable named date with the start date and end date.
In [2]: 1 date = pd.date_range(
2 np.random.randn(len(date)),index = date)
3 ts
Out [3]: 2018-01-31 – 1.977003
2018-02-28 – 0.339459
2018-03-30 – 0.587687
2018-04-30 1.141997
2018-05-31 – 0.125199
2018-06-29 – 1.090406
2018-07-31 – 0.435640
2018-08-31 0.181651
2018-09-28 – 2.518869
2018-10-31 1.428868
2018-11-30 – 0.357551
2018-12-31 0.612771
5.16 Data Exploration and Visualization
In [4]: 1 ts.index
In [6]: 1 fb=pd.read_csv(“FB.csv”)
You can find this dataset here. Let‘s see the first 5 rows of the dataset with the
head method.
In [7]: 1 fb.head ( )
Multivariate and Time Series Analysis 5.17
In [8]: 1 fb.dtypes
As you can see, the type of the date column is an object. Let‘s convert this date
column to the DateTime type. To do this, I‘m going to use the parse_dates parameter
when reading the dataset.
5.18 Data Exploration and Visualization
In [9]: 1 fb=pd.read_csv(
2 “FB.csv”,parse_dates=[“Date”])
Let‘s convert the date column into the index with the index_col parameter.
In [10]: 1 fb=pd.read_csv(
2 “FB.csv”,
3 parse_dates=[―Date‖],
4 index_col=‖Date‖)
In [11]: 1 fb.index
In [12]: 1 fb.head ( )
Multivariate and Time Series Analysis 5.19
In [13]: 1 fb[“2019-06”]
It is very useful to convert the dates into the DatetimeIndex structure. For
example, you can easily select the values of the 6 months of 2019.
In [14]: 1 fb[“2019-06”].Close.mean( )
Out[14]: 181.27450025000002
In [16]: 1 t=pd.to_datetime(“7/22/2019”)
2 t
In [17]: 1 fb.loc]fb.index>=t,:]
Multivariate and Time Series Analysis 5.21
Dating a Dataset
To perform a time series analysis, you need to assign date values. To show this,
I‘m going to use a dataset without dates. Let‘s read this dataset.
In [18]: 1 fb1=pd.read_csv(“FB-no-date.csv”,sep=”;”)
Let‘s have a look at the first rows of the dataset.
In [19]: 1 fb1.head( )
Notice that there is no date column in the dataset. Let‘s add a date column to this
dataset. To do this, let me generate a date with the date_range function. I‘m going to
use the start, end, and freq parameters. Here, B represents business days.
In [20]: 1 dates=pd.date_range(start=”03/01/2019”,
2 end=”03/29/2019”,
3 freq=”B”)
4 dates
-03-05‘, ‗2019-03-06‘,
‗2019-03-07‘,‗2019-03-08‘, ‗2019
-03-11‘, ‗2019-03-12‘,
-03-15‘, ‗2019-03-18‘,
-03-21‘, ‗2019-03-22‘,
-03-27‘, ‗2019-03-28‘,
‗2019-03-29‘]
dtype=‘datetime64[ns]‘, freq=‘B‘)
Now, let‘s assign this created date variable to the dataset as an index.
In [22]: 1 fb1.head( )
Multivariate and Time Series Analysis 5.23
Adj
Out [22]: Open High Low Close
Close
2019- 162600006 163130005 161690002 162279999 162279
03-01
2019- 163899994 167500000 163830002 167369995 167369
03-04
2019- 167369995 171880005 166550003 171259995 17259
03-05
2019- 172899994 173570007 171270004 172509995 172509
03-06
2019- 171500000 171740005 167610001 169130005 16913
03-07
As you can see, working days have been added to the dataset. Let‘s look at the
index of the dataset
Since the dataset is indexed with time, you can easily work with time series.
In [23]: 1 fb1.index
Out [23]: DatetimeIndex([‗2019-03-01‘, ‗2019-03-04‘, ‗2019
-03-05‘, ‗2019-03-06‘,
‗2019-03-07‘,‗2019-03-08‘, ‗2019
-03-11‘, ‗2019-03-12‘,
‗2019-03-13‘, ‗2019-03-14‘, ‗2019
-03-15‘, ‗2019-03-18‘,
‗2019-03-19‘, ‗2019-03-20‘, ‗2019
-03-21‘, ‗2019-03-22‘,
‗2019-03-25‘, ‗2019-03-26‘, ‗2019
-03-27‘, ‗2019-03-28‘,
‗2019-03-29‘]
dtype=‘datetime64[ns]‘, freq=‘B‘)
5.24 Data Exploration and Visualization
Let‘s draw a graph showing closing prices. First, I‘m going to use the %
matplotlib inline magic command to see the graph between lines.
In [25]: 1 1 fb1.close.plot( )
Out[25]: <AxesSubplot:>
Fig. 5.11.
Let‘s draw a line plot.
In [26]: 1 fb1.asfreq(“H”,method=”pad”).head( )
Multivariate and Time Series Analysis 5.25
Out Adj
Open High Low Close
[26]: Close
00:00:00
01:00:00
02:00:00
03:00:00
04:00:00
In [27]: 1 fb1.asfreq(“W”,method=”pad”)
Out Adj
Open High Low Close
[27]: Close
In [29]: 1 z=pd.date_range(start=”3/1/2019”,
2 periods=60 , freq=”B”)
3 z
In [30]: 1 z=pd.date_range
3 z
Multivariate and Time Series Analysis 5.27
In [31]: 1 ts=pd.Series(
2 np.random.randint(1,10,len(z)),index=z)
3 ts.head( )
Group Time Series (GTS) reports contain raw or aggregated data for a group of
resources over a particular reporting period.
Raw data can be displayed for daily and weekly reporting periods only.
Aggregated data can be displayed for any reporting period, but different reporting
periods support different granularity values.
Reports on GTS
Spatial aggregation - Spatial aggregation is the aggregation of all data
points for a group of resources over a specified period (the granularity).
Data aggregations in Group Time Series reports are of the spatial
aggregation type.
Sum of Average Reports - A Sum of Average (sumOfAvg) report is an
extension of the Group Time Series report. It calculates two data points for
each granularity period.
While dealing with time-Series data analysis we need to combine data into certain
intervals like with each day, a week, or a month.
We will solve these using only 2 Pandas APIs i.e. resample() and GroupBy().
Multivariate and Time Series Analysis 5.29
Resampling
Resampling is for frequency conversion and resampling of time series. So, if one
needs to change the data instead of daily to monthly or weekly etc. or vice versa. For
this, we have resample option in pandas library[2]. In the resampling function, if we
need to change the date to datetimeindex there is also an option of parameter ―on‖
but the column must be datetime-like.
df.resample(„w‟, on=‟LastUpdated‟).mean ( )
Below from resampling with option ―D‖, the data got changed into daily data, i.e.,
all the dates will be taken into account. 375717 records downsampled to 77 records.
df3.resample(“D”).mean( ) # daily option
LastUpdated Occupancy
2016-10-04 655.543651
2016-10-05 655.185185
2016-10-06 636.942130
2016-10-07 576.282407
2016-10-08 428.036232
- -
2016-12-15 736.445110
2016-12-16 675.021073
2016-12-17 726.115385
2016-12-18 613.589583
2016-12-19 844.256410
77 rows 1 columns
A resample option is used for two options, i.e., upsampling and downsampling.
Upsampling: In this, we resample to the shorter time frame, for example monthly
data to weekly/biweekly/daily etc. Because of this, many bins are created with NaN
values and to fill these there are different methods that can be used as pad method
and bfill method. For example, changing weekly data to daily data and using bfill
method following results are obtained, so bfill filling backward the new missing
values in the resampled data:
dd.resample („D‟).pad ( ) [:15]
Multivariate and Time Series Analysis 5.31
Other method is pad method, it forward fills the values as above right:
We can also use asfreq() or fillna() methods in upsamling.
Downsampling: In this we resample to the wider time frame, for example
resample daily data to weekly/biweekly/monthly etc. For this we have options like
sum(), mean(), max() etc. For example, daily data got resampled to month start data
and mean function is used as below:
df3.resample(“MS”).mean( ) [ : ]
LastUpdated Occupancy
2016-10-01 600.6633861
2016-11-01 637.142419
2016-12-01 714.497266
Fig. 5.12.
5.34 Data Exploration and Visualization
daily and weekly reporting periods only. Aggregated data can be displayed for
any reporting period, but different reporting periods support different granularity
values.
9. List the featutes of GTS.
Near Real Time (NRT) data points. NRT data is raw data collected during
the current hour that has not yet been written to the database.
Access to all branches of a group hierarchy. Subelement groups are
organized within a tree structure. When a GTS report is deployed against a
particular group in a group tree, resources in that group and in groups at all
levels of the tree below it are included in the aggregation. If a particular
resource appears in multiple groups within the group tree, that resource is
included in the aggregation only once.
10. What is Resampling.
While dealing with time-Series data analysis we need to combine data into
certain intervals like with each day, a week, or a month. We will solve these
using only 2 Pandas APIs i.e. resample() and GroupBy().
The resample() function is used to resample time-series data. Convenience
method for frequency conversion and resampling of time series. The object must
have a DateTime-like index(DatetimeIndex, PeriodIndex, or TimedeltaIndex), or
pass DateTime-like values to the on or level keyword.
******************
1. Install the data Analysis and Visualization tool: R/ Python /Tableau Public/
Power BI.
Program 1:
# importing the pands package
import pandas as pd
# creating rows
hafeez = ['Hafeez', 19]
aslan = ['Aslan', 21]
kareem = ['Kareem', 18]
# pass those Series to the DataFrame
# passing columns as well
data_frame = pd.DataFrame([hafeez, aslan, kareem], columns = ['Name', 'Age'])
# displaying the DataFrame
print(data_frame)
Output
If you run the above program, you will get the following results.
Name Age
0 Hafeez 19
1 Aslan 21
2 Kareem 18
Program 2:
# importing the pyplot module to create graphs
import matplotlib.pyplot as plot
# importing the data using pd.read_csv() method
data = pd.read_csv('CountryData.IND.csv')
P.2 Data Exploration and Visualization
Output
If you run the above program, you will get the following results.
<matplotlib.axes._subplots.AxesSubplot at 0x25e363ea8d0>
2. Perform exploratory data analysis (EDA) on with datasets like email data
set. Export all your emails as a dataset, import them inside a pandas data
frame, visualize them and get different insights from the data.
Create a CSV file with only the required attributes:
with open('mailbox.csv', 'w') as outputfile:
writer =csv.writer(outputfile)
writer.writerow(['subject','from','date','to','label','thread'])
for message in mbox:
writer.writerow([ me
ssage['subject'],
message['from'],
Practical Exercises P.3
message['date'],
message['to'],
message['X-Gmail-Labels'],
message['X-GM-THRID']
The output of the preceding code is as follows:
subject object
from object date
object
to object label
object
thread float64
dtype: object
def plot_number_perdhour_per_year(df, ax, label=None, dt=1,
smooth=False,
weight_fun=None, **plot_kwargs):
if weight_fun is None:
weights = 1 / (np.ones_like(tod) * Ty * 365.25 / dt) else:
weights = weight_fun(df) if
smooth:
P.4 Data Exploration and Visualization
ax.set_yticklabels([datetime.datetime.strptime(str(int(np.mod(ts, 24))),
"%H").strftime("%I %p")
for ts in ax.get_yticks()]);
Practical Exercises P.5
3. Working with Numpy arrays, Pandas data frames, Basic plots using
Matplotlib.
Program 1:
import numpy as np
from matplotlib import pyplot as plt
P.6 Data Exploration and Visualization
x = np.arange(1,11)
y=2*x+5
plt.title("Matplotlib demo")
plt.xlabel("x axis caption")
plt.ylabel("y axis caption")
plt.plot(x,y)
plt.show()
The above code should produce the following output −
Program 2:
import pandas as pd
import matplotlib.pyplot as plt
# creating a DataFrame with 2 columns
dataFrame = pd.DataFrame(
{
"Car": ['BMW', 'Lexus', 'Audi', 'Mustang', 'Bentley', 'Jaguar'],
"Reg_Price": [2000, 2500, 2800, 3000, 3200, 3500],
Practical Exercises P.7
Output
This will produce the following output −
4. Explore various variable and row filters in R for cleaning data. Apply
various plot features in R on sample data sets and visualize.
install.packages("data.table") # Install data.table
package library("data.table") # Load data.table
We also create some example data.
dt_all <- data.table(x = rep(month.name[1:3], each =
3), y = rep(c(1, 2, 3), times = 3),
z = rep(c(TRUE, FALSE, TRUE), each = 3)) # Create data.table
head(dt_all)
P.8 Data Exploration and Visualization
Table 1
x y z
1 January 1 TRUE
2 January 2 TRUE
3 January 3 TRUE
4 February 1 FALSE
5 February 2 FALSE
6 February 3 FALSE
Table 2
x y z
1 February 1 FALSE
2 February 2 FALSE
3 February 3 FALSE
Table 2
x y z
1 February 1 FALSE
2 February 2 FALSE
3 February 3 FALSE
Table 3
x y z
1 February 1 FALSE
Date Value
0 1991-07-01 3.526591
1 1991-08-01 3.180891
2 1991-09-01 3.252221
3 1991-10-01 3.611003
4 1991-11-01 3.565869
# Time series data source: fpp pacakge in R.
import matplotlib.pyplot as plt
df=pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date'], index_col='date')
# Draw Plot
def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value',
dpi=100): plt.figure(figsize=(16,5), dpi=dpi)
plt.plot(x, y, color='tab:red')
plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
plt.show()
plot_df(df, x=df.index, y=df.value, title='Monthly anti-diabetic drug sales in
Australia from 1992 to 2008.')
Practical Exercises P.11
).project( type='alb
ersUsa'
).properties(
width=900,
height=500
).configure_view(
stroke=None
)
alt.layer(
alt.Chart(alt.topo_feature(usa, 'states')).mark_geoshape(
fill='#ddd', stroke='#fff', strokeWidth=1
),
alt.Chart(airports).mark_circle(size=9).encode(
latitude='latitude:Q',
longitude='longitude:Q',
tooltip='iata:N'
)
).project( type='alb
ersUsa'
).properties(
P.14 Data Exploration and Visualization
width=900,
height=500
).configure_view(
stroke=None
)
df.columns
Out [4]: Index([‘fixed acidity’, volatile acidity’, ‘citric acid’, ‘residual
su gar’,
;chlorides’, ‘free sulfur dioxide’, total sulfur dioxide’, ‘den
sity’,
‘pH’, ‘sulphates’, ‘alcohol’, ‘quality’],
dtype=’object’)
Practical Exercises P.15
df.head( )
free total
Out Fixed Volatile citric residual
chlorides sulphur sulphur density pH Sulphates alcohol Quality
[5]: acidity acidity acid sugar
dioxide dioxide
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.068 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.068 47.0 186.0 0.9956 3.19 0.40 9.9 6
In [13]: sns.catplot(x=‘quality’,data=df,kind=‘count’)
9. Use a case study on a data set and apply the various EDA and visualization
techniques and present an analysis report.
import datetime
import math
import pandas as pd
import random
import radar
from faker import Faker
fake = Faker()
def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
delta = end - start
for _ in range(n):
Practical Exercises P.17
Date Price
2019-08-01 999.598900
2019-08-02 957.870150
2019-08-04 978.674200
2019-08-05 963.380375
2019-08-06 978.092900
2019-08-07 987.847700
2019-08-08 952.669900
2019-08-10 973.929400
2019-08-13 971.485600
2019-08-14 977.036200
listdata.append([date, price])
df = pd.DataFrame(listdata, columns = ['Date', 'Price']) df['Date']
= pd.to_datetime(df['Date'], format='%Y-%m-%d') df =
df.groupby(by='Date').mean()
import matplotlib.pyplot as plt
******************