Chapter 5 - Data Exploration and Visualization With
Chapter 5 - Data Exploration and Visualization With
Lecturer: XYZ
1
Jamhuriya University of Science & Technology (JUST)
Learning outcomes
By the end of this lecture, you will be able to:
Describe some core data analysis concepts including
dataframes, and data exploration.
Create and access main data structures in Python,
such as series and dataframes.
Perform exploratory data analysis in Python usig
Pandas library.
Build data visualizations with Matplotlib library
2
Jamhuriya University of Science & Technology (JUST)
Source:
https://www.dataquest.io/bl
og/what-is-data-science/ 3
Jamhuriya University of Science & Technology (JUST)
Types of variables
A variable is any characteristic or attribute that can be
quantitatively and qualitatively been measured
Variable
Numeric Categorical
5
Jamhuriya University of Science & Technology (JUST)
Pandas
Data analysis is normally performed over data stored in a
tabular format, e.g., Excel spreadsheet.
Each observation is recorded in a row and its each
attribute is recorded in a column (e.g., students’ and
their grades in each assignment)
Pandas is a Python library for manipulating data in tabular
format and comes with Python Anaconda.
In Pandas, data manipulation can be much more varied, it can
programmed more easily and performed more efficiently,
which is critical in large-scale projects
6
Jamhuriya University of Science & Technology (JUST)
Pandas
• Pandas is a high-level library built on NumPy, providing
tools that make it easier to work with real-world data:
– load data from a variety of sources (e.g. CSV, JSON, SQL).
– update (add, modify, delete etc) data
– select subsets of the data
– group data by a certain criterion
– clean and handling missing values or NANs
– visualize the data using different plotting tools
– perform statistical analysis of the data and
– export the data to other file formats or databases
• Pandas provides two main data structures, namely,
DataFrame (equivalent to a spreadsheet) and a Series
(equivalent to a column in a spreadsheet).
7
Jamhuriya University of Science & Technology (JUST)
Pandas Series
A Series is just a column is a dataframe or spreadsheet
9
Jamhuriya University of Science & Technology (JUST)
Pandas Dataframe
A Pandas dataframe is a collection of Series ( a 2D data
structure with row & columns - effectively a spreadsheet )
Let's create an example dataframe with the population and area
values for some regions in Somalia
Hands-on exercise 1
We can create dataframes by supplying dictionaries with
identical sets of keys as arguments to DataFrame()
12
Jamhuriya University of Science & Technology (JUST)
Dataframe attributes
13
Jamhuriya University of Science & Technology (JUST)
Descriptive statistics
14
Jamhuriya University of Science & Technology (JUST)
Descriptive statistics
15
Jamhuriya University of Science & Technology (JUST)
16
Jamhuriya University of Science & Technology (JUST)
Logy=True argument
enables us to scale the
y values
logarithmically.
Otherwise the scales
could have been very
different
The extracted columns are Pandas series type and can stored in a
different variable or processed separately 18
Jamhuriya University of Science & Technology (JUST)
Remember this a
vectorised or element
wise math operation just
like NumPy arrays.
22
Jamhuriya University of Science & Technology (JUST)
23
Jamhuriya University of Science & Technology (JUST)
Such conditional
extraction can be
applied to any other
dataframe column
24
Jamhuriya University of Science & Technology (JUST)
Conditional updating
This populates
the entire new
column with the
single value
‘low’
2525
Jamhuriya University of Science & Technology (JUST)
Conditional updating
26
Jamhuriya University of Science & Technology (JUST)
Conditional updating
We can also use the apply() function to apply some
operation to every row or every column in a dataframe.
The function takes a custom function as an argument, the
custom function takes either a row or a column at a time and
can return a modified row or column:
2727
Jamhuriya University of Science & Technology (JUST)
Conditional updating
The axis
argument
indicates
whether to
process the
dataframe by
columns (1)
or rows (0)
28
Jamhuriya University of Science & Technology (JUST)
29
Jamhuriya University of Science & Technology (JUST)
30
Jamhuriya University of Science & Technology (JUST)
Maximum
Q3
Median IQR
Q1
Minimum
The two variable or columns could be drawn on the same plot but they
have been plotted separately since their scales differ, 34
Jamhuriya University of Science & Technology (JUST)
Hands-on exercise 2
Suppose you have this data in a dictionary:
exam_data =
{
'name': ['Ali', 'Ahmed', 'Jama', 'Omar', 'Fatima', 'Mohamed',
'Mohamud', 'Malin', 'Farah', 'Samad'],
'score': [62.5, 79, 16.5, 65, 53, 81, 58, 45, 72, 66.5], 'attempts':
[1, 3,
2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'no', 'yes']
}
Create a dataframe from this data and retrieve the following subsets of data:
1. The first three rows
2. The following three rows
3. The score for 'Mohamed'
4. The scores of all students who qualify and who made just one
attempt.
37
Jamhuriya University of Science & Technology (JUST)
Matplotlib
• https://matplotlib.org/tutorials/index.html
38
Jamhuriya University of Science & Technology (JUST)
39