Advanced Python Programming Data Science: The University of Sheffield
Advanced Python Programming Data Science: The University of Sheffield
net/publication/339411719
CITATIONS READS
0 8,040
1 author:
Edgar Iyasele
The University of Sheffield
17 PUBLICATIONS 2 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
HIGH EFFICIENCY, LOW - COST PEROVSKITE SOLAR CELL MODULES FOR ELECTRICITY GENERATION IN NIGERIA View project
All content following this page was uploaded by Edgar Iyasele on 21 February 2020.
- Capture Data
- Manage and Clean Data
- Data Analysis
- Report data
Requirements:
- CI6010a
- CI6010b
- CI6011a
Anaconda Python should be installed on your
desktop, please start Spyder.
Unstructured:
• Data without inherent structure.
Quasi-Structured:
• Textual data with erratic format that can be formatted with effort.
Semi-Structured:
• Textual data with apparent pattern (including errors)
Structured:
• Defined data model (errors less likely).
Complexity
Flexibility
Advantages:
It can present data in a way that is suitable for data analysis.
The package contains multiple methods for convenient data filtering.
Pandas has a variety of utilities to perform Input/Output operations
in a seamless manner.
Constructing a DataFrame
import pandas as pd
d_1 = [1,2,3]
d_2 = {‘header_1': [1, 2], ‘header_2': [3, 4]}
print(df_1)
print(df_2)
0 # Header
0 1 # First row
1 2 # Second row
2 3 # …
df = pd.read_table(‘global_temp.txt’, sep= ‘ ’)
print(df)
MANAGE DATA
Unwanted Observations
Remove Outliers
Irrelevant data
Duplicates
• Removing identical rows
• Parameter keep: It could be ‘first’, ‘last’ or ‘False’ (it consider all of the same
values as duplicates).
• Dropping irrelevant columns
df = df.drop([“Sales”], axis = 1)
• Dropping irrelevant rows
df = df.drop([“Johnson”, “Smith”])
Outlier
Outlier
Finding outliers with method .describe()
df = pd.DataFrame(d)
print(df.describe())
df.hist([‘Age’,’IQ])
plt.show() #It may be necessary after importing matplotlib
Unexpected behaviour,
i.e., far from general
population, nonsense
value, wrong
distribution shape, etc
…
Removing Outliers from the data
Remove the outlier by dropping the row, replacing its value one by
one or introducing a threshold.
• Dropping column or row can be done by the method .drop() as
discussed before.
#Drop duplicates
df = df.drop_duplicates(subset=‘UID', keep=‘first’)
Outlier
import numpy as np
Sorting Data
• Sorting by some dimension alphabetically
or numerically, e.g. sorting by time or date.
• Ascending or Descending.
• Use the method .sort_values().
Filtering Data by Using iloc()
Output:
C
0 7
1 8
• Select a column of the DataFrame
print(df.iloc[:,1]) # Output: 4, 5, 6
print(df.iloc[2,:]) # Output: 3, 6, 9
• First 2 rows:
print(df.iloc[0:2,:])
print(df.iloc[:2,:])
• And so on…
CLEAN DATA
• Normalisation typically means rescales the values into a range of [0,1].
• In most cases, when you normalise data you eliminate the units of
measurement for data, enabling you to more easily compare data from
different places.
x = [1,43,65,23,4,57,87,45,45,23]
x − xmin xmin = 1
xnew =
xmax − xmin xmax = 87
xnew = [0,0.48,0.74,0.25,0.03,0.65,1,0.51,0.51,0.25]
Normalising a Numpy array or Normalising a column of Pandas
DataFrame (normalise column named “score” in Dataframe “df ”):
import numpy as np
import pandas as pd
raw_data= [1,43,65,23,4,57,87,45,45,23])
x = np.array(raw_data)
x_new = (x - x.min()) / (x.max() - x.min())
df = pd.DataFrame({‘score’: raw_data})
df[‘score’] = (df[‘score’] - df[‘score’].min()) /
(df[‘score’].max() - df[‘score’].min())
Data normalisation example
• Data Standardisation:
Standardisation typically means rescales x−μ
xnew =
data to have a mean of 0 and a standard σ
deviation of 1 (unit variance).
import numpy as np
import pandas as pd
raw_data= [1,43,65,23,4,57,87,45,45,23])
x = np.array(raw_data)
x_new = (x - x.mean()) / x.std()
df = pd.DataFrame({‘sc’: raw_data})
df[‘sc’]=(df[‘sc’]-df[‘sc’].mean())/
df[‘sc’].std()
EXPLORATORY DATA ANALYSIS
Aim:
Objectives:
Tools:
EDA typically relies heavily on visualising the data to assess patterns and
identify data characteristics that the analyst would not otherwise know to look
for.
Example database: Airline safety
df.hist('incidents_85_99')
Univariate visualisation
df.hist()
The investigation starts
Insights
Somebody is flying a lot…
Connection?
Questions
one airline? Outlier?
Is my data reliable?
Is it safer to fly today than before?
So on…
Download and load the airline safety database.
Standardise the column “avail_seat” and find the airline
who had more than 70 incidents between 1985 and
1999.
import pandas as pd
df.plot.scatter('avail_seat_km_per_week', 'incidents_85_99')
• Use corr() function to find the correlation among the columns in the dataframe
using ‘Pearson’ method.
• Correlations are never lower than -1. A correlation of -1 indicates that
the data points in a scatter plot lie exactly on a straight descending line.
• A correlation of 0 means that two variables don't have any linear relation
whatsoever. However, some non linear relation may exist between the two variables.
• Correlation coefficients are never higher than 1. A correlation
coefficient of 1 means that two variables are perfectly positively linearly related.
avail_seat_km_per_week incidents_85_99
avail_seat_km_per_week 1.000000 0.279538
incidents_85_99 0.279538 1.000000
fatal_accidents_85_99 0.468300 0.856991
incidents_00_14 0.725917 0.403009
df.plot.scatter('incidents_85_99',
'incidents_00_14')
There seems to be a
relationship but is it significant?
Significant improvement.
incidents_85_99
DATA ANALYSIS
Turn insight and ideas into scientifically valid results.
incidents_85_99
# Filter incidents < 10
df_l = df.mask(df["incidents_85_99"] > 10).dropna()