Lecture 6
Lecture 6
Lecture 6
Lecture 8
Data Wrangling & Summarization
COSC-3107 Machine Learning
Shahzad Hussain
Lecturer
2 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
V. Imputing Missing Values
COSC-3107 Machine Learning
4 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Imputing Missing Values
• We use the dropna() function from pandas in the following
snippet to remove rows of data where the date of transaction
is missing.:
print("Drop Rows with missing dates::" )
df_dropped = df.dropna(subset=['date'])
print("Shape::",df_dropped.shape)
• On the same lines, we use the ffill() and bfill() functions to COSC-3107 Machine Learning
impute missing values for the user_type attribute.
• The ffill() and bfill() functions copy forward the data from the
previous row (forward fill) or copy the value from the next row
(backward fill).
6 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Imputing Missing Values
• Fill Missing Price values with mean price::
• Fill Missing user_type values with value from previous row (forward fill) ::
• Fill Missing user_type values with value from next row (backward fill) ::
7 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
9 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
11 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
12 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Handling Categorical Data
• The second method is to convert the categorical variable
into indicator variables using the get_dummies() function.
13 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
• Normalization is also called feature scaling. There are COSC-3107 Machine Learning
various ways of scaling/normalizing features, some of
them are rescaling, standardization (or zero-mean unit
variance), unit scaling and many more.
15 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Normalizing Values
• Unscaled price values and the normalized price values
• That have been scaled to a range of [0, 1]
16 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
ix. String Manipulation
COSC-3107 Machine Learning
String Manipulation
• Raw data presents all sorts of issues and complexities before it can be
used for analysis.
• Strings are another class of raw data which needs special attention and
treatment before our algorithms can make sense out of them.
• String data representing natural language is highly noisy and requires its
own set of steps for wrangling.
• String data usually undergoes wrangling steps such as:
• Tokenization: Splitting of string data into constituent units. For example,
splitting sentences into words or words into characters. COSC-3107 Machine Learning
• Stemming and lemmatization: These are normalization methods to
bring words into their root or canonical forms. While stemming is a
heuristic process to achieve the root form, lemmatization utilizes rules of
grammar and vocabulary to derive the root.
Summarization
• Data summarization refers to the process of
preparing a compact representation of raw data at
hand.
• This process involves aggregation of data using
different statistical, mathematical, and other
methods.
• Summarization is helpful for visualization, COSC-3107 Machine Learning
21 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Summarization
• To generates a tabular output representing sum of
quantities purchased by each user_class.
• The output is generated is as follows.
22 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Summarization
• Variant 1: Here we apply three different aggregations on quantity
purchased which is grouped by user_class
• Variant 2: Here we apply three different aggregations on quantity COSC-3107 Machine Learning
purchased which is grouped by user_class
23 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Summarization
• Variant 3: Here we apply three different aggregations on quantity
purchased which is grouped by user_class
24 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Today’s Lecture Summary
2. Data Wrangling
i. Understanding Data
ii. Filtering Data
iii. Type Casting
iv. Transformation
v. Imputing Missing Values
vi. Handling Duplicates COSC-3107 Machine Learning
vii. Handling Categorical Data
viii. Normalization
ix. String Manipulation
3. Summarization
25 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology