Lecture 6

COSC 3021: Machine Learning
Lecture 8
Data Wrangling & Summarization
COSC-3107 Machine Learning
Shahzad Hussain
Lecturer
Today’s Lecture Outline

2. Data Wrangling
i. Understanding Data
ii. Filtering Data
iii. Type Casting
iv. Transformation
v. Imputing Missing Values
vi. Handling Duplicates COSC-3107 Machine Learning
vii. Handling Categorical Data
viii. Normalization
ix. String Manipulation
3. Summarization
2 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
V. Imputing Missing Values
Imputing Missing Values

• Missing values can lead to all sorts of problems when
dealing with Machine Learning and Data Science related
use cases.
• Not only can they cause problems for algorithms, they
can mess up calculations and even final outcomes.
• Missing values also pose risk of being interpreted in
non-standard ways as well leading to confusion and
more errors. COSC-3107 Machine Learning
• One of the easiest ways of handling missing values is to

ignore or remove them altogether from the dataset.
• When the dataset is fairly large and we have enough
samples of various types required, this option can be
safely exercised.
• We use the dropna() function from pandas in the following
snippet to remove rows of data where the date of transaction
is missing.:
print("Drop Rows with missing dates::" )
df_dropped = df.dropna(subset=['date'])
print("Shape::",df_dropped.shape)
Dataframe without any missing date information


• In many scenarios, missing values are imputed using the help of
other values in the dataframe.
• One commonly used trick is to replace missing values with a

central tendency measure like mean or median.
• We utilize the fillna() method from pandas to fill these values

with mean price value from our dataframe.
• On the same lines, we use the ffill() and bfill() functions to COSC-3107 Machine Learning
impute missing values for the user_type attribute.
• user_type is a string type attribute, we use a proximity

based solution to handle missing values in this case.
• The ffill() and bfill() functions copy forward the data from the
previous row (forward fill) or copy the value from the next row
(backward fill).
• Fill Missing Price values with mean price::
• Fill Missing user_type values with value from previous row (forward fill) ::
• Fill Missing user_type values with value from next row (backward fill) ::
vi. Handling Duplicates

Handling Duplicates
• Another issue with many datasets is the presence of duplicates.
• To identify duplicates, we have a utility called duplicated() that

can applied on the whole dataframe as well as on a subset of it.
• We may handle duplicates by fixing the errors and use the

duplicated() function, although we may also choose to drop the
duplicate data points altogether.
• To drop duplicates, we use the method drop_duplicates().

Handling Categorical Data
• The attribute user_type is a categorical variable that can
take only a limited number of values from the allowed set
{a,b,c,d}.
• With pandas, we can handle categorical variables in a

couple of different ways.
• The first method is using the map() function, where we

simply map each value from the allowed set to a numeric COSC-3107 Machine Learning
value.
• The second method is to convert the categorical variable

into indicator variables using the get_dummies() function.

• Method I: The first method is using the map() function,
where we simply map each value from the allowed set to a
numeric value.
• The second method is to convert the categorical variable
into indicator variables using the get_dummies() function.
viii. Normalizing Values

Normalizing Values
• Attribute normalization is the process of standardizing the
range of values of attributes.
• Machine learning algorithms in many cases utilize distance

metrics, attributes or features of different scales /
ranges which might adversely affect the calculations or bias
the outcomes.
• Normalization is also called feature scaling. There are COSC-3107 Machine Learning
various ways of scaling/normalizing features, some of
them are rescaling, standardization (or zero-mean unit
variance), unit scaling and many more.
• We may choose a normalization technique based upon the

feature, algorithm and use case at hand.
Normalizing Values
• Unscaled price values and the normalized price values
• That have been scaled to a range of [0, 1]
String Manipulation
• Raw data presents all sorts of issues and complexities before it can be
used for analysis.
• Strings are another class of raw data which needs special attention and
treatment before our algorithms can make sense out of them.
• String data representing natural language is highly noisy and requires its
own set of steps for wrangling.
• String data usually undergoes wrangling steps such as:
• Tokenization: Splitting of string data into constituent units. For example,
splitting sentences into words or words into characters. COSC-3107 Machine Learning
• Stemming and lemmatization: These are normalization methods to
bring words into their root or canonical forms. While stemming is a
heuristic process to achieve the root form, lemmatization utilizes rules of
grammar and vocabulary to derive the root.
• Stopword Removal: Text contains words that occur at high frequency

yet do not convey much information (punctuations, conjunctions, and so
on). These words/phrases are usually removed to reduce dimensionality
and noise from data.
3. Summarization
Summarization
• Data summarization refers to the process of
preparing a compact representation of raw data at
hand.
• This process involves aggregation of data using
different statistical, mathematical, and other
methods.
• Summarization is helpful for visualization, COSC-3107 Machine Learning
compressing raw data, and better understanding

of its attributes.
• The pandas library provides various powerful

summarization techniques to suit different
requirements.
Summarization
• calculates the mean price for all transactions by user_type:
• Counts the number of transactions per week ::
Summarization
• To generates a tabular output representing sum of
quantities purchased by each user_class.
• The output is generated is as follows.
Summarization
• Variant 1: Here we apply three different aggregations on quantity
purchased which is grouped by user_class
• Variant 2: Here we apply three different aggregations on quantity COSC-3107 Machine Learning
Summarization
• Variant 3: Here we apply three different aggregations on quantity
Today’s Lecture Summary
2. Data Wrangling
i. Understanding Data
ii. Filtering Data
iii. Type Casting
iv. Transformation
v. Imputing Missing Values
vi. Handling Duplicates COSC-3107 Machine Learning
viii. Normalization
3. Summarization

Lecture 6

Uploaded by

Copyright:

Available Formats

Lecture 6

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 6

Uploaded by

Copyright:

Available Formats

COSC 3021: Machine Learning

Today’s Lecture Outline

Imputing Missing Values

• One of the easiest ways of handling missing values is to

COSC-3107 Machine Learning

Dataframe without any missing date information

Imputing Missing Values

• One commonly used trick is to replace missing values with a

• We utilize the fillna() method from pandas to fill these values

• user_type is a string type attribute, we use a proximity

COSC-3107 Machine Learning

vi. Handling Duplicates

• To identify duplicates, we have a utility called duplicated() that

• We may handle duplicates by fixing the errors and use the

vii. Handling Categorical Data

• With pandas, we can handle categorical variables in a

• The first method is using the map() function, where we

• The second method is to convert the categorical variable

Handling Categorical Data

COSC-3107 Machine Learning

COSC-3107 Machine Learning

viii. Normalizing Values

• Machine learning algorithms in many cases utilize distance

• We may choose a normalization technique based upon the

COSC-3107 Machine Learning

• Stopword Removal: Text contains words that occur at high frequency

compressing raw data, and better understanding

• The pandas library provides various powerful

• Counts the number of transactions per week ::

COSC-3107 Machine Learning

COSC-3107 Machine Learning

COSC-3107 Machine Learning

You might also like