Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture 6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

COSC 3021: Machine Learning

Lecture 8
Data Wrangling & Summarization
COSC-3107 Machine Learning

Shahzad Hussain
Lecturer

Today’s Lecture Outline


2. Data Wrangling
i. Understanding Data
ii. Filtering Data
iii. Type Casting
iv. Transformation
v. Imputing Missing Values
vi. Handling Duplicates COSC-3107 Machine Learning
vii. Handling Categorical Data
viii. Normalization
ix. String Manipulation
3. Summarization

2 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
V. Imputing Missing Values
COSC-3107 Machine Learning

Imputing Missing Values


• Missing values can lead to all sorts of problems when
dealing with Machine Learning and Data Science related
use cases.
• Not only can they cause problems for algorithms, they
can mess up calculations and even final outcomes.
• Missing values also pose risk of being interpreted in
non-standard ways as well leading to confusion and
more errors. COSC-3107 Machine Learning

• One of the easiest ways of handling missing values is to


ignore or remove them altogether from the dataset.
• When the dataset is fairly large and we have enough
samples of various types required, this option can be
safely exercised.

4 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Imputing Missing Values
• We use the dropna() function from pandas in the following
snippet to remove rows of data where the date of transaction
is missing.:
print("Drop Rows with missing dates::" )
df_dropped = df.dropna(subset=['date'])
print("Shape::",df_dropped.shape)

COSC-3107 Machine Learning

Dataframe without any missing date information


5 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Imputing Missing Values


• In many scenarios, missing values are imputed using the help of
other values in the dataframe.

• One commonly used trick is to replace missing values with a


central tendency measure like mean or median.

• We utilize the fillna() method from pandas to fill these values


with mean price value from our dataframe.

• On the same lines, we use the ffill() and bfill() functions to COSC-3107 Machine Learning
impute missing values for the user_type attribute.

• user_type is a string type attribute, we use a proximity


based solution to handle missing values in this case.

• The ffill() and bfill() functions copy forward the data from the
previous row (forward fill) or copy the value from the next row
(backward fill).

6 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Imputing Missing Values
• Fill Missing Price values with mean price::

• Fill Missing user_type values with value from previous row (forward fill) ::

COSC-3107 Machine Learning

• Fill Missing user_type values with value from next row (backward fill) ::

7 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

vi. Handling Duplicates


COSC-3107 Machine Learning
Handling Duplicates
• Another issue with many datasets is the presence of duplicates.

• To identify duplicates, we have a utility called duplicated() that


can applied on the whole dataframe as well as on a subset of it.

• We may handle duplicates by fixing the errors and use the


duplicated() function, although we may also choose to drop the
duplicate data points altogether.
• To drop duplicates, we use the method drop_duplicates().
COSC-3107 Machine Learning

9 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

vii. Handling Categorical Data


COSC-3107 Machine Learning
Handling Categorical Data
• The attribute user_type is a categorical variable that can
take only a limited number of values from the allowed set
{a,b,c,d}.

• With pandas, we can handle categorical variables in a


couple of different ways.

• The first method is using the map() function, where we


simply map each value from the allowed set to a numeric COSC-3107 Machine Learning
value.

• The second method is to convert the categorical variable


into indicator variables using the get_dummies() function.

11 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Handling Categorical Data


• Method I: The first method is using the map() function,
where we simply map each value from the allowed set to a
numeric value.

COSC-3107 Machine Learning

12 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Handling Categorical Data
• The second method is to convert the categorical variable
into indicator variables using the get_dummies() function.

COSC-3107 Machine Learning

13 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

viii. Normalizing Values


COSC-3107 Machine Learning
Normalizing Values
• Attribute normalization is the process of standardizing the
range of values of attributes.

• Machine learning algorithms in many cases utilize distance


metrics, attributes or features of different scales /
ranges which might adversely affect the calculations or bias
the outcomes.

• Normalization is also called feature scaling. There are COSC-3107 Machine Learning
various ways of scaling/normalizing features, some of
them are rescaling, standardization (or zero-mean unit
variance), unit scaling and many more.

• We may choose a normalization technique based upon the


feature, algorithm and use case at hand.

15 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Normalizing Values
• Unscaled price values and the normalized price values
• That have been scaled to a range of [0, 1]

COSC-3107 Machine Learning

16 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
ix. String Manipulation
COSC-3107 Machine Learning

String Manipulation
• Raw data presents all sorts of issues and complexities before it can be
used for analysis.
• Strings are another class of raw data which needs special attention and
treatment before our algorithms can make sense out of them.
• String data representing natural language is highly noisy and requires its
own set of steps for wrangling.
• String data usually undergoes wrangling steps such as:
• Tokenization: Splitting of string data into constituent units. For example,
splitting sentences into words or words into characters. COSC-3107 Machine Learning
• Stemming and lemmatization: These are normalization methods to
bring words into their root or canonical forms. While stemming is a
heuristic process to achieve the root form, lemmatization utilizes rules of
grammar and vocabulary to derive the root.

• Stopword Removal: Text contains words that occur at high frequency


yet do not convey much information (punctuations, conjunctions, and so
on). These words/phrases are usually removed to reduce dimensionality
and noise from data.
18 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
3. Summarization
COSC-3107 Machine Learning

Summarization
• Data summarization refers to the process of
preparing a compact representation of raw data at
hand.
• This process involves aggregation of data using
different statistical, mathematical, and other
methods.
• Summarization is helpful for visualization, COSC-3107 Machine Learning

compressing raw data, and better understanding


of its attributes.

• The pandas library provides various powerful


summarization techniques to suit different
requirements.
20 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Summarization
• calculates the mean price for all transactions by user_type:

• Counts the number of transactions per week ::

COSC-3107 Machine Learning

21 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Summarization
• To generates a tabular output representing sum of
quantities purchased by each user_class.
• The output is generated is as follows.

COSC-3107 Machine Learning

22 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Summarization
• Variant 1: Here we apply three different aggregations on quantity
purchased which is grouped by user_class

• Variant 2: Here we apply three different aggregations on quantity COSC-3107 Machine Learning
purchased which is grouped by user_class

23 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

Summarization
• Variant 3: Here we apply three different aggregations on quantity
purchased which is grouped by user_class

COSC-3107 Machine Learning

24 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology
Today’s Lecture Summary
2. Data Wrangling
i. Understanding Data
ii. Filtering Data
iii. Type Casting
iv. Transformation
v. Imputing Missing Values
vi. Handling Duplicates COSC-3107 Machine Learning
vii. Handling Categorical Data
viii. Normalization
ix. String Manipulation
3. Summarization

25 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information Technology

You might also like