Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter7 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Data

Cleaning
Part II
Zakaria KERKAOU
Zakaria.kerkaou@e-polytechnique.ma
Removing Duplicates
The DataFrame method duplicated returns a boolean Series
indicating whether each row is a duplicate (has been observed in a
previous row) or not:
Removing Duplicates
Relatedly, drop_duplicates() returns a DataFrame where the
duplicated array is False:
Removing Duplicates
Relatedly, drop_duplicates() returns a DataFrame where the
duplicated array is False:
Transforming Data Using a
Function or Mapping
For many datasets, you may wish to perform some transformation based on the values in
an array, Series, or column in a DataFrame. Consider the following hypothetical data
collected about various kinds of meat
Suppose you wanted to add a column indicating the type of animal that each food came
from. Let’s write down a mapping of each distinct meat type to the kind of animal:
Transforming Data Using a
Function or Mapping
Suppose you wanted to add a column indicating the type of animal that each food came
from. Let’s write down a mapping of each distinct meat type to the kind of animal. Either we
use :

The map method on a Series accepts a function or dict-like object containing a mapping.

Or we could also have passed a function that does all the work.
Replacing Values
The method replace provides a simpler and more flexible way to be used to modify a subset
of values in an object
Let’s consider this Series as example:

In some cases the -999 values might be sentinel values for missing data. To replace these
with NA values that pandas understands.
Replacing Values
To replace a value we can use replace, producing a new Series (unless you pass
inplace=True):

To replace multiple values :


Replacing Values
To use a different replacement for each value, pass a list of substitutes, or we can simply
pass the argument as a dictionary.
Discretization and Binning
For analysis, continuous data is frequently discretized or otherwise divided into "bins."
Consider gathering information from a study about a group of individuals and categorizing
them according to distinct age ranges.

These can be separated into four groups: ages 18 to 25, 26 to 35, 36 to 60, and 61 and
older.

To do so, we have to use cut, a function in pandas:


Discretization and Binning
The object pandas returns is a special Categorical object. The output you
see describes the bins computed by pandas.cut. You can treat it like an
array of strings indicating the bin name; internally it contains a categories
array specifying the distinct category names along with a labeling for the
ages data in the codes attribute
Detecting and Filtering Outliers
Filtering or transforming outliers is largely a
matter of applying array operations.
Consider a DataFrame with some normally
distributed data.
Suppose you wanted to find values in one of the
columns exceeding 3 in absolute value:
Detecting and Filtering Outliers
Suppose you wanted to find values in one of the columns exceeding 3 in absolute value

To select all rows having a value exceeding 3 or –3, you can use the any method on a
boolean
Permutation and Random
Sampling
Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do
using the numpy.random.permutation function. Calling permutation with the length
of the axis you want to permute produces an array of integers indicating the new
ordering
Permutation and Random
Sampling
The sampler array can then be used in iloc-based indexing or the equivalent take
function
Permutation and Random
Sampling
To select a random subset without replacement, you
can use the sample method on Series and
DataFrame:

To generate a sample with replacement (to allow


repeat choices), pass replace=True to sample:
Computing Indicator/Dummy
Variables
Another type of transformation for statistical modeling or machine learning applications is
converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a
DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns
containing all 1s and 0s. pandas has a get_dummies function for doing this, though devising
one yourself is not difficult
String manipulation
String built-in methods
Python has long been a popular raw data manipulation language in part due to
its ease of use for string and text processing. Most text operations are made
simple with the string object’s built-in methods. For more complex pattern
matching and text manipulations, regular expressions may be needed. pandas
adds to the mix by enabling you to apply string and regular expressions
concisely on whole arrays of data, additionally handling the annoyance of
missing data.
String built-in methods
String built-in methods
In many string munging and scripting applications, built-in string methods are
sufficient. As an example, a comma-separated string can be broken into pieces
with split

These substrings could be concatenated together with a two-colon delimiter


using addition
Regular expressions
Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a regex, is a string formed
according to the regular expression language. Python’s built-in re module is responsible for
applying regular expressions to strings.
Regular expressions
The findall Function
Return all non-overlapping matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found:
re.findall(pattern, string, flags=0)

The regular expression pattern and target string are the mandatory arguments. The maxsplit,
and flags are optional.

pattern: the regular expression pattern used for splitting the target string.

string: The variable pointing to the target string (i.e., the string we want to split).

flags: By default, no flags are applied. There are many regex flags we can use. For example,
the re.I is used for performing case-insensitive searching.
Regular expressions
The search Function
This function searches for first occurrence of RE pattern within string with optional flags.
Here is the syntax for this function −
re.search(pattern, string, flags=0)

The regular expression pattern and target string are the mandatory arguments. The maxsplit,
and flags are optional.

pattern: the regular expression pattern used for splitting the target string.

string: The variable pointing to the target string (i.e., the string we want to split).

flags: By default, no flags are applied. There are many regex flags we can use. For example,
the re.I is used for performing case-insensitive searching.
Regular expressions
The split Function
method split the string by the occurrences of the regex pattern, returning a list containing
the resulting substrings.
Here is the syntax for this function −
re.split(pattern, string, maxsplit=0, flags=0)

The regular expression pattern and target string are the mandatory arguments. The maxsplit, and flags are
optional.

pattern: the regular expression pattern used for splitting the target string.

string: The variable pointing to the target string (i.e., the string we want to split).

maxsplit: The number of splits you wanted to perform. If maxsplit is 2, at most two splits occur, and the
remainder of the string is returned as the final element of the list.

flags: By default, no flags are applied. There are many regex flags we can use. For example, the re.I is
used for performing case-insensitive searching.
Regular expressions
Example:

When you call re.split('\s+', text), the regular expression is first compiled, which matches
whitespace (Equivalent to [\t\n\r\f]). Then its split method is called on the passed text.
Regular expressions
Example:
Regular expressions
Example:
Fuzzy matching
FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string
matching is the process of finding strings that match a given pattern. Basically it
uses Levenshtein Distance to calculate the differences between sequences.

where the tail of some string x is a string of all but the first character of x, and x[n] is the n-th
character of the string x, counting from 0.
Fuzzy matching
In most cases the libraries necessary are not installed, so we need to use the
package installer for Python pip to install packages from the Python Package
Index and other indexes. And then import the library fuzzywuzzy
Fuzzy matching
Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the
smaller the edit distance between the two strings. Here, we're going to get the
ten strings from our list of cities that have the closest distance to "south korea".

You might also like