Chapter7 PDF
Chapter7 PDF
Chapter7 PDF
Cleaning
Part II
Zakaria KERKAOU
Zakaria.kerkaou@e-polytechnique.ma
Removing Duplicates
The DataFrame method duplicated returns a boolean Series
indicating whether each row is a duplicate (has been observed in a
previous row) or not:
Removing Duplicates
Relatedly, drop_duplicates() returns a DataFrame where the
duplicated array is False:
Removing Duplicates
Relatedly, drop_duplicates() returns a DataFrame where the
duplicated array is False:
Transforming Data Using a
Function or Mapping
For many datasets, you may wish to perform some transformation based on the values in
an array, Series, or column in a DataFrame. Consider the following hypothetical data
collected about various kinds of meat
Suppose you wanted to add a column indicating the type of animal that each food came
from. Let’s write down a mapping of each distinct meat type to the kind of animal:
Transforming Data Using a
Function or Mapping
Suppose you wanted to add a column indicating the type of animal that each food came
from. Let’s write down a mapping of each distinct meat type to the kind of animal. Either we
use :
The map method on a Series accepts a function or dict-like object containing a mapping.
Or we could also have passed a function that does all the work.
Replacing Values
The method replace provides a simpler and more flexible way to be used to modify a subset
of values in an object
Let’s consider this Series as example:
In some cases the -999 values might be sentinel values for missing data. To replace these
with NA values that pandas understands.
Replacing Values
To replace a value we can use replace, producing a new Series (unless you pass
inplace=True):
These can be separated into four groups: ages 18 to 25, 26 to 35, 36 to 60, and 61 and
older.
To select all rows having a value exceeding 3 or –3, you can use the any method on a
boolean
Permutation and Random
Sampling
Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do
using the numpy.random.permutation function. Calling permutation with the length
of the axis you want to permute produces an array of integers indicating the new
ordering
Permutation and Random
Sampling
The sampler array can then be used in iloc-based indexing or the equivalent take
function
Permutation and Random
Sampling
To select a random subset without replacement, you
can use the sample method on Series and
DataFrame:
The regular expression pattern and target string are the mandatory arguments. The maxsplit,
and flags are optional.
pattern: the regular expression pattern used for splitting the target string.
string: The variable pointing to the target string (i.e., the string we want to split).
flags: By default, no flags are applied. There are many regex flags we can use. For example,
the re.I is used for performing case-insensitive searching.
Regular expressions
The search Function
This function searches for first occurrence of RE pattern within string with optional flags.
Here is the syntax for this function −
re.search(pattern, string, flags=0)
The regular expression pattern and target string are the mandatory arguments. The maxsplit,
and flags are optional.
pattern: the regular expression pattern used for splitting the target string.
string: The variable pointing to the target string (i.e., the string we want to split).
flags: By default, no flags are applied. There are many regex flags we can use. For example,
the re.I is used for performing case-insensitive searching.
Regular expressions
The split Function
method split the string by the occurrences of the regex pattern, returning a list containing
the resulting substrings.
Here is the syntax for this function −
re.split(pattern, string, maxsplit=0, flags=0)
The regular expression pattern and target string are the mandatory arguments. The maxsplit, and flags are
optional.
pattern: the regular expression pattern used for splitting the target string.
string: The variable pointing to the target string (i.e., the string we want to split).
maxsplit: The number of splits you wanted to perform. If maxsplit is 2, at most two splits occur, and the
remainder of the string is returned as the final element of the list.
flags: By default, no flags are applied. There are many regex flags we can use. For example, the re.I is
used for performing case-insensitive searching.
Regular expressions
Example:
When you call re.split('\s+', text), the regular expression is first compiled, which matches
whitespace (Equivalent to [\t\n\r\f]). Then its split method is called on the passed text.
Regular expressions
Example:
Regular expressions
Example:
Fuzzy matching
FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string
matching is the process of finding strings that match a given pattern. Basically it
uses Levenshtein Distance to calculate the differences between sequences.
where the tail of some string x is a string of all but the first character of x, and x[n] is the n-th
character of the string x, counting from 0.
Fuzzy matching
In most cases the libraries necessary are not installed, so we need to use the
package installer for Python pip to install packages from the Python Package
Index and other indexes. And then import the library fuzzywuzzy
Fuzzy matching
Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the
smaller the edit distance between the two strings. Here, we're going to get the
ten strings from our list of cities that have the closest distance to "south korea".