Data Cleaning With Python and Pandas
Data Cleaning With Python and Pandas
OWN_OCCU
PID ST_NUM ST_NAME PIED NUM_BEDROOMS NUM_BATH
100001000 104 ANGALLU Y 3 1
100002000 197 MADANAPALLE N 3 1.5
100003000 MADANAPALLE N n/a 1
100004000 201 TEMPLE 12 1 NaN
203 TEMPLE Y 3 2
100006000 207 TEMPLE Y NA 1
100007000 NA KOTAKOTTA 2 HURLEY
100008000 213 DNR Y -- 1
100009000 215 DNR Y na 2
What are the features?
what are my features?
• ST_NUM: Street number
• ST_NAME: Street name
• OWN_OCCUPIED: Is the residence owner occupied
• NUM_BEDROOMS: Number of bedrooms
• NUM_BATH : Number of bathrooms
201
203
207
NA
213
215
Non-Standard Missing Values
• Sometimes it might be the case where there’s In this column,
missing values that have different formats. there’s four
missing values.
• Let’s take a look at the “Number of Bedrooms”
column to see what I mean.
Parameters:
axis {0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Only a single axis is allowed.
how {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at
least one NA or all NA.
‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
Thresh int, optional
Require that many non-NA values. Cannot be combined with how.
Inplace bool, default False
Whether to modify the DataFrame rather than creating a new one.
Returns:
DataFrame or None : DataFrame with NA entries dropped from it or None if inplace=True.
Remove missing values
Inplace bool, default False
Whether to modify the DataFrame rather than creating a new one.
• fillna(),
• bfill()
• ffill()
• replace(),
• interpolate()
pandas.DataFrame.fillna
Fill NA/NaN values using the specified method.
df.fillna(value=values)
pandas.DataFrame.fillna
Replacing
• want to fill in missing values with a single value.
pandas.DataFrame.ffill
Fill NA/NaN values by propagating the last valid observation to next valid.
Parameters:
downcast : dict, default is None
A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which
will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
Returns:
Series/DataFrame or None
Object with missing values filled or None if inplace=True
pandas.DataFrame.ffill
pandas.DataFrame.bfill
DataFrame.bfill(*, axis=None, inplace=False, limit=None,
downcast=_NoDefault.no_default)
Filling the Missing Values –
Imputation
• we can nest regular expressions as well. Note that column names (the top-level dictionary keys in a
nested dictionary) cannot be regular expressions.
interpolate() function
• Pandas dataframe.interpolate() function is basically
used to fill NA values in the dataframe or series.
Parameters :
method : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’,
‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’,
‘pchip’, ‘akima’}
axis : 0 fill column-by-column and 1 fill row-by-row.
limit : Maximum number of consecutive NaNs to fill. Must be greater than 0.
limit_direction : {‘forward’, ‘backward’, ‘both’}, default ‘forward’
limit_area : None (default) no fill restriction. inside Only fill NaNs surrounded by valid
values (interpolate). outside Only fill NaNs outside valid values (extrapolate). If limit is
specified, consecutive NaNs will be filled in this direction.
inplace : Update the NDFrame in place if possible.
downcast : Downcast dtypes if possible.
kwargs : keyword arguments to pass on to the interpolating function.