Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

NumPy - Removing Missing Data



Removing Missing Data from Arrays

Removing missing data from arrays involves cleaning the dataset by eliminating entries that contain NaN or other indicators of missing values.

NaN is used to denote undefined or unrepresentable values. It is important to address NaN values before performing any calculations to avoid misleading results or errors.

Removing Missing Data from 1D Arrays

Removing missing data from 1D arrays involves filtering out elements that are marked as missing, usually represented by NaN (Not a Number). In a 1D array, missing values are identified using the np.isnan() function, which creates a boolean array where each "True" value corresponds to a "NaN" entry in the original array.

To remove these missing values, you apply this boolean mask to the array, inverting the mask to focus on non-NaN entries. Specifically, ~np.isnan() generates a boolean array where True indicates valid data.

By using this mask to index the original array, you filter out all NaN values, resulting in a cleaned array that contains only valid entries.

Example

In the following example, we use Boolean indexing with np.isnan() function to create a mask that identifies NaN values. We then apply this mask to remove NaN values from the original array −

import numpy as np

# Creating a 1D array with NaN values
arr = np.array([1.0, 2.5, np.nan, 4.7, np.nan, 6.2])

# Removing NaN values using Boolean indexing
cleaned_arr = arr[~np.isnan(arr)]

print("Original Array:\n", arr)
print("Cleaned Array (without NaN):\n", cleaned_arr)

Following is the output obtained −

Original Array:
[1.  2.5 nan 4.7 nan 6.2]
Cleaned Array (without NaN):
[1.  2.5 4.7 6.2]

Removing Missing Data from 2D Arrays

Removing missing data from 2D arrays involves eliminating rows or columns that contain NaN (Not a Number) values.

This process ensures that the dataset is cleaned and suitable for analysis or modeling. Depending on the specific requirements, you can choose to remove entire rows or columns where missing values are present.

Example

In this example, we use np.isnan() function combined with any() function to create a mask that identifies rows containing NaN values. We then use this mask to filter out and remove those rows from the original 2D array −

import numpy as np 

# Creating a 2D array with NaN values
arr_2d = np.array([[1.0, np.nan, 3.5],
                   [np.nan, 5.1, 6.3],
                   [7.2, 8.1, 9.4]])

# Removing rows with NaN values
cleaned_arr_2d = arr_2d[~np.isnan(arr_2d).any(axis=1)]

print("Original 2D Array:\n", arr_2d)
print("Cleaned 2D Array (rows without NaN):\n", cleaned_arr_2d)

This will produce the following result −

Original 2D Array:
[[1.  nan 3.5]
 [nan 5.1 6.3]
 [7.2 8.1 9.4]]
Cleaned 2D Array (rows without NaN):
[[7.2 8.1 9.4]]

Removing Columns with Missing Data

Removing columns with missing data involves eliminating entire columns from a 2D array or dataset where any element is marked as missing, generally represented by NaN (Not a Number).

This is a common data cleaning step used to ensure that the dataset only includes columns with complete data, which can improve the quality of subsequent analyses.

Example

In the example below, we are creating a 2D array with some NaN values and removing columns that contain any NaN values using np.isnan() function combined with the any() function. This identifies columns with NaN values and then filters the array to exclude those columns −

import numpy as np

# Create a 2D array with some NaN values
arr_2d = np.array([[1.0, np.nan, 3.0],
                   [4.0, 5.0, 6.0],
                   [np.nan, 8.0, 9.0]])

# Remove columns with any NaN values
cleaned_arr_2d_cols = arr_2d[:, ~np.isnan(arr_2d).any(axis=0)]

print("Original 2D array:")
print(arr_2d)
print("2D array with columns containing NaN removed:")
print(cleaned_arr_2d_cols)

Following is the output of the above code −

Original 2D array:
[[ 1. nan  3.]
 [ 4.  5.  6.]
 [nan  8.  9.]]
2D array with columns containing NaN removed:
[[3.]
 [6.]
 [9.]]

Removing Missing Data from Multi-dimensional Arrays

Removing missing data from multi-dimensional arrays involves a process similar to that used for 1D and 2D arrays but applied to higher dimensions.

Multi-dimensional arrays (e.g., 3D or 4D arrays) present additional complexity because missing values may occur across multiple dimensions. The goal is to filter out slices or specific parts of the array that contain missing data.

Example

In the following example, we are creating a 3D array with some NaN values and removing slices (2D arrays) that contain any NaN values. We use the np.isnan() function combined with the any() function to identify slices with NaN values and then filter out those slices from the array −

import numpy as np 

# Creating a 3D array with NaN values
arr_3d = np.array([[[1.0, np.nan],
                    [3.5, 4.2]],
                   [[np.nan, 6.3],
                    [7.2, 8.1]]])

# Removing slices with NaN values
cleaned_arr_3d = arr_3d[~np.isnan(arr_3d).any(axis=(1, 2))]

print("Original 3D Array:\n", arr_3d)
print("Cleaned 3D Array (slices without NaN):\n", cleaned_arr_3d)

The output obtained is as shown below −

Original 3D Array:
[[[1.  nan]
  [3.5 4.2]]

 [[nan 6.3]
  [7.2 8.1]]]
Cleaned 3D Array (slices without NaN):
[]

Removing Missing Values from Structured Arrays

Removing missing values from structured arrays in NumPy involves handling arrays with complex data types where each element is a record or a row with multiple fields.

Structured arrays can include missing values (NaN or other placeholders) in specific fields. The goal is to filter out records that contain missing values, ensuring that only complete data is retained.

Example

In the following example, we define a structured array with fields 'name' and 'age', using 'f4' (float32) for the 'age' field to accommodate NaN values. We then create a boolean mask to identify and remove records with missing values in the 'age' field −

import numpy as np

# Define a structured array with fields 'name' and 'age'
# Use 'f4' (float32) for the 'age' field to handle NaN values
dtype = [('name', 'U10'), ('age', 'f4')]
data = [('Alice', 25.0), ('Bob', np.nan), ('Charlie', 30.0)]
structured_array = np.array(data, dtype=dtype)

# Identify missing values in the 'age' field
nan_mask = np.isnan(structured_array['age'])

# Remove records with missing values in the 'age' field
cleaned_structured_array = structured_array[~nan_mask]

print("Original structured array:")
print(structured_array)
print("Structured array with missing values removed:")
print(cleaned_structured_array)

After executing the above code, we get the following output −

Original structured array:
[('Alice', 25.) ('Bob', nan) ('Charlie', 30.)]
Structured array with missing values removed:
[('Alice', 25.) ('Charlie', 30.)]
Advertisements