Chapter 2:
Data Manipulation and Preprocessing
• All machine learning is concerned with extracting information from data.
• Machine learning typically requires working with large datasets, which we can think of as tables
o The rows correspond to examples, and
o The columns correspond to attributes.
• Linear algebra gives us a powerful set of techniques for working with tabular data.
• Additionally, deep learning is all about optimization.
o We have a model with some parameters and we want to find those that fit our data the best.
o Determining which way to move each parameter at each step of an algorithm requires a little bit of calculus.
o The package automatically computes differentiation for us.
• This chapter provides a rapid introduction to basic and frequently-used mathematics to allow anyone
to understand at least most of the mathematical content of the book.
Data Manipulation
• There are two important things we need to do with data:
o Acquire them
o Process them once they are inside the computer.
Getting Started
• To start:
o Import the and modules from MXNet.
o The module includes functions supported by NumPy.
o module contains a set of extensions developed to empower deep learning within a NumPy-like environment.
from mxnet import np, npx
Source: https://www.i2tutorials.com/what-do-you-mean-by-tensor-and-explain-about-tensor-dataty
Getting Started
• We can use to create a row vector containing the first 12 integers starting with 0.
o They are created as floats by default.
x = np.arange(12)
• We can access a tensor’s shape (the length along each axis) by inspecting its property.
• To know the total number of elements in a tensor, i.e., the product of all of the shape elements, we can inspect its size.
Getting Started
• To change the shape of a tensor without altering either the number of elements or their values,
invoke the function.
X = x.reshape(3, 4)
• We can create a tensor representing a tensor with all elements set to 0 and a shape of (2, 3, 4) as follows:
np.zeros((2, 3, 4))
• The following snippet creates a tensor with shape (3, 4). Each of its elements is randomly sampled from a standard
Gaussian (normal) distribution with a mean of 0 and a standard deviation of 1.
• We can also specify the exact values for each element in the desired tensor by supplying a Python list (or list of lists)
containing the numerical values.
np.array([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
• Some of the simplest and most useful operations are the elementwise operations.
o These apply a standard scalar operation to each element of an array.
Source: https://www.freepik.com/free-photo/mathematics-with-numbers-pi-symbol_6625946.htm#page=2&query=mathematics&position=32
• We would denote:
o A unary scalar operator (taking one input) by the signature .
o A binary scalar operator (taking two real inputs, and yielding one output) by the signature: .
o Given any two vectors u and v of the same shape, and a binary operator , we can produce a vector c = F(u,v)
by setting for all where and are the element of the vectors c, u, and v.
o Here, we produced the vector-valued by lifting the scalar function to an elementwise
vector operation.
• In the following example, we use commas to formulate a 5-element tuple, where each element is the result of an
elementwise operation.
x = np.array([1, 2, 4, 8])
y = np.array([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y # The ** operator is exponentiation
• Many more operations can be applied elementwise, including unary operators like exponentiation.
• In addition to elementwise computations, we can also perform linear algebra operations, including vector dot
products and matrix multiplication.
• We can also concatenate multiple tensors together, stacking them end-to-end to form a larger tensor.
• The example below shows what happens when we concatenate two matrices along:
o Rows- axis 0, the first element of the shape.
o Columns- axis 1, the second element of the shape.
X = np.arange(12).reshape(3, 4)
Y = np.array([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
np.concatenate([X, Y], axis=0), np.concatenate([X, Y], axis=1)
• Sometimes, we want to construct a binary tensor via logical statements. Take X == Y as an example.
X == Y
• Summing all the elements in the tensor yields a tensor with only one element.
Broadcasting Mechanism
• Under certain conditions, even when shapes differ, we can still perform elementwise operations by invoking the
broadcasting mechanism. This mechanism works in the following way:
o Expand one or both arrays by copying elements appropriately so that the two tensors have the same shape.
o Carry out the elementwise operations on the resulting arrays.
• In most cases, we broadcast along an axis where an array initially only has length 1, such as in the following example:
a = np.arange(3).reshape(3, 1)
b = np.arange(2).reshape(1, 2)
a, b
Indexing and Slicing
• Elements in a tensor can be accessed by index.
o The first element has index 0 and ranges are specified to include the first but before the last element.
0 1 2 3 4 5 6 7
Slice 2:5
• We can access elements according to their relative position to the end of the list by using negative indices.
o Thus, selects the last element and selects the second and the third elements as follows:
X[-1], X[1:3]
• To assign multiple elements the same value, we simply index all of them and then assign them the value.
X[0:2, :] = 12
Saving Memory
• Running operations can cause new memory to be allocated to host results.
o For example, if we write Y = X + Y, we will dereference the tensor that Y used to point to and instead point Y
at the newly allocated memory.
o In the following example, we demonstrate this with Python’s function, which gives us the exact address
of the referenced object in memory.
before = id(Y)
Y = Y + X
id(Y) == before
𝑌 =𝑌 + 𝑋
Y Value
Result of
Saving Memory
• Allocating new memory for new results for the same variable might be undesirable for two reasons:
o We do not want to run around allocating memory unnecessarily all the time.
o We might point at the same parameters from multiple variables.
• In machine learning, we might have hundreds of megabytes of parameters and update all of them multiple
times per second.
• If we do not update in place, other references will still point to the old memory location, making it possible for parts
of our code to inadvertently reference stale parameters.
Saving Memory
• Performing in-place operations is easy. We can assign the result of an operation to a previously allocated array with
slice notation, e.g., .
o For example, create a new matrix with the same shape as , using to allocate a block of entries.
Z = np.zeros_like(Y)
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))
• If the value of X is not reused in subsequent computations, we can also use X[:] = X + Y or X += Y to reduce the
memory overhead of the operation.
before = id(X)
X += Y
id(X) == before
Conversion to Other Python Objects
• Converting to a NumPy tensor, or vice versa, is easy. The converted result does not share memory.
A = X.asnumpy()
B = np.array(A)
type(A), type(B)
• To convert a size-1 tensor to a Python scalar, we can invoke the function or Python’s built-in functions.
a = np.array([3.5])
a, a.item(), float(a), int(a)
• The main interface to store and manipulate data for deep learning is the tensor ( n -dimensional array).
It provides a variety of functionalities including basic mathematics operations, broadcasting, indexing, slicing,
memory saving, and conversion to other Python objects.
Data Preprocessing
• To apply deep learning to solving real-world problems, we often begin with preprocessing raw data.
• package is a commonly used data analytic tool.
o can work together with tensors.
Reading the Dataset
• We begin by creating an artificial dataset that is stored in a csv (comma-separated values) file
o The following mkdir_if_not_exist function ensures that the directory ../data exists.
o is a special mark where the following function, class, or statements are saved in the d2l package
so later they can be directly invoked (e.g., ) without being redefined.
import os
def mkdir_if_not_exist(path): #@save
"""Make a directory if it does not exist."""
if not isinstance(path,str):
path = os.path.join(*path)
if not os.path.exists(path):
Reading the Dataset
• Below we write the dataset row by row into a csv file.
data_file = "../data/house_tiny.csv"
with open(data_file,'w') as f:
f.write('NumRooms,Alley,Price\n') # Column names
f.write('NA,Pave,127500\n') # Each row represents a data example
• This dataset has four rows and three columns, where each row describes the number of rooms (“NumRooms”),
the alley type (“Alley”), and the price (“Price”) of a house.
• To load the raw dataset from the created csv file, we import the package and invoke the function.
# If pandas is not installed, just uncomment the following line:
# !pip install pandas
import pandas as pd
data = pd.read_csv(data_file)
Handling Missing Data
• “NaN” entries are missing values. To handle missing data, typical methods include imputation and deletion:
o imputation replaces missing values with substituted ones. (that will be considered)
o deletion ignores missing values.
• By integer-location based indexing (), we split data into inputs and outputs,
o inputs takes the first two columns.
o outputs only keeps the last column.
• For numerical values in inputs that are missing, replace the “NaN” entries with the mean value of the same column.
• For categorical or discrete values in inputs, we consider “NaN” as a category.
inputs, outputs = data.iloc[:,0:2], data.iloc[:,2]
inputs = inputs.fillna(inputs.mean())
• Since the “Alley” column only takes two types of categorical values “Pave” and “NaN”,
can automatically convert this column to two columns “Alley_Pave” and “Alley_nan”.
o A row whose alley type is “Pave” will set values to 1.
o A row whose alley type is “NaN” will set values to 0.
inputs = pd.get_dummies(inputs,dummy_na=True)
Conversion to the Tensor Format
• Now that all the entries in inputs and outputs are numerical, they can be converted to the tensor format.
o Once data are in this format, they can be further manipulated with those tensor functionalities.
• Like many other extension packages in the vast ecosystem of Python, can work together with tensors.
• Imputation and deletion can be used to handle missing data.