FDS Unit 4

4-1 B.
Tech CIVIL Regulation: R20 FDS: UNIT-4
UNIT-4
Python for Data Handling
Syllabus:
Python for Data Handling: Basics of Numpy arrays – aggregations – computations on arrays –
comparisons, masks, Boolean logic – fancy indexing – structured arrays .Data manipulation with
Pandas – data indexing and selection – operating on data – missing data – hierarchical indexing
– combining datasets –aggregation and grouping – pivot tables.
Introduction to NumPy Arrays
 Datasets can come from a wide range of sources and a wide range of
formats, including collections of documents, collections of images,
collections of sound clips, collections of numerical measurements, or nearly
anything else. Despite this apparent heterogeneity, it will help us to think of
all data fundamentally as arrays of numbers.
 For example, images—particularly digital images—can be thought of as
simply two dimensional arrays of numbers representing pixel brightness
across the area. Sound clips can be thought of as one-dimensional arrays of
intensity versus time. Text can be converted in various ways into numerical
representations, perhaps binary digits representing the frequency of certain
words or pairs of words. No matter what the data are, the first step in making
them analyzable will be to transform them into arrays of numbers.
 For this reason, efficient storage and manipulation of numerical arrays is
absolutely fundamental to the process of doing data science
 NumPy (short for Numerical Python) provides an efficient interface to store
and operate on dense data buffers.
 In some ways, NumPy arrays are like Python’s built-in list type, but NumPy
arrays provide much more efficient storage and data operations as the arrays
grow larger in size.
 NumPy arrays form the core of nearly the entire ecosystem of data science
tools in Python.
 NumPy in Python is a library that is used to work with arrays and was
created in 2005 by Travis Oliphant.
 NumPy library in Python has functions for working in domain of Fourier
transform, linear algebra, and matrices
 In particular, NumPy arrays provide an efficient way of storing and
manipulating data. NumPy also includes a number of functions that make it
easy to perform mathematical operations on arrays. This can be really useful
for scientific or engineering applications.
Basics of Numpy Arrays
 Categories of basic array manipulations are:
1. Attributes of arrays: Determining the size, shape, memory consumption, and
data types of arrays
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 1
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
2. Indexing of arrays: Getting and setting the value of individual array

elements
3. Slicing of arrays: Getting and setting smaller subarrays within a larger array
4. Reshaping of arrays: Changing the shape of a given array
5. Joining and splitting of arrays: Combining multiple arrays into one, and
splitting one array into many
NumPy Array Attributes
 some useful array attributes are:
 ndim : the number of dimensions
 shape :the size of each dimension
 size :the total size of the array
 dtype: the data type of the array
 itemsize: lists the size (in bytes) of each array element,
 nbytes: lists the total size (in bytes) of the array
 In general, nbytes is equal to itemsize times size.
# Python program to demonstrate Attribute of arrays
import numpy as np
# Creating array object
arr = np.array( [[ 1, 2, 3],
[ 4, 5, 6]] )
# Printing array dimensions (axes)
print("No. of dimensions: ", arr.ndim)
# Printing shape of array
print("Shape of array: ", arr.shape)
# Printing size (total number of elements) of array
print("Size of array: ", arr.size)
# Printing type of elements in array
print("Array Elements type: ", arr.dtype)
# Printing size of each elements in array
print("Size of array elment: ", arr.itemsize,"bytes")
# Printing total size of array
print("Total size of array: ",arr.nbytes ,"bytes")
Array Indexing: Accessing Single Elements
 Indexing in NumPy is quite similar to Python’s standard list indexing.
 In a one-dimensional array, we can access the ith value (counting fromzero)
by specifying the desired index in square brackets, just as with Python lists
Example:
Import numpy as np
a=np. array([5, 0, 3, 3, 7, 9])
print(a[0])
print(a[4])
Output:5
7
 To index from the end of the array, we can use negative indices:
print(a[-1])
print(a[-2])
Output:9
7
 In a multidimensional array, we access items using a comma-separated tuple
ofindices:
Example:
import numpy as np
a=np.array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
print(a[0,0])
print(a[2,0])
print(a[2,-1])
Output: 3
1
7
 We can also modify values using any of the above index notation:
a[0, 0] = 12
Output: [[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
Array Slicing: Accessing Sub arrays
 Just as we can use square brackets to access individual array elements, we
can also use them to access sub arrays with the slice notation, marked by the
colon (:) character.
 The NumPy slicing syntax follows that of the standard Python list; to access
a slice of an array x, use this: x[start:stop:step] If any of these are
unspecified, they default to the values start=0, stop=size of dimension,
step=1.
 Example: One-dimensional sub arrays
import numpy as np
x = np.arange(10)
print(x)
print( x[:5]) # first five elements
print( x[5:]) # elements after index 5
print(x[4:7]) # middle sub array
print(x[::2]) # every other element
print( x[1::2]) # every other element, starting at index 1
print(x[::-1]) # all elements, reversed
print( x[5::-2]) # reversed every other from index 5
Output:[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4]
[5 6 7 8 9]
[4 5 6]
[0 2 4 6 8]
[1 3 5 7 9]
[9 8 7 6 5 4 3 2 1 0]
[5 3 1]
 Example: Multidimensional sub arrays
import numpy as np
x2=np.array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
print( x2[:2, :3]) # two rows, three columns
print(x2[:3, ::2]) # all rows, every other column
print(x2[::-1, ::-1]) # sub array dimensions reversed together
Output:[[12 5 2]
[ 7 6 8]]
[[12 2]
[ 7 8]
[ 1 7]]
[[ 7 7 6 1]
[ 8 8 6 7]
[ 4 2 5 12]]
Accessing array rows and columns
 One commonly needed routine is accessing single rows or columns of an
array. We can do this by combining indexing and slicing, using an empty
slice marked by a single colon (:):
 Example:
print(x2[:, 0]) # first column of x2 [12 7 1]
print(x2[0, :]) # first row of x2 [12 5 2 4]
 In the case of row access, the empty slice can be omitted for a more compact
syntax:
print(x2[0]) # equivalent to x2[0, :] [12 5 2 4]
Sub arrays as no-copy views
 One important—and extremely useful—thing to know about array slices is
that they return views rather than copies of the array data.
 This is one area in which NumPy array slicing differs from Python list
slicing: in lists, slices will be copies.
 Consider our two-dimensional array from before:
print(x2) Output: [[12 5 2 4] [ 7 6 8 8] [ 1 6 7 7]]
x2_sub = x2[:2, :2]
print(x2_sub) Output: [[12 5] [ 7 6]]
 Now if we modify this sub array, we’ll see that the original array is changed.
Example:
x2_sub[0, 0] = 99
print(x2_sub) Output: [[99 5] [ 7 6]]
print(x2) Output: [[99 5 2 4] [ 7 6 8 8] [ 1 6 7 7]]
 This default behavior is actually quite useful: it means that when we work
with large datasets, we can access and process pieces of these datasets
without the need to copy the underlying data buffer.
 Creating copies of arrays Despite the nice features of array views, it is
sometimes useful to instead explicitly copy the data within an array or a sub
array.
 This can be most easily done with the copy() method:
Example:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy) Output: [[99 5] [ 7 6]]
 If we now modify this subarray, the original array is not touched:
Example:
x2_sub_copy[0, 0] = 42
print(x2_sub_copy) Output:[[42 5] [ 7 6]]
print(x2) Output: [[99 5 2 4] [ 7 6 8 8] [ 1 6 7 7]]
Reshaping of Arrays
 Another useful type of operation is reshaping of arrays.
 The most flexible way of doing this is with the reshape() method.
 For example:
import numpy as np
grid = np.arange(1, 10).reshape((3, 3))
print(grid)
Output: [[1 2 3]
[4 5 6]
[7 8 9]]
 Note that for this to work; the size of the initial array must match the size of
the reshaped array.
 Where possible, the reshape method will use a no-copy view of the initial
array, but with non-contiguous memory buffers this is not always the case.
 Another common reshaping pattern is the conversion of a one-dimensional
array into a two-dimensional row or column matrix. We can do this with the
reshape method, or more easily by making use of the newaxis keyword
within a slice operation
Example:
import numpy as np
x = np.array([1, 2, 3])
# row vector via reshape
x.reshape((1, 3))
# row vector via newaxis
x[np.newaxis, :]
# column vector via reshape
x.reshape((3, 1))
# column vector via newaxis
x[:, np.newaxis]
Output:[[1 2 3]]
[[1 2 3]]
[[1]
[2]
[3]]
[[1]
[2]
[3]]
Array Concatenation and Splitting
 It’s also possible to combine multiple arrays into one, and to conversely split
a single array into multiple arrays.
 Concatenation of arrays: Concatenation, or joining of two arrays in
NumPy, is primarily accomplished through the routines np.concatenate,
np.vstack, and np.hstack.
 np.concatenate takes a tuple or list of arrays as its first argument
 Example:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
print(np.concatenate([x, y]))
Output: [1 2 3 4 5 6]
 We can also concatenate more than two arrays at once:
Example:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))
Output: [ 1 2 3 4 5 6 99 99 99]
 np.concatenate can also be used for two-dimensional arrays:
 Example:
import numpy as np
grid = np.array([[1, 2, 3], [4, 5, 6]])
# concatenate along the first axis
np.concatenate([grid, grid])
Output: [[1, 2, 3], [4, 5, 6], [1, 2, 3], [4, 5, 6]]
# concatenate along the second axis (zero-indexed)

np.concatenate([grid, grid], axis=1)
Output: [[1, 2, 3, 1, 2, 3], [4, 5, 6, 4, 5, 6]]
 For working with arrays of mixed dimensions, it can be clearer to use the
np.vstack (vertical stack) and np.hstack (horizontal stack) functions:
Example:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7], [6, 5, 4]])
# vertically stack the arrays
np.vstack([x, grid])
Output: [[1, 2, 3], [9, 8, 7], [6, 5, 4]])
# horizontally stack the arrays
y = np.array([[99], [99]])
np.hstack([grid, y])
Ouput:[[ 9, 8, 7, 99], [ 6, 5, 4, 99]])
 Similarly, np.dstack will stack arrays along the third axis.
Splitting of arrays
 The opposite of concatenation is splitting, which is implemented by the
functions np.split, np.hsplit, and np.vsplit.
 For each of these, we can pass a list of indices giving the split points:
Example
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
 Notice that N split points lead to N + 1 subarrays.
 The related functions np.hsplit and np.vsplit are similar:
Example:
grid = np.arange(16).reshape((4, 4))
Output: [[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
Ouput:[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
left, right = np.hsplit(grid, [2])
print(left)
print(right)
Output:
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
 Similarly, np.dsplit will split arrays along the third axis.
Aggregations
 Aggregations are used to compute summary statistics for the data in
question.
 Perhaps the most common summary statistics are the mean and standard
deviation, which allows to summarize the “typical” values in a dataset, but
other aggregates are useful as well (the sum, product, median, minimum and
maximum, quantiles, etc.).
Summing the Values in an Array
 np.sum() function is used for computing the sum of all values in an array.
 Example:
import numpy as np
arr = np.array([2,4,6,8])
b = np.sum(arr)
print(b)
Output: 20
Minimum and Maximum
 np.min(), np.max() are used to find minimum and maximum in given array
 Example:
import numpy as np
arr = np.array([2,4,6,8])
print(np.min(arr))
print(np.max(arr))
output:
2
8
Multidimensional aggregates
 One common type of aggregation operation is an aggregate along a row or
column.
 By default, each NumPy aggregation function will return the aggregate over
the entire array:
 Aggregation functions take an additional argument specifying the axis along
which the aggregate is computed.
 For example, we can find the minimum value within each column by
specifying axis=0:Similarly, we can find the maximum value within each
row by specifying axis=1
 Example:
import numpy as np
a = np.array([[1,3,5,7],
[2,4,6,8]])
print(a.max())
print(np.max(a))
print(a.max(axis=0))
print(a.max(axis=1))
Output:
8
8
[2 4 6 8]
[7 8]
 The axis keyword specifies the dimension of the array that will be
collapsed, rather than the dimension that will be returned. So specifying
axis=0 means that the first axis will be collapsed: for two-dimensional
arrays, this means that values within each column will be aggregated.
Computation on NumPy Arrays: Universal Functions

 NumPy provides an easy and flexible interface to optimized computation
with arrays of data.
 Computation on NumPy arrays can be very fast, or it can be very slow. The
key to making it fast is to use vectorized operations, generally implemented
through NumPy’s universal functions (ufuncs).
 These Universal (mathematical NumPy functions) operate on the NumPy
Array and perform element-wise operations on the data values.
 For many types of operations, NumPy provides a convenient interface into
just this kind of statically typed, compiled routine. This is known as a
vectorized operation.
 ufuncs are used to implement vectorization in NumPy which is way faster
than iterating over elements.
 This vectorized approach is designed to push the loop into the compiled
layer that underlies NumPy, leading to much faster execution.
 Vectorized operations in NumPy are implemented via ufuncs, whose main
purpose is to quickly execute repeated operations on values in NumPy
arrays. Ufuncs are extremely flexible.
 Another means of vectorizing operations is to use NumPy’s broadcasting
functionality. Broadcasting is simply a set of rules for applying binary
ufuncs (addition, subtraction, multiplication, etc.) on arrays of different
sizes.
 Computations using vectorization through ufuncs are nearly always more
efficient than their counterpart implemented through Python loops,
especially as the arrays grow in size.
 These functions contain standard trigonometric functions, arithmetic
operations, complex number handling, statistical functions, and so forth.
 The following are some of the characteristics of universal functions:
 These functions work with ndarray (N-dimensional array), which is
Numpy's array class.
 It provides quick array operations on elements.
 It provides a variety of functions such as array broadcasting, type
casting, and so on.
 Numpy universal functions are objects in the numpy.ufunc class.
 Python functions can also be made universal by utilizing the
frompyfunc library function.
 When the corresponding array arithmetic operator is applied, some
ufuncs are called automatically. When two arrays are added element
by element using the '+' operator, np.add() is called internally.
 Ufuncs exist in two flavors:
 unary ufuncs, which operate on a single input
 binary ufuncs, which operate on two inputs.
 Example:
import numpy as np
x=np.array([1,2,3,4])
y=np.array([4,5,6,7])
z=np.add(x,y)
print(z)
Output: [ 5 7 9 11]
Absolute value
 Just as NumPy understands Python’s built-in arithmetic operators, it also
understands Python’s built-in absolute value function:
Example:
Import numpy as np
x = np.array([-2, -1, 0, 1, 2])
print(np.abs(x))
Output: [2 1 0 1 2]
Trigonometric functions
 NumPy provides a large number of useful ufuncs, and some of the most
useful for the data scientist are the trigonometric functions.
 Example:
theta = np.linspace(0, np.pi, 3)
print("theta = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
Output:
theta = [ 0. 1.57079633 3.14159265]

sin(theta) = [ 0.00000000e+00 1.00000000e+00 1.22464680e-16]
cos(theta) = [ 1.00000000e+00 6.12323400e-17 -1.00000000e+00]
tan(theta) = [ 0.00000000e+00 1.63312394e+16 -1.22464680e-16]
 Inverse trigonometric functions are also available:

Example:
import numpy as np
x = [-1, 0, 1]
print("x = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))
Output:
x = [-1, 0, 1]
arcsin(x) = [-1.57079633 0. 1.57079633]
arccos(x) = [3.14159265 1.57079633 0. ]
arctan(x) = [-0.78539816 0. 0.78539816]
Exponents and logarithms
 Another common type of operation available in a NumPy ufunc are the
exponentials:
Example:
import numpy as np
x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
Output:
x = [1, 2, 3]
e^x = [ 2.71828183 7.3890561 20.08553692]
2^x = [ 2. 4. 8.]
3^x = [ 3 9 27]
Advanced Ufunc Features
Some of specialized features of ufuncs are:
Specifying output
 For large calculations, it is sometimes useful to be able to specify the array
where the result of the calculation will be stored. Rather than creating a
temporary array, we can use this to write computation results directly to the
required memory location where we would like them to be.
 Example:
Import numpy as np
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)
Output: [ 0. 10. 20. 30. 40.]
Aggregates
 For binary ufuncs, there are some interesting aggregates that can be
computed directly from the object.
 For example, if we’d like to reduce an array with a particular operation, we
can use the reduce method of any ufunc. A reduce repeatedly applies a given
operation to the elements of an array until only a single result remains.
 For example, calling reduce on the add ufunc returns the sum of all elements
in the array:
 Example:
Import numpy as np
x = np.arange(1, 6)
np.add.reduce(x)
Output: 15
 If we’d like to store all the intermediate results of the computation, we can
instead use accumulate:
 Example:
Import numpy as np
x = np.arange(1, 6)
np.add.accumulate(x)
Output: [ 1 3 6 10 15]
Outer products
 Any ufunc can compute the output of all pairs of two different inputs using
the outer method.
Example:
x = np.arange(1, 6)
np.multiply.outer(x, x)
Ouput:
[[ 1, 2, 3, 4, 5],
[ 2, 4, 6, 8, 10],
[ 3, 6, 9, 12, 15],
[ 4, 8, 12, 16, 20],
[ 5, 10, 15, 20, 25]])
Broadcasting
 Broadcasting is means of vectorizing Operations.
 Broadcasting is simply a set of rules for applying binary ufuncs (addition,
subtraction, multiplication, etc.) on arrays of different sizes.
 Broadcasting allows binary operations to be performed on arrays of different
sizes.
 Broadcasting is a mechanism that allows NumPy to handle arrays of
different shapes during arithmetic operations.
 Broadcasting automatically expands smaller arrays to match the shape of
larger arrays for element-wise operations.
 In broadcasting, we can think of it as a smaller array being “broadcasted”

into the same shape as the larger array, before doing certain operations. In
general, the smaller array will be copied multiple times, until it reaches the
same shape as the larger array.
 Using broadcasting allows for vectorization, a style of programming that
works with entire arrays instead of individual elements
 Broadcasting is usually fast, since it vectorizes array operations so that
looping occurs in optimized C code instead of the slower Python. In addition,
it doesn’t really require storing all copies of the smaller array; instead, there
are faster and more efficient algorithms to store that.
 The central idea around broadcasting is that it tries to copy the data contained
within the smaller array to match the shape of the larger array.
 Example 1:For example, we can just as easily add a scalar (think of it as a
zero dimensional array) to an array:
 Example-1:
import numpy as np
a = np.array([0, 1, 2])
print(a + 5)
Output: [5, 6, 7]
 We can think of this as an operation that stretches or duplicates the value 5
into the array [5, 5, 5], and adds the results.
 Example-2:
import numpy as np
a = np.array([0, 1, 2])
M = np.ones((3, 3))
print(M + a)
Output: [[ 1., 2., 3.],
[ 1., 2., 3.],
[ 1., 2., 3.]])
 The advantage of NumPy’s broadcasting is that this duplication of values
does not actually take place, but it is a useful mental model as we think
about broadcasting.
 In broadcasting, the smaller array is broadcast to the larger array to make
their shapes compatible with each other.
Visualization of NumPy broadcasting

In above diagram, the light boxes represent the broadcasted values: this
extra memory is not actually allocated in the course of the operation, but it
can be useful conceptually to imagine that it is.
Rules of Broadcasting
 Broadcasting in NumPy follows a strict set of rules to determine the
interaction between the two arrays:
 Rule 1: If the two arrays differ in their number of dimensions, the
shape of the one with fewer dimensions is padded with ones on its
leading (left) side.
 Rule 2: If the shape of the two arrays does not match in any
dimension, the array with shape equal to 1 in that dimension is
stretched to match the other shape.
 Rule 3: If in any dimension the sizes disagree and neither is equal to
1, an error is raised.
 A set of arrays is said to be compatible with broadcasting (

broadcastable) if the one of the following is true:
 Arrays have exactly the same shape.
 Arrays have the same number of dimensions and the length of each
dimension is either a common length or 1.
 Array having too few dimensions can have its shape prepended with a
dimension of length 1, so that the above stated property is true.
Uses of Broadcasting (Broadcasting in Practice)

Centering an array:
 One commonly seen example is centering an array of data.
 Example: Imagine we have an array of 10 observations, each of which
consists of 3 values. We will store this in a 10×3 array:
X = np.random.random((10, 3))
 We can compute the mean of each feature using the mean aggregate across
the first dimension:
Xmean = X.mean(0)
print(Xmean)
And now we can center the X array by subtracting the mean
X_centered = X - Xmean
To double-check that we’ve done this correctly, we can check that the
centered array has near zero mean:
print(X_centered.mean(0))
To within-machine precision, the mean is now zero.
 The entire program is:
import numpy as np
X = np.random.random((10, 3))
Xmean = X.mean(0)
print(Xmean)
X_centered = X - Xmean
print(X_centered.mean(0))
Output:[ 0.53514715, 0.66567217, 0.44385899])
[ 2.22044605e-17, -7.77156117e-17, -1.66533454e-17])
Plotting a two-dimensional function:
 One place that broadcasting is very useful is in displaying images based on
two dimensional functions.
 If we want to define a function z = f(x, y), broadcasting can be used to
compute the function across the grid:
# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]
z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
 We can use Matplotlib to plot this two-dimensional array
Comparisons, Masks, and Boolean Logic
Comparisons:
 NumPy also implements all six of standard comparison operators such as <
(less than) and > (greater than) as element-wise ufuncs.
 The result of these comparison operators is always an array with Boolean
data type called Boolean array.
 As in the case of arithmetic operators, the comparison operators are

implemented as ufuncs in NumPy; for example, when we write x < 3,
internally NumPy uses np.less(x, 3).
 A summary of the comparison operators and their equivalent ufunc is
shown here:
 Example 1:
x = np.array([1, 2, 3, 4, 5])
print(x < 3) # less than operator
Output: [True, True, False, False, False]
 Example 2:
x = np.array([1, 2, 3, 4, 5])
print(np.less(x,3)) # less than ufunc
Output: [True, True, False, False, False]
Working with Boolean Arrays:
Counting entries:
 To count the number of True entries in a Boolean array, np.count_nonzero is
useful:
# how many values less than 6?
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
print(x)
print(np.less(x,6))
print(np.count_nonzero(x < 6)))
Output:
[[5 0 3 3]
[7 9 3 5]
[2 4 7 6]]
[[ True True True True]
[False False True True]
[ True True False False]]
8
 Another way to get at this information is to use np.sum; in this case, False is
interpreted as 0, and True is interpreted as 1:
print( np.sum(x < 6))
output: 8
 The benefit of sum() is that like with other NumPy aggregation functions,
this summation can be done along rows or columns as well:
Ex: # how many values less than 6 in each row?
print(np.sum(x < 6, axis=1))
Output: [4, 2, 2]
This counts the number of values less than 6 in each row of the matrix.
Boolean Operators
 We can combine the comparison operators using Python’s bitwise logic
operators, &, |, ^, and ~.
 Like with the standard arithmetic operators, NumPy overloads these as
ufuncs that work element-wise on (usually Boolean) arrays.
 The following table summarizes the bitwise Boolean operators and their
equivalent ufuncs:
 Example:
import numpy as np
a =np.arange(10)
print(a)
#bitwise or operatot
b=((a<=2) | (a>=8))
print(b)
d=np.sum(b)
print(d)
#bitwise or ufunc
c=np.bitwise_or(a<=2,a>=8)
print(c)
e=np.sum(c)
print(e)
Output:
[0 1 2 3 4 5 6 7 8 9]
[ True True True False False False False False True True]
5
[ True True True False False False False False True True]
5
Boolean Masking (Boolean Indexing)

(Boolean Arrays as Masks)
 Boolean masks are used to examine and manipulate values within NumPy
arrays.
 Masking comes up when we want to extract, modify, count, or otherwise
manipulate values in an array based on some criterion: for example,
counting all values greater than a certain value, or perhaps remove all
outliers that are above some threshold.
 In NumPy, Boolean masking is often the most efficient way to manipulate
values in an array based on some criterion.
 Boolean masking, also called boolean indexing, is a feature in Python
NumPy that allows for the filtering of values in numpy arrays.
 Boolean indexing in NumPy is a powerful technique that allows us to filter
and manipulate elements of a NumPy array based on boolean (True/False)
values. It's often used to select elements from an array that satisfy a certain
condition.
 Boolean indexing can be very useful for more complex data filtering and
manipulation tasks in NumPy.
 Numpy allows us to use an array of boolean values as an index of another
array.
 Each element of the boolean array indicates whether or not to select the
elements from the array.If the value is True, the element of that index is
selected. In case the value is False, the element of that index is not
selected.
 Example 1:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([True, True, False])
c = a[b]
print(c)
Output:[1 2]
 Example 2:
import numpy as np
a = np.arange(1, 10)
b=a>5
print(b)
c = a[b]
print(c)
Output:
[False False FalseFalse False True True True True]
[6 7 8 9]
 Example 3:
import numpy as np
# Create a NumPy array

arr = np.array([1, 2, 3, 4, 5])
# Create a boolean mask based on a condition

condition = arr > 3
# Use the boolean mask to filter elements from the array

filtered_arr = arr[condition]
# Display the original array

print("Original array:")
print(arr)
# Display the boolean mask

print("\nBoolean mask:")
print(condition)
# Display the filtered array

print("\nFiltered array (elements greater than 3):")
print(filtered_arr)
Output:
Original array:
[1 2 3 4 5]
Boolean mask:
[False False False True True]
Filtered array (elements greater than 3):

[4 5]
Fancy Indexing
 Fancy indexing is a new style of array indexing, where we pass arrays of
indices in place of single scalars. This allows us to very quickly access and
modify complicated subsets of an array’s values.
 Fancy indexing means passing an array of indices to access multiple array
elements at once.
 Fancy indexing allows us to select specific elements from an array or a list
based on the indices provided. This can be particularly useful for
operations like data filtering or rearrangement.
 Example 1:
# Import NumPy for working with arrays

import numpy as np
arr = np.array([10, 20, 30, 40, 50])
# Create an array of indices for fancy indexing
indices = np.array([1, 3])
# Access elements using fancy indexing
selected_elements = arr[indices]
print("Original array:", arr)
print("Indices for fancy indexing:", indices)
print("Selected elements using fancy indexing:", selected_elements)
Output:
Original array: [10 20 30 40 50]
Indices for fancy indexing: [1 3]
Selected elements using fancy indexing: [20 40]
 Example2:
import numpy as np # Import NumPy for working with arrays
my_array = np.array(['apple', 'banana', 'cherry', 'date', 'elderberry'])
# Create a array of indices for fancy indexing

array_indices = [0, 2, 4]
# Access elements using fancy indexing

selected_fruits = my_array[array_indices]
print("\nOriginal array:", my_array)

print("Indices for fancy indexing:", array_indices)
print("Selected fruits using fancy indexing:", selected_fruits)
Output:
Original array: ['apple' 'banana' 'cherry' 'date' 'elderberry']
Indices for fancy indexing: [0, 2, 4]
Selected fruits using fancy indexing: ['apple' 'cherry' 'elderberry']
Combined Indexing
(Combining fancy indexing with other indexing schemes)
 For even more powerful operations, fancy indexing can be combined with
the other indexing schemes
 We can combine fancy and simple indices:
import numpy as np
X = np.arange(12).reshape((3, 4))
print(X)
a= X[2, [2, 0, 1]]
print(a)
Output: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[10, 8, 9]
 We can also combine fancy indexing with slicing:
b=X[1:, [2, 0, 1]]
print(b)
Output:
[[ 6, 4, 5],
[10, 8, 9]]
 We can combine fancy indexing with masking:
mask = np.array([True, False, True, False])
row = np.array([0, 1, 2])
c= X[row[:, np.newaxis], mask]
print(c)
Output: [[ 0, 2],
[ 4, 6],
[ 8, 10]]
 All of these indexing options combined lead to a very flexible set of
operations for accessing and modifying array values.
Differences between Boolean and Fancy indexing
Boolean indexing Fancy indexing
1. Selection is based on conditions or Selection is based on predefined
boolean masks. indices or arrays of indices.
2. We create a boolean array of the same We explicitly specify which elements
shape as the original array, where each to select using integer arrays of indices.
element corresponds to whether the
condition is met or not.
3. Requires creating a boolean mask, Requires a separate array (or arrays) of
which involves evaluating a condition integer indices to specify the elements
against the original array. to be selected
4. The resulting array has the same The resulting array can have a different
shape as the original array. shape than the original array.
5. Elements are selected based on Elements are selected explicitly based
whether the corresponding boolean on the specified integer indices.
mask is True or False.
6. Useful for condition-based selection Useful for selecting elements at
of elements. Commonly used for specific locations or indices. Allows
filtering data based on conditions. for more flexible and customized
selection of elements.
Structured Arrays
 Structure arrays are arrays with compound data types.
 Structured arrays in NumPy allow us to work with structured data, where
each element of the array can have multiple fields with different data types,
similar to a table in a database or a structure in a programming language
like C or C++.
 Structured arrays are particularly useful when we need to handle
heterogeneous data, such as data from CSV files or other tabular data
sources.
 They provide efficient storage for compound, heterogeneous data.
 Structured arrays are handy for handling and manipulating structured data
within NumPy, making it easier to work with complex datasets.
 Example:
import numpy as np
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print(data.dtype)
#storing data in three separate arrays
name = np.array( ['Kumar', 'Rao', 'Ali', 'Singh'])
age = np.array([25, 45, 37, 19])
weight =np.array( [55.0, 85.5, 68.0, 61.5])
# filling the array with our lists of values
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
Output:
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
[('Kumar', 25, 55. ) ('Rao', 45, 85.5) ('Ali', 37, 68. ) ('Singh', 19, 61.5)]
 The handy thing with structured arrays is that we can now refer to values
either by index or by name:
 Example:
# Get all names
print(data['name'])
# Get first row of data
print(data[0])
# Get the name from the last row
print(data[-1]['name'])
# Get names where age is under 30
print(data[data['age'] < 30]['name'])
Output:
['Kumar' 'Rao' 'Ali' 'Singh']
('Kumar', 25, 55.)
Singh
['Kumar' 'Singh']
Creating Structured Arrays
 Structured array data types can be specified in a number of ways.
Method 1: Dictionary method: We can create a structured array using a
compound data type specification:
struct = np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
Method2: Numerical types can be specified with Python types or NumPy
dtypes instead:
struct2 = np.dtype({'names':('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
Method3: A compound type can also be specified as a list of tuples:
struct3 = np.dtype([('name', 'U10'), ('age', 'i4'), ('weight', 'f8')])
 Example:
import numpy as np
struct = np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
data = np.zeros(4,struct)
struct2 = np.dtype({'names':('name', 'age', 'weight'),

'formats':((np.str_, 10), int, np.float32)})
data2 = np.zeros(4,struct2)
struct3 = np.dtype([('name', 'U10'), ('age', 'i4'), ('weight', 'f8')])

data3 = np.zeros(4,struct3)
name = ['Kumar', 'Rao', 'Ali', 'Singh']

age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
data['name'] = name
data['age'] = age
data2['name'] = name
data2['age'] = age
data2['weight'] = weight
data3['name'] = name
data3['age'] = age
data3['weight'] = weight
print(data)
print(data2)
print(data3)
Output:
[('Kumar', 25, 55. ) ('Rao', 45, 85.5) ('Ali', 37, 68. )('Singh', 19, 61.5)]
Record Arrays: Structured Arrays with a Twist
 In NumPy, a record array is a special kind of structured array where each
element behaves like a record or a structured row, and we can access fields
using attributes (like object-oriented attributes) instead of dictionary-style
indexing. This can make the code more readable and similar to working with
structured data.
 NumPy also provides the np.recarray class, which is almost identical to the
structured arrays , but with one additional feature: fields can be accessed as
attributes rather than as dictionary keys.
 Recall can access the ages by writing: data['age']
Output: array([25, 45, 37, 19], dtype=int32).
 If we view our data as a record array instead, we can access this with
slightly fewer keystrokes:
data_rec = data.view(np.recarray)
print(data_rec.age)
Output:array([25, 45, 37, 19], dtype=int32)
 The downside is that for record arrays, there is some extra overhead
involved in accessing the fields
 Example program:
import numpy as np
name = ['Kumar','Rao','Ali','Singh']
age = [25,45,37,19]
weight = [55.0,85.5,68.0,61.5]
struct = np.dtype({'names':('name','age','weight'),
'formats':('U10','i4','f8')})
data = np.zeros(4,struct)
data['name'] = name
data['age'] = age
print(data)
#accesing field as dictinary keys

print(data['age'])
# accesing field as attribute
data_rec = data.view(np.recarray)
print(data_rec.age)
Output:
[25 45 37 19]
[25 45 37 19]
import numpy as np
# Creating a record array with fields 'name' and 'age'
data = np.rec.array([('Alice', 25), ('Bob', 30), ('Charlie', 35)],
dtype=[('name', 'U10'), ('age', int)])
print(data)
print(data.name)
import numpy as np
# Creating a record array with fields 'name' and 'age'
data = np.rec.array([('Alice', 25), ('Bob', 30), ('Charlie', 35)],
dtype=[('name', 'U10'), ('age', int)])
print(data)
print(data2)
print(data.name)
print(data2['name'])
data_rec=data2.view(np.recarray)
print(data_rec.name)
Exercise: Create a structured array representing information about students. Each
student should have fields for "Name," "Age," "Grade," and "City." Populate the
array with data for at least 5 students. Access and print the names of all students in
the array. Find and print the average grade of all students.
import numpy as np
# Define the structured array (assuming you already have it)
struct = [('Name', 'U20'), ('Age', int), ('Grade', float), ('City', 'U20')]
data = [("Alice", 22, 90.5, "New York"),
("Bob", 21, 88.0, "Los Angeles"),
("Charlie", 25, 95.2, "Chicago"),
("David", 20, 87.3, "San Francisco"),
("Eva", 23, 91.8, "New York")]
students =np.array(data,struct)
# Access and print the names of all students
names = students['Name']
print("Names of all students:")
print(names)
# Find and print the average grade of all students

average_grade = np.mean(students['Grade'])
print("\nAverage grade of all students:", average_grade)
Pandas
 Pandas is a popular open-source Python library for data manipulation and
analysis. It provides data structures and functions for working with
structured and tabular data, making it a fundamental tool for data scientists,
analysts, and anyone working with data in Python.
 Pandas is a newer package built on top of NumPy, and provides an efficient
implementation of a DataFrame.
 DataFrames are essentially multidimensional arrays with attached row and
column labels, and often with heterogeneous types and/or missing data.
 As well as offering a convenient storage interface for labeled data, Pandas
implements a number of powerful data operations familiar to users of both
database frameworks and spreadsheet programs.
Pandas Objects
(Fundamental Pandas Data Structures)
 Three fundamental Pandas data structures are:
 Series
 DataFrame
 Index.
The Pandas Series Object
 A Pandas Series is a one-dimensional array of indexed data.
 Example: import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
Output:
0 0.25
1 0.50
3 0.75
3 1.00
 The Series wraps both a sequence of values and a sequence of indices,
which we can access with the values and index attributes. The index is an
array-like object of type pd.Index,
Example:
print(data.values)
print(data.index)
Output:
[0.25 0.5 0.75 1. ]
RangeIndex(start=0, stop=4, step=1)
 The essential difference between NumPy one-dimensional array and pandas
Series is the presence of the index: while the NumPy array has an implicitly
defined integer index used to access the values, the Pandas Series has an
explicitly defined index associated with the values.
 This explicit index definition gives the Series object additional capabilities.
For example, the index need not be an integer, but can consist of values of
any desired type.
 Example:
data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
print(data)
Output:
a 0.25
b 0.50
c 0.75
d 1.00
 We can even use non-contiguous or non-sequential indices:
Example:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
print(data)
Output:
2 0.25
5 0.50
3 0.75
7 1.00
Constructing Series objects
 The general syntax to create pandas Series object is
pd.Series(data, index=index)
where index is an optional argument, and data can be one of many entities.
 data can be a list or NumPy array, in which case index defaults to an
integer sequence
 data can be a scalar, which is repeated to fill the specified index
 data can be a dictionary, in which index defaults to the sorted dictionary
keys
 Example program:
import pandas as pd
import numpy as np
arr=np.arange(10,60,10)
li=[10,20,30,40,50]
s=10
dic={'1st':10,'2nd':20,'3rd':30,'4th':40,'5th':50}
ser1 = pd.Series(arr) #A one-dimensional ndarray
ser2 = pd.Series(li) # A Python list
ser3 = pd.Series(s) #A scalar value
ser4 =pd.Series(s,index=['a','b','c','d','e'])
ser5 = pd.Series(dic) #A Python dictionary
print(ser1)
print(ser2)
print(ser3)
print(ser4)
print(ser5)
Output:
0 10
1 20
2 30
3 40
4 50
0 10
1 20
2 30
3 40
4 50
0 10
a 10
b 10
c 10
d 10
e 10
1st 10
2nd 20
3rd 30
4th 40
5th 50
The Pandas DataFrame Object
 The DataFrame can be thought of either as a generalization of a NumPy
array, or as a specialization of a Python dictionary.
 A DataFrame is an analog of a two-dimensional array with both flexible row
indices and flexible column names.
 We can think of a DataFrame as a sequence of aligned (they share the same

index) Series objects.
 Thus the DataFrame can be thought of as a generalization of a two-
dimensional NumPy array, where both the rows and columns have a
generalized index for accessing the data.
 Example:
import pandas as pd
df=pd.DataFrame([[10,20],[30,40],[50,60]])
print(df)
df=pd.DataFrame([[10,20],[30,40],[50,60]],columns=['col1', 'col2'])
print(df)
df=pd.DataFrame([[10,20],[30,40],[50,60]],index=['row1', 'row2', 'row3'])
df=pd.DataFrame([[10,20],[30,40],[50,60]],columns=['col1', 'col2'],
index=['row1', 'row2', 'row3'])
print(df)
Output:
0 1 col1 col2 col1 col2
0 10 20 0 10 20
1 30 40 1 30 40 row1 10 20
2 50 60 2 50 60 row2 30 40
row3 50 60
Series vs data Frame

Series Data Frame
1.A Series is a one-dimensional labeled array A DataFrame is a two-dimensional, tabular
capable of holding data of any type data structure consisting of rows and columns.
2. It can be thought of as a single column of It is similar to a spreadsheet or a SQL table.
data with an associated label, which is called
the index.
3. Series is similar to a NumPy array but with We can think of a DataFrame as a collection
the addition of an index that allows for more of Series objects, where each Series
flexible and meaningful data access. represents a column.
1. 4. All elements in a Series must be of the same 1. Different columns in a DataFrame can hold
data type. For example, if you create a Series data of different data types. For example, one
with integers, all elements in that Series will be column can contain integers; another can
integers contain strings, and so on.
2. 5. The size (number of elements) of a Series is 2. DataFrames can be modified in size by adding
fixed upon creation. You cannot add or remove or removing rows and columns. This makes
elements without creating a new Series. them more flexible for data manipulation.
3.
5. We can create a Series from a list, array, or We can create a DataFrame from various data
dictionary. sources like lists, dictionaries, NumPy arrays,
other DataFrames, or by reading data from
files like CSV, Excel, SQL databases, etc.
Constructing DataFrame objects

 A Pandas DataFrame can be constructed in a variety of ways.
 From a single Series object
 From List of Dicts
 From a dictionary of Series objects
 From a two-dimensional NumPy array
 From a NumPy structured array
From a single Series object:
 A DataFrame is a collection of Series objects, and a single column
DataFrame can be constructed from a single Series:
Example:
import pandas as pd
markslist = {'kumar':89,'Rao':78,'Ali':67,'Singh':96}
marks = pd.Series(markslist)
df= pd.DataFrame(marks,columns=['Marks'])
print(df)
Output:
Marks
kumar 89
Rao 78
Ali 67
Singh 96
From List of Dicts:
 Any list of dictionaries can be made into a DataFrame.
 Example:
import pandas as pd
import numpy as np
data = [{'a':i,'b':2*i} for i in range(3)]
print(pd.DataFrame(data))
#alternate way of defining
l1={'a':0,'b':0}
l2={'a':1,'b':2}
l3={'a':2,'b':4}
data = [l1,l2,l3]
print('\n',pd.DataFrame(data))
Output:
a b
0 0 0
1 1 2
2 2 4
a b
0 0 0
1 1 2
2 2 4
From a dictionary of Series objects:
 A DataFrame can be constructed from a dictionary of Series objects
 Example:
import pandas as pd
ageslist = {'kumar':21,'Rao':22,'Ali':19,'Singh':20}
ages = pd.Series(ageslist)
df = pd.DataFrame({'marks': marks,'ages': ages})
print(df)
Output:
marks ages
kumar 89 21
Rao 78 22
Ali 67 19
Singh 96 20
From a two-dimensional NumPy array.
 Given a two-dimensional array of data, we can create a DataFrame with any
specified column and index names. If omitted, an integer index will be used
for each
 Example:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(1,7,1).reshape(3,2),
columns=['col1', 'col2'],
index=['row1', 'row2', 'row3'])
print(df)
Output:
col1 col2
row1 1 2
row2 3 4
row3 5 6
From a NumPy structured array.
 A Pandas DataFrame operates much like a structured array, and can be
created directly from one:
Example:
import numpy as np
import pandas as pd
sa = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
print(pd.DataFrame(sa))
Output:
A B
0 0 0.0
1 0 0.0
2 0 0.0
Pandas Index Object
 Both the Series and DataFrame objects contain an explicit index using
which we reference and modify data.
 This Index object is an interesting structure in itself, and it can be thought of
either as an immutable array or as an ordered set.
Example:
import pandas as pd
rind = pd.Index(['row1','row2','row3','row4'])
cind =pd.Index(['col1'])
ser = pd.Series([100,200,300,400],index=rind)
df = pd.DataFrame(ser,columns=cind)
print(df)
Output:
col1
row1 100
row2 200
row3 300
row4 400
use of index object:
import pandas as pd
rind = pd.Index(['row1','row2','row3','row4'])
ser1 = pd.Series([10,20,30,40],index=rind)
ser2 = pd.Series([50,60,70,80],index=rind)
frame={'col1':ser1,'col2':ser2}
df = pd.DataFrame(frame)
print(df)
Output:
col1 col2
row1 10 50
row2 20 60
row3 30 70
row4 40 80
Operating on Data in Pandas
 Pandas inherit much of this functionality from NumPy, and the ufuncs. So
Pandas having the ability to perform quick element-wise operations, both
with basic arithmetic (addition, subtraction, multiplication, etc.) and with
more sophisticated operations (trigonometric functions, exponential and
logarithmic functions, etc.).
 For unary operations like negation and trigonometric functions, these ufuncs
will preserve index and column labels in the output.
 For binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
 The universal functions are working in series and DataFrames by
 Index preservation
 Index alignment
Index Preservation
 Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series and DataFrame objects.
 We can use all arithmetic and special universal functions as in NumPy on
pandas. In outputs the index will preserved (maintained) as shown below.
import pandas as pd
import numpy as np
ser = pd.Series([10,20,30,40])
df = pd.DataFrame(np.arange(1,13,1).reshape(3,4),columns=['A', 'B', 'C',
'D'])
print(df)
print(np.add(ser,5)) # the indices preserved for series
print(np.add(df,10)) # the indices preserved for DataFrame
Index Alignment in series
 Pandas will align indices in the process of performing the operation. This is
very convenient when we are working with incomplete data, as we’ll.
 suppose we are combining two different data sources, then the index will
aligned accordingly.
 Exampe:
import numpy as np
import pandas as pd
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A + B)
print(A.add(B)) #equivalent to A+B
print(A.add(B,fill_value=0)) #fill value for any elements in A or B that
might be missing
Index Alignment in DataFrame
A similar type of alignment takes place for both columns and indices when we are
performing operations on DataFrames.
Example:
import numpy as np
import pandas as pd
A = pd.DataFrame(np.arange(1,5,1).reshape(2,2),columns =list('AB'))
B = pd.DataFrame(np.arange(1,10,1).reshape(3,3),columns =list('BAC'))
print(A)
print(B)
print(A+B)
print(A.add(B,fill_value=0))
fill = A.stack().mean()
print(A.add(B,fill_value=fill))
Output:
A B
0 1 2
1 3 4
B ... C
0 1 ... 3
1 4 ... 6
2 7 ... 9
[3 rows x 3 columns]
A ... C
0 3.0 ... NaN
1 8.0 ... NaN
2 NaN ... NaN
A ... C
0 3.0 ... 3.0
1 8.0 ... 6.0
2 8.0 ... 9.0
A ... C
0 3.0 ... 5.5
1 8.0 ... 8.5
2 10.5 ... 11.5
Operations between DataFrame and Series
 When we are performing operations between a DataFrame and a Series, the
index and column alignment is similarly maintained.
 Operations between a DataFrame and a Series are similar to operations
between a two-dimensional and one-dimensional NumPy array.
Example:
import numpy as np
import pandas as pd
ser = pd.Series([10,20])
df = pd.DataFrame([[100,200],[300,400]])
print(ser)
print(df)
print(df.subtract(ser))
print(df.subtract(ser,axis=0))
Output:
0 10
1 20
0 1
0 100 200
1 300 400
0 1
0 90 180
1 290 380
0 1
0 90 190
1 280 380
Data Selection in DataFrame
DataFrame as a dictionary
Example1:
import pandas as pd
ser1 = pd.Series([10,20,30,40],index = ['row1','row2','row3','row4'])
data = pd.DataFrame({'col1':ser1,'col2':ser2})
print(data)
print(data['col1']) # dict style
print(data.col1) # attribute style
data['sum'] = data['col1']+data['col2']
print(data)
Output:
col1 col2
row1 10 50
row2 20 60
row3 30 70
row4 40 80
row1 10
row2 20
row3 30
row4 40
row1 10
row2 20
row3 30
row4 40
col1 ... sum

row1 10 ... 60
row2 20 ... 80
row3 30 ... 100
row4 40 ... 120
Example2:
import pandas as pd
data = pd.DataFrame({'marks': marks,'ages': ages})
print(data)
print(data['marks'])
print(data.marks)
data['ratio'] = data['marks'] / data['ages']
print(data)
Output:
marks ages
kumar 89 21
Rao 78 22
Ali 67 19
Singh 96 20
kumar 89
Rao 78
Ali 67
Singh 96
kumar 89
Rao 78
Ali 67
Singh 96
marks ... ratio

kumar 89 ... 4.238095
Rao 78 ... 3.545455
Ali 67 ... 3.526316
Singh 96 ... 4.800000
DataFrame as two-dimensional array
Example1:
import pandas as pd
data = pd.DataFrame({'col1':ser1,'col2':ser2})
print(data)
print(data.values)
print(data.T)
print(data.value[0])
print(data.iloc[:3,:1])
print(data.loc[:'row3',:'col1'])
#print(data.ix[:3,:'col1'])
Output:
col1 col2
row1 10 50
row2 20 60
row3 30 70
row4 40 80
[[10 50]
[20 60]
[30 70]
[40 80]]
row1 ... row4
col1 10 ... 40
col2 50 ... 80
[10 50]
col1
row1 10
row2 20
row3 30
col1
row1 10
row2 20
row3 30
Example2:
import pandas as pd
data = pd.DataFrame({'marks': marks,'ages': ages})
print(data)
print(data.values)
print(data.T)
Output:
marks ages
kumar 89 21
Rao 78 22
Ali 67 19
Singh 96 20
[[89 21]
[78 22]
[67 19]
[96 20]]
kumar ... Singh
marks 89 ... 96
ages 21 ... 20
Operating on Data in Pandas

 Pandas inherit much of this functionality from NumPy, and the ufuncs. So
Pandas having the ability to perform quick element-wise operations, both
with basic arithmetic (addition, subtraction, multiplication, etc.) and with
more sophisticated operations (trigonometric functions, exponential and
logarithmic functions, etc.).
 For unary operations like negation and trigonometric functions, these ufuncs
will preserve index and column labels in the output.
 For binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
 Index preservation
 Index Alignment
 Operation between DataFrame and Series
Index preservation:
 Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series and DataFrame objects.
 We can use all arithmetic and special universal functions as in NumPy on
pandas.
 The result will be another pandas object with the indices preserved
 In outputs the index will preserved (maintained) as shown below.
Example:
import numpy as np
import pandas as pd
s=pd.Series([10,20,30,40])
print(s)
df =pd.DataFrame(np.arange(1,13).reshape(3,4),columns=['A','B','C','D'])
print(df.to_string())
print(np.add(s,5))
print(np.add(df,10).to_string())
Output:
0 10
1 20
2 30
3 40
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
0 15
1 25
2 35
3 45
A B C D
0 11 12 13 14
1 15 16 17 18
2 19 20 21 22
Index Alignment in Data Frame:
import numpy as np
import pandas as pd
A = pd.DataFrame(np.arange(1,5).reshape(2,2),columns=list('AB'))
B = pd.DataFrame(np.arange(1,10).reshape(3,3),columns=list('BAC'))
print(A)
print(B.to_string())
print(A.add(B).to_string())
print(A.add(B,fill_value=0).to_string())
print(A.add(B).to_string())
fill=A.stack().mean()
print(A.add(B,fill_value=fill).to_string())
Output:
A B
0 1 2
1 3 4
B A C
0 1 2 3
1 4 5 6
2 7 8 9
A B C
0 3.0 3.0 NaN
1 8.0 8.0 NaN
2 NaN NaN NaN
A B C
0 3.0 3.0 3.0
1 8.0 8.0 6.0
2 8.0 7.0 9.0
A B C
0 3.0 3.0 NaN
1 8.0 8.0 NaN
2 NaN NaN NaN
A B C
0 3.0 3.0 5.5
1 8.0 8.0 8.5
2 10.5 9.5 11.5
Handling Missing Data
 A number of schemes have been developed to indicate the presence of
missing data in a table or DataFrame.
 Generally, they revolve around one of two strategies: using a mask that
globally indicates missing values, or choosing a sentinel value that indicates
a missing entry.
 In the masking approach, the mask might be an entirely separate Boolean

array, or it may involve appropriation of one bit in the data representation to
locally indicate the null status of a value.
 In the sentinel approach, the sentinel value could be some data-specific
convention, such as indicating a missing integer value with –9999 or some
rare bit pattern, or it could be a more global convention, such as indicating a
missing floating-point value with NaN (Not a Number), a special value
which is part of the IEEE floating-point specification.
 Example:
import numpy as np
import pandas as pd
arr1 =np.array([1,2,3,4])
print(arr1)
print(arr1.sum())
arr2 =np.array([1,None,3,4])
print(arr2)
#print(arr2.sum())
arr3 =np.array([1,np.nan,3,4])
print(arr3)
print(arr3.sum())
print(np.nansum(arr3))
Output:
[1 2 3 4]
10
[1 None 3 4]
[ 1. nan 3. 4.]
nan
8.0
Missing Data in Pandas
 The way in which Pandas handles missing values is constrained by its
NumPy package, which does not have a built-in notion of NA values for non
floating- point data types.
 NumPy supports fourteen basic integer types once we account for available
precisions, signedness, and endianness of the encoding.
 Reserving a specific bit pattern in all available NumPy types would lead to
an unwieldy amount of overhead in special-casing various operations for
various types, likely even requiring a new fork of the NumPy package.
 Pandas chose to use sentinels for missing data, and further chose to use two
already-existing Python null values: the special floatingpoint NaN value,
and the Python None object.
 This choice has some side effects, as we will see, but in practice ends up
being a good compromise in most cases of interest.
None: Pythonic missing data

 The first sentinel value used by Pandas is None, a Python singleton object
that is often used for missing data in Python code. Because None is a Python
object, it cannot be used in any arbitrary NumPy/Pandas array, but only in
arrays with data type 'object' (i.e., arrays of Python objects)
 This dtype=object means that the best common type representation NumPy
could infer for the contents of the array is that they are Python objects.
NaN: Missing numerical data
 NaN is a special floating-point value recognized by all systems that use the
standard IEEE floating-point representation.
NaN and None in Pandas
 NaN and None both have their place, and Pandas is built to handle the two
of them nearly interchangeably.
Example:
import numpy as np
import pandas as pd
ser = pd.Series([1,np.nan,2,None])
print(ser)
df = pd.DataFrame([[1,None],[3,np.nan],[None,6],[np.nan,8]])
print(df)
Output:
0 1.0
1 NaN
2 2.0
3 NaN
0 1
0 1.0 NaN
1 3.0 NaN
2 NaN 6.0
3 NaN 8.0
Operating on Null Values
 There are several useful methods for detecting, removing, and replacing null
values in Pandas data structures.
 They are:
 isnull() - Generate a Boolean mask indicating missing values
 notnull() - Opposite of isnull()
 dropna() - Return a filtered version of the data
 fillna() - Return a copy of the data with missing values filled or
imputed
Detecting null values

 Pandas data structures have two useful methods for detecting null data:
isnull() and notnull().
 Either one will return a Boolean mask(Boolean arrays) over the data.
Example:
import numpy as np
import pandas as pd
ser = pd.Series([1,np.nan,'hello',None])
df = pd.DataFrame([[np.nan,10,'hai'],[20,30,'wow']])
print(ser)
print(ser.isnull())
print(ser.notnull())
print(df)
print(df.isnull())
print(df.notnull())
0 1
1 NaN
2 hello
3 None
0 False
1 True
2 False
3 True
0 True
1 False
2 True
3 False
0 ... 2
0 NaN ... hai
1 20.0 ... wow
0 ... 2
0 True ... False
1 False ... False
0 ... 2
0 False ... True
1 True ... True
Dropping Null values

 For dropping null values dropna() method (which removes NA values) is
used.
 We cannot drop single values from a DataFrame; we can only drop full rows
or full columns.
 By default, dropna() will drop all rows in which any null value is present:
 Alternatively, we can drop NA values along a different axis; axis=1 drops
all columns containing a null value.
 We can also drop rows or columns with all NA values, or a majority of NA
values. This can be specified through the how or thresh parameters, which
allow fine control of the number of nulls to allow through. The default is
how='any', such that any row or column (depending on the axis keyword)
containing a null value will be dropped. We can also specify how='all',
which will only drop rows/columns that are all null values:
 For finer-grained control, the thresh parameter can be used to specify a
minimum number of non-null values for the row/column to be kept:
 Example:
import numpy as np
import pandas as pd
ser = pd.Series([1,np.nan,'hello',None])
df = pd.DataFrame([[np.nan,10,'hai'],[20,30,'wow']])
print(ser)
print(df)
print(ser.dropna())
print(df.dropna())
print(df.dropna(axis =1))
print(df.dropna(axis ='columns')) #equivalent to axis =1
0 1
1 NaN
2 hello
3 None
0 ... 2
0 NaN ... hai
1 20.0 ... wow
0 1
2 hello
0 ... 2
1 20.0 ... wow
1 2
0 10 hai
1 30 wow
1 2
0 10 hai
1 30 wow
Example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan,10,'hai',None],[20,30,'wow',None]])
print(df)
print(df.dropna())
print(df.dropna(axis =1))
print(df.dropna(axis ='columns')) #equivalent to axis =1
print(df.dropna(axis ='columns',how='all'))
print(df.dropna(axis ='columns',thresh=2))
Output:
0 ... 3
0 NaN ... None
1 20.0 ... None
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
1 2
0 10 hai
1 30 wow
1 2
0 10 hai
1 30 wow
0 ... 2
0 NaN ... hai

1 20.0 ... wow
1 2
0 10 hai
1 30 wow
Filling null values
 We can replace NA values with a valid value. This value might be a single
number like zero, or it might be some sort of imputation or interpolation
from the good values.
 Pandas provides the fillna() method, which returns a copy of the array with
the null values replaced.
 Example: filling null values in Series
import numpy as np
import pandas as pd
ser = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
print(ser)
print(ser.fillna(0))
print(ser.fillna(method='ffill'))
print(ser.fillna(method='bfill'))
Output:
a 1.0
b NaN
c 2.0
d NaN
e 3.0
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
 Example: Filling null values in DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, np.nan, 2,None],
[2, 3, 5, None],
[np.nan, 4, 6, None]])
print(df)
print(df.fillna(method='ffill', axis=1))
print(df.fillna(method='bfill', axis=1))
print(df.fillna(method='ffill', axis=0))
print(df.fillna(method='bfill', axis=0))
Output:
0 1 2 3
0 1.0 NaN 2 None
1 2.0 3.0 5 None
2 NaN 4.0 6 None
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0
0 1 2 3
0 1.0 2.0 2.0 NaN
1 2.0 3.0 5.0 NaN
2 4.0 4.0 6.0 NaN
0 1 2 3
0 1.0 NaN 2 None
1 2.0 3.0 5 None
2 2.0 4.0 6 None
0 1 2 3
0 1.0 3.0 2 None
1 2.0 3.0 5 None
2 NaN 4.0 6 None
Hierarchical Indexing
 Hierarchical indexing (also known as multi-indexing) is used to incorporate
multiple index levels within a single index.
 In this way, higher-dimensional data can be compactly represented within
the familiar one-dimensional Series and two-dimensional DataFrame
objects.
 A Multiply Indexed Series: Here we represent two-dimensional data within
a one-dimensional Series.
Example:
import numpy as np
import pandas as pd
ser = pd.Series([10,20,30,40,50,60],index = [[1,1,1,2,2,2,],
['a','b','c','a','b','c']])
print(ser)
ser.index.names = ['ind1','ind2']
print(ser)
Output:
1 a 10
b 20
c 30
2 a 40
b 50
c 60
ind1 ind2
1 a 10
b 20
c 30
2 a 40
b 50
c 60
 A Multiply Indexed DataFrame:
Example:
import numpy as np
import pandas as pd
data = [[25,24],[28,26],[29,28],[27,26],[30,29],[28,27]]
ind = [['1201','1201','1264','1264','12C7','12C7'],
['mid1','mid2','mid1','mid2','mid1','mid2']]
col = ['DS','DO']
df = pd.DataFrame(data,index=ind,columns=col)
print(df)
df.index.names =['rollNo','mid']
print(df)
Output:
DS DO
1201 mid1 25 24
mid2 28 26
1264 mid1 29 28
mid2 27 26
12C7 mid1 30 29
mid2 28 27
DS DO
rollNo mid
1201 mid1 25 24
mid2 28 26
1264 mid1 29 28
mid2 27 26
12C7 mid1 30 29
mid2 28 27
Example:
Python program to create following table of data
Dept Other
DS DO MOB EPC
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Program:
import numpy as np
import pandas as pd
data = [[25,24,23,15],[28,26,23,21],[29,28,27,26],[27,26,24,25],[30,29,28,27],
[28,27,25,26]]
ind = [['1201','1201','1264','1264','12C7','12C7'],
col = [['Dept','Dept','Other','Other'],['DS','DO','MOB','EPC']]
Output:
Dept Other
DS DO MOB EPC
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Example:
Python program to create following table:
Type
Dept Other
Sub DS DO MOB EPC
RollNo Mid
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Program:
import numpy as np
import pandas as pd
data = [[25,24,23,15],[28,26,23,21],[29,28,27,26],[27,26,24,25],[30,29,28,27],
[28,27,25,26]]
ind = [['1201','1201','1264','1264','12C7','12C7'],
col = [['Dept','Dept','Other','Other'],['DS','DO','MOB','EPC']]
df.index.names =['RollNo','Mid']
df.columns.names =['Type','Sub']
Output:
Type Dept Other
Sub DS DO MOB EPC
RollNo Mid
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Combining Datasets
 Some of the most interesting studies of data come from combining different
data sources.
 These operations can involve anything from very straightforward
concatenation of two different datasets, to more complicated database-style
joins and merges that correctly handle any overlaps between the dataset.
 These operations can be:
 simple concatenation of Series and DataFrames with the pd.concat
function
 in-memory merges and joins implemented in Pandas.
Simple Concatenation with pd.concat
 Pandas has a function, pd.concat(), which has a similar syntax to
np.concatenate but contains a number of other options
 pd.concat() can be used for a simple concatenation of Series or DataFrame
objects, just as np.concatenate() can be used for simple concatenations of
arrays
import pandas as pd
import numpy as np
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
print(pd.concat([ser1, ser2]))
Output:
1A
2B
3C
4D
5E
6F
 Concatenation in data frame:
import pandas as pd
import numpy as np
df1 =pd.DataFrame([[10,20],[30,40]],index=[1,2],columns=['A','B'])
print(df1); print(df2); print(pd.concat([df1, df2]))
Output:
A B
1 10 20
2 30 40
A B
1 50 60
2 70 80
A B
1 10 20
2 30 40
1 50 60
2 70 80
 By default, the concatenation takes place row-wise within the DataFrame
(i.e., axis=0). Like np.concatenate, pd.concat allows specification of an axis
along which concatenation will take place.
Example:
import pandas as pd
import numpy as np
print(df1); print(df2);
print(pd.concat([df1, df2],axis=1).to_string())
Output:
A B
1 10 20
2 30 40
C D
1 50 60
2 70 80
A B C D
1 10 20 50 60
2 30 40 70 80
 By default, the entries for which no data is available are filled with NA
values. To change this, we can specify one of several options for the join
and join_axes parameters of the concatenate function. By default, the join is
a union of the input columns (join='outer'), but we can change this to an
intersection of the columns using join='inner':
Example:
import pandas as pd
import numpy as np
df1 =pd.DataFrame([[1,2,3],[4,5,6]],index=[1,2],columns=['A','B','C'])
df2 =pd.DataFrame([[7,8,9],[10,11,12]],index=[1,2],columns=['B','C','D'])
print(df1.to_string()); print(df2.to_string())
print(pd.concat([df1, df2]).to_string())
print(pd.concat([df1, df2],join='inner'))
Output:
A B C
1 1 2 3
2 4 5 6
B C D
1 7 8 9
2 10 11 12
A B C D
1 1.0 2 3 NaN
2 4.0 5 6 NaN
1 NaN 7 8 9.0
2 NaN 10 11 12.0
B C
1 2 3
2 5 6
1 7 8
2 10 11
The append() method
 Series and DataFrame objects have an append method that can accomplish the
concatenation in fewer keystrokes.
 For example, rather than calling pd.concat([df1, df2]), we can simply call
df1.append(df2):
print(df1); print(df2); print(df1.append(df2))
Merge and Join
One essential feature offered by Pandas is its high-performance, in-memory join
and merge operations.
Categories of Joins
 One-to-one joins
 Many-to-one joins
 Many-to-many joins
One-to-One Join:
 A one-to-one join combines two DataFrames when there is a unique
matching key in both DataFrames. Each key appears only once in both
DataFrames.
 This type of join results in a DataFrame where each row from the left
DataFrame is combined with exactly one matching row from the right
DataFrame.
 Example of a one-to-one join:
import pandas as pd
# Create two DataFrames
left_df = pd.DataFrame({'key': ['A', 'B', 'C'],'value_left': [1, 2, 3]})
right_df = pd.DataFrame({'key': ['A', 'B', 'D'],

'value_right': ['apple', 'banana', 'dog']})
# Perform a one-to-one join on the 'key' column
one_to_one_df = pd.merge(left_df, right_df, on='key', how='inner')
print(one_to_one_df)
Output:
key value_left value_right
0 A 1 apple
1 B 2 banana
Many-to-One Join:
 A many-to-one join combines two DataFrames when there is a unique
matching key in the left DataFrame but multiple matches in the right
DataFrame.
DataFrame is combined with one or more matching rows from the right
DataFrame.
 Example of a many-to-one join:
import pandas as pd

left_df = pd.DataFrame({'key': ['A', 'B', 'A'], 'value_left': [1, 2, 3]})
right_df = pd.DataFrame({'key': ['A', 'B'], 'value_right': ['apple', 'banana']})
# Perform a many-to-one join on the 'key' column
many_to_one_df = pd.merge(left_df, right_df, on='key', how='inner')
print(many_to_one_df)
Output:
0 A 1 apple
1 A 3 apple
2 B 2 banana
Many-to-Many Join:
 A many-to-many join combines two DataFrames when there are multiple
matching keys in both DataFrames.
DataFrame is combined with all matching rows from the right DataFrame.
 Example of a many-to-many join:
import pandas as pd
left_df = pd.DataFrame({'key': ['A', 'B', 'A'], 'value_left': [1, 2, 3]})
right_df = pd.DataFrame({'key': ['A', 'B', 'A'], 'value_right': ['apple',

'banana', 'apple']})
# Perform a many-to-many join on the 'key' column
many_to_many_df = pd.merge(left_df, right_df, on='key', how='inner')
print(many_to_many_df)
Output:
0 A 1 apple
1 A 1 apple
2 A 3 apple
3 B 2 banana
join() Method:
 join() is a DataFrame method that is more concise and primarily used for
combining DataFrames based on their indices.
 It defaults to a left join and only allows joining DataFrames on their indices
by default.
 It is more convenient when we want to join DataFrames that share the same
index or have common index labels.
 Example:
import pandas as pd
# Create two DataFrames with the same index

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3']},
index=['index0', 'index1', 'index2', 'index3'])
df2 = pd.DataFrame({'B': ['B0', 'B1', 'B2', 'B3']},

index=['index1', 'index2', 'index3', 'index4'])
# Join based on the index

joined_df = df1.join(df2, how='left')
print(joined_df)
Output:
A B
index0 A0 NaN
index1 A1 B0
index2 A2 B1
index3 A3 B2
Specifying Set Arithmetic for Joins

(Types of joins based on common column)
 In pandas, we can perform various types of joins (also known as merge
operations) to combine data from multiple DataFrames based on a common
key or column. The common types of joins include inner join, outer join, left
join, and right join.
Inner Join:
 An inner join returns only the rows that have matching keys in both
DataFrames. Rows from either DataFrame that do not have a match in the
other DataFrame are excluded from the result.
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
'C': ['C2', 'C3', 'C4']})
# Perform an inner join on the 'A' column

inner_join_result = pd.merge(df1, df2, on='A', how='inner')
# Display the result

print("Inner Join Result:")
print(inner_join_result)
Ouput:
Inner Join Result:
A B C
0 A2 B2 C2
Outer Join:
 An outer join returns all the rows from both DataFrames and fills in missing
values with NaN (or specified fill values) for columns that don't have a
match in the other DataFrame.
 Example:
import pandas as pd
# Sample DataFrames
'B': ['B0', 'B1', 'B2']})
'C': ['C2', 'C3', 'C4']})
# Perform an outer join on the 'A' column

outer_join_result = pd.merge(df1, df2, on='A', how='outer')
# Display the result

print("Outer Join Result:")
print(outer_join_result)
Output:
Outer Join Result:
A B C
0 A0 B0 NaN
1 A1 B1 NaN
2 A2 B2 C2
3 A3 NaN C3
4 A4 NaN C4
 We can also perform left joins and right joins using the how='left' and
how='right' parameters, respectively. Left join includes all rows from
the left DataFrame and matches from the right DataFrame, while right
join includes all rows from the right DataFrame and matches from the
left DataFrame.
Aggregation and Grouping
 An essential piece of analysis of large data is efficient summarization:
computing aggregations like sum(), mean(), median(), min(), and max(), in
which a single number gives insight into the nature of a potentially large
dataset.
 Aggregation in pandas can be performed by:
 Simple Aggregation
 Operations based on the concept of a groupby.
Simple Aggregation in Pandas
 As with a one dimensional NumPy array, for a Pandas Series the aggregates
return a single value:
Example:
import pandas as pd
import numpy as np
ser = pd.Series([10,20,30,40,50])
print(ser.sum())
print(ser.mean())
Output:
150
30.0
 For a DataFrame, by default the aggregates return results within each column.
By specifying the axis argument, we can instead aggregate within each row.
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1,6),
'B':np.arange(10,60,10)})
print(df.sum())
print(df.mean())
print(df.sum(axis ='columns'))
print(df.mean(axis = 'columns'))
Output:
A 15
B 150
dtype: int64
A 3.0
B 30.0
dtype: float64
0 11
1 22
2 33
3 44
4 55
dtype: int64
0 5.5
1 11.0
2 16.5
3 22.0
4 27.5
dtype: float64
 Pandas Series and DataFrames include all of the common aggregates .In
addition, there is a convenience method describe() that computes several
common aggregates for each column and returns the result.
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1,6),
'B':np.arange(10,60,10)})
print(df.describe())
Output:
A B
count 5.000000 5.000000
mean 3.000000 30.000000
std 1.581139 15.811388
min 1.000000 10.000000
25% 2.000000 20.000000
50% 3.000000 30.000000

75% 4.000000 40.000000
max 5.000000 50.000000
 Some of other built-in Pandas aggregations are:
GroupBy: Split, Apply, Combine

 In pandas, the groupby operation is used to group rows of a DataFrame
based on the values in one or more columns. Once the data is grouped, we
can perform various aggregate operations on each group, such as calculating
the mean, sum, count, or applying custom functions.
 The groupby operation allows to quickly and efficiently compute aggregates on
subsets of data.
 The groupby operation is used to aggregate conditionally on some label or
index.
 The name “group by” comes from a command in the SQL database
language, but it is perhaps more illuminative to think of it in the terms first
coined by Hadley Wickham of Rstats fame: split, apply, combine.
 The split step involves breaking up and grouping a DataFrame
depending on the value of the specified key.
 The apply step involves computing some function, usually an
aggregate, transformation, or filtering, within the individual groups.
 The combine step merges the results of these operations into an
output array.
 Example program
import pandas as pd
import numpy as np
df = pd.DataFrame({'key':['A','B','C','A','B','C'],
'data':np.arange(1,7)},columns=['key','data'])
print(df)
print(df.groupby('key').sum())
Output:
key data
0 A 1
1 B 2
2 C 3
3 A 4
4 B 5
5 C 6
data
key
A 5
B 7
C 9
Exercise program:
A DataFrame consists of three columns: ‘Date',’Product’ and ’Sales’. Create data
frame and Groupby the data by product and calculate the sum of sales.
Solution:
import pandas as pd
data={'date':['01-10-2023','02-10-2023','01-10-2023','04-10-2023','01-10-2023'],
'Product':['A','B','A','A','B'],
'Sales':[100,180,120,80,200] }
df =pd.DataFrame(data)
product_sales = df.groupby('Product').sum()
print(product_sales)
Pivot Tables
 A pivot table in pandas is a data manipulation technique that allows to
restructure and summarize data from a DataFrame, making it easier to
analyze and visualize. Pivot tables are commonly used for tasks like data
aggregation, summarization, and cross-tabulation.
 A pivot table is a similar to GroupBy operation that is commonly seen in
spreadsheets and other programs that operate on tabular data.
 The pivot table takes simple column wise data as input, and groups the
entries into a two-dimensional table that provides a multidimensional
summarization of the data.
 We can think of pivot tables as essentially a multidimensional version of
GroupBy aggregation. i.e., we can split-apply- combine, but both the split
and the combine happen across not a one dimensional index, but across a
two-dimensional grid.
 Pivot Table Syntax: The full call signature of the pivot_table method of
DataFrames is as follows:
DataFrame.pivot_table(data, values=None, index=None,
columns=None,aggfunc='mean',
fill_value=None, margins=False,
dropna=True, margins_name='All')
where
data : pandas dataframe
index : feature that allows to group data
values : feature to aggregates on
columns: displays the values horizontally on top of the resultant
table
fill_value and dropna, have to do with missing data
The aggfunc keyword controls what type of aggregation is applied, which is a
mean by default.
margins_name: compute totals along each grouping.
 Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Kumar','Rao','Ali','Singh'],
'Job':['FullTimeEmployee','Intern','PartTime
Employee','FullTimeEmployee'],
'Dept':['Admin','Tech','Admin','management'],
'YOJ':[2018,2019,2018,2010],
'Sal':[20000,50000,10000,20000]})
output = pd.pivot_table(data=df,index=['Job'],columns = ['Dept'],
values ='Sal',aggfunc ='mean')
print('\n')
print(output.to_string())
Output:
Exercise program:
A DataFrame consists of three columns: ‘Date',’Product’ and ’Sales’. Create data
frame and display the total sales for each product on each date.
Solution:
import pandas as pd
# Sample data
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Sales': [100, 150, 200, 120, 180]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the original DataFrame
print("Original DataFrame:")
# Create a pivot table
pivot_table = df.pivot_table(values='Sales', index='Date', columns='Product',
aggfunc='sum')
# Display the pivot table
print("\nPivot Table:")
print(pivot_table)
Outpu
Original DataFrame:
Date Product Sales
0 01-10-2023 A 100
1 01-10-2023 B 150
2 02-10-2023 A 200
3 02-10-2023 B 120
4 03-10-2023 A 180
Pivot Table:
Product A B
Date
01-10-2023 100.0 150.0
02-10-2023 200.0 120.0
03-10-2023 180.0 NaN
NumPy vs Pandas
NumPy Pandas
1.NumPy stands for Numerical Python Pandas stand for PANel DAta Sructure
2.NumPy is primarily designed for Pandas is designed for data manipulation
numerical and mathematical operations and analysis, particularly for working with
andscientific computing. structured and labeled data.
3.NumPy's core data structure is the Pandas offers two main data structures:
ndarray (n-dimensional array), which is DataFrames and Series.
homogeneous (all elements have the
same data type)
4. NumPy is best suited for numerical Pandas is ideal for working with structured
and scientific computing tasks, data in data science, business analytics, and
including linear algebra, statistical machine learning applications.
analysis, and mathematical operations
on large arrays of numerical data.
Tutorial Questions
1. Illustrate different categories of basic array manipulations with examples.
2. What are universal functions in NumPy array? Explain the different advanced features of
universal functions.
3. Discuss and demonstrate some of built-in aggregation functions in NumPy.
4. What is broadcasting in NumPy? Discuss the different rules of broadcasting with examples
5. What is Boolean masking in NumPay? Explain with example.
6. What is fancy indexing in NumPy? Discuss and demonstrate the Fancy Indexing in NumpPy.
7. Demonstrate the use of structured arrays andrecord arrays in NumpPy
8. How fancy indexing can be combined with other indexing schemes.
9. Illustrate different attributes of NumPy arrays with example.
10. Write short note on Computation on NumPy arrays
11. Explain the fundamental data objects with its construction in pandas
12. Briefly explain the hierarchical indexing with examples
13. What is pivot table? Explain it clearly
14. Demonstrate data indexing and selection in Pandas Series and DataFrame objects.
15. Write short note on Operating on Data in Pandas
16. Demonstrate different methods of constructing MultiIndex.
17. How to handle missing data in pandas
18. Illustrate different approaches to combine data from multiple sources in pandas
19. Explore aggregation and grouping in Pandas
20. Briefly explore and demonstrate different methods for Operating on Null Values
Assignment Questions:
1. Write a python program to demonstrate the Attributes of Arrays in NumpPy
2. Write a python program to demonstrate the Indexing of Arrays in NumpPy
3. Write a python program to demonstrate the Slicing of Arrays in NumpPy
4. Write a python program to demonstrate the Reshaping of Arrays in NumpPy
5. Write a python program to demonstrate the Joining and Splitting of Arrays in NumpPy
6. Write a python program to demonstrate the Aggregation Universal Functions in NumpPy
7. Write a python program to demonstrate the Broadcasting in NumpPy
8. Write a python program to demonstrate the Boolean Making in NumpPy
9. Write a python program to demonstrate the Fancy Indexing in NumpPy
10. Write a python program to demonstrate the use of structured arrays and record arrays in
NumpPy
11. Write a python program to illustrate different ways of creating pandas Series
12. Write a python program to illustrate different ways of creating pandas DataFrame
13. Write a python program to illustrate detecting null values in pandas DataFrame
14. Write a python program to illustrate dropping null values in pandas DataFrame
15. Write a python program to illustrate filling null values in pandas DataFrame
16. Write a python program to illustrate creating different ways of pandas MutiIndex
17. Write a python program to illustrate indexing, slicing, Boolean indexing and fancy indexing in
MultiIndex.
18. Write a python program to illustrate merging two data sets with joins(inner, left and right) in
pandas
19. Write a python program to illustrate GroupBy operation of pandas.
20. Write a python program to illustrate pivot table in pandas.

FDS Unit 4

Uploaded by

Copyright:

Available Formats

FDS Unit 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FDS Unit 4

Uploaded by

Copyright:

Available Formats

4-1 B.

Tech CIVIL Regulation: R20 FDS: UNIT-4

2. Indexing of arrays: Getting and setting the value of individual array

# concatenate along the second axis (zero-indexed)

Computation on NumPy Arrays: Universal Functions

theta = [ 0. 1.57079633 3.14159265]

 Inverse trigonometric functions are also available:

 In broadcasting, we can think of it as a smaller array being “broadcasted”

Visualization of NumPy broadcasting

 A set of arrays is said to be compatible with broadcasting (

Uses of Broadcasting (Broadcasting in Practice)

 As in the case of arithmetic operators, the comparison operators are

Boolean Masking (Boolean Indexing)

# Create a NumPy array

# Create a boolean mask based on a condition

# Use the boolean mask to filter elements from the array

# Display the original array

# Display the boolean mask

# Display the filtered array

Filtered array (elements greater than 3):

# Import NumPy for working with arrays

# Create a array of indices for fancy indexing

# Access elements using fancy indexing

print("\nOriginal array:", my_array)

struct2 = np.dtype({'names':('name', 'age', 'weight'),

struct3 = np.dtype([('name', 'U10'), ('age', 'i4'), ('weight', 'f8')])

name = ['Kumar', 'Rao', 'Ali', 'Singh']

#accesing field as dictinary keys

# Find and print the average grade of all students

 We can think of a DataFrame as a sequence of aligned (they share the same

Series vs data Frame

Constructing DataFrame objects

col1 ... sum

marks ... ratio

Operating on Data in Pandas

 In the masking approach, the mask might be an entirely separate Boolean

None: Pythonic missing data

Detecting null values

Dropping Null values

0 NaN ... hai

right_df = pd.DataFrame({'key': ['A', 'B', 'D'],

# Create two DataFrames

left_df = pd.DataFrame({'key': ['A', 'B', 'A'], 'value_left': [1, 2, 3]})

right_df = pd.DataFrame({'key': ['A', 'B', 'A'], 'value_right': ['apple',

many_to_many_df = pd.merge(left_df, right_df, on='key', how='inner')

# Create two DataFrames with the same index

df2 = pd.DataFrame({'B': ['B0', 'B1', 'B2', 'B3']},

# Join based on the index

Specifying Set Arithmetic for Joins

# Perform an inner join on the 'A' column

# Display the result

# Perform an outer join on the 'A' column

# Display the result

50% 3.000000 30.000000

 Some of other built-in Pandas aggregations are:

GroupBy: Split, Apply, Combine

You might also like