FDS Unit 4
FDS Unit 4
FDS Unit 4
UNIT-4
Python for Data Handling
Syllabus:
Python for Data Handling: Basics of Numpy arrays – aggregations – computations on arrays –
comparisons, masks, Boolean logic – fancy indexing – structured arrays .Data manipulation with
Pandas – data indexing and selection – operating on data – missing data – hierarchical indexing
– combining datasets –aggregation and grouping – pivot tables.
Introduction to NumPy Arrays
Datasets can come from a wide range of sources and a wide range of
formats, including collections of documents, collections of images,
collections of sound clips, collections of numerical measurements, or nearly
anything else. Despite this apparent heterogeneity, it will help us to think of
all data fundamentally as arrays of numbers.
For example, images—particularly digital images—can be thought of as
simply two dimensional arrays of numbers representing pixel brightness
across the area. Sound clips can be thought of as one-dimensional arrays of
intensity versus time. Text can be converted in various ways into numerical
representations, perhaps binary digits representing the frequency of certain
words or pairs of words. No matter what the data are, the first step in making
them analyzable will be to transform them into arrays of numbers.
For this reason, efficient storage and manipulation of numerical arrays is
absolutely fundamental to the process of doing data science
NumPy (short for Numerical Python) provides an efficient interface to store
and operate on dense data buffers.
In some ways, NumPy arrays are like Python’s built-in list type, but NumPy
arrays provide much more efficient storage and data operations as the arrays
grow larger in size.
NumPy arrays form the core of nearly the entire ecosystem of data science
tools in Python.
NumPy in Python is a library that is used to work with arrays and was
created in 2005 by Travis Oliphant.
NumPy library in Python has functions for working in domain of Fourier
transform, linear algebra, and matrices
In particular, NumPy arrays provide an efficient way of storing and
manipulating data. NumPy also includes a number of functions that make it
easy to perform mathematical operations on arrays. This can be really useful
for scientific or engineering applications.
Basics of Numpy Arrays
Categories of basic array manipulations are:
1. Attributes of arrays: Determining the size, shape, memory consumption, and
data types of arrays
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 1
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
To index from the end of the array, we can use negative indices:
print(a[-1])
print(a[-2])
Output:9
7
In a multidimensional array, we access items using a comma-separated tuple
ofindices:
Example:
import numpy as np
a=np.array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
print(a[0,0])
print(a[2,0])
print(a[2,-1])
Output: 3
1
7
We can also modify values using any of the above index notation:
a[0, 0] = 12
Output: [[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
Array Slicing: Accessing Sub arrays
Just as we can use square brackets to access individual array elements, we
can also use them to access sub arrays with the slice notation, marked by the
colon (:) character.
The NumPy slicing syntax follows that of the standard Python list; to access
a slice of an array x, use this: x[start:stop:step] If any of these are
unspecified, they default to the values start=0, stop=size of dimension,
step=1.
Example: One-dimensional sub arrays
import numpy as np
x = np.arange(10)
print(x)
print( x[:5]) # first five elements
print( x[5:]) # elements after index 5
print(x[4:7]) # middle sub array
print(x[::2]) # every other element
print( x[1::2]) # every other element, starting at index 1
print(x[::-1]) # all elements, reversed
print( x[5::-2]) # reversed every other from index 5
Output:[0 1 2 3 4 5 6 7 8 9]
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 3
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
[0 1 2 3 4]
[5 6 7 8 9]
[4 5 6]
[0 2 4 6 8]
[1 3 5 7 9]
[9 8 7 6 5 4 3 2 1 0]
[5 3 1]
Example: Multidimensional sub arrays
import numpy as np
x2=np.array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
print( x2[:2, :3]) # two rows, three columns
print(x2[:3, ::2]) # all rows, every other column
print(x2[::-1, ::-1]) # sub array dimensions reversed together
Output:[[12 5 2]
[ 7 6 8]]
[[12 2]
[ 7 8]
[ 1 7]]
[[ 7 7 6 1]
[ 8 8 6 7]
[ 4 2 5 12]]
Accessing array rows and columns
One commonly needed routine is accessing single rows or columns of an
array. We can do this by combining indexing and slicing, using an empty
slice marked by a single colon (:):
Example:
print(x2[:, 0]) # first column of x2 [12 7 1]
print(x2[0, :]) # first row of x2 [12 5 2 4]
In the case of row access, the empty slice can be omitted for a more compact
syntax:
print(x2[0]) # equivalent to x2[0, :] [12 5 2 4]
Sub arrays as no-copy views
One important—and extremely useful—thing to know about array slices is
that they return views rather than copies of the array data.
This is one area in which NumPy array slicing differs from Python list
slicing: in lists, slices will be copies.
Consider our two-dimensional array from before:
print(x2) Output: [[12 5 2 4] [ 7 6 8 8] [ 1 6 7 7]]
x2_sub = x2[:2, :2]
print(x2_sub) Output: [[12 5] [ 7 6]]
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 4
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Now if we modify this sub array, we’ll see that the original array is changed.
Example:
x2_sub[0, 0] = 99
print(x2_sub) Output: [[99 5] [ 7 6]]
print(x2) Output: [[99 5 2 4] [ 7 6 8 8] [ 1 6 7 7]]
This default behavior is actually quite useful: it means that when we work
with large datasets, we can access and process pieces of these datasets
without the need to copy the underlying data buffer.
Creating copies of arrays Despite the nice features of array views, it is
sometimes useful to instead explicitly copy the data within an array or a sub
array.
This can be most easily done with the copy() method:
Example:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy) Output: [[99 5] [ 7 6]]
If we now modify this subarray, the original array is not touched:
Example:
x2_sub_copy[0, 0] = 42
print(x2_sub_copy) Output:[[42 5] [ 7 6]]
print(x2) Output: [[99 5 2 4] [ 7 6 8 8] [ 1 6 7 7]]
Reshaping of Arrays
Another useful type of operation is reshaping of arrays.
The most flexible way of doing this is with the reshape() method.
For example:
import numpy as np
grid = np.arange(1, 10).reshape((3, 3))
print(grid)
Output: [[1 2 3]
[4 5 6]
[7 8 9]]
Note that for this to work; the size of the initial array must match the size of
the reshaped array.
Where possible, the reshape method will use a no-copy view of the initial
array, but with non-contiguous memory buffers this is not always the case.
Another common reshaping pattern is the conversion of a one-dimensional
array into a two-dimensional row or column matrix. We can do this with the
reshape method, or more easily by making use of the newaxis keyword
within a slice operation
Example:
import numpy as np
x = np.array([1, 2, 3])
# row vector via reshape
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
x.reshape((1, 3))
# row vector via newaxis
x[np.newaxis, :]
# column vector via reshape
x.reshape((3, 1))
# column vector via newaxis
x[:, np.newaxis]
Output:[[1 2 3]]
[[1 2 3]]
[[1]
[2]
[3]]
[[1]
[2]
[3]]
Array Concatenation and Splitting
It’s also possible to combine multiple arrays into one, and to conversely split
a single array into multiple arrays.
Concatenation of arrays: Concatenation, or joining of two arrays in
NumPy, is primarily accomplished through the routines np.concatenate,
np.vstack, and np.hstack.
np.concatenate takes a tuple or list of arrays as its first argument
Example:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
print(np.concatenate([x, y]))
Output: [1 2 3 4 5 6]
We can also concatenate more than two arrays at once:
Example:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))
Output: [ 1 2 3 4 5 6 99 99 99]
np.concatenate can also be used for two-dimensional arrays:
Example:
import numpy as np
grid = np.array([[1, 2, 3], [4, 5, 6]])
# concatenate along the first axis
np.concatenate([grid, grid])
Output: [[1, 2, 3], [4, 5, 6], [1, 2, 3], [4, 5, 6]]
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 6
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
[[ 8 9 10 11]
[12 13 14 15]]
left, right = np.hsplit(grid, [2])
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 7
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
print(left)
print(right)
Output:
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
Similarly, np.dsplit will split arrays along the third axis.
Aggregations
Aggregations are used to compute summary statistics for the data in
question.
Perhaps the most common summary statistics are the mean and standard
deviation, which allows to summarize the “typical” values in a dataset, but
other aggregates are useful as well (the sum, product, median, minimum and
maximum, quantiles, etc.).
Summing the Values in an Array
np.sum() function is used for computing the sum of all values in an array.
Example:
import numpy as np
arr = np.array([2,4,6,8])
b = np.sum(arr)
print(b)
Output: 20
Minimum and Maximum
np.min(), np.max() are used to find minimum and maximum in given array
Example:
import numpy as np
arr = np.array([2,4,6,8])
print(np.min(arr))
print(np.max(arr))
output:
2
8
Multidimensional aggregates
One common type of aggregation operation is an aggregate along a row or
column.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
By default, each NumPy aggregation function will return the aggregate over
the entire array:
Aggregation functions take an additional argument specifying the axis along
which the aggregate is computed.
For example, we can find the minimum value within each column by
specifying axis=0:Similarly, we can find the maximum value within each
row by specifying axis=1
Example:
import numpy as np
a = np.array([[1,3,5,7],
[2,4,6,8]])
print(a.max())
print(np.max(a))
print(a.max(axis=0))
print(a.max(axis=1))
Output:
8
8
[2 4 6 8]
[7 8]
The axis keyword specifies the dimension of the array that will be
collapsed, rather than the dimension that will be returned. So specifying
axis=0 means that the first axis will be collapsed: for two-dimensional
arrays, this means that values within each column will be aggregated.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Example:
import numpy as np
x=np.array([1,2,3,4])
y=np.array([4,5,6,7])
z=np.add(x,y)
print(z)
Output: [ 5 7 9 11]
Absolute value
Just as NumPy understands Python’s built-in arithmetic operators, it also
understands Python’s built-in absolute value function:
Example:
Import numpy as np
x = np.array([-2, -1, 0, 1, 2])
print(np.abs(x))
Output: [2 1 0 1 2]
Trigonometric functions
NumPy provides a large number of useful ufuncs, and some of the most
useful for the data scientist are the trigonometric functions.
Example:
theta = np.linspace(0, np.pi, 3)
print("theta = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
Output:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Aggregates
For binary ufuncs, there are some interesting aggregates that can be
computed directly from the object.
For example, if we’d like to reduce an array with a particular operation, we
can use the reduce method of any ufunc. A reduce repeatedly applies a given
operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements
in the array:
Example:
Import numpy as np
x = np.arange(1, 6)
np.add.reduce(x)
Output: 15
If we’d like to store all the intermediate results of the computation, we can
instead use accumulate:
Example:
Import numpy as np
x = np.arange(1, 6)
np.add.accumulate(x)
Output: [ 1 3 6 10 15]
Outer products
Any ufunc can compute the output of all pairs of two different inputs using
the outer method.
Example:
x = np.arange(1, 6)
np.multiply.outer(x, x)
Ouput:
[[ 1, 2, 3, 4, 5],
[ 2, 4, 6, 8, 10],
[ 3, 6, 9, 12, 15],
[ 4, 8, 12, 16, 20],
[ 5, 10, 15, 20, 25]])
Broadcasting
Broadcasting is means of vectorizing Operations.
Broadcasting is simply a set of rules for applying binary ufuncs (addition,
subtraction, multiplication, etc.) on arrays of different sizes.
Broadcasting allows binary operations to be performed on arrays of different
sizes.
Broadcasting is a mechanism that allows NumPy to handle arrays of
different shapes during arithmetic operations.
Broadcasting automatically expands smaller arrays to match the shape of
larger arrays for element-wise operations.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 13
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
In above diagram, the light boxes represent the broadcasted values: this
extra memory is not actually allocated in the course of the operation, but it
can be useful conceptually to imagine that it is.
Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the
interaction between the two arrays:
Rule 1: If the two arrays differ in their number of dimensions, the
shape of the one with fewer dimensions is padded with ones on its
leading (left) side.
Rule 2: If the shape of the two arrays does not match in any
dimension, the array with shape equal to 1 in that dimension is
stretched to match the other shape.
Rule 3: If in any dimension the sizes disagree and neither is equal to
1, an error is raised.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 15
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Example 1:
x = np.array([1, 2, 3, 4, 5])
print(x < 3) # less than operator
Output: [True, True, False, False, False]
Example 2:
x = np.array([1, 2, 3, 4, 5])
print(np.less(x,3)) # less than ufunc
Output: [True, True, False, False, False]
Working with Boolean Arrays:
Counting entries:
To count the number of True entries in a Boolean array, np.count_nonzero is
useful:
# how many values less than 6?
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
print(x)
print(np.less(x,6))
print(np.count_nonzero(x < 6)))
Output:
[[5 0 3 3]
[7 9 3 5]
[2 4 7 6]]
[[ True True True True]
[False False True True]
[ True True False False]]
8
Another way to get at this information is to use np.sum; in this case, False is
interpreted as 0, and True is interpreted as 1:
print( np.sum(x < 6))
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 18
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
output: 8
The benefit of sum() is that like with other NumPy aggregation functions,
this summation can be done along rows or columns as well:
Ex: # how many values less than 6 in each row?
print(np.sum(x < 6, axis=1))
Output: [4, 2, 2]
This counts the number of values less than 6 in each row of the matrix.
Boolean Operators
We can combine the comparison operators using Python’s bitwise logic
operators, &, |, ^, and ~.
Like with the standard arithmetic operators, NumPy overloads these as
ufuncs that work element-wise on (usually Boolean) arrays.
The following table summarizes the bitwise Boolean operators and their
equivalent ufuncs:
Example:
import numpy as np
a =np.arange(10)
print(a)
#bitwise or operatot
b=((a<=2) | (a>=8))
print(b)
d=np.sum(b)
print(d)
#bitwise or ufunc
c=np.bitwise_or(a<=2,a>=8)
print(c)
e=np.sum(c)
print(e)
Output:
[0 1 2 3 4 5 6 7 8 9]
[ True True True False False False False False True True]
5
[ True True True False False False False False True True]
5
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 19
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 20
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Example 3:
import numpy as np
Boolean mask:
[False False False True True]
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 21
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 22
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Output: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[10, 8, 9]
We can also combine fancy indexing with slicing:
b=X[1:, [2, 0, 1]]
print(b)
Output:
[[ 6, 4, 5],
[10, 8, 9]]
We can combine fancy indexing with masking:
mask = np.array([True, False, True, False])
row = np.array([0, 1, 2])
c= X[row[:, np.newaxis], mask]
print(c)
Output: [[ 0, 2],
[ 4, 6],
[ 8, 10]]
All of these indexing options combined lead to a very flexible set of
operations for accessing and modifying array values.
Differences between Boolean and Fancy indexing
Boolean indexing Fancy indexing
1. Selection is based on conditions or Selection is based on predefined
boolean masks. indices or arrays of indices.
2. We create a boolean array of the same We explicitly specify which elements
shape as the original array, where each to select using integer arrays of indices.
element corresponds to whether the
condition is met or not.
3. Requires creating a boolean mask, Requires a separate array (or arrays) of
which involves evaluating a condition integer indices to specify the elements
against the original array. to be selected
4. The resulting array has the same The resulting array can have a different
shape as the original array. shape than the original array.
5. Elements are selected based on Elements are selected explicitly based
whether the corresponding boolean on the specified integer indices.
mask is True or False.
6. Useful for condition-based selection Useful for selecting elements at
of elements. Commonly used for specific locations or indices. Allows
filtering data based on conditions. for more flexible and customized
selection of elements.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 23
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Structured Arrays
Structure arrays are arrays with compound data types.
Structured arrays in NumPy allow us to work with structured data, where
each element of the array can have multiple fields with different data types,
similar to a table in a database or a structure in a programming language
like C or C++.
Structured arrays are particularly useful when we need to handle
heterogeneous data, such as data from CSV files or other tabular data
sources.
They provide efficient storage for compound, heterogeneous data.
Structured arrays are handy for handling and manipulating structured data
within NumPy, making it easier to work with complex datasets.
Example:
import numpy as np
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print(data.dtype)
#storing data in three separate arrays
name = np.array( ['Kumar', 'Rao', 'Ali', 'Singh'])
age = np.array([25, 45, 37, 19])
weight =np.array( [55.0, 85.5, 68.0, 61.5])
# filling the array with our lists of values
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
Output:
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
[('Kumar', 25, 55. ) ('Rao', 45, 85.5) ('Ali', 37, 68. ) ('Singh', 19, 61.5)]
The handy thing with structured arrays is that we can now refer to values
either by index or by name:
Example:
# Get all names
print(data['name'])
# Get first row of data
print(data[0])
# Get the name from the last row
print(data[-1]['name'])
# Get names where age is under 30
print(data[data['age'] < 30]['name'])
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 24
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Output:
['Kumar' 'Rao' 'Ali' 'Singh']
('Kumar', 25, 55.)
Singh
['Kumar' 'Singh']
Creating Structured Arrays
Structured array data types can be specified in a number of ways.
Method 1: Dictionary method: We can create a structured array using a
compound data type specification:
struct = np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
Method2: Numerical types can be specified with Python types or NumPy
dtypes instead:
struct2 = np.dtype({'names':('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
Method3: A compound type can also be specified as a list of tuples:
struct3 = np.dtype([('name', 'U10'), ('age', 'i4'), ('weight', 'f8')])
Example:
import numpy as np
struct = np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
data = np.zeros(4,struct)
data['name'] = name
data['age'] = age
data['weight'] = weight
data2['name'] = name
data2['age'] = age
data2['weight'] = weight
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 25
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
data3['name'] = name
data3['age'] = age
data3['weight'] = weight
print(data)
print(data2)
print(data3)
Output:
[('Kumar', 25, 55. ) ('Rao', 45, 85.5) ('Ali', 37, 68. ) ('Singh', 19, 61.5)]
[('Kumar', 25, 55. ) ('Rao', 45, 85.5) ('Ali', 37, 68. ) ('Singh', 19, 61.5)]
[('Kumar', 25, 55. ) ('Rao', 45, 85.5) ('Ali', 37, 68. )('Singh', 19, 61.5)]
Record Arrays: Structured Arrays with a Twist
In NumPy, a record array is a special kind of structured array where each
element behaves like a record or a structured row, and we can access fields
using attributes (like object-oriented attributes) instead of dictionary-style
indexing. This can make the code more readable and similar to working with
structured data.
NumPy also provides the np.recarray class, which is almost identical to the
structured arrays , but with one additional feature: fields can be accessed as
attributes rather than as dictionary keys.
Recall can access the ages by writing: data['age']
Output: array([25, 45, 37, 19], dtype=int32).
If we view our data as a record array instead, we can access this with
slightly fewer keystrokes:
data_rec = data.view(np.recarray)
print(data_rec.age)
Output:array([25, 45, 37, 19], dtype=int32)
The downside is that for record arrays, there is some extra overhead
involved in accessing the fields
Example program:
import numpy as np
name = ['Kumar','Rao','Ali','Singh']
age = [25,45,37,19]
weight = [55.0,85.5,68.0,61.5]
struct = np.dtype({'names':('name','age','weight'),
'formats':('U10','i4','f8')})
data = np.zeros(4,struct)
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 26
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
import numpy as np
# Creating a record array with fields 'name' and 'age'
data = np.rec.array([('Alice', 25), ('Bob', 30), ('Charlie', 35)],
dtype=[('name', 'U10'), ('age', int)])
print(data)
print(data.name)
import numpy as np
# Creating a record array with fields 'name' and 'age'
data = np.rec.array([('Alice', 25), ('Bob', 30), ('Charlie', 35)],
dtype=[('name', 'U10'), ('age', int)])
print(data)
print(data2)
print(data.name)
print(data2['name'])
data_rec=data2.view(np.recarray)
print(data_rec.name)
Exercise: Create a structured array representing information about students. Each
student should have fields for "Name," "Age," "Grade," and "City." Populate the
array with data for at least 5 students. Access and print the names of all students in
the array. Find and print the average grade of all students.
import numpy as np
# Define the structured array (assuming you already have it)
struct = [('Name', 'U20'), ('Age', int), ('Grade', float), ('City', 'U20')]
data = [("Alice", 22, 90.5, "New York"),
("Bob", 21, 88.0, "Los Angeles"),
("Charlie", 25, 95.2, "Chicago"),
("David", 20, 87.3, "San Francisco"),
("Eva", 23, 91.8, "New York")]
students =np.array(data,struct)
# Access and print the names of all students
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 27
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
names = students['Name']
print("Names of all students:")
print(names)
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 28
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Output:
[0.25 0.5 0.75 1. ]
RangeIndex(start=0, stop=4, step=1)
The essential difference between NumPy one-dimensional array and pandas
Series is the presence of the index: while the NumPy array has an implicitly
defined integer index used to access the values, the Pandas Series has an
explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities.
For example, the index need not be an integer, but can consist of values of
any desired type.
Example:
data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
print(data)
Output:
a 0.25
b 0.50
c 0.75
d 1.00
We can even use non-contiguous or non-sequential indices:
Example:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
print(data)
Output:
2 0.25
5 0.50
3 0.75
7 1.00
Constructing Series objects
The general syntax to create pandas Series object is
pd.Series(data, index=index)
where index is an optional argument, and data can be one of many entities.
data can be a list or NumPy array, in which case index defaults to an
integer sequence
data can be a scalar, which is repeated to fill the specified index
data can be a dictionary, in which index defaults to the sorted dictionary
keys
Example program:
import pandas as pd
import numpy as np
arr=np.arange(10,60,10)
li=[10,20,30,40,50]
s=10
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 29
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
dic={'1st':10,'2nd':20,'3rd':30,'4th':40,'5th':50}
ser1 = pd.Series(arr) #A one-dimensional ndarray
ser2 = pd.Series(li) # A Python list
ser3 = pd.Series(s) #A scalar value
ser4 =pd.Series(s,index=['a','b','c','d','e'])
ser5 = pd.Series(dic) #A Python dictionary
print(ser1)
print(ser2)
print(ser3)
print(ser4)
print(ser5)
Output:
0 10
1 20
2 30
3 40
4 50
0 10
1 20
2 30
3 40
4 50
0 10
a 10
b 10
c 10
d 10
e 10
1st 10
2nd 20
3rd 30
4th 40
5th 50
The Pandas DataFrame Object
The DataFrame can be thought of either as a generalization of a NumPy
array, or as a specialization of a Python dictionary.
A DataFrame is an analog of a two-dimensional array with both flexible row
indices and flexible column names.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 30
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
1. 4. All elements in a Series must be of the same 1. Different columns in a DataFrame can hold
data type. For example, if you create a Series data of different data types. For example, one
with integers, all elements in that Series will be column can contain integers; another can
integers contain strings, and so on.
2. 5. The size (number of elements) of a Series is 2. DataFrames can be modified in size by adding
fixed upon creation. You cannot add or remove or removing rows and columns. This makes
elements without creating a new Series. them more flexible for data manipulation.
3.
5. We can create a Series from a list, array, or We can create a DataFrame from various data
dictionary. sources like lists, dictionaries, NumPy arrays,
other DataFrames, or by reading data from
files like CSV, Excel, SQL databases, etc.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 31
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 32
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
1 1 2
2 2 4
From a dictionary of Series objects:
A DataFrame can be constructed from a dictionary of Series objects
Example:
import pandas as pd
markslist = {'kumar':89,'Rao':78,'Ali':67,'Singh':96}
ageslist = {'kumar':21,'Rao':22,'Ali':19,'Singh':20}
marks = pd.Series(markslist)
ages = pd.Series(ageslist)
df = pd.DataFrame({'marks': marks,'ages': ages})
print(df)
Output:
marks ages
kumar 89 21
Rao 78 22
Ali 67 19
Singh 96 20
From a two-dimensional NumPy array.
Given a two-dimensional array of data, we can create a DataFrame with any
specified column and index names. If omitted, an integer index will be used
for each
Example:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(1,7,1).reshape(3,2),
columns=['col1', 'col2'],
index=['row1', 'row2', 'row3'])
print(df)
Output:
col1 col2
row1 1 2
row2 3 4
row3 5 6
From a NumPy structured array.
A Pandas DataFrame operates much like a structured array, and can be
created directly from one:
Example:
import numpy as np
import pandas as pd
sa = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
print(pd.DataFrame(sa))
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 33
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Output:
A B
0 0 0.0
1 0 0.0
2 0 0.0
Pandas Index Object
Both the Series and DataFrame objects contain an explicit index using
which we reference and modify data.
This Index object is an interesting structure in itself, and it can be thought of
either as an immutable array or as an ordered set.
Example:
import pandas as pd
rind = pd.Index(['row1','row2','row3','row4'])
cind =pd.Index(['col1'])
ser = pd.Series([100,200,300,400],index=rind)
df = pd.DataFrame(ser,columns=cind)
print(df)
Output:
col1
row1 100
row2 200
row3 300
row4 400
use of index object:
import pandas as pd
rind = pd.Index(['row1','row2','row3','row4'])
ser1 = pd.Series([10,20,30,40],index=rind)
ser2 = pd.Series([50,60,70,80],index=rind)
frame={'col1':ser1,'col2':ser2}
df = pd.DataFrame(frame)
print(df)
Output:
col1 col2
row1 10 50
row2 20 60
row3 30 70
row4 40 80
Operating on Data in Pandas
Pandas inherit much of this functionality from NumPy, and the ufuncs. So
Pandas having the ability to perform quick element-wise operations, both
with basic arithmetic (addition, subtraction, multiplication, etc.) and with
more sophisticated operations (trigonometric functions, exponential and
logarithmic functions, etc.).
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 34
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
For unary operations like negation and trigonometric functions, these ufuncs
will preserve index and column labels in the output.
For binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
The universal functions are working in series and DataFrames by
Index preservation
Index alignment
Index Preservation
Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series and DataFrame objects.
We can use all arithmetic and special universal functions as in NumPy on
pandas. In outputs the index will preserved (maintained) as shown below.
import pandas as pd
import numpy as np
ser = pd.Series([10,20,30,40])
df = pd.DataFrame(np.arange(1,13,1).reshape(3,4),columns=['A', 'B', 'C',
'D'])
print(df)
print(np.add(ser,5)) # the indices preserved for series
print(np.add(df,10)) # the indices preserved for DataFrame
Index Alignment in series
Pandas will align indices in the process of performing the operation. This is
very convenient when we are working with incomplete data, as we’ll.
suppose we are combining two different data sources, then the index will
aligned accordingly.
Exampe:
import numpy as np
import pandas as pd
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A + B)
print(A.add(B)) #equivalent to A+B
print(A.add(B,fill_value=0)) #fill value for any elements in A or B that
might be missing
Index Alignment in DataFrame
A similar type of alignment takes place for both columns and indices when we are
performing operations on DataFrames.
Example:
import numpy as np
import pandas as pd
A = pd.DataFrame(np.arange(1,5,1).reshape(2,2),columns =list('AB'))
B = pd.DataFrame(np.arange(1,10,1).reshape(3,3),columns =list('BAC'))
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 35
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
print(A)
print(B)
print(A+B)
print(A.add(B,fill_value=0))
fill = A.stack().mean()
print(A.add(B,fill_value=fill))
Output:
A B
0 1 2
1 3 4
B ... C
0 1 ... 3
1 4 ... 6
2 7 ... 9
[3 rows x 3 columns]
A ... C
0 3.0 ... NaN
1 8.0 ... NaN
2 NaN ... NaN
[3 rows x 3 columns]
A ... C
0 3.0 ... 3.0
1 8.0 ... 6.0
2 8.0 ... 9.0
[3 rows x 3 columns]
A ... C
0 3.0 ... 5.5
1 8.0 ... 8.5
2 10.5 ... 11.5
[3 rows x 3 columns]
Operations between DataFrame and Series
When we are performing operations between a DataFrame and a Series, the
index and column alignment is similarly maintained.
Operations between a DataFrame and a Series are similar to operations
between a two-dimensional and one-dimensional NumPy array.
Example:
import numpy as np
import pandas as pd
ser = pd.Series([10,20])
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 36
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
df = pd.DataFrame([[100,200],[300,400]])
print(ser)
print(df)
print(df.subtract(ser))
print(df.subtract(ser,axis=0))
Output:
0 10
1 20
0 1
0 100 200
1 300 400
0 1
0 90 180
1 290 380
0 1
0 90 190
1 280 380
Data Selection in DataFrame
DataFrame as a dictionary
Example1:
import pandas as pd
ser1 = pd.Series([10,20,30,40],index = ['row1','row2','row3','row4'])
ser2 = pd.Series([50,60,70,80],index = ['row1','row2','row3','row4'])
data = pd.DataFrame({'col1':ser1,'col2':ser2})
print(data)
print(data['col1']) # dict style
print(data.col1) # attribute style
data['sum'] = data['col1']+data['col2']
print(data)
Output:
col1 col2
row1 10 50
row2 20 60
row3 30 70
row4 40 80
row1 10
row2 20
row3 30
row4 40
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 37
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
row1 10
row2 20
row3 30
row4 40
[4 rows x 3 columns]
Example2:
import pandas as pd
markslist = {'kumar':89,'Rao':78,'Ali':67,'Singh':96}
ageslist = {'kumar':21,'Rao':22,'Ali':19,'Singh':20}
marks = pd.Series(markslist)
ages = pd.Series(ageslist)
data = pd.DataFrame({'marks': marks,'ages': ages})
print(data)
print(data['marks'])
print(data.marks)
data['ratio'] = data['marks'] / data['ages']
print(data)
Output:
marks ages
kumar 89 21
Rao 78 22
Ali 67 19
Singh 96 20
kumar 89
Rao 78
Ali 67
Singh 96
kumar 89
Rao 78
Ali 67
Singh 96
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 38
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
[10 50]
col1
row1 10
row2 20
row3 30
col1
row1 10
row2 20
row3 30
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 39
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Example2:
import pandas as pd
markslist = {'kumar':89,'Rao':78,'Ali':67,'Singh':96}
ageslist = {'kumar':21,'Rao':22,'Ali':19,'Singh':20}
marks = pd.Series(markslist)
ages = pd.Series(ageslist)
data = pd.DataFrame({'marks': marks,'ages': ages})
print(data)
print(data.values)
print(data.T)
Output:
marks ages
kumar 89 21
Rao 78 22
Ali 67 19
Singh 96 20
[[89 21]
[78 22]
[67 19]
[96 20]]
kumar ... Singh
marks 89 ... 96
ages 21 ... 20
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 40
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Index preservation:
Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series and DataFrame objects.
We can use all arithmetic and special universal functions as in NumPy on
pandas.
The result will be another pandas object with the indices preserved
In outputs the index will preserved (maintained) as shown below.
Example:
import numpy as np
import pandas as pd
s=pd.Series([10,20,30,40])
print(s)
df =pd.DataFrame(np.arange(1,13).reshape(3,4),columns=['A','B','C','D'])
print(df.to_string())
print(np.add(s,5))
print(np.add(df,10).to_string())
Output:
0 10
1 20
2 30
3 40
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
0 15
1 25
2 35
3 45
A B C D
0 11 12 13 14
1 15 16 17 18
2 19 20 21 22
Index Alignment in Data Frame:
import numpy as np
import pandas as pd
A = pd.DataFrame(np.arange(1,5).reshape(2,2),columns=list('AB'))
B = pd.DataFrame(np.arange(1,10).reshape(3,3),columns=list('BAC'))
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 41
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
print(A)
print(B.to_string())
print(A.add(B).to_string())
print(A.add(B,fill_value=0).to_string())
print(A.add(B).to_string())
fill=A.stack().mean()
print(A.add(B,fill_value=fill).to_string())
Output:
A B
0 1 2
1 3 4
B A C
0 1 2 3
1 4 5 6
2 7 8 9
A B C
0 3.0 3.0 NaN
1 8.0 8.0 NaN
2 NaN NaN NaN
A B C
0 3.0 3.0 3.0
1 8.0 8.0 6.0
2 8.0 7.0 9.0
A B C
0 3.0 3.0 NaN
1 8.0 8.0 NaN
2 NaN NaN NaN
A B C
0 3.0 3.0 5.5
1 8.0 8.0 8.5
2 10.5 9.5 11.5
Handling Missing Data
A number of schemes have been developed to indicate the presence of
missing data in a table or DataFrame.
Generally, they revolve around one of two strategies: using a mask that
globally indicates missing values, or choosing a sentinel value that indicates
a missing entry.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 42
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 43
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Output:
0 1.0
1 NaN
2 2.0
3 NaN
0 1
0 1.0 NaN
1 3.0 NaN
2 NaN 6.0
3 NaN 8.0
Operating on Null Values
There are several useful methods for detecting, removing, and replacing null
values in Pandas data structures.
They are:
isnull() - Generate a Boolean mask indicating missing values
notnull() - Opposite of isnull()
dropna() - Return a filtered version of the data
fillna() - Return a copy of the data with missing values filled or
imputed
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 44
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
0 1
1 NaN
2 hello
3 None
0 False
1 True
2 False
3 True
0 True
1 False
2 True
3 False
0 ... 2
0 NaN ... hai
1 20.0 ... wow
0 ... 2
0 True ... False
1 False ... False
0 ... 2
0 False ... True
1 True ... True
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 45
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
0 1
1 NaN
2 hello
3 None
0 ... 2
0 NaN ... hai
1 20.0 ... wow
[2 rows x 3 columns]
0 1
2 hello
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 46
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
0 ... 2
1 20.0 ... wow
[1 rows x 3 columns]
1 2
0 10 hai
1 30 wow
1 2
0 10 hai
1 30 wow
Example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan,10,'hai',None],[20,30,'wow',None]])
print(df)
print(df.dropna())
print(df.dropna(axis =1))
print(df.dropna(axis ='columns')) #equivalent to axis =1
print(df.dropna(axis ='columns',how='all'))
print(df.dropna(axis ='columns',thresh=2))
Output:
0 ... 3
0 NaN ... None
1 20.0 ... None
[2 rows x 4 columns]
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
1 2
0 10 hai
1 30 wow
1 2
0 10 hai
1 30 wow
0 ... 2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 47
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
[2 rows x 3 columns]
1 2
0 10 hai
1 30 wow
Filling null values
We can replace NA values with a valid value. This value might be a single
number like zero, or it might be some sort of imputation or interpolation
from the good values.
Pandas provides the fillna() method, which returns a copy of the array with
the null values replaced.
Example: filling null values in Series
import numpy as np
import pandas as pd
ser = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
print(ser)
print(ser.fillna(0))
print(ser.fillna(method='ffill'))
print(ser.fillna(method='bfill'))
Output:
a 1.0
b NaN
c 2.0
d NaN
e 3.0
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
a 1.0
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 48
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
b 2.0
c 2.0
d 3.0
e 3.0
Example: Filling null values in DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, np.nan, 2,None],
[2, 3, 5, None],
[np.nan, 4, 6, None]])
print(df)
print(df.fillna(method='ffill', axis=1))
print(df.fillna(method='bfill', axis=1))
print(df.fillna(method='ffill', axis=0))
print(df.fillna(method='bfill', axis=0))
Output:
0 1 2 3
0 1.0 NaN 2 None
1 2.0 3.0 5 None
2 NaN 4.0 6 None
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0
0 1 2 3
0 1.0 2.0 2.0 NaN
1 2.0 3.0 5.0 NaN
2 4.0 4.0 6.0 NaN
0 1 2 3
0 1.0 NaN 2 None
1 2.0 3.0 5 None
2 2.0 4.0 6 None
0 1 2 3
0 1.0 3.0 2 None
1 2.0 3.0 5 None
2 NaN 4.0 6 None
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 49
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Hierarchical Indexing
Hierarchical indexing (also known as multi-indexing) is used to incorporate
multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within
the familiar one-dimensional Series and two-dimensional DataFrame
objects.
A Multiply Indexed Series: Here we represent two-dimensional data within
a one-dimensional Series.
Example:
import numpy as np
import pandas as pd
ser = pd.Series([10,20,30,40,50,60],index = [[1,1,1,2,2,2,],
['a','b','c','a','b','c']])
print(ser)
ser.index.names = ['ind1','ind2']
print(ser)
Output:
1 a 10
b 20
c 30
2 a 40
b 50
c 60
ind1 ind2
1 a 10
b 20
c 30
2 a 40
b 50
c 60
A Multiply Indexed DataFrame:
Example:
import numpy as np
import pandas as pd
data = [[25,24],[28,26],[29,28],[27,26],[30,29],[28,27]]
ind = [['1201','1201','1264','1264','12C7','12C7'],
['mid1','mid2','mid1','mid2','mid1','mid2']]
col = ['DS','DO']
df = pd.DataFrame(data,index=ind,columns=col)
print(df)
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 50
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
df.index.names =['rollNo','mid']
print(df)
Output:
DS DO
1201 mid1 25 24
mid2 28 26
1264 mid1 29 28
mid2 27 26
12C7 mid1 30 29
mid2 28 27
DS DO
rollNo mid
1201 mid1 25 24
mid2 28 26
1264 mid1 29 28
mid2 27 26
12C7 mid1 30 29
mid2 28 27
Example:
Python program to create following table of data
Dept Other
DS DO MOB EPC
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Program:
import numpy as np
import pandas as pd
data = [[25,24,23,15],[28,26,23,21],[29,28,27,26],[27,26,24,25],[30,29,28,27],
[28,27,25,26]]
ind = [['1201','1201','1264','1264','12C7','12C7'],
['mid1','mid2','mid1','mid2','mid1','mid2']]
col = [['Dept','Dept','Other','Other'],['DS','DO','MOB','EPC']]
df = pd.DataFrame(data,index=ind,columns=col)
print(df.to_string())
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 51
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Output:
Dept Other
DS DO MOB EPC
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Example:
Python program to create following table:
Type
Dept Other
Sub DS DO MOB EPC
RollNo Mid
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Program:
import numpy as np
import pandas as pd
data = [[25,24,23,15],[28,26,23,21],[29,28,27,26],[27,26,24,25],[30,29,28,27],
[28,27,25,26]]
ind = [['1201','1201','1264','1264','12C7','12C7'],
['mid1','mid2','mid1','mid2','mid1','mid2']]
col = [['Dept','Dept','Other','Other'],['DS','DO','MOB','EPC']]
df = pd.DataFrame(data,index=ind,columns=col)
df.index.names =['RollNo','Mid']
df.columns.names =['Type','Sub']
print(df.to_string())
Output:
Type Dept Other
Sub DS DO MOB EPC
RollNo Mid
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 52
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Combining Datasets
Some of the most interesting studies of data come from combining different
data sources.
These operations can involve anything from very straightforward
concatenation of two different datasets, to more complicated database-style
joins and merges that correctly handle any overlaps between the dataset.
These operations can be:
simple concatenation of Series and DataFrames with the pd.concat
function
in-memory merges and joins implemented in Pandas.
Simple Concatenation with pd.concat
Pandas has a function, pd.concat(), which has a similar syntax to
np.concatenate but contains a number of other options
pd.concat() can be used for a simple concatenation of Series or DataFrame
objects, just as np.concatenate() can be used for simple concatenations of
arrays
import pandas as pd
import numpy as np
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
print(pd.concat([ser1, ser2]))
Output:
1A
2B
3C
4D
5E
6F
Concatenation in data frame:
import pandas as pd
import numpy as np
df1 =pd.DataFrame([[10,20],[30,40]],index=[1,2],columns=['A','B'])
df2 =pd.DataFrame([[50,60],[70,80]],index=[1,2],columns=['A','B'])
print(df1); print(df2); print(pd.concat([df1, df2]))
Output:
A B
1 10 20
2 30 40
A B
1 50 60
2 70 80
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 53
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
A B
1 10 20
2 30 40
1 50 60
2 70 80
By default, the concatenation takes place row-wise within the DataFrame
(i.e., axis=0). Like np.concatenate, pd.concat allows specification of an axis
along which concatenation will take place.
Example:
import pandas as pd
import numpy as np
df1 =pd.DataFrame([[10,20],[30,40]],index=[1,2],columns=['A','B'])
df2 =pd.DataFrame([[50,60],[70,80]],index=[1,2],columns=['A','B'])
print(df1); print(df2);
print(pd.concat([df1, df2],axis=1).to_string())
Output:
A B
1 10 20
2 30 40
C D
1 50 60
2 70 80
A B C D
1 10 20 50 60
2 30 40 70 80
By default, the entries for which no data is available are filled with NA
values. To change this, we can specify one of several options for the join
and join_axes parameters of the concatenate function. By default, the join is
a union of the input columns (join='outer'), but we can change this to an
intersection of the columns using join='inner':
Example:
import pandas as pd
import numpy as np
df1 =pd.DataFrame([[1,2,3],[4,5,6]],index=[1,2],columns=['A','B','C'])
df2 =pd.DataFrame([[7,8,9],[10,11,12]],index=[1,2],columns=['B','C','D'])
print(df1.to_string()); print(df2.to_string())
print(pd.concat([df1, df2]).to_string())
print(pd.concat([df1, df2],join='inner'))
Output:
A B C
1 1 2 3
2 4 5 6
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 54
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
B C D
1 7 8 9
2 10 11 12
A B C D
1 1.0 2 3 NaN
2 4.0 5 6 NaN
1 NaN 7 8 9.0
2 NaN 10 11 12.0
B C
1 2 3
2 5 6
1 7 8
2 10 11
The append() method
Series and DataFrame objects have an append method that can accomplish the
concatenation in fewer keystrokes.
For example, rather than calling pd.concat([df1, df2]), we can simply call
df1.append(df2):
print(df1); print(df2); print(df1.append(df2))
Merge and Join
One essential feature offered by Pandas is its high-performance, in-memory join
and merge operations.
Categories of Joins
One-to-one joins
Many-to-one joins
Many-to-many joins
One-to-One Join:
A one-to-one join combines two DataFrames when there is a unique
matching key in both DataFrames. Each key appears only once in both
DataFrames.
This type of join results in a DataFrame where each row from the left
DataFrame is combined with exactly one matching row from the right
DataFrame.
Example of a one-to-one join:
import pandas as pd
# Create two DataFrames
left_df = pd.DataFrame({'key': ['A', 'B', 'C'],'value_left': [1, 2, 3]})
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 55
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
import pandas as pd
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 56
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
print(many_to_many_df)
Output:
key value_left value_right
0 A 1 apple
1 A 1 apple
2 A 3 apple
3 B 2 banana
join() Method:
join() is a DataFrame method that is more concise and primarily used for
combining DataFrames based on their indices.
It defaults to a left join and only allows joining DataFrames on their indices
by default.
It is more convenient when we want to join DataFrames that share the same
index or have common index labels.
Example:
import pandas as pd
print(joined_df)
Output:
A B
index0 A0 NaN
index1 A1 B0
index2 A2 B1
index3 A3 B2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 57
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
# Sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A2', 'A3', 'A4'],
'C': ['C2', 'C3', 'C4']})
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 58
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Output:
Outer Join Result:
A B C
0 A0 B0 NaN
1 A1 B1 NaN
2 A2 B2 C2
3 A3 NaN C3
4 A4 NaN C4
We can also perform left joins and right joins using the how='left' and
how='right' parameters, respectively. Left join includes all rows from
the left DataFrame and matches from the right DataFrame, while right
join includes all rows from the right DataFrame and matches from the
left DataFrame.
Aggregation and Grouping
An essential piece of analysis of large data is efficient summarization:
computing aggregations like sum(), mean(), median(), min(), and max(), in
which a single number gives insight into the nature of a potentially large
dataset.
Aggregation in pandas can be performed by:
Simple Aggregation
Operations based on the concept of a groupby.
Simple Aggregation in Pandas
As with a one dimensional NumPy array, for a Pandas Series the aggregates
return a single value:
Example:
import pandas as pd
import numpy as np
ser = pd.Series([10,20,30,40,50])
print(ser.sum())
print(ser.mean())
Output:
150
30.0
For a DataFrame, by default the aggregates return results within each column.
By specifying the axis argument, we can instead aggregate within each row.
Example:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 59
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1,6),
'B':np.arange(10,60,10)})
print(df.sum())
print(df.mean())
print(df.sum(axis ='columns'))
print(df.mean(axis = 'columns'))
Output:
A 15
B 150
dtype: int64
A 3.0
B 30.0
dtype: float64
0 11
1 22
2 33
3 44
4 55
dtype: int64
0 5.5
1 11.0
2 16.5
3 22.0
4 27.5
dtype: float64
Pandas Series and DataFrames include all of the common aggregates .In
addition, there is a convenience method describe() that computes several
common aggregates for each column and returns the result.
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1,6),
'B':np.arange(10,60,10)})
print(df.describe())
Output:
A B
count 5.000000 5.000000
mean 3.000000 30.000000
std 1.581139 15.811388
min 1.000000 10.000000
25% 2.000000 20.000000
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 60
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 61
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Output:
key data
0 A 1
1 B 2
2 C 3
3 A 4
4 B 5
5 C 6
data
key
A 5
B 7
C 9
Exercise program:
A DataFrame consists of three columns: ‘Date',’Product’ and ’Sales’. Create data
frame and Groupby the data by product and calculate the sum of sales.
Solution:
import pandas as pd
data={'date':['01-10-2023','02-10-2023','01-10-2023','04-10-2023','01-10-2023'],
'Product':['A','B','A','A','B'],
'Sales':[100,180,120,80,200] }
df =pd.DataFrame(data)
print(df.to_string())
product_sales = df.groupby('Product').sum()
print(product_sales)
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 62
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Pivot Tables
A pivot table in pandas is a data manipulation technique that allows to
restructure and summarize data from a DataFrame, making it easier to
analyze and visualize. Pivot tables are commonly used for tasks like data
aggregation, summarization, and cross-tabulation.
A pivot table is a similar to GroupBy operation that is commonly seen in
spreadsheets and other programs that operate on tabular data.
The pivot table takes simple column wise data as input, and groups the
entries into a two-dimensional table that provides a multidimensional
summarization of the data.
We can think of pivot tables as essentially a multidimensional version of
GroupBy aggregation. i.e., we can split-apply- combine, but both the split
and the combine happen across not a one dimensional index, but across a
two-dimensional grid.
Pivot Table Syntax: The full call signature of the pivot_table method of
DataFrames is as follows:
DataFrame.pivot_table(data, values=None, index=None,
columns=None,aggfunc='mean',
fill_value=None, margins=False,
dropna=True, margins_name='All')
where
data : pandas dataframe
index : feature that allows to group data
values : feature to aggregates on
columns: displays the values horizontally on top of the resultant
table
fill_value and dropna, have to do with missing data
The aggfunc keyword controls what type of aggregation is applied, which is a
mean by default.
margins_name: compute totals along each grouping.
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Kumar','Rao','Ali','Singh'],
'Job':['FullTimeEmployee','Intern','PartTime
Employee','FullTimeEmployee'],
'Dept':['Admin','Tech','Admin','management'],
'YOJ':[2018,2019,2018,2010],
'Sal':[20000,50000,10000,20000]})
print(df.to_string())
output = pd.pivot_table(data=df,index=['Job'],columns = ['Dept'],
values ='Sal',aggfunc ='mean')
print('\n')
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 63
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
print(output.to_string())
Output:
Exercise program:
A DataFrame consists of three columns: ‘Date',’Product’ and ’Sales’. Create data
frame and display the total sales for each product on each date.
Solution:
import pandas as pd
# Sample data
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Sales': [100, 150, 200, 120, 180]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the original DataFrame
print("Original DataFrame:")
print(df.to_string())
# Create a pivot table
pivot_table = df.pivot_table(values='Sales', index='Date', columns='Product',
aggfunc='sum')
# Display the pivot table
print("\nPivot Table:")
print(pivot_table)
Outpu
Original DataFrame:
Date Product Sales
0 01-10-2023 A 100
1 01-10-2023 B 150
2 02-10-2023 A 200
3 02-10-2023 B 120
4 03-10-2023 A 180
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 64
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
Pivot Table:
Product A B
Date
01-10-2023 100.0 150.0
02-10-2023 200.0 120.0
03-10-2023 180.0 NaN
NumPy vs Pandas
NumPy Pandas
1.NumPy stands for Numerical Python Pandas stand for PANel DAta Sructure
2.NumPy is primarily designed for Pandas is designed for data manipulation
numerical and mathematical operations and analysis, particularly for working with
andscientific computing. structured and labeled data.
3.NumPy's core data structure is the Pandas offers two main data structures:
ndarray (n-dimensional array), which is DataFrames and Series.
homogeneous (all elements have the
same data type)
4. NumPy is best suited for numerical Pandas is ideal for working with structured
and scientific computing tasks, data in data science, business analytics, and
including linear algebra, statistical machine learning applications.
analysis, and mathematical operations
on large arrays of numerical data.
Tutorial Questions
1. Illustrate different categories of basic array manipulations with examples.
2. What are universal functions in NumPy array? Explain the different advanced features of
universal functions.
3. Discuss and demonstrate some of built-in aggregation functions in NumPy.
4. What is broadcasting in NumPy? Discuss the different rules of broadcasting with examples
5. What is Boolean masking in NumPay? Explain with example.
6. What is fancy indexing in NumPy? Discuss and demonstrate the Fancy Indexing in NumpPy.
7. Demonstrate the use of structured arrays andrecord arrays in NumpPy
8. How fancy indexing can be combined with other indexing schemes.
9. Illustrate different attributes of NumPy arrays with example.
10. Write short note on Computation on NumPy arrays
11. Explain the fundamental data objects with its construction in pandas
12. Briefly explain the hierarchical indexing with examples
13. What is pivot table? Explain it clearly
14. Demonstrate data indexing and selection in Pandas Series and DataFrame objects.
15. Write short note on Operating on Data in Pandas
16. Demonstrate different methods of constructing MultiIndex.
17. How to handle missing data in pandas
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 65
4-1 B.Tech CIVIL Regulation: R20 FDS: UNIT-4
18. Illustrate different approaches to combine data from multiple sources in pandas
19. Explore aggregation and grouping in Pandas
20. Briefly explore and demonstrate different methods for Operating on Null Values
Assignment Questions:
1. Write a python program to demonstrate the Attributes of Arrays in NumpPy
2. Write a python program to demonstrate the Indexing of Arrays in NumpPy
3. Write a python program to demonstrate the Slicing of Arrays in NumpPy
4. Write a python program to demonstrate the Reshaping of Arrays in NumpPy
5. Write a python program to demonstrate the Joining and Splitting of Arrays in NumpPy
6. Write a python program to demonstrate the Aggregation Universal Functions in NumpPy
7. Write a python program to demonstrate the Broadcasting in NumpPy
8. Write a python program to demonstrate the Boolean Making in NumpPy
9. Write a python program to demonstrate the Fancy Indexing in NumpPy
10. Write a python program to demonstrate the use of structured arrays and record arrays in
NumpPy
11. Write a python program to illustrate different ways of creating pandas Series
12. Write a python program to illustrate different ways of creating pandas DataFrame
13. Write a python program to illustrate detecting null values in pandas DataFrame
14. Write a python program to illustrate dropping null values in pandas DataFrame
15. Write a python program to illustrate filling null values in pandas DataFrame
16. Write a python program to illustrate creating different ways of pandas MutiIndex
17. Write a python program to illustrate indexing, slicing, Boolean indexing and fancy indexing in
MultiIndex.
18. Write a python program to illustrate merging two data sets with joins(inner, left and right) in
pandas
19. Write a python program to illustrate GroupBy operation of pandas.
20. Write a python program to illustrate pivot table in pandas.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 66