Ultimate Data Science Programming in Python 9365895669
Ultimate Data Science Programming in Python 9365895669
Data Science
Programming in
PythonMaster data science libraries with 300+
programs, 2 projects, and EDA GUI tools
Saurabh Chandrakar
logo
www.bpbonline.com
First Edition 2025
ISBN: 978-93-65895-667
All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any
form or by any means or stored in a database or retrieval system, without the prior written permission
of the publisher with the exception to the program listings which may be entered, stored and executed
in a computer system, but they can not be reproduced by the means of publication, photocopy,
recording, or by any electronic and mechanical means.
All trademarks referred to in the book are acknowledged as properties of their respective owners but
BPB Publications cannot guarantee the accuracy of this information.
www.bpbonline.com
Dedicated to
My parents
Dr. Surendra Kumar Chandrakar and
Smt. Bhuneshwari Chandrakar
my brother
Shri Pranav Chandrakar
my sister-in-law
Smt. Silky Chandrakar
my wife
Smt. Priyanka Chandrakar
and my lovely son
Master Yathartha Chandrakar
About the Author
Saurabh Chandrakar is a Research and Development Engineer (Dy.
Manager) at Bharat Heavy Electricals Limited (BHEL) Hyderabad. He is
the winner of the best executive award in the Operations Division by BHEL
Hyderabad. He has been awarded the prestigious BHEL Excellence Award
under Anusandhan category for the projects “Redundant Composite
Monitoring System of Power Transformers project” and “Innovation and
demonstration of Digital Sub Station with in-house developed IEC61850
compliant intelligent electronic devices and optical current transformers for
the year 2020-21”. He has 25 copyrights, 5 patents granted, and 2 patents
filed.
https://rebrand.ly/bzpl6rq
The code bundle for the book is also hosted on GitHub at
https://github.com/bpbpublications/Ultimate-Data-Science-
Programming-in-Python. In case there’s an update to the code, it will be
updated on the existing GitHub repository.
We have code bundles from our rich catalogue of books and videos
available at https://github.com/bpbpublications. Check them out!
Errata
We take immense pride in our work at BPB Publications and follow best
practices to ensure the accuracy of our content to provide with an indulging
reading experience to our subscribers. Our readers are our mirrors, and we
use their inputs to reflect and improve upon human errors, if any, that may
have occurred during the publishing processes involved. To let us maintain
the quality and help us reach out to any readers who might be having
difficulties due to any unforeseen errors, please write to us at :
errata@bpbonline.com
Your support, suggestions and feedbacks are highly appreciated by the BPB
Publications’ Family.
Did you know that BPB offers eBook versions of every book published, with PDF and ePub files
available? You can upgrade to the eBook version at www.bpbonline.com and as a print book
customer, you are entitled to a discount on the eBook copy. Get in touch with us at :
business@bpbonline.com for more details.
At www.bpbonline.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.
Piracy
If you come across any illegal copies of our works in any form on the internet, we would be
grateful if you would provide us with the location address or website name. Please contact us at
business@bpbonline.com with a link to the material.
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on the site
that you purchased it from? Potential readers can then see and use your unbiased opinion to make
purchase decisions. We at BPB can understand what you think about our products, and our
authors can see your feedback on their book. Thank you!
For more information about BPB, please visit www.bpbonline.com.
1. Introduction
2. Structure
3. Objectives
4. Introduction to Python
10. Conclusion
12. Questions
1. Introduction
2. Structure
3. Objectives
1. randint()
2. shuffle()
12. Conclusion
14. Questions
15. References
1. Introduction
2. Structure
3. Objectives
1. reshape()
2. resize()
3. flatten()
4. flat variable
5. ravel()
6. transpose()
1. concatenate()
2. stack()
3. vstack()
4. hstack()
5. dstack()
1. split()
2. vsplit()
3. hsplit()
4. dsplit()
5. array_split()
1. where()
1. insert()
2. append()
1. delete()
13. Conclusion
15. Questions
1. Introduction
2. Structure
3. Objectives
1. Metric
2. Force
3. Binary
4. Angle
5. Mass
6. Time
7. Pressure
8. Length
9. Volume
10. Speed
11. Temperature
12. Energy
13. Power
14. Area
6. Optimizers in scipy
8. Graphs in scipy
1. Adjacency matrix
2. connected_components method
3. Dijkstra method
4. Floyd_Warshall method
1. Triangulation function
10. Integrate in scipy
1. Single integration
2. Multiple integration
1. 1D interpolation
2. Spline interpolation
3. splrep function
4. splev function
12. Conclusion
14. Questions
1. Introduction
2. Structure
3. Objectives
1. Line plot
8. Linestyle property
9. Color property
1. mlc form
2. clm form
22. Conclusion
1. Introduction
2. Structure
3. Objectives
4. Barplot
1. Pie chart
2. Histogram
3. Scatter plot
4. Subplot in matplotlib
9. Conclusion
11. Questions
1. Introduction
2. Structure
3. Objectives
4. Pandas series
22. Conclusion
24. Questions
1. Introduction
2. Structure
3. Objectives
4. Pandas dataframe
7. Conclusion
8. Points to remember
9. Questions
1. Introduction
2. Structure
3. Objectives
15. Output:
23. Conclusion
1. Introduction
2. Structure
3. Objectives
1. Numeric group
2. Nested group
3. Temporal group
4. Others group
1. Series
2. Dataframe
8. Contexts in polars
2. GroupBy concept
13. Questions
1. Introduction
2. Structure
3. Objectives
1. Basic operators
2. Column selections
3. Functions
4. Casting
5. Strings
6. Aggregation
7. Missing data
8. Folds
6. Conclusion
7. Points to remember
8. Questions
1. Introduction
2. Structure
3. Objectives
1. Quantitative variables
2. Qualitative variables
2. Qualitative palettes
3. Sequential palettes
4. Diverging palettes
9. Conclusion
11. Questions
13. 13. Crafting Seaborn Plots: KDE, Line, Violin, and Facets
1. Introduction
2. Structure
3. Objectives
9. Conclusion
11. Questions
1. Introduction
2. Structure
3. Objectives
9. Conclusion
11. Questions
1. Introduction
2. Structure
3. Objectives
8. Conclusion
9. Points to remember
10. Questions
1. Introduction
2. Structure
3. Objectives
4. Conclusion
5. Points to remember
6. Questions
17. Index
CHAPTER 1Environmental Setup for Using
Data Science Libraries in Python
Introduction
In this chapter, we will explore the data science library in Python. However,
before learning about various data science libraries, it is quite important to
create an environmental setup for installing and using these data science
libraries in Python. Setting up the environment for utilizing data science
libraries like NumPy, SciPy, Matplotlib, Pandas, and others in Python
ensures effective data analysis and modeling workflows. This entails
managing dependencies, version control, and package management to
guarantee project compatibility and reproducibility. By creating isolated
environments, potential conflicts between different library versions are
mitigated, facilitating seamless collaboration and reproducibility of results.
So, let us get into the intricate details of Python installation and Integrated
Development Environments (IDEs) like VSCode and Jupyter Notebook for
writing and executing the Python code.
Structure
Introduction to Python
Setup Installation in Windows for Jupyter Notebook
Insights of Jupyter Notebook
Demo program using Jupyter Notebook
Introduction to Data Science Libraries in Python
Objectives
Introduction to Python
1. Use the package manager: Most Linux distributions come with Python
pre-installed. However, if you need to install or upgrade Python, you
can use the package manager. For example, on Ubuntu, you can use
the following commands:
3. On clicking the Next button of Figure 1.3, read the License Agreement
as shown in Figure 1.4:
5. Then, clicking the Next button (shown in Figure 1.5) will prompt us to
select a destination folder to install Anaconda. Install Anaconda to a
directory path that avoids spaces or unicode characters, as shown in
Figure 1.6:
6. Then, when clicking the Next button in Figure 1.6, we will see the
advanced installation option, as shown in Figure 1.7:
7. Clicking the Install button in Figure 1.7 will initiate the installation
process, as shown in Figure 1.8. The user may click the Show Details
button to observe the packages installed on Anaconda3.
9. Then, click the Next button of Figure 1.10, which displays the cloud
notebook service of Anaconda:
https://docs.anaconda.com/anaconda/install/mac-os/
https://docs.anaconda.com/anaconda/install/linux/
In Figure 1.15, we can see that a Jupyter notebook file already has the
extension Untitled1.ipynb. In the web app shown above, the user can create
a new Notebook, File, Folder, and so on. Suppose a new Notebook file has
to be created, like Untitled.ipynb. Then, the user will click Notebook and
be prompted to Select the kernel, as shown in Figure 1.16:
Figure 1.17: Image depicting Kernel selection for executing Python code
within notebook
Observe Figure 1.18, with annotation in our current topic demo program at
Jupyter notebook. A Jupyter notebook comprises several integral
components:
Apart from these, there are two states or modes of a notebook, viz Edit
mode and Command mode:
mynum1 = 2
mynum2 = 3
In another code cell, we are only writing mynum3 and getting the output as
6 in the output Area. Also, notice there are 2 prompts namely In[] and Out[]
which indicate input and output, respectively, for code cells.
In[]: This prompt denotes an input cell where the user can write and
execute code. The number inside the square brackets indicates the
order in which the code cell was executed within the notebook. For
example, In[2] indicates that this is the second code cell executed in
the notebook.
Out[]: This prompt denotes an output cell that displays the result of the
code execution from the corresponding input cell (In[]). The number
inside the square brackets corresponds to the input cellʼs number (In[])
from which the output was generated. For example, Figure 1.18
Out[2] indicates that this is the output whose value is 6 corresponding
to the input cell In[2].
These prompts help to keep track of the execution order and the
associated input and output for code cells in a Jupyter Notebook. They
provide a clear indication of the codeʼs execution flow and the results
generated at each step.
In Jupyter Notebook, Shift + Enter and Ctrl + Enter are both keyboard
shortcuts used to execute code cells, but they operate differently:
Shift + Enter: This shortcut executes the current cell and moves the
focus to the next cell. If there is no subsequent cell, a new one is
created below. Itʼs commonly used when you want to execute a cell
and proceed to the next task or cell in the notebook.
Ctrl + Enter: Pressing Ctrl + Enter executes the current cell but keeps
the focus within the same cell after execution. Itʼs useful when you
want to execute a cell without advancing to the next one, allowing you
to stay focused on the current cell for further editing or analysis.
In essence, Shift + Enter executes and advances to the next cell, while
Ctrl + Enter executes without moving to the next cell, enabling you to
control the flow of execution based on your workflow in Jupyter
Notebook.
We also saw usage of pwd in In[3] which indicates the current working
directory in Jupyter notebook. So, the file Untitled.ipynb is saved in the
following folder:
'C:\\Users\\SAURABH\\Desktop\\Demo_Jupyter'.
Figure 1.22: Image depicting Raw Terms and Conditions page during
Microsoft VS Code installation
3. Select the directory where we want to run Visual Studio Code. Weʼll
be prompted to browse for the location. Afterward, click on the Next
button to proceed. Here, the default selected path is
C:\Users\6146c\AppData\Local\Programs\Microsoft VS Code as
shown in Figure 1.23 as follows:
Figure 1.23: Image depicting Destination folder selection during
Microsoft VS Code installation
6. The installation has been started and will take some time to install.
During the installation, we might encounter the image shown in Figure
1.26 as follows:
https://code.visualstudio.com/docs/setup/mac
https://code.visualstudio.com/docs/setup/linux
Problems: This refers to the panel where we can view and manage
diagnostic messages, warnings, and errors reported by our code or
extensions. It helps us to identify and resolve issues in our codebase.
Output: The Output panel displays the output of tasks, extensions, and
debug sessions. It provides valuable information and feedback from
various processes running within Visual Studio Code.
Debug Console: The Debug Console is where we can interactively
debug your code during a debugging session. It allows us to view and
evaluate expressions, inspect variables, and execute commands within
the context of our debug
session.
Terminal: Visual Studio Code includes an integrated terminal that
allows us to run command-line tasks and interact with our operating
system directly within the editor. It supports various shells and can be
customized to suit our preferences.
Ports: Ports refer to network ports used for communication between
Visual Studio Code and external processes, such as debuggers or
language servers. Configuring and managing ports may be necessary
when working with certain features or extensions that require network
connectivity.
Now, a new Python file helloworld.py is created by clicking New File icon
under EXPLORER section. We are saving Python files with .py extension.
It is a text file that contains Python code and under this particular file we
are only writing the print statement and that too as shown in Figure 1.29 as
follows:
print(“HelloWorld”)
Now, at present by just writing pwd command we can know the directory
as shown in Figure 1.20 as follows:
Figure 1.32: Image depicting error on running the Python command along
with file name
So, just close VSCode, again restart it and run the command python
.\helloworld.py under the Terminal tab. This time we may not get an error
and the output HelloWorld will be displayed to the user, as shown below:
Note: If you type press h letter by focusing on Jupyter notebook, then the
following image will be popped up as shown in Figure 1.34 which will
access the keyboard shortcuts help menu. When you press h while in
command mode (blue cell border), a list of available keyboard shortcuts in
a popup dialog will be displayed in the Jupyter notebook.
Conclusion
Points to remember
Questions
Introduction
We are carrying over the legacy of explaining multiple concepts in Python
language. In this chapter, we will be dealing with the current popular topic
in the data science area which will be related to open-source data science
libraries in Python.
The primary library is essential for data science and machine learning is
numpy, short for numerical Python library. While basic mathematical
operations are handled by standard Python, numpy fills the gap for complex
mathematical operations such as array creation, manipulation,
differentiation, statistics, and integral calculus. These functionalities are not
at all supported by standard Python. Numpy introduces the n-dimensional
array (ndarray) as its fundamental data structure. Written in both Python
and C language, it offers high performance, making it the optimal choice for
machine learning algorithms requiring efficient data structures akin to
Python lists. Data is stored in array format within numpy, and the library
provides an array data structure referred to as ndarrays. We can install
numpy using pip that is, pip install numpy.
Structure
In this chapter, we will discuss the following topics:
Commonalities shared by lists and numpy arrays
Contrasts separating lists from numpy arrays
The process of creating numpy arrays
Using random module for ndarray creation
Exploring differences between View vs. Copy in numpy
Elements access of ndarray
Iterate elements of ndarray
Arithmetic operators in numpy
Broadcasting concept in numpy
Objectives
Readers will gain a clear understanding of the comparisons between lists
and numpy arrays and what they entail. We shall delve into the creation of
ndarrays through different methods, including utilizing list and tuple data
structures. We will also demonstrate various functions using Python code
snippets and peep into ndarray creation with random values using a module.
The distinction between view and copy in numpy will be explained using
examples which is crucial for memory efficiency. We will learn different
methods to access individual elements or subsets of elements in ndarrays.
We shall see how to iterate over elements of ndarrays using loops and
explore various arithmetic operators available in numpy. In the end, we will
learn how to use broadcasting to perform operations on ndarrays with
different shapes.
Note: We will be using an alias for numpy library as mynp. In this and
the forthcoming chapters, we will be writing syntax of methods
pertaining to the library. In the code, print
(help(aliasname.method_name)) will be mentioned under comment. The
reader is requested to kindly view the parameter details along with
their syntax so that repeatability in this book can be avoided thus
decreasing the page size and lucidly explaining the concepts without
providing any redundant information. It is expected that readers will
view the parameters of the syntax used.
randint()
Random integers are generated within a specified range by using randint()
function that is from low (inclusive) to high (exclusive). Refer to the
following syntax:
mynp.random.randint(low, high=None, size=None,
dtype=int)
low: Lowest integer value which is inclusive in the range.
high: Highest integer value which is exclusive in the range. The
range becomes [0,low) if unspecified.
size: output shape which can be an integer or tuple of integers.
dtype: output desired data type which is an integer by default here.
Let us view various examples for a better explanation of the above function
as demonstrated in the code program name: Chap2_Example2.19.py as
follows:
# np.random module examples
import numpy as mynp
import sys
# single random integer value generation in the
range between 20 to 29
print(mynp.random.randint(20,30))
print('-'*50)
# 1-D nd-array creation of size 5 with random
values from 10 to 19
print(mynp.random.randint(10,20, size=5))
print('-'*50)
# 2-D array with high as None and random values
from 0 to 49 with shape as (3,4)
print(mynp.random.randint(0,50, size=(3,4)))
print('-'*50)
# memory utilization is improved using dtype
a = mynp.random.randint(1,21,size=(30,40))
print(f"ndarray int32 size: {sys.getsizeof(a)}")
a = mynp.random.randint(1,21,size=
(30,40),dtype='int8')
print(f"ndarray int8 size : {sys.getsizeof(a)}")
Output:
26
--------------------------------------------------
[17 15 11 15 16]
--------------------------------------------------
[[31 21 8 41]
[47 3 39 41]
[46 21 10 31]]
--------------------------------------------------
ndarray int32 size: 4928
ndarray int8 size : 1328
shuffle()
This function will shuffle the array along axis-0. It corresponds to the first
axis in a multi-dimensional array. Here, the content remains the same with a
change in sub-array order.
Let us view some examples as demonstrated in the code program name:
Chap2_Example2.20.py as follows:
# Example of view
mynp1=mynp.array([16,28,37,46,55])
mynp2=mynp1.view()
print(f"The original array is: {mynp1}")
print(f"The view array is: {mynp2}")
print('-'*50)
mynp1[2]=100
print(f"After changing 2nd index value of mynp1,
mynp1 array value is changed as: {mynp1}")
print(f"After changing 2nd index value of mynp1,
mynp2 array value is changed as: {mynp2}")
print('-'*50)
mynp2[-1]=300
print(f"After changing last index value of mynp2,
mynp1 array value is changed as: {mynp1}")
print(f"After changing last index value of mynp2,
mynp2 array value is changed as: {mynp2}")
# Example of copy
print('*'*100)
mynp3=mynp.array([111,132,136,147,158])
mynp4=mynp3.copy()
print(f"The original array is: {mynp3}")
print(f"The copy array is: {mynp4}")
print('-'*50)
mynp3[2]=110
print(f"After changing 2nd index value of mynp3,
mynp3 array value is changed as: {mynp3}")
print(f"After changing 2nd index value of mynp3,
mynp4 array value is retained as: {mynp4}")
print('-'*50)
mynp4[-1]=310
print(f"After changing last index value of mynp4,
mynp3 array value is retained as: {mynp3}")
print(f"After changing last index value of mynp4,
mynp4 array value is changed as: {mynp4}")
Output:
The original array is: [16 28 37 46 55]
The view array is: [16 28 37 46 55]
--------------------------------------------------
After changing 2nd index value of mynp1, mynp1
array value is changed as: [ 16 28 100 46 55]
After changing 2nd index value of mynp1, mynp2
array value is changed as: [ 16 28 100 46 55]
--------------------------------------------------
After changing last index value of mynp2, mynp1
array value is changed as: [ 16 28 100 46 300]
After changing last index value of mynp2, mynp2
array value is changed as: [ 16 28 100 46 300]
**************************************************
**************************************************
The original array is: [111 132 136 147 158]
The copy array is: [111 132 136 147 158]
--------------------------------------------------
After changing 2nd index value of mynp3, mynp3
array value is changed as: [111 132 110 147 158]
After changing 2nd index value of mynp3, mynp4
array value is retained as: [111 132 136 147 158]
--------------------------------------------------
After changing last index value of mynp4, mynp3
array value is retained as: [111 132 110 147 158]
After changing last index value of mynp4, mynp4
array value is changed as: [111 132 136 147 310]
Let mynp1 array have a shape as Both arrays here are of different dimensions and
(2,3), 2-D array, and mynp2 array by using rule -1 we will be adding 1ʼs to the
have a shape as (3), 1-D array. lesser dimension array here as mynp2.
So, mynp1 array is of shape (2,3) and
mynp2 array is of shape (1,3).
mynp1 array is of After applying, rule two, the shape of both mynp1
shape (2,3) and mynp2 and mynp2 array is (2,3). Their sizes are matched
array is of shape with six. The dimensions of both arrays are 2-D.
(1,3).
The first coordinate of mynp2
array (1) will be increased to
mynp1 (2).
Note: If the sizes in any dimension are not the same and are not equal
to one, we will receive an error because numpy is unable to do
broadcasting across those arrays.
Conclusion
In this chapter, we initially explored the concept of numpy arrays.
Subsequently, we delved into comparing and contrasting lists and numpy
arrays through illustrative examples. We also examined different methods
for creating ndarrays, including 1-D array creation using lists and tuples in
Python code. For 2-D array creation, we explored approaches such as
nested lists and specifying data types. Additionally, we covered the creation
of 1-D arrays using the arange() function. Furthermore, we introduced
several other functions such as linspace(), zeros(), ones(), full(), eye(),
diag(), and empty(), along with their respective functionalities, followed by
Python code snippets. This chapter also emphasized the crucial distinction
between View and Copy in numpy for memory efficiency, supported by
illustrative examples. The reader learned about different methods to access
individual elements or subsets of elements in ndarrays and how to iterate
them using loops. Finally, we have explored the wide range of arithmetic
operators available in numpy and how to leverage broadcasting for
operations on ndarrays with different shapes.
In the next chapter, we shall learn about various array manipulation
functions in numpy followed by functions such as sorting, searching dot
function for matrix multiplication, and getting to know linalg module for
performing linear algebra with various examples.
Points to remember
Numpy arrays are implemented in both C and Python languages.
Numpy arrays exhibit high performance and are the preferred choice
for machine learning algorithms.
Numpy arrays are faster and consume less memory compared to lists.
Numpy arrays contain homogeneous elements, while lists can contain
heterogeneous elements.
Vector operations can be performed on ndarrays, unlike lists.
1-D arrays can be created using various methods including lists,
tuples, arange(), linspace(), zeros(), ones(), full(), eye(), diag(),
and empty() functions.
Within a specified range, randint() function generated random
integers that are from low (inclusive) to high (exclusive).
shuffle() function will shuffle elements in the order in the given nd
array.
View is not a separate object but just a logical representation of the
existing array whereas Copy is a separate object.
We can access elements of ndarray using indexing, advanced
indexing, slicing, and condition-based selection.
We can iterate ndarary elements using the loop concept of Python, the
nditer function, and the ndenumerate function.
Questions
1. What defines a numpy array, and why is it often considered the
optimal choice for machine learning applications?
2. Explain the commonalities shared between lists and numpy arrays.
3. What are the distinguishing features between lists and numpy arrays?
4. How can 1-D arrays be created using lists, tuples, and a Python code
snippet?
5. Could you provide a Python code snippet demonstrating the creation
of 2-D arrays using nested lists?
6. How is array creation achieved with a specific data type, and could
you illustrate this with a Python code snippet?
7. How can 1-D arrays be created using the arange() function, and
could you provide an example with a Python code snippet?
8. What is the purpose of the linspace() function, and can you
showcase its usage with a Python code snippet?
9. Could you provide Python code snippets demonstrating the
functionality of functions such as zeros(), ones(), full(), eye(),
diag(), and empty()?
References
Programming techniques using Python: A must for beginners
covering Core + Advanced Python + MCQs for Microsoft and
various other certification exams.
Python for everyone: Coverage of fundamental topics for Core+
Advanced Python along with some open-source data science libraries
Building Modern GUIs with tkinter and Python: Coverage of GUI
app creation using the Tkinter library in Python.
Python GUI with PyQt: Covered GUI app creation using the PyQt5
library in Python.
In the previous chapter, we learned what exactly numpy array is and how to
create ndarrays using various approaches along with accessing and iterating
elements of ndarray. In addition to it, numpy offers a comprehensive suite
of array manipulation functions that are indispensable for efficient data
processing and analysis. Understanding how to reshape arrays, concatenate
or stack multiple them, and split them into meaningful segments is crucial
for managing and preparing datasets. Sorting and searching operations on
ndarrays are essential for organizing and extracting relevant information,
while the ability to insert or delete elements dynamically enhances data
manipulation capabilities. Moreover, numpyʼs dot function for matrix
multiplication and the linalg module for linear algebra operations are
fundamental tools for handling complex mathematical computations. They
make numpy an indispensable resource for researchers, data scientists, and
engineers working with large-scale numerical data.
Structure
Objectives
By the end of this chapter, the reader will have an idea about various array
manipulation functions and variables available in numpy. We will
understand different methods for joining ndarrays. We shall explore how to
split ndarrays into smaller ones. Sorting of ndarrays in numpy will be
explored. Some of the search functions available in numpy will be explored
along with the insertion and deletion of elements into/from ndarrays will be
carried out. The usage of the dot function in numpy for matrix
multiplication will be explored. Finally, the linalg module in numpy for
linear algebra will be well understood with various examples. So, by the
end of this chapter, the reader will be able to appreciate the concept of
numpy arrays by looking into their properties, ways of creation, and their
importance for machine learning algorithms. In this chapter, we will also be
using an alias for numpy library as mynp.
reshape()
This function will change the array shape without swapping its data. It is a
function present in numpy module, or a method present in ndarray class.
Just make sure that the sizes are matched otherwise we will get the
ValueError. Here, View is created instead of Copy. So, if there are changes
done in the original array, it will be reflected in the reshaped array. Refer to
the following syntax for a better understanding:
print("Eg-1 1-D array conversion to 2-D array with by default C style order
---------------------------")
myarr1 = mynp.arange(1,7)
myarr2 = myarr1.reshape(2,3)
print(f"myarr1 is : {myarr1}")
print(f"myarr2 is : {myarr2}")
myarr1[1] = 12
myarr3 = mynp.arange(1,7)
print(f"myarr3 is : {myarr3}")
print(f"myarr4 is : {myarr4}")
Output:
Eg-1 1-D array conversion to 2-D array with by default C style order -------
--------------------
myarr1 is : [1 2 3 4 5 6]
myarr2 is : [[1 2 3]
[4 5 6]]
[ 4 5 6]]
myarr3 is : [1 2 3 4 5 6]
myarr4 is : [[ 1 3 5]
[12 4 6]]
resize()
This function will resize an array to a specified shape which can either
increase or decrease the array size. Using functional style, if more elements
are required by new size, then repeat elements of the input array. A new
object, the new array will be created. Using an object-oriented style, the
extra elements will be filled with zeros and the existing array will only get
modified. Refer to the following syntax for a better understanding:
Or
# print(help(mynp.resize))
myarr1=mynp.arange(1,5)
myarr2=mynp.resize(myarr1,(2,3))
myarr3=mynp.arange(1,5)
myarr3.resize((2,3))
Output:
myarr1 is: [1 2 3 4]
myarr2 is: [[1 2 3]
[4 1 2]]
[4 0 0]]
flatten()
ndarray.flatten(order='C')
# print(help(mynp.ndarray.flatten))
myarr1=mynp.arange(1,7).reshape(2,3)
myarr2 = myarr1.flatten()
print(f"myarr1 is {myarr1}")
print(f"myarr2 is {myarr2}")
print("Eg2 Changing the original array value will not be reflected in the
flatten array------------------------------")
myarr1[0,0]=13
print(f"myarr1 is {myarr1}")
print(f"myarr2 is {myarr2}")
myarr3=myarr1.flatten('F')
Output:
myarr1 is [[1 2 3]
[4 5 6]]
myarr2 is [1 2 3 4 5 6]
Eg2 Changing the original array value will not be reflected in the flatten
array------------------------------
myarr1 is [[13 2 3]
[ 4 5 6]]
myarr2 is [1 2 3 4 5 6]
flat variable
This variable is a 1-dimensional iterator over the array which will allow to
access the elements of a multi-dimensional array. Refer to the following
syntax for a better understanding:
ndarray.flat
#print(help(mynp.ndarray.flat))
print("Eg1 ------------------------------")
mynp1=mynp.arange(1,7).reshape(2,3)
print(mynp1.flat)
# iterating
print(loop)
Output:
Eg1 ------------------------------
2
3
ravel()
This function is similar to the flatten method except that it returns View
whereas, flatten returns Copy. So, if any changes are made in the original
array, then the same would be reflected in the ravel copy too, and vice-
versa. Refer to the following syntax for a better understanding:
mynp.ravel(a, order='C')
or
a.ravel(order='C')
#print(help(mynp.ndarray.ravel))
myarr1 = mynp.arange(1,7).reshape(2,3)
myarr2 = myarr1.ravel()
print(f"myarr1 is {myarr1}")
print(f"myarr2 is {myarr2}")
myarr2[0] = 11
myarr3 = mynp.arange(1,7).reshape(2,3)
print(f"myarr3 is {myarr3}")
print(f"myarr4 is {myarr4}")
Output:
myarr1 is [[1 2 3]
[4 5 6]]
myarr2 is [1 2 3 4 5 6]
[ 4 5 6]]
[4 5 6]]
myarr4 is [1 4 2 5 3 6]
transpose()
This function will interchange ndarray that will reverse the axes of an array.
A new array is not created and will return the View of the existing array.
Using the T variable, we can use a shortcut representation of the transpose.
There will be no change in the size of any dimension. Refer to the
following syntax for a better understanding:
mynp.transpose(a, axis=None)
or
a. transpose (axis=None)
#print(help(mynp.transpose))
a = mynp.arange(1,7).reshape(2,3)
myarr1 = a.transpose()
a = mynp.arange(1,25).reshape(2,3,4)
myarr2 = mynp.transpose(a,axes=(1,2,0))
a = mynp.arange(1,25).reshape(2,3,4)
myarr3 = a.T
Output:
Original Array is :
[[1 2 3]
[4 5 6]]
Transposed Array is :
[[1 4]
[2 5]
[3 6]]
[[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
[[13 14 15 16]
[17 18 19 20]
[21 22 23 24]]]
[[[ 1 13]
[ 2 14]
[ 3 15]
[ 4 16]]
[[ 5 17]
[ 6 18]
[ 7 19]
[ 8 20]]
[[ 9 21]
[10 22]
[11 23]
[12 24]]]
(2, 3, 4)
(3, 4, 2)
Eg3 Using T variable------------------------------
[[[ 1 13]
[ 5 17]
[ 9 21]]
[[ 2 14]
[ 6 18]
[10 22]]
[[ 3 15]
[ 7 19]
[11 23]]
[[ 4 16]
[ 8 20]
[12 24]]]
This function will join arrays along a specified axis. Refer to the following
syntax:
When using the concatenate() function in numpy, the choice of the axis
parameter determines how arrays will be joined together. Here is what
happens with different axis values:
#print(help(mynp.concatenate))
myarr1= mynp.arange(4)
myarr2= mynp.arange(5)
myarr4 = mynp.array([[10,11,12],[13,14,15]])
print('-'*50)
print('-'*50)
# the result in an array, but the result and out must be in the same shape.
myarr5 = mynp.arange(5)
myarr6 = mynp.arange(3)
Output:
Vertical concatenation: [[ 1 2 3]
[ 4 5 6]
[10 11 12]
[13 14 15]]
--------------------------------------------------
[ 4 5 6 13 14 15]]
--------------------------------------------------
out and dtype usage: ['0' '1' '2' '3' '4' '0' '1' '2']
stack()
Using the stack function, a sequence of arrays is joined along a new axis. It
receives an iterable series of arrays as input and outputs a new array with
the input stacked along the new axis. The shape of input arrays must be the
same otherwise we will get an error. The stacked array resulting from the
stack() function has an additional dimension compared to the input arrays.
# print(help(mynp.stack))
myarr1= mynp.array([1,2,3])
myarr2 = mynp.array([10,12,13])
mystack_array = mynp.stack((myarr1,myarr2))
myarr3= mynp.arange(1,7).reshape(2,3)
myarr4 = mynp.array([[10,12,13],[14,15,17]])
# Take 1st row from 1st array and make it 1st row of the stacked array
# Take 2nd row from 2nd array and make it 2nd row of the stacked array
# Combine these 2 rows and repeat the remaining rows of the arrays
Output:
[10 12 13]]
[10 12 13]]
[[ 4 5 6]
[14 15 17]]]
[ 2 12]
[ 3 13]]
[[ 4 14]
[ 5 15]
[ 6 17]]]
vstack()
This function will stack arrays vertically which takes the sequence as input
and the single array is returned by stacking them vertically. Concatenation
along the first axis, following the reshaping of 1-D arrays of shape (N,) to
(1, N), is the equivalent of the vstack() function. An important point is that
all the input arrays must have the same shape except at axis zero. Refer to
the following syntax for a better understanding:
myarr2 = mynp.array([11,12,13])
myarr3 = mynp.vstack((myarr1,myarr2))
a = mynp.arange(1,10).reshape(3,3)
b = mynp.arange(20,26).reshape(2,3)
Output:
Vstack array is [[ 1 2 3]
[11 12 13]]
vstack array is : [[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[20 21 22]
[23 24 25]]
Note: The preceding code is covered in (program name:
Chap3_Example3.9.py)
hstack()
This function will stack arrays horizontally which takes array sequence as
input and the single array is returned by stacking them horizontally. An
important point is that all along the input arrays must have the same
dimensions. This is the same as concatenation along the second axis, except
for 1-D arrays, where it concatenates along the first axis. Refer to the
following syntax for a better understanding:
# print(help(mynp.hstack))
myarr2 = mynp.array([11,12,13,14])
myarr3 = mynp.hstack((myarr1,myarr2))
a = mynp.arange(1,10).reshape(3,3)
b = mynp.arange(1,7).reshape(3,2)
myarr4 = mynp.arange(1,5).reshape(2,2)
myarr5 = mynp.arange(1,7).reshape(3,2)
Output:
[4 5 6 3 4]
[7 8 9 5 6]]
ValueError: all the input array dimensions except for the concatenation axis
must match exactly, but along dimension 0, the array at index 0 has size 2
and the array at index 1 has size 3
dstack()
This function will stack arrays across axis two, which is the third axis that
takes array sequence as an input, and the single array is returned by
stacking them along the third axis. An important point to note is that 1-D or
2-D arrays must have the same shape. This is identical to the concatenation
of 2-D arrays of shape (A, B) into (A, B,1) and 1-D arrays of shape (A,)
into (1, A,1) along the third axis. Refer to the following syntax for a better
understanding:
mynp.dstack(tup)
# print(help(mynp.dstack))
myarr2 = mynp.array([11,12,13])
myarr3 = mynp.dstack((myarr1,myarr2))
a = mynp.arange(11,17).reshape(3,2)
b = mynp.arange(1,7).reshape(3,2)
Output:
[ 3 13]]]
[12 2]]
[[13 3]
[14 4]]
[[15 5]
[16 6]]]
We can divide a ndarray into multiple smaller arrays along a specified axis
by performing split operations on ndarrays. Splitting of arrays in numpy is
essential for a variety of reasons. Firstly, it allows for efficient data
preprocessing by breaking down large datasets into smaller, manageable
chunks for analysis or manipulation. Secondly, splitting arrays facilitates
parallel computing, enabling different parts of an array to be processed
simultaneously across multiple computing units. Thirdly, it supports
organizing data for specific analytical tasks, such as cross-validation in
machine learning or time-series analysis in signal processing. Additionally,
array splitting provides flexibility in handling data with irregular or unequal
partitions, accommodating various computational requirements. Overall,
the ability to split arrays in numpy is crucial for optimizing computational
tasks, enhancing data organization, and improving overall efficiency in
numerical computing and data analysis workflows. We shall see direct
examples with their definitions and syntax.
split()
This function will split an array into multiple sub-arrays of equal size along
a specified axis and will return ndarray object list. If array splitting is based
on sections, then sub-array size should be equal (otherwise we will get an
error). Splitting is based on indices, then the sub-arrays size need not be the
same. Refer to the following syntax for a better understanding:
#print(help(mynp.split))
myarr1 = mynp.arange(1,7)
myarr2 = mynp.split(myarr1,3)
myarr3 = mynp.arange(1,19).reshape(6,3)
myarr4=mynp.arange(1,11)
myresult = mynp.split(myarr4,[3,6])
myarr5=mynp.arange(1,19).reshape(3,6)
Output:
[4, 5, 6],
myresult for 1-D after splitting is: [array([1, 2, 3]), array([4, 5, 6]), array([
7, 8, 9, 10])]
[ 7, 8, 9],
[10, 11],
[12],
[18]])]
vsplit()
This function will allow row-wise splitting of an array into multiple sub-
arrays vertically along axis-0. Here, the input array must be atleast a 2-D
array as a 1-D array cannot be split vertically. Refer to the following syntax
for a better understanding:
mynp.vsplit(ary, indices_or_sections)
#print(help(mynp.vsplit))
myarr1= mynp.arange(1,17).reshape(4,4)
print(f"vsplit:2 {mynp.vsplit(myarr1,2)}")
print(f"vsplit:4 {mynp.vsplit(myarr1,4)}")
print(f"vsplit:2 {mynp.vsplit(myarr1,[2,3])}")
Output:
[5, 6, 7, 8]]), array([[ 9, 10, 11, 12]]), array([[13, 14, 15, 16]])]
hsplit()
#print(help(mynp.hsplit))
myarr1= mynp.arange(1,9).reshape(2,4)
print(f"hsplit:2 {mynp.hsplit(myarr1,2)}")
print(f"hsplit: {mynp.hsplit(myarr1,[2,3])}")
Output:
[7, 8]])]
[7]]), array([[4],
[8]])]
Note: The preceding code is covered in (program name:
Chap3_Example3.14.py)
dsplit()
mynp.dsplit(ary, indices_or_sections)
# print(help(mynp.dsplit))
myarr1= mynp.arange(24).reshape(2,3,4)
print(f"dsplit: {mynp.dsplit(myarr1,2)}")
print(f"dsplit: {mynp.dsplit(myarr1,[1,3])}")
Output:
[ 4, 5],
[ 8, 9]],
[[12, 13],
[16, 17],
[ 6, 7],
[10, 11]],
[[14, 15],
[18, 19],
[22, 23]]])]
[ 4],
[ 8]],
[[12],
[16],
[ 5, 6],
[ 9, 10]],
[[13, 14],
[17, 18],
[ 7],
[11]],
[[15],
[19],
[23]]])]
array_split()
This function will allow you to have an unequal splitting of an array into
subarrays if the split number does not divide the specified axis size evenly.
It returns a% n subarrays of size a//n + 1 and the rest of size a//n for an
array of length a that should be divided into n parts. Refer to the following
syntax for a better understanding:
# print(help(mynp.array_split))
myarr1 = mynp.arange(10)
myarraysplit = mynp.array_split(myarr1,3)
myarr2 = mynp.arange(10).reshape(5,2)
myarraysplit2 = mynp.array_split(myarr2,4)
Output:
Array split 1-D array result is: [array([0, 1, 2, 3]), array([4, 5, 6]), array([7,
8, 9])]
We can sort ndarray elements using sort() function as it will sort elements
around a specified axis in an array. Quicksort is the default sorting
algorithm and the sorting will be in ascending order. For 2-D arrays, sorting
will be done based on columns by default, where every 1-D array will be
sorted. Let us view the following code for a better understanding:
# print(help(mynp.sort))
myarr1 = mynp.array([45,34,68,34,67,26,9,98,35])
print('-'*50)
myarr2 = mynp.array(['mat','hat','rat','bat','cat'])
print('-'*50)
myarr3 = mynp.array([[78,35,98],[21,11,9],[87,35,65]])
print('-'*50)
Output:
--------------------------------------------------
String sorted 1-D array in ascending order is ['bat' 'cat' 'hat' 'mat' 'rat']
String sorted 1-D array in descending order is ['rat' 'mat' 'hat' 'cat' 'bat']
--------------------------------------------------
[ 9 11 21]
[35 65 87]]
[21 11 9]
[87 65 35]]
--------------------------------------------------
Original Array is :
Using where function, the indices of the elements that satisfy the criterion
will be returned if we simply point out the condition. where() function does
not return the elements. If we provide conditions, x, and y, then the items
that satisfy the condition will be replaced with x, and the rest of the
elements will be replaced with y. It returns the indices and will serve as a
replacement operator as well. When used as a replacement operator, the
functioning is just like that of the ternary operator. Refer to the following
syntax for a better understanding:
# print(help(mynp.where))
myarr1= mynp.array([10,11,12,15,19,8,39,16,57,8])
mysearch = mynp.where(myarr1%2==0)
print(myarr1[mysearch])
myboolean_indexing = myarr1%2==0
print(myarr1[myboolean_indexing])
Output:
[10 12 8 16 8]
[10 12 8 16 8]
insert()
# print(help(mynp.insert))
myarr1= mynp.array([10,11,12,15,19,8,39,16,57,8])
myarr2 = mynp.insert(myarr1,[1,4],[87,89])
print(myarr3)
print(myarr4)
# If axis not defined for 2-D array, then 2D will be flattened to 1-D
myarr5 = mynp.arange(4).reshape(2,2)
myarr6 = mynp.insert(myarr5,2,10)
print(myarr6)
print(myarr7)
print('-'*50)
print(myarr8)
Output:
[10 87 11 12 15 89 19 8 39 16 57 8]
[10 11 33 12 15 19 8 39 16 57 8]
[10 11 1 12 15 19 8 39 16 57 8]
[ 0 1 10 2 3]
[[ 0 1]
[10 10]
[ 2 3]]
--------------------------------------------------
[[ 0 10 1]
[ 2 10 3]]
append()
myarr1 = mynp.arange(10)
print("In 1-D array, since element added is of float type, all the elements
will be converted to float type")
print(myarr2)
print(myarr3)
myarr4 = mynp.arange(6).reshape(2,3)
myarr5 = mynp.append(myarr4,10)
print(myarr5)
print("If I/P array is 2-D array and axis is specified, then appended array
must also be 2-D array otherwise error")
myarr6 = mynp.arange(10,16).reshape(2,3)
print(myarr7)
print(myarr8)
Output:
In 1-D array, since element added is of float type, all the elements will be
converted to float type
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.5]
['0' '1' '2' '3' '4' '5' '6' '7' '8' '9' '10.5']
[ 0 1 2 3 4 5 10]
If I/P array is 2-D array and axis is specified, then appended array must
also be 2-D array otherwise error
[[ 0 1 2]
[ 3 4 5]
[10 11 12]
[13 14 15]]
[[ 0 1 2 10 11 12]
[ 3 4 5 13 14 15]]
delete()
This function will remove array elements of ndarray along a specified axis.
For multi-dimensional arrays, the axis must be specified, otherwise, the
default value of axis=None will be used. In this scenario, the array is first
flattened to a 1-D array before deletion is carried out. Refer to the
following syntax for a better understanding:
#print(help(mynp.delete))
myarr1 = mynp.arange(10)
myarr2 = mynp.delete(myarr1, 4)
print(myarr2)
print(myarr3)
myarr4 = mynp.arange(9).reshape(3,3)
print("Since no axis is specified. So, flattening will be done for 2-D array")
myarr5 = mynp.delete(myarr4,2)
print(myarr5)
print(myarr6)
print(myarr7)
Output:
[0 1 2 3 5 6 7 8 9]
[0 1 6 7 8 9]
Since no axis is specified. So, flattening will be done for 2-D array
[0 1 3 4 5 6 7 8]
[[0 1 2]
[6 7 8]]
[[0 2]
[3 5]
[6 8]]
mynp.dot(a, b, out=None)
# print(help(mynp.dot))
myarr1 = mynp.arange(1,5).reshape(2,2)
myarr2 = mynp.arange(11,15).reshape(2,2)
print(myarr1)
print(myarr2)
print("using numpy function")
print(mynp.dot(myarr1,myarr2))
print(myarr1.dot(myarr2))
Output:
[[1 2]
[3 4]]
[[11 12]
[13 14]]
[[37 40]
[85 92]]
[[37 40]
[85 92]]
myarr1 = mynp.arange(11,15).reshape(2,2)
print(mynp.linalg.matrix_power(myarr1,0))
print(mynp.linalg.matrix_power(myarr1,2))
print(mynp.linalg.det(myarr1))
myarr2 = mynp.array([[1,1],[2,6]])
myarr3 = mynp.array([6,24])
print(mynp.linalg.solve(myarr2,myarr3))
Output:
Finding matrix inverse
[[-7. 6. ]
[ 6.5 -5.5]]
[[1 0]
[0 1]]
[[277 300]
[325 352]]
[[ 88. -75. ]
[-81.25 69.25]]
-1.9999999999999931
[3. 3.]
Note: Apart from all these application usages of numpy which we have
seen till now, we can also use numpy library for some basic statistics like
calculating the minimum, maximum, average, median, variance, standard
deviation, and so on. It is also helpful in computing mathematical
operations like sine, cos, log functions, and so on. So, depending on
application-to-application requirements, we can use this numpy library. We
have left the calculation of a minimum for you to explore these concepts as
they are quite easy and direct usage can be applied for both 1-D and 2-D
arrays.
Conclusion
In the next chapter, we shall see about Scipy library which provides
additional functionality for scientific computing. It offers a wide range of
mathematical algorithms and functions for tasks such as optimization,
interpolation, linear algebra, signal processing, and many more.
Points to remember
Questions
Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 4
Exploring Scipy Library for Data
Science in Python
Introduction
Scipy is a comprehensive scientific computing library built on top of
NumPy, extending its capabilities with a wide range of tools for tasks like
linear algebra, optimization, numerical integration, and signal processing.
Co-created by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy
combines Pythonʼs ease of use with performance-critical components
written in C, C++ and Fortran. As a collection of mathematical algorithms
and convenience functions, Scipy enhances the NumPy extension of
Python, making it a powerful tool for technical and scientific applications.
To install scipy, we can use the pip command: pip install scipy.
Itʼs important to note the difference between pip and pip3. pip refers to the
default Python package installer, which may be associated with Python 2 or
Python 3 depending on your system configuration. For Python 2 users,
ensure compatibility by using pip3 explicitly to install scipy for Python 3:
For specific version requirements, you can specify the version of the SciPy
library using pip:
pip install scipy==<version_number>
Structure
The following topics will be discussed in the current chapter:
Difference between numpy and scipy
Exploring scipy constants
Optimizers in scipy
Sparse data and sparse matrix in scipy
Graphs in scipy
Spatial data in scipy
Integrate in scipy
Interpolation in scipy
Objectives
After exploring all the concepts of this chapter, the reader will initially have
an idea about the difference between numpy and scipy array. We aim to
explore the scipy constants which represent physical quantities,
mathematical, scientific, and other useful values. We will also delve into
optimizers in scipy where different optimization algorithms available will
be discussed with some examples like examples of finding the roots of an
equation and many more. Additionally, we will study sparse data and sparse
matrix representation in scipy, enabling efficient storage and manipulation
of large, mostly empty matrices. The seven different sparse matrix types in
scipy will be well explored. We will learn about graphs in scipy, where the
module provides a number of functions to effectively analyze and work
with sparse graphs. Spatial data analysis using scipy allows us to work with
geographic and geometric data. The chapter covers integration techniques
in scipy, which are useful for numerical integration of functions. We shall
explore single and multiple integration examples. Finally, we will delve into
interpolation methods in scipy, which enables the estimation of values
between known data points. We will explore 1D interpolation, spline
interpolation, and interpolation with radial basis function for example. We
will be using an alias for scipy library with modules as myscpy at various
places.
Metric
Units are yotta, zetta, exa, peta, tera, giga, mega, kilo, hecta, deca, deci,
centi, milli, micro, nano, pico, femto, atto, and zepto. These were the metric
functions that we studied in Class-IX physics. The meter is the specified
unit that is returned and the code program name: Chap4_Example4.3.py
revealing the metric units is demonstrated as follows:
import scipy.constants as myscpy
print(myscpy.yotta) # +1e24
print(myscpy.zetta) # +1e21
print(myscpy.atto) # -1e18
print(myscpy.zepto) # -1e21
Output:
1e+24
1e+21
1e-18
1e-21
Force
The specified unit returned is Newton and the code program name:
Chap4_Example4.4.py revealing the force units is demonstrated as follows:
Angle
Radians is the specified unit which is returned. Refer to the following code
program name: Chap4_Example4.6.py depicting different angle units:
import scipy.constants as myscpy
print(myscpy.degree)
print(myscpy.arcmin)
print(myscpy.arcsec)
Output:
0.017453292519943295
0.0002908882086657216
4.84813681109536e-06
Mass
The specified unit returned is kg, say gram is returned with the value 0.001
kilograms. Refer to the following code program name:
Chap4_Example4.7.py depicting different mass units:
Pressure
Pascals is the specified unit that is returned. Refer to the following code
program name: Chap4_Example4.9.py depicting different pressure units.
import scipy.constants as myscpy
print(myscpy.atm)
print(myscpy.bar)
print(myscpy.psi)
Output:
101325.0
100000.0
6894.757293168361
Length
Meters is the specified unit which is returned. Refer to the following code
program name: Chap4_Example4.10.py revealing different length units:
import scipy.constants as myscpy
print(myscpy.foot)
print(myscpy.mile)
print(myscpy.light_year)
Output:
0.30479999999999996
1609.3439999999998
9460730472580800.0
Volume
Cubic meters is the specified units that is returned. Refer to the following
code program name: Chap4_Example4.11.py revealing different volume
units:
import scipy.constants as myscpy
print(myscpy.liter)
print(myscpy.gallon)
print(myscpy.barrel)
Output:
0.001
0.0037854117839999997
0.15898729492799998
Speed
Meters per second is the specified unit that is returned. Refer to the
following code program name: Chap4_Example4.12.py revealing different
speed units:
import scipy.constants as myscpy
print(myscpy.kmh)
print(myscpy.mph)
print(myscpy.speed_of_sound)
Output:
0.2777777777777778
0.44703999999999994
340.5
Temperature
Kelvin is the specified unit that is returned. Refer to the following code
program name: Chap4_Example4.13.py revealing different temperature units:
import scipy.constants as myscpy
print(myscpy.zero_Celsius)
print(myscpy.degree_Fahrenheit)
Output:
273.15
0.5555555555555556
Energy
Joules is the specified unit that is returned. Refer to the following code
program name: Chap4_Example4.14.py revealing different energy units:
import scipy.constants as myscpy
print(myscpy.eV)
print(myscpy.calorie)
Output:
1.602176634e-19
4.184
Power
Watts is the specified unit that is returned. Refer to the following code
program name: Chap4_Example4.15.py revealing different power units:
import scipy.constants as myscpy
print(myscpy.hp)
print(myscpy.horsepower)
Output:
745.6998715822701
745.6998715822701
Area
The square meter is the specified unit which is returned. Refer to the
following code program name: Chap4_Example4.16.py revealing different
area units:
import scipy.constants as myscpy
print(myscpy.hectare)
print(myscpy.acre)
Output:
10000.0
4046.8564223999992
Optimizers in scipy
Through the scipy.optimize subpackage, scipy offers several optimization
techniques. It offers a set of scipy-defined procedures for determining an
equationʼs root or the least value of a function. All machine learning
algorithms simply involve the minimization of a challenging equation using
the available data. The frequently employed optimizers in scipy are
mentioned as follows:
Global optimization routines are anneal(), basinhopping(), and
brute-force.
Minimization of multivariate scalar functions minimize() with and
without constraints can be done using a variety of techniques like
Broyden–Fletcher–Goldfarb–Shanno (BFGS), Nelders-Mead
simplex, Newton conjugate gradient, and Constrained Optimization
BY Linear Approximations (COBLYA).
Least-squares minimization techniques are leastsq() and
curve_fit().
Note: The initial value provided for the optimization process is called
the initial guess.
Here, the code defines a function called myfunc_eqn which takes a variable x
and returns the value of a quadratic equation. It then uses the minimize
function from the optimize module in the scipy library to find the
minimum value of the quadratic equation. The method used for
optimization is BFGS, and the initial guess for the minimum value is set to
zero. The result of the optimization is stored in the variable myvarmin and is
displayed on the screen eventually.
Remember that fun is the objective function value, jac is the Jacobian, and
the message is the termination cause message. nfev, and njev is Jacobian
and it is the evaluation number of the objective functions, the nit is the
number of iterations, status, and success is the termination and exit status,
and finally, x is the optimization solution.
Let us view the following example that shows optimization problems:
Consider the objective function: min x12 + 3*x1*x2
Constraints:
x13 + 2*x1*x2 = 200
x12 + 2*x1*x2 ≥ 50
-200 ≤ x1, x2 ≤ 200
Minimize the function using the SLSQP method. Use initial x0 as [1,1].
Refer to the following code program name: Chap4_Example4.19.py of how
to minimize the function using SLSQP method:
import scipy as myscpy
# Define the objective function
def myobjective(myvar):
return myvar[0]**2 + 3*myvar[0]*myvar[1]
# Define the constraints
def myconstraint1(myvar):
return myvar[0]**3 + 2*myvar[0]*myvar[1] - 200
def myconstraint2(myvar):
return myvar[0]**2 + 2*myvar[0]*myvar[1] - 50
# Initial guess
x0 = [1, 1]
mycsc_matrix = myscpy.csc_matrix((mydata,
myrow_indices, mycol_pointers), shape=(3, 3))
print(mycsc_matrix)
Output:
(0, 0) 11
(1, 0) 12
(2, 1) 13
(0, 1) 14
(1, 2) 15
(2, 2) 16
Graphs in scipy
In scipy, scipy.sparse module offers tools for working with sparse
matrices, including representations for graphs in sparse matrixes. The
module provides several sparse matrix formats that can effectively describe
graphs. Working with sparse graphs and carrying out operations on them are
made possible by the scipy.sparse.csgraph module in scipy. It contains
formulas for connectedness, shortest paths through a graph, and other
operations. The module provides several functions to effectively analyze
and work with sparse graphs.
Adjacency matrix
The links or relationships between the nodes in a graph are represented by
the scipy adjacency matrix, which is a square matrix. Typically, each node
in a graph is represented by a row and a column in the matrix. Whether
there is an edge or connection between two nodes is indicated by the values
in the matrix. Consider the following graph with elements A, B, and C:
Further, we shall see some of the important methods to work with these
adjacency matrices.
connected_components method
The quantity of connected components in a graph can be found using
scipyʼs connected_components method. It accepts a sparse matrix
representation of the graph as input and outputs. The number of connected
components as well as an array provides each node a unique label
indicating its component. Refer to the following syntax for a better
understanding:
myscpy.csgraph.connected_components(csgraph,
directed=True, connection='weak',
return_labels=True)
Dijkstra method
Dijkstraʼs method is used to find the shortest path from a single source node
to all other nodes in a graph. Refer to the following syntax for a better
understanding:
myscpy.csgraph.dijkstra(csgraph, directed=True,
indices=None, return_predecessors=False,
unweighted=False, limit=np.inf,
min_only=False)
Floyd_Warshall method
The Floyd-Warshall method will be used to discover the shortest path
between all pairs of nodes in a graph. Refer to the following syntax for a
better understanding:
myscpy.csgraph.floyd_warshall(csgraph,
directed=True, return_predecessors=False,
unweighted=False,
overwrite=False)
The following code program name: Chap4_Example4.27.py for
understanding adjacency matrices is demonstrated as follows:
import scipy.sparse as myscpy
import numpy as mynp
myarr = mynp.array([
[0, 1, 0],
[1, 0, 2],
[0, 2, 0]
])
mynewarr = myscpy.csr_matrix(myarr)# numpy array
conversion into a CSR matrix
print("Usage of connected_components method")
# type
print(help(myscpy.csgraph.connected_components))
print(myscpy.csgraph.connected_components(mynewarr
))# to determine the number of connected
components in the CSR matrix
print('-'*50)
print("Usage of Dijkstra method")
# to find the shortest path from a given starting
node (index 0)
# to all other nodes in a graph represented by the
mynewarr array.
# here The return_predecessors parameter is set to
True, which means
# that the algorithm will also return the
predecessor nodes for each node in the shortest
path
print(myscpy.csgraph.dijkstra(mynewarr,
return_predecessors=True, indices=0))
print('-'*50)
# to discover the shortest path between all pair
of nodes
# in a graph represented by the mynewarr array.
print("Usage of Floyd_Warshall method")
print(myscpy.csgraph.floyd_warshall(mynewarr,
return_predecessors=True))
Output:
Usage of connected_components method
(1, array([0, 0, 0]))
--------------------------------------------------
Usage of Dijkstra method
(array([0., 1., 3.]), array([-9999, 0,
1]))
--------------------------------------------------
Usage of Floyd_Warshall method
(array([[0., 1., 3.],
[1., 0., 2.],
[3., 2., 0.]]), array([[-9999, 0,
1],
[ 1, -9999, 1],
[ 1, 2, -9999]]))
There are other methods like belliman_ford which will discover the
shortest path between all pairs of elements by handling negative weights as
well, breadth_first_order, and depth_first_order are left for you to
explore.
Triangulation function
In scipy, the term triangulation refers to the process of building a
triangulated mesh out of a collection of points or vertices. It is frequently
employed in data visualization and computational geometry. A polygon
triangulation is the splitting up of a polygon into several triangles so that its
area can be computed.
When you triangulate with points, you build triangles out of surfaces where
each of the given points is on at least one of the trianglesʼ vertices.
The scipy.spatial module offers triangulation functionality within the
scipy environment. The scipy.spatial.Delaunay class for creating a
Delaunay triangulation is a particular kind of triangulation where no point
lies inside the circumference of any triangle in the mesh.
The following code program name: Chap4_Example4.28.py for
understanding Delaunay triangulation algorithm is demonstrated as follows:
import numpy as mynp
import scipy.spatial as myscpy
import matplotlib.pyplot as myplt
myarray = mynp.array([[2,5],[3,5],[3,3],[2,2],
[4,1]])
myvar_simplices =
myscpy.Delaunay(myarray).simplices
myxcoordinate = myarray[:, 0]
myycoordinate = myarray[:, 1]
myplt.triplot(myxcoordinate, myycoordinate,
myvar_simplices)
myplt.scatter(myxcoordinate, myycoordinate,
color='b')
myplt.show()
Output:
myplt.scatter(myxcoordinate, myycoordinate,
color='b')
for mypoints in myvar_simplices:
myplt.plot(myarray[mypoints,0],
myarray[mypoints,1], 'g-')
myplt.show()
Output:
myvar_simplices =
myscpy.Delaunay(mypoints).simplices
myxcoordinate = mypoints[:, 0]
myycoordinate = mypoints[:, 1]
myplt.triplot(myxcoordinate, myycoordinate,
myvar_simplices)
myplt.scatter(myxcoordinate, myycoordinate,
color='b')
myplt.show()
Output:
Nearest neighbor value is as follows: [[5 6]
[3 6]]
Refer to the following figure to understand the output:
The nearest neighbors are retrieved from the mypoints array using the
indices in myindices. It stores these nearest neighbors in the
mynearest_neighbors array which displays the values.
Single integration
To determine a single-variable functionʼs definite integral across a specified
range, utilize single integration. Just observe the following example. The
following code program name: Chap4_Example4.31.py for understanding
single integration in scipy is demonstrated as follows:
∫01x2 dx, the answer to the example is 0.3333 + c.
scipy.integrate.quad(func, a, b, args=(),
full_output=0, epsabs=1.49e-08, epsrel=1.49e-08,
limit=50, points=None, weight=None, wvar=None,
wopts=None, maxp1=50, limlst=50,
complex_func=False)
import scipy.integrate as myscpy
def myfunc(x):
return x**2
# print(help(myscpy.quad))
myresult, myerror = myscpy.quad(myfunc, 0, 1)
The code defines a function called myfunc that takes a parameter x and
returns the square of x. It then uses the quad function from the myscpy
module to calculate the definite integral of myfunc over the interval [0, 1].
The result and the error of the integral calculation are then displayed.
Multiple integration
Calculating the definite integrals of functions with multiple variables is
known as multiple integrations in scipy. The nquad function from the Scipy
library allows for multiple integrations. The functions dblquad(),
tplquad(), and nquad() result from multiple integrals such as double and
triple integration. Here, we will look at the double integral problem that
scipy.integrate.dblquad(func,a,b,gfun,hfun) can resolve. Refer to the
following syntax for a better understanding:
scipy.integrate.nquad(func, ranges, args=None,
opts=None, full_output=False)
scipy.integrate.dblquad(func, a, b, gfun, hfun,
args=(), epsabs=1.49e-08, epsrel=1.49e-08)
scipy.integrate.tplquad(func, a, b, gfun, hfun,
qfun, rfun, args=(), epsabs=1.49e-08,
epsrel=1.49e-08)
The following code program name: Chap4_Example4.32.py for
understanding first multiple integration in scipy is demonstrated as follows:
Observe the following example:
The answer to the above example is 0.125 + c. Refer to the following code
for a better understanding:
import scipy.integrate as myscpy
#print(help(myscpy.nquad))
def myfunc(num1, num2):
return num1 / num2**2
x_lower = 1
x_upper = 2
y_lower = 4
y_upper = 6
A function called myfunc takes two arguments, x, and y, and returns the
result of dividing x by the square of y. It then sets the values of x_lower,
x_upper, y_lower, and y_upper to 1, 2, 4, and 6 respectively. Next, the nquad
function is used from the myscpy library to calculate the multiple integrals of
myfunc over the range defined by [x_lower, x_upper] and [y_lower,
y_upper]. The result of the integral is stored in the variable myresult, and
the error of the integral is stored in the variable myerror. Finally, the code
prints the result and error of the integral.
Now, let us explore double integration. The following code program name:
Chap4_Example4.33.py for understanding double integration in scipy is
demonstrated as follows:
Observe the following example:
The answer to the above example is 8 + c. Refer to the following code for a
better understanding:
import scipy.integrate as myscpy
#print(help(myscpy.nquad))
def myfunc(x, y):
return -4*(x + 5*y)
x_lower = 0
x_upper = 1
y_lower = lambda x:1
y_upper = lambda x: 1-x
Interpolation in scipy
Interpolation is the technique of estimating values between known data
points in the context of the scipy library. It is frequently used to retrieve
values at key points in a dataset or to complete missing data. Scipy offers a
variety of interpolation techniques, including linear, polynomial, spline, and
others. Interpolation has various applications, and because missing data in a
dataset are frequently dealt with in machine learning, interpolation is
frequently used to replace those values. Imputation is the name for this
process of filling in data. Interpolation is frequently used in place of
imputation when discrete points in a dataset need to be smoothed. Scipy
gives us a module called scipy.interpolate that offers a wide range of
interpolation-related functions.
1D interpolation
A distribution with only one variable can be interpolated using the function
interp1d(). It requires the input of two points, x, and y which are arrays of
values representing the coordinates of the function and the kind argument
specifying the interpolation method which is to be used and thus outputs a
callable function that, when called with a new x, and returns the
corresponding y. Refer to the following syntax for a better understanding:
scipy.interpolate.interp1d(x, y, kind='linear',
axis=-1, copy=True, bounds_error=None,
fill_value=nan, assume_sorted=False)
The following code program name: Chap4_Example4.34.py for
understanding 1D interpolation in scipy is demonstrated as follows:
import scipy.interpolate as myscpy
import numpy as mynp
# print(help(myscpy.interp1d))
x = mynp.array([0, 2, 4, 6, 8, 10])
y = mynp.vectorize(lambda num: num**2)(x)
print(y)
myfunc = myscpy.interp1d(x, y, kind='linear')
print(myfunc(5))
Output:
[ 0 4 16 36 64 100]
26.0
Spline interpolation
It is a method for estimating values between known data points in spline
interpolation. It involves fitting the data to a polynomial function with
piece-wise definitions known as a spline. So, the points defined with
polynomials are fitted against a piece-wise function (a function for different
ranges has different definitions) unlike 1D interpolation where for a single
curve the points are being fitted. The spline consisting of multiple
polynomial segments results in a smooth curve by smoothly connecting the
data points. Refer to the following syntax for a better understanding:
scipy.interpolate.UnivariateSpline(x, y, w=None,
bbox=[None, None], k=3, s=None, ext=0,
check_finite=False)
Let us view the following example of first-order spline interpolation with k
value as 1 in the following code program name: Chap4_Example4.35.py:
import scipy.interpolate as myscpy
import numpy as mynp
import matplotlib.pyplot as myplt
#print(help(myscpy.UnivariateSpline))
print(mynew_y)
myplt.plot(mynew_x, mynew_y,'ro-')
myplt.scatter(mynew_x, mynew_y)
myplt.xlabel('mynew_x')
myplt.ylabel('mynew_y')
myplt.title('Spline Interpolation with k=1')
myplt.show()
Output:
[ 0.32559189 3.36574767 6.40590345 12.40590345
21.36574767 30.32559189]
Refer to the following figure for a better understanding:
print(mynew_y)
myplt.plot(mynew_x, mynew_y,'ro-')
myplt.scatter(mynew_x, mynew_y)
myplt.xlabel('x')
myplt.ylabel('y')
myplt.title('Quadratic Spline Interpolation with
k=2')
myplt.show()
Output:
[ 0.25 2.25 6.25 12.25 20.25 30.25]
Refer to the following figure for a better understanding:
splrep function
Till now we have seen that the two necessary steps for spline interpolation
are:
1. Initially, a spline representation of the curve is computed.
2. Subsequently, the spline is assessed at the desired points.
Using the function splrep(), the direct spline approach is used to determine
the B-spline curve representation within a two-dimensional plane. With the
input keyword k, the default spline order of cubic can be modified. We can
parametrically define the curve in N-dimensional space using the
splprep() function. The amount of smoothing to apply during the spline fit
is specified using the keyword argument. The B-spline is represented as a
tuple (tck) by the splrep function, which takes a set of data points as
input. The knot sequence, the B-spline coefficients, and the splineʼs degree
are the three elements of this tuple. Refer to the following syntax for a
better understanding:
scipy.interpolate.splrep(x, y, w=None, xb=None,
xe=None, k=3, task=0, s=None, t=None,
full_output=0, per=0, quiet=1)
splev function
In the scipy.interpolate module, the splev function is used to assess a B-
spline or its derivatives at specified points. It receives the input of the B-
spline representation (tck tuple) from splrep and returns the corresponding
values. The evaluated values of the B-spline at the specified points are
returned in an array by the splev function. Refer to the following syntax for
a better understanding:
scipy.interpolate.splev(x, tck, der=0, ext=0)
Let us view the following code using these 2 functions for k = 1. The
following code program name: Chap4_Example4.37.py for understanding
splrep and splev functions are demonstrated as follows:
# Adding legend
myplt.legend()
mynew_x = 3.5
myestimated_y = myrbf(mynew_x)
print(myestimated_y)
Output:
12.17057812839015
Conclusion
In this chapter, we initially explained the difference between numpy and
scipy array. We have explored the scipy constants which represent physical
quantities, mathematical, scientific, and other useful values. The chapter
also delved into optimizers in scipy, discussing various optimization
algorithms with practical examples such as finding roots of equations. The
concept of sparse data and sparse matrix representation in scipy is
thoroughly examined, highlighting their efficiency in storing and
manipulating large, mostly empty matrices. We have further explored
graphs in scipy, offering functions for effective analysis and manipulation
of sparse graphs. Spatial data analysis using scipy enables working with
geographic and geometric data was well explored. Integration techniques
are covered, providing numerical integration methods for functions,
including single and multiple integration examples. Lastly, interpolation
methods in scipy were demonstrated with examples, allowing estimation of
values between known data points, including 1D interpolation, spline
interpolation, and interpolation with radial basis functions, demonstrated
with examples.
In the next chapter, we shall discuss about libraries such as matplotlib and
plotly. Matplotlib is a commonly used Python charting library that offers a
range of functions and classes for constructing different types of plots, such
as line plots, bar plots, scatter plots, histograms, and so on. A powerful and
interactive plotting library that supports multiple programming languages
including Python is called plotly.
Points to remember
Scipy is built on top of numpy which is mostly written in Python
language and provides functionalities for numerical integration,
statistics, signal processing, and so on.
Scipy provides a wide range of mathematical and scientific constants,
such as pi, and e, and physical constants like the speed of light. These
constants can be accessed through the scipy.constants module.
Scipy offers various optimization algorithms to find the minimum or
maximum of a function. The optimization module provides functions
to perform unconstrained and constrained optimization, including
methods like BFGS, Newton-CG, and SLSQP.
Scipy supports sparse data structures and provides efficient methods
for working with sparse matrices. Sparse matrices are useful when
dealing with large datasets that have mostly zero values, as they save
memory and computational resources.
Scipyʼs graph module provides functionality for working with graphs
and performing graph-related operations. It includes algorithms for
finding shortest paths, computing centrality measures, and many
more.
Scipy offers tools for working with spatial data, such as distance
calculations, convex hulls, and spatial transformations. The spatial
module can be used for tasks like spatial interpolation and nearest
neighbor search.
The integrated module in scipy provides functions for the numerical
integration of functions (single and multiple integrations) and solving
Ordinary Differential Equations (ODEs).
Scipyʼs interpolation module offers various interpolation techniques,
such as linear, polynomial, and spline interpolation. These methods
can be used to estimate values between known data points or to create
smooth curves from discrete data.
Questions
1. What is scipy? Who discovered it?
2. Explain the difference between numpy and scipy.
3. Explain scipy constants in the following categories with an example:
a. Metric
b. Force
c. Binary
d. Angle
e. Mass
f. Time
g. Pressure
h. Length
i. Volume
j. Speed
k. Temperature
l. Energy
m. Power
n. Area
4. Write a Python code to find the roots of the equation 2*x + cos(x)
using scipy optimizers.
5. Explain different types of sparse matrix in scipy with examples.
6. Explain important methods to work with adjacency matrices with
examples.
7. Explain the triangulation function in scipy. Also, explain the
Delaunay method, and ConvexHull method with an example.
8. Explain simple and multiple integration in scipy with an example.
9. Explain scipy 1D, spline, and radial basis interpolation functions
with an example.
CHAPTER 5
Line Plot exploration with
Matplotlib Library
Introduction
The representation of data in a graphical or visual format is known as data
visualization. Data visualization aims to perform analysis of complicated
data sets. Patterns, trends, and relationships can be easily seen and expressed
when data is presented visually. Charts, graphs, tables, maps, and other types
of data visualization are only a few examples. Each kind of visualization
works best with kinds of information or ideas. For instance, scatterplots can
be used to depict the relationship between two variables, and line charts are
frequently used to show patterns over time.
In a variety of disciplines, including business, finance, science, and social
sciences, data visualization is a crucial tool. It lets decision-makers swiftly
grasp and analyze data, which can lead to more informed judgments. Data
visualization is a vital tool for researchers and educators because it can make
complicated ideas understandable to a wide audience.
Structure
In this chapter, we will discuss the following topics:
Python data visualization tools
Line plot creation by passing 2 ndarrays
Adding title, xlabel and ylabel to the line plot
Advanced line plot
Linestyle property
Color property
Default color
Quick settings for color, marker, and linestyle
alpha property
linewidth and markersize property
markerfacecolor property
Customizing the figure size
Plotting multiple lines in a same plot
Line plot creation by passing a single ndarray
Addition of grid lines to a plot
Major and minor grid lines display
Legends display to a plot
Using xlim and ylim functions to a plot
Objectives
By the end of this chapter, the reader will get to know about Pythonʼs data
visualization tools, covering a wide array of techniques and concepts to
enhance our ability to represent and analyze data effectively. First, we will
be learning about line plots, where we will explore creating them by passing
two ndarrays, adding essential elements like titles, x-labels, and y-labels, and
advancing our skills with properties like linestyle, color, alpha, linewidth,
markersize, markerfacecolor, and figure size customization. We will also
explore how to plot multiple lines in a single plot, creating line plots with
single ndarrays, incorporating grid lines (both major and minor), and in the
end getting to handle effective utilization of legends, xlim, and ylim
functions to a line plot.
Note: In this chapter, we are displaying the line plot graphs in colors,
but the hard copy of the book will be printed in black color. So, do not
get confused. It is requested that all the readers view the Image bundle
simultaneously for a better understanding of the code and concepts.
So, matplotlib is a library in which pyplot is a module name. With the help
of the pyplot module, we can create plots such as plot, bar, pie, hist, and
scatter for creating line plots, bar charts, pie charts, histograms, and scatter
plots.
Line plot
A cornerstone of data visualization, the line plot in Matplotlib efficiently
connects data points with straight lines, facilitating the visualization of
trends over time and relationships between variables. Its significance
transcends mere visualization, extending to data exploration, communication
of insights, dataset comparison, model evaluation, and forecasting.
Renowned for its simplicity and clarity, the line plot stands as a pivotal
element in data visualization, empowering users to extract valuable insights
and make well-informed decisions.
Figure 5.1 is a distinct window on the screen that Matplotlib uses to display
the graphs, serving as the output container for graphics.
The code is using the matplotlib library to plot a graph. It first imports the
necessary libraries, matplotlib.pyplot and numpy. Then, it creates an array
of values for the x-axis using the arange() function from numpy, with values
ranging from one to ten. It also creates an array for the y-axis by taking the
cube of each value in the x-axis array. Next, it uses the plot function from
matplotlib.pyplot to create a line plot of the x-axis values against the y-axis
values. Comments (1,1),(2,8), and (3,27) indicate the corresponding
coordinates of the first few points on the plot. Finally, it uses the show
function from matplotlib.pyplot to display the plot.
Figure 5.2: Addition of title, xlabel and ylabel properties to the line plot
1 . point marker
2 , pixel marker
3 O circle marker
4 V triangle_down marker
5 ^ triangle_up marker
8 1 tri_down marker
9 2 tri_up marker
10 3 tri_left marker
11 4 tri_right marker
12 S square marker
13 P pentagon marker
14 * star marker
15 H hexagon1 marker
16 H hexagon2 marker
17 + plus marker
18 X x marker
19 D diamond marker
20 D thin_diamond marker
21 | vline marker
22 _ hline marker
Linestyle property
linestyleproperty is specified as solid, dotted, or dashed. The linestyle
keyword argument is used as follows:
myplt.plot(x_axis,y_axis,marker='o',linestyle='--')
The details against each linestyle character can be seen in the following
Table 5.2:
S No. Character Description
Color property
By utilizing the color keyword parameter, we can add colors to our plot,
define the color we want for the line plot, and utilize any color, and even hex
code.
The details against each color character can be seen in the following Table
5.3:
S No. Character Description
1 B blue
2 G Green
3 R red
4 c cyan
5 m Magenta
6 y Yellow
7 k Black
8 w White
Default color
The default color will be selected from the style circle if we do not specify a
color. We can write the following statement to get the default color:
myplt.rcParams['axes.prop_cycle'].by_key()
# code for checking the default color
import matplotlib.pyplot as myplt
print(myplt.rcParams['axes.prop_cycle'].by_key())
Output:
{'color': ['#1f77b4', '#ff7f0e', '#2ca02c',
'#d62728', '#9467bd', '#8c564b', '#e377c2',
'#7f7f7f', '#bcbd22', '#17becf']}
The first default color is blue, the second is orange, the third is green, and
the fourth color is red. Observe the following code for a better
understanding:
# Line plot creation by passing 2 ndarrays with
default color
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
myplt.plot(x_axis,x_axis) # blue
myplt.plot(x_axis,x_axis*2) # orange
myplt.plot(x_axis,x_axis*4) # green
myplt.plot(x_axis,x_axis*8) # red
myplt.title('Default color Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Display of various calculations------
-')
myplt.show()
The output is shown in Figure 5.6:
Figure 5.6: Line plot creation using default color
mlc form
MLC encompasses marker, linestyle, and color, defining the visual
characteristics of lines or points in Matplotlib plots. These attributes enable
users to personalize markers, line styles, and colors, facilitating the creation
of customized and insightful visualizations. Refer to the following code for a
better understanding:
# mlc form
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:g')# mlc form
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.7:
Figure 5.7: Line plot creation using mlc form
clm form
CLM stands for color, linestyle, and marker, representing essential visual
characteristics adjustable in Matplotlib for customizing plot appearance.
These attributes enable users to define colors, line styles, and markers,
facilitating the creation of tailored visualizations. Refer to the following
code for a better understanding:
# clm form
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'g:o')# clm form
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.8:
Figure 5.8: Line plot creation using clm form
Alpha property
The graph plot transparency is adjusted using the alpha attribute, which
denotes the opaque or transparency of the color. The value is between 0.0
and 1.0.
If alpha < 1, the graph plot becomes more transparent. Let us view the code
of line plot creation using mlc form with the alpha property as demonstrated
in program name: Chap5_Example5.11.py as follows:
# mlc form with alpha property
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:', alpha = 0.2)# mlc
form with alpha = 0.2
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.10:
Figure 5.10: Line plot creation using mlc form with alpha property
Markerfacecolor property
This property will set the face color of markers in plots. Let us view the code
of line plot creation using mfc property as demonstrated in code program
name: Chap5_Example5.13.py as follows:
# mfc property
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:g', lw=5, ms = 10,
mfc='yellow')
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.12:
Figure 5.12: Line plot creation using mfc property
It creates an array of values for the x-axis using the numpy arange()
function, with values ranging from one to ten.
It creates three arrays for the y-axis, y_axis1, y_axis2, and y_axis3.
These arrays are generated by performing mathematical operations on
the x_axis array.
The code uses the plot() function from matplotlib.pyplot to plot the
three lines on the same plot. The first plot() call plots y_axis1 against
x_axis, using a red colored line with circle markers.
The second plot() call plots y_axis2 against x_axis, using a green
colored line with circle markers.
The third plot() call plots y_axis3 against x_axis, using a blue
colored line with circle markers. We can also plot three different lines
in a single plot function. Just uncomment it and observe the same
output.
The code sets a title for the plot using the title() function.
It sets labels for the x-axis and y-axis using the xlabel() and ylabel()
functions, respectively.
Finally, the show() function is called to display the plot.
Note: If we apply the linewidth (lw) property in the plot function, it will
be applicable for all three lines.
If we look in Case No: 1 of the following table for the code say
myplt.grid() , then * is displayed against the column value Is grid lines
displayed in the line plot which means using this code the grid lines will
be displayed in the line plot.
Case Code Is grid lines Is grid lines not
No displayed in the displayed in the line
line plot plot
1 myplt.grid() *
2 myplt.grid() *
myplt.grid()
3 myplt.grid() * (Since provided
keyword arguments)
myplt.grid(color='y')
4 myplt.grid(visible=True) *
5 myplt.grid(visible=False) *
Now let us understand the code and how the grid lines are demonstrated:
The code uses the matplotlib library to create a figure with subplots
and display different cases of grid line visibility on each subplot.
The code imports the matplotlib.pyplot module as myplt and the
numpy module as mynp. 2. It creates an array called myarr1 with values
[20, 40, 60, 80, 100].
It creates a figure with two rows and two columns of subplots using
the subplots() function and assigns the figure object to myfig and the
axes objects to myaxs.
The first subplot (0-0) is plotted with the values of myarr1 using red
circles connected by a solid line. The title is set to Subplot 0-0 and the
grid lines are displayed.
The second subplot (0-1) is plotted and styled the same way as the
first subplot. The title is set to Subplot 0-1 and the grid lines are not
displayed here.
The third subplot (1-0) is plotted and styled the same way as the first
subplot. The title is set to Subplot 1-0 and the grid lines are displayed.
The color of the grid lines is set to yellow.
The fourth subplot (1-1) is plotted and styled the same way as the first
subplot. The title is set to Subplot 1-1 and the grid lines are displayed.
The spacing between the subplots is adjusted to avoid overlapping
elements using the tight_layout() function.
The figure with the subplots is displayed using the show() function.
Figure 5.18: Line plot creation using major and minor grid lines
Now, let us understand the code and how the major and minor grid lines are
plotted:
The code imports the matplotlib.pyplot module as myplt and the
numpy module as mynp.
It then creates a numpy array myarr1 with values [20, 40, 60, 80, 100].
The code then plots myarr1 using the plot() function from myplt.
The plot is a line plot with markers represented by green dots
connected by a solid line.
The line width is set to five and the marker size is set to five.
The marker face color is set to magenta.
The code then adds a grid to the plot using the grid() function from
myplt.
The grid lines are colored red and have a line width of two.
The code enables minor ticks on the plot using the minorticks_on()
function from myplt.
The code then adds a minor grid to the plot using the grid() function
again.
The minor grid lines are colored yellow.
Finally, the code displays the plot using the show() function from
myplt.
Figure 5.19: Line plot creation with display of legends using different approaches
Note: The
preceding code is covered in (program name:
Chap5_Example5.21.py).
The code uses the matplotlib library to create a figure with three subplots. In
the first subplot, it plots the square, sine, and cosine functions of an array of
values. It adds a legend to this subplot. In the second subplot, it plots the
same functions without a legend. The third subplot, adds a legend using
handles and labels for the functions. Finally, it displays the figure.
The legend location can also be decided in the plot by using loc argument.
The code used for different locations is displayed as follows:
S No. String location Code used
1 'best' 0
2 'upper right' 1
3 'upper left' 2
4 'lower left' 3
5 'lower right' 4
6 'right' 5
7 'center left' 6
8 'center right' 7
9 'lower center' 8
10 'upper center' 9
myleft,myright = myplt.xlim()
mybottom, mytop = myplt.ylim()
print('The Left limit value on the x-axis:',myleft)
print('The Right limit value on the x-
axis:',myright)
print('The Bottom limit value on the y-
axis:',mybottom)
print('The Top limit value on the y-axis:',mytop)
Output:
The Left limit value on the x-axis: -0.3141592653589793
Let us see an example of a change of values of left, right, top, and bottom
using xlim and ylim functions:
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,16)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:r', alpha = 0.6)# mlc
form with alpha = 0.2
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.grid()
myleft,myright = myplt.xlim()
mybottom, mytop = myplt.ylim()
print('The default Left limit value on the x-
axis:',myleft)
print('The default Right limit value on the x-
axis:',myright)
print('The default Bottom limit value on the y-
axis:',mybottom)
print('The default Top limit value on the y-
axis:',mytop)
# setting the new limit values
myplt.xlim(left = 0.7, right = 10)
myplt.ylim(bottom = 20, top = 3000)
myplt.show()
The output is shown in Figure 5.21:
Figure 5.21: Line plot creation using xlim and ylim functions
myplt.plot(x_axis,y_axis)
myplt.grid()
myplt.xscale('linear')
myplt.yscale('log')
The provided code uses the matplotlib library in Python to create a plot of
the cubic function. It generates an array of x-values from one to 20 using
numpyʼs arange function and calculates the corresponding y-values by
raising each x-value to the power of three. The code then creates a figure
object, an axis object, and plots the graph using the plot function. It sets the
properties of the axes, such as labels and title, and enables the grid. Finally,
it displays the plot.
Conclusion
In this chapter, we have learned about a wide spectrum of Python data
visualization tools and techniques that empower us to bring data to life. We
began our journey with a comprehensive understanding of line plot creation
using numpy. Then we explored the basics of creating line plots by passing
two ndarrays, the chapter progresses to advanced techniques such as
customizing line styles, colors, and markers. Readers will learn to add
essential elements to their plots, including titles, axis labels, and legends,
enhancing the interpretability of their visualizations. The chapter explores
properties like alpha, linewidth, markersize, and markerfacecolor for finer
control over plot appearance. The inclusion of techniques for customizing
figure size, plotting multiple lines in the same plot, and incorporating grid
lines adds depth to the readerʼs repertoire. Additionally, the usage of xlim
and ylim functions will allow to focus on specific data ranges. The chapter
culminates in a practical overview of shortcuts for setting color, marker, and
linestyle properties efficiently. Overall, this chapter equips beginners with
the foundational knowledge needed to create and customize line plots for
data visualization.
In the next chapter, we shall learn about bar plots, pie charts, histograms,
scatter plots and subplots in matplotlib which is essential for effective data
visualization across different types of data and analysis tasks. Bar plots are
useful for comparing categorical data, pie charts for displaying proportions
or percentages, histograms for visualizing distributions of continuous
variables, scatter plots for exploring relationships between two continuous
variables and subplots for organizing multiple plots within a single figure.
Points to remember
Line plots visualize trends in continuous data, and the user can create
them using matplotlib.
Pass data as two ndarrays or single ndarrays to create line plots.
Customize line plots by adding titles, x-labels, and y-labels using
functions like myplt.title(), myplt.xlabel(), and myplt.ylabel().
Advanced line plots involve setting properties like linestyle, color,
and alpha to control line appearance.
You can change the default line color using the color parameter.
Set color, marker, and linestyle using shortcut codes.
Adjust line thickness with linewidth and marker size with markersize.
Control marker face color using markerfacecolor.
Customize the figure size using myplt.figure().
Plot multiple lines in the same plot using multiple myplt.plot()
commands.
Add grid lines with myplt.grid() and specify major and minor grid
lines with myplt.grid(which='both').
Display legends for labeled lines with myplt.legend().
Use myplt.xlim() and myplt.ylim() to set specific axis limits.
Questions
1. What are some popular Python data visualization tools?
2. How can you create line plots in Python?
3. What properties can be customized in an advanced line plot?
4. How do you add a title, xlabel, and ylabel to a line plot?
5. What is the purpose of the linestyle property in a line plot?
6. How can you change the color of lines in a line plot?
7. What is the default color for lines in a Python line plot?
8. How can you set color, marker, and linestyle in a shortcut way for a
line plot?
9. What does the alpha property control in a line plot?
10. How do you customize the figure size in a line plot?
Introduction
In the previous chapter, we have witnessed line plots. Now, in this chapter
on data visualization by using matplotlib, various types of plots will be
explored, starting with the vertical bar chart or simple bar chart. The
discussion will include customization options such as changing the color,
width, bottom position, and alignment of each bar, along with insights into
adding labels. The chapter also explores horizontal, stacked, and grouped
bar charts. The exploration will extend to pie charts for representing
proportions, histograms for visualizing data distribution, and scatter plots
for depicting relationships between variables. Subplots in matplotlib offer a
valuable tool for creating comprehensive visualizations with multiple plots
in a single figure, enhancing the clarity and coherence of the presented
information. Overall, the diverse set of topics fulfills the varied
requirements of data analysts and scientists in effectively communicating
and interpreting data. Each plot will provide a comprehensive overview,
ensuring a thorough understanding of the capabilities and customization
options available for effective data representation and analysis.
Structure
In this chapter, we will discuss the following topics:
Barplot
Vertical bar chart and simple bar chart:
Horizontal bar chart
Stacked bar chart
Pie chart
Histogram
Scatter plot
Subplot in matplotlib
Objectives
By the end of this chapter, the readers will know about Bar plots where
various aspects such as changing the color, width, bottom position, left
alignment, and right alignment of individual bars as well as adding labels to
bars will be explained. We will also venture into horizontal, stacked (both
vertical and horizontal), and grouped bar charts, providing us with a
comprehensive understanding of bar chart variations. Additionally, we will
master plot creation i.e. charts, histograms, scatter plots, and subplots in
matplotlib, equipping the reader with a versatile toolkit for data
visualization and analysis in Python.
Barplot
An example of a visualization for categorical data is a bar plot, commonly
referred to as a bar chart. Rectangular bars are used to represent the data,
and each barʼs length or height reflects the value it stands for. The
myplt.bar() function of the well-known Python data visualization toolkit
matplotlib can be used to produce a bar plot. The parameters required to
define the positions and heights of the bars are passed into this function. So,
in a barplot, the data is represented in the form of bars, unlike the line plot
where the data points are represented by markers that are connected by a
line.
In order to create a bar plot by using matplotlib, follow the steps given as
follows:
1. Import: The required libraries need to be imported.
2. Prepare the data: We need to have the data that we want to
visualize in the form of lists or arrays.
3. Create the bar plot: For creating the bar plot, use the myplt.bar()
function. The x-coordinates (positions) of the bars and the
corresponding heights are the parameters for this function. By
specifying more parameters, the plotʼs look can also be customized.
4. The plot can be customized which is optional: We can add labels,
and titles, change colors, and modify other aspects of the plot by
using various functions provided by matplotlib.
5. Save or display the plot: Finally, use the myplt.savefig() to save it
to a file or myplt.show() function to display the plot.
Let us discuss the different types of bar charts with examples.
The code uses the matplotlib library to create a bar chart that represents the
number of centuries scored by different cricketers. We shall now explain the
code line by line:
The code imports the matplotlib.pyplot module as myplt.
The my_cricketers list contains the names of the cricketers, which
will be displayed on the x-axis of the bar chart.
The my_centuries list contains the number of centuries scored by
each cricketer, which will be represented by the height of the bars on
the y-axis.
The myplt.bar() function is called with the my_cricketers and
my_centuries lists as arguments to create the bar chart.
The myplt.xlabel() function is used to set the label for the x-axis,
with the color set to blue and the font size set to 15.
The myplt.ylabel() function is used to set the label for the y-axis,
with the color set to green and the font size set to 15.
The myplt.title() function is used to set the title of the bar chart,
with the color set to red and the font size set to 15.
The myplt.show() function is called to display the bar chart.
We can customize several elements in the bar plot which shall be explained
with some examples:
In the preceding code, we will be focusing on the two lines with the
explanation as follows:
mycolorlist = ['r','b','g','orange','k']: This line
creates a list called mycolorlist which contains the colors to be used
for the bars in the plot. The colors are represented by their respective
abbreviations: r for red, b for blue, g for green, orange for orange, and
k for black.
The following code snippet uses the xticks function from the
matplotlib.pyplot library to set the x-axis tick labels for a plot. The
rotation=30 parameter is used to rotate the tick labels by 30 degrees.
myplt.xticks(my_cricketers, rotation=30)
The tight_layout function is then called to automatically adjust the subplot
parameters to fit the plot figure area:
myplt.tight_layout()
The grid function is used to display grid lines on the y-axis of the plot, as
shown below:
myplt.grid(axis='y')
We shall discuss the two lines discussed in our present code. The first line
of code is the following:
myplt.barh(my_cricketers, my_centuries,
color=mycolorlist)
This line of code is creating a horizontal bar chart by using the Matplotlibʼs
barh function, where my_cricketers are the labels on the y-axis,
my_centuries determine the height of the bars, and mycolorlist determines
the colors of the bars.
Another line of code where we need to concentrate is the following:
myplt.xticks(rotation=30)
Here, xticks is a function used to customize the tick positions and labels
on the x-axis.
rotation=30 is an argument passed to xticks, which specifies the
rotation angle (in degrees) for the x-axis tick labels. In this case, it rotates
the x-axis tick labels by 30 degrees.
This line of code is rotating the x-axis tick labels by 30 degrees to make
them more readable if they are long or overlap with each other. It is often
used when the x-axis labels are strings, as in your case where the cricketerʼs
names are being displayed on the x-axis.
The code we have provided uses the matplotlib library in Python to create
a bar chart. Here is a breakdown of what each line of code does:
import matplotlib.pyplot as myplt: This line imports the pyplot
module from the matplotlib library and assigns it with the alias
myplt.
myallround_cricketers: This is a list of strings that represents the
names of the cricketers.
my_runs: This is a list of integers that represents the number of runs
scored by each cricketer.
my_wickets: This is a list of integers that represents the number of
wickets taken by each cricketer.
myplt.bar(myallround_cricketers, my_runs, color='r',
label='Runs'): This line creates a bar chart with the names of the
cricketers on the x-axis, the number of runs on the y-axis, and assigns
the color red to the bars. It also adds the label Runs to the legend.
myplt.bar(myallround_cricketers, my_wickets, bottom=my_runs,
color='g', label='Wickets'): This line creates another set of bars
on top of the previous bars, representing the number of wickets. The
bottom parameter specifies that the bars should start from the height
of the previous bars. The green color is assigned to these bars, and the
label Wickets is added to the legend.
myplt.xticks(myallround_cricketers, rotation=90): This line sets
the x-axis tick labels to be the cricketersʼ names, with a rotation of 90
degrees to prevent overlapping.
for i in range(len(myallround_cricketers)): This line sets up a
loop that iterates over the indices of the myallround_cricketers list.
The len(myallround_cricketers) function returns the length of the
list, and range() generates a sequence of numbers from zero to
len(myallround_cricketers)-1.
myplt.text(myallround_cricketers[i], (my_runs[i]/2),
str(my_runs[i]), ha='center', color='black', weight=1000): This
line adds a text label to the bar chart. The myplt.text() function is
used to add the text label. The first parameter is the x-coordinate
(cricketerʼs name), the second parameter is the y-coordinate (half of
the number of runs), the third parameter is the text to be displayed
(number of runs), and the remaining parameters specify the
formatting options such as horizontal alignment(ha='center'), text
color(color='black'), and font-weight(weight=1000).
myplt.text(myallround_cricketers[i],
(my_runs[i]+my_wickets[i]/2), str(my_wickets[i]), ha='center',
color='black', weight=1000): This line adds another text label to the
bar chart. Similar to the previous line, it specifies the cricketerʼs
name as the x-coordinate, the y-coordinate as the sum of the number
of runs and half of the number of wickets, the text to be displayed as
the number of wickets, and the same formatting options.
myplt.legend(): This line adds a legend to the plot.
myplt.tight_layout(): This line adjusts the layout of the plot to
avoid overlapping elements.
myplt.show(): This line displays the plot on the screen.
An example of a horizontal stacked bar chart is demonstrated in the code
program name: Chap6_Example6.11.py as follows:
import matplotlib.pyplot as myplt
fig, ax = myplt.subplots()
# Plot the runs
ax.barh(myallround_cricketers, my_runs, color='r',
label='Runs')
ax.set_xlabel('Performance')
ax.set_title('Performance of All-Round
Cricketers')
ax.set_xticks(range(0, 30001, 5000))
ax.invert_yaxis() # Invert the y-axis to display
the top performer at the top
ax.legend()
for i in range(len(myallround_cricketers)):
ax.text(my_runs[i] + my_wickets[i] + 100, i,
str(my_wickets[i]), ha='center', va='center',
color='black', weight='bold')
ax.text(my_runs[i] / 2, i, str(my_runs[i]),
ha='center', va='center', color='black',
weight='bold')
myplt.tight_layout()
myplt.show()
The output is shown in Figure 6.11:
Figure 6.11: Output of Chap6_Example6.11.py
for i in range(len(myallround_cricketers)):
myplt.text(myxpos[i], my_runs[i] + 10 ,
my_runs[i], ha='center', color='black',
weight=1000)
myplt.text(myxpos[i]+mywidth, my_wickets[i] +
5 , my_wickets[i], ha='center', color='black',
weight=1000)
myplt.tight_layout()
myplt.show()
The output is shown in Figure 6.12:
Figure 6.12: Output of Chap6_Example6.12.py
Pie chart
A pie chart is a circular statistical graphic divided into slices to represent
data proportions. A circular chart divided into segments is called wedges
which represent an individual category. The wedge area is proportional to
the category value. In the context of matplotlib, the user can create pie
charts by using the myplt.pie() function. For categories comparison, a
pie chart is very useful. Refer to the following syntax and the code
demonstrated program name: Chap6_Example6.13.py for better
understanding:
pie(x, explode=None, labels=None, colors=None,
autopct=None, pctdistance=0.6, shadow=False,
labeldistance=1.1, startangle=0, radius=1,
counterclock=True, wedgeprops=None,
textprops=None, center=(0, 0), frame=False,
rotatelabels=False, *, normalize=True, hatch=None,
data=None)
# addition of legend
myplt.legend(title='Languages Popularity Pie
chart')
myplt.tight_layout()
myplt.show()
Note: In the above code, comments are written which are self-
explanatory about what is happening in the entire code.
Histogram
Histograms in matplotlib serve as graphical representations that depict the
distribution of a dataset. They accomplish this by breaking the data into
bins and displaying the frequency or count of observations within each bin.
The significance of histograms stems from their capacity to unveil the
inherent data distribution. They enable an understanding of the shape,
central tendency, and dispersion of the dataset. Moreover, histograms prove
invaluable for the detection of data patterns, outliers, and skewness. They
find frequent application in exploratory data analysis, assisting in data-
driven decision-making.
Furthermore, histograms enjoy widespread usage across fields such as
statistics, data science, and data visualization. They offer a rapid and
intuitive means of scrutinizing and comprehending data, rendering them
indispensable tools for researchers, analysts, and data enthusiasts.
The number of observations in the given interval is termed frequency
distribution. Histograms prove to be a valuable asset when dealing with
extensive datasets. To construct histograms, it is essential to partition the
entire set of input values into equitably sized groups or bins. Each bin
corresponds to a bar in the histogram, and the height of these bars is directly
proportional to the count of values falling within the respective bin or
interval. To generate a histogram, the hist() function can be employed
effectively. Refer to the following syntax for a better understanding:
hist(x, bins=None, range=None, density=False,
weights=None, cumulative=False, bottom=None,
histtype='bar', align='mid',
orientation='vertical', rwidth=None, log=False,
color=None, label=None, stacked=False, *,
data=None, **kwargs)
Now, we shall see a simple example of creating a histogram with the marks
of students who are in the decided range as shown as follows:
35 students got marks in the range zero to 34.
100 students got marks in the range 35 to 49.
45 students got marks in the range 50 to 59.
68 students got marks in the range 60 to 79.
27 students got marks in the range 80 to 100.
Refer to the following code program name: Chap6_Example6.14.py
for better clarity:
import matplotlib.pyplot as myplt
#print(help(myplt.hist))
# Data
# Two lists are created: mymarks_ranges contains
the ranges of marks, and
# mynum_students contains the number of students
falling within each range.
mymarks_ranges = ["0-34", "35-49", "50-59", "60-
79", "80-100"]
mynum_students = [35, 100, 45, 68, 27]
# Create a histogram
# The hist() function is called to create the
histogram. It takes several arguments:
# - mymidpoints: the values to be plotted on the
x-axis.
# - bins: the range of values for each bin/bar in
the histogram.
# - weights: the weights or frequencies of each
value.
# - edgecolor: the color of the edges of the bars
in the histogram.
# - color: the color of the bars in the histogram.
myplt.xlabel('Marks Range')
myplt.ylabel('Number of Students')
myplt.title('Distribution of Marks for 300
Students')
Scatter plot
A scatter plot, a graphical representation, is employed in data visualization
to illustrate the correlation between two numeric variables. This plot utilizes
a cartesian coordinate framework to position points based on two variable
values. In Pythonʼs prominent data visualization library, matplotlib, scatter
plots can be constructed through the 'scatter()' function. This function
accepts x and y values as inputs and renders them as individual points on
the chart. Furthermore, customization options, such as specifying color,
size, and point shape, are available for enhancing the scatter plotʼs
appearance.
The comparison of line and scatter plots is mentioned as follows:
Both line and scatter plots serve the purpose of visualizing the
association between two numeric variables. Terminology like x-axis,
and y-axis data are the same in scatter plots.
Both types employ a cartesian coordinate system for visual
representation.
Matplotlib, a Python library, supports the creation of both line and
scatter plots.
The differences between line and scatter plots are mentioned as follows:
Line plots utilize straight lines to connect data points, whereas scatter
plots do not establish connections between points.
Line plots are commonly employed to depict trends or patterns in
data across time or another continuous variable, whereas scatter plots
are used to portray the distribution or correlation between two
variables.
Line plots are suitable for illustrating continuous data, while scatter
plots are better suited for displaying discrete or categorical data.
In summary, although both line and scatter plots are valuable tools for
visualizing relationships between variables, they differ in terms of point
connection and the types of data they are best suited for. Refer to the
following syntax and code program name: Chap6_Example6.15.py as follows
for better understanding:
scatter(x, y, s=None, c=None, marker=None,
cmap=None, norm=None, vmin=None, vmax=None,
alpha=None, linewidths=None, *, edgecolors=None,
plotnonfinite=False, data=None, **kwargs)
# scatter function
myscatter =
myplt.scatter(my_xaxis_data,my_yaxis_data,
c=mycolors, s=mysize)
# creating a list of Patch objects representing
each marker
# and use it as an argument in the legend()
function.
mylegend_patches = [mpatches.Patch(color=color,
label=label) for color, label in zip(mycolors,
mylabels)]
We shall discuss the two lines here as the rest of the code is self-explanatory
due to the comments:
The first line is Mylegend_patches = [mpatches.Patch(color=color,
label=label) for color, label in zip(mycolors, mylabels)]
Subplot in matplotlib
The subplot function in matplotlib is used to create a grid of subplots within
a single figure. By specifying the number of rows and columns in the
subplot grid, the user can easily control the layout and arrangement of the
subplots. It allows the user to display multiple plots or charts in a single
window, organized in a grid-like structure. This function is particularly
useful when the user wants to compare different datasets or visualize
different aspects of the data side by side.
First, we shall look into the subplot function:
Refer to the following syntax and code program name:
Chap6_Example6.16.py as follows for better understanding:
subplot(*args, **kwargs)
myax2 = myplt.subplot(2,2,2)
myax2.plot(myxaxis_data,my_y2,color='b',marker='o'
)
myax2.set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Double of Xaxis')
myax3 = myplt.subplot(2,2,3)
myax3.plot(myxaxis_data,my_y3,color='y',marker='o'
)
myax3.set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Square function')
myplt.tight_layout()
myplt.show()
The output is shown in Figure 6.16:
Figure 6.16: Output of Chap6_Example6.16.py
We are calling the subplot() function separately for every subplot. If there
is a requirement for a large number of subplots, then every time writing this
subplot() function will become difficult. So, to overcome the preceding
issue, we shall go for the subplots() function returning all axes objects at a
time.
Another one is subplots function. Refer to the following syntax:
subplots(nrows=1, ncols=1, *, sharex=False,
sharey=False, squeeze=True, width_ratios=None,
height_ratios=None, subplot_kw=None,
gridspec_kw=None, **fig_kw)
Just observe the following line of code:
myfig, myaxs = myplt.subplots(2,2)
Here, one figure object and ndarray of four axes objects are returned. Now,
we shall look into the code program name: Chap6_Example6.17.py as
follows for better understanding:
import matplotlib.pyplot as myplt
import numpy as mynp
# print(help(myplt.subplots))
myxaxis_data = mynp.arange(1,11)
my_y1 = myxaxis_data
my_y2 = myxaxis_data *2
my_y3 = myxaxis_data **2
my_y4 = myxaxis_data **3
myfig,myax = myplt.subplots(2,2)
myax[0,0].plot(myxaxis_data,my_y1,color='r',marker
='o')
myax[0,0].set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Same Value Linearity')
myax[0,1].plot(myxaxis_data,my_y2,color='b',marker
='o')
myax[0,1].set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Double of Xaxis')
myax[1,0].plot(myxaxis_data,my_y3,color='y',marker
='o')
myax[1,0].set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Square function')
myax[1,1].plot(myxaxis_data,my_y4,color='g',marker
='o')
myax[1,1].set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Cubic function')
myplt.tight_layout()
myplt.show()
The output is shown in Figure 6.17:
Figure 6.17: Output of Chap6_Example6.17.py
Conclusion
In this chapter, we have explored the versatility of bar plots, discovering
how to tweak colors, widths, and alignments for individual bars, along with
insights into creating horizontal and stacked bar charts. We also uncovered
the secrets of grouped bar charts, equipping us with an arsenal of bar chart
skills. Then, we explored pie charts, histograms, scatter plots, and subplots
in matplotlib, further enriching our data visualization repertoire. This
chapter has provided us with a robust foundation in Python data
visualization, enabling the reader to effectively communicate insights and
trends from data in various formats and styles.
In the next chapter, we shall learn about the pandas series in data science as
it is a fundamental data structure for handling one-dimensional labeled data,
which is common in many data analysis tasks. Pandas series offers powerful
functionalities for data manipulation, transformation, and analysis, making
it indispensable for tasks such as data cleaning, exploration, and pre-
processing. By mastering the pandas series, data scientists will gain the
ability to efficiently organize, access, and analyze data.
Points to remember
If we want to compare different categories of values, we can use a
vertical bar chart, created by using the bar() function. In addition, if
the labels are too long or there are multiple values to represent, we
should opt for a horizontal bar chart. This can be created by using the
barh() function.
Questions
1. How can you create a vertical bar chart (simple bar chart) in Python?
2. How do we add labels to bars in a bar chart?
3. What is a horizontal bar and stacked bar chart, and how do you
create them in Python?
4. Provide an example of a vertical stacked bar chart.
5. Provide an example of a horizontal stacked bar chart.
6. How do you create a grouped bar chart (clustered bar chart) in
Python?
7. How can we create a pie chart in Python?
8. What is a histogram, and how can you create one in Python?
9. How can we create a scatter plot in Python?
10. What is a subplot, and how can we use it in matplotlib for creating
multiple plots in one figure?
CHAPTER 7Exploring Pandas Series for Data
Science in Python
Introduction
Pandas is a robust and popular Python package for data analysis and
manipulation. It offers simple-to-use data structures and data analysis
capabilities for working with structured data, including tabular, time series,
and heterogeneous data. Matplotlib, SciPy, and Scikit-learn are among the
many additional data analysis libraries which Pandas is compatible with, it
is built on top of NumPy which means for operating Pandas, a numpy
library is required. Series and DataFrame are the two main data structures
in Pandas. Similar to a column in a spreadsheet or a Structured Query
Language (SQL) table, a series is a one-dimensional named array that can
carry any data type (integers, floats, texts, and so on). A two-dimensional
labeled data structure called a DataFrame can carry a variety of data kinds
in its columns of potentially different types. The data analysis in pandas is
based on numpy, whereas data visualization is based on matplotlib.
Python code frequently uses the import pandas as mypd statement to import
the pandas library and alias it as mypd. In this chapter we are using mypd
as an alias while writing the code.
Also, we will be writing help in some parts of the code snippet so that the
reader can uncomment and then view the syntax, explanation of properties,
and some examples of its usage.
Structure
By the end of this chapter, the readers will have an idea about a
comprehensive range of topics related to the pandas series in Python,
equipping them with the skills and knowledge needed for effective data
manipulation and analysis. We will start by exploring the pandas series,
covering fundamental aspects such as series creation, data access, slicing,
and filtering. We will also explore more advanced topics like using callable
objects for selection, leveraging useful attributes, applying functions with
the apply method, aggregating data, and performing basic arithmetic
operations. The chapter will further emphasize series transformations and
iteration for a well-rounded understanding. By the end of this chapter,
readers will have a solid foundation in working with pandas series for data
analysis and manipulation in Python.
Pandas series
data: This is the information that will be kept in the series. It could be
a scalar value, a dictionary, a list, or a numpy array.
index: This option details the series index labels. It could be a
dictionary, list, array, or range. The series will be given a default
integer index starting at zero if no index is given.
dtype: The data type for the values in the series is specified by this
argument. Pandas will attempt to determine the data type from the
input data if it is not provided.
name: The series name is specified by this argument.
Copy: With this argument, the user can decide whether the input data
should be copied.
fastpath: Most users should not use this argument because it is used
for internal optimization only.
mylist = [1, 2, 3, 4, 5]
mypd_series = mypd.Series(mylist)
print(mypd_series)
Output:
01
12
23
34
45
dtype: int64
Note: The preceding code is covered in (program name:
Chap7_Example7.1.py).
In the preceding example, we can see that no index parameter has been
provided, so pandas is considering default index values from
Rangeindex(0,1,2,3,…) object internally.
mypd_series = mypd.Series(mydict)
print(mypd_series)
Output:
key1 1
key2 2
key3 3
key4 4
key5 5
dtype: int64
myndarray = mynp.array([11,12,13,14,15])
mypd_series = mypd.Series(myndarray)
print(mypd_series)
Output:
0 11
1 12
2 13
3 14
4 15
dtype: int32
For creating a series from scalar value, an index has to be provided. The
repetition of scalar value will be done as per index length as demonstrated.
Observe the code Chap7_Example7.4.py below for better understanding:
import pandas as mypd
print(myseries)
Output:
1 10
2 10
3 10
4 10
dtype: int64
#print(help(mypd.RangeIndex))
print(mypd_series)
Output:
6 11
7 12
8 13
9 14
10 15
dtype: int64
print(mypd_series_float)
Output:
Index name
0 11.0
1 12.0
2 13.0
3 14.0
4 15.0
The copy parameter in the pandas series provides control over whether a
deep copy or a reference copy is made when manipulating or modifying
data. When copy=True, a deep copy is created, ensuring that changes to the
new series do not affect the original data. Conversely, setting copy=False
results in a reference copy, allowing changes to the new series to propagate
to the original data, potentially saving memory but requiring caution to
avoid unintended modifications. This parameter is valuable in scenarios
where data integrity and independence are crucial considerations during
exploratory data analysis and manipulation using pandas, as shown in the
following code Chap7_Example7.7.py:
myndarray = mynp.array([11,12,13,14])
myseries[1] = 32
print(myseries)
myndarray2 = mynp.array([11,12,13,14])
myseries2[1] = 32
print(myseries2)
Output:
0 11
1 32
2 13
3 14
dtype: int32
0 11
1 32
2 13
3 14
dtype: int32
head() method, this method will access the first five rows of a series as the
default value is five. If the argument says, n other than five is passed, then
this method will access the first n rows of a series.
head(n: 'int' = 5)
tail() method, this method will access the last five rows of a series as the
default value is five. If the argument says, n other than five is passed, then
this method will access the last n rows of a series. If the n value is negative,
then this function will return all rows except the first n rows, as shown in
the following Chap7_Example7.8.py:
# head method
print(myseries.head())
print(myseries.head(n=2))
# tail method
print(myseries.tail())
Output:
00
11
22
33
44
dtype: int64
00
11
dtype: int64
15 15
16 16
17 17
18 18
19 19
dtype: int64
18 18
19 19
dtype: int64
mypd_series[x]:
Here, x can be the index value, mypd_series[3]. This will return the value
present at index three. List of indices mypd_series[[1,2,4]], this will return
a series of values present at indices one, two, and four. For slices
mypd_series[2:5], this will return a series of values from the second to
fourth index.
print(mypd_series['u'])
print(mypd_series[3])
Output:
14
14
print(mypd_series['r':'u'])
print(mypd_series[1:4])
Output:
r 11
s 12
t 13
u 14
dtype: int64
s 12
t 13
u 14
dtype: int64
print(mypd_series.get(10))
print(mypd_series.get('a'))
So, in the same code, if we add the above two lines, then we will get an
output as None in both cases and we will not encounter any errors.
print(mypd_series.get('a', default=22))
So, in the same code, we are adding the line shown above. The output we
are getting is 22 since the specified label is unavailable. If the specified
index/label is present and a default value is provided, the output will not be
the default value, as shown as follows:
print(mypd_series.get(2, default=32))
So, in the same code, if we add the line shown above, then the output we
are getting is 13; the specified index is available.
mypd_series.iloc[0]
mypd_series.iloc [[1,2,4]]
mypd_series[['labela']]
mypd_series[['labela', 'labelb']]
# using iloc---------------------------------------
# print(help(mypd_series.iloc))
print(mypd_series.iloc[1])
print(mypd_series.iloc[[1,3]])
print(mypd_series.iloc[1:3])
print(mypd_series.iloc[-1]) # we are able to access the last item
print('-'*50)
# using loc---------------------------------------
# print(help(mypd_series.loc))
print(mypd_series.loc['s'])
print(mypd_series.loc[['s','u']])
print(mypd_series.loc['s':'u'])
Output:
12
s 12
u 14
dtype: int64
s 12
t 13
dtype: int64
15
--------------------------------------------------
12
s 12
u 14
dtype: int64
s 12
t 13
u 14
dtype: int64
mypd_series = mypd.Series(mylist)
Output:
0 11
2 13
4 15
dtype: int64
0 11
2 13
4 15
dtype: int64
0 11
2 13
4 15
dtype: int64
0 11
2 13
4 15
dtype: int64
my_filtered_mypdseries = mypd_series[my_bool_filter]
print(my_filtered_mypdseries)
Output:
r 11
s 12
t 13
u 14
v 15
dtype: int64
Note: The preceding code is covered in (program name:
Chap7_Example7.13.py).
The preceding code creates a pandas series called mypd_series using the
list mylist as data and the index ['r', 's', 't', 'u', 'v']. It then filters the pandas
series using Boolean indexing, where a new series called
my_filtered_mypdseries is created by selecting only the values from
mypd_series that are greater than three. Finally, it displays the filtered
series.
The callable object-like function can be used while selecting values from
the series. A callable object can be passed to the normal indexer, the loc,
and iloc indexers, and for get() method too. Just observe the following code
Chap7_Example7.14.py for better understanding:
def my_even_number_selection(myseries):
print("Approach-1--------------------------")
print(mypd_series[my_even_number_selection])
print("Approach-2--------------------------")
print("Approach-3--------------------------")
print(mypd_series.iloc[lambda myseries: [True if loop%2==0 else False for
loop in range(myseries.size)]])
print("Approach-4--------------------------")
Output:
Approach-1--------------------------
00
22
44
66
88
dtype: int64
Approach-2--------------------------
00
22
44
66
88
dtype: int64
Approach-3--------------------------
00
22
44
66
88
dtype: int64
Approach-4--------------------------
00
22
44
66
88
dtype: int64
Some useful attributes of the pandas series object are mentioned as follows:
values: This attribute will return values present inside the series
object.
index: This attribute will return the series index (axis labels).
dtype: The dtype object in the underlying data is returned using this
attribute.
size: The number of elements in the underlying data is returned using
this attribute.
shape: A tuple of the shape of the underlying data is returned using
this attribute. For a series object, it will return a tuple which is several
elements in the series object.
ndim: The number of dimensions of the underlying data is returned
using this attribute. Here, it will be always one by definition.
name: The series name is returned using this attribute. The default
value is None if not provided.
is_unique: If the values in the series object are unique, then True is
returned using this attribute. If the values in the series object are not
unique, that is, repeatability of the values is there and we want to have
unique values, then use nunique() method.
is_monotonic_increasing/ is_monotonic_decreasing: Monotonic
means that the values are consistently in some order, either increasing
or decreasing. If values in the series object are monotonically
increasing, then it will return True for is_monotonic_increasing.
If values in the series object are monotonically decreasing, then it will
return False for is_monotonic_decreasing.
hasnans: This attribute will return True if the series contains NaNs. We
can use the above attribute to check whether some values are missing
or not, as shown as follows:
mypd_series = mypd.Series(
data=['Mango','Guava','Pear','Banana','Litchi'],
index = [1,2,3,4,5],
)
# values
print(mypd_series.values)
print(type(mypd_series.values))
# index
print(mypd_series.index)
# dtype
print(mypd_series.dtype)
# size
print(mypd_series.size)
# shape
print(mypd_series.shape)
# ndim
print(mypd_series.ndim)
# name
print(mypd_series.name)
# is_unique
print(mypd_series.is_unique)
print('-'*50)
mypd_series2 = mypd.Series(
data=['Mango','Guava','Pear','Banana','Litchi','Pear','Banana', mypd.NA])
print(mypd_series2.is_unique)
print(mypd_series2.nunique())
# is_monotonic
print('-'*50)
mypd_series3 = mypd.Series([1,2,3,4,5])
mypd_series4 = mypd.Series([5,4,3,2,1])
print(mypd_series3.is_monotonic_increasing)
print(mypd_series3.is_monotonic_decreasing)
print(mypd_series4.is_monotonic_increasing)
print(mypd_series4.is_monotonic_decreasing)
print('-'*50)
# hasnans
mypd_series5 =
mypd.Series(['Mango','Guava','Pear','Banana','Litchi','Pear','Banana',
mypd.NA])
mypd_series6 =
mypd.Series(['Mango','Guava','Pear','Banana','Litchi','Pear','Banana', None])
print(mypd_series.hasnans)
print(mypd_series5.hasnans)
print(mypd_series6.hasnans)
Output:
<class 'numpy.ndarray'>
object
(5,)
1-D data
True
--------------------------------------------------
False
--------------------------------------------------
True
False
False
True
--------------------------------------------------
False
True
True
def mycube(mynum):
return mynum ** 3
mycube_pdseries = mypd_series.apply(mycube)
print(mycube_pdseries)
#print(help(mypd_series.apply))
Output:
r 1331
s 1728
t 2197
u 2744
v 3375
dtype: int64
A pandas series can be aggregated using min, max, sum, mean, and median
methods, as shown in the following code Chap7_Example7.17.py:
Output:
Sum is: 65
# print(help(mypd.read_csv))
mydf = mypd.read_csv('employees_salary.csv')
print(mydf)
print(mydf1)
Output:
2 Arnold 130000
10 Ronald 300000
11 Recky 400000
ndim: 2
0 John Deo
1 Max Ruin
2 Arnold
3 Krish Star
4 John Mike
5 Alex John
6 My John Rob
7 Asruid
8 Tes Qry
9 Big John
10 Ronald
11 Recky
# Read the CSV file and select both 'Employee Name' and 'Salary' columns
print(mydf)
mydf['Salary'] = mydf['Salary'].fillna(12345)
# Convert the DataFrame into a Series object with 'Employee Name' as the
index
print(myseries)
print(type(myseries))
Output:
2 Recky 400000.0
Employee Name
Donald 12345.0
Recky 400000.0
<class 'pandas.core.series.Series'>
print(mydf*2)
print(mys1+mys2)
print(mys3+mys4)
Output:
0 40
1 NaN
2 NaN
3 50
4 NaN
dtype: object
r7
s9
t 11
u 13
v 15
dtype: int64
r 7.0
s 9.0
t 11.0
u NaN
v NaN
w NaN
x NaN
dtype: float64
There is a fill_value parameter for methods like mul, div, add, and sub. The
fill_value will be considered in the place of the missing element, if the
matched index is unavailable, as shown in the following code
Chap7_Example7.21.py:
print(mys1.add(mys2, fill_value=0))
Output:
r 7.0
s 9.0
t 11.0
u 4.0
v 5.0
w 9.0
x 10.0
y NaN
dtype: float64
If mys1 and mys2 are two series objects, then observe the following
symbols and their meaning reference to it:
+ → mys1.add(mys2)
- → mys1.sub(mys2)
* → mys1.mul(mys2)
/ → mys1.div(mys2)
mypd_series = mypd.Series([15,30,45,60])
mypd_series[1] = 25
mypd_series[2] = 55
print(mypd_series)
# Obs: We can see that we are using 'n' lines of code, in order to
update 'n' elements
mypd_series2 = mypd.Series([15,30,45,60])
mypd_series2.update(mypd.Series([25,55], index=[1,2]))
print(mypd_series2)
#print(help(mypd_series2.update))
Output:
3 15
3 25
2 55
3 60
dtype: int64
3 15
3 25
2 55
3 60
dtype: int64
If the matched indexes are there then old values will be replaced with new
values. If the matched indexes are not possible and there are more indexes,
then extra elements will be ignored. The corresponding values are not
updated in the original series if the argument series contains NaNs, as
shown as follows.
mypd_series = mypd.Series(['Green','Yellow','Blue','Violet'])
#print(help(mypd_series.map))
# Using dict
mynewpd_series = mypd_series.map({'Green':'Pink','Blue':'Magenta'})
print(my_series_func)
def myfunc(x):
mypd_series3 = mypd.Series(['Green','Yellow',mynp.NaN,'Violet'])
print(my_series_func2)
Output:
0 Pink
1 NaN
2 Magenta
3 NaN
dtype: object
dtype: object
2 NaN
dtype: object
We can iterate the elements of the series using the items() method as shown
in the code Chap7_Example7.24.py as follows:
mypd_series = mypd.Series(['Green','Yellow','Blue','Violet'])
Output:
0 ---> Green
1 ---> Yellow
2 ---> Blue
3 ---> Violet
Conclusion
Points to remember
Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 8
Exploring Pandas Dataframe for
Data Science in Python
Introduction
In the previous chapter, we learned about the pandas series. In this chapter,
we will explore the pandas dataframe . A two-dimensional labeled data
structure called a DataFrame can carry a variety of data kinds in its
columns. It resembles a spreadsheet or table, capable of accommodating
various data types and facilitating operations on entire rows or columns.
Pandas dataframe shines in managing structured, tabular data with multiple
variables. The flexibility and functionality provided by pandas dataframe
include support for handling missing data, making them particularly
valuable for diverse and complex data analysis tasks. It is common to start
with a pandas series and then combine them into a dataframe to leverage the
benefits of both structures in handling different aspects of data manipulation
and analysis.
Let us learn some concepts on pandas dataframe.
Structure
The chapter will discuss the following topics:
Pandas dataframe constructor
Pandas methods and attributes for dataframe
Dropping dataframe rows with null values
Arithmetic operators for dataframes
New columns addition to dataframes
Usage of fillna() method to the dataframe
Sorting values of the dataframe
Sorting dataframe by multiple columns
Sorting dataframe based on index
Ranking series of values by using rank() method
Filter data from dataframe
Inclusion check in dataframe
Usage of isnull and notnull method
Usage of duplicated method
Usage of drop_duplicates method in dataframe
Usage of unique and nunique methods
Objectives
By the end of this chapter, the reader will have an idea about a
comprehensive range of topics related to pandas dataframes in Python,
equipping them with the skills and knowledge needed for effective data
manipulation and analysis. We will look at pandasʼ dataframes, readers will
gain insights into dataframe construction and exclusive methods and
attributes applicable only to dataframes. We will address the critical task of
handling missing data, arithmetic operations, and the addition of new
columns for dataframes. The usage of the fillna method for handling
missing values will also be covered. Sorting, ranking, and filtering data
within dataframes will be thoroughly explored, along with techniques for
checking data inclusion, identifying missing values, and handling
duplicates. Finally, the reader will be able to understand the concept of a
pandas dataframe and will be able to differentiate between pandas series.
Pandas dataframe
A two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns) is called a pandas
dataframe. Data is aligned in a tabular fashion in rows and columns, which
is a two-dimensional data structure. The data, rows, and columns are the
three primary components of a pandas dataframe. The data representation is
like a SQL table or a spreadsheet. We can say that a dataframe is a series of
collection objects. Let us see some of the key concepts of pandas dataframe
in Python.
An important point to note is that dtype attribute will be applicable for the
series object whereas for the Dataframe we will be using dtypes attribute
which returns the dtypes in the dataframe, that is, series for the data type of
each column.
We can also create dataframe from a dict of tuple objects which we have
left for you to practice. We are in a position to solve it, we can do this by
replacing [] with ().
# column names
# dataframe creation using ndarray
myndarray = mynp.array([[1,2,3],[4,5,6],[7,8,9]])
mydf2 = mypd.DataFrame(myndarray)
print(mydf2) # with default column names 0 1 and 2
print('-'*50)
mydf3 = mypd.DataFrame(myndarray, columns=
["a","b","c"])
print(mydf3)
print(mydf3.columns) # index object containing
column names
print('-'*50)
# axes
print(mydf3.axes) # list of axes object is
returned [rowindex, columnindex]
Output:
id int64
name object
class object
mark int64
gender object
dtype: object
object 3
int64 2
Name: count, dtype: int64
2 1 2
2 1 2 3
2 4 5 6
2 7 8 9
a b c
2 1 2 3
2 4 5 6
2 7 8 9
Index(['a', 'b', 'c'], dtype='object')
[RangeIndex(start=0, stop=3, step=1), Index(['a',
'b', 'c'], dtype='object')]
In order to select only one required column from the dataframe, we will be
using mydf.column_name or mydf['column_name']. (Mostly
recommended because column names may contain spaces).
In order to select multiple columns from the dataframe, you can use
mydf[['column_name1'], ['column_name2']].
Now, let us discuss some methods which will be used exclusively for
dataframes:
info() method: The summary information of dataframe is generated
using info() method. It is highly recommended to use this method
before using dataframe. The output of info() method can be
customized by using the parameters as highlighted in the following
syntax and code Chap8_Example8.3.py:
info(verbose: 'bool | None' = None, buf:
'WriteBuffer[str] | None' = None, max_cols:
'int | None' = None, memory_usage: 'bool | str
| None' = None, show_counts: 'bool | None' =
None)
The code reads a CSV file called student.csv using the pandas
library and assigns it to a dataframe called mydf. It then uses the
info() method to get information about the dataframe, such as the
number of rows, columns, and data types. Instead of displaying the
information to the console, it saves it to a buffer called mybuffer.
The contents of the buffer are then written to a text file called
df_information.txt using the write() method. Finally, a
message is printed to the console indicating that the details can be
found in the df_information.txt file.
sample() method: Whenever there is a requirement to get random
records from the dataframe, we will use the sample() method. A
random sample of items will be returned from an axis of the object,
as shown in the following syntax and in the code
Chap8_Example8.5.py as follows:
sample(n: 'int | None' = None, frac: 'float |
None' = None, replace: 'bool_t' = False,
weights=None, random_state: 'RandomState |
None' = None, axis: 'Axis | None' = None,
ignore_index: 'bool_t' = False)
We can see that the row with at least one NaN value has been removed. We
can play with how parameters such that the row with any or all NaN values
can be dropped. We saw the row containing any column where NaN has
been dropped. The default value of how is any. Now, let us see the output
with the all value by looking at the following code
Chap8_Example8.7.py:
import pandas as mypd
mydf = mypd.read_csv('sample_dropna.csv')
mydf1 = mydf.dropna(how='all')
print(mydf1)
Output:
Employee Name Salary Location
0 John Deo 100000.0 Hyderabad
1 Max Ruin 200000.0 Mumbai
2 Arnold 130000.0 Bengaluru
3 Krish Star 140000.0 Delhi
4 John Mike 150000.0 Chennai
5 Alex John 160000.0 Kolkata
6 My John Rob 10000.0 Bhopal
7 Asruid 170000.0 Raipur
8 Tes Qry NaN NaN
9 Big John 109000.0 NaN
Note: The preceding code is covered in (program name:
Chap8_Example8.7.py).
We can say that the last row was removed because all the columns
contained NaN values.
Till now, we saw that dropna method returns a new dataframe. But suppose
there is a requirement to use the same dataframe and remove the rows with
missing values, then we will be using inplace parameter by setting it as
True, as shown in the code Chap8_Example8.8.py as follows:
Till now, we have seen how to remove rows that contain missing values
because the default value for the axis parameter was zero. Rows were
dropped which contain missing values. Let us see what will happen when
axis=1. The columns will be dropped that contain the missing values. Refer
to the following code Chap8_Example8.10.py:
import pandas as mypd
mydf = mypd.read_csv('sample_axis.csv')
print(mydf)
print('-'*50)
mydf1 = mydf.dropna(axis=1) # if 'Salary' has
missing value, then only row will be deleted.
print(mydf1)
Output:
Employee Name Salary Location
0 John Deo 100000.0 Hyderabad
1 Tes Qry NaN NaN
2 Big John 109000.0 NaN
3 Johnny NaN NaN
--------------------------------------------------
Employee Name
0 John Deo
1 Tes Qry
2 Big John
3 Johnny
If we view the sample_fillna.csv file, we will find that the salary column
values are of int type. But due to the presence of NaN value in some rows,
we are getting the output of the column as float data type. So, memory will
be wasted as the int data type takes less memory when compared to the
float data type. So, we will convert the particular column to the required
data type with the help of the astype() method. The dataframe column
types will be converted using astype() method, thus giving the advantage
of memory utilization. So, first, use dropna() method and then convert it
into int type using astype() method, as shown in the code
Chap8_Example8.15.py as follows:
import pandas as mypd
mydf = mypd.read_csv('sample_fillna.csv')
mydf.dropna(inplace=True)
print(mydf)
print(mydf.info())
print('-'*50)
mydf['Salary']=mydf['Salary'].astype('int')
print(mydf) # int type
print(mydf.info())
Output:
Employee Name Salary Location
0 My John Rob 10000.0 Bhopal
1 Asruid 170000.0 Raipur
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee Name 2 non-null object
1 Salary 2 non-null float64
2 Location 2 non-null object
dtypes: float64(1), object(2)
memory usage: 64.0+ bytes
None
--------------------------------------------------
Employee Name Salary Location
0 My John Rob 10000 Bhopal
1 Asruid 170000 Raipur
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee Name 2 non-null object
1 Salary 2 non-null int32
2 Location 2 non-null object
dtypes: int32(1), object(2)
memory usage: 56.0+ bytes
None
From the current example, we can see that memory usage decreased when
we changed from float to int type, that is, from 64.0 + bytes to 56.0 + bytes.
Now, there may be a scenario in which a particular column contains
multiple items. So, if we create a separate copy of that value every time,
then performance will be affected by more memory usage, reducing
memory utilization.
For example, consider the following CSV file category_example.csv. We
will find that under the City column that contains multiple employee
names, we have city names such as Hyderabad, Bhopal, and Indore which
are repeated. So, if a copy is created every time, then memory usage will be
more. So, the better option will be to create a copy once and reuse the same
value multiple times where it is required instead of creating the same value
multiple times which will improve the memory utilization and performance.
This approach is possible only if we use the category type, as shown in the
code Chap8_Example8.16.py as follows:
import pandas as mypd
mydf = mypd.read_csv('category_example.csv')
print(mydf.info())
print('-'*50)
mydf['City'] = mydf['City'].astype('category')
print(mydf.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 648 entries, 0 to 647
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee Name 648 non-null object
1 Salary 648 non-null int64
2 City 648 non-null object
dtypes: int64(1), object(2)
memory usage: 15.3+ KB
None
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 648 entries, 0 to 647
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee Name 648 non-null object
1 Salary 648 non-null int64
2 City 648 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 11.0+ KB
None
We can see that there are 648 entries in the file category_example.csv and
the memory usage has been improved from 15.3 +KB to 11.0 + KB. Also,
we can observe that the City column type is of the category instead of
string type(object). Pandas-specific data type is of category type.
Now, suppose there is a requirement such that the employee records will be
first sorted based on the alphabetical order of Employee Name in the
ascending order. If multiple employees have the same Employee Name, then
sorting will happen based on the descending order of Salary for those
employees. To do this, pass the list values first with the value as True and
then as False to ascending parameter, in the code
Chap8_Example8.19.py as follows:
import pandas as mypd
mydf = mypd.read_csv('sorting_example2.csv')
print(mydf)
print('-'*50)
mydf.sort_values(by = ['Employee Name','Salary'],
inplace=True , ascending=[True,False])
print(mydf)
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 130000.0 Durg
3 Saurabh 140000.0 Raipur
4 Ankit 10000.0 NaN
5 Saurabh 170000.0 Lara
6 Rohit NaN Manuguru
7 Saurabh 400000.0 Mahasamund
--------------------------------------------------
Employee Name Salary City
4 Ankit 10000.0 NaN
2 Divya 130000.0 Durg
0 Rahul 100000.0 Bhopal
6 Rohit NaN Manuguru
7 Saurabh 400000.0 Mahasamund
1 Saurabh 200000.0 Hyderabad
5 Saurabh 170000.0 Lara
3 Saurabh 140000.0 Raipur
Note: The preceding code is covered in (program name:
Chap8_Example8.19.py).
We can provide a separate column as Salary Rank and then display the
ranking, as demonstrated in the following code
Chap8_Example8.22.py:
import pandas as mypd
mydf = mypd.read_csv('rank_eg.csv').dropna()
print(mydf)
print('-'*50)
mydf['SalaryRank'] =
mydf['Salary'].rank(ascending=False)# Highest
salary has rank 1.0
mydf.sort_values(by='SalaryRank',inplace=True)
print(mydf)
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 130000.0 Durg
3 Saurabh 13000.0 Raipur
5 Saurabh 200000.0 Lara
7 Saurabh 400000.0 Mahasamund
--------------------------------------------------
Employee Name Salary City SalaryRank
7 Saurabh 400000.0 Mahasamund 1.0
1 Saurabh 200000.0 Hyderabad 2.5
5 Saurabh 200000.0 Lara 2.5
2 Divya 130000.0 Durg 4.0
0 Rahul 100000.0 Bhopal 5.0
3 Saurabh 13000.0 Raipur 6.0
Now, let us see a scenario where there will be a selection of the employeeʼs
names who are staying in their cities in one CSV file based on the Air
Quality Index (AQI) limit mentioned in another CSV file. We will be
selecting the employee names who are residing in the city whose AQI > 105,
as shown in the following code Chap8_Example8.25.py:
import pandas as mypd
mydf = mypd.read_csv('filter_eg.csv').dropna()
print(mydf)# displaying 1st dataframe
print('-'*50)
mydf2 = mypd.read_csv('AQI_eg.csv')
print(mydf2)# displaying 2nd dataframe
print('-'*50)
# selecting city name whose AQI > 105
myaqi = mydf2['AQI']>105
print(mydf2[myaqi])
print('-'*50)
print(mydf2[myaqi]['City'])
myaqicity = mydf2[myaqi]['City']
print('-'*50)
# selecting employee name who are residing in
cities whose AQI > 105
myemp = mydf['City'].isin(myaqicity)
print(mydf[myemp]['Employee Name']) # getting the
employee name only
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 140000.0 Durg
3 Saurabh 150000.0 Raipur
4 Ankit 110000.0 Durg
5 Saurabh 200000.0 Lara
--------------------------------------------------
City AQI
0 Bhopal 100
1 Hyderabad 150
2 Durg 90
3 Raipur 120
4 Lara 110
--------------------------------------------------
City AQI
1 Hyderabad 150
3 Raipur 120
4 Lara 110
--------------------------------------------------
1 Hyderabad
3 Raipur
4 Lara
Name: City, dtype: object
--------------------------------------------------
1 Saurabh
3 Saurabh
5 Saurabh
Name: Employee Name, dtype: object
This between() method will check for inclusion with a range of values, refer
to the following syntax and in the follwoing code
Chap8_Example8.26.py for better clarity:
between(left, right, inclusive: "Literal['both',
'neither', 'left', 'right']" = 'both')
The notnull() method is the opposite of isnull():, it returns True for each
element that is not missing and False otherwise. Refer to the following
code Chap8_Example8.27.py as follows:
import pandas as mypd
mydf = mypd.read_csv('isnull_notnull_eg.csv')
print(mydf)
print('-'*50)
# Applying on series object ----------------------
-----------------
print(mydf['Employee Name'].isnull()) # will
return Boolean series
print('-'*50)
print(mydf[mydf['Employee Name'].isnull()]) #
selecting rows with True
print('-'*50)
print(mydf['Employee Name'].notnull()) # will
return Boolean series
print('-'*50)
print(mydf[mydf['Employee Name'].notnull()]) #
selecting rows with True
# Applying on dataframe object -------------------
--------------------
print(mydf.isnull())
print('-'*50)
print(mydf.notnull())
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Vikas 200000.0 Lara
2 Rohit NaN Durg
3 Saurabh 400000.0 NaN
4 NaN 310000.0 Indore
--------------------------------------------------
0 False
1 False
2 False
3 False
4 True
Name: Employee Name, dtype: bool
--------------------------------------------------
Employee Name Salary City
4 NaN 310000.0 Indore
--------------------------------------------------
0 True
1 True
2 True
3 True
4 False
Name: Employee Name, dtype: bool
--------------------------------------------------
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Vikas 200000.0 Lara
2 Rohit NaN Durg
3 Saurabh 400000.0 NaN
Employee Name Salary City
0 False False False
1 False False False
2 False True False
3 False False True
4 True False False
--------------------------------------------------
Employee Name Salary City
0 True True True
1 True True True
2 True False True
3 True True False
4 False True True
To remove all rows with duplicate Employee Names, including the first
occurrence use the following method:
mydf.drop_duplicates(subset=['Employee Name'],
keep=False, inplace=True)
Conclusion
In this chapter, we learned a comprehensive journey through the intricate
world of pandas dataframes in Python, equipping readers with a diverse
skill set to effectively analyze and manipulate data. Transitioning
seamlessly into the field of pandas dataframes, we provided readers with
invaluable insights into dataframe construction and the exclusive methods
and attributes that come into play. Handling missing data was tackled
comprehensively, and readers were equipped with the skills needed for
arithmetic operations on dataframes and the dynamic addition of new
columns. The fillna method was introduced as a vital tool for addressing
missing values. Sorting, ranking, and filtering data within dataframes were
explored in-depth, and various techniques for data validation, including
checking for data inclusion, identifying missing values, and handling
duplicates were covered extensively. With this robust foundation, readers
are now well-prepared to tackle data analysis and different approaches to
manipulation tasks with confidence, armed with the power of pandas
dataframes in Python.
In the next chapter, we will explore advanced data manipulation techniques
in pandas. Topics covered will include setting customized columns as
indices, selecting rows using loc[] and index positions, and renaming index
and column names. We will also delve into filtering data using the query
method, handling spaces in column names, and utilizing operators like in
and not in. The chapter will cover data aggregation, text data handling,
modifying data, and advanced functions like apply(), nlargest(),
nsmallest(), and where() methods for powerful data analysis.
Points to remember
Pandas library provides two primary data structures: Series (1D) and
DataFrame (2D).
Access columns in a dataframe using square brackets:
mydf['column_name'].
Use the unique() and nunique() functions to find unique values and
their counts.
Pandas dataframe is a table of data with rows and columns.
Pandas dataframe can be created from many different types of data,
such as lists, dictionaries, numpy arrays, and CSV files.
Pandas dataframe has many functions for accessing and manipulating
data, such as slicing, filtering, and indexing.
Pandas dataframe can perform basic arithmetic operations, such as
addition, subtraction, multiplication, and division.
Pandas dataframe can be manipulated using a variety of methods,
such as sorting, ranking, and aggregating.
Questions
1. What is the purpose of the pandas dataframe constructor?
2. What are some methods and attributes that are exclusively applicable
to dataframes?
3. How do you drop rows with missing or null values in a dataframe?
4. What arithmetic operators can be used with dataframes?
5. How to add new columns to a dataframe?
6. What is the fillna method used for in a dataframe?
7. How to sort the values of a dataframe?
8. How to sort a dataframe based on multiple columns?
9. What is the method for sorting a dataframe based on its index?
10. How to rank a series of values in a dataframe using the rank method?
11. How to filter data from a dataframe?
12. What are the isnull and notnull methods used for?
13. How do you use the duplicated method and what does the
drop_duplicates do?
14. How to find unique values using the unique and nunique methods?
15. Explain inclusion check in a pandas dataframe ?
CHAPTER 9Advanced Dataframe Filtering
Techniques
Introduction
Structure
Objectives
By the end of this chapter, the reader will learn various methods for
renaming index labels and column names, as well as efficient ways to
delete rows and columns from dataframes. The chapter will introduce the
powerful query method for data filtering and explore additional advanced
techniques such as using apply to manipulate dataframe data and finding
the largest and smallest values with nlargest and nsmallest. Text data
handling within pandas will be addressed, demonstrating how to filter
dataframe rows with string methods. Practical examples of data
modification in dataframes will provide readers with hands-on experience
in real-world data manipulation. Various practical examples will be
explained with output and comments wherever necessary. Finally, we can
say that readers will have a solid foundation in working pandas dataframes
for data analysis and manipulation in Python.
mydf = mypd.read_csv('unique_eg.csv')
print(mydf)
print('-'*50)
# Approach-1
print(mydf1)
print('-'*50)
# Approach-2
# print(help(mydf1.set_index))
print(mydf)
Output:
--------------------------------------------------
We can reset the index from the column index to the default numerical
index by using reset_index() method, refer to the following syntax for the
same:
reset_index(level: 'IndexLabel | None' = None, *, drop: 'bool' = False,
inplace: 'bool' = False, col_level: 'Hashable' = 0, col_fill: 'Hashable' = '',
allow_duplicates: 'bool | lib.NoDefault' = <no_default>, names: 'Hashable |
Sequence[Hashable] | None' = None)
An important point to note is that the old index would not be deleted from
the dataframe if drop=False and if set to True, the old index will be deleted
from the dataframe indicating that data will be missed, as demonstrated in
the following code Chap9_Example9.2.py:
print(mydf)
print('-'*50)
mydf.reset_index(drop=True, inplace=True)
print(mydf)
Output:
--------------------------------------------------
Salary City
0 200000.0 Lara
1 400000.0 NaN
2 210000.0 NaN
3 110000.0 Jhansi
4 NaN Anand
5 130000.0 Aligarh
Pandas loc[] indexer lets the user select rows and columns by their index
labels, instead of their position in the dataframe. We can pass a single label,
a list of labels, or a slice of labels to the loc[] indexer to select rows, as
shown in the following syntax:
mydf.loc[index_label]
If the row is selected with the matched index label, then a series object will
be returned. dataframe object is returned if multiple rows are associated
with the specified index.
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
# row selection with the matched index label --> single row associated so
series object
print(mypd_dataframe.loc['Divya'])
print('-'*50)
# now multiple rows associated with the index specified ---> we are getting
dataframe object
print(mypd_dataframe.loc['Saurabh'])
print('-'*50)
print(mypd_dataframe.loc['Johnny'])
Output:
Divya 34 India
Mintoo 37 Australia
Nilesh 43 UK
Priyanka 30 USA
Saurabh 35 Finland
Saurabh 40 Italy
--------------------------------------------------
myage 34
mycountry India
--------------------------------------------------
Saurabh 35 Finland
Saurabh 40 Italy
KeyError: 'Johnny'
mydf.loc[begin_index:end_index:step]
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe.loc['Mintoo':'Priyanka'])
print('-'*50)
print(mypd_dataframe.loc['Mintoo':'Priyanka':2])
print('-'*50)
# The begin index label is optional and default value is first index label
print('-'*50)
print(mypd_dataframe.loc[:'Priyanka'])
# The end index label is optional and default value is also end index label
print('-'*50)
print(mypd_dataframe.loc['Nilesh':])
print('-'*50)
print(mypd_dataframe.loc[:])
Output:
Divya 34 India
Mintoo 37 Australia
Nilesh 43 UK
Priyanka 30 USA
Saurabh 35 Finland
Saurabh 40 Italy
--------------------------------------------------
Mintoo 37 Australia
Nilesh 43 UK
Priyanka 30 USA
--------------------------------------------------
Mintoo 37 Australia
Priyanka 30 USA
--------------------------------------------------
--------------------------------------------------
Divya 34 India
Mintoo 37 Australia
Nilesh 43 UK
Priyanka 30 USA
--------------------------------------------------
Priyanka 30 USA
Saurabh 35 Finland
Saurabh 40 Italy
--------------------------------------------------
Divya 34 India
Mintoo 37 Australia
Nilesh 43 UK
Priyanka 30 USA
Saurabh 35 Finland
Saurabh 40 Italy
Here, the argument is a list of index labels. All rows will be returned with
matched indexes. You might have the chance of getting KeyError, if at least
one index label is unavailable. Refer to the following code
Chap9_Example9.5.py for better understanding:
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe.loc[['Mintoo','Nilesh','Priyanka']])
print('-'*50)
print(mypd_dataframe.loc[['Mintoo','Nilesh','Priyanka','Nivesh']])
Output:
Divya 34 India
Mintoo 37 Australia
Nilesh 43 UK
Priyanka 30 USA
Saurabh 35 Finland
Saurabh 40 Italy
--------------------------------------------------
Mintoo 37 Australia
Nilesh 43 UK
Priyanka 30 USA
KeyError: "['Nivesh'] not in index"
mydf.iloc[index_position]
The row selected with a matched index position will be returned in the
form of a series object. Non-availability of specified index position results
in IndexError.
mydf.iloc[begin_index_position:end_ index_position:step]
Here, all rows are returned with matched index positions. The
unavailability of at least one index position will throw IndexError, refer to
the following code Chap9_Example9.6.py for better understanding:
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
# Syntax-1:
print('-'*50)
# Syntax-2:
print('-'*50)
# Syntax-3:
Output:
Divya 34 India
Mintoo 37 Australia
Nilesh 43 UK
Priyanka 30 USA
Saurabh 35 Finland
Saurabh 40 Italy
--------------------------------------------------
myage 37
mycountry Australia
--------------------------------------------------
Mintoo 37 Australia
Nilesh 43 UK
Priyanka 30 USA
--------------------------------------------------
Mintoo 37 Australia
Nilesh 43 UK
Saurabh 35 Finland
mydf.iloc[index_position, column_name]
We saw that the first argument was used to select rows and the second
argument will be used to get the value of the specified column, as
demonstrated in the following code Chap9_Example9.7.py:
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe.loc['Saurabh','mycountry'])
print('-'*50)
print('-'*50)
print(mypd_dataframe.loc[['Mintoo','Priyanka'],['mycountry','mysalary']]) #
will return rows with more than 1 column
print('-'*50)
print(mypd_dataframe.iloc[1,0])
print('-'*50)
Output:
Nilesh 43 UK 40000
--------------------------------------------------
myname
Saurabh Finland
Saurabh Italy
--------------------------------------------------
Italy
--------------------------------------------------
--------------------------------------------------
37
--------------------------------------------------
Mintoo 37 20000
Priyanka 30 50000
--------------------------------------------------
Nilesh 43 UK
Priyanka 30 USA
Saurabh 35 Finland
Saurabh 40 Italy
Suppose there is a requirement to set the new value for the specified cell,
then we shall be using the following syntax:
Wherever multiple values are required, we will provide them in the form of
a list which will be demonstrated in the following code
Chap9_Example9.8.py:
import pandas as mypd
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
mypd_dataframe.loc['Nilesh','myage'] = 45
print(mypd_dataframe)
print('-'*50)
mypd_dataframe.loc[['Divya','Priyanka'],['myage','mycountry']] =
[[42,'Russia'],[29,'France']]
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe)
print('-'*50)
# replacing name having Saurabh row with mycountry as Italy from its
salary as 60000 to 99999
print(mypd_dataframe)
Output:
Nilesh 43 UK 40000
--------------------------------------------------
Nilesh 45 UK 40000
Priyanka 30 USA 30000
--------------------------------------------------
Nilesh 45 UK 40000
--------------------------------------------------
Nilesh 45 UK 40000
--------------------------------------------------
myname myage mycountry mysalary
Nilesh 45 UK 40000
Index labels are related to rows, that is, axis-0 and column labels are related
to columns, that is, axis-1. The index labels and column names can be
renamed using the rename() method. The user is allowed to change the
name of one or more index labels or column names by providing a mapping
of old names to the new names, refer to the following syntax for a better
understanding:
mypd_dataframe = mypd.read_csv('rename_eg.csv',
index_col='myname')
mypd_dataframe.sort_index(inplace=True)
#print(help(mypd_dataframe.rename))
print(mypd_dataframe)
print('-'*50)
mypd_dataframe.rename(mapper={'Saurabh':'Yathartha',
'Nilesh':'Ramesh'}, inplace=True)
# axis=0 is default value. This will generate the same output for
axis='rows' or axis='index ---> plz try on your own
print(mypd_dataframe)
Output:
Nilesh 43 UK 40000
--------------------------------------------------
Ramesh 43 UK 40000
mypd_dataframe = mypd.read_csv('rename_eg.csv',
index_col='myname')
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
# Usage of index parameter ----------
mypd_dataframe.rename(index={'Saurabh':'Yathartha',
'Nilesh':'Ramesh'}, inplace=True)
print(mypd_dataframe)
Output:
Nilesh 43 UK 40000
--------------------------------------------------
Ramesh 43 UK 40000
The column can be renamed by using mapper and axis parameters and also
by using column parameters as demonstrated in the following code
Chap9_Example9.11.py:
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
mypd_dataframe.rename(mapper={'myage':'my_age',
'mycountry':'my_country'}, axis=1, inplace=True)
print(mypd_dataframe)
print('-'*50)
mypd_dataframe.rename(columns={'mysalary':'my_salary'}, inplace=True)
print(mypd_dataframe)
Output:
myname myage mycountry mysalary
Nilesh 43 UK 40000
--------------------------------------------------
Nilesh 43 UK 40000
--------------------------------------------------
Nilesh 43 UK 40000
Priyanka 30 USA 50000
mypd_dataframe.columns = ['my_age','my_country','my_salary']
But here, the length of the list must be the same as number of columns in
the dataframe. So, this technique will be useful if we want to change all the
column names.
The drop() method in pandas will be used to delete rows and columns from
a dataframe object, as shown below:
mypd_dataframe.sort_index(inplace=True)
# print(help(mypd_dataframe.drop))
print(mypd_dataframe)
print('-'*50)
# to delete rows where index label is 'Saurabh' ---> all will generate the
same output. Uncomment it and then check
# mypd_dataframe.drop(labels='Saurabh', inplace=True)
# mypd_dataframe.drop(labels='Saurabh',axis=0, inplace=True)
# mypd_dataframe.drop(labels='Saurabh',axis='index', inplace=True)
# mypd_dataframe.drop(labels='Saurabh',axis='rows', inplace=True)
mypd_dataframe.drop(index='Saurabh', inplace=True)
print(mypd_dataframe)
print('-'*50)
mypd_dataframe.drop(index=['Mintoo','Nilesh'], inplace=True)
# mypd_dataframe.drop(labels=['Mintoo','Nilesh'], inplace=True)
print(mypd_dataframe)
Output:
Nilesh 43 UK 40000
--------------------------------------------------
Nilesh 43 UK 40000
--------------------------------------------------
mypd_dataframe.sort_index(inplace=True)
# print(help(mypd_dataframe.drop))
print(mypd_dataframe)
print('-'*50)
mypd_dataframe.drop(labels='myage',axis=1, inplace=True)
# mypd_dataframe.drop(columns='myage', inplace=True)
print(mypd_dataframe)
print('-'*50)
mypd_dataframe.pop('mycountry')
print(mypd_dataframe)
del mypd_dataframe['mysalary']
print('-'*50)
print(mypd_dataframe)
Output:
Nilesh 43 UK 40000
--------------------------------------------------
Nilesh UK 40000
--------------------------------------------------
myname mysalary
Divya 10000
Mintoo 20000
Nilesh 40000
Priyanka 50000
Saurabh 30000
Saurabh 60000
--------------------------------------------------
Empty DataFrame
Columns: []
The most commonly used method in pandas is query() method. If the user
wants to filter rows from dataframe, then we will be using the query()
method. The columns of a dataframe will be queried with a Boolean
expression, as shown in the following syntax:
mypd_dataframe.sort_index(inplace=True)
# print(help(mypd_dataframe.query))
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe[mycond])
Output:
Nilesh 43 UK 40000
--------------------------------------------------
Now, we will be exploring an example with the usage of the query method
which is demonstrated in the following code Chap9_Example9.15.py:
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
Output:
It is like SQL, where we are getting an output based on a query. But the
query must be a valid query. Similarly, expr must be a valid string. Normal
Python operators like ==, !=, >, >=, <, <=, &, and, |, or, in, or not in can be
used.
We can filter results based on multiple values using in and not in operators,
the following code which is demonstrated in the following code
Chap9_Example9.16.py shows the usage of in operator:
import pandas as mypd
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe)
Output:
Nilesh 43 UK 40000
Nilesh 43 UK 40000
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe)
Output:
Nilesh 43 UK 40000
The first approach is by using back quotes or backticks. The following code
Chap9_Example9.18.py displays the usage of back quotes (``):
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe)
Output:
Nilesh 43 UK 40000
--------------------------------------------------
Nilesh 43 UK 40000
Replace space with underscore symbol (_) in column name and then apply
query() method, as demonstrated in the following code
Chap9_Example9.19.py:
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe)
Output:
Nilesh 43 UK 40000
--------------------------------------------------
Nilesh 43 UK 40000
--------------------------------------------------
Nilesh 43 UK 40000
This apply() method is exactly the same as the Series class apply() method
except that it will be applicable row-wise. Refer to the following syntax for
better understanding and code Chap9_Example9.20.py as follows:
mypd_dataframe.sort_index(inplace=True)
# print(help(mypd_dataframe.apply))
print(mypd_dataframe)
print('-'*50)
# tripling the salary of each name using series apply method since we are
selecting a column
mypd_dataframe['mysalary'] = mypd_dataframe['mysalary'].apply(lambda
sal: sal*3)
print(mypd_dataframe)
print('-'*50)
# Now, tripling the salary of country with 'India' using dataframe apply
method
def india_triple_salary(myrow):
mycountry = myrow[1]
mysalary = myrow[2]
if mycountry == 'India':
myrow[2] = mysalary * 3
print(mypd_dataframe.apply(india_triple_salary, axis='columns')) # to
apply function row wise we have to use axis='columns'
Output:
--------------------------------------------------
3 Nilesh 43 UK 120000
--------------------------------------------------
3 Nilesh 43 UK 120000
mypd_dataframe = mypd.read_csv('query_eg.csv')
mypd_dataframe.sort_index(inplace=True)
# print(help(mypd_dataframe.apply))
print(mypd_dataframe)
print('-'*50)
# Now, tripling the salary of country with 'India' using dataframe apply
method
def designation_func(myrow):
mysalary = myrow[3]
if mysalary <25000:
return 'Worker'
return 'Artisan'
return 'DGM'
return 'AGM'
else:
return 'GM'
mypd_dataframe['MyDesignation'] =
mypd_dataframe.apply(designation_func, axis='columns') # applying func
row wise
print(mypd_dataframe)
Output:
3 Nilesh 43 UK 40000
The top n rows with the largest values in a specified column will be
returned by the nlargest() method. The syntax of nlargest() method is
shown below:
The top n rows with the smallest values in a specified column will be
returned by the nsmallest() method. The syntax of nsmallest() method is
mentioned as follows:
mypd_dataframe.sort_index(inplace=True)
# print(help(mypd_dataframe.nlargest))
# print(help(mypd_dataframe.nsmallest))
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe.nlargest(n=4, columns=['mysalary']))
print('-'*50)
print(mypd_dataframe.nsmallest(n=4, columns=['mysalary']))
Output:
3 Nilesh 43 UK 40000
--------------------------------------------------
myname myage mycountry mysalary
3 Nilesh 43 UK 40000
--------------------------------------------------
3 Nilesh 43 UK 40000
Note: If more than one name contains the same value of salary, then use the
keep='all' parameter to display the top n largest or smallest values.
Use the where() method, whenever the condition is False, that is, not
meeting the specified condition, replaces the corresponding value with
another value whose default value is NaN. As per the syntax parameter, the
other can be a scalar, Series/DataFrame, or callable. The dataframe is
returned with original values where the condition is True, otherwise NaN.
So, to get a full dataframe, use where() method by using the following
syntax:
where(cond, other=nan, *, inplace: 'bool_t' = False, axis: 'Axis | None' =
None, level: 'Level | None' = None)
mypd_dataframe = mypd.read_csv('query_eg.csv')
mypd_dataframe.sort_index(inplace=True)
# print(help(mypd_dataframe.where))
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe.where(mycond))
Output:
3 Nilesh 43 UK 40000
--------------------------------------------------
myname myage mycountry mysalary
mypd_dataframe = mypd.read_csv('query_eg.csv')
mypd_dataframe.sort_index(inplace=True)
# print(help(mypd_dataframe.where))
print(mypd_dataframe)
print('-'*50)
mypd_dataframe['mysalary'] = mypd_dataframe['mysalary'].where(lambda
x: x<=45000, other='Upper limit')
print(mypd_dataframe)
Output:
3 Nilesh 43 UK 40000
--------------------------------------------------
3 Nilesh 43 UK 40000
myseries1 = myseries.copy()
mydf1 = mydf.copy()
The changes done in the original object would not be reflected in the copy
object and vice-versa.
We can handle text data like removing spaces, converting into required
spaces, and so on in pandas too. Just like the methods we have learned in
the basics of string methods, the same can be used here. Methods like
upper, title, lower, replace, lstrip, rstrip, strip, len, and so on can be used.
We will be using the str prefix while using all these methods, as shown in
the example demonstrated in the following code Chap9_Example9.25.py:
mypd_dataframe = mypd.read_csv('text_eg.csv')
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
mypd_dataframe['myname'] = mypd_dataframe['myname'].str.upper()
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe)
print('-'*50)
mypd_dataframe['myname'] =
mypd_dataframe['myname'].str.strip().str.replace(' ','_')
print(mypd_dataframe)
print('-'*50)
mypd_dataframe['mysalary'] =
(mypd_dataframe['mysalary'].str.replace('Rs','').astype(float)/1000).astype(s
tr) + 'K'
print(mypd_dataframe)
Output:
--------------------------------------------------
--------------------------------------------------
--------------------------------------------------
In order to filter dataframe rows using string methods, we can use the
contains(), startswith(), and endswith() methods from the pandas library,
refer to the following points for a better understanding:
mypd_dataframe = mypd.read_csv('df_string_eg.csv')
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe[mycond])
print('-'*50)
print(mypd_dataframe[mycond])
print('-'*50)
print('-'*50)
Output:
--------------------------------------------------
--------------------------------------------------
--------------------------------------------------
--------------------------------------------------
mypd_dataframe = mypd.read_csv('df_string_eg.csv')
mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)
print('-'*50)
del mypd_dataframe['myname']
print(mypd_dataframe)
Output:
--------------------------------------------------
myfirstname mysecondname mysalary mystaffno
# shortcut
mypd_dataframe[['FirstName','LastName']] =
mypd_dataframe['myname'].str.strip().str.split(expand=True)
print(mypd_dataframe)
By using indexing and slicing with various other methods, we can modify
the data in pandas dataframe, as demonstrated in the following code
Chap9_Example9.28.py:
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe)
print('-'*50)
print(mypd_dataframe)
print('-'*50)
my_df_drop = mypd_dataframe.drop(columns=['Hobby'])
print(my_df_drop)
print('-'*50)
my_df_sort = mypd_dataframe.sort_values(by='myname',
ascending=False)
print(my_df_sort)
Output:
myname myage mycountry
0 Alex 35 UK
1 John 45 USA
2 Michael 55 Australia
3 Tom 65 Finland
--------------------------------------------------
0 Alex 35 UK
1 John 45 Russia
2 Michael 55 Australia
3 Tom 65 Finland
--------------------------------------------------
--------------------------------------------------
0 Alex 35 UK
1 John 45 Russia
2 Michael 55 Australia
3 Tom 65 Finland
--------------------------------------------------
mydict = {
mypd_dataframe = mypd.DataFrame(mydict)
mypd_dataframe['myage'] = mypd.to_numeric(mypd_dataframe['myage'],
errors='coerce')
print(mypd_dataframe)
print('-' * 50)
my_mean_age = mypd_dataframe['myage'].mean()
print(my_mean_age)
print('-' * 50)
my_num_row_count = len(mypd_dataframe.index)
print(my_num_row_count)
print('-' * 50)
# Grouping the DataFrame by country and getting the mean age for each
group
mygroup = mypd_dataframe.groupby(['mycountry'])['myage'].mean()
print(mygroup)
Output:
0 Alex 35 UK
1 John 45 USA
2 Michael 55 Australia
3 Tom 65 Finland
--------------------------------------------------
50.0
--------------------------------------------------
--------------------------------------------------
mycountry
Australia 55.0
Finland 65.0
UK 35.0
USA 45.0
Conclusion
In this chapter, we explored various essential operations and techniques for
effectively manipulating pandas dataframes. We explored methods for
renaming index labels and column names, ensuring clarity and consistency
in data representation. Efficient ways to delete rows and columns from
dataframes were discussed, streamlining data management tasks. The
introduction of the powerful query method for data filtering enhanced data
exploration capabilities, while advanced techniques like using apply for
data manipulation and nlargest and nsmallest for identifying extreme values
provided deeper insights into dataframe operations. The chapter also
addressed text data handling within pandas, showcasing how to filter
dataframe rows with string methods. Through practical examples of data
modification, readers gained hands-on experience in real-world data
manipulation, with detailed explanations and outputs. As a result, readers
now possess a solid foundation in working with pandas dataframe for data
analysis and manipulation in Python.
In the next chapter, we shall explore polars and how they offer an
alternative to pandas for data manipulation. We will also discuss their
features, performance benefits, and practical applications in addition to
their high-performance data structures, including series and dataframe,
which offer a seamless experience for data manipulation and analysis.
Points to remember
To set the index of the pandas dataframe we can use either the
index_col parameter when reading data from a file or the set_index()
method to set a specific column as the index.
Use df.loc[label] to select rows by index label, inclusive of endpoints.
Access rows using iloc[], exclusive of the end position.
Use the rename() method to rename index labels or column names.
The drop() method in pandas will be used to delete rows and columns
from a dataframe object.
Utilize the query() method with direct condition expressions.
Employ in and not in operators within the query() method.
If function is to be applied for every row, then we will use dataframe
class apply() method. The if function will be applied for every value
of a particular column, then we will use the series class apply()
method.
The nlargest and nsmallest methods will be used to retrieve the rows
with the largest and smallest values, respectively, based on a specified
column or columns in the context of the pandas dataframe object.
For conditional filtering of dataframe elements, use the where()
method.
The text data can be handled in pandas using string methods.
Questions
In the previous chapter, we discussed about pandas library for data analysis
and manipulation. However, pandas can be slow for large datasets and
complex operations because it is written in Python, a dynamically typed
language. Pandas can also run out of memory when working with large
datasets because it stores dataframes in memory. In addition, some may
argue that pandas is simpler to learn for beginners due to its high-level
functionality and ability to perform complex data manipulations with
relatively few lines of code, many users find it challenging. The complexity
of its Application Programming Interface (API), with a vast array of
functions and methods, can be overwhelming. Sometimes, writing efficient
code to handle large datasets without performance issues requires a deeper
understanding, making pandas less straightforward for many beginners.
On the other hand, polars is a newer DataFrame library that addresses some
of the limitations of Pandas. It is written in Rust, a compiled language that
is known for its speed and efficiency. It also uses a more efficient memory
layout for dataframes, which allows it to handle larger datasets than pandas.
Additionally, polars has a syntax that is similar to pandas, making it easy to
learn for users who are already familiar with pandas. All the available cores
on the machine will be utilized by the polars. Polars can handle datasets
much larger than the available RAM. It has a consistent and predictable
API. In polars, before running the query, data types should be known.
We can install polars by using the pip command, that is, pip install polars or
conda install polars.
Here, on every snippet code, we will be using alias name as mypl while
importing polars. So, import polars as mypl.
Structure
In this chapter, we will discuss the following topics:
Objectives
Some specific reasons why you might want to choose Polars over Pandas
are mentioned below:
S
Feature Pandas Polars
No.
Documentation is comprehensive
for polars as well but yes the
Larger (more
4 Community community is smaller so there
documentation)
might be lesser question/answers on
stack overflow and other sites
More Built in
5 Features Has less built in functions
functions
Core
8 Single Core Multi Core
computation
https://pola-rs.github.io/polars/user-guide
The data types in polars are based on arrow and backed by arrow memory
arrays.
Numeric group
The numeric group consists of signed, and unsigned integers and floating-
point values. Refer to the following table for a better understanding:
S Numeric
Details Range
No. type
16-bit signed
2 Int16 -215 to +215 -1
integer
32-bit signed
3 Int32 -231 to +231 -1
integer
64-bit signed
4 Int64 -263 to +263 -1
integer
8-bit unsigned
5 UInt8 0 to 28 -1
integer
16-bit unsigned
6 UInt16 0 to 216 -1
integer
32-bit unsigned
7 UInt32 0 to 232 -1
integer
64-bit unsigned
8 UInt64 0 to 264 -1
integer
32-bit floating point also known as single precision has a 1-bit sign bit, an
8-bit exponent, and a 23-bit significand (also known as the mantissa). It
provides about seven-eight decimal digits of precision. Suitable for a wide
range of applications, including real-time graphics and many scientific
computations.
64-bit floating point also known as double precision has a 1-bit sign bit, an
11-bit exponent, and a 52-bit significand (mantissa). It provides about 15-
17 decimal digits of precision. Suitable for a wide range of scientific and
engineering applications where high precision is required.
Nested group
The nested group consists of values. Refer to the following table for a
better understanding:
S Nested
Explanation
No. type
Temporal group
S Nested Explanation
No. type
The internal representation is days and serves as a date
1 Date representation from the UNIX epoch, preserved within a 32-
bit signed integer.
Others group
The others group in polar data types encompasses Boolean, binary, UTF-8,
categorical, and object data, offering a comprehensive framework for
handling diverse data formats and facilitating efficient analysis across
various data types. Refer to the following table for a better understanding:
mydf_int8 = mypl.DataFrame({
})
mydf_int16 = mypl.DataFrame({
})
mydf_int32 = mypl.DataFrame({
mydf_int64 = mypl.DataFrame({
})
mydf_uint8 = mypl.DataFrame({
})
mydf_uint16 = mypl.DataFrame({
})
mydf_uint32 = mypl.DataFrame({
})
mydf_uint64 = mypl.DataFrame({
mydf_f32 = mypl.DataFrame({
})
mydf_f64 = mypl.DataFrame({
})
print("8-bit integers:")
print(mydf_int8)
print('-'*50)
print("\n16-bit integers:")
print(mydf_int16)
print('-'*50)
print("\n32-bit integers:")
print(mydf_int32)
print('-'*50)
print("\n64-bit integers:")
print(mydf_int64)
print('-'*50)
print("8-bit unsigned integers:")
print(mydf_int8)
print('-'*50)
print(mydf_int16)
print('-'*50)
print(mydf_int32)
print('-'*50)
print(mydf_int64)
print('-'*50)
print(mydf_f32)
print('-'*50)
print(mydf_f64)
shape: (3, 1)
┌────────────────┐
│ float64_column │
│ --- │
│ f64 │
╞════════════════╡
│ 11.23 │
│ 14.56 │
│ 17.89 │
└────────────────┘
Series
A 1-dimensional data structure in which, within the series, all the elements
have the same data type. We can think of it as a labeled array. Now, let us
view the code as demonstrated in program name: Chap10_Example10.2.py
for series creation using a list of integers in polars:
print(myseries)
print(myseries.sum())
Output:
shape: (5,)
11
12
13
14
15
65
Dataframe
mydf = mypl.DataFrame(
"mydate": [
],
print(mydf)
# head: will return the first 5 rows of a dataframe as default and can specify
the number of rows we want by inserting some number other than default
value of 5
print(mydf.head(2))
# tail: will return the last 5 rows of a dataframe as default and can specify
the number of rows we want by inserting some number other than default
value of 5
print(mydf.tail(2))
# sample: will return 'n' number of random rows from the dataframe
print(mydf.sample(2))
print(mydf.describe())
Output:
shape: (9, 5)
┌────────────┬──────────┬─────────────────
────┬──────────┬────────┐
╞═══════════╪══════════╪══════════════════
═╪═════════╪════════╡
└────────────┴──────────┴─────────────────
────┴──────────┴────────┘
Users will be allowed to load, transform, and analyze data by using various
operations and functions with the help of contexts in polars. Using
contexts, a structured and cohesive way will be provided to work with data.
Different tasks will be performed like select, filter, and aggregate which we
shall discuss one by one. We can create and manipulate data frames, apply
transformations, perform calculations, and execute queries on the data by
creating a context in polars. It will act as a container for data which will
provide a set of tools and methods to work with data effectively. The three
main contexts are selection, filtering, and groupby/aggregation which we
will discuss now with an example. We will be using the famous iris.csv
dataset.
Note: The iris.csv dataset is widely recognized and utilized in the domains
of data science and machine learning. It encompasses measurements of
diverse attributes associated with iris flowers, including sepal length, sepal
width, petal length, and petal width. These attributes are accompanied by
the respective species of each iris flower. This dataset finds extensive
application in tasks such as classification, clustering, and visualization
within the realm of data science. It serves as an ideal starting point for
beginners to practice and grasp various machine-learning algorithms and
techniques.
Typically, the iris.csv dataset comprises 150 rows, with each row
representing an individual iris flower and its corresponding attribute
measurements. Machine learning models are often trained and assessed by
using this dataset to predict the species of iris flowers based on their
attribute measurements. Due to its simplicity and versatility, the iris.csv
dataset is well-known and easily accessible. It is frequently employed as a
getting-started dataset for DS beginners.
Observe the step-by-step code written along with the operations being
performed as shown in the following example program name:
Chap10_Example10.4.py:
7777777776import polars as mypl
mydf = mypl.read_csv("iris.csv")
print('-'*50)
print(mydf.head().to_pandas())
print('-'*50)
print(" This code will return the datatypes of the columns of the
dataframe")
print(mydf.dtypes)
print('-'*50)
print(mydf[0,:])
print('-'*50)
print(mydf[:,['sepal.length','petal.length']].head(2))
print('-'*50)
print(mydf.filter(mypl.col('petal.length')>2.1).head(3))
print('-'*50)
print(mydf.sample(5).to_pandas)
print('-'*50)
print(mydf.select(['sepal.width','petal.width','variety']).head(4))
print('-'*50)
# Note if alias name is not written, then the change will happen in
petal.length itself
print(mydf.with_columns([
(mypl.col('petal.length')*10).alias('Mux10_Petal.Length')
]).head(3))
print('-'*50)
# The is_in() function is called on the 'variety' column. The is_in() function
checks if the values
# in the column are present in a given list. In this case, it checks if the
'variety' column values
print(mydf.filter
(mypl.col('variety').is_in(['Setosa']))
.tail())
print('-'*50)
print(mydf.select([
mypl.col('sepal.length').n_unique()
]))
print('-'*50)
print(mydf.select([
mypl.col('sepal.length').min().alias('min_sepal_length'),
mypl.col('sepal.length').mean().alias('mean_sepal_length'),
mypl.col('sepal.length').median().alias('median_sepal_length'),
mypl.col('sepal.length').max().alias('max_sepal_length'),
mypl.col('sepal.length').std().alias('std_sepal_length')
]))
print('-'*50)
print(mydf.select([mypl.col('sepal.length')]).describe())
print('-'*50)
mysepal_length = mydf.select([
mypl.col('sepal.length')
]) # data processing
# The given code takes a sample of 100 rows from the "mysepal_length"
dataset,
myplt.show()
Output:
Figure 10.3 (f): Image depicting histogram view with 10 bins for a sample
of 100 rows from mysepal_length dataset
We have also worked the same code in Jupyter notebook under the
filename context_demo.ipynb.
One major observation from the preceding example is that the original
columns will be retained and new ones will be added by using
with_columns whereas, the original columns will be dropped by using the
select statement. The existing dataframe is filtered by using the filter
context based on arbitrary expression that evaluates to a Boolean data type.
GroupBy concept
Chap10_Example10.5.py:
print("Reading the iris.csv dataset and storing the contents in the variable
mydf")
mydf = mypl.read_csv("iris.csv")
print('-'*50)
print(mydf.groupby('variety').agg(mypl.count()))
'''
1. The code starts by calling the 'groupby' function on the dataframe 'mydf'
and passing the column name 'variety' as an argument.
2. This groups the dataframe based on the unique values in the 'variety'
column.
5. The 'agg' function applies this counting operation to each group and
returns the result.
6. The resulting dataframe will have the count of non-null values for each
column, grouped by the 'variety' column.
'''
print(mydf.groupby('variety').agg(mypl.count()).sort('count').with_columns
([
(mypl.col('count')/mypl.col('count').sum()).alias('mypercentage')
]))
Polar expressions are crucial because they let the user manipulate and
compute the data in a variety of ways. Within the Polars framework, they
offer a versatile and potent approach to modify, filter, aggregate, and
transform the data.
mydf = mypl.read_csv("iris.csv")
selected_data = mydf.select([
mypl.col("sepal.length"),
mypl.col("petal.length"),
mypl.col("variety")
print(selected_data)
The lazy API in Polars refers to the lazy evaluation mechanism provided by
the library. It allows you to build a computation plan without executing it
immediately. Instead, the operations are executed only when the result is
explicitly requested or when an action is performed. The lazy API is useful
for several reasons:
Optimization: The lazy evaluation allows polars to optimize the
execution plan by rearranging and optimizing the operations. This can
lead to more efficient computation and improved performance.
Memory efficiency: With lazy evaluation, intermediate results are not
materialized unless explicitly requested. This reduces memory usage,
especially when dealing with large datasets or complex computations.
Flexibility: Building complex computing pipelines is made possible by
the lazy APIʼs flexibility. A series of operations and transformations
can be defined without having to carry them out immediately. This
makes it simple for the user to adjust or expand the computation
strategy as necessary.
Regarding its fundamental computational model and method of data
processing, polars' lazy API differs from pandas in several key ways,
as explained in the following points:
Evaluation model: In pandas, operations are frequently carried out,
which means that outcomes are promptly carried out and materialized
in memory. Polars, on the other hand, use lazy evaluation, in which
operations are built into a computational plan rather than being carried
out immediately. Only when a result is specifically requested or when
an action is taken does the plan is executed.
Optimization: In order to optimize the execution plan, Polars makes
use of lazy evaluation. Based on the available optimizations, such as
predicate pushdown, and projection pushdown, it rearranges and
optimizes the processes. When compared to pandas, this may lead to
more effective computation and improved performance.
We have also worked on the same code in Jupyter notebook under the
filename context_demo2.ipynb.
Lazy evaluation: It prevents intermediate results from materializing
unless specifically requested. This uses less memory, particularly
when working with large data sets or difficult calculations. Pandas, in
contrast, eagerly materialize intermediate results, which may result in
higher memory usage.
Parallelism: Polarsʼ lazy evaluation methodology facilitates
parallelism in the execution of operations. Making use of modern
multi-core CPUs, it can parallelize computations automatically across
multiple threads or cores. When compared to pandas, this may result
in faster processing of data, especially for computationally demanding
tasks.
print("Reading the iris.csv dataset and storing the contents in the variable
mydf")
mydf = mypl.read_csv("iris.csv")
mylazy_plan = mydf.lazy().select([
mypl.col("sepal.length"),
mypl.col("petal.length"),
mypl.col("variety")
myresult = mylazy_plan.collect()
print(myresult)
In this example, we read the iris.csv dataset into a dataframe called mydf .
Then, we define a lazy computation plan by using the .lazy() method on the
dataframe. We chained multiple operations such as select, filter, and sort to
form the computation plan. However, the plan is not executed immediately.
To execute the plan and obtain the result, we call .collect() on the lazy plan.
This triggers the execution of the computation plan, and the result is stored
in the myresult variable. Finally, we display the result. By using the lazy
API, we can define a computation plan without immediately executing it.
This allows polars to optimize the plan and execute it efficiently when
needed.
Note: this is true but streaming also means that instead processing the data
all-at-once Polars can execute the query in batches allowing you to process
datasets that are larger-than-memory. Also streaming is in development
mode so it is not a stable method and is not available for all
functions/expressions/methods in polars library.
myresult = mylazy_plan.collect(streaming=True)
Conclusion
In this chapter, readers embarked on a captivating exploration of Polars, a
versatile Python library for data manipulation, unveiling its vast potential
and functionalities. From the outset, we set out to distinguish between
pandas and polars, highlighting the unique strengths and capabilities that
polars offer. We then discussed polars data types, breaking them down into
numeric, nested, temporal, and other groups, providing a solid foundation
for understanding the underlying data structures. As the chapter progressed,
readers gained insights into polarsʼ core data structures, namely series and
dataframes. Armed with this knowledge, they acquired the skills necessary
to work with data effectively, setting the stage for more advanced
operations. We introduced the concept of contexts in polars, focusing on
essential operations, selection, and filtering techniques, ensuring readers
were well-prepared to manipulate data powerfully and efficiently. The
unveiling of the groupby concept further enhanced readersʼ data-handling
capabilities, allowing them to harness polars for data aggregation and
summarization, a crucial aspect of data analysis. Expressions emerged as a
key feature, enabling readers to grasp their significance in facilitating
complex data transformations. Finally, in the end, the chapter meticulously
demystified the Lazy API in polars, showing how it can significantly
improve performance and resource utilization in data operations.
In the next chapter, we will explore the concept of expressions in polars and
the integration of numpy in polars. We will also see a comprehensive
comparison of operations between pandas and polars which one is faster
and better to use.
Points to remember
Questions
Structure
Objectives
By the end of this chapter, the reader will cover the subtleties of
expressions in polars along with basic operators, column selections,
functions, casting, string operations, aggregation, handling missing data,
and leveraging folds, lists, and arrays. We will also introduce the
integration of numpy in polars, offering a bridge between these two
powerful libraries. Finally, readers will gain a deep understanding of when
different operations will be performed with polars and pandas, thus helping
to make informed choices when working with data in Python. By the end of
this chapter, readers will have acquired a comprehensive understanding of
polars, their data manipulation capabilities, and their unique features,
positioning them to work with data confidently and efficiently in a Python
environment.
In Polars, expressions are like instructions that tells what to do with the
data. They make it easy to do lots of different things to data, like adding
numbers together or changing how itʼs organized. So, instead of writing
long, complicated code, you can use expressions to get things done quickly.
Basic operators
print("Reading the iris.csv dataset and storing the contents in the variable
mydf")
mydf = mypl.read_csv("iris.csv")
# numerical
mydf_numerical = mydf.select(
mypl.col("sepal.length"),
mypl.col("petal.length"),
(mypl.col("sepal.length") * mypl.col("petal.length")).alias("sepal.length *
petal.length"),
(mypl.col("sepal.length") / mypl.col("petal.length")).alias("sepal.length /
petal.length"),
print(mydf_numerical.head())
print('-'*50)
# logical
mydf_logical = mydf.select(
mypl.col("sepal.width"),
mypl.col("petal.width"),
(mypl.col("sepal.width") >3.2).alias("sepal.width_Greater"),
(mypl.col("petal.width") == 0.2).alias("petal.width_check"),
print(mydf_logical.head())
Output:
In the code, the logical operators used are greater than (>), and equal to
(==). These operators are applied to the sepal.width and petal.width
columns. The result is new columns that represent the outcome of the
logical operation. For example, (mypl.col("sepal.width") > 3.2) creates a
new column named sepal.width_Greater, which contains boolean values
indicating whether each corresponding value in the sepal.width column is
greater than 3.2. The resulting dataframes, mydf_numerical and
mydf_logical, contain the original columns along with the computed
columns resulting from the numerical and logical operations, respectively.
The head() method is used to display the first few rows of each dataframe,
providing a glimpse of the computed values. Overall, the code
demonstrates how basic operators can be used in conjunction with
expressions in Polars to perform numerical and logical computations on the
data, resulting in new columns that capture the modified or evaluated
values.
Column selections
mydf = mypl.read_csv("iris.csv")
# M-1
print(mydf.select(mypl.col('*')).head())
# M-2
print('-'*50)
print(mydf.select(mypl.col('*').exclude('sepal.length','petal.length')).head())
print('-'*50)
print('-'*50)
print(mydf.select(mysel.float()))
print('-'*50)
Output:
Functions
Polars expressions have a large number of built-in functions that allow the
user to create complex queries without the need for user-defined functions.
There are too many to go through here, but we will cover some of the more
popular use cases. Refer to the following example Chap11_Example11.3.py
for understanding some built-in functions:
mydf = mypl.read_csv("iris.csv")
print('Case1: Usage of try except block if trying to use the same column
multiple times in the expression')
try:
print(mydf.select(mypl.col('sepal.length')+2 , mypl.col('sepal.length')-2 ))
print("Exception...")
print(mye)
print('-'*50)
# The map_alias method is used in the provided code to modify the column
names of the DataFrame mydf during the select operation.
print(mydf.select(mypl.col('*').exclude('variety','sepal.width','petal.width').
map_alias(lambda c: c.rstrip("length").replace(".","").upper())))
'''
Also, .lit function is used to create a literal value which will be used as
the results of the conditional expression within the when and otherwise
clauses.
'''
mydf_conditional = mydf.select(
mypl.col("sepal.length"),
.then(mypl.lit(True))
.otherwise(mypl.lit(False))
.alias("Condition_Check_Sepal_Length"),
print(mydf_conditional.head())
Output:
Casting
"myint_col":[10,0,12,-11,14],
"mybigint_col":[10010000,5,30030000,40040000,11],
"myfloat_col":[12.0,15.0,18.0,21.0,24.0],
"myfloat_decimal":[12.11,15.22,18.33,21.44,24.55],
"myfloat_string":["12.11","15.22","18.33","21.44","24.55"],
"mydate_string": [
"2023-11-01",
"2023-11-02",
"2023-11-03",
"2023-11-04",
"2023-11-05",
print(mydf)
print('-'*50)
print(mydf.select(
mypl.col("myint_col").cast(mypl.Float32).alias("myint_as_floats"),
mypl.col("myfloat_decimal").cast(mypl.Int32).alias("myfloat_decimal_as_i
ntegers"),
))
print('-'*50)
try:
print(mydf.select(mypl.col("mybigint_col").cast(mypl.Int16)))
print(mye)
print('-'*50)
print(mydf.select(mypl.col("mybigint_col").cast(mypl.Int16, strict=False)))
# The overflowing values will be assigned to null
print('-'*50)
print(mydf.select(
mypl.col("myfloat_string").cast(mypl.Float64),
mypl.col("myfloat_col").cast(mypl.Utf8)
))
print('-'*50)
print(mydf.select(
mypl.col("myint_col").cast(mypl.Boolean)
))
print('-'*50)
print(mydf.select(
mypl.col("mydate_string").str.to_datetime("%Y-%m-%d")
))
Output:
print(mydf.select(
mypl.col("Random_Data"),
mypl.col("Random_Data").str.contains("Man|na").alias("Myregex"),
mypl.col("Random_Data").str.contains("Bana$",
literal=True).alias("Myliteral"),
mypl.col("Random_Data").str.starts_with("Bana").alias("Mystarts_with"),
mypl.col("Random_Data").str.ends_with("Fear").alias("Myends_with")))
print('-'*50)
print(mydf.select(
mypl.col("Random_Data").str.extract_all(r"
([^\d\s]+)").alias("extracted_chars"),
))
Output:
Aggregation
dtypes = {
"first_name": mypl.Categorical,
"gender": mypl.Categorical,
"type": mypl.Categorical,
"district": mypl.Categorical
}
mydataset = mypl.read_csv("election_data.csv",
dtypes=dtypes).with_columns(
mypl.col("birthday").str.to_date(strict=False)
print(mydataset)
print('-'*50)
mydf = (
mydataset.lazy()
.groupby("last_name")
.agg(
mypl.count(),
mypl.col("gender"),
mypl.first("first_name"),
.sort("count", descending=True)
.limit(5)
print(mydf.collect())
Output:
First, the code imports the polars library as mypl and defines the data
types for the columns in the dataset by using the dtypes dictionary.
Then, it reads a CSV file called election_data.csv by using the
read_csv method and assigns it to the variable mydataset. The dtypes
argument is used to specify the data types for each column in the
dataset.
The code converts the birthday column in mydataset from a string to a
date format by using the str.to_date(strict=False) method and adds it as
a new column to mydataset.
The mydataset is printed to the console.
A new dataframe called mydf is created by performing a series of
operations on mydataset. The operations include grouping the data by
the last_name column, aggregating the count of rows, selecting the
gender column, and selecting the first occurrence of the first_name
column.
The resulting mydf DataFrame is sorted in descending order based on
the count column and is limited to the top five rows.
The mydf dataframe is printed to the console by using the collect()
method.
In summary, the code reads a CSV file, performs operations on the dataset
by using the polars library, and outputs the results.
Missing data
mydf = mypl.DataFrame({"values":[11,None,0,False,None,12]})
print(mydf)
print('-'*50)
print(mydf_null_count)
print('-'*50)
print(mydf.select(
mypl.col("values").is_null()
))
print('-'*50)
mypl.col("values").fill_null(mypl.lit(3))
))
print('-'*50)
print(mydf.select(
mypl.col("values").fill_null(strategy='backward')
))
print('-'*50)
print(mydf.select(
mypl.col("values").fill_null(mypl.mean("values")) # (11+0+0(False)+12/4)
))
print('-'*50)
print(mydf.select(
mypl.col("values").interpolate())
)
Output:
Folds
Polars offer expressions and methods such as sum, min, mean, and so on,
for performing horizontal aggregations. However, in cases where more
complex aggregations are required, the default methods provided by polars
may not be enough. This is where folds become useful. The fold expression
operates on columns, providing maximum speed and efficient utilization of
the data layout. It often executes in a vectorized manner, further enhancing
performance. Refer to the following example Chap11_Example11.8.py for
understanding the fold concept in polars:
mydf = mypl.DataFrame(
"list1": [4,5,6],
"list2": [10, 100, 1000],
'''
'''
print(mydf.select(
.alias(“mymul”)
))
print('-'*50)
print(mydf.filter(
mypl.fold(
acc=mypl.lit(True),
exprs=mypl.all() > 5,
))
print('-'*50)
print(mydf.select(mypl.concat_str(["list1", "list2"])))
print('-'*50)
print(mydf.select(mypl.sum(["list1", "list2"])))
Output:
mydf = mypl.DataFrame(
"runs": [
"0 35 16 6 12 10",
"57 15 7 8 10 24",
"14 16 22 24 45 103",
],
print(mydf)
print('-'*50)
print('Case1: Creating a list column')
print(mydf.with_columns(mypl.col("runs").str.split(" ")))
print('-'*50)
print(mydf.with_columns(mypl.col("runs").str.split(" ")).with_columns(
mypl.col("runs").list.head(2).alias("MyTop2"),
print('-'*50)
mydf = mydf.with_columns(
mypl.col("runs")
.str.split(" ")
.list.eval(mypl.element().cast(mypl.Int8, strict=False))
.list.sum()
.alias("Total"),
print(mydf.with_columns(mypl.col("Total").rank(method="dense").alias("
Rank")))
print("Case5: Array data type contains same number of elements per row")
print(mypl.DataFrame(
],
schema={
},
))
Output:
# Element-wise functions
print(mydf.select(
mynp.exp(mypl.col("mynum1")).alias("_exp1"),
mynp.exp(mypl.col("mynum2")).alias("_exp2")
))
print('-'*50)
print(mydf.select(
mynp.sin(mypl.col("mynum1")).alias("_sin1"),
mynp.sin(mypl.col("mynum2")).alias("_sin2")
))
print("-"*50)
# # Vectorized function
print(mydf.select(
mynp.degrees(mypl.col("mynum1")).alias("_degrees1"),
mynp.degrees(mypl.col("mynum2")).alias("_degrees2")
))
# We can even perform aggregate functions like sim, min, max, mean,
median
Output:
By observing the output, it can be interpreted that pandas took more time to
read the CSV file than polars, thus indicating that polars performance is
better than pandas. The execution time may change on each run iteration,
but the polars will take less time to execute.
Conclusion
In the upcoming chapter, we will explore Seaborn, an essential tool for data
visualization tasks. Its ability to generate informative plots with minimal
code addresses the need to effectively explore and communicate insights
from complex datasets. Seabornʼs higher-level interface simplifies the
creation of intricate statistical plots compared to Matplotlib. Offering a
variety of plot types like Kernel Density Estimate (KDE) plots, Violin
plots, Line plots, Scatter plots, Joint plots, and Facet Grids, Seaborn caters
to diverse aspects of data analysis.
Points to remember
Questions
Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 12
Exploring Seaborn Library for
Data Science in Python
Introduction
In this chapter, we will learn how we can install scipy using pip command,
that is, pip install seaborn. We will be importing and using seaborn in
Python scripts using import seaborn as mysns. Before working with seaborn,
the following libraries are to be installed:
Python version 3.6 or higher.
Numpy with version >= 1.13.3 . We are using version: 1.25.0
Scipy with version >= 1.0.1 . We are using version 1.11.1.
Matplotlib with version >= 2.1.2 . We are using version 3.7.1.
Pandas with version >= 0.22.0 . We are using version 2.1.0.
We are using version 0.12.2 of seaborn.
Note: Kindly note that in this chapter, there will be various example
codes with multiple color images via plots. Since this book will be a
paperback edition it is a must to kindly refer to the image bundle
parallelly, so that if any image the reader wants to view after going
through the code it can be easily referred. Also, there will be various
examples with different colors which will be a combination of primary
and secondary colors and will be difficult to explain with legends on
each graph. So, it is expected that the readers kindly open the image
bundle parallelly to view the output.
Structure
In this chapter, we will discuss the following topics:
Some basic statistical terms to know
Built in seaborn datasets
Plot Styling in seaborn
Concept of color palette in seaborn
Heatmap plot in seaborn
Box plot in seaborn
Objectives
By the end of this chapter, the reader will learn the concept of some basic
statistical terms, such as quantitative and qualitative variables, mean, and
their significance in data analysis. They will understand the distinction
between variables that represent numerical measurements (quantitative) and
those that represent categories or labels (qualitative). The reader will also be
familiar with various built-in datasets available in seaborn, gaining practical
exposure to real-world datasets for data visualization and analysis.
Furthermore, the reader will comprehend the principles of plot styling in
seaborn, learning how to customize the aesthetics of plots for better
presentation and interpretation. The concept of color palettes in seaborn,
including qualitative, sequential, and diverging palettes, will be covered,
enabling the reader to choose appropriate color schemes for different types
of data and visualizations. Finally, the reader will acquire the skills to create
various advanced plots in seaborn using box plots for depicting the
distribution of data.
Quantitative variables
These variables are numerical values that represent counts or measures, for
example, weight, percentages, and so on These variables are of two types
namely discrete and continuous variables. Discrete variables are always
numeric and have a finite number of values between any two values. On the
other hand, continuous variables are numeric variables that have an infinite
number of values between any two values.
Qualitative variables
These variables contain a finite number of categories and according to some
characteristics, they are placed into distinct categories.
According to the number of variables, statistical data is classified as
univariate and bivariate. Univariate data consists of only one variable, and
we concentrate on only one variable at a time, whereas Bivariate data
involves two different variables that are being studied to explore the
relationship between them.
The default context in Seaborn is the notebook style, which can be modified
based on our needs. Seaborn provides four preset contexts that determine the
plot size and enables us to tailor the plot according to its presentation. These
preset contexts, arranged by relative size, include paper, notebook, talk, and
poster. To modify the size of plot components such as labels, ticks, markers,
line widths, and so on, we can employ the functions plotting_context() and
set_context(). Both functions accept identical arguments.
The style and context parameters can be set in one step using set() function,
refer to the following code Chap12_Example12.4.py for the same:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns_dset = mysns.load_dataset('planets')
mysns.set(style='white',context='talk',rc=
{'axes.edgecolor': 'g','axes.titlesize': 10.0})
mysns.scatterplot(x='mass',y='distance',data=mysns_
dset)
myplt.title('talk context')
myplt.show()
Output:
Figure 12.3: Plot of Mass vs. Distance for planets Dataset with talk context
Without passing any arguments, the set() function can be called if you want
to switch to default seaborn settings (font=’sans-serif’, font_scale=1,
context=’notebook, style=’darkgrid’, palette=’deep’,
color_codes=True).
A plot has four spines (borders on the plot sides) and the top and right spines
can be removed by default from the plot using despine() functio3n, as
shown in the following code Chap12_Example12.5.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns_dset = mysns.load_dataset('planets')
mysns.set(style='ticks')
mysns.scatterplot(x='mass',y='distance',data=mysns_
dset)
mysns.despine()
myplt.title('Only Left and Bottom spine')
myplt.show()
Output:
Figure 12.4: Mass vs. Distance for planets dataset with despine() function
Now we will discuss the color palette types that exist for data visualization.
Qualitative palettes
We will use qualitative palettes to illustrate categorical data without a
particular order among categories. The desired number of colors can be
passed by passing a value to the n_colors parameter, as shown in the
following code Chap12_Example12.7.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
mycolor_palette = mysns.color_palette(n_colors=6)
print(mycolor_palette)
# This line will plot by default 6 colors.
mysns.palplot(mycolor_palette)
myplt.show()
Output:
[(0.12156862745098039, 0.4666666666666667,
0.7058823529411765), (1.0, 0.4980392156862745,
0.054901960784313725), (0.17254901960784313,
0.6274509803921569, 0.17254901960784313),
(0.8392156862745098, 0.15294117647058825,
0.1568627450980392), (0.5803921568627451,
0.403921568627451, 0.7411764705882353),
(0.5490196078431373, 0.33725490196078434,
0.29411764705882354)]
Refer to the following figure for a better understanding:
The given code imports the seaborn library as mysns and the
matplotlib.pyplot library as myplt. It then generates a color palette using
the color_palette function from Seaborn, specifying that it should contain
six colors. The color palette is then printed. The code proceeds to plot the
color palette using seabornʼs palplot function, which displays the colors in
the palette. Finally, the plot is shown using matplotlib.pyplot library's show
function.
In qualitative palettes, there are hsl and husl color models, refer to the
following code Chap12_Example12.8.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
# By default n_colors=6 in hls_palette()
# A palette is generated with colors evenly
distributed in the HLS (Hue, Lightness, Saturation)
hue space.
# values for parameters h, l and s should be ranged
between 0 and 1.
mysns.palplot(mysns.hls_palette(h=0.5, l=0.3,
s=0.5))
myplt.show()
Output:
Sequential palettes
With these palettes, numeric data can be effectively visualized, showcasing a
progression from lower to higher values or vice versa using colors that
transition from light to dark. Here, the main focus is on data that has larger
values. Here, an additional character s is passed to the color passed to the
color parameter for plotting the sequential plot, refer to the following code
Chap12_Example12.10.py for clarity:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns.palplot(mysns.color_palette("Greens"))
myplt.show()
Output:
Figure 12.9: Sequential Color Palette Transitioning from Light Green to Bright Green
By appending the suffix _r to the palette name, the order of colors in the
palette will be reversed, refer to the following code
Chap12_Example12.11.py for better understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns.palplot(mysns.color_palette("Blues_r"))
myplt.show()
Output:
Figure 12.10: Sequential Color Palette Transitioning from a brighter shade of blue to lighter blue
Diverging palettes
By utilizing this color palette, we can distinguish between two different
colors to emphasize both high and low extreme values. Data used with this
color scheme has a natural mid-point such as zero, refer to the following
code Chap12_Example12.12.py for better understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns.palplot(mysns.color_palette("BrBG",
n_colors=9))
myplt.show()
Output:
Figure 12.11: Depiction of the diverging color palette highlighting the extreme values: high and low
Note: The preceding code is covered in (program name:
Chap12_Example12.12.py).
The provided Python code utilizes the seaborn and matplotlib libraries to
create a heatmap visualization of a diamond dataset. The dataset is loaded
using Seabornʼs load_dataset function, and the first ten rows are displayed
using head(10). The code then creates a pivot table (mydiamonds) using the
pandas library, where the cut column serves as the index, the color column
as columns, and the depth column as values. Finally, a heatmap is generated
using Seabornʼs heatmap function, displaying different colors to represent
numerical values. The title, ylabel, and xlabel are set using Matplotlib.
Dark colors are used for low depth values (below 61.5), while light colors
represent higher depth values (63.5, 64, and 64.5). The resulting heatmap
visually represents the depth of diamonds based on their cut and color.
Let us now customize the appearance of the heatmap with the change of
default settings as explained in the following code
Chap12_Example12.16.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
import pandas as mypd
diamonds_datset = mysns.load_dataset('diamonds')
# using pivot_table function for converting into a
format useful for analysis
mydiamonds= mypd.pivot_table(data=diamonds_datset,
index='cut', columns='color', values='depth')
mysns.heatmap(data = mydiamonds, vmin=50,
vmax=80,annot=True, fmt='0.2f',linewidth=0.4)
myplt.title('diamonds data')
myplt.ylabel('cut')
myplt.xlabel('color')
myplt.show()
Output:
This Python code will use the seaborn library to generate a heatmap
visualizing the mydiamonds pivot table, where color intensity represents the
depth values within the specified range (vmin=50, vmax=80). The heatmap
includes annotations displaying the exact values and gridlines with a
linewidth of 0.4. The title diamonds data is set for the plot, with the y-axis
labeled as cut and the x-axis as color. Finally, the plot is displayed.
Also, a numerical value can be provided to the argument center which will
be used as the center of the color map then plotting data. So, we will be
using the loc attribute on the dataframe mydiamonds as shown in the
following code Chap12_Example12.17.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
import pandas as mypd
diamonds_dataset = mysns.load_dataset('diamonds')
# Use pivot_table function for converting into a
format useful for analysis
mydiamonds =
mypd.pivot_table(data=diamonds_dataset,
index='cut', columns='color', values='depth')
# Define a custom color palette ranging from dark
blue to light blue
colors = mysns.color_palette("Blues", as_cmap=True)
# Plot the heatmap with the custom color palette
mysns.heatmap(data=mydiamonds,
center=mydiamonds.loc['Good', 'H'], cmap=colors,
annot=True, fmt="0.2f")
myplt.title('Diamonds Data')
myplt.ylabel('Cut')
myplt.xlabel('Color')
myplt.show()
Output:
Figure 12.16: Depiction of the output of heatmap with the diverging color scheme
Figure 12.17: Depiction of the output of heatmap with a display of lower triangle
Note: The preceding code is covered in (program name:
Chap12_Example12.18.py).
Also, we can display the preceding cells the diagonal in the heatmap using
tril() method of numpy library, the code Chap12_Example12.19.py
for this is shown as follows:
import seaborn as mysns
import matplotlib.pyplot as myplt
import pandas as mypd
import numpy as mynp
diamonds_dataset = mysns.load_dataset('diamonds')
mydiamonds =
mypd.pivot_table(data=diamonds_dataset,
index='cut', columns='color', values='depth')
colors = mysns.color_palette("Blues", as_cmap=True)
mytrildata = mynp.tril(mydiamonds)
print(mytrildata)
# values will be plotted for the cells where mask
is False .i.e. 0
mysns.heatmap(data=mydiamonds, cmap=colors,
annot=True, fmt="0.2f", cbar_kws=
{'orientation':'horizontal'}, mask=mytrildata)
myplt.title('Diamonds Data')
myplt.ylabel('Cut')
myplt.xlabel('Color')
myplt.show()
Output:
[[61.6783698 0. 0. 0.
0. 0.
0. ]
[61.16924517 61.17590929 0. 0.
0. 0.
0. ]
[61.75016523 61.73025 61.72162662 0.
0. 0.
0. ]
[62.36570997 62.20375134 62.20231023 62.52663605
0. 0.
0. ]
[64.04846626 63.31964286 63.50801282 64.33980892
64.58514851 0.
0. ]]
Refer to the following figure for a better understanding:
Figure 12.18: Depiction of the output of the heatmap with a display of the upper triangle
As shown in Figure 12.22, we can infer that the amount spent by the people
of different countries during the period from 1970 to 2020 ranges between
123.99 USD indicated by the left-hand side whisker to approximately 8400
USD indicated by the right-hand side whisker. The median is somewhere
around 2296 USD and it is closer to the lower end of the box or Q1 which
means that the data points on the left-hand side of the median are clustered.
So, the median is pulled towards the left-hand side of the box. Also, the
median is far away from the upper end of the box or Q3 which means that
the data points on the right-hand side of the median are scattered. Also, the
whisker on the upper end of the box is longer. So, the data distribution is
skewed in the direction of longer whiskers. Here, the distribution is
positively skewed or right-side skewed. Also, we can notice that the dots
located outside the right whisker which are called outliers, are usually large
values in the data.
Orientation can be vertical by passing a parameter as orient=’v’ in the
boxplot function. Also, if we want to know the total expenditure incurred for
a specific country like the USA, then follow the following approach as
explained in the following code Chap12_Example12.22.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('healthexp')
myfiltered_df = my_dataset[my_dataset['Country'] ==
'USA']
print(myfiltered_df.head())
mysns.boxplot(data=myfiltered_df,x='Spending_USD',o
rient="v")
myplt.title('Expenditure Spend on Health')
myplt.show()
Output:
Year Country Spending_USD Life_Expectancy
4 1970 USA 326.961 70.9
9 1971 USA 357.988 71.2
12 1972 USA 397.097 71.2
15 1973 USA 439.302 71.4
18 1974 USA 495.114 72.0
Refer to the following figure to understand the example:
Figure 12.23: Box plot example with orientation vertical and filtering for USA only
Note: The preceding code is covered in (program name:
Chap12_Example12.22.py).
Conclusion
In this chapter, we initially saw the fundamental building blocks of statistical
analysis, discussing the key terms that form the bedrock of data
interpretation. By the end of this comprehensive exploration, readers now
possess a solid understanding of basic statistical terms, notably quantitative
and qualitative variables. The distinction between variables representing
numerical measurements and those embodying categories or labels has been
elucidated, providing readers with foundational knowledge, crucial for
navigating the complexities of statistical datasets. Moving beyond
theoretical concepts, readers have been introduced to the practical side of
data exploration through various built-in datasets seamlessly integrated into
seaborn. This hands-on exposure equips them with the skills to navigate and
analyze real-world datasets, setting the stage for effective data visualization.
The journey through this chapter extends beyond mere data representation; it
ventures into the art of presentation with Seabornʼs plot styling principles.
Readers now understand how to customize the aesthetics of their
visualizations, enhancing both interpretability and visual appeal. The
concept of color palettes, encompassing qualitative, sequential, and
diverging palettes, has been unveiled, empowering readers to choose suitable
color schemes tailored to different types of data and visualization goals. We
also explored the plots using heat maps and boxes with various examples.
In the next chapter, we shall explore additional seaborn plotting techniques
like KDE plot, scatter plot, violin plot, line plot, joint plot, and facet grid for
analyzing data trends and distributions using various examples.
Points to remember
Quantitative variables are numerical and measurable, representing
quantities or amounts, while qualitative variables are categorical,
representing qualities or characteristics. Seaborn provides convenient
datasets for practice and exploration, accessible directly through its
library, eliminating the need for external data sources.
Plot styling in seaborn enhances visual appeal and offers customizable
parameters for plot aesthetics, ensuring tailored and impactful
visualizations.
Seaborn offers qualitative palettes designed for categorical data,
ensuring clear distinctions between different categories. Sequential
palettes smoothly transition colors for ordered data, while diverging
palettes emphasize extremes using contrasting colors.
Heatmap plot in seaborn displays data in a matrix format, with the
intensity of colors representing values. It is useful for identifying
patterns and correlations in large datasets.
Box plot in seaborn illustrates the distribution of data where key
components include the median, quartiles, and outliers. It is helpful for
identifying skewness and outliers in the data.
Questions
1. What is a seaborn library and what is it used for?
2. What is the definition of quantitative variables in statistics?
3. How do qualitative variables differ from quantitative variables?
4. What are some advantages of using built-in datasets in seaborn for
data visualization?
5. Can you provide examples of situations where using built-in datasets
is beneficial?
6. How does plot styling contribute to the overall aesthetics of data
visualizations?
7. What are some common styling parameters available in seaborn for
customizing plots?
8. Explain the significance of using color palettes in data visualization.
9. Differentiate between qualitative, sequential, and diverging color
palettes in seaborn.
10. In what scenarios is a heatmap plot useful in data analysis?
11. How does seaborn handle the representation of data in a heatmap?
12. What insights can be gained from a box plot?
13. How does seaborn generate a box plot, and what key components
does it illustrate?
Introduction
Building on the foundation of quantitative and qualitative variables,
statistical concepts and basic seaborn plots from the previous chapter, this
chapter dives deeper into Seaborn visualizations. In the current chapter, it is
crucial to grasp additional seaborn plotting techniques like KDE plot, violin
plot, line plot, scatter plot, joint plot, and facet grid, as these tools offer
diverse perspectives for analyzing data trends, relationships, and
distributions, thus enabling comprehensive exploration and interpretation of
datasets across various dimensions.
Structure
In this chapter, we will discuss the following topics:
KDE plot in seaborn
The violin plot in seaborn
The line plot in seaborn
The scatter plot in seaborn
The joint plot in seaborn
Facet grid in seaborn
Objectives
By the end of this chapter, the reader will be equipped with the skills to
create various advanced plots in seaborn, Kernel Density Estimation
(KDE) plots for exploring the distribution of a single variable, violin plots
for combining aspects of box plots, and KDE plots, line plots for depicting
trends, scatter plots are employed to depict the relationship between two
variables visually, joint plots for combining different types of plots in a
single grid, and FacetGrid for creating grids of subplots based on categorical
variables. Readers will also gain proficiency in utilizing Seaborn to visualize
data and extract meaningful insights from various types of datasets.
On the Y-axis, the density is plotted by setting the parameter vertical to True.
mysns.kdeplot(data=my_dataset, x="total_bill",
shade=True, vertical=True)
Output:
Figure 13.5: Univariate kde plot with plotting on Y-axis and shade=True
So, the advantage of the KDE plot is that it provides a continuous and
smooth representation of the data distribution, allowing for a more detailed
understanding of the underlying patterns and shapes of the data. It can reveal
features such as multiple modes, skewness, and other characteristics that
may not be easily visible in a histogram. Up to this point, we have observed
univariate plots based on a single variable. Now, we shall see a bivariate plot
as demonstrated in the following code Chap13_Example13.3.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('iris')
mysns.kdeplot(data=my_dataset.loc[my_dataset.specie
s=="versicolor"], x="sepal_length",
y="petal_length")
myplt.title('Bivariate kde Plot')
myplt.show()
Output:
Here, we have passed the color Greens to the parameter cmap. In order to
color the different levels of contour lines, a contour plot can use a color map.
It is best to visualize numeric data which uses greens as a sequential color
map that progresses from low to high density by gradually increasing the
darkness and color saturation. In Figure 13.7, the higher density values are
represented by the darker colors, and the lower density values are
represented by the contour lines with lighter density values.
We can also increase the number of levels in the contour plot, so that
important details may not be missed. Also, note that a greater number of
levels will make the plot look cluttered, and a color bar may be added for
annotating the color mapping in a bivariate plot. Just observe the following
line of code where we added n_levels=20:
mysns.kdeplot(data=my_dataset.loc[my_dataset.specie
s=="versicolor"],
x="sepal_length", y="petal_length",
cmap='Greens',
cbar=True, n_levels=20)
Output:
Figure 13.8: Bivariate kde plot with cmap=’Greens’, cbar=True and n_levels=20
Now, let us see the univariate plot using distplot() function. Refer to the
following syntax for a better understanding:
distplot(a=None, bins=None, hist=True, kde=True,
rug=False, fit=None, hist_kws=None, kde_kws=None,
rug_kws=None, fit_kws=None, color=None,
vertical=False, norm_hist=False, axlabel=None,
label=None, ax=None, x=None)
As per the documentation of Seaborn, this function has been deprecated and
will be removed in Seaborn v0.14.0. Observe the following code
Chap13_Example13.4.py for better understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('iris')
#print(help(mysns.distplot))
mysns.distplot(my_dataset['petal_length'],hist=Fals
e)
myplt.title('Univariate kde Plot using distplot()
function')
myplt.show()
Output:
There is a box plot of the data inside each violin plot. The first and third
quartile represents the left and right side of the inner box. The median of the
data is represented by the small white circle which we see inside the box.
The minimum and maximum values in the data are represented by the thin
black lines extending from the box. The higher probability is represented by
the wider sections of the violin plot having a large cluster of values. Lower
probability is represented by the lower sections of the violin plot around a
small cluster of values. So, in the violin plot, the detailed distribution of the
data including peaks and positions is depicted as compared to the box plot.
Refer to the following syntax:
violinplot(data=None, *, x=None, y=None, hue=None,
order=None, hue_order=None, orient=None,
color=None, palette=None, saturation=0.75,
fill=True, inner='box', split=False, width=0.8,
dodge='auto', gap=0, linewidth=None,
linecolor='auto', cut=2, gridsize=100,
bw_method='scott', bw_adjust=1,
density_norm='area', common_norm=False,
hue_norm=None, formatter=None, log_scale=None,
native_scale=False, legend='auto', scale=
<deprecated>, scale_hue=<deprecated>, bw=
<deprecated>, inner_kws=None, ax=None, **kwargs)
Observe the following code Chap13_Example13.5.py for better
understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('exercise')
#print(help(mysns.violinplot))
mysns.violinplot(my_dataset['pulse'])
myplt.title('Violin Plot')
myplt.show()
Output:
This Python code uses the seaborn library to create a violin plot based on the
exercise dataset. Seaborn is imported as mysns, and the dataset is loaded
using the load_dataset() function with the argument exercise. The code
then prints the help documentation for the violinplot() function, although it
is commented out. Subsequently, a violin plot is generated using the pulse
column of the dataset, which likely represents pulse rates. The violinplot()
function is used to visualize the distribution of pulse values, displaying a
combination of a kernel density estimate and a box plot. Finally, the plot is
given a title, Violin Plot, and is displayed to the user.
The violin plot is a fusion of a box plot and a density plot. We can observe
that on either side of the box plot, density plots are symmetric as they both
display the distribution of the column pulse and the y-axis displays the range
of values in the column pulse.
Let us analyze this violin plot with more detailed data representation to the
user with the help of the code Chap13_Example13.6.py mentioned as
follows:
import seaborn as mysns
import matplotlib.pyplot as myplt
# Load the exercise dataset from seaborn
my_dataset = mysns.load_dataset('exercise')
# Create a violin plot
mysns.violinplot(my_dataset['pulse'])
# Calculate summary statistics
summary_stats = my_dataset['pulse'].describe()
q1 = summary_stats['25%']
q3 = summary_stats['75%']
myiqr = q3 - q1
minimum = summary_stats['min']
maximum = summary_stats['max']
median = summary_stats['50%']
print(summary_stats['mean'])
# Calculate range excluding outliers
mylower_bound = q1 - 1.5 * myiqr
myupper_bound = q3 + 1.5 * myiqr
range_without_outliers = myupper_bound -
mylower_bound
print(f"q1 value is {q1}")
print(f"q3 value is {q3}")
print(f"iqr value is {myiqr}")
print(f"minimum value is {minimum}")
print(f"maximum value is {maximum}")
print(f"median value is {median}")
print(f"lower_bound value is {mylower_bound}")
print(f"upper_bound value is {myupper_bound}")
print(f"range_without_outliers value is
{range_without_outliers}")
# Add text annotations for summary statistics
# print(help(myplt.text))
myplt.text(0.1, q1 - 3, f'Q1: {q1}', ha='center',
va='bottom', color='red')
myplt.text(0.1, q3 + 3, f'Q3: {q3}', ha='center',
va='top', color='red')
myplt.text(0.1, q1 + 0.1 * myiqr, f'IQR: {myiqr}',
ha='center', va='bottom', color='red')
myplt.text(0.1, minimum - 5, f'Min: {minimum}',
ha='center', va='bottom', color='red')
myplt.text(0.1, maximum + 5, f'Max: {maximum}',
ha='center', va='top', color='red')
myplt.text(0.1, median + 2, f'Median: {median}',
ha='center', va='bottom', color='red')
myplt.text(0.1, myupper_bound + 2, f'Range (no
outliers): {range_without_outliers}', ha='center',
va='bottom', color='red')
# Set plot title
myplt.title('Violin Plot with Summary Statistics')
# Add grid
myplt.grid(axis='y', linestyle='--', alpha=0.7)
# Show the plot
myplt.show()
Output:
q1 value is 90.25
q3 value is 103.0
iqr value is 12.75
minimum value is 80.0
maximum value is 150.0
median value is 96.0
lower_bound value is 71.125
upper_bound value is 122.125
range_without_outliers value is 51.0
Refer to the following figure for a better understanding:
From Figure 13.13, the violin plot in blue color represents the sex category
of Males and Females who are no longer alive. On the other hand, the violin
plot in orange color represents the sex category of the ones that are alive.
If we provide inner parameter value say points as shown in the following
line of code:
mysns.violinplot(data=my_dataset,
x='age',y='sex',hue='alive', inner='points')
All the data points in the distribution will be displayed as we can observe
different values in the data points of the dataset as shown in the following
figure:
If we do not want points or box plots, then set the inner parameter as None.
Figure 13.15: Fig depicting line plot with the iris dataset
We will be explaining the line of code with the tag name #F1.
The code calls the lineplot function from the mysns module. The data
parameter is set to my_dataset, which is the dataset used for the plot. The x
parameter is set to sepal_length, indicating that the values from the
sepal_length column in the dataset will be plotted on the x-axis. The y
parameter is set to petal_length, indicating that the values from the
petal_length column in the dataset will be plotted on the y-axis. The hue
parameter is set to species, indicating that different species will be
differentiated by different colors on the plot. The code executes and
generates the line plot-based on the specified parameters.
This Python code utilizes seaborn to generate a joint plot of the iris dataset,
specifically examining the correlation between the sepal length and sepal
width of iris flowers. The seaborn library is imported as mysns, and the iris
dataset is loaded into the variable my_dataset. The jointplot() function is
then employed to generate a joint plot, where the x-axis represents sepal
length, the y-axis represents sepal width, and points are colored based on the
species of the iris flowers. The hue parameter allows for the differentiation
of species with different colors in the plot. Subsequently, the title, x-axis
label, and y-axis label are set using matplotlib (myplt), indicating that the
plot illustrates the joint distribution of sepal length and width. Finally, the
show() function is called to display the resulting joint plot. This type of
visualization is beneficial for exploring the correlation, distribution, and
individual characteristics of two variables simultaneously, providing insights
into the relationship between sepal length and width in different species of
iris flowers.
In this scenario, the primary plot consists of a scatter plot, while density
plots are positioned along both margins, illustrating the data distribution
separately for the three levels of hue variables. We can view the density
curves for those three levels of hue parameters plotted in different colors.
Now, suppose we pass the kind parameter as reg and remove the hue
parameter in the above line of code since the use of hue with kind='reg' is
not currently supported. Also, we have removed the title() function here:
mysns.jointplot(x="sepal_length", y="sepal_width",
data=my_dataset,kind='reg')
Output:
Figure 13.18: Joint plot with iris dataset and kind parameter
Here, a regression line has been drawn on the scatter plot which is nothing
but the line of best fit which gives the visual representation of the
relationship between the dependent and one independent variable. Here, the
line is drawn such that it will be as close to all the data points in the plot as
possible it can. We can easily find the outliers on a scatter plot that is farther
from the regression line. We can observe that here there are many outliers in
the scatter plot. Moreover, we shall discuss this in our Machine learning
book which will be launched shortly.
Figure 13.19: Facet grid with penguins dataset with a histogram of “bill_length_mm”
Note: The preceding code is covered in (program name:
Chap13_Example13.11.py).
Here, the facets are plotted in the histograms from three subsets. In
summary, the code uses Seabornʼs FacetGrid to create a grid of histograms,
each representing the distribution of penguin bill lengths for a specific
island. This allows for a visual comparison of how the distribution of bill
lengths varies across different islands. The histplot() function is employed
to create the histograms, and the resulting grid is displayed using
matplotlib. This type of visualization is useful for understanding the
distribution of a numerical variable across different categories or conditions
in the dataset. We shall focus on the following three lines of code:
my_dataset = mysns.load_dataset('penguins'): It
loads the penguin's dataset, a built-in dataset in Seaborn, into the
variable my_dataset. This dataset contains information about various
penguin species, including their bill measurements.
myfacet_obj = mysns.FacetGrid(data=my_dataset,
col="island"): It creates a FacetGrid object named myfacet_obj.
This object will be used to generate a grid of subplots, with each
subplot corresponding to a different value of the island column in the
dataset. The grid will be arranged in columns.
myfacet_obj.map(mysns.histplot,
'bill_length_mm'): It maps the histplot function from seaborn
onto the FacetGrid. This means that a histogram of the
bill_length_mm variable will be plotted in each subplot of the grid,
with the data being split based on the values of the island column.
Now, we can even subset a dataset based on both columns and rows. The
FacetGrid function has both row and column parameters for specifying the
row name and column name. The graph will be divided into multiple facets
corresponding to the levels of categorical variables as demonstrated in the
following code Chap13_Example13.12.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('penguins')
myfacet_obj = mysns.FacetGrid(data=my_dataset,
row="island", col='sex')
myfacet_obj.map(mysns.histplot, 'bill_length_mm')
myplt.show()
Output:
In the code mentioned above, the variable island and sex have three and two
categories respectively. The grid will split into six facets for each
combination of categorical variables. Let us break down these three lines of
code, refer to the following points:
my_dataset = mysns.load_dataset('penguins'), this line loads the
penguin's dataset using the load_dataset function present in the
seaborn library. The dataset contains information about different
penguin species, including measurements of bill length, flipper length,
body mass, and so on. The loaded dataset is stored in the variable
my_dataset.
Conclusion
In this chapter, we initially witnessed that as the chapter progresses, readers
gain proficiency in creating a diverse array of advanced plots in Seaborn.
From KDE plots, we explored single variable distributions to violin plots
harmonizing aspects of box plots and KDE plots, readers are now equipped
with a repertoire of tools for nuanced data exploration. The chapterʼs
culmination lies in empowering readers to navigate Seabornʼs extensive
capabilities, from crafting line plots to depict trends, scatter plots to visualize
variable relationships, joint plots amalgamating diverse plot types in a single
grid, to FacetGrids creating grids of subplots based on categorical variables.
In essence, the chapter delivers on its objective to provide a comprehensive
understanding of basic statistical concepts, unveiling Seabornʼs
functionalities, and familiarizing readers with a spectrum of plots designed
for effective data exploration and presentation. Armed with this knowledge,
readers are now poised to leverage Seabornʼs capabilities, transforming
datasets into meaningful visual narratives that yield valuable insights. This
chapter not only imparts technical skills but also cultivates an appreciation
for the art and science of data visualization, ultimately fostering a data-savvy
mindset for those embarking on the journey of data analysis.
In the next chapter, we will integrate the data science libraries using
ChatGPT prompts. We will learn how to use ChatGPT to get our answers by
writing the actual prompts that are required.
Points to remember
KDE plot in seaborn utilizes Kernel Density Estimation for a smooth
representation of the underlying distribution of continuous data, aiding
in understanding the shape of the data distribution.
Violin plot in seaborn combines aspects of box plot and KDE to
display the distribution of data across different levels, making it
suitable for visualizing both central tendency and spread.
Line plot in seaborn displays data points connected by straight lines,
useful for showing trends and patterns over time or other ordered
categories.
Scatter plot in seaborn: displays individual data points on a two-
dimensional plane, optimal for depicting the connections between two
continuous variables.
Joint plot in seaborn combines multiple plots to visualize
relationships between two variables, displaying scatter plots,
histograms, and regression lines simultaneously.
Facet grid in seaborn divides the data into subsets for separate
visualizations, facilitating comparison between different subsets and
exploring patterns within categorical variables.
Questions
1. What is the purpose of a KDE plot, and how does it differ from a
histogram?
2. Explain the kernel density estimation process in Seaborn.
3. When would you choose a violin plot over a box plot?
4. Describe the anatomy of a violin plot in Seaborn.
5. In what situations is a line plot preferable for data visualization?
6. How does Seaborn handle the creation of line plots?
7. What information can be derived from a scatter plot?
8. Discuss the parameters available in Seaborn for customizing scatter
plots.
9. How does a joint plot combine multiple types of visualizations?
10. Provide examples of scenarios where a joint plot is particularly
useful.
11. Explain the concept of Facet Grid in Seaborn.
12. What advantages does a Facet Grid offer in terms of data exploration
and presentation?
Introduction
ChatGPT, a specialized variant of OpenAIʼs Generative Pre-trained
Transformer (GPT) models, excels in generating human-like text responses
for conversational contexts. Its neural network architecture, trained on
extensive text data, facilitates tasks such as data augmentation for text-based
applications like classification and sentiment analysis. Additionally,
ChatGPT powers conversational AI systems, and virtual assistants, and
facilitates text generation and summarization tasks, catering to a wide array
of needs from customer support to creative writing. Its adaptability through
fine-tuning enables a deeper understanding of language nuances, making it
an asset in various data science and machine learning projects, albeit its
frequency of use varies based on project requirements.
In data science, ChatGPT can be useful for tasks such as generating text
data, answering queries about data analysis, or providing explanations of
complex concepts more understandably. The frequency of using this tool for
data science would depend on the specific needs of your projects and how
often you require assistance with text-based tasks or explanations. In this
chapter, we shall discuss how to use ChatGPT with data science libraries.
Structure
In this chapter, we will discuss the following topics:
ChatGPT prompts for exploring data science libraries
Querying ChatGPT for Python code with data science libraries
Querying ChatGPT with Python error
Data science library cheat sheets using ChatGPT
Novice challenges querying Python code via ChatGPT
Objectives
This chapter will cover the topics to illuminate readers on the versatile
applications of ChatGPT in the exploration of open-source data science
libraries. Firstly, it aims to guide users in formulating effective prompts for
ChatGPT, tailored to each library, including numpy, scipy, matplotlib,
pandas, polars, and seaborn, to extract valuable insights and coding
solutions. We will cover various topics, including leveraging ChatGPT for
generating insightful prompts and queries to explore data science libraries,
seeking assistance for Python code utilizing these libraries, collaboratively
generating comprehensive cheat sheets, and addressing challenges faced by
novice users. By providing practical guidance and insights, this chapter
endeavors to empower readers to enhance their proficiency in data analysis
and visualization tasks while utilizing ChatGPT as a valuable resource in
their learning journey.
Loaded Array 2:
[[ 7 8 9]
[10 11 12]]
Loaded Array 3:
[[13 14 15]
[16 17 18]]
The conclusion drawn from the two approaches for saving and loading
NumPy ndarray objects to a binary file is that they offer different methods
with varying trade-offs:
Pickle approach: Using Pickle for serialization provides a flexible
and convenient way to save and load NumPy arrays. It is easy to use
and requires minimal code to implement. However, Pickle may not be
as efficient as other serialization methods, and there may be security
concerns when loading Pickle files from untrusted sources.
Numpy's save and load functions: Utilizing numpyʼs save and load
functions (for example, np.savez and np.load) offers a more
specialized approach tailored specifically for NumPy arrays. It
provides efficient storage and retrieval of NumPy arrays and is
optimized for performance. However, the saved files are specific to
NumPy and may not be compatible with other serialization libraries.
In summary, the choice between Pickle and numpyʼs functions depends on
factors such as ease of use, efficiency, compatibility, and security
requirements. Both approaches have their merits and can be used depending
on the specific needs of the application. Refer to another example:
ChatGPT Prompt8: Can you provide an example of interpolation
using SciPy and explain its application in real-world scenarios?
Please include a Python code demonstrating interpolation with SciPy
ChatGPT reply8: The code will be covered in
Chap14_Example14.3.py:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
import numpy as np
import matplotlib.pyplot as plt
# Sample data
x = np.arange(1, 6)
y_line = np.array([3, 5, 7, 4, 6])
y_bar = np.array([8, 6, 9, 7, 5])
# Add legend
ax.legend()
# Show plot
plt.show()
Output:
Figure 14.2: Image depicting both line and bar plots using matplotlib
Note: The preceding code is covered in (program name:
Chap14_Example14.4.py).
In this code, we create two sets of sample data (y_line and y_bar) and plot
them on the same figure using Matplotlibʼs plot() and bar() functions. We
then customize the plot by adding markers, linestyle, and colors. Finally,
we set labels for the axes, and title for the plot, and added a legend to
distinguish between the line plot and the bar plot. Refer to the following
example for a better understanding:
ChatGPT Prompt10: Provide a code snippet showcasing the
difference between pandas and polars.
ChatGPT reply10: The code will be covered in
Chap14_Example14.5.py:
import pandas as pd
import polars as pl
# Creating a DataFrame/Series
# Pandas
df_pandas = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10]})
series_pandas = pd.Series([1, 2, 3, 4, 5])
# Polars
df_polars = pl.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10]})
series_polars = pl.Series([1, 2, 3, 4, 5])
# Accessing data
# Pandas DataFrame
print(“Pandas DataFrame:”)
print(df_pandas)
# Polars DataFrame
print("\nPolars DataFrame:")
print(df_polars)
# Pandas Series
print("\nPandas Series:")
print(series_pandas)
# Polars Series
print("\nPolars Series:")
print(series_polars)
# Modifying data
# Pandas
df_pandas['C'] = df_pandas['A'] +
df_pandas['B']
series_pandas = series_pandas + 1
# Polars
df_polars =
df_polars.with_columns([series_polars.rename("C
")])
series_polars = series_polars + 1
# Display modified data
print("\nModified Pandas DataFrame:")
print(df_pandas)
print("\nModified Polars DataFrame:")
print(df_polars)
print("\nModified Pandas Series:")
print(series_pandas)
print("\nModified Polars Series:")
print(series_polars)
Output:
Pandas DataFrame:
A B
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
Polars DataFrame:
shape: (5, 2)
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 6 │
│ 2 ┆ 7 │
┆
│ 3 ┆ 8 │
│ 4 ┆ 9 │
│ 5 ┆ 10 │
└─────┴─────┘
Pandas Series:
0 1
1 2
2 3
3 4
4 5
dtype: int64
Polars Series:
shape: (5,)
Series: '' [i64]
[
1
2
3
4
5
]
Modified Pandas DataFrame:
A B C
0 1 6 7
1 2 7 9
2 3 8 11
3 4 9 13
4 5 10 15
Modified Polars DataFrame:
shape: (5, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ 1 │
│ 2 ┆ 7 ┆ 2 │
│ 3 ┆ 8 ┆ 3 │
│ 4 ┆ 9 ┆ 4 │
│ 5 ┆ 10 ┆ 5 │
└─────┴─────┴─────┘
Modified Pandas Series:
0 2
1 3
2 4
3 5
4 6
dtype: int64
Modified Polars Series:
shape: (5,)
Series: '' [i64]
[
2
3
4
5
6
]
The code snippet provided demonstrates basic data manipulation tasks such
as creating DataFrames and Series, accessing data, and modifying data. By
comparing how these tasks are performed in both pandas and polars, users
can gain insight into the differences in syntax and functionality between the
two libraries. Refer to another example for a better understanding:
ChatGPT Prompt11: Can you demonstrate the creation of scatter
plots using both seaborn and matplotlib in Python? Please provide a
code example showcasing how to create scatter plots with each library
and highlight any differences in their syntax or features.
ChatGPT reply11: The code will be covered in
Chap14_Example14.6.py:
To fix the issue, you can reshape arr2 to have the same number of
dimensions as arr1 before concatenating.
The code will be covered in Chap14_Example14.7.py:
import numpy as np
# Arrays
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([7, 8, 9])
# Reshape arr2 to match the dimensions of arr1
arr2_reshaped = np.reshape(arr2, (1, 3))
# Concatenate along row axis
result = np.concatenate((arr1, arr2_reshaped),
axis=0)
print(result)
Output:
[[1 2 3]
[4 5 6]
[7 8 9]]
We have only explored two examples within this topic, yet there are
numerous other instances to consider. The primary purpose of this exercise is
to demonstrate that you can pose any query to ChatGPT, and it will respond
to the best of its ability based on its knowledge. While ChatGPT can easily
handle solutions for simpler problems like these two examples, it may
struggle with more complex issues. In such cases, obtaining a complete
solution may require persistent questioning, and providing all available
information regarding the specific problem until a satisfactory resolution is
reached. Otherwise, it becomes a repetitive cycle where the query is
repeatedly posed, and the solution provided by ChatGPT may not be entirely
accurate.
Conclusion
In conclusion, this chapter has explored the myriad applications of ChatGPT
in navigating data science libraries. By guiding users in formulating
effective prompts tailored to specific libraries, such as numpy, scipy,
matplotlib, pandas, polars, and seaborn, valuable insights and coding
solutions can be obtained. Additionally, the chapter has provided insights
into utilizing ChatGPT for querying Python code related to data science
tasks, along with strategies for seeking assistance when encountering
challenges. Furthermore, readers have learned how to collaboratively create
cheat sheets summarizing key functions and commands from data science
libraries using ChatGPT. Lastly, the chapter has addressed the common
challenges faced by novice users when querying Python code with ChatGPT,
offering practical strategies and insights to overcome these obstacles.
Overall, this chapter equips readers with the knowledge and tools necessary
to leverage ChatGPT effectively in their exploration and utilization of data
science libraries, enhancing their proficiency in data analysis and
visualization tasks. In the next chapter, we will explore automated EDA
libraries for machine learning and will delve into the capabilities of the
sweetviz, autoviz, pandas_visual_analysis, y-data profiling, dtale, and
dabl libraries, each providing distinct functionalities and visualization
approaches to enhance the understanding and interpretation of datasets.
Points to remember
When formulating prompts for ChatGPT to explore data science
libraries, ensure clarity and specificity to extract relevant insights
tailored to the library of interest, such as numpy, scipy, matplotlib,
pandas, polars, or seaborn.
While asking questions to ChatGPT for Python code utilizing data
science libraries, provide sufficient context and details about the
problem statement or desired outcome to receive accurate and helpful
responses.
When stuck in any Python code, utilize ChatGPT effectively by
breaking down the problem into smaller components and providing
clear and concise queries to solicit targeted assistance, especially when
dealing with data science libraries.
When creating cheat sheets of data science libraries using ChatGPT,
collaborate with others to ensure comprehensive coverage of essential
functions and commands, and regularly update the cheat sheets as new
features or functionalities are introduced in the libraries.
Be mindful of the common challenges faced by novices when
querying Python code using ChatGPT, such as formulating clear
questions, understanding error messages, providing adequate context,
and describing the logic of the code accurately. Practice effective
communication and problem-solving strategies to overcome these
challenges and make the most out of ChatGPTʼs capabilities in data
science tasks.
Questions
1. How can ChatGPT assist in exploring numpy functionalities?
2. Can ChatGPT help me in interpolating data using scipy?
3. How do we fix a syntax error in my Python code using ChatGPT?
4. What are the key components of a cheat sheet for matplotlib that
ChatGPT can provide?
5. Can ChatGPT generate a cheat sheet summarizing essential functions
in numpy?
6. How can novices improve their communication skills when
interacting with ChatGPT?
7. How can we plot a histogram in matplotlib using ChatGPT?
8. How can ChatGPT assist in optimizing my Python code for better
performance?
9. What strategies can we employ with ChatGPT to overcome runtime
errors in Python?
10. What resources and tools are available to novices for improving their
proficiency in querying Python code with ChatGPT?
Introduction
In the field of machine learning, Exploratory Data Analysis (EDA) plays a
role in understanding data properties recognizing patterns and deriving
insights. Traditionally EDA entails an iterative approach where data analysts
meticulously inspect datasets using visualization methods, statistical
summaries, and domain expertise. However, with the increase in size and
complexity of datasets, traditional EDA techniques are becoming more time-
consuming and labor-intensive. Automated EDA tools have emerged as a
solution to this challenge by harnessing algorithms and machine learning
methods to streamline the data exploration process.
These tools provide a variety of features, including data profiling,
visualization capabilities, feature engineering options, and anomaly
detection functionalities. This enables data scientists and analysts to quickly
gain insights from their datasets with intervention. Automated libraries for
EDA cater to the demand for data investigation managing intricate datasets
in a scalable manner while maintaining consistent analytical outcomes. By
utilizing methods, these tools reveal concealed trends and streamline
processes, thereby making data exploration accessible to a wider range of
users.
Structure
In this chapter, we will discuss the following topics:
Usage of PyGWalker visual library with a dataset
Usage of dataprep library with a dataset
Usage of autoviz library with a dataset
Usage of pandas_visual_analysis library with a dataset
Objectives
This chapter will introduce learners to a diverse array of Python libraries
tailored for data visualization and analysis, each uniquely contributing to an
enriched data exploration experience. Beginning with the PyGWalker visual
library, readers will discover how to seamlessly integrate it with datasets to
create visually appealing representations. The exploration continues with the
dataprep library, offering insights into efficient data preparation and
analysis. Subsequently, learners will delve into the capabilities of the
autoviz and pandas_visual_analysis, libraries, each providing distinct
functionalities and visualization approaches to enhance the understanding
and interpretation of datasets. By the end of this module, participants will be
equipped with a versatile toolkit for effective data exploration and
visualization in Python.
Note: Our discussion on each visual library will cover the following
topics: introduction, installation method, code/how to use, and features
explanation.
We will be using all these libraries in Jupyter Notebook. Since this book will
not be a colored edition, the maximum images shown will be colored and
can be found in a soft image bundle. So, if we are referring to any images
that are colored, you are requested to refer to the image bundle.
We can check the entries and inspect the dataframe in the Data tab. Here, the
numeric values (continuous columns) will automatically be labeled as
measures and categorical columns as dimensions in pygwalker. Various
column types like:
Continuous columns are considered to be quantitative.
For categorical columns, the data will be either nominal or ordinal.
For date/time-related columns, the data type will be temporal.
In the Visualization tab, we can create different types of charts. The
available fields can be dragged and dropped into the x and y axes. Let us plot
the total bill amount by day. So, in the Y axis, let us take the total_bill
column, and in the X axis, let us take the day column. We can also resize the
chart by clicking the symbol highlighted in the black rectangle as shown in
Figure 15.3. A chart type will be automatically picked by default by the
pygwalker for the user. We can additionally charts by clicking +New in
pygwalker.
Figure 15.3: Image depicting chart plot with a single X and Y field in Chart 1
However, the chart types can be selected based on Figure 15.4 shown as
follows:
Figure 15.4: Image depicting chart types in pygwalker
We can even explore our data in a very imaginative way by adding some
extra fields into our chart either in the X or Y axis. Suppose we want to
know how much smokers spend each day. We can drag and drop the smoker
field and the day field on the X-axis against the total_bill column on the Y-
axis. The time field can be added under Color. Let us filter the sex too, that
is, only for female customers. We will only check female and uncheck male
here during selection. So, based on the selection, we have filtered the values,
and the plot is shown in Figure 15.5:
Figure 15.5: Image depicting chart plot with a multiple X and a single Y field in Chart2
Also, we can export it under three different formats either in PNG, SVG, or
BASE64 format as shown in Figure 15.6 as follows:
We can also export our code using export_code tool button as highlighted in
Figure 15.7 as follows:
Figure 15.7: Image depicting export_code button
On clicking this tool button, the following figure will pop-up in front of the
user:
Now, if we click the Copy to Clipboard button, the pop-up window will be
closed and if we paste the code in the new cell, we shall see the same
specification of our dashboard consisting of both the Charts on running.
There are more options to explore in pygwalker which we will leave for you
to explore.
For more information, you may refer to the website:
https://pypi.org/project/pygwalker/
Figure 15.10: Image displaying different features using dataprep library in Jupyter Notebook
Let us explain each of the features shown in the report one-by-one. Each
feature explained here will help us to understand the data quickly but make
sure that your EDA fundamentals are very strong before practicing all these
types of libraries. The overview and variables are mentioned below:
Overview: The overview of our penguins.csv dataset contains dataset
statistics and dataset insights here as shown in Figure 15.11. In dataset
statistics, we can find statistical details like the number of variables
which is nothing but the number of columns, the number of rows,
mussing cells, their percentage, duplicate rows if any, the memory
size, and the last important information which is differentiation of
numerical or categorical variables. The dataset insights display only
the percentage of missing values.
Figure 15.11: Image displaying Overview feature using dataprep libary in Jupyter Notebook
Figure 15.12: Image displaying Categorical species variable using dataprep library
The species column is a categorical column within the dataset which
consists of three different penguin species: Adelie, Chinstrap, and Gentoo.
The species column serves as a pivotal factor for organizing and examining
penguin data, enabling researchers to discover the unique characteristics
associated with each species. We will notice the visually appealing bar plot
effectively represents categorical values. Upon clicking the Show details
button below the species, we shall find five different plots as shown in
Figure 15.13:
When we click the Show Details button, four tabs are shown to the user. The
first tab is related to Stats which is towards the bottom, and we will come
across two statistical summaries of the data points; the quantile and
descriptive statistics, encompassing measures like mean, median, variance,
and other relevant statistical calculations as shown in Figure 15.16:
Different plots here are demonstrated to the user starting from Kernel
Density Estimation (KDE) visualizations serve as vital tools for assessing
and interpreting Gaussian distribution properties as shown in Figure 15.17:
Now, we shall explore the interactions option in dataprep visual library for
displaying scatter plots.
Interactions
This option enables the exploration of relationships between two features,
with one feature represented on the x-axis and the other on the y-axis. Here,
we are viewing the relationship between bill_length_mm on the X-axis and
body_mass_g on the Y-axis using a scatter plot as shown in Figure 15.18. We
can also change the X-axis and Y-axis variables based on our choice of
requirement. So, different features can be selected, and plotted values can be
viewed by the user.
Correlations
Typically, datasets contain correlated features, where the impact of one
feature on accuracy mirrors that of another correlated feature. Therefore, it is
essential to identify and analyze these correlations through EDA. Within this
option, we will encounter various types of correlation graphs, including
Pearson, Spearman, and KendallTau, accessible through a change of tabs
as shown in Figure 15.19:
Figure 15.19: Image displaying different graphs with Correlation feature
Missing values
Detecting missing values within a dataset can pose challenges for data
scientists. However, here, the missing values are visualized using a bar chart
and spectrum graph, providing a clear scale of missing versus present values.
Notably, missing values are distinctly highlighted in orange, aiding in their
identification and analysis within the dataset. We already stated that sex has
3.2% of missing values from the Overview feature. So, in Figure 15.21, we
are displaying the missing values using a bar chart and spectrum graph:
Figure 15.21: Image displaying Barplot and Spectrum plot for Missing Values feature
In Figure 15.21 we have displayed missing values using a bar and spectrum
chart. We can also write a small code in dataprep as there is another method
inside dataprep.eda module to search for the missing values as shown in
Figure 15.22. Here, we are displaying the missing values using Heat Map
and Dendogram. Heatmap helps us to understand which all are highly
correlated or not and the color is displayed to the user based on the
correlation. For clustering similar kinds of data which is mainly used in
hierarchical clustering, that is, grouping data points into clusters based on
their similarity or dissimilarity.
Figure 15.22: Image displaying HeatMap and Dendogram plot for Missing Values feature
Figure 15.24: Image displaying DQ issues against each column of iris.csv dataset
In AutoViz, the pair-wise scatter plot showcases the connections between all
continuous variables, offering a rich visual overview of their relationships.
This dynamic visualization aids in spotting patterns, correlations, and
emerging trends among variables, enabling a deeper understanding of the
datasetʼs underlying structure and interconnections. The pair wise scatter
plot is plotted between sepal_length vs. sepal_width, sepal_length vs.
petal_length, sepal_length vs. petal_width, sepal_width vs. petal_length,
sepal_width vs. petal_width, and petal_length vs. petal_width. We are
only displaying the image of sepal_length vs. sepal_width, and
sepal_length vs. petal_length as shown in Figure 15.25:
Figure 15.25: Image depicting scatter plot between sepal_length vs (sepal_width/ petal_length)
Also, we can view the distplot, boxplot, and probability plot for all
columns except the species column. Here, we are displaying all three plots
for the sepal_length column shown in Figure 15.26:
Figure 15.26: Image displaying Dist, Box, and Probability plot for sepal_length column
Also, other plots like Histogram, Violin (of all continuous variables),
HeatMap (of all numeric variables including target), and bar (For each
continuous column by each categorical variable) are also mentioned for the
dataset columns which we left for you to explore in the Jupyter Notebook.
The last message displayed will be the execution time to run AutoViz which
is four seconds displayed as follows:
Time to run AutoViz = 4 seconds
###################### AUTO VISUALIZATION
Completed ########################
For more information, you may refer to the website:
https://pypi.org/project/autoviz/
Figure 15.30: Image displaying Subtractive selection type with highlighted black ellipse
Also, if we look at the bar chart of the displacement column, the removed
data points will be displayed in grey color as shown in Figure 15.32. The
rest of the other data points will be in blue:
Figure 15.32: Image displaying Removed data points with highlighted grey area in the bar plot
Also, an important point to note is that on removing the data points, the
statistical data pertaining to each column will be affected as shown in Figure
15.33:
Figure 15.33: Image displaying each column data affected due to the removal of data points
So, we can conclude that due to the removal of this small set of data points,
there will be an impact on the graph.
Now, let us focus on our third selection type which is Additive. Here, we will
add whatever data we have removed and then visualize the images as shown
in Figure 15.34:
Figure 15.34: Image displaying after the addition of data points under the Additive Selection type
We shall observe that we have the same data visualization that we witnessed
before removing the data including the statistical information.
In conclusion, Pandas Visual Analysis aids in gauging the effects of
removing specific data points that deviate significantly from the datasetʼs
mean. By visualizing the impact of removing these outliers, users gain
insights into potential adjustments to the dataset. Additionally, the tool offers
functionality for feature normalization, further enhancing its utility in data
exploration and analysis.
We can also download a plot as PNG, zoom in or out a particular section,
pan, either box or lasso select, autoscale, or reset axis on any of the plots as
these options are available to the user as shown in Figure 15.35:
Figure 15.35: Image displaying multiple options available to the user
So, we have covered the following 4 automated EDA libraries in the above
chapter viz pygwalker, dataprep, autoviz, and pandas_visual_analysis for
seamless integration, EDA, and visualization.
Conclusion
In this chapter, we equipped learners with a diverse range of Python libraries
tailored for data visualization and analysis, each offering unique
contributions to enrich the data exploration process. Beginning with
PyGWalker Visual, readers gained insights into seamlessly integrating
datasets to create visually appealing representations. Moving forward, the
exploration continued with the dataprep library, offering efficient data
preparation and analysis techniques. Subsequently, learners explored autoviz
and pandas_visual_analysis, discovering distinct functionalities and
visualization approaches to deepen their understanding and interpretation of
datasets. As a result, participants now possess a versatile toolkit for effective
data exploration and visualization in Python, empowering them to extract
valuable insights from diverse datasets with confidence and ease. In the next
chapter, we shall view 2 Case Studies with applicability of open data science
libraries which we have studied till now viz pandas, matplotlib, seaborn, and
so on.
Points to remember
PyGWalker Visual library simplifies the integration of datasets,
enabling users to effortlessly craft visually captivating representations.
The walk method facilitates EDA, enabling users to visually
investigate and analyze the dataframe.
The dataprep library simplifies data preparation by providing tools
for tasks like detecting missing values and profiling data, ensuring
datasets are clean and prepared for analysis.
The create_report() function is called on the dataframe to
generate an EDA report, which is then displayed using the show()
method.
The autoviz library streamlines visualization tasks by automatically
generating a diverse array of insightful plots and charts, requiring
minimal user input. This automation accelerates the data exploration
process, enabling quick and thorough analysis.
The AutoViz report is generated by calling the AutoViz method with
the dataset filename as an argument, resulting in comprehensive
visualizations and insights presented in the report.
pandas_visual_analysis provides an interactive environment
for data exploration, offering a range of visualizations and tools to
interactively analyze and understand dataset characteristics. Users can
benefit from features such as normalization, and outlier detection,
enhancing their ability to uncover patterns and relationships within the
dataset.
VisualAnalysis widget allows for interactive exploration and
visualization of the data within mydf, offering various plotting
options and tools for gaining insights into the datasetʼs characteristics
and relationships.
Questions
1. How does PyGWalker Visual simplify the integration of datasets for
visualization purposes?
2. What are the key functionalities offered by the dataprep library for
data preparation?
3. How does the autoviz library automate the visualization process, and
what are its main advantages?
4. What makes pandas_visual_analysis an interactive tool for data
exploration, and how does it support feature-rich visualization?
5. Can you demonstrate how to create visually appealing representations
using PyGWalker Visual?
6. How does dataprep assist in identifying and handling missing values
in a dataset? What steps are involved in data profiling with dataprep?
7. What types of visualizations can be automatically generated using
autoviz, and how does it contribute to speeding up the analysis
process?
8. What are some common challenges in data preparation, and how does
dataprep help to address them?
9. In what scenarios would you recommend using PyGWalker Visual,
dataprep, autoviz, and pandas_visual_analysis over other tools for
data exploration and analysis?
10. Name the important methods while using PyGWalker Visual,
dataprep, autoviz, and pandas_visual_analysis libraries for EDA.
Structure
Objectives
Let us explore about case study on electrical fault classification. The case
study code is provided on the Jupyter Notebook file
Electrical_Fault_Classification.ipynb. The provided code displayed in
Figure 16.1 imports fundamental data science libraries, such as pandas for
data manipulation, NumPy for numerical computations, seaborn for
statistical visualization, and matplotlib for plotting.
timestamp = datetime.now()
We are then writing the code for suppressing the warning messages for
ensuring cleaner output:
import warnings
warnings.filterwarnings('ignore')
mydf = mypd.read_csv("Fault_Class_Data.csv")
Python uses the pandas library to read data from a CSV file named
Fault_Class_Data.csv (dataset) and store it into a dataframe named mydf.
We are then displaying the first five rows of the dataset as shown in the
following
Figure 16.2:
Then, let us compute the sum of missing values across each column as
shown in
Figure 16.4:
Next, we will look into the quick overview of the size and structure of the
dataset to understand its scale and complexity by executing the statement
mydf.shape. A tuple is returned representing the dimensions of the
dataframe where the first element represents the number of rows, and the
second element returns the number of columns as shown in Figure 16.5:
Figure 16.5: Image checking null values in a dataset
Figure 16.6: Image checking for duplicates and displaying duplicate rows
This code snippet efficiently detects and displays duplicate rows within the
dataframe mydf, first by creating a Boolean mask to identify duplicates and
then filtering and presenting these duplicate rows in a new dataframe
named duplicate_rows. This process offers clear insight into any repetitive
entries within the dataset. You may observe the next row for the dropped
duplicates here.
Then, we will be writing the code which efficiently removes duplicate rows
from the dataframe mydf and displays the resulting dataset without
duplicates as shown in
Figure 16.7:
mydf = myunique_df
Then, we will concatenate columns O_G, O_A, O_B, and O_C as strings
and assigns them to a new column MyFault_Type in the dataframe mydf.
The astype('str') method ensures the conversion of columns to string type
before concatenation. Finally, mydf.head() is used to display the first five
rows of the dataframe mydf with the newly created column as shown in
Figure 16.9:
Figure 16.10: Image displaying the binary data for Faults/No Faults
After this, we will randomly select and display five rows from the
dataframe, providing a brief snapshot of the data along with the dimensions
of the dataframe. In addition, the statistics for numerical columns in the
dataframe, including count, mean, standard deviation, minimum,
maximum, and quartile values, offering insights into the distribution and
summary statistics of the data will be displayed to the user as shown in
Figure 16.12:
Figure 16.12: Image displaying sample data, shape and summary statistics
of the dataframe
Then, we will calculate the frequency count of each unique value in the
MyFault_Type column of the dataframe mydf, sorting the counts in
descending order and thus providing a summary of fault types and their
occurrences as shown in Figure 16.13:
Figure 16.13: Image displaying the frequency count for fault of target
variable
Next, let us focus on data visualization of the current dataset. We shall view
the code segment which visualizes the distribution of fault types within the
dataset. In the first subplot, let us create a countplot displaying the
frequency of each fault type, while the second subplot will create a pie
chart illustrating the distribution of fault types as percentages.
Customizations such as title, font properties, and rotation are applied for
better visualization. Finally, it prints the current timestamp, likely for
documentation purposes. Together, these visualizations offer insights into
the prevalence of different fault types in the dataset as shown in Figure
16.14:
First , we will be plotting the count plot which is a type of plot that displays
the count of observations in each category of a categorical variable which is
displayed in Figure 16.15:
Figure 16.15: Image displaying count plot with bar labels indicating display
of value
Next, we shall observe the visualization using pie plot in Figure 16.16
shown as follows:
On the same concept, we have displayed the data visualization for Line A,
Line B and Line C faults.
Next code segment will generate histograms for each numerical column in
the dataframe mydf, with bins set to ten and a figure size of 10x10. The
hist() function from pandas is used to create the histograms, which display
the distribution of values within each numerical column, providing insight
into their data distribution and range. The myplt.show() function displays
the histograms, allowing visual examination along with a display of the
current timestamp as shown in Figure 16.18:
The output is shown in Figure 16.20, for the display of various plots viz
Histogram:
Figure 16.20: Image displaying histogram plots in 3*2 grid layout for
distribution of voltage and load of line
The output is shown in Figure 16.21 for the display of various plots viz
kdeplot:
Figure 16.21: Image displaying kde plots in 3*2 grid layout for distribution
of voltage and load of line
The output is shown in Figure 16.22 for the display of various plots viz
boxplot:
Figure 16.22: Image displaying boxplot in 3*2 grid layout for distribution
of voltage and load of line
The forthcoming code segment will generate a plot featuring three lines,
each representing a distinct feature of Current ('Ir', 'Iy', and 'Ib') sourced
from the dataframe mydf. By employing the myplt.figure(figsize = (10,4))
command, it defines the dimensions of the plot to be displayed. Then,
utilizing myplt.plot(), it maps each feature against its respective index
values, distinguishing each line with distinct colors (r for red, y for yellow,
and b for blue) and then finally it will display the current timestamp as
shown in Figure 16.23:
The same concept applies to the Voltage feature (Vr, Vy, and Vb) as shown
in Figure 16.24:
Now, we shall view the data visualization for a healthy system (that is, No
Fault).
Next, the code segment will create a figure with a size of 10x4 and set the
super title as No Fault Healthy System. It will then divide the figure into
two subplots arranged vertically by using subplot(2,1,1) and subplot(2,1,2).
The first subplot will plot three lines representing features Ir, Iy, and Ib
from the dataframe My_NF with colors r (red), y (yellow), and b (blue)
respectively. Similarly, in the second subplot, it plots features Vr, Vy, and
Vb from My_NF and finally it will display the current timestamp as shown
in Figure 16.25:
The same concept of code will be applied to the data visualization for
LineA to Ground Fault, LineA, LineB to Ground Fault and LineA, LineB,
LineC to Ground Fault which is displayed in the figures Figure 16.26,
Figure 16.27 and Figure 16.28 respectively.
In Figure 16.26, we are displaying the plot of Faulty system with LineA to
Ground Fault:
Figure 16.26: Image displaying plot of LineA to Ground Fault
From Figure 16.26, the observation is at the time of Line A to ground fault
the current in line A increases to 10 fold approximately 1000 Amperes
from normal 100 Amperes and voltage is reduced. Now, let us observe our
next figure for display of Fault system with Line A, Line B to Ground
Fault:
Now, we shall observe our next figure for the display of Fault system with
Line A, Line B, and Line C to Ground Fault:
We shall begin with importing the libraries which is similar to the first case
study as shown in Figure 16.29:
Figure 16.29: Image depicting importing of data science libraries for Case
Study2
mydf = mypd.read_csv('titanic.csv')
Python uses the pandas library to read data from a CSV file named
"titanic.csv" (dataset) and store it in a dataframe named mydf.
We are then displaying the first five rows of the dataset as shown in the
following Figure 16.30:
The existing features and the target variable in the Titanic dataset to the
user is mentioned as follows:
Then, the next approach that we need to focus on is to find out the missing
data in the dataset. Here, we are writing the code mydf.isnull() serves to
identify missing values (null values) within the dataframe mydf. Upon
invocation, it generates a dataframe mirroring the structure of mydf, where
each cell contains either True or False. A True value signifies a missing
(NaN) entry in the original dataframe, while a False indicates the presence
of valid data.
But this approach might not be a proper way of finding the null values. If
we will be looking for more dataset, then we have to have to iterate over
each dataset which is very tedious. Also, we need to scroll down and see in
Jupyter Notebook which does not display all the data. It will skip many
rows. So, we need to find an alternative approach to the visualization
concept in data science for the solution mentioned above, the explanation is
shown in Figure 16.33:
Figure 16.34: Image displaying the count of survived people in the dataset
Figure 16.35: Image displaying the count of survived people in the dataset
based on hue=’sex’ parameter
Figure 16.37: Image displaying distribution of ages who aboard the titanic
If kde parameter is set to False, the kde plot will not be visible to the user.
In Figure 16.38, we are viewing the histogram creation for the age column
from the dataframe mydf, which separates the data into 20 bins and
presents it in a dark blue color with 60% transparency, annotations are
introduced atop each bar to display the count value of data points within
each bin. This is achieved by iterating over each bar, extracting its height
(representing the count), and placing annotations at the center of the bars.
These annotations, slightly offset vertically for clarity, provide numerical
context alongside the visual representation of the age distribution,
enhancing the histogramʼs interpretability. Observe Figure 16.38 for
histogram view by separating the data into 20 bins:
Figure 16.38: Image displaying histogram creation for the age column
The next code segment displayed in the following Figure 16.41, produces a
box plot comparing the distribution of ages among different passenger
classes (pclass) aboard the Titanic. Firstly, the code sets the figure size by
using plt.figure(figsize=(8,7)) to ensure clarity in visualization. Then, a
seaborn box plot (sns.boxplot) is created, with the x-axis representing
passenger class and the y-axis denoting age. Subsequently, the code
calculates the 25th, 50th (median), and 75th percentiles of age for each
passenger class by using the quantile method. These percentile values are
stored in separate dataframes (p25, p50, p75).
Finally, annotations are added to the box plot to mark the positions of these
percentiles for each passenger class. The annotations display the percentile
values at the corresponding positions in the plot as demonstrated in the
following Figure 16.41:
Figure 16.41: Image displaying box plot code creation using pclass vs age
On executing the preceding code, we shall get box plot creation with
annotations as demonstrated in Figure 16.42:
Figure 16.42: Image displaying box plot comparing age distribution with
pclass
The function extracts two elements from mycols: the age (MyAge) and
passenger class (MyPClass). It then checks if the age value is null by using
mypd.isnull(MyAge). If it is null, the function assigns a default age based
on the passenger class: 37 for class 1, 29 for class 2, and 24 for class 3. If
the age value is not null, it returns the original age value.
Finally, the function is applied to the age and pclass columns of the
dataframe mydf by using the apply method. The function is applied row-
wise (axis=1), and the result is assigned back to the age column of mydf.
This process imputes missing age values in the dataframe based on the
passenger class, ensuring that each passenger has a non-null age value for
further analysis.
After this, we are again visualizing the null values against each column by
using the heatmap concept in the seaborn library as shown in Figure 16.44.
We can observe that there are no null values in the age column and only the
rest of the three columns with null values. That is the benefit of analyzing
the missing value of a particular column in the dataset.
Figure 16.44: Image displaying visualization after handling the null values
in age column
Now, we can see that the deck column has more NaN values. In order to
replace the NaN values, we have to do concept analysis by using feature
engineering which we shall cover in our next book on Machine Learning.
So, at present, we will just drop this column by using the code segment as
shown in Figure 16.45:
After this, we will again visualize the null values against each column
using the heatmap concept in the seaborn library as shown in Figure 16.46.
We can observe that there are null values only in the rest of the two
columns which are embark and embark_town.
Figure 16.46: Image displaying visualization after handling the null values
in deck column
The provided code snippet in Figure 16.47, first fills null values in the
embarked column of the dataframe mydf with the mode (most frequent
value) of that column. The mode is obtained by using the mode() function,
and [0] is used to access the first mode in case there are multiple modes.
The fillna() function is then used to replace null values with the calculated
mode value. The same process is repeated for the embark_town column.
In the code snippet in Figure 16.49, dummy variables are created for
categorical columns sex and embarked from the dataframe mydf by using
mypd.get_dummies(). The drop_first=True parameter is set to drop the first
dummy variable to avoid multicollinearity. After creating the dummy
variables, the original categorical columns sex, embarked, and
embark_town are dropped from the dataframe mydf by using the drop()
method along the columns axis (axis=1). This step removes the original
categorical columns since they are no longer needed after creating dummy
variables. Finally, mydf.head() is used to display the first five rows of the
modified dataframe mydf, which now contains dummy variables for the sex
and embarked columns instead of the original categorical data.
Figure 16.49: Image displaying after dropping 3 columns from the dataset
The next line of code segment is shown in Figure 16.50, horizontally
combining the original dataframe mydf with the dummy variables
generated for the sex and embarked columns. By utilizing mypd.concat(), it
appends these dummy variables to the original dataframe, expanding its
features for subsequent analysis or modeling.
Figure 16.50: Image displaying final dataset after appending with dummy
variables for sex and embarked columns
Conclusion
Points to remember
Questions
Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
Index
A
annotate() 183
append() 86
apply() 301
arange() 32
arrays, categories
concatenate() 69, 71
dstack() 74, 75
hstack() 73
stack() 71, 72
vstack() 72-74
array_split() 81
astype() 254
axes_style() 380
barh() 185
Barplot 174
Barplot, steps
customizing 174
importing 174
belliman_ford 115
Broadcasting 53
Broadcasting, points 53
callable 219
cast() 353
collect() 362
concatenate() 69
ConvexHull 117
create_report() 468
Dataframe, aspects
column name, renaming 292, 293
dataprep 467
delete() 87
diag() 40
distplot() 408
dot() 89
drop() 293
drop_duplicates() 273
dsplit() 79
dstack() 75
duplicated() 271
Elements Searching 83
where() 83
empty() 42
endswith() 309
FacetGrid 430
head() 212
hist() 496
Histograms 195-197
hsplit() 78
hstack() 73
insert() 85
Inserting Elements 84
append() 86, 87
insert() 85, 86
integrate 120
integrate, type
Interpolation 124
Interpolation, types
1D Interpolation 124, 125
iris.csv 330
isin() 266
isnull 269
jointplot() 426
correlation 427
EDA 427
Cell 13
Kernel 13
menu bar 13
output area 13
tool bar 13
command, mode 13
edit mode 13
In[] 14
Out[] 14
kdeplot() 408
KDTree() 118
CLM 147
MLC 146
comparisons 423
interopolation 423
alpha 149
linewidth/markersize 150
markerfacecolor 151
linspace() 34
linspace(), parameters
dtype 34
endpoint 34
num 34
retstep 34
start 34
stop 34
loc[] 281
minimize() 102
mydf.isnull() 505
myfunc 123
myfunc_eqn 104
myplt.bar() 174
myplt.grid() 157
myplt.title() 139
myplt.yscale() 167
notnull 269
nquad 122
flattern() 63
flat variable 64
ravel() 65, 66
reshape() 60-62
resize() 62
transpose() 66, 69
randint() 43, 44
shuffle() 44-48
numpy/scipy, differences 94
array_split() 81
dsplit() 79, 80
hsplit() 78, 79
split() 76, 77
vsplit() 78
nunique() 274
axes 239
columns 239
dtype 239
columns 237
copy 237
data 237
dtype 237
index 237
Pandas, reasons
features 319
performance 318
scalability 318
dictionary 208
copy 208
data 207
dtype 207
fastpath 208
index 207
name 208
plot() 155
plot.pie() 496
plotting_context() 382
Polars Data, concepts
flexibility 341
optimization 341
parallelism 341
regarding 341
aggregation 339
filtering 339
Dataframe 328-330
aggregating 360-362
folds 366
functions 351-353
operators 346-348
strings 358-360
pyplot 137
Python 2
Altair 137
Bokeh 137
Matplotlib 136
Plotly 137
Seaborn 137
randint() 43
dtype 43
high 43
low 43
size 43
RangeIndex 210
rank() 262
ravel() 65
rename() 290
resize() 62
scatter() 455
clustering 425
scipy, categories
angle 97
area 101
binary 97
energy 100
force 97
length 99
mass 98
metric 96
power 101
pressure 99
speed 100
temperature 100
volume 99
Seaborn, palettes
Diverging 389-391
Qualitative 386-388
set_style() 380
show() 156
sort() 81
sort_index() 260
sort_values() 256
splev 128
split() 76
splrep() 128
stack() 71
startswith() 309
str() 185
subplot() 202
tail() 213
text() 182
tight_layout 182
title() 156
transpose() 66
unique() 274
VisualStudio
reference link 16
vsplit() 78
vstack() 72
W
where() 83, 305