Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
258 views

Ultimate Data Science Programming in Python 9365895669

Uploaded by

Ali İnce
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
258 views

Ultimate Data Science Programming in Python 9365895669

Uploaded by

Ali İnce
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 756

Ultimate

Data Science
Programming in
PythonMaster data science libraries with 300+
programs, 2 projects, and EDA GUI tools
Saurabh Chandrakar

logo

www.bpbonline.com
First Edition 2025

Copyright © BPB Publications, India

ISBN: 978-93-65895-667

All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any
form or by any means or stored in a database or retrieval system, without the prior written permission
of the publisher with the exception to the program listings which may be entered, stored and executed
in a computer system, but they can not be reproduced by the means of publication, photocopy,
recording, or by any electronic and mechanical means.

LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY


The information contained in this book is true to correct and the best of author’s and publisher’s
knowledge. The author has made every effort to ensure the accuracy of these publications, but
publisher cannot be held responsible for any loss or damage arising from any information in this
book.

All trademarks referred to in the book are acknowledged as properties of their respective owners but
BPB Publications cannot guarantee the accuracy of this information.

www.bpbonline.com
Dedicated to

My parents
Dr. Surendra Kumar Chandrakar and
Smt. Bhuneshwari Chandrakar
my brother
Shri Pranav Chandrakar
my sister-in-law
Smt. Silky Chandrakar
my wife
Smt. Priyanka Chandrakar
and my lovely son
Master Yathartha Chandrakar
About the Author
Saurabh Chandrakar is a Research and Development Engineer (Dy.
Manager) at Bharat Heavy Electricals Limited (BHEL) Hyderabad. He is
the winner of the best executive award in the Operations Division by BHEL
Hyderabad. He has been awarded the prestigious BHEL Excellence Award
under Anusandhan category for the projects “Redundant Composite
Monitoring System of Power Transformers project” and “Innovation and
demonstration of Digital Sub Station with in-house developed IEC61850
compliant intelligent electronic devices and optical current transformers for
the year 2020-21”. He has 25 copyrights, 5 patents granted, and 2 patents
filed.

Moreover, he has published six books in reputed publications such as BPB


New Delhi (Programming Techniques using Python, Python for everyone,
Building Modern GUIs with tkinter and Python, Python GUI with PyQt,
Scitech Publications Chennai (Programming Techniques using Matlab) and
IK International publishers (Microcontrollers and Embedded System
Design). Additionally, he has also launched one video course on BPB titled
“First Time Play with Basic, Advanced Python Concepts and Complete
Guide for different Python certification exams all under one umbrella.”
About the Reviewers
Nirakar Padhy is an experienced Senior Data Scientist with a strong
academic background, holding a Bachelor’s degree in Mathematics
from Mithibai College and a Master’s degree in Machine Learning and
Artificial Intelligence from Liverpool John Moores University.

He specializes in advanced Machine Learning (ML) and Artificial


Intelligence (AI) algorithms, mathematics, and statistics, along with
strong skills in Python programming, data analysis, and distributed
computing. Nirakar is also highly proficient in ML engineering,
MLOps, model deployment and monitoring, CI/CD pipelines, and
cloud computing platforms such as Amazon Web Services (AWS).

Dedicated to mentoring the next generation of data professionals,


Nirakar conducts workshops where he shares his extensive knowledge
and practical insights, particularly in statistics, data science, machine
learning, and MLOps.

Outside of his professional pursuits, Nirakar is an avid movie


enthusiast with a keen interest in personal finance. He enjoys playing
cricket to stay active and relax and is always on the lookout for new
hobbies and interests to explore.

Prasenjeet Damodar Patil received a B.E in E and TC Engineering


from Sant Gadgebaba Amravati University and an M. Tech. from
Walchand College of Engineering Sangli, India. He did his PhD
degree in E&TC Engineering from Sant Gadgebaba Amravati
University. He has 15+ years of teaching experience. Currently, he is
working as a Professor at the School of Computing, M.I.T A.D.T
University, Pune. He has published more than 20+ papers in reputed
Journals. His research interests include the Internet of Things and
Digital Image Processing. He is also a certified Google Data Analytics
Professional.
Acknowledgement
First and foremost, I would like to thank you all for selecting this book. It
has been written with a beginner reader in mind. I take this opportunity to
greet and thank my mentor “Prof. Nilesh Bahadure Sir” for motivating me
and always communicating his expertise on topics related to Python. I am
very thankful for being his protégé. I appreciate his belief in me, for always
standing behind me and pushing me to achieve more. The
phrase “Journey of a
thousand miles begins with a single step” is something he always reminds
me of.

My parents, Dr. Surendra Kumar Chandrakar and Smt. Bhuneshwari


Chandrakar, my brother, Shri Pranav Chandrakar, my sister-in-law, Silky
Chandrakar, my beloved wife, Mrs. Priyanka Chandrakar, my adorable son,
Yathartha Chandrakar, and all of my friends who have inspired me and
given me the confidence over the years. Last but not least, I would like to
express my sincere gratitude to the staff at BPB Publications for their
contributions and insights that made parts of this book possible.
Preface

The purpose of this book is to introduce individuals to the dynamic field of


data science in Python, unraveling the capabilities of essential libraries that
drive data analysis and manipulation. For those with varied levels of
programming experience, this book serves as a gateway to the intricate
world of NumPy, SciPy, Matplotlib, Pandas, Polars, Seaborn, and usage of
ChatGPT for all these open-source libraries. Readers will embark on a
hands-on journey, navigating through a multitude of solved examples that
illuminate the practical application of each library. Beginning with
foundational concepts, the chapters progressively explore advanced
functionalities, empowering readers to harness the full potential of these
libraries. Whether it is mastering the manipulation of data with pandas,
visualizing insights with Matplotlib, or leveraging the performance-oriented
polars library, this book provides a comprehensive guide for both beginners
and seasoned data scientists. With a blend of theoretical understanding and
practical implementation, readers will acquire the skills necessary to tackle
real-world data challenges and unlock the vast possibilities within the
Python data science ecosystem. By mastering open-source libraries in data
science, readers will be able to apply this knowledge to solve real-world
problems and work on various useful projects according to their needs.
The first part of the book is dedicated to an in-depth exploration of the
NumPy library, laying the groundwork for proficient data manipulation in
Python. Readers will explore fundamental concepts such as the creation of
NumPy arrays, understanding the distinctions between lists and arrays, and
the application of arithmetic operations. The chapter also sheds light on
advanced topics like broadcasting and matrix multiplication. With practical
examples, this section ensures a comprehensive understanding of NumPy’s
capabilities, equipping readers with the skills necessary to efficiently handle
numerical data for various data science applications. Then, we will unfold
the powerful functionalities of both SciPy and Matplotlib, extending the
readers’ capabilities in the realm of scientific computing and data
visualization. The exploration of SciPy introduces concepts like optimizers,
sparse data handling, graph algorithms, and integration techniques, enabling
readers to tackle a broad spectrum of scientific and engineering challenges.
Transitioning seamlessly, Matplotlib is unveiled as a quintessential tool for
data visualization, covering an array of plots such as line plots, bar plots,
pie charts, histograms, scatter plots, and subplots. Various practical
examples in this section will provide a comprehensive understanding of
how to effectively communicate complex data through visually compelling
plots, setting the stage for advanced data exploration.
In the later part of the book, the focus shifts to pandas, a versatile library for
data manipulation and analysis. Readers will master the creation of pandas
series and dataframes, along with advanced techniques like filtering,
sorting, and aggregation. The exploration extends to the polars library,
emphasizing its modern approach to dataframe manipulation and its
performance advantages. The chapter on seaborn delves into statistical data
visualization, covering essential plots like heatmaps, box plots, and scatter
plots. Additionally, readers will discover the capabilities of ChatGPT in
conjunction with open libraries of data science. With a rich array of solved
examples in this section, readers will acquire a holistic skill set,
empowering them to tackle diverse data challenges and innovate in their
data science endeavors.
This book is divided into 16 chapters. Each chapter description is listed as
follows:
Chapter 1: Environmental Setup for Using Data Science Libraries in
Python - This chapter covers the key aspects of setting up an effective
Python programming environment, highlighting the importance of using an
Integrated Development Environment (IDE). It begins with a step-by-
step guide to installing Jupyter Notebook on a Windows platform, along
with an overview of its functionalities. The focus will then move to
installing Visual Studio Code (VSCode) for Python development and
exploring its features for coding. Finally, the chapter introduces essential
Python data science libraries, providing a foundation for learners to utilize
these powerful tools in their programming journey.
Chapter 2: Exploring Numpy Library for Data Science in Python - This
chapter covers a clear understanding of the comparisons between lists and
numpy arrays and what they entail. We will see the creation of ndarrays
through different methods, including utilizing list and tuple data structures.
We will also demonstrate various functions using Python code snippets and
peep into ndarray creation with random values using a module. The
distinction between view and copy in numpy will be explained using
examples that are crucial for memory efficiency. We will learn different
methods to access individual elements or subsets of elements in ndarrays.
We will explore how to iterate over elements of ndarrays using loops and
explore various arithmetic operators available in numpy. In the end, we will
learn how to use broadcasting to perform operations on ndarrays with
different shapes.
Chapter 3: Exploring Array Manipulations in Numpy - This chapter
covers an idea about various array manipulation functions and variables
available in numpy. We will understand different methods for joining
ndarrays and explore how to split ndarrays into smaller ones. Sorting of
ndarrays in numpy will also be discussed. Some of the search functions
available in numpy will be explored along with the insertion and deletion of
elements into/from ndarrays will be carried out. The usage of the dot
function in numpy for matrix multiplication will be explored. Finally, the
linalg module in numpy for linear algebra will be well understood with
various examples.
Chapter 4: Exploring Scipy Library for Data Science in Python - This
chapter will cover the basic difference between numpy and scipy array. The
aim is to explore the scipy constants which represent physical quantities,
mathematical, scientific, and other useful values. We will also discuss the
optimizers in scipy where different optimization algorithms available will
be discussed with some examples, like finding the roots of an equation and
many more. Additionally, we will study sparse data and sparse matrix
representation in scipy, enabling efficient storage and manipulation of large,
mostly empty matrices. Different sparse matrix types in scipy will be well
explored. We will learn about graphs in scipy where the module provides
several functions to effectively analyze and work with sparse graphs. The
chapter covers integration techniques in scipy which are useful for
numerical integration of functions. In the end, we will discuss the
interpolation methods in scipy which enables the estimation of values
between known data points.
Chapter 5: Line Plot exploration with Matplotlib Library - This chapter
covers Python’s data visualization tools along with a wide array of
techniques and concepts to enhance our ability to represent and analyze data
effectively. First to learn will be line plots where we will explore the
creation of line plots by passing two ndarrays, adding essential elements
like titles, x-labels, and y-labels, and advancing our skills with properties
like linestyle, color, alpha, linewidth, markersize, markerfacecolor, and
figure size customization. We will also learn how to plot multiple lines in a
single plot, creating line plots with single ndarrays, incorporate grid lines
(both major and minor), and effectively utilize legends, xlim, and ylim
functions.
Chapter 6: Charting Data with Various Visuals Using Matplotlib - This
chapter covers the bar plots where various aspects such as changing the
color, width, bottom position, left alignment and right alignment of
individual bars, as well as adding labels to bars will be learned. We will also
venture into Horizontal Bar Charts, Stacked Bar Charts (both vertical and
horizontal), and Grouped Bar Charts, providing us with a comprehensive
understanding of bar chart variations. Additionally, we will master the art of
creating i.e. Charts, Histograms, Scatter Plots, and subplots in Matplotlib,
equipping us with a versatile toolkit for data visualization and analysis in
Python.
Chapter 7: Exploring Pandas Series for Data Science in Python - This
chapter covers a comprehensive range of topics related to pandas series and
dataframes in Python, equipping them with the skills and knowledge needed
for effective data manipulation and analysis. We will start by exploring
pandas series, covering fundamental aspects such as series creation, data
access, slicing, and filtering. We will also discuss more advanced topics like
using callable objects for selection, leveraging useful attributes, applying
functions with the ‘apply’ method, aggregating data, and performing basic
arithmetic operations. The chapter will further emphasize series
transformations and iteration for a well-rounded understanding.
Chapter 8: Exploring Pandas Dataframe for Data Science in Python -
This chapter covers the pandas dataframes, readers will gain insights into
dataframe construction and exclusive methods and attributes applicable
only to dataframes. We will address the critical task of handling missing
data, arithmetic operations for dataframes, and the addition of new columns.
The usage of the ‘fillna’ method for handling missing values will also be
covered. Sorting, ranking, and filtering data within dataframes will be
thoroughly explored along with techniques for checking data inclusion,
identifying missing values, and handling duplicates.
Chapter 9: Advanced Dataframe Filtering Techniques - This chapter
covers various methods for renaming index labels and column names, as
well as efficient ways to delete rows and columns from dataframes. It will
introduce the powerful ‘query’ method for data filtering and explore
additional advanced techniques such as using ‘apply’ to manipulate
dataframe data and finding the largest and smallest values with ‘nlargest’
and ‘nsmallest.’ Text data handling within pandas will be addressed,
demonstrating how to filter dataframe rows with string methods. Practical
examples of data modification in dataframes will provide readers with
hands-on experience in real-world data manipulation.
Chapter 10: Exploring Polars Library for Data Science in Python - This
chapter covers a comprehensive exploration of the polars data manipulation
library in Python. First, we will highlight the key differences between
pandas and polars, shedding light on the unique advantages and features
that polars bring to the table. Then, we will explore polars data types,
categorizing them into Numeric, Nested, Temporal, and Others Groups,
providing a solid understanding of the data structures. Readers will then
venture into the heart of polars, gaining insights into its data structures,
including series and dataframe, and learning how to work with them
effectively. Next, we will unveil the concept of contexts in polars, focusing
on essential operations, selection, and filtering techniques. The groupby
concept is demystified, enabling readers to harness the power of polars for
data aggregation and summarization. The concept of a Lazy API will be
well elucidated, demonstrating how it enhances performance and resource
utilization.
Chapter 11: Exploring Expressions in Polars - This chapter will cover
the subtleties of expressions in polars along with basic operators, column
selections, functions, casting, string operations, aggregation, handling
missing data, and leveraging folds, lists, and arrays. We will also introduce
the integration of numpy in polars, offering a bridge between these two
powerful libraries. Finally, readers will gain a deep understanding of when
different operations will be performed with polars and pandas, thus helping
to make informed choices when working with data in Python.
Chapter 12: Exploring Seaborn Library for Data Science in Python -
This chapter covers the concept of some basic statistical terms, such as
quantitative and qualitative variables, mean, and their significance in data
analysis. They will understand the distinction between variables that
represent numerical measurements (quantitative) and those that represent
categories or labels (qualitative). The reader will also be familiar with
various built-in datasets available in seaborn, gaining practical exposure to
real-world datasets for data visualization and analysis. Furthermore, the
reader will comprehend the principles of plot styling in seaborn, learning
how to customize the aesthetics of plots for better presentation and
interpretation. The concept of color palettes in seaborn, including
qualitative, sequential, and diverging palettes, will be covered, enabling the
reader to choose appropriate color schemes for different types of data and
visualizations.
Chapter 13: Crafting Seaborn Plots: KDE, Line, Violin, and Facets -
This chapter covers the skills to create various advanced plots in seaborn,
such as heatmap plots for visualizing matrices of data, KDE plots for
exploring the distribution of a single variable, violin plots for combining
aspects of box plots and KDE plots, line plots for depicting trends, scatter
plots for visualizing the relationship between two variables, joint plots for
combining different types of plots in a single grid, and FacetGrid for
creating grids of subplots based on categorical variables.
Chapter 14: Integrating Data Science Libraries with ChatGPT
Prompts - This chapter covers the topics to illuminate readers on the
versatile applications of ChatGPT in the exploration of open-source data
science libraries. Firstly, it aims to guide users in formulating effective
prompts for ChatGPT, tailored to each library, including Numpy, Scipy,
Matplotlib, Pandas, Polars, and Seaborn to extract valuable insights and
coding solutions. We will cover various topics, including leveraging
ChatGPT for generating insightful prompts and queries to explore data
science libraries, seeking assistance for Python code utilizing these
libraries, collaboratively generating comprehensive cheat sheets, and
addressing challenges faced by novice users.
Chapter 15: Exploring Automated EDA Libraries for Machine
Learning - This chapter will introduce learners to a diverse array of Python
libraries tailored for data visualization and analysis, each uniquely
contributing to an enriched data exploration experience. Beginning with the
PyGWalker Visual library, readers will discover how to seamlessly integrate
it with datasets to create visually appealing representations. The exploration
continues with the dataprep library, offering insights into efficient data
preparation and analysis. Subsequently, learners will explore the
capabilities of the autoviz and pandas_visual_analysis, libraries, each
providing distinct functionalities and visualization approaches to enhance
the understanding and interpretation of datasets.
Chapter 16: Case Studies Using Data Science Libraries - This chapter
covers a practical application of data science methodologies by exploring
two distinct datasets. The first case study shifts attention to an Electrical
Fault Classification dataset. This dataset is designed to help identify and
categorize various types of electrical faults in power systems. This dataset
typically includes features such as voltage, current, and other relevant
electrical parameters collected over time or during specific fault events. By
analyzing these features, we will explore no-fault and different fault
conditions. Libraries like Pandas and numpy will be used for data
manipulation and preprocessing. Visualization libraries such as Matplotlib
and Seaborn help in understanding the data distribution and model
performance. The second case study focuses on the Titanic dataset, utilizing
a range of data science libraries. Learners will employ popular Python
libraries such as pandas, numpy and seaborn to preprocess and visualize the
data, gaining insights into factors influencing passenger survival and
various other factors focusing on the null values.
Code Bundle and Coloured Images
Please follow the link to download the
Code Bundle and the Coloured Images of the book:

https://rebrand.ly/bzpl6rq
The code bundle for the book is also hosted on GitHub at
https://github.com/bpbpublications/Ultimate-Data-Science-
Programming-in-Python. In case there’s an update to the code, it will be
updated on the existing GitHub repository.
We have code bundles from our rich catalogue of books and videos
available at https://github.com/bpbpublications. Check them out!

Errata
We take immense pride in our work at BPB Publications and follow best
practices to ensure the accuracy of our content to provide with an indulging
reading experience to our subscribers. Our readers are our mirrors, and we
use their inputs to reflect and improve upon human errors, if any, that may
have occurred during the publishing processes involved. To let us maintain
the quality and help us reach out to any readers who might be having
difficulties due to any unforeseen errors, please write to us at :
errata@bpbonline.com
Your support, suggestions and feedbacks are highly appreciated by the BPB
Publications’ Family.

Did you know that BPB offers eBook versions of every book published, with PDF and ePub files
available? You can upgrade to the eBook version at www.bpbonline.com and as a print book
customer, you are entitled to a discount on the eBook copy. Get in touch with us at :
business@bpbonline.com for more details.
At www.bpbonline.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.

Piracy
If you come across any illegal copies of our works in any form on the internet, we would be
grateful if you would provide us with the location address or website name. Please contact us at
business@bpbonline.com with a link to the material.

If you are interested in becoming an author


If there is a topic that you have expertise in, and you are interested in either writing or
contributing to a book, please visit www.bpbonline.com. We have worked with thousands of
developers and tech professionals, just like you, to help them share their insights with the global
tech community. You can make a general application, apply for a specific hot topic that we are
recruiting an author for, or submit your own idea.

Reviews
Please leave a review. Once you have read and used this book, why not leave a review on the site
that you purchased it from? Potential readers can then see and use your unbiased opinion to make
purchase decisions. We at BPB can understand what you think about our products, and our
authors can see your feedback on their book. Thank you!
For more information about BPB, please visit www.bpbonline.com.

Join our book’s Discord space


Join the book’s Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
Table of Contents
1. 1. Environmental Setup for Using Data Science Libraries in Python

1. Introduction

2. Structure

3. Objectives

4. Introduction to Python

5. Setup installation in Windows for Jupyter Notebook

6. Insights of Jupyter Notebook

7. Demo program using Jupyter Notebook

8. Demo program using VSCode

9. Introduction to data science libraries in Python

10. Conclusion

11. Points to remember

12. Questions

2. 2. Exploring Numpy Library for Data Science in Python

1. Introduction

2. Structure

3. Objectives

4. Commonalities shared by lists and numpy arrays


5. Contrasts separating lists from numpy arrays

6. The process of creating numpy arrays

1. Creating a 1-D array using a list

2. Creating a 1-D array using a tuple

3. Creating a 2-D array using nested lists

4. Creating an array with a specific dtype

5. Creating an array of object type

6. Creating a 1-D array with arange() function

7. Array creation using linspace() function

8. Utilizing the zeros() function in numpy

9. Utilizing the ones() function in numpy

10. Utilizing the full() function in numpy

11. Using eye() function

12. Utilizing diag() function

13. Utilizing empty() function

14. Contrast between zeros and empty function

7. Using random module for ndarray creation

1. randint()

2. shuffle()

8. Elements access of ndarray


9. Iterate elements of ndarray

10. Arithmetic operators in numpy

11. Broadcasting concept in numpy

12. Conclusion

13. Points to remember

14. Questions

15. References

3. 3. Exploring Array Manipulations in Numpy

1. Introduction

2. Structure

3. Objectives

4. Array manipulation functions/variables in numpy

1. reshape()

2. resize()

3. flatten()

4. flat variable

5. ravel()

6. transpose()

5. Multiple arrays joining into a single array

1. concatenate()
2. stack()

3. vstack()

4. hstack()

5. dstack()

6. Splitting of arrays in numpy

1. split()

2. vsplit()

3. hsplit()

4. dsplit()

5. array_split()

7. Sorting of ndarray in numpy

8. Elements searching of ndarray in numpy

1. where()

2. condition based selection

9. Elements insert into ndarray

1. insert()

2. append()

10. Elements delete from ndarray in numpy

1. delete()

11. Usage of dot function for matrix multiplication


12. Usage of linalg module in numpy

13. Conclusion

14. Points to remember

15. Questions

4. 4. Exploring Scipy Library for Data Science in Python

1. Introduction

2. Structure

3. Objectives

4. Differences between numpy and scipy

5. Exploring scipy constants

1. Metric

2. Force

3. Binary

4. Angle

5. Mass

6. Time

7. Pressure

8. Length

9. Volume

10. Speed
11. Temperature

12. Energy

13. Power

14. Area

6. Optimizers in scipy

7. Sparse data and sparse matrix in scipy

1. Block sparse row matrix

2. Coordinate list matrix

3. Compressed sparse column matrix

4. Compressed sparse row matrix

5. Diagonal sparse matrix

6. Dictionary of keys sparse matrix

7. Row-based linked list sparse matrix

8. Graphs in scipy

1. Adjacency matrix

2. connected_components method

3. Dijkstra method

4. Floyd_Warshall method

9. Spatial data in scipy

1. Triangulation function
10. Integrate in scipy

1. Single integration

2. Multiple integration

11. Interpolation in scipy

1. 1D interpolation

2. Spline interpolation

3. splrep function

4. splev function

5. Interpolation with radial basis function

12. Conclusion

13. Points to remember

14. Questions

5. 5. Line Plot exploration with Matplotlib Library

1. Introduction

2. Structure

3. Objectives

4. Python data visualization tools

1. Line plot

5. Line plot creation by passing 2 ndarrays

6. Adding title, xlabel and ylabel to the line plot


7. Advanced line plot

8. Linestyle property

9. Color property

10. Default color

11. Quick settings for color, marker and linestyle

1. mlc form

2. clm form

3. When unspecified then default color is blue

12. Alpha property

13. Linewidth and markersize property

14. Markerfacecolor property

15. Customizing the figure size

16. Plotting multiple lines in a same plot

17. Line plot creation by passing a single ndarray

18. Addition of grid lines to a plot

19. Major and minor grid lines display

20. Legends display to a plot

21. Using xlim and ylim functions to a plot

22. Conclusion

23. Points to remember


24. Questions

6. 6. Charting Data with Various Visuals Using Matplotlib

1. Introduction

2. Structure

3. Objectives

4. Barplot

5. Vertical bar chart and simple bar chart

1. Color change of each bar

2. Width change of each bar

3. Bottom change of each bar

4. Left alignment of each bar

5. Right alignment of each bar

6. Enhancing Bar visualization with labels

6. Horizontal bar chart

7. Stacked bar chart

8. Other bar charts

1. Pie chart

2. Histogram

3. Scatter plot

4. Subplot in matplotlib
9. Conclusion

10. Points to remember

11. Questions

7. 7. Exploring Pandas Series for Data Science in Python

1. Introduction

2. Structure

3. Objectives

4. Pandas series

5. Pandas series constructor

6. Creating pandas series by passing a list

7. Creating pandas series by passing a dictionary

8. Creating pandas series by passing a numpy array

9. Creating pandas series from scalar value

10. Creating pandas series using RangeIndex

11. Creating pandas series with dtype and name

12. Exploring pandas series using copy parameter

13. Accessing series values using head and tail methods

14. Accessing elements in the pandas series

15. Pandas series slicing

16. Values extraction using loc and iloc indexers


17. Masking using Booleans for condition-based selection

18. Pandas series filtering

19. Callable object usage in selecting elements

20. Usage of the apply method to Pandas series

21. Aggregating of pandas series

1. Series object transformation

2. Iterate elements of the series

22. Conclusion

23. Points to remember

24. Questions

8. 8. Exploring Pandas Dataframe for Data Science in Python

1. Introduction

2. Structure

3. Objectives

4. Pandas dataframe

5. Pandas dataframe constructor

6. Pandas dataframe creation

1. Pandas methods and attributes for dataframe

2. Dropping dataframe rows with null values

3. Arithmetic operators for dataframes


4. New columns addition to dataframes

5. Usage of fillna() method to the dataframe

6. Sorting values of the dataframe

7. Sorting dataframe by multiple columns

8. Sorting dataframe based on index

9. Ranking series of values by using the rank() method

10. Filter data from dataframe

11. Inclusion check in dataframe

12. Usage of isnull and notnull method

13. Usage of duplicated method

14. Usage of drop_duplicates method in dataframe

15. Usage of unique and nunique methods

7. Conclusion

8. Points to remember

9. Questions

9. 9. Advanced Dataframe Filtering Techniques

1. Introduction

2. Structure

3. Objectives

4. Approaches of setting customized columns as index


5. Rows selection by index label with loc[] indexer

6. Accessing rows of dataframe by using index position

7. Renaming index and column names in dataframe

8. Renaming of index labels

9. Renaming of column names

10. Rows and columns deletion from the Dataframe

11. Filter dataframe using query method

12. Without using the query method

13. in and not in operators using the query method

14. Handling spaces in column names with query

15. Output:

16. Usage of apply() method to pandas dataframe

17. Using nlargest() and nsmallest() in pandas dataframe

18. Filtering dataframe with where() method

19. Handling of text data in Pandas

20. Filtering dataframe rows with string methods

21. Modifying data in pandas dataframe

22. Example of data aggregation in Pandas DataFrame

23. Conclusion

24. Points to remember


25. Questions

10. 10. Exploring Polars Library for Data Science in Python

1. Introduction

2. Structure

3. Objectives

4. Preference of polars over pandas

5. Differences between pandas and polars

6. Polars data types

1. Numeric group

2. Nested group

3. Temporal group

4. Others group

7. Data structures in polars

1. Series

2. Dataframe

8. Contexts in polars

1. Basic operations, select and filter concept

2. GroupBy concept

9. Importance of expressions in polars

10. Concept of lazy API in polars


11. Conclusion

12. Points to remember

13. Questions

11. 11. Exploring Expressions in Polars

1. Introduction

2. Structure

3. Objectives

4. Concept of expressions in polars

1. Basic operators

2. Column selections

3. Functions

4. Casting

5. Strings

6. Aggregation

7. Missing data

8. Folds

9. List and arrays

10. Usage of numpy in polars

5. Operations comparison between pandas and polars

6. Conclusion
7. Points to remember

8. Questions

12. 12. Exploring Seaborn Library for Data Science in Python

1. Introduction

2. Structure

3. Objectives

4. Some basic statistical terms to know

1. Quantitative variables

2. Qualitative variables

5. Built in seaborn datasets

6. Plot styling in Seaborn

1. Concept of color palette in Seaborn

2. Qualitative palettes

3. Sequential palettes

4. Diverging palettes

7. Heatmap plot in Seaborn

8. The box plot in seaborn

9. Conclusion

10. Points to remember

11. Questions
13. 13. Crafting Seaborn Plots: KDE, Line, Violin, and Facets

1. Introduction

2. Structure

3. Objectives

4. KDE plot in seaborn

5. The violin plot in seaborn

6. Line plot in seaborn

7. The scatter plot in seaborn

8. Facet grid in seaborn

9. Conclusion

10. Points to remember

11. Questions

14. 14. Integrating Data Science Libraries with ChatGPT Prompts

1. Introduction

2. Structure

3. Objectives

4. ChatGPT prompts for exploring data science libraries

5. Querying ChatGPT for Python code with data science libraries

6. Querying ChatGPT with Python error

7. Data science library cheat sheets using ChatGPT


8. Novice challenges querying Python code via ChatGPT

9. Conclusion

10. Points to remember

11. Questions

15. 15. Exploring Automated EDA Libraries for Machine Learning

1. Introduction

2. Structure

3. Objectives

4. Usage of PyGWalker visual library with a dataset

5. Usage of dataprep library with a dataset

6. Usage of autoviz library with a dataset

7. Usage of pandas_visual_analysis library with a dataset

8. Conclusion

9. Points to remember

10. Questions

16. 16. Case Studies Using Data Science Libraries

1. Introduction

2. Structure

3. Objectives

1. Case study on electrical fault classification


2. Case study on titanic dataset

4. Conclusion

5. Points to remember

6. Questions

17. Index
CHAPTER 1Environmental Setup for Using
Data Science Libraries in Python
Introduction

In this chapter, we will explore the data science library in Python. However,
before learning about various data science libraries, it is quite important to
create an environmental setup for installing and using these data science
libraries in Python. Setting up the environment for utilizing data science
libraries like NumPy, SciPy, Matplotlib, Pandas, and others in Python
ensures effective data analysis and modeling workflows. This entails
managing dependencies, version control, and package management to
guarantee project compatibility and reproducibility. By creating isolated
environments, potential conflicts between different library versions are
mitigated, facilitating seamless collaboration and reproducibility of results.
So, let us get into the intricate details of Python installation and Integrated
Development Environments (IDEs) like VSCode and Jupyter Notebook for
writing and executing the Python code.

Structure

In this chapter, we will discuss the following topics:

Introduction to Python
Setup Installation in Windows for Jupyter Notebook
Insights of Jupyter Notebook
Demo program using Jupyter Notebook
Introduction to Data Science Libraries in Python

Objectives

This chapter will discuss the essential aspects of setting up a conducive


programming environment for Python, emphasizing the significance of an
IDE. The discussion commences with the installation process for the
Jupyter Notebook on a Windows platform, guiding learners through the
setup and providing insights into its functionality. Subsequently, the focus
shifts to installing Visual Studio Code (VSCode) for Python development,
exploring its usage for code development. The chapter will conclude with
an introduction to fundamental data science libraries in Python, laying the
groundwork for learners to leverage these powerful tools in their
programming journey.

Introduction to Python

Before we explore this topic, it is essential to set up our development


environment. In this chapter, we will guide you through installing Python
on your system, ensuring you have everything you need to start writing and
executing Python code. Python is an open-source, high-level programming
language renowned for its simplicity, readability, and extensive support for
various programming paradigms. Whether you are a beginner taking your
first steps into programming or an experienced developer seeking a
powerful tool for web development, data analysis, machine learning, or
scientific computing, Python has something to offer. To begin harnessing
the power of Python, the first step is to install the Python interpreter on
your computer. The Python interpreter is the core component that executes
Python code and provides access to the vast array of libraries and tools
available in the Python ecosystem. In this chapter, we will walk you
through the installation process for Python on Windows operating systems,
including Windows, macOS, or Linux OS.

The steps for installing Python on Windows are mentioned below:

1. Visit the Python website: https://www.python.org/downloads/ to


download the latest version of Python. Here, we have downloaded the
latest Python version 3.12.2.
2. We use 64-bit Windows OS and will click or execute the Python
installer file.
3. The following window will pop out, and the user can choose Install
Now or Customize installation. Here, we will be selecting the Install
Now option. Also, remember initially, there are two unchecked
checkboxes which are:
1. Use admin privileges when installing py.exe.
2. Add python.exe to PATH.
4. Use admin privileges when installing py.exe. This option grants the
Python installer administrative privileges during installation. It is
essential for installing Python in system-wide directories or when
installing packages that demand elevated permissions. Enabling this
ensures a smooth installation process without encountering
permission-related issues. This option can also be used to change the
Python installation folder.
5. Add python.exe to PATH. By selecting this option, the Python installer
adds the directory containing python.exe to the PATH environment
variable. This inclusion allows easy access to Python commands from
any command prompt or terminal window without specifying the full
path to the Python executable. It streamlines the usage of Python
across the system, enhancing convenience for running Python scripts
and commands.

The steps for installing Python on macOS are mentioned below:

1. Visit the Python website: Go to https://www.python.org/downloads/ to


download the latest version of Python for macOS.
2. Run the installer: Open the downloaded .pkg file and follow the
instructions in the installation wizard.
3. Verify the installation: Open a terminal and type python3 --version to
confirm that Python has been installed correctly.

The steps for installing Python on Linux are mentioned below:

1. Use the package manager: Most Linux distributions come with Python
pre-installed. However, if you need to install or upgrade Python, you
can use the package manager. For example, on Ubuntu, you can use
the following commands:

sudo apt update

sudo apt install python3

2. Verify the installation: Open a terminal and type python3 --version to


confirm that Python has been installed correctly.
We have used Python version 3.11.4 in this book, which means the micro
release of Python 3.11 may contain bug fixes and minor enhancements
compared to earlier micro releases within the Python 3.11 series. But to
show you all the installation steps, we will be showing you Python version
3.12.2, which is the latest version as of 6 February 2024. By the end of this
chapter, readers will have a fully functional Python environment set up on
their system, ready to embark on their programming journey. Whether you
are a student, professional, or hobbyist, Python offers a welcoming and
intuitive platform for turning your ideas into reality.

This revision includes installation instructions for Windows, macOS, and


Linux users, ensuring that the content is comprehensive and helpful for all
readers.

In most cases, it is recommended that you leave these checkboxes checked


to ensure a smooth installation experience and convenient Python usage on
your system. Refer to the following figure for a better understanding:

Figure 1.1: Python installation on running the Python executable

Refer to the following steps for a better understanding:

1. We will be selecting the Install Now option as a recommended option


where the default installation path will be C:\Users\
[user]\AppData\Local\Programs\Python\Python[version] for the
existing user, which will include the IDLE, pip, and documentation
and thus create shortcuts and file associations as mentioned in Figure
1.1.
2. Once the installation has been done, the following image will pop out,
as displayed in Figure 1.2, where the user can close the image.

Figure 1.2: Image depicting the successful installation of Python version


Once Python is successfully installed, the user can view the Python version
using IDLE (Python 3.12 64-bit), which is installed and can be searched in
Windows apps. Pythonʼs built-in IDE will be opened when IDLE is run.
Another way is that the user can navigate to the directory where Python is
installed on the system and double-click python.exe.

Setup installation in Windows for Jupyter Notebook

Now, we shall view the steps of installing Anaconda on Windows OS:

1. First, we will download the Anaconda installer by visiting the website


https://www.anaconda.com/download. Here, we have downloaded
Anaconda3 2024.02-1.
2. Once downloaded, we will run this installer file, and we are just
getting started with the pop-up of the following image file, as shown
in Figure 1.3:

Figure 1.3: Image depicting Step-1 of getting started of Anaconda3

3. On clicking the Next button of Figure 1.3, read the License Agreement
as shown in Figure 1.4:

Figure 1.4: Image depicting Step-2 of license agreement of Anaconda3

4. On clicking Next (refer to Figure 1.4), we will be prompted with an


installation type of choosing either Just Me (Recommended) or All
users (which requires admin privileges), as shown in Figure 1.5. Just
me option provides the current user account and All Users option
provides installation for all the user accounts which require privileges
(refer to Figure 1.4), we will be prompted to choose an installation
type: either just me (recommended) or all Windows administrator.
Figure 1.5: Image depicting Step 3 of Installation Type of Anaconda3

5. Then, clicking the Next button (shown in Figure 1.5) will prompt us to
select a destination folder to install Anaconda. Install Anaconda to a
directory path that avoids spaces or unicode characters, as shown in
Figure 1.6:

Figure 1.6: Image depicting Step-4 of Destination folder selection of


Anaconda3

6. Then, when clicking the Next button in Figure 1.6, we will see the
advanced installation option, as shown in Figure 1.7:

Figure 1.7: Image depicting Step-5 of Advanced Installation options of


Anaconda3

This Figure 1.7 provides us with three checkbox options to select:

Create start menu shortcuts (supported packages only): This option


creates shortcuts in the Start menu for supported Anaconda packages,
making it easier to access Anaconda Navigator, Anaconda Prompt,
and other tools. Use these tools frequently for your data science work.
Register Anaconda3 as the system Python 3.11: Enabling this sets
Anaconda Python as the default interpreter for the system, meaning
references to Python will point to the Anaconda installation. It is
helpful if you prefer using Anaconda Python for all your Python-
related tasks. However, if you have other Python installations or prefer
managing Python environments manually, you may choose to keep
this unchecked.
Clear the package cache upon completion: Enabling this option clears
the package cache after installation, freeing up disk space. This cache
stores downloaded packages and files used during installation. While
clearing it saves space, leaving it unchecked retains cached files,
which can reduce download times for future updates or reinstalls of
packages.

We are checking all the three following options:

7. Clicking the Install button in Figure 1.7 will initiate the installation
process, as shown in Figure 1.8. The user may click the Show Details
button to observe the packages installed on Anaconda3.

Figure 1.8: Image depicting Step-6 of packages installation of Anaconda3

8. When the installation is complete, we will be prompted with the


following image where they can click the Next button, as shown in
Figure 1.9:

Figure 1.9: Image depicting complete installation of packages of


Anaconda3

9. Then, click the Next button of Figure 1.10, which displays the cloud
notebook service of Anaconda:

Figure 1.10: Image depicting cloud notebook service of Anaconda3

10. The Anaconda Distribution Tutorial offers a comprehensive guide to


installing and utilizing Anaconda, a widely used platform for data
science and machine learning. This tutorial covers essential aspects,
including installation on various operating systems, setup of Python
environments, package management with conda and pip, integration
with popular IDEs like Jupyter Notebook and Spyder, practical usage
examples for data analysis and machine learning, and access to
community resources for support and learning. Getting Started with
Anaconda provides a beginner-friendly introduction, focusing on
installation, setup, and basic usage to empower users to begin their
data science journey effectively with Anacondaʼs robust tools and
libraries. Finally, check the following two checkboxes, as shown in
Figure 1.11:

Figure 1.11: Anaconda distribution tutorial and getting started guide

You may check/uncheck it as you wish. We are getting started with


Anaconda and clicking the finish button at the end, thus ensuring the
complete installation of Anaconda3 2024.02-1 (64-bit).

In Figure 1.12, some pre-installed tools available on Anaconda Navigator


can be viewed, and the user can Launch a Jupyter Notebook on their default
browser by clicking it, as shown below:

Figure 1.12: Image depicting the Launch of Jupyter Notebook on Anaconda


Navigator

Note: If any reader is interested in installing Anaconda on macOS / Linux,


then you may refer to the following two links as references:

https://docs.anaconda.com/anaconda/install/mac-os/

https://docs.anaconda.com/anaconda/install/linux/

Insights of Jupyter Notebook

In the previous topic, we demonstrated how to launch a Jupyter Notebook


using the Anaconda Navigator. Now, we will use the Anaconda prompt to
launch by typing Jupyter notebook, as shown in Figure 1.13:

Figure 1.13: Image depicting the Launch of Jupyter Notebook using


anaconda prompt
Using a web application, Jupyter Notebook, we can run the code, embed
explanatory text, and visualize all under a single umbrella. On launching a
web application, Jupyter, the user will be prompted, as depicted in Figure
1.14:

Figure 1.14: Image depicting Jupyter Notebook web app

A new folder called Demo1 is created on the Desktop, as shown in Figure


1.15:

Figure 1.15: Image depicting Desktop/Demo1 folder in Jupyter notebook

In Figure 1.15, we can see that a Jupyter notebook file already has the
extension Untitled1.ipynb. In the web app shown above, the user can create
a new Notebook, File, Folder, and so on. Suppose a new Notebook file has
to be created, like Untitled.ipynb. Then, the user will click Notebook and
be prompted to Select the kernel, as shown in Figure 1.16:

Figure 1.16: Image depicting creation of new Notebook file


Desktop/Demo1 folder

In Jupyter Notebook, when we click New to create a new notebook, we are


prompted to select a kernel. A kernel, in this context, refers to the
computational engine that executes the code within the notebook. The
prompt Python3 (ipykernel) indicates that you are selecting the Python 3
kernel managed by the ipykernel package.

The meaning of each part is given below:

Python3: This specifies the kernelʼs programming language. In this


case, it is Python 3, indicating that the notebook will execute Python
code.
(ipykernel): This indicates the specific implementation of the Python
kernel. ipykernel is the package responsible for providing Jupyter with
IPython-compatible kernels. It enables the notebook to execute Python
code and manage interactions with the Python interpreter.

By selecting Python3 (ipykernel) as the kernel when creating a new


notebook, we specify that the notebook will use the Python 3 kernel
provided by the ipykernel package to execute Python code within the
notebook. We are checking the option Always start the preferred kernel.
The expanded image is shown in Figure 1.17:

Figure 1.17: Image depicting Kernel selection for executing Python code
within notebook

Demo program using Jupyter Notebook

Observe Figure 1.18, with annotation in our current topic demo program at
Jupyter notebook. A Jupyter notebook comprises several integral
components:

Cell: The fundamental units of a Jupyter notebook are cells, which


come in two primary types:
Code cell: These contain executable code written in Python, R, or
Julia. Code cells execute independently, with their outputs (such
as text, plots, or errors) displayed below each cell. In Figure 1.18,
we have discussed the code cell.
Markdown cells: These contain formatted text written in
Markdown syntax, enabling the creation of structured
documentation with features like headings, lists, links, and
images. So, if we write 4 + 4 and then press Shift + Enter, the
literal visual output will be 4 + 4, as shown in Figure 1.19.
Kernel: As the computational engine, the kernel executes code within
the notebookʼs cells. Each notebook is linked to a specific kernel,
which dictates the programming language and execution environment.
Additionally, the kernel manages the notebookʼs state, including
variable values and imported modules, as shown in Figure 1.18.
Tool bar: It offers swift access to various notebook interactions,
including saving, adding cells, running cells, and modifying cell types,
as shown in Figure 1.18. The user may click the icons in the toolbar
according to their needs.
Menu bar: It hosts dropdown menus and provides supplemental
functionalities for managing the notebook, such as cell insertion, type
adjustments, and kernel configuration, as shown in Figure 1.18. The
user may explore multiple options present in the Menu bar, such as
File, Edit, View, Insert, and so on.
Output area: Situated beneath code cells, the output area showcases
execution results, encompassing printed output, error messages, and
graphical plots generated by the code, as shown in Figure 1.18.

These components synergize to establish an interactive and adaptable


environment for data analysis, visualization, and documentation within
Jupyter notebooks.

Apart from these, there are two states or modes of a notebook, viz Edit
mode and Command mode:

Edit mode: When we are in Edit mode, signified by a green cell


border, we can directly modify the content of a cell. This mode allows
us to type code or text within the cell and perform editing actions like
copying, cutting, and pasting. To enter Edit mode, simply click inside
a cell or press Enter when a cell is selected.
Command mode: Indicated by a blue cell border, it enables users to
execute operations on cells without directly editing their content. In
this mode, the user can perform tasks like moving, deleting, and
creating cells by using keyboard shortcuts. To enter Command mode,
press Esc or click outside a cell after editing its content.

Refer to the following figure for a better understanding:

Figure 1.18: Image depicting Demo program on Jupyter notebook


Here, we have displayed a simple demo program using Jupyter notebook.
In the code cell, we are simply initializing the values of two variables,
multiplying these two variables, and storing the result in mynum3.

mynum1 = 2

mynum2 = 3

mynum3 = mynum1 * mynum2

In another code cell, we are only writing mynum3 and getting the output as
6 in the output Area. Also, notice there are 2 prompts namely In[] and Out[]
which indicate input and output, respectively, for code cells.

In[]: This prompt denotes an input cell where the user can write and
execute code. The number inside the square brackets indicates the
order in which the code cell was executed within the notebook. For
example, In[2] indicates that this is the second code cell executed in
the notebook.
Out[]: This prompt denotes an output cell that displays the result of the
code execution from the corresponding input cell (In[]). The number
inside the square brackets corresponds to the input cellʼs number (In[])
from which the output was generated. For example, Figure 1.18
Out[2] indicates that this is the output whose value is 6 corresponding
to the input cell In[2].

These prompts help to keep track of the execution order and the
associated input and output for code cells in a Jupyter Notebook. They
provide a clear indication of the codeʼs execution flow and the results
generated at each step.

In Jupyter Notebook, Shift + Enter and Ctrl + Enter are both keyboard
shortcuts used to execute code cells, but they operate differently:

Shift + Enter: This shortcut executes the current cell and moves the
focus to the next cell. If there is no subsequent cell, a new one is
created below. Itʼs commonly used when you want to execute a cell
and proceed to the next task or cell in the notebook.
Ctrl + Enter: Pressing Ctrl + Enter executes the current cell but keeps
the focus within the same cell after execution. Itʼs useful when you
want to execute a cell without advancing to the next one, allowing you
to stay focused on the current cell for further editing or analysis.

In essence, Shift + Enter executes and advances to the next cell, while
Ctrl + Enter executes without moving to the next cell, enabling you to
control the flow of execution based on your workflow in Jupyter
Notebook.

We also saw usage of pwd in In[3] which indicates the current working
directory in Jupyter notebook. So, the file Untitled.ipynb is saved in the
following folder:

'C:\\Users\\SAURABH\\Desktop\\Demo_Jupyter'.

Now, we shall explain an example of Markdown, Raw NBConvert, and


Heading when the selection mode is changed for a cell in Jupyter notebook:

Figure 1.19: Image depicting Markdown Cell usage on Jupyter notebook

Raw NBConvert Cell: In Jupyter Notebook, a Raw NBConvert cell


allows you to insert unprocessed content, such as HTML, LaTeX, or
Markdown, which remains untouched during the conversion process.
For instance, including a Raw NBConvert cell with custom CSS styles
ensures specific formatting remains intact when exporting the
notebook to HTML or other formats, enhancing document
presentation without altering its content within the notebook, as shown
below:

Figure 1.20: Image depicting Raw NBConvert Cell usage on Jupyter


notebook
Heading cell: Heading cells in Jupyter Notebook enable the creation of
structured section titles to organize content effectively. Using
Markdown syntax, preceded by # symbols, users designate various
heading levels, aiding in content hierarchy and navigation. For
instance, employing # Introduction for top-level sections and ##
Background for subtopics enhances readability and comprehension,
particularly when navigating extensive notebooks or generating
navigational aids like table of contents, as shown below:

Figure 1.21: Image depicting Raw usage on Jupyter notebook

The steps on how to setup installation in Windows for VSCode is


mentioned below:

1. First visit the website https://code.visualstudio.com/download and


download the installer file for windows using any browser of your
choice. There are other choices to download for ubuntu or macOS.
Here, we are explaining the installation setup procedure in Windows.
2. Run the installer file and will be prompted to accept the agreement
option, that is, VSCode terms and conditions. We need to accept the
radio-button option I accept the agreement as shown in Figure 1.22:

Figure 1.22: Image depicting Raw Terms and Conditions page during
Microsoft VS Code installation

3. Select the directory where we want to run Visual Studio Code. Weʼll
be prompted to browse for the location. Afterward, click on the Next
button to proceed. Here, the default selected path is
C:\Users\6146c\AppData\Local\Programs\Microsoft VS Code as
shown in Figure 1.23 as follows:
Figure 1.23: Image depicting Destination folder selection during
Microsoft VS Code installation

4. Then, we will be prompted to select additional tasks as shown in


Figure 1.24 as follows. Just check the options as displayed in the
image and click the Next button.

Figure 1.24: Image depicting Additional Tasks selection during


Microsoft VS Code installation

5. We will be prompted to start the installation setup as shown in Figure


1.25 as follows. Click on the Install button option and the installation
procedure will begin:

Figure 1.25: Image depicting Installation prompt during Microsoft VS


Code installation

6. The installation has been started and will take some time to install.
During the installation, we might encounter the image shown in Figure
1.26 as follows:

Figure 1.26: Image depicting Installation action during Microsoft VS


Code installation

7. When the installation of VS Code is completed, check the Launch


Visual Studio Code and then click Finish as shown in Figure 1.27:

Figure 1.27: Image depicting completion of Microsoft VS Code


installation
8. In the Visual Studio Code window, we have the option to create a new
file and select the preferred programming language to kickstart our
coding journey! Then type Ctrl + Shift + X and then type Python
under EXTENSIONS as shown in Figure 1.28 as follows. Click
Install. We are installing the Python extension for Visual Studio Code:

Figure 1.28: Image depicting Python extension installation for VS


Code

Note: If any reader is interested in installing VSCode in macOS/Linux, then


you may refer to the following two links as references:

https://code.visualstudio.com/docs/setup/mac

https://code.visualstudio.com/docs/setup/linux

Demo program using VSCode

Now, we shall see a simple demo program to print HelloWorld which we


have learned first before start learning any programming language. Click
the EXPLORER Icon which is below the VS Code Icon at the top left. To
explain to you all the basic demo program I have installed VS Code on my
PC desktop and created a Demo folder under Downloads by clicking New
Folder icon option such that my present working directory is
C:\Users\6146c\Downloads\Demo1. Now, press Ctrl + `, we shall be
viewing the following tabs:

Problems: This refers to the panel where we can view and manage
diagnostic messages, warnings, and errors reported by our code or
extensions. It helps us to identify and resolve issues in our codebase.
Output: The Output panel displays the output of tasks, extensions, and
debug sessions. It provides valuable information and feedback from
various processes running within Visual Studio Code.
Debug Console: The Debug Console is where we can interactively
debug your code during a debugging session. It allows us to view and
evaluate expressions, inspect variables, and execute commands within
the context of our debug
session.
Terminal: Visual Studio Code includes an integrated terminal that
allows us to run command-line tasks and interact with our operating
system directly within the editor. It supports various shells and can be
customized to suit our preferences.
Ports: Ports refer to network ports used for communication between
Visual Studio Code and external processes, such as debuggers or
language servers. Configuring and managing ports may be necessary
when working with certain features or extensions that require network
connectivity.

Now, a new Python file helloworld.py is created by clicking New File icon
under EXPLORER section. We are saving Python files with .py extension.
It is a text file that contains Python code and under this particular file we
are only writing the print statement and that too as shown in Figure 1.29 as
follows:

print(“HelloWorld”)

Figure 1.29: Image depicting helloworld.py file in VSCode

Now, at present by just writing pwd command we can know the directory
as shown in Figure 1.20 as follows:

Figure 1.30: Image depicting current working directory in VSCode

By typing, ls command the contents of the current directory, including files


and directories will be displayed as shown in Figure 1.31 as follows:

Figure 1.31: Image depicting ls command in VSCode


So, now just write the command python .\helloworld.py under the Terminal
tab, we may get the error as shown in Figure 1.32 as VS Code was unable
to locate the Python interpreter.

Figure 1.32: Image depicting error on running the Python command along
with file name

So, just close VSCode, again restart it and run the command python
.\helloworld.py under the Terminal tab. This time we may not get an error
and the output HelloWorld will be displayed to the user, as shown below:

Figure 1.33: Image depicting HelloWorld output to the user

Note: Executing this command in a terminal or command prompt triggers


the Python interpreter to run the code within the helloworld.py script,
resulting in the generation of any output or behavior defined within the
script.

Introduction to data science libraries in Python

The introductory overview of data science libraries in Python encompasses


essential tools for data analysis, manipulation, and visualization.
Foundational libraries like NumPy, Pandas, Matplotlib, and Scipy cater to
distinct aspects of the data science workflow. NumPy excels in array
operations and mathematical functions, while Pandas offers versatile data
structures for flexible manipulation and analysis. Matplotlib aids in crafting
high-quality visualizations, and Scipy extends functionality with scientific
computing tools and algorithms. Proficiency in these libraries is paramount
for aspiring data scientists as they underpin Pythonʼs capabilities for data
exploration and analysis. Moreover, Polars and Seaborn emerge as valuable
additions to the data science toolkit. Polars, a fast and efficient DataFrame
library akin to Pandas, excels in handling large-scale data processing tasks
with improved performance. Conversely, Seaborn, built on Matplotlib,
delivers a high-level interface for creating captivating statistical graphics,
enabling users to generate various plots swiftly, including scatter plots,
histograms, and heatmaps, to glean insights and communicate findings
effectively. Integrating these libraries elevates Pythonʼs prowess in data
manipulation, analysis, and visualization, fostering more efficient and
insightful data-driven decision-making processes. So, in this book, we will
be covering these data science libraries with examples and various concepts
chapter-wise so that all the data science learners can grasp these concepts in
a well-structured manner.

Note: If you type press h letter by focusing on Jupyter notebook, then the
following image will be popped up as shown in Figure 1.34 which will
access the keyboard shortcuts help menu. When you press h while in
command mode (blue cell border), a list of available keyboard shortcuts in
a popup dialog will be displayed in the Jupyter notebook.

Figure 1.34: Image depicting shortcuts in Jupyter notebook

Conclusion

In conclusion, this chapter has provided a comprehensive exploration of


setting up an optimal programming environment for Python, with a
particular emphasis on the importance of an IDE. Beginning with the
installation process for Jupyter Notebook on a Windows platform, learners
have been guided through the setup and gained insights into its
functionality. Subsequently, the focus shifted to the installation of VSCode
for Python development, offering an overview of its usage in code
development. Lastly, the chapter introduced fundamental data science
libraries in Python, laying a solid foundation for learners to harness these
powerful tools in their programming endeavors. Armed with these essential
skills and resources, learners are well-equipped to embark on their journey
into Python programming with confidence and proficiency.

Points to remember

Python is a versatile and beginner-friendly programming language


widely used in various fields such as web development, data science,
and artificial intelligence.
Install Python and Jupyter Notebook using Anaconda distribution for
streamlined setup and management.
Jupyter Notebook provides an interactive environment for data
analysis, combining code execution, text documentation, and
visualizations in a single interface.
Create and execute Python scripts in Jupyter Notebook to demonstrate
programming concepts and experiment with data analysis techniques.
Install VSCode and the Python extension to set up an efficient Python
development environment on Windows.
Learn about essential data science libraries such as NumPy, Pandas,
Matplotlib, and SciPy, which are fundamental tools for data analysis
and visualization in Python.

Questions

1. What are the key features of Python that make it a popular


programming language?
2. How does Python compare to other programming languages in terms
of simplicity and readability?
3. Why is the Anaconda distribution recommended for setting up Python
and Jupyter Notebook on Windows?
4. What is the purpose of Jupyter Notebook in data analysis and
programming workflows?
5. How does Jupyter Notebook facilitate interactive computing with code
execution, text formatting, and visualizations?
6. What are some keyboard shortcuts and features that enhance
productivity within Jupyter Notebook?
7. How can you configure VSCode for Python development, including
settings and extensions?
8. Create a new Python file in VSCode and write a simple Python script.
9. How do libraries such as NumPy, Pandas, Matplotlib, and SciPy
contribute to data analysis and visualization in Python?
CHAPTER 2
Exploring Numpy Library for Data
Science in Python

Introduction
We are carrying over the legacy of explaining multiple concepts in Python
language. In this chapter, we will be dealing with the current popular topic
in the data science area which will be related to open-source data science
libraries in Python.
The primary library is essential for data science and machine learning is
numpy, short for numerical Python library. While basic mathematical
operations are handled by standard Python, numpy fills the gap for complex
mathematical operations such as array creation, manipulation,
differentiation, statistics, and integral calculus. These functionalities are not
at all supported by standard Python. Numpy introduces the n-dimensional
array (ndarray) as its fundamental data structure. Written in both Python
and C language, it offers high performance, making it the optimal choice for
machine learning algorithms requiring efficient data structures akin to
Python lists. Data is stored in array format within numpy, and the library
provides an array data structure referred to as ndarrays. We can install
numpy using pip that is, pip install numpy.

Structure
In this chapter, we will discuss the following topics:
Commonalities shared by lists and numpy arrays
Contrasts separating lists from numpy arrays
The process of creating numpy arrays
Using random module for ndarray creation
Exploring differences between View vs. Copy in numpy
Elements access of ndarray
Iterate elements of ndarray
Arithmetic operators in numpy
Broadcasting concept in numpy

Objectives
Readers will gain a clear understanding of the comparisons between lists
and numpy arrays and what they entail. We shall delve into the creation of
ndarrays through different methods, including utilizing list and tuple data
structures. We will also demonstrate various functions using Python code
snippets and peep into ndarray creation with random values using a module.
The distinction between view and copy in numpy will be explained using
examples which is crucial for memory efficiency. We will learn different
methods to access individual elements or subsets of elements in ndarrays.
We shall see how to iterate over elements of ndarrays using loops and
explore various arithmetic operators available in numpy. In the end, we will
learn how to use broadcasting to perform operations on ndarrays with
different shapes.

Note: We will be using an alias for numpy library as mynp. In this and
the forthcoming chapters, we will be writing syntax of methods
pertaining to the library. In the code, print
(help(aliasname.method_name)) will be mentioned under comment. The
reader is requested to kindly view the parameter details along with
their syntax so that repeatability in this book can be avoided thus
decreasing the page size and lucidly explaining the concepts without
providing any redundant information. It is expected that readers will
view the parameters of the syntax used.

Commonalities shared by lists and numpy arrays


List and numpy arrays share several similarities, some are mentioned as
follows:
Both the list and numpy arrays can store data.
They preserve the order of elements, allowing for indexing and
slicing operations.
The lists and numpy arrays are mutable, meaning their contents can
be changed.

Contrasts separating lists from numpy arrays


Given below are the distinctions between lists and numpy arrays:
Numpy requires explicit installation, while lists are built-in to Python.
Numpy arrays contain homogeneous elements, while lists can have
heterogeneous elements.
Numpy arrays support vector operations, unlike lists.
See the following code snippet for reference. Let us explore the code
mentioned in the following program name: Chap2_Example2.1.py:
myl1 = [101,201,301,401]
myl1+3 # vector operations cannot be performed on
list
Output:
TypeError: can only concatenate list (not "int")
to list

Note: The preceding code is covered in (program name:


Chap2_Example2.1.py)
Observe the following code in the program name: Chap2_Example2.2.py:
import numpy as mynp
mynp1_arr = mynp.array([101,201,301,401,501])
print(mynp1_arr+21) # vector operations will be
performed on ndarray
Output:
[122 222 322 422 522]

Note: The preceding code is covered in (program name:


Chap2_Example2.2.py).

Numpy arrays (ndarray) consume less memory compared to lists, as


demonstrated in the code program name: Chap2_Example2.3.py below:
import numpy as mynp
import sys
myl1 = [loop for loop in range(11,31)]
myndarray = mynp.array(myl1)
print('The List size is -->
',sys.getsizeof(myl1))
print('The ndarray size is a -->
',sys.getsizeof(myndarray))
Output:
The List size is --> 248
The ndarray size is a --> 192
Note: The preceding code is covered in (program name:
Chap2_Example2.3.py).

Arrays outperform lists in terms of speed, as demonstrated in the code


program name: Chap2_Example2.4.py as follows:
import numpy as mynp
from datetime import datetime
myarr_1 = mynp.array([14,15,16])
myarr_2 = mynp.array([11,12,13])

#conventional Python code


def my_dot_product(myarr_1,myarr_2):
my_result = 0
for _x,_y in zip(myarr_1,myarr_2):
my_result += _x*_y
return my_result
mybefore_time = datetime.now()
for myloop in range(2000000):
my_dot_product(myarr_1,myarr_2)
myafter_time = datetime.now()
print('Time take to execute using conventional
Python approach:',myafter_time-mybefore_time)

#code using numpy library


mybefore2_time = datetime.now()
for my_loop in range(2000000):
mynp.dot(myarr_1,myarr_2) # using numpy
myafter2_time = datetime.now()
print('Time take to execute using Numpy
Library:',myafter2_time-mybefore2_time)
Output:
Time take to execute using conventional Python
approach: 0:00:02.861940
Time take to execute using Numpy Library:
0:00:02.358692

Note: The preceding code is covered in (program name:


Chap2_Example2.4.py).

The process of creating numpy arrays


In this section, we shall explore the creation of ndarrays. The process of
creating numpy arrays involves choosing an appropriate method based on
the desired array structure and content, utilizing the flexibility and
efficiency offered by the NumPy library. By leveraging the flexibility and
efficiency provided by the numpy library, users can tailor their arrays to
meet the requirements of their data manipulation and analysis tasks
effectively.

Creating a 1-D array using a list


A 1-D array can be created using the list, as demonstrated in the code
program name: Chap2_Example2.5.py as follows:
# creating a 1-D array using a list
import numpy as mynp
my_l1 = [113,213,313,413,567]
print(f'my_l1 type is: {type(my_l1)}')
mynd_arr = mynp.array(my_l1)
print('mynd_arr type is: --> '+
str(type(mynd_arr)))
print('mynd_arr -->'+ str(mynd_arr))
print('The Array dimensions is: '+
str(mynd_arr.ndim))
print('The data type of array elements is : '+
str(mynd_arr.dtype))
print('The array size is : '+ str(mynd_arr.size))
print('The array shape is : '+
str(mynd_arr.shape))
Output:
my_l1 type is: <class 'list'>
mynd_arr type is: --> <class 'numpy.ndarray'>
mynd_arr -->[113 213 313 413 567]
The Array dimensions is: 1
The data type of array elements is : int32
The array size is : 5
The array shape is : (5,)

Note: The preceding code is covered in (program name:


Chap2_Example2.5.py).

Creating a 1-D array using a tuple


A 1-D array can be created using a tuple, as demonstrated in the code
program name: Chap2_Example2.6.py as follows:
# Creating a 1-D array using tuple
import numpy as mynp
mytuple = ('Ric','Pic','Tic','Mic')
print('mytuple type is: ' + str(type(mytuple)))
mynd_arr = mynp.array(mytuple)
print('mynd_arr type is: --> ' +
str(type(mynd_arr)))
print('mynd_arr --> ' + str(mynd_arr))
Output:
mytuple type is: <class 'tuple'>
mynd_arr type is: --> <class 'numpy.ndarray'>
mynd_arr --> ['Ric' 'Pic' 'Tic' 'Mic']

Note: The preceding code is covered in (program name:


Chap2_Example2.6.py).

Creating a 2-D array using nested lists


A 2-D array can be created using nested lists, as demonstrated in the code
program name: Chap2_Example2.7.py as follows:
# creating a 2-D array using nested list
import numpy as mynp
my_nestedl1 = [[105,205,305],[405,505,605],
[705,805,905],[1005,1105,1205]]
mynd_arr = mynp.array(my_nestedl1)
print('mynd_arr type is: --> ' +
str(type(mynd_arr)))
print('mynd_arr -->' + str(mynd_arr))
print('The array dimensions is:'+
str(mynd_arr.ndim))
print('The data type of array elements is : '+
str(mynd_arr.dtype))
print('The array size is :'+ str(mynd_arr.size))
print('The array shape is : ' +
str(mynd_arr.shape))
Output:
mynd_arr type is: --> <class 'numpy.ndarray'>
mynd_arr -->[[ 105 205 305]
[ 405 505 605]
[ 705 805 905]
[1005 1105 1205]]
The array dimensions is:2
The data type of array elements is : int32
The array size is :12
The array shape is : (4, 3)

Note: The preceding code is covered in (program name:


Chap2_Example2.7.py).

It is crucial to remember that ndarrays contain homogeneous elements (of


the same data type). If the list provided during creation contains
heterogeneous elements (of different data types). ndarray upcasting will be
performed as demonstrated in the code program name:
Chap2_Example2.8.py as follows:

# list containing elements of different data type


import numpy as mynp
my_list = [201,301,420.5]
mynd_arr = mynp.array(my_list) # upcasting to
float data type
print('mynd_arr -->'+str(mynd_arr))
print('Elements data type is --> ' +
str(mynd_arr.dtype))
Output:
mynd_arr -->[201. 301. 420.5]
Elements data type is --> float64

Note: The preceding code is covered in (program name:


Chap2_Example2.8.py).

Creating an array with a specific dtype


Using a particular dtype, an array is created as demonstrated in the code
program name: Chap2_Example2.9.py as follows:
import numpy as mynp
# int type
mynd_arr = mynp.array([101,201,410.8,0],dtype=int)
print(mynd_arr)
# float type
mynd_arr =
mynp.array([101,201,410.8,0],dtype=float)
print(mynd_arr)
# bool type : True for non-empty string and
number, False for empty string and 0
mynd_arr = mynp.array([101,201,410.8, 0,
'','BoolType','False'],dtype=bool)
print(mynd_arr)
# complex type
mynd_arr =
mynp.array([101,201,410.8,0],dtype=complex)
print(mynd_arr)
# str type
mynd_arr = mynp.array([101,201,410.8,0],dtype=str)
print(mynd_arr)
Output:
[101 201 410 0]
[101. 201. 410.8 0. ]
[ True False False True True]
[101. +0.j 201. +0.j 410.8+0.j 0. +0.j]
['101' '201' '410.8' '0']

Note: The preceding code is covered in (program name:


Chap2_Example2.9.py).

Creating an array of object type


The parent category for data types like int, float, bool, complex, and str is
of the object type. In this context, while the elements may seem diverse,
their underlying data type remains an object, as demonstrated in the code
program name: Chap2_Example2.10.py as follows:
import numpy as mynp
mynd_arr =
mynp.array([410,'Yathartha',410.5,True,3+7j,False]
, dtype=object)
print(mynd_arr)
print('The elements data type of mynd_Arr is:' +
str(mynd_arr.dtype))
Output:
[410 'Yathartha' 410.5 True (3+7j) False]
The elements data type of mynd_Arr is:object

Note: The preceding code is covered in (program name:


Chap2_Example2.10.py).

Creating a 1-D array with arange() function


A one-dimensional array can be generated using the arange() function, as
exemplified in the code program name: Chap2_Example2.11.py as follows:
import numpy as mynp
mynd_arr = mynp.arange(7)
print('mynd_arr is:' + str(mynd_arr))
print('The array dimensions is: ' +
str(mynd_arr.ndim))
print('The data type of elements of array is : '
+ str(mynd_arr.dtype))
print('The array size is : ' + str(mynd_arr.size))
print('The array shape is : ' +
str(mynd_arr.shape))
print('-'*50)
mynd_arr1 = mynp.arange(1,7)
print('mynd_arr1 is:' + str(mynd_arr1))
print('The array dimensions is: ' +
str(mynd_arr1.ndim))
print('The data type of elements of array is : '
+ str(mynd_arr1.dtype))
print('The array size is : ' +
str(mynd_arr1.size))
print('The array shape is : ' +
str(mynd_arr1.shape))
print('-'*50)
mynd_arr2 = mynp.arange(1,7,2)
print('mynd_arr is:' + str(mynd_arr2))
print('The array dimensions is: ' +
str(mynd_arr2.ndim))
print('The data type of elements of array is : '
+ str(mynd_arr2.dtype))
print('The array size is : ' +
str(mynd_arr2.size))
print('The array shape is : ' +
str(mynd_arr2.shape))
print('-'*50)
mynd_arr3 = mynp.arange(1,7,3,dtype=float)
print('mynd_arr is:' + str(mynd_arr3))
print('The array dimensions is: ' +
str(mynd_arr3.ndim))
print('The data type of elements of array is : '
+ str(mynd_arr3.dtype))
print('The array size is : ' +
str(mynd_arr3.size))
print('The array shape is : ' +
str(mynd_arr3.shape))
Output:
mynd_arr is:[0 1 2 3 4 5 6]
The array dimensions is: 1
The data type of elements of array is : int32
The array size is : 7
The array shape is : (7,)
--------------------------------------------------
mynd_arr1 is:[1 2 3 4 5 6]
The array dimensions is: 1
The data type of elements of array is : int32
The array size is : 6
The array shape is : (6,)
--------------------------------------------------
mynd_arr is:[1 3 5]
The array dimensions is: 1
The data type of elements of array is : int32
The array size is : 3
The array shape is : (3,)
--------------------------------------------------
mynd_arr is:[1. 4.]
The array dimensions is: 1
The data type of elements of array is : float64
The array size is : 2
The array shape is : (2,)

Note: The preceding code is covered in (program name:


Chap2_Example2.11.py).

Array creation using linspace() function


Linearly spaced values within the specified interval will be returned by
using linspace() function. The syntax is shown as follows:
linspace(start,
stop,
num = 50,
endpoint = True,
retstep = False,
dtype = None)
Some of the parameters of linspace() have been explained as follows:
start: Optional parameter with a default value of zero, indicating the
start of the interval range.
stop: Indicates the end of the interval range.
num: Optional parameter indicating the number of samples to
generate.
endpoint: If set to false, the stop value is excluded from the interval
and is calculated as (stop-start) / num. When true, the spacing is
calculated as (stop-start) / (num – 1).
retstep: The spacing between the points is returned which is denoted
by retstep. When set to true, the value will be returned and its
default value is false.
dtype: Specifies the output array type.
The following code is mentioned to help you understand the linspace()
function which is demonstrated in the code program name:
Chap2_Example2.12.py as follows:

import numpy as mynp


# evenly spaced values of no. of samples as 50
between 0 and 2 which includes both 0 and 2
print(mynp.linspace(0,2))
print('-'*50)
# 3 evenly spaced values between 0 and 2 and also
including both 0 and 2
print(mynp.linspace(0,2,3))
print('-'*50)
# 3 evenly spaced values between 0 and 2 and also
including 0 but excluding 2
print(mynp.linspace(0,2,3, endpoint=False))
print('-'*50)
# 3 evenly spaced values between 0 and 2 and also
including 0 but excluding 2 and also returning
spacing
print(mynp.linspace(0,2,3, endpoint=False,
retstep=True))
print('-'*50)
# 5 values between 2 to 50 including both 2 and 50
with equally spacing int type values with spacing
as 12.0
print(mynp.linspace(2,50,5, dtype=int,
retstep=True))
Output:
[0. 0.04081633 0.08163265 0.12244898
0.16326531 0.20408163
0.24489796 0.28571429 0.32653061 0.36734694
0.40816327 0.44897959
0.48979592 0.53061224 0.57142857 0.6122449
0.65306122 0.69387755
0.73469388 0.7755102 0.81632653 0.85714286
0.89795918 0.93877551
0.97959184 1.02040816 1.06122449 1.10204082
1.14285714 1.18367347
1.2244898 1.26530612 1.30612245 1.34693878
1.3877551 1.42857143
1.46938776 1.51020408 1.55102041 1.59183673
1.63265306 1.67346939
1.71428571 1.75510204 1.79591837 1.83673469
1.87755102 1.91836735
1.95918367 2. ]
[0. 1. 2.]
[0. 0.66666667 1.33333333]
(array([0. , 0.66666667, 1.33333333]),
0.6666666666666666)
(array([ 2, 14, 26, 38, 50]), 12.0)

Note: The preceding code is covered in (program name:


Chap2_Example2.12.py).

We can conclude that the arange() function considers elements within


the specified range based on the step value. In contrast, the linspace()
function includes a specified number of values within the given range.
Consider the following distinctions:
0-D array: Represented by a single value, often termed a scalar.
1-D array: A collection of 0-D arrays, commonly referred to as a
vector.
2-D array: Consists of 1-D arrays, often referred to as a matrix.
3-D array: Comprises collections of 2-D arrays.
(20,): Represents a 1-D array containing 20 elements.
(10,2): Denotes a 2-D array with ten rows and two columns.
(1,2,3): Represents a 3-D array which is composed of the following:
One set of 2-D arrays.
Each 2-D array contains two rows.
Each 2-D array comprises three columns.
Size calculation: 1 * 2 * 3 = 6
Utilizing the zeros() function in numpy
The zeros() function in numpy generates arrays filled with zeros, with the
desired array shape specified as an input parameter. Consider the following
code for better understanding as shown in program name:
Chap2_Example2.13.py as follows:

import numpy as mynp


# Creating 1-D array with zeros
print(mynp.zeros(4))
print('-'*50)
# Creating 2-D array with zeros
print(mynp.zeros((4,4)))
print('-'*50)
# Creating 3-D array with zeros
print(mynp.zeros((2,3,2)))
print('-'*50)
# Creating 4-D array with zeros
print(mynp.zeros((3,1,2,3)))
Output:
[0. 0. 0. 0.]
--------------------------------------------------
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
--------------------------------------------------
[[[0. 0.]
[0. 0.]
[0. 0.]]
[[0. 0.]
[0. 0.]
[0. 0.]]]
--------------------------------------------------
[[[[0. 0. 0.]
[0. 0. 0.]]]
[[[0. 0. 0.]
[0. 0. 0.]]]
[[[0. 0. 0.]
[0. 0. 0.]]]]

Note: The preceding code is covered in (program name:


Chap2_Example2.13.py).

Utilizing the ones() function in numpy


The ones() function behaves similarly to the zeros() function. It returns a
new array with the specified shape and data type, where all elements are set
to one, as demonstrated in the code program name: Chap2_Example2.14.py
as follows:
import numpy as mynp
# Creating 1-D array with ones
print(mynp.ones(4))
print('-'*50)
# Creating 2-D array with ones
print(mynp.ones((4,2), dtype = int))
print('-'*50)
# Creating 3-D array with ones
print(mynp.ones((3,2,4), dtype = int))
Output:
[1. 1. 1. 1.]
--------------------------------------------------
[[1 1]
[1 1]
[1 1]
[1 1]]
--------------------------------------------------
[[[1 1 1 1]
[1 1 1 1]]
[[1 1 1 1]
[1 1 1 1]]
[[1 1 1 1]
[1 1 1 1]]]

Note: The preceding code is covered in (program name:


Chap2_Example2.14.py).
Utilizing the full() function in numpy
The numpy full function creates a new array with a specified shape and data
type, filled with a specific fill value. Simply use the following command for
a better understanding as demonstrated in the code program name:
Chap2_Example2.15.py as follows:
import numpy as mynp
print(help(mynp.full))
To examine the parameters and their functionalities, enter the following
command:
full(shape, fill_value, dtype=None, order='C', *,
like=None)
Observe the following code:
import numpy as mynp
# Creating 1-D array
print(mynp.full(shape=4,fill_value=2))
print('-'*50)
# Creating 2-D array
print(mynp.full((3,2),fill_value=2))
print('-'*50)
# Creating 3-D array
print(mynp.full((3,2,3),2)) # fill_value here is 2
Output:
[2 2 2 2]
--------------------------------------------------
[[2 2]
[2 2]
[2 2]]
--------------------------------------------------
[[[2 2 2]
[2 2 2]]
[[2 2 2]
[2 2 2]]
[[2 2 2]
[2 2 2]]]

Note: The preceding code is covered in (program name:


Chap2_Example2.15.py).

Using eye() function


The eye function in numpy generates an identity matrix, returning a 2-D
array with values of one along the diagonal and zero elsewhere.
Simply enter the following command to get a clear understanding of eye
function:
import numpy as mynp
print(help(mynp.eye))
To examine the parameters and their functionalities, enter the following
command:
eye(N, M=None, k=0, dtype=<class 'float'>,
order='C', *, like=None)
In this context, refer to the following values of the output:
N refers to the number of rows in the output.
M refers to the number of columns in the output.
k determines which diagonal should be filled with 1s, with a default
value of zero.
The 2-D arrays are consistently returned, with no requirement for the
number of rows and columns to be identical. If the M value is omitted, it will
default to the same value as N.
Observe the following code for a better understanding as demonstrated in
the code program name: Chap2_Example2.16.py as follows:
import numpy as mynp
# Creating 2-D array with default value k = 0
print(mynp.eye(2,4))
print('-'*50)
# Creating 2-D array with k = 1
print(mynp.eye(5, k=1))
print('-'*50)
# Creating 2-D array with k = 2
print(mynp.eye(5, k=2))
print('-'*50)
# Creating 2-D array with k = -1
print(mynp.eye(5, k=-1))
print('-'*50)
# Creating 2-D array with k = -2
print(mynp.eye(5, k=-2))
Output:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]]
--------------------------------------------------
[[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0.]]
--------------------------------------------------
[[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
--------------------------------------------------
[[0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]]
--------------------------------------------------
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]]

Note: The preceding code is covered in (program name:


Chap2_Example2.16.py).

Utilizing diag() function


A diagonal is extracted or a diagonal array is constructed using the diag()
function. Simply enter the following command for this function:
import numpy as mynp
print(help(mynp.diag))
To examine the parameters and their functionalities, enter the following
command:
diag(v, k=0) as demonstrated in the code program name:
Chap2_Example2.17.py as follows:

import numpy as mynp


# Extracting 2-D diagonal elements
myndarray = mynp.arange(17,33).reshape(4,4)
print("2-D Original array is : \n " +
str(myndarray))
print("The elements present at 0-diagonal : " +
str(mynp.diag(myndarray,k=0)))
print("The elements present at 1-diagonal : " +
str(mynp.diag(myndarray,k=1)))
print("The elements present at 2-diagonal : " +
str(mynp.diag(myndarray,k=2)))
print("The elements present at -1-diagonal : " +
str(mynp.diag(myndarray,k=-1)))
print("The elements present at -2-diagonal : " +
str(mynp.diag(myndarray,k=-2)))
print("The elements present at 3-diagonal : " +
str(mynp.diag(myndarray,k=3)))
print("The elements present at -3-diagonal : " +
str(mynp.diag(myndarray,k=-3)))
print('-'*50)

# 1-D construct a 2-D array using the provided


elements and a diagonal array;
# and here the remaining elements are all filled
with zeros.
mynd_arr1 = mynp.array([101,201,301,401,501])
print(mynp.diag(mynd_arr1,k=0))
print('-'*50)
mynd_arr2 = mynp.array([101,201,301,401,501])
print(mynp.diag(mynd_arr2,k=1))
print('-'*50)
mynd_arr3 = mynp.array([101,201,301,401,501])
print(mynp.diag(mynd_arr3,k=-1))
Output:
2-D Original array is :
[[17 18 19 20]
[21 22 23 24]
[25 26 27 28]
[29 30 31 32]]
The elements present at 0-diagonal : [17 22 27 32]
The elements present at 1-diagonal : [18 23 28]
The elements present at 2-diagonal : [19 24]
The elements present at -1-diagonal : [21 26 31]
The elements present at -2-diagonal : [25 30]
The elements present at 3-diagonal : [20]
The elements present at -3-diagonal : [29]
--------------------------------------------------
[[101 0 0 0 0]
[ 0 201 0 0 0]
[ 0 0 301 0 0]
[ 0 0 0 401 0]
[ 0 0 0 0 501]]
--------------------------------------------------
[[ 0 101 0 0 0 0]
[ 0 0 201 0 0 0]
[ 0 0 0 301 0 0]
[ 0 0 0 0 401 0]
[ 0 0 0 0 0 501]
[ 0 0 0 0 0 0]]
--------------------------------------------------
[[ 0 0 0 0 0 0]
[101 0 0 0 0 0]
[ 0 201 0 0 0 0]
[ 0 0 301 0 0 0]
[ 0 0 0 401 0 0]
[ 0 0 0 0 501 0]]

Note: The preceding code is covered in (program name:


Chap2_Example2.17.py).

Utilizing empty() function


The empty() function in NumPy generates a new array with a specified
shape and data type without initializing entries. It creates a dummy array.
Enter the following command in NumPy:
import numpy as mynp
print(help(mynp.empty))
To examine the parameters and their functionalities, enter the following
command:
empty(shape, dtype=float, order='C', *, like=None)
Observe the following snippet as demonstrated in the code program name:
Chap2_Example2.18.py as follows:

import numpy as mynp


print(mynp.empty((4,3)))
Output:
[[4.45057637e-308 1.78021527e-306 8.45549797e-307]
[1.37962049e-306 1.11260619e-306 1.78010255e-306]
[9.79054228e-307 4.45057637e-308 8.45596650e-307]
[9.34602321e-307 4.94065646e-322
0.00000000e+000]]

Note: The preceding code is covered in (program name:


Chap2_Example2.18.py).

Contrast between zeros and empty function


When considering zeros and empty functions, refer to the following list:
Choose zeros when requiring an array filled with zeros.
Opt for empty if you need an uninitialized array for future use
without concern for data.
The creation of an empty array with empty() is significantly faster
compared to zeros(). Therefore, if data integrity is not crucial,
empty() is preferred for improved performance.

Using random module for ndarray creation


We can create ndarray with random data using the module mentioned
above, it provides various functions like randint and shuffle functions.

randint()
Random integers are generated within a specified range by using randint()
function that is from low (inclusive) to high (exclusive). Refer to the
following syntax:
mynp.random.randint(low, high=None, size=None,
dtype=int)
low: Lowest integer value which is inclusive in the range.
high: Highest integer value which is exclusive in the range. The
range becomes [0,low) if unspecified.
size: output shape which can be an integer or tuple of integers.
dtype: output desired data type which is an integer by default here.
Let us view various examples for a better explanation of the above function
as demonstrated in the code program name: Chap2_Example2.19.py as
follows:
# np.random module examples
import numpy as mynp
import sys
# single random integer value generation in the
range between 20 to 29
print(mynp.random.randint(20,30))
print('-'*50)
# 1-D nd-array creation of size 5 with random
values from 10 to 19
print(mynp.random.randint(10,20, size=5))
print('-'*50)
# 2-D array with high as None and random values
from 0 to 49 with shape as (3,4)
print(mynp.random.randint(0,50, size=(3,4)))
print('-'*50)
# memory utilization is improved using dtype

a = mynp.random.randint(1,21,size=(30,40))
print(f"ndarray int32 size: {sys.getsizeof(a)}")
a = mynp.random.randint(1,21,size=
(30,40),dtype='int8')
print(f"ndarray int8 size : {sys.getsizeof(a)}")
Output:
26
--------------------------------------------------
[17 15 11 15 16]
--------------------------------------------------
[[31 21 8 41]
[47 3 39 41]
[46 21 10 31]]
--------------------------------------------------
ndarray int32 size: 4928
ndarray int8 size : 1328

Note: The above-mentioned code is covered in (program name:


Chap2_Example2.19.py).

shuffle()
This function will shuffle the array along axis-0. It corresponds to the first
axis in a multi-dimensional array. Here, the content remains the same with a
change in sub-array order.
Let us view some examples as demonstrated in the code program name:
Chap2_Example2.20.py as follows:

import numpy as mynp


# print(help(mynp.random.shuffle)) # This will
display help of shuffle function

# shuffling of 1-D array


mynp1 = mynp.arange(6)
print(f'before shuffle mynp1 data is {mynp1}')
mynp.random.shuffle(mynp1)
print(f'after shuffle mynp1 data is {mynp1}')
print('-'*50)

# shuffling of 2-D array -- shuffling around axis0


mynp2 = mynp.random.randint(1,50,size=(4,3))
print(f'before shuffle mynp2 data is {mynp2}')
mynp.random.shuffle(mynp2)
print(f'after shuffle mynp2 data is {mynp2}')
Output:
before shuffle mynp1 data is [0 1 2 3 4 5]
after shuffle mynp1 data is [2 4 0 3 1 5]
--------------------------------------------------
before shuffle mynp2 data is [[35 30 49]
[12 43 37]
[13 22 16]
[28 17 22]]
after shuffle mynp2 data is [[28 17 22]
[12 43 37]
[13 22 16]
[35 30 49]]

Note: Thepreceding code is covered in (program name:


Chap2_Example2.20.py).

Some additional functions have been mentioned as follows:


rand(): Generates uniformly distributed float values in the range
(0,1).
uniform(): Generates uniformly distributed float values in the
given range (low, high).
randn(): Generates normally distributed float values with a mean
of zero and a standard deviation of one.
normal(): Generates normally distributed float values with a
specified mean and standard deviation.

Exploring differences between View vs. Copy in numpy


Refer to the following table to learn the difference between View and Copy:
S View Copy
No.

1 A logical representation of the existing It is a separate object.


array and not a separate object.
2 Any changes made to the original array Any changes made to the original array will
will be reflected in the view, and vice not be reflected in the Copy, and vice versa.
versa.
3 On using view() method of ndarray On using copy() method of ndarray
class, view can be created explicitly. class, copy can be created existing ndarray.
Table 2.1: Difference between View and Copy in numpy
Let us explore a simple example to explain their difference in detail as
demonstrated in the code program name: Chap2_Example2.21.py as follows:
import numpy as mynp

# Example of view
mynp1=mynp.array([16,28,37,46,55])
mynp2=mynp1.view()
print(f"The original array is: {mynp1}")
print(f"The view array is: {mynp2}")
print('-'*50)
mynp1[2]=100
print(f"After changing 2nd index value of mynp1,
mynp1 array value is changed as: {mynp1}")
print(f"After changing 2nd index value of mynp1,
mynp2 array value is changed as: {mynp2}")
print('-'*50)
mynp2[-1]=300
print(f"After changing last index value of mynp2,
mynp1 array value is changed as: {mynp1}")
print(f"After changing last index value of mynp2,
mynp2 array value is changed as: {mynp2}")

# Example of copy
print('*'*100)
mynp3=mynp.array([111,132,136,147,158])
mynp4=mynp3.copy()
print(f"The original array is: {mynp3}")
print(f"The copy array is: {mynp4}")
print('-'*50)
mynp3[2]=110
print(f"After changing 2nd index value of mynp3,
mynp3 array value is changed as: {mynp3}")
print(f"After changing 2nd index value of mynp3,
mynp4 array value is retained as: {mynp4}")
print('-'*50)
mynp4[-1]=310
print(f"After changing last index value of mynp4,
mynp3 array value is retained as: {mynp3}")
print(f"After changing last index value of mynp4,
mynp4 array value is changed as: {mynp4}")
Output:
The original array is: [16 28 37 46 55]
The view array is: [16 28 37 46 55]
--------------------------------------------------
After changing 2nd index value of mynp1, mynp1
array value is changed as: [ 16 28 100 46 55]
After changing 2nd index value of mynp1, mynp2
array value is changed as: [ 16 28 100 46 55]
--------------------------------------------------
After changing last index value of mynp2, mynp1
array value is changed as: [ 16 28 100 46 300]
After changing last index value of mynp2, mynp2
array value is changed as: [ 16 28 100 46 300]
**************************************************
**************************************************
The original array is: [111 132 136 147 158]
The copy array is: [111 132 136 147 158]
--------------------------------------------------
After changing 2nd index value of mynp3, mynp3
array value is changed as: [111 132 110 147 158]
After changing 2nd index value of mynp3, mynp4
array value is retained as: [111 132 136 147 158]
--------------------------------------------------
After changing last index value of mynp4, mynp3
array value is retained as: [111 132 110 147 158]
After changing last index value of mynp4, mynp4
array value is changed as: [111 132 136 147 310]

Note: The preceding code is covered in (program name:


Chap2_Example2.21.py).

Elements access of ndarray


We can access ndarray elements using either indexing, slicing, condition-
based selection or advanced indexing. Let us see an example for better
understanding. Refer to the following code as demonstrated in the code
program name: Chap2_Example2.22.py as follows:
import numpy as mynp
# We will be accessing elements of 3-D numpy array
mynd1 = mynp.array([
[[111,112,113],[114,115,116],[117,118,119]],
[[11,21,31],[41,51,61],[71,81,91]]
])
# 1. using indexing: single element access, zero
based indexing supporting both +ve and -ve
print(f"Array shape is: ==> {mynd1.shape}")
print("To access the element 19 from the 3-D
array")
print(f"mynd1[0][2][2] ==> {mynd1[0][2][2]}")
print(f"mynd1[-2][-1][-1] ==> {mynd1[-2][-1]
[-1]}")
print('-'*50)

# 2. using slicing: elements group which are in


order
print("Accessing elements: 0th and 2nd row of 0th
and 2nd column value of each 2-D array")
print(f"mynd1[:,::2,::2] ==> {mynd1[:,::2,::2]}")
print('-'*50)

# 3. using advanced indexing: elements group which


are not ordered (arbitrary elements)
# In order to access 3-D array arbitrary elements
use
# myndarray[[2-D array indices],[row indices],
[column indices]]
print("To access element 15 and 5 from 3-D array")
print(mynd1[[0,1],[1,1],[1,1]])
print('-'*50)
# 4. using condition based selection: selecting
array elements based on condition
mynp2=mynp.arange(11,23).reshape(3,4)
print(mynp2[mynp2%2 !=0]) # displaying odd numbers
from an array as condition will return the boolean
value
# and array[condition] will return the result
Output:
Array shape is: ==> (2, 3, 3)
To access the element 19 from the 3-D array
mynd1[0][2][2] ==> 119
mynd1[-2][-1][-1] ==> 119
--------------------------------------------------
Accessing elements: 0th and 2nd row of 0th and 2nd
column value of each 2-D array
mynd1[:,::2,::2] ==> [[[111 113]
[117 119]]
[[ 11 31]
[ 71 91]]]
--------------------------------------------------
To access element 15 and 5 from 3-D array
[115 51]
--------------------------------------------------
[ 11 13 15 17 19 21]

Note: The preceding code is covered in (program name:


Chap2_Example2.22.py).

Iterate elements of ndarray


We can iterate through all elements individually, accessing them one by
one. There are three ways to iterate through ndarrays, as demonstrated in
the code program name: Chap2_Example2.23.py as follows:
import numpy as mynp

# Approach1: iteration using Python loop


# we are iterating elements of 3-D array
myarr1 = mynp.array([[[34,44],[54,64]],[[74,84],
[94,104]]])
for outer in myarr1: #iterating 2-D array with no
scalar value
for inner in outer: # iterating 1-D array with
no scalar value
for data in inner: # here data is a scalar
value
print(data)
print('-'*50)

# Approach2: iteration using nditer() function:


class in numpy library with only one loop for
iterating any n-D array
myarr2=mynp.arange(21,27).reshape(2,3)
for loop in mynp.nditer(myarr2):
print(loop)
print('-'*50)
# Approach3: iteration using ndenumerate()
function
# This function will return indexes in addition to
elements as multidimensional
# index iterator will be returned yielding pair of
index-tuple with the corresponding array values
for mypos,myelement in mynp.ndenumerate(myarr2):
print(f'{myelement} is present at position:
{mypos}')
Output:
34
44
54
64
74
84
94
104
--------------------------------------------------
21
22
23
24
25
26
--------------------------------------------------
21 is present at position:(0, 0)
22 is present at position:(0, 1)
23 is present at position:(0, 2)
24 is present at position:(1, 0)
25 is present at position:(1, 1)
26 is present at position:(1, 2)

Note: The above mentioned code is covered in (program name:


Chap2_Example2.23.py).

Arithmetic operators in numpy


In order to perform mathematical operations on arrays, numpy offers
several arithmetic
operators where the arrays must be in the same dimension, size, and shape
as demonstrated in the code program name: Chap2_Example2.24.py as
follows:
import numpy as mynp

myarr1 = mynp.array([ [21,31,41], [52,62,72] ])


myarr2 = mynp.arange(21,27).reshape(2,3)
print(f"Dimension of myarr1 is : {myarr1.ndim},
size of myarr1 is :{myarr1.size} and shape of
myarr1 is : {myarr1.shape}")
print(f"Dimension of myarr2 is : {myarr2.ndim},
size of myarr2 is :{myarr2.size} and shape of
myarr2 is : {myarr2.shape}")
print(f'Numpy array 1 is: {myarr1}')
print(f'Numpy array 2 is: {myarr2}')
# performing addition
print('-'*50)
print(f"Addition of 2 numpy arrays is
{mynp.add(myarr1,myarr2)}")
# performing subtraction
print('-'*50)
print(f"Subtraction of 2 numpy arrays is
{mynp.subtract(myarr1,myarr2)}")
# performing multiplication
print('-'*50)
print(f"Multiplication of 2 numpy arrays is
{mynp.multiply(myarr1,myarr2)}")
# performing division
print('-'*50)
print(f"Division of 2 numpy arrays is
{mynp.divide(myarr1,myarr2)}")
# performing floor division
print('-'*50)
print(f"Floor division of 2 numpy arrays is
{mynp.floor_divide(myarr1,myarr2)}")
# performing modulus
print('-'*50)
print(f"Modulus of 2 numpy arrays is
{mynp.mod(myarr1,myarr2)}")
# performing power
print('-'*50)
print(f"Power of 2 numpy arrays is
{mynp.power(myarr1,myarr2)}")
Output:
Dimension of myarr1 is : 2, size of myarr1 is :6
and shape of myarr1 is : (2, 3)
Dimension of myarr2 is : 2, size of myarr2 is :6
and shape of myarr2 is : (2, 3)
Numpy array 1 is: [[21 31 41]
[52 62 72]]
Numpy array 2 is: [[21 22 23]
[24 25 26]]
--------------------------------------------------
Addition of 2 numpy arrays is [[42 53 64]
[76 87 98]]
--------------------------------------------------
Subtraction of 2 numpy arrays is [[ 0 9 18]
[28 37 46]]
--------------------------------------------------
Multiplication of 2 numpy arrays is [[ 441 682
943]
[1248 1550 1872]]
--------------------------------------------------
Division of 2 numpy arrays is [[1.
1.40909091 1.7826087 ]
[2.16666667 2.48 2.76923077]]
--------------------------------------------------
Floor division of 2 numpy arrays is [[1 1 1]
[2 2 2]]
--------------------------------------------------
Modulus of 2 numpy arrays is [[ 0 9 18]
[ 4 12 20]]
--------------------------------------------------
Power of 2 numpy arrays is [[ 878082373 1244764481
-118448423]
[ 0 1040187392 0]]

Note: The preceding code is covered in (program name:


Chap2_Example2.24.py).

Broadcasting concept in numpy


Broadcasting is a powerful concept in numpy that allows arrays of different
shapes to be used together in arithmetic operations. When performing
operations between arrays in NumPy, the arrays need to have compatible
shapes Broadcasting in numpy enhances the flexibility and ease of use
when working with arrays of different shapes, making it a fundamental
feature for array manipulation and computation. Refer to the following
points for clarity:
In general, arithmetic operations are typically carried out between
two arrays that share the same dimension, shape, and size.
Even if there are variations in size, shape, and dimension, some
mathematical operations are permitted by broadcasting.
Numpy will perform broadcasting automatically, we are not obliged
to do it explicitly.
Not every situation will lend itself to broadcasting.
Numpy does broadcasting by adhering to certain principles. It is
solely performed internally during arithmetic operations if the
specified rules are met. This mechanism is necessary when dealing
with arrays of varying dimensions, shapes, or sizes.
The rules of broadcasting are mentioned as follows:
Rule-1: Ensure that the dimensions of both arrays are the same.
Numpy adjusts arrays to have equal dimensions if they are initially
different. This adjustment involves padding the shape of the array
with fewer dimensions on the left side with 1ʼs until both arrays have
the same dimensions. Refer to the following table for a better
understanding:
Before After

Let mynp1 array have a shape as Both arrays here are of different dimensions and
(2,3), 2-D array, and mynp2 array by using rule -1 we will be adding 1ʼs to the
have a shape as (3), 1-D array. lesser dimension array here as mynp2.
So, mynp1 array is of shape (2,3) and
mynp2 array is of shape (1,3).

Table 2.2: Table demonstrating incorporation of rule-1 on numpy array


Rule-2: The arrays with a size equal to one, in that dimension, will be
expanded to the size of another dimension to match if the two arraysʼ
sizes do not match in any dimension. Refer to the following table for
a better understanding:
Before After

mynp1 array is of After applying, rule two, the shape of both mynp1
shape (2,3) and mynp2 and mynp2 array is (2,3). Their sizes are matched
array is of shape with six. The dimensions of both arrays are 2-D.
(1,3).
The first coordinate of mynp2
array (1) will be increased to
mynp1 (2).

Table 2.3: Table demonstrating incorporation of rule-2 on numpy array


having same dimensions

Note: If the sizes in any dimension are not the same and are not equal
to one, we will receive an error because numpy is unable to do
broadcasting across those arrays.

Let us understand the broadcast concept with code demonstration program


name: Chap2_Example2.25.py as follows:
import numpy as mynp

print("Broadcasting of 1-D array--------------")


# Broadcasting of 1-D array
myarr1=mynp.array([21,22,23])
myarr2= mynp.array([10])
print(f"myarr1 is {myarr1}")
print(f"myarr2 is {myarr2}")
print(f"myarr1+myarr2 is: {myarr1+myarr2}")
# myarr1 is [21 22 23]
# on broadcasting myarr2 becomes [10 10 10]
# answer becomes [31 32 33]

print("Broadcasting of 2-D array--------------")


myarr3= mynp.arange(21,27).reshape(2,3)
myarr4= mynp.array([10,12,13])
print(f"myarr3 is {myarr3}")
print(f"myarr4 is {myarr4}")
print(f"myarr3+myarr4 is: {myarr3+myarr4}")
# myarr3 is [21 22 23]
# [24 25 26]
# on broadcasting myarr4 becomes [10 12 13]
# [10 12 13]
# answer becomes [31 34 36]
# [34 37 39]
Output:
Broadcasting of 1-D array--------------
myarr1 is [21 22 23]
myarr2 is [10]
myarr1+myarr2 is: [31 32 33]
Broadcasting of 2-D array--------------
myarr3 is [[21 22 23]
[24 25 26]]
myarr4 is [10 12 13]
myarr3+myarr4 is: [[31 34 36]
[34 37 39]]

Note: The above-mentioned code is covered in (program name:


Chap2_Example2.25.py).

Conclusion
In this chapter, we initially explored the concept of numpy arrays.
Subsequently, we delved into comparing and contrasting lists and numpy
arrays through illustrative examples. We also examined different methods
for creating ndarrays, including 1-D array creation using lists and tuples in
Python code. For 2-D array creation, we explored approaches such as
nested lists and specifying data types. Additionally, we covered the creation
of 1-D arrays using the arange() function. Furthermore, we introduced
several other functions such as linspace(), zeros(), ones(), full(), eye(),
diag(), and empty(), along with their respective functionalities, followed by
Python code snippets. This chapter also emphasized the crucial distinction
between View and Copy in numpy for memory efficiency, supported by
illustrative examples. The reader learned about different methods to access
individual elements or subsets of elements in ndarrays and how to iterate
them using loops. Finally, we have explored the wide range of arithmetic
operators available in numpy and how to leverage broadcasting for
operations on ndarrays with different shapes.
In the next chapter, we shall learn about various array manipulation
functions in numpy followed by functions such as sorting, searching dot
function for matrix multiplication, and getting to know linalg module for
performing linear algebra with various examples.

Points to remember
Numpy arrays are implemented in both C and Python languages.
Numpy arrays exhibit high performance and are the preferred choice
for machine learning algorithms.
Numpy arrays are faster and consume less memory compared to lists.
Numpy arrays contain homogeneous elements, while lists can contain
heterogeneous elements.
Vector operations can be performed on ndarrays, unlike lists.
1-D arrays can be created using various methods including lists,
tuples, arange(), linspace(), zeros(), ones(), full(), eye(), diag(),
and empty() functions.
Within a specified range, randint() function generated random
integers that are from low (inclusive) to high (exclusive).
shuffle() function will shuffle elements in the order in the given nd
array.
View is not a separate object but just a logical representation of the
existing array whereas Copy is a separate object.
We can access elements of ndarray using indexing, advanced
indexing, slicing, and condition-based selection.
We can iterate ndarary elements using the loop concept of Python, the
nditer function, and the ndenumerate function.

The arithmetic operators like addition, subtraction, multiplication,


division, etc. can be performed using numpy library.
If there are variations in size, shape, and dimension, some
mathematical operations are permitted by broadcasting.

Questions
1. What defines a numpy array, and why is it often considered the
optimal choice for machine learning applications?
2. Explain the commonalities shared between lists and numpy arrays.
3. What are the distinguishing features between lists and numpy arrays?
4. How can 1-D arrays be created using lists, tuples, and a Python code
snippet?
5. Could you provide a Python code snippet demonstrating the creation
of 2-D arrays using nested lists?
6. How is array creation achieved with a specific data type, and could
you illustrate this with a Python code snippet?
7. How can 1-D arrays be created using the arange() function, and
could you provide an example with a Python code snippet?
8. What is the purpose of the linspace() function, and can you
showcase its usage with a Python code snippet?
9. Could you provide Python code snippets demonstrating the
functionality of functions such as zeros(), ones(), full(), eye(),
diag(), and empty()?

10. Explain a random module in numpy with a Python snippet code.


11. Explain the difference between View vs Copy with a Python snippet
code.
12. Explain with a Python snippet code how to access elements of
ndarray.
13. Explain with a Python snippet code how to iterate elements of
ndarray.
14. Explain with a Python snippet code the arithmetic operators in
numpy.
15. Explain with a Python snippet code the broadcasting concept in
numpy.

References
Programming techniques using Python: A must for beginners
covering Core + Advanced Python + MCQs for Microsoft and
various other certification exams.
Python for everyone: Coverage of fundamental topics for Core+
Advanced Python along with some open-source data science libraries
Building Modern GUIs with tkinter and Python: Coverage of GUI
app creation using the Tkinter library in Python.
Python GUI with PyQt: Covered GUI app creation using the PyQt5
library in Python.

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 3Exploring Array Manipulations
in Numpy
Introduction

In the previous chapter, we learned what exactly numpy array is and how to
create ndarrays using various approaches along with accessing and iterating
elements of ndarray. In addition to it, numpy offers a comprehensive suite
of array manipulation functions that are indispensable for efficient data
processing and analysis. Understanding how to reshape arrays, concatenate
or stack multiple them, and split them into meaningful segments is crucial
for managing and preparing datasets. Sorting and searching operations on
ndarrays are essential for organizing and extracting relevant information,
while the ability to insert or delete elements dynamically enhances data
manipulation capabilities. Moreover, numpyʼs dot function for matrix
multiplication and the linalg module for linear algebra operations are
fundamental tools for handling complex mathematical computations. They
make numpy an indispensable resource for researchers, data scientists, and
engineers working with large-scale numerical data.

Structure

In this chapter, we will discuss the following topics:

Array manipulation functions/variables in numpy


Multiple arrays joining into a single array
Splitting of arrays in numpy
Sorting of ndarray in numpy
Elements searching of ndarray in numpy
Elements insert into ndarray
Elements delete from ndarray in numpy
Usage of dot function for matrix multiplication
Usage of linalg module in numpy

Objectives
By the end of this chapter, the reader will have an idea about various array
manipulation functions and variables available in numpy. We will
understand different methods for joining ndarrays. We shall explore how to
split ndarrays into smaller ones. Sorting of ndarrays in numpy will be
explored. Some of the search functions available in numpy will be explored
along with the insertion and deletion of elements into/from ndarrays will be
carried out. The usage of the dot function in numpy for matrix
multiplication will be explored. Finally, the linalg module in numpy for
linear algebra will be well understood with various examples. So, by the
end of this chapter, the reader will be able to appreciate the concept of
numpy arrays by looking into their properties, ways of creation, and their
importance for machine learning algorithms. In this chapter, we will also be
using an alias for numpy library as mynp.

Array manipulation functions/variables in numpy

Array manipulation functions and variables in numpy are essential for


efficient data manipulation and analysis in scientific computing and data
analysis tasks. These tools are necessary to reshape arrays, concatenate or
split them along specific axes, transpose arrays, and extract information
about array dimensions and shapes. Such operations are crucial for tasks
like data preprocessing, image processing, numerical simulations, and
statistical analysis, allowing for streamlined and optimized manipulation of
array data structures, which are foundational to many scientific and data-
driven applications. In numpy, we shall see now various functions/variables
for manipulating arrays efficiently. We shall see direct examples with their
definitions and syntax.

reshape()

This function will change the array shape without swapping its data. It is a
function present in numpy module, or a method present in ndarray class.
Just make sure that the sizes are matched otherwise we will get the
ValueError. Here, View is created instead of Copy. So, if there are changes
done in the original array, it will be reflected in the reshaped array. Refer to
the following syntax for a better understanding:

mynp.reshape(a, newshape, order='C') # functional style and a is array


or

a.reshape(newshape, order='C') # object-oriented style

Now, we shall view an example of reshape() function which is


demonstrated in the code program name: Chap3_Example3.1.py as follows:

import numpy as mynp

# print(help(mynp.reshape)) # --- will display help for reshape function

print("Eg-1 1-D array conversion to 2-D array with by default C style order
---------------------------")

myarr1 = mynp.arange(1,7)

myarr2 = myarr1.reshape(2,3)

print(f"myarr1 is : {myarr1}")

print(f"myarr2 is : {myarr2}")

print("Eg-2 modifying the 2nd index value of original array ------------------


---------")

myarr1[1] = 12

print(f"myarr1 after modification in the original array is : {myarr1}")

print(f"myarr2 after modification in the reshaped array is : {myarr2}")

print("Eg-3 F Style order ---------------------------")

myarr3 = mynp.arange(1,7)

myarr4 = myarr3.reshape((2,3), order='F')

print(f"myarr3 is : {myarr3}")
print(f"myarr4 is : {myarr4}")

Output:

Eg-1 1-D array conversion to 2-D array with by default C style order -------
--------------------

myarr1 is : [1 2 3 4 5 6]

myarr2 is : [[1 2 3]

[4 5 6]]

Eg-2 modifying the 2nd index value of original array --------------------------


-

myarr1 after modification in the original array is : [ 1 12 3 4 5 6]

myarr2 after modification in the reshaped array is : [[ 1 12 3]

[ 4 5 6]]

Eg-3 F Style order ---------------------------

myarr3 is : [1 2 3 4 5 6]

myarr4 is : [[ 1 3 5]

[12 4 6]]

Note: The preceding code is covered in (program name:


Chap3_Example3.1.py).

resize()

This function will resize an array to a specified shape which can either
increase or decrease the array size. Using functional style, if more elements
are required by new size, then repeat elements of the input array. A new
object, the new array will be created. Using an object-oriented style, the
extra elements will be filled with zeros and the existing array will only get
modified. Refer to the following syntax for a better understanding:

mynp.resize(a, new_shape) # functional style and a is array

Or

a.resize(new_shape, refcheck = True) # object-oriented style

Now, we shall view an example of resize() function which is demonstrated


in the code program name: Chap3_Example3.2.py as follows:

import numpy as mynp

# print(help(mynp.resize))

print("Eg1 Increasing the size using functional style--------------------------")

myarr1=mynp.arange(1,5)

myarr2=mynp.resize(myarr1,(2,3))

print(f"myarr1 is: {myarr1}")

print(f"myarr2 is: {myarr2}")

print("Eg2 Increasing the size using object-oriented style -------------------")

myarr3=mynp.arange(1,5)

myarr3.resize((2,3))

print(f"myarr3 is: {myarr3}")

Output:

Eg1 Increasing the size using functional style------------------------------

myarr1 is: [1 2 3 4]
myarr2 is: [[1 2 3]

[4 1 2]]

Eg2 Increasing the size using object-oriented style ------------------------------

myarr3 is: [[1 2 3]

[4 0 0]]

Note: The preceding code is covered in (program name:


Chap3_Example3.2.py).

flatten()

This method includes elements of a given multi-dimensional array that will


create a new 1-D array (its output). A copy of the array collapsed into 1-D
will be returned. It is not a numpy function but a method present in the
ndarray class. Refer to the following syntax for a better understanding:

ndarray.flatten(order='C')

Now, we shall review an example of flatten() function which is


demonstrated in the code program name: Chap3_Example3.3.py as follows:

import numpy as mynp

# print(help(mynp.ndarray.flatten))

print("Eg1 Displaying flatten method example with default order='C'--------


-----")

myarr1=mynp.arange(1,7).reshape(2,3)

myarr2 = myarr1.flatten()

print(f"myarr1 is {myarr1}")

print(f"myarr2 is {myarr2}")
print("Eg2 Changing the original array value will not be reflected in the
flatten array------------------------------")

myarr1[0,0]=13

print(f"myarr1 is {myarr1}")

print(f"myarr2 is {myarr2}")

print("Eg3 with order='F'------------------------------")

myarr3=myarr1.flatten('F')

print(f"Flatten array myarr3 is {myarr3}")

Output:

Eg1 Displaying flatten method example with default order='C'----------------


--------------

myarr1 is [[1 2 3]

[4 5 6]]

myarr2 is [1 2 3 4 5 6]

Eg2 Changing the original array value will not be reflected in the flatten
array------------------------------

myarr1 is [[13 2 3]

[ 4 5 6]]

myarr2 is [1 2 3 4 5 6]

Eg3 with order='F'------------------------------

Flatten array myarr3 is [13 4 2 5 3 6]


Note: The preceding code is covered in (program name:
Chap3_Example3.3.py)

flat variable

This variable is a 1-dimensional iterator over the array which will allow to
access the elements of a multi-dimensional array. Refer to the following
syntax for a better understanding:

ndarray.flat

Now, we shall view an example of a flat variable which is demonstrated in


the code program name: Chap3_Example3.4.py as follows:

import numpy as mynp

#print(help(mynp.ndarray.flat))

print("Eg1 ------------------------------")

mynp1=mynp.arange(1,7).reshape(2,3)

print(mynp1.flat)

# iterating

for loop in mynp1.flat:

print(loop)

Output:

Eg1 ------------------------------

<numpy.flatiter object at 0x000002C5D3984520>

2
3

Note: The preceding code is covered in (program name:


Chap3_Example3.4.py).

ravel()

This function is similar to the flatten method except that it returns View
whereas, flatten returns Copy. So, if any changes are made in the original
array, then the same would be reflected in the ravel copy too, and vice-
versa. Refer to the following syntax for a better understanding:

mynp.ravel(a, order='C')

or

a.ravel(order='C')

Now, we shall view an example of ravel() function which is demonstrated


in the code program name: Chap3_Example3.5.py as follows:

import numpy as mynp

#print(help(mynp.ndarray.ravel))

print("Eg1 using ravel method------------------------------")

myarr1 = mynp.arange(1,7).reshape(2,3)

myarr2 = myarr1.ravel()

print(f"myarr1 is {myarr1}")
print(f"myarr2 is {myarr2}")

print("Eg2 Making changes to the ravel array------------------------------")

myarr2[0] = 11

print(f"myarr1 after making changes to ravel array is {myarr1}")

print(f"myarr2 after making changes to ravel array is {myarr2}")

print("Eg3 Using ravel function------------------------------")

myarr3 = mynp.arange(1,7).reshape(2,3)

myarr4 = mynp.ravel(myarr3, order='F')

print(f"myarr3 is {myarr3}")

print(f"myarr4 is {myarr4}")

Output:

Eg1 using ravel method------------------------------

myarr1 is [[1 2 3]

[4 5 6]]

myarr2 is [1 2 3 4 5 6]

Eg2 Making changes to the ravel array------------------------------

myarr1 after making changes to ravel array is [[11 2 3]

[ 4 5 6]]

myarr2 after making changes to ravel array is [11 2 3 4 5 6]

Eg3 Using ravel function------------------------------


myarr3 is [[1 2 3]

[4 5 6]]

myarr4 is [1 4 2 5 3 6]

Note: The preceding code is covered in (program name:


Chap3_Example3.5.py)

transpose()

This function will interchange ndarray that will reverse the axes of an array.
A new array is not created and will return the View of the existing array.
Using the T variable, we can use a shortcut representation of the transpose.
There will be no change in the size of any dimension. Refer to the
following syntax for a better understanding:

mynp.transpose(a, axis=None)

or

a. transpose (axis=None)

If we do not specify the axis parameter, then the dimensions will be


reversed as explained with a few examples in the table shown as follows:

Original Array Shape with no Transpose Array Shape with no axis


axis mentioned mentioned

1D : (3,) (3), no change.

2D : (3,4) (4,3) axis0 is 4 and axis1 is 3.

3D: (3,4,5) (5,4,3) axis0 is 5, axis1 is 4 and axis2 is


3.
(4,3,2,1) axis0 is 4, axis1 is 3, axis2 is 2
4D: (1,2,3,4)
and axis3 is 1.

Table 3.1: Table depicting the reversing of dimensions when no axis


parameter is specified

Now, we shall view an example of transpose() function which is


demonstrated in the code program name: Chap3_Example3.6.py as follows:

import numpy as mynp

#print(help(mynp.transpose))

print("Eg1 transpose with non-specified axis parameter of 2-D array --------


----------------------")

a = mynp.arange(1,7).reshape(2,3)

myarr1 = a.transpose()

print(f"Original Array is : \n {a}")

print(f"Transposed Array is : \n {myarr1}")

print("Eg2 Transpose with specified axis parameter of 3-D array -------------


-----------------")

a = mynp.arange(1,25).reshape(2,3,4)

myarr2 = mynp.transpose(a,axes=(1,2,0))

print(f"Original 3-D Array is : \n {a}")

print(f"Transposed 3-D Array is : \n {myarr2}")

print(f"Shape of Original 3-D Array is : \n {a.shape}")


print(f"Shape of Transposed 3-D Array is : \n {myarr2.shape}")

print("Eg3 Using T variable------------------------------")

a = mynp.arange(1,25).reshape(2,3,4)

myarr3 = a.T

print(f"Transposed 3-D Array is : \n {myarr3}") # (4,3,2)

Output:

Eg1 transpose with non-specified axis parameter of 2-D array ----------------


--------------

Original Array is :

[[1 2 3]

[4 5 6]]

Transposed Array is :

[[1 4]

[2 5]

[3 6]]

Eg2 Transpose with specified axis parameter of 3-D array ---------------------


---------

Original 3-D Array is :

[[[ 1 2 3 4]

[ 5 6 7 8]

[ 9 10 11 12]]
[[13 14 15 16]

[17 18 19 20]

[21 22 23 24]]]

Transposed 3-D Array is :

[[[ 1 13]

[ 2 14]

[ 3 15]

[ 4 16]]

[[ 5 17]

[ 6 18]

[ 7 19]

[ 8 20]]

[[ 9 21]

[10 22]

[11 23]

[12 24]]]

Shape of Original 3-D Array is :

(2, 3, 4)

Shape of Transposed 3-D Array is :

(3, 4, 2)
Eg3 Using T variable------------------------------

Transposed 3-D Array is :

[[[ 1 13]

[ 5 17]

[ 9 21]]

[[ 2 14]

[ 6 18]

[10 22]]

[[ 3 15]

[ 7 19]

[11 23]]

[[ 4 16]

[ 8 20]

[12 24]]]

Note: The preceding code is covered in (program name:


Chap3_Example3.6.py). If there is a requirement to interchange only two
dimensions then, we should go for swapaxes function.

Multiple arrays joining into a single array

Whenever there is a requirement for easier data integration and analysis,


optimized array operation, simplifying data manipulation, processing, and
improving efficiency, we can think of multiple arrays joining into a single
array using functions as explained below. Let us see direct examples with
their definitions and syntax.
concatenate()

This function will join arrays along a specified axis. Refer to the following
syntax:

mynp.concatenate ((a1, a2, ...), axis=0, out=None, dtype=None,


casting="same_kind")

When using the concatenate() function in numpy, the choice of the axis
parameter determines how arrays will be joined together. Here is what
happens with different axis values:

If axis = 0, then vertical concatenation will be performed.


If axis = 1, then horizontal concatenation will be performed.
If axis = None, first arrays will be converted to 1_D array and then
will be performing concatenation on the resultant arrays.
Any number of arrays can be joined, but they must all have the same
dimension.
All axes sizes, except the concatenation axis, should be the same.
The shapes of the output array and the resultant array must match.

Now, we shall view an example of concatenate() function which is


demonstrated in the code program name: Chap3_Example3.7.py as follows:

import numpy as mynp

#print(help(mynp.concatenate))

print("Eg1 1-D array concatenation------------------------------")

myarr1= mynp.arange(4)

myarr2= mynp.arange(5)

print(f"Concatenation array 1-D is {mynp.concatenate((myarr1,myarr2))}")

print("Eg2 2-D array concatenation with/without axes parameter-------------


-----------------")
myarr3 = mynp.arange(1,7).reshape(2,3)

myarr4 = mynp.array([[10,11,12],[13,14,15]])

print(f"Vertical concatenation: {mynp.concatenate((myarr3,myarr4),


axis=0)}")

print('-'*50)

print(f"Horizontal concatenation: {mynp.concatenate((myarr3,myarr4),


axis=1)}")

print('-'*50)

print(f"Flat and then concatenation: {mynp.concatenate((myarr3,myarr4),


axis=None)}")

print("Eg3 out parameter and dtype usage------------------------------")

# After concatenation, using the 'out' parameter, we can store

# the result in an array, but the result and out must be in the same shape.

myarr5 = mynp.arange(5)

myarr6 = mynp.arange(3)

myempty = mynp.empty(8, dtype=str)

print(f"out and dtype usage: {mynp.concatenate((myarr5,myarr6),


out=myempty)}")

print(f"out and dtype usage: {mynp.concatenate((myarr5,myarr6),


out=myempty, dtype=str)}") # error in the above line

Output:

Eg1 1-D array concatenation------------------------------


Concatenation array 1-D is [0 1 2 3 0 1 2 3 4]

Eg2 2-D array concatenation with/without axes parameter---------------------


---------

Vertical concatenation: [[ 1 2 3]

[ 4 5 6]

[10 11 12]

[13 14 15]]

--------------------------------------------------

Horizontal concatenation: [[ 1 2 3 10 11 12]

[ 4 5 6 13 14 15]]

--------------------------------------------------

Flat and then concatenation: [ 1 2 3 4 5 6 10 11 12 13 14 15]

Eg3 out parameter and dtype usage------------------------------

out and dtype usage: ['0' '1' '2' '3' '4' '0' '1' '2']

TypeError: concatenate() only takes `out` or `dtype` as an argument, but


both were provided.

Note: The preceding code is covered in (program name:


Chap3_Example3.7.py)

stack()

Using the stack function, a sequence of arrays is joined along a new axis. It
receives an iterable series of arrays as input and outputs a new array with
the input stacked along the new axis. The shape of input arrays must be the
same otherwise we will get an error. The stacked array resulting from the
stack() function has an additional dimension compared to the input arrays.

Refer to the following syntax for a better understanding:

mynp.stack(arrays, axis=0, out=None, *, dtype=None,


casting='same_kind')

Here, 1-D + 1-D 2-D (Addition of 1-Dimensional arrays leads to two-


Dimensional array) and 2-D + 2-D 3-D(Addition of 2-Dimensional arrays
leads to three-Dimensional array)

Now, we shall view an example of stack() function which is demonstrated


in the code program name: Chap3_Example3.8.py as follows:

import numpy as mynp

# print(help(mynp.stack))

print("Eg1 axis-0(default) stacking in 1-D array ------------------------------")

myarr1= mynp.array([1,2,3])

myarr2 = mynp.array([10,12,13])

mystack_array = mynp.stack((myarr1,myarr2))

print(f"Stack array result is: {mystack_array }")

print(f"Shape of stack array is: {mystack_array.shape}")

print("Eg2 stacked array with axis-1 ------------------------------")

myarr3= mynp.arange(1,7).reshape(2,3)

myarr4 = mynp.array([[10,12,13],[14,15,17]])

mystack_array_3D = mynp.stack((myarr3,myarr4), axis=1)


print(f"Stack array result is: {mystack_array_3D }")

print(f"Shape of stack array is: {mystack_array_3D.shape}")

# Take 1st row from 1st array and make it 1st row of the stacked array

# Take 2nd row from 2nd array and make it 2nd row of the stacked array

# Combine these 2 rows and repeat the remaining rows of the arrays

# axis-1 means row wise in each 2-D array

print("Eg3 stacked array with axis-2 ------------------------------")

# axis-2 means column wise in each 2-D array

mystack_array_3D = mynp.stack((myarr3,myarr4), axis=2)

print(f"Stack array result is: {mystack_array_3D }")

print(f"Shape of stack array is: {mystack_array_3D.shape}")

Output:

Eg1 axis-0(default) stacking in 1-D array ------------------------------

Stack array result is: [[ 1 2 3]

[10 12 13]]

Shape of stack array is: (2, 3)

Eg2 stacked array with axis-1 ------------------------------

Stack array result is: [[[ 1 2 3]

[10 12 13]]

[[ 4 5 6]
[14 15 17]]]

Shape of stack array is: (2, 2, 3)

Eg3 stacked array with axis-2 ------------------------------

Stack array result is: [[[ 1 10]

[ 2 12]

[ 3 13]]

[[ 4 14]

[ 5 15]

[ 6 17]]]

Shape of stack array is: (2, 3, 2)

Note: The preceding code is covered in (program name:


Chap3_Example3.8.py)

vstack()

This function will stack arrays vertically which takes the sequence as input
and the single array is returned by stacking them vertically. Concatenation
along the first axis, following the reshaping of 1-D arrays of shape (N,) to
(1, N), is the equivalent of the vstack() function. An important point is that
all the input arrays must have the same shape except at axis zero. Refer to
the following syntax for a better understanding:

mynp.vstack(tup, *, dtype=None, casting='same_kind')

Now, we shall view an example of vstack() function which is demonstrated


in the code program name: Chap3_Example3.9.py as follows:

import numpy as mynp


# print(help(mynp.vstack))

print("Eg1 vstack for 1-D array------------------------------")

myarr1 = mynp.array([1, 2, 3,])

myarr2 = mynp.array([11,12,13])

myarr3 = mynp.vstack((myarr1,myarr2))

print(f"Vstack array is {myarr3}")

print("Eg2 2-D array shape with (3,3) and (2,3)------------------------------")

a = mynp.arange(1,10).reshape(3,3)

b = mynp.arange(20,26).reshape(2,3)

print(f" vstack array is : {mynp.vstack((a,b))}") # size of axis 1 of both the


arrays are same

Output:

Eg1 vstack for 1-D array------------------------------

Vstack array is [[ 1 2 3]

[11 12 13]]

Eg2 2-D array shape with (3,3) and (2,3)------------------------------

vstack array is : [[ 1 2 3]

[ 4 5 6]

[ 7 8 9]

[20 21 22]

[23 24 25]]
Note: The preceding code is covered in (program name:
Chap3_Example3.9.py)

hstack()

This function will stack arrays horizontally which takes array sequence as
input and the single array is returned by stacking them horizontally. An
important point is that all along the input arrays must have the same
dimensions. This is the same as concatenation along the second axis, except
for 1-D arrays, where it concatenates along the first axis. Refer to the
following syntax for a better understanding:

mynp.hstack(tup, *, dtype=None, casting='same_kind')

Now, we shall view an example of hstack() function which is demonstrated


in the code program name: Chap3_Example3.10.py as follows:

import numpy as mynp

# print(help(mynp.hstack))

print("Eg1 hstack for 1-D array------------------------------")

myarr1 = mynp.array([1, 2, 3,])

myarr2 = mynp.array([11,12,13,14])

myarr3 = mynp.hstack((myarr1,myarr2))

print(f"Hstack array is {myarr3}")

print("Eg2 2-D array shape with (3,3) and (3,2)------------------------------")

a = mynp.arange(1,10).reshape(3,3)

b = mynp.arange(1,7).reshape(3,2)

print(f" hstack array is : {mynp.hstack((a,b))}")


print("Eg with error ------------------------------")

myarr4 = mynp.arange(1,5).reshape(2,2)

myarr5 = mynp.arange(1,7).reshape(3,2)

print(f" hstack array error for the above :


{mynp.hstack((myarr4,myarr5))}")

Output:

Eg1 hstack for 1-D array------------------------------

Hstack array is [ 1 2 3 11 12 13 14]

Eg2 2-D array shape with (3,3) and (3,2)------------------------------

hstack array is : [[1 2 3 1 2]

[4 5 6 3 4]

[7 8 9 5 6]]

Eg with error ------------------------------

ValueError: all the input array dimensions except for the concatenation axis
must match exactly, but along dimension 0, the array at index 0 has size 2
and the array at index 1 has size 3

Note: The preceding code is covered in (program name:


Chap3_Example3.10.py)

dstack()

This function will stack arrays across axis two, which is the third axis that
takes array sequence as an input, and the single array is returned by
stacking them along the third axis. An important point to note is that 1-D or
2-D arrays must have the same shape. This is identical to the concatenation
of 2-D arrays of shape (A, B) into (A, B,1) and 1-D arrays of shape (A,)
into (1, A,1) along the third axis. Refer to the following syntax for a better
understanding:

mynp.dstack(tup)

Now, we shall view an example of dstack() function which is demonstrated


in the code program name: Chap3_Example3.11.py as follows:

import numpy as mynp

# print(help(mynp.dstack))

print("Eg1 dstack for 1-D array------------------------------")

myarr1 = mynp.array([1, 2, 3,])

myarr2 = mynp.array([11,12,13])

myarr3 = mynp.dstack((myarr1,myarr2))

print(f"dstack array is {myarr3}")

print(f"dstack 1-D array shape is {myarr3.shape}")

print("d stack for 2-D array------------------------------")

a = mynp.arange(11,17).reshape(3,2)

b = mynp.arange(1,7).reshape(3,2)

print(f" dstack 2-D array is : {mynp.dstack((a,b))}")

print(f" dstack 2-D array shape is : {mynp.dstack((a,b)).shape}")

Output:

Eg1 dstack for 1-D array------------------------------

dstack array is [[[ 1 11]


[ 2 12]

[ 3 13]]]

dstack 1-D array shape is (1, 3, 2)

d stack for 2-D array------------------------------

dstack 2-D array is : [[[11 1]

[12 2]]

[[13 3]

[14 4]]

[[15 5]

[16 6]]]

dstack 2-D array shape is : (3, 2, 2)

Note: The preceding code is covered in (program name:


Chap3_Example3.11.py).

Splitting of arrays in numpy

We can divide a ndarray into multiple smaller arrays along a specified axis
by performing split operations on ndarrays. Splitting of arrays in numpy is
essential for a variety of reasons. Firstly, it allows for efficient data
preprocessing by breaking down large datasets into smaller, manageable
chunks for analysis or manipulation. Secondly, splitting arrays facilitates
parallel computing, enabling different parts of an array to be processed
simultaneously across multiple computing units. Thirdly, it supports
organizing data for specific analytical tasks, such as cross-validation in
machine learning or time-series analysis in signal processing. Additionally,
array splitting provides flexibility in handling data with irregular or unequal
partitions, accommodating various computational requirements. Overall,
the ability to split arrays in numpy is crucial for optimizing computational
tasks, enhancing data organization, and improving overall efficiency in
numerical computing and data analysis workflows. We shall see direct
examples with their definitions and syntax.

split()

This function will split an array into multiple sub-arrays of equal size along
a specified axis and will return ndarray object list. If array splitting is based
on sections, then sub-array size should be equal (otherwise we will get an
error). Splitting is based on indices, then the sub-arrays size need not be the
same. Refer to the following syntax for a better understanding:

mynp.split(ary, indices_or_sections, axis=0)

Now, we shall view an example of split() function which is demonstrated in


the code program name: Chap3_Example3.12.py as follows:

import numpy as mynp

#print(help(mynp.split))

print("Eg1 split for 1-D array based on sections------------------------------")

myarr1 = mynp.arange(1,7)

myarr2 = mynp.split(myarr1,3)

print(f"Array type is: {type(myarr2)}")

print(f"Subarrays are : {myarr2}")

print("Eg2 2-D array split based on sections------------------------------")

myarr3 = mynp.arange(1,19).reshape(6,3)

print(f"Array split ino 2 sections: {mynp.split(myarr3,2, axis=0)}")

print(f"Array split ino 6 sections: {mynp.split(myarr3,6, axis=0)}")


print("Eg3 Splitting 1-D array based on indices")

myarr4=mynp.arange(1,11)

myresult = mynp.split(myarr4,[3,6])

print(f"myresult for 1-D after splitting is: {myresult}") # splitting on


indexes: 0,1,2 and 3,4,5 and 6,7,8,9

print("Eg4 Splitting 2-D array based on indices")

myarr5=mynp.arange(1,19).reshape(3,6)

myresult = mynp.split(myarr5,[3,5], axis=1)

print(f"myresult for 2-D after splitting is: {myresult}")

Output:

Eg1 split for 1-D array based on sections------------------------------

Array type is: <class 'list'>

Subarrays are : [array([1, 2]), array([3, 4]), array([5, 6])]

Eg2 2-D array split based on sections------------------------------

Array split ino 2 sections: [array([[1, 2, 3],

[4, 5, 6],

[7, 8, 9]]), array([[10, 11, 12],

[13, 14, 15],

[16, 17, 18]])]

Array split ino 6 sections: [array([[1, 2, 3]]), array([[4, 5, 6]]), array([[7, 8,


9]]), array([[10, 11, 12]]), array([[13, 14, 15]]), array([[16, 17, 18]])]
Eg3 Splitting 1-D array based on indices

myresult for 1-D after splitting is: [array([1, 2, 3]), array([4, 5, 6]), array([
7, 8, 9, 10])]

Eg4 Splitting 2-D array based on indices

myresult for 2-D after splitting is: [array([[ 1, 2, 3],

[ 7, 8, 9],

[13, 14, 15]]), array([[ 4, 5],

[10, 11],

[16, 17]]), array([[ 6],

[12],

[18]])]

Note: The preceding code is covered in (program name:


Chap3_Example3.12.py).

vsplit()

This function will allow row-wise splitting of an array into multiple sub-
arrays vertically along axis-0. Here, the input array must be atleast a 2-D
array as a 1-D array cannot be split vertically. Refer to the following syntax
for a better understanding:

mynp.vsplit(ary, indices_or_sections)

Now, we shall view an example of vsplit() function which is demonstrated


in the code program name: Chap3_Example3.13.py as follows:

import numpy as mynp

#print(help(mynp.vsplit))
myarr1= mynp.arange(1,17).reshape(4,4)

print("Eg1: splitting based on sections ................")

print(f"vsplit:2 {mynp.vsplit(myarr1,2)}")

print(f"vsplit:4 {mynp.vsplit(myarr1,4)}")

print("Eg2: splitting based on indices ................")

print(f"vsplit:2 {mynp.vsplit(myarr1,[2,3])}")

Output:

Eg1: splitting based on sections ................

vsplit:2 [array([[1, 2, 3, 4],

[5, 6, 7, 8]]), array([[ 9, 10, 11, 12],

[13, 14, 15, 16]])]

vsplit:4 [array([[1, 2, 3, 4]]), array([[5, 6, 7, 8]]), array([[ 9, 10, 11, 12]]),


array([[13, 14, 15, 16]])]

Eg2: splitting based on indices ................

vsplit:2 [array([[1, 2, 3, 4],

[5, 6, 7, 8]]), array([[ 9, 10, 11, 12]]), array([[13, 14, 15, 16]])]

Note: The preceding code is covered in (program name:


Chap3_Example3.13.py).

hsplit()

This function will allow column-wise splitting of an array into multiple


sub-arrays horizontally along the axis-1. Refer to the following syntax for a
better understanding:
mynp.split(ary, indices_or_sections)

Now, we shall view an example of hsplit() function which is demonstrated


in the code program name: Chap3_Example3.14.py as follows:

import numpy as mynp

#print(help(mynp.hsplit))

myarr1= mynp.arange(1,9).reshape(2,4)

print("Eg1: splitting based on sections ................")

print(f"hsplit:2 {mynp.hsplit(myarr1,2)}")

print("Eg2: splitting based on indices ................")

print(f"hsplit: {mynp.hsplit(myarr1,[2,3])}")

Output:

Eg1: splitting based on sections ................

hsplit:2 [array([[1, 2],

[5, 6]]), array([[3, 4],

[7, 8]])]

Eg2: splitting based on indices ................

hsplit: [array([[1, 2],

[5, 6]]), array([[3],

[7]]), array([[4],

[8]])]
Note: The preceding code is covered in (program name:
Chap3_Example3.14.py)

dsplit()

This function will allow third-dimension splitting of an array into multiple


sub-arrays along axis-2, which means depth split. Refer to the following
syntax for a better understanding:

mynp.dsplit(ary, indices_or_sections)

Now, we shall view an example of dsplit() function which is demonstrated


in the code program name: Chap3_Example3.15.py as follows:

import numpy as mynp

# print(help(mynp.dsplit))

myarr1= mynp.arange(24).reshape(2,3,4)

print("Eg1: splitting based on sections ................")

print(f"dsplit: {mynp.dsplit(myarr1,2)}")

print("Eg2: splitting based on indices ................")

print(f"dsplit: {mynp.dsplit(myarr1,[1,3])}")

Output:

Eg1: splitting based on sections ................

dsplit: [array([[[ 0, 1],

[ 4, 5],

[ 8, 9]],

[[12, 13],
[16, 17],

[20, 21]]]), array([[[ 2, 3],

[ 6, 7],

[10, 11]],

[[14, 15],

[18, 19],

[22, 23]]])]

Eg2: splitting based on indices ................

dsplit: [array([[[ 0],

[ 4],

[ 8]],

[[12],

[16],

[20]]]), array([[[ 1, 2],

[ 5, 6],

[ 9, 10]],

[[13, 14],

[17, 18],

[21, 22]]]), array([[[ 3],

[ 7],
[11]],

[[15],

[19],

[23]]])]

Note: The preceding code is covered in (program name:


Chap3_Example3.15.py)

array_split()

This function will allow you to have an unequal splitting of an array into
subarrays if the split number does not divide the specified axis size evenly.
It returns a% n subarrays of size a//n + 1 and the rest of size a//n for an
array of length a that should be divided into n parts. Refer to the following
syntax for a better understanding:

mynp.array_split(ary, indices_or_sections, axis=0)

Now, we shall view an example of array_split() function which is


demonstrated in the code program name: Chap3_Example3.16.py as
follows:

import numpy as mynp

# print(help(mynp.array_split))

myarr1 = mynp.arange(10)

myarraysplit = mynp.array_split(myarr1,3)

# 10 %3 = 1 subarray having size as 10//3 + 1 = 4

# and rest 2 subarray of each size 10//3 = 3

# Concept: # a % n sub-arrays of size a//n + 1 where a: length and n


sections
# and the rest of size a//n

print(f"Array split 1-D array result is: {myarraysplit}")

myarr2 = mynp.arange(10).reshape(5,2)

myarraysplit2 = mynp.array_split(myarr2,4)

# Concept: # 5 % 4 (1) sub-arrays of size 5//4 + 1 = 2

# and the rest (3) subarrays of size 5//4 = 1

print(f"Array split 2-D array result is: {myarraysplit2}")

Output:

Array split 1-D array result is: [array([0, 1, 2, 3]), array([4, 5, 6]), array([7,
8, 9])]

Array split 2-D array result is: [array([[0, 1],

[2, 3]]), array([[4, 5]]), array([[6, 7]]), array([[8, 9]])]

Note: The preceding code is covered in (program name:


Chap3_Example3.16.py)

Sorting of ndarray in numpy

We can sort ndarray elements using sort() function as it will sort elements
around a specified axis in an array. Quicksort is the default sorting
algorithm and the sorting will be in ascending order. For 2-D arrays, sorting
will be done based on columns by default, where every 1-D array will be
sorted. Let us view the following code for a better understanding:

mynp.sort(a, axis=-1, kind=None, order=None)

Now, we shall view an example of sorting of ndarray which is


demonstrated in the code program name: Chap3_Example3.17.py as
follows:
import numpy as mynp

# print(help(mynp.sort))

myarr1 = mynp.array([45,34,68,34,67,26,9,98,35])

print(f"Original array is: {myarr1}")

print(f"sorted 1-D array in ascending order is {mynp.sort(myarr1)}") #


default sorting is ascending

print(f"sorted 1-D array in descending order is {-mynp.sort(-myarr1)}") #


default sorting is ascending

print('-'*50)

myarr2 = mynp.array(['mat','hat','rat','bat','cat'])

print(f"String sorted 1-D array in ascending order is {mynp.sort(myarr2)}")


# default sorting is ascending

print(f"String sorted 1-D array in descending order is {mynp.sort(myarr2)


[::-1]}")

print('-'*50)

myarr3 = mynp.array([[78,35,98],[21,11,9],[87,35,65]])

print(f"Sorted 2-D array in ascending order is {mynp.sort(myarr3)}")

print(f"Sorted 2-D array in descending order is {-mynp.sort(-myarr3)}")

print('-'*50)

print("Sorting based on same marks of Students and correct questions


attempted in ascendong order")

mydata = [('name', 'S10'), ('marks', float), ('correct_questions', int)]


myvalues = [('Saurabh', 89.5, 52), ('Pallavi', 67, 38),('Priyanka', 89.5, 53),
('Yathartha', 88, 51)]

mydatatype = mynp.array(myvalues, dtype=mydata)

mysort_marks_correct = mynp.sort(mydatatype, order=['marks',


'correct_questions'])

print(f"Original Array is :\n {mydata}")

print(f"Sorting carried out based on correct_questions :\n


{mysort_marks_correct}")

Output:

Original array is: [45 34 68 34 67 26 9 98 35]

sorted 1-D array in ascending order is [ 9 26 34 34 35 45 67 68 98]

sorted 1-D array in descending order is [98 68 67 45 35 34 34 26 9]

--------------------------------------------------

String sorted 1-D array in ascending order is ['bat' 'cat' 'hat' 'mat' 'rat']

String sorted 1-D array in descending order is ['rat' 'mat' 'hat' 'cat' 'bat']

--------------------------------------------------

Sorted 2-D array in ascending order is [[35 78 98]

[ 9 11 21]

[35 65 87]]

Sorted 2-D array in descending order is [[98 78 35]

[21 11 9]
[87 65 35]]

--------------------------------------------------

Sorting based on same marks of Students and correct questions attempted


in ascendong order

Original Array is :

[('name', 'S10'), ('marks', <class 'float'>), ('correct_questions', <class 'int'>)]

Sorting carried out based on correct_questions :

[(b'Pallavi', 67. , 38) (b'Yathartha', 88. , 51) (b'Saurabh', 89.5, 52)

(b'Priyanka', 89.5, 53)]

Note: The preceding code is covered in (program name:


Chap3_Example3.17.py)

Elements searching of ndarray in numpy

Element searching of ndarrays in numpy is essential for various reasons. It


allows for efficient querying and retrieval of specific elements or subsets of
data within large arrays, enabling tasks such as data filtering, indexing, and
conditional operations. Practical significance arises in numerous domains,
including scientific computing, data analysis, and machine learning. For
instance, in scientific simulations, element searching facilitates identifying
specific data points corresponding to certain conditions or criteria, aiding in
hypothesis testing and model validation. In data analysis, it enables quick
identification of outliers or specific patterns within datasets, contributing to
insights and decision-making. Moreover, in machine learning algorithms,
efficient element searching is critical for tasks such as feature selection,
model evaluation, and parameter tuning. Overall, element searching in
ndarrays enhances the versatility and usability of numpy arrays in diverse
computational tasks, enabling efficient data manipulation and analysis.
Elements searching in numpy can be done by using the following two
approaches:
where()

Using where function, the indices of the elements that satisfy the criterion
will be returned if we simply point out the condition. where() function does
not return the elements. If we provide conditions, x, and y, then the items
that satisfy the condition will be replaced with x, and the rest of the
elements will be replaced with y. It returns the indices and will serve as a
replacement operator as well. When used as a replacement operator, the
functioning is just like that of the ternary operator. Refer to the following
syntax for a better understanding:

mynp.where(condition, [x, y], /)

condition based selection

There will be a Boolean indexing searching of elements in a numpy array


based on a condition.

Now, we shall view an example of where() function and condition-based


selection which is demonstrated in the code program name:
Chap3_Example3.18.py as follows:

Refer to the following code for a better understanding:

import numpy as mynp

# print(help(mynp.where))

# using where function

myarr1= mynp.array([10,11,12,15,19,8,39,16,57,8])

mysearch = mynp.where(myarr1%2==0)

print(myarr1[mysearch])

# condition based selection

myboolean_indexing = myarr1%2==0
print(myarr1[myboolean_indexing])

Output:

[10 12 8 16 8]

[10 12 8 16 8]

Note: The preceding code is covered in (program name:


Chap3_Example3.18.py)

Elements insert into ndarray

Inserting elements into ndarrays in numpy is crucial for dynamically


modifying array data, enabling flexibility in array manipulation and data
processing tasks. This functionality is particularly valuable in scenarios
where new data needs to be added to an existing array or when modifying
specific elements within an array based on certain conditions or
requirements. Practical applications include data augmentation in image
processing, dynamic resizing of arrays in numerical simulations and
updating datasets in real-time analytics. Elements searching in numpy can
be done by using the following two approaches:

insert()

This function allows elements to be inserted into an array along a given


axis at a specified position. Refer to the following syntax for a better
understanding:

mynp.insert(arr, obj, values, axis=None)

Now, we shall view an example of insert() function which is demonstrated


in the code program name: Chap3_Example3.19.py as follows:

import numpy as mynp

# print(help(mynp.insert))

myarr1= mynp.array([10,11,12,15,19,8,39,16,57,8])
myarr2 = mynp.insert(myarr1,[1,4],[87,89])

print(myarr2) # inserting 87 before index 1 and 89 before index 4

myarr3 = mynp.insert(myarr1,2,33.56) # float ---> int as the original array


contains int values

print(myarr3)

myarr4 = mynp.insert(myarr1,2,True) # addition of value 1 before index


position 2

print(myarr4)

# If axis not defined for 2-D array, then 2D will be flattened to 1-D

# and then insertion will happen

myarr5 = mynp.arange(4).reshape(2,2)

myarr6 = mynp.insert(myarr5,2,10)

print(myarr6)

# when axis = 0 or -2 is defined then insertion of rows after broadcasting


will happpen

myarr7 = mynp.insert(myarr5,1,10, axis=0)

print(myarr7)

print('-'*50)

# when axis = 1 or -1 is defined then insertion of columns after


broadcasting will happpen

myarr8 = mynp.insert(myarr5,1,10, axis=-1)

print(myarr8)
Output:

[10 87 11 12 15 89 19 8 39 16 57 8]

[10 11 33 12 15 19 8 39 16 57 8]

[10 11 1 12 15 19 8 39 16 57 8]

[ 0 1 10 2 3]

[[ 0 1]

[10 10]

[ 2 3]]

--------------------------------------------------

[[ 0 10 1]

[ 2 10 3]]

Note: The preceding code is covered in (program name:


Chap3_Example3.19.py)

append()

This function will allow elements to be appended at the end of an array


along a specified axis. Refer to the following syntax for a better
understanding:

mynp.append(arr, values, axis=None)

Now, we shall view an example of append() function which is


demonstrated in the code program name: Chap3_Example3.20.py as
follows:

import numpy as mynp


# print(help(mynp.append))

myarr1 = mynp.arange(10)

print("In 1-D array, since element added is of float type, all the elements
will be converted to float type")

myarr2 = mynp.append(myarr1, 10.5)

print(myarr2)

myarr3 = mynp.append(myarr1, '10.5')

print(myarr3)

print("2-D array: axis is not specified. So will flattened to 1-D array")

myarr4 = mynp.arange(6).reshape(2,3)

myarr5 = mynp.append(myarr4,10)

print(myarr5)

print("If I/P array is 2-D array and axis is specified, then appended array
must also be 2-D array otherwise error")

myarr6 = mynp.arange(10,16).reshape(2,3)

myarr7 = mynp.append(myarr4,myarr6, axis=0)

print(myarr7)

print("appending to columns for 2-D array")

myarr8 = mynp.append(myarr4,myarr6, axis=1)

print(myarr8)

Output:
In 1-D array, since element added is of float type, all the elements will be
converted to float type

[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.5]

['0' '1' '2' '3' '4' '5' '6' '7' '8' '9' '10.5']

2-D array: axis is not specified. So will flattened to 1-D array

[ 0 1 2 3 4 5 10]

If I/P array is 2-D array and axis is specified, then appended array must
also be 2-D array otherwise error

[[ 0 1 2]

[ 3 4 5]

[10 11 12]

[13 14 15]]

appending to columns for 2-D array

[[ 0 1 2 10 11 12]

[ 3 4 5 13 14 15]]

Note: The preceding code is covered in (program name:


Chap3_Example3.20.py)

Elements delete from ndarray in numpy

The ability to delete elements from ndarrays in numpy is essential for


dynamically modifying array data, enabling efficient memory management
and data manipulation. This functionality allows for the removal of specific
elements or entire rows/columns from an array based on certain conditions
or requirements, facilitating tasks such as data cleaning, filtering, and
preprocessing. Practical applications include removing outliers or irrelevant
data points in statistical analysis, eliminating noise in signal processing and
reducing dimensionality in machine learning feature sets. Elements can be
deleted from ndarray by using the delete() function.

delete()

This function will remove array elements of ndarray along a specified axis.
For multi-dimensional arrays, the axis must be specified, otherwise, the
default value of axis=None will be used. In this scenario, the array is first
flattened to a 1-D array before deletion is carried out. Refer to the
following syntax for a better understanding:

mynp.delete(arr, obj, axis=None)

Now, we shall view an example of delete() function which is demonstrated


in the code program name: Chap3_Example3.21.py as follows:

import numpy as mynp

#print(help(mynp.delete))

myarr1 = mynp.arange(10)

print("deleting element present at index 4")

myarr2 = mynp.delete(myarr1, 4)

print(myarr2)

print("deleting elements from indices 2 to 5")

myarr3 = mynp.delete(myarr1, mynp.s_[2:6])

print(myarr3)

myarr4 = mynp.arange(9).reshape(3,3)

print("Since no axis is specified. So, flattening will be done for 2-D array")
myarr5 = mynp.delete(myarr4,2)

print(myarr5)

print("axis-0 is specified. So deleting will be done for that row index")

myarr6 = mynp.delete(myarr4,1, axis=0)

print(myarr6)

print("axis-1 is specified. So deleting will be done for that column index")

myarr7 = mynp.delete(myarr4,1, axis=1)

print(myarr7)

Output:

deleting element present at index 4

[0 1 2 3 5 6 7 8 9]

deleting elements from indices 2 to 5

[0 1 6 7 8 9]

Since no axis is specified. So, flattening will be done for 2-D array

[0 1 3 4 5 6 7 8]

axis-0 is specified. So deleting will be done for that row index

[[0 1 2]

[6 7 8]]

axis-1 is specified. So deleting will be done for that column index

[[0 2]
[3 5]

[6 8]]

Note: The preceding code is covered in (program name:


Chap3_Example3.21.py)

Usage of dot function for matrix multiplication

The dot() function in numpy is essential for performing matrix


multiplication efficiently, which is a fundamental operation in various
fields including linear algebra, signal processing, and machine learning.
Unlike element-wise multiplication (* operator), the dot() function
computes the dot product of two arrays, resulting in matrix multiplication
when dealing with multi-dimensional arrays. This function is crucial for
tasks such as solving systems of linear equations, transforming data in
neural networks, and performing various matrix operations required in
scientific computing and data analysis. In order to perform the dot product
of two arrays which can be either matrices or vectors, the dot function will
be used for matrix multiplication. Refer to the following syntax for a better
understanding:

mynp.dot(a, b, out=None)

Now, we shall view an example of dot() function which is demonstrated in


the code program name: Chap3_Example3.22.py as follows:

import numpy as mynp

# print(help(mynp.dot))

myarr1 = mynp.arange(1,5).reshape(2,2)

myarr2 = mynp.arange(11,15).reshape(2,2)

print(myarr1)

print(myarr2)
print("using numpy function")

print(mynp.dot(myarr1,myarr2))

print("Using ndarray object")

print(myarr1.dot(myarr2))

Output:

[[1 2]

[3 4]]

[[11 12]

[13 14]]

using numpy function

[[37 40]

[85 92]]

Using ndarray object

[[37 40]

[85 92]]

Note: The preceding code is covered in (program name:


Chap3_Example3.22.py)

Usage of linalg module in numpy

In order to perform linear algebra operations in numpy like finding matrix


inverse, power, determinant, and solving linear algebra equations, we can
use linalg module in numpy. Refer to the following code.
Now, we shall view an example linalg module in numpy which is
demonstrated in the code program name: Chap3_Example3.23.py as
follows:

import numpy as mynp

print("Finding matrix inverse")

myarr1 = mynp.arange(11,15).reshape(2,2)

print(mynp.linalg.inv(myarr1)) # only valid for square matrices

print("Finding power of a matrix with n=0") # only valid for square


matrices

print(mynp.linalg.matrix_power(myarr1,0))

print("Finding power of a matrix with n>0")

print(mynp.linalg.matrix_power(myarr1,2))

print("Finding power of a matrix with n<0")

print(mynp.linalg.matrix_power(myarr1,-2)) # first inverse then power

print("Finding matrix determinant") # only valid for square matrices

print(mynp.linalg.det(myarr1))

print("Finding Solving linear algebra equations")

myarr2 = mynp.array([[1,1],[2,6]])

myarr3 = mynp.array([6,24])

print(mynp.linalg.solve(myarr2,myarr3))

Output:
Finding matrix inverse

[[-7. 6. ]

[ 6.5 -5.5]]

Finding power of a matrix with n=0

[[1 0]

[0 1]]

Finding power of a matrix with n>0

[[277 300]

[325 352]]

Finding power of a matrix with n<0

[[ 88. -75. ]

[-81.25 69.25]]

Finding matrix determinant

-1.9999999999999931

Finding Solving linear algebra equations

[3. 3.]

Note: The preceding code is covered in (program name:


Chap3_Example3.23.py)

Note: Apart from all these application usages of numpy which we have
seen till now, we can also use numpy library for some basic statistics like
calculating the minimum, maximum, average, median, variance, standard
deviation, and so on. It is also helpful in computing mathematical
operations like sine, cos, log functions, and so on. So, depending on
application-to-application requirements, we can use this numpy library. We
have left the calculation of a minimum for you to explore these concepts as
they are quite easy and direct usage can be applied for both 1-D and 2-D
arrays.

Conclusion

In this chapter, we initially covered various array manipulation functions


and variables in numpy, as well as methods for joining and splitting
ndarrays. Sorting and searching functions in numpy were explored, and the
reader gained knowledge on how to insert and delete elements from
ndarrays. The usage of the dot function for matrix multiplication and the
linalg module for linear algebra in numpy was explained with practical
examples. By the end of the chapter, the reader might have developed a
deep appreciation for the concept of numpy arrays by understanding their
properties, creation methods, and their significance along with their usage
in machine learning algorithms.

In the next chapter, we shall see about Scipy library which provides
additional functionality for scientific computing. It offers a wide range of
mathematical algorithms and functions for tasks such as optimization,
interpolation, linear algebra, signal processing, and many more.

Points to remember

ndarrays can be manipulated using functions like reshape, resize,


flatten, ravel, and so on.
Multiple arrays can be joined into a single array using functions like
concatenate, stack, vstack, hstack, dstacke, and so on.
Performing split operations on ndarrays allows for breaking down a
single ndarray into multiple smaller arrays along a specified axis.
ndarray elements can be sorted using the sort() function as it sorts
elements around a specified axis in an array.
Elements searching in numpy can be done using where and condition-
based selection.
Elements can be inserted into ndarray using the insert and append
function.
Elements can be deleted from ndarray using the delete function.
In order to perform the dot product of two arrays which can be either
matrices or vectors, the dot function will be used for matrix
multiplication.
linalg module in numpy is used to perform linear algebra operations.

Questions

1. Explain all the array manipulation function/variables in numpy with a


Python snippet code.
2. Explain how multiple arrays can be joined into a single array with a
Python snippet code.
3. Explain the snippet code splitting of arrays in numpy with Python.
4. Explain the sorting of ndarray in numpy with a Python snippet code.
5. Explain the elements searching of ndarray in numpy with a Python
snippet code.
6. Explain how elements can be inserted into ndarray with a Python
snippet code.
7. Explain how to delete elements from ndarray in numpy with a Python
snippet code
8. Explain with a Python snippet code usage of the dot function for
matrix multiplication.
9. Explain the usage of linalg module in numpy with a Python snippet
code.

Join our book’s Discord space

Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:

https://discord.bpbonline.com
CHAPTER 4
Exploring Scipy Library for Data
Science in Python

Introduction
Scipy is a comprehensive scientific computing library built on top of
NumPy, extending its capabilities with a wide range of tools for tasks like
linear algebra, optimization, numerical integration, and signal processing.
Co-created by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy
combines Pythonʼs ease of use with performance-critical components
written in C, C++ and Fortran. As a collection of mathematical algorithms
and convenience functions, Scipy enhances the NumPy extension of
Python, making it a powerful tool for technical and scientific applications.
To install scipy, we can use the pip command: pip install scipy.

Itʼs important to note the difference between pip and pip3. pip refers to the
default Python package installer, which may be associated with Python 2 or
Python 3 depending on your system configuration. For Python 2 users,
ensure compatibility by using pip3 explicitly to install scipy for Python 3:
For specific version requirements, you can specify the version of the SciPy
library using pip:
pip install scipy==<version_number>
Structure
The following topics will be discussed in the current chapter:
Difference between numpy and scipy
Exploring scipy constants
Optimizers in scipy
Sparse data and sparse matrix in scipy
Graphs in scipy
Spatial data in scipy
Integrate in scipy
Interpolation in scipy

Objectives
After exploring all the concepts of this chapter, the reader will initially have
an idea about the difference between numpy and scipy array. We aim to
explore the scipy constants which represent physical quantities,
mathematical, scientific, and other useful values. We will also delve into
optimizers in scipy where different optimization algorithms available will
be discussed with some examples like examples of finding the roots of an
equation and many more. Additionally, we will study sparse data and sparse
matrix representation in scipy, enabling efficient storage and manipulation
of large, mostly empty matrices. The seven different sparse matrix types in
scipy will be well explored. We will learn about graphs in scipy, where the
module provides a number of functions to effectively analyze and work
with sparse graphs. Spatial data analysis using scipy allows us to work with
geographic and geometric data. The chapter covers integration techniques
in scipy, which are useful for numerical integration of functions. We shall
explore single and multiple integration examples. Finally, we will delve into
interpolation methods in scipy, which enables the estimation of values
between known data points. We will explore 1D interpolation, spline
interpolation, and interpolation with radial basis function for example. We
will be using an alias for scipy library with modules as myscpy at various
places.

Differences between numpy and scipy


Refer to the following table to understand the differences:
Sno. Purpose Numpy Scipy

1 Contains Basic operations and array data. Numerical code.

2 Functionality Array operations are to be focused Includes functionality beyond


efficiently which includes sorting, arrays such as statistics,
indexing, shape manipulation, and numerical integration, signal
so on. processing, and so on.
3 Module It is a stand-alone package. It is built on top of numpy which
provides additional scientific
computing.
4 Other Works as a foundation for Here, the comprehensive
libraries scientific computing libraries such ecosystem is provided for
integration as scipy, pandas, matplotlib, and scientific computing and data
so on. analysis.
Table 4.1: Table depicting difference between numpy and scipy

Exploring scipy constants


Mathematical and physical constants are offered in SI units by the
scipy.constants module. These parameters can be utilized in a variety of
simulations and calculations in data science as many built-in scientific
constants are provided. In the code mentioned program name:
Chap4_Example4.1.py as follows, we are displaying a mathematical and
physical constant.
import scipy.constants as myscpy
print(myscpy.c)# displaying speed of light :
physical constant
print(myscpy.pi) # displaying the value of pi :
mathematical constant
Output:
299792458.0
3.141592653589793

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.1.py).

In the Programming Techniques using Python book, we have well explained


about dir function which returns a list of valid attributes of the object. The
constants module using dir function list of all units can be seen in program
name: Chap4_Example4.2.py as follows:
import scipy.constants as myscpy
print(dir(myscpy))
Output:
['Avogadro', 'Boltzmann', 'Btu', 'Btu_IT',
'Btu_th', 'ConstantWarning', 'G', 'Julian_year',
'N_A', 'Planck', 'R', 'Rydberg',
'Stefan_Boltzmann', 'Wien', '__all__',
'__builtins__', '__cached__', '__doc__',
'__file__', '__loader__', '__name__',
'__package__', '__path__', '__spec__', '_codata',
'_constants', '_obsolete_constants', 'acre',
'alpha', 'angstrom', 'arcmin', 'arcminute',
'arcsec', 'arcsecond', 'astronomical_unit', 'atm',
'atmosphere', 'atomic_mass', 'atto', 'au', 'bar',
'barrel', 'bbl', 'blob', 'c', 'calorie',
'calorie_IT', 'calorie_th', 'carat', 'centi',
'codata', 'constants', 'convert_temperature',
'day', 'deci', 'degree', 'degree_Fahrenheit',
'deka', 'dyn', 'dyne', 'e', 'eV', 'electron_mass',
'electron_volt', 'elementary_charge', 'epsilon_0',
'erg', 'exa', 'exbi', 'femto', 'fermi', 'find',
'fine_structure', 'fluid_ounce', 'fluid_ounce_US',
'fluid_ounce_imp', 'foot', 'g', 'gallon',
'gallon_US', 'gallon_imp', 'gas_constant', 'gibi',
'giga', 'golden', 'golden_ratio', 'grain', 'gram',
'gravitational_constant', 'h', 'hbar', 'hectare',
'hecto', 'horsepower', 'hour', 'hp', 'inch', 'k',
'kgf', 'kibi', 'kilo', 'kilogram_force', 'kmh',
'knot', 'lambda2nu', 'lb', 'lbf', 'light_year',
'liter', 'litre', 'long_ton', 'm_e', 'm_n', 'm_p',
'm_u', 'mach', 'mebi', 'mega', 'metric_ton',
'micro', 'micron', 'mil', 'mile', 'milli',
'minute', 'mmHg', 'mph', 'mu_0', 'nano',
'nautical_mile', 'neutron_mass', 'nu2lambda',
'ounce', 'oz', 'parsec', 'pebi', 'peta',
'physical_constants', 'pi', 'pico', 'point',
'pound', 'pound_force', 'precision',
'proton_mass', 'psi', 'pt', 'quecto', 'quetta',
'ronna', 'ronto', 'short_ton', 'sigma', 'slinch',
'slug', 'speed_of_light', 'speed_of_sound',
'stone', 'survey_foot', 'survey_mile', 'tebi',
'tera', 'test', 'ton_TNT', 'torr', 'troy_ounce',
'troy_pound', 'u', 'unit', 'value', 'week',
'yard', 'year', 'yobi', 'yocto', 'yotta', 'zebi',
'zepto', 'zero_Celsius', 'zetta']

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.2.py).

We will be viewing the placements of these units in various categories as


shown as follows:

Metric
Units are yotta, zetta, exa, peta, tera, giga, mega, kilo, hecta, deca, deci,
centi, milli, micro, nano, pico, femto, atto, and zepto. These were the metric
functions that we studied in Class-IX physics. The meter is the specified
unit that is returned and the code program name: Chap4_Example4.3.py
revealing the metric units is demonstrated as follows:
import scipy.constants as myscpy
print(myscpy.yotta) # +1e24
print(myscpy.zetta) # +1e21
print(myscpy.atto) # -1e18
print(myscpy.zepto) # -1e21
Output:
1e+24
1e+21
1e-18
1e-21

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.3.py).

Force
The specified unit returned is Newton and the code program name:
Chap4_Example4.4.py revealing the force units is demonstrated as follows:

import scipy.constants as myscpy


print(myscpy.dyne)
print(myscpy.kgf)
Output:
1e-05
9.80665

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.4.py).
Binary
The specified unit returned is bytes, if kibi is returned with the value 1024
bytes. Refer to the following code program name: Chap4_Example4.5.py
depicting different binary units:
import scipy.constants as myscpy
print(myscpy.kibi) # 1024
print(myscpy.mebi) # 1024*1024
print(myscpy.zebi) # (1024*1024)*(1024*1024)*
(1024*1024)*1024
print(myscpy.yobi) # (1024*1024)*(1024*1024)*
(1024*1024)*(1024*1024)
Output:
1024
1048576
1180591620717411303424
1208925819614629174706176

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.5.py).

Angle
Radians is the specified unit which is returned. Refer to the following code
program name: Chap4_Example4.6.py depicting different angle units:
import scipy.constants as myscpy
print(myscpy.degree)
print(myscpy.arcmin)
print(myscpy.arcsec)
Output:
0.017453292519943295
0.0002908882086657216
4.84813681109536e-06

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.6.py).

Mass
The specified unit returned is kg, say gram is returned with the value 0.001
kilograms. Refer to the following code program name:
Chap4_Example4.7.py depicting different mass units:

import scipy.constants as myscpy


print(myscpy.gram)
print(myscpy.pound)
print(myscpy.carat)
print(myscpy.atomic_mass)
Output:
0.001
0.45359236999999997
0.0002
1.6605390666e-27

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.7.py).
Time
Seconds is the specified unit that is returned. Refer to the following code
program name: Chap4_Example4.8.py depicting different time units:
import scipy.constants as myscpy
print(myscpy.minute)
print(myscpy.hour)
print(myscpy.day)
Output:
60.0
3600.0
86400.0

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.8.py).

Pressure
Pascals is the specified unit that is returned. Refer to the following code
program name: Chap4_Example4.9.py depicting different pressure units.
import scipy.constants as myscpy
print(myscpy.atm)
print(myscpy.bar)
print(myscpy.psi)
Output:
101325.0
100000.0
6894.757293168361

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.9.py).

Length
Meters is the specified unit which is returned. Refer to the following code
program name: Chap4_Example4.10.py revealing different length units:
import scipy.constants as myscpy
print(myscpy.foot)
print(myscpy.mile)
print(myscpy.light_year)
Output:
0.30479999999999996
1609.3439999999998
9460730472580800.0

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.10.py).

Volume
Cubic meters is the specified units that is returned. Refer to the following
code program name: Chap4_Example4.11.py revealing different volume
units:
import scipy.constants as myscpy
print(myscpy.liter)
print(myscpy.gallon)
print(myscpy.barrel)
Output:
0.001
0.0037854117839999997
0.15898729492799998

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.11.py).

Speed
Meters per second is the specified unit that is returned. Refer to the
following code program name: Chap4_Example4.12.py revealing different
speed units:
import scipy.constants as myscpy
print(myscpy.kmh)
print(myscpy.mph)
print(myscpy.speed_of_sound)
Output:
0.2777777777777778
0.44703999999999994
340.5

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.12.py).

Temperature
Kelvin is the specified unit that is returned. Refer to the following code
program name: Chap4_Example4.13.py revealing different temperature units:
import scipy.constants as myscpy
print(myscpy.zero_Celsius)
print(myscpy.degree_Fahrenheit)
Output:
273.15
0.5555555555555556

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.13.py).

Energy
Joules is the specified unit that is returned. Refer to the following code
program name: Chap4_Example4.14.py revealing different energy units:
import scipy.constants as myscpy
print(myscpy.eV)
print(myscpy.calorie)
Output:
1.602176634e-19
4.184

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.14.py).

Power
Watts is the specified unit that is returned. Refer to the following code
program name: Chap4_Example4.15.py revealing different power units:
import scipy.constants as myscpy
print(myscpy.hp)
print(myscpy.horsepower)
Output:
745.6998715822701
745.6998715822701

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.15.py).

Area
The square meter is the specified unit which is returned. Refer to the
following code program name: Chap4_Example4.16.py revealing different
area units:
import scipy.constants as myscpy
print(myscpy.hectare)
print(myscpy.acre)
Output:
10000.0
4046.8564223999992

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.16.py).

Optimizers in scipy
Through the scipy.optimize subpackage, scipy offers several optimization
techniques. It offers a set of scipy-defined procedures for determining an
equationʼs root or the least value of a function. All machine learning
algorithms simply involve the minimization of a challenging equation using
the available data. The frequently employed optimizers in scipy are
mentioned as follows:
Global optimization routines are anneal(), basinhopping(), and
brute-force.
Minimization of multivariate scalar functions minimize() with and
without constraints can be done using a variety of techniques like
Broyden–Fletcher–Goldfarb–Shanno (BFGS), Nelders-Mead
simplex, Newton conjugate gradient, and Constrained Optimization
BY Linear Approximations (COBLYA).
Least-squares minimization techniques are leastsq() and
curve_fit().

Scalar univariate function minimizers are root finders newton() and


minimizer_scalar().

The syntax is as mentioned as follows:


myscpy.optimize.root(fun, x0, args=(),
method='hybr', jac=None, tol=None, callback=None,
options=None)
We shall discuss an example of finding the roots of an equation by using
scipy in the following code program name: Chap4_Example4.17.py:
# finding roots of an equation: x + cos(x)
import scipy as myscpy
from math import cos
#print(help(myscpy.optimize.root))
def myeq_func(x):
return 2*x + cos(x)
myrootvar = myscpy.optimize.root(myeq_func,1) #
args as function and initial guess
print(myrootvar.x)
print('-'*50)
print(myrootvar) # information about whole root
Output:
[-0.45018361]
--------------------------------------------------
message: The solution converged.
success: True
status: 1
fun: [-3.109e-15]
x: [-4.502e-01]
nfev: 9
fjac: [[-1.000e+00]]
r: [-2.435e+00]
qtf: [-2.970e-09]

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.17.py).

The explanation of the code is mentioned as follows:


The myeq_func function is defined with a single parameter x.
Inside the function, the value of 2*x + cos(x) is calculated and
returned.
The root function has myeq_func as the first argument and one as the
initial guess. The function returns an object that contains information
regarding the solution.
The result of the root-finding process is stored in the variable
myrootvar.

The value of myrootvar.x is displayed, which represents the root of


the function. The returned objectʼs attribute x contains the actual
solution.
The entire myrootvar object is displayed, providing information about
the root and the root-finding process.

Note: The initial value provided for the optimization process is called
the initial guess.

Further, we shall learn about minimizing the function which represents a


curve that perfectly defines high points (maxima) and low points (minima).
In the whole curve, the highest point is called the global maxima, and the
lowest point is called the global minima. For minimizing the function, we
will be using scipy.optimize.minimize function in which the useful
arguments are fun (a function which represents an equation), x0 (root initial
guess), method having legal values as the BFGS. Refer to the following full
forms for a better understanding:
CG: Conjugate Gradient
L-BFGS-B: Limited-memory Broyden–Fletcher–Goldfarb–Shanno
with Bound Constraints
Newton-CG: Newton-Conjugate Gradient
TNC: Truncated Newton
SLSQP: Sequential Least Squares Programming
COBYLA: Constrained Optimization BY Linear Approximations
The callback is the function after each iteration of optimization. Its syntax is
mentioned as follows:
myscpy.optimize.minimize (fun, x0, args=(),
method=None, jac=None, hess=None, hessp=None,
bounds=None, constraints=(), tol=None,
callback=None, options=None)
Let us directly view an example for more clarity. We shall be viewing the
code program name: Chap4_Example4.18.py which shows how to minimize
the function x2 + 3*x + 5 with BFGS:
import scipy as myscpy
#print(help(myscpy.optimize.minimize))
def myfunc_eqn(x):
return x**2 + 3*x + 5
myvarmin = myscpy.optimize.minimize(myfunc_eqn, 0,
method='BFGS')
print(myvarmin)
Output:
message: Optimization terminated successfully.
success: True
status: 0
fun: 2.75
x: [-1.500e+00]
nit: 2
jac: [ 0.000e+00]
hess_inv: [[ 5.000e-01]]
nfev: 6
njev: 3
Note: The code referenced here is mentioned as (program name:
Chap4_Example4.18.py).

Here, the code defines a function called myfunc_eqn which takes a variable x
and returns the value of a quadratic equation. It then uses the minimize
function from the optimize module in the scipy library to find the
minimum value of the quadratic equation. The method used for
optimization is BFGS, and the initial guess for the minimum value is set to
zero. The result of the optimization is stored in the variable myvarmin and is
displayed on the screen eventually.
Remember that fun is the objective function value, jac is the Jacobian, and
the message is the termination cause message. nfev, and njev is Jacobian
and it is the evaluation number of the objective functions, the nit is the
number of iterations, status, and success is the termination and exit status,
and finally, x is the optimization solution.
Let us view the following example that shows optimization problems:
Consider the objective function: min x12 + 3*x1*x2
Constraints:
x13 + 2*x1*x2 = 200
x12 + 2*x1*x2 ≥ 50
-200 ≤ x1, x2 ≤ 200
Minimize the function using the SLSQP method. Use initial x0 as [1,1].
Refer to the following code program name: Chap4_Example4.19.py of how
to minimize the function using SLSQP method:
import scipy as myscpy
# Define the objective function
def myobjective(myvar):
return myvar[0]**2 + 3*myvar[0]*myvar[1]
# Define the constraints
def myconstraint1(myvar):
return myvar[0]**3 + 2*myvar[0]*myvar[1] - 200

def myconstraint2(myvar):
return myvar[0]**2 + 2*myvar[0]*myvar[1] - 50

# Define the bounds


mybounds = [(-200, 200), (-200, 200)]

# Define the constraints dictionary


myconstraints = [{'type': 'eq', 'fun':
myconstraint1},
{'type': 'ineq', 'fun':
myconstraint2}]

# Initial guess
x0 = [1, 1]

# Minimize the objective function using SLSQP


method
myresult = myscpy.optimize.minimize(myobjective,
x0, method='SLSQP', bounds=mybounds,
constraints=myconstraints)
# Print the optimized result
print(myresult)
Output:
message: Optimization terminated successfully
success: True
status: 0
fun: 58.93456942250782
x: [ 5.668e+00 1.576e+00]
nit: 6
jac: [ 1.607e+01 1.701e+01]
nfev: 19
njev: 6

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.19.py).

The explanation of the code is given as follows:


The objective function is defined as myobjective(myvar), which takes
a vector myvar as input and returns the value of the function
myvar[0]**2 + 3* myvar[0]* myvar[1]. This function represents that
it is supposed to be minimized.
Two constraint functions are defined as myconstraint1(myvar) and
myconstraint2 myvar). myconstraint1(myvar) represents the
constraint myvar [0]**3 + 2* myvar [0]* myvar [1] - 200 = 0, and
myconstraint2(myvar) represents the constraint myvar [0]**2 + 2*
myvar [0]* myvar [1] - 50 ≥ 0. These functions define the
constraints that the variables myvar must satisfy.
The bounds for the variables x are defined as mybounds = [(-200,
200), (-200, 200)], which means that x[0] and x[1] can take values
between -200 and 200. These bounds restrict the search space for the
optimization.
The constraints are defined as a list of dictionaries in the
myconstraints variable.

The first constraint is an equality constraint ('type': 'eq') and is


defined by the function myconstraint1(x).
The second constraint is an inequality constraint ('type': 'ineq')
and is defined by the function myconstraint2(x). These constraints
restrict the feasible region for optimization.
The initial guess for the optimization is set as x0 = [1, 1]. This is the
starting point from which the optimization algorithm will begin the
search.
The optimization is performed using the minimize function from the
scipy library. The objective function, initial guess, method SLSQP,
bounds, and constraints are passed as arguments to the minimize
function.
The result of the optimization is stored in the myresult variable.
Finally, the optimized result is displayed using print(myresult). This
will display the optimized values of the variables x and the minimum
value of the objective function.

Sparse data and sparse matrix in scipy


The unused elements which do not carry information are coined as sparse
data. We can understand that each dataset will have unused data which will
trouble us, this will stop the flow of algorithms in machine learning.
Observe the given array for sparse data [11,0,0,0,0,12,0,0,0,13,0,0,0]. Here,
the legal values are 11, 12, and 13. So, the dataset having more items as
zero is the sparse data. In a dense array, most of the values are non-zero
which means they will have some information.
For providing functions to deal with sparse data, scipy has a module named
as scipy.sparse. The sparse matrix types in scipy are mentioned in the
following section.

Block sparse row matrix


The Block Sparse Row (BSR) matrix format in scipy is made to handle
sparse matrices with a block structure quickly and effectively. It is used to
effectively store and handle block-structured matrices, where each block
represents a dense submatrix. Refer to the following syntax:
scipy.sparse.bsr_matrix(arg1, shape=None,
dtype=None, copy=False, blocksize=None)
The following code program name: Chap4_Example4.20.py for
understanding BSR matrix is demonstrated as follows:
import numpy as mynp
import scipy.sparse as myscpy
# Creating a dense matrix
mydense_matrix = mynp.array([[11, 12, 0, 0],
[13, 14, 15, 0],
[0, 16, 71, 18],
[0, 0, 19, 100]])
# Converting the dense matrix to BSR format
mybsr_matrix = myscpy.bsr_matrix(mydense_matrix,
blocksize=(2, 2))
print(mybsr_matrix)
Output:
(0, 0) 11
(0, 1) 12
(1, 0) 13
(1, 1) 14
(0, 2) 0
(0, 3) 0
(1, 2) 15
(1, 3) 0
(2, 0) 0
(2, 1) 16
(3, 0) 0
(3, 1) 0
(2, 2) 71
(2, 3) 18
(3, 2) 19
(3, 3) 100

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.20.py). Just type print(help(scipy.sparse.bsr_matrix))
to get more information about the arguments used.

Coordinate list matrix


When only a small portion of a matrixʼs elements are non-zero and their
positions are known, the Coordinate List Matrix (COO) can be used to
efficiently store and handle the sparse matrix. For sparse matrices, where
only non-zero values are stored along with their row and column indices,
the COO list matrix is frequently used. Refer to the following syntax for a
better understanding:
scipy.sparse.coo_matrix(arg1, shape=None,
dtype=None, copy=False)
The following code program name: Chap4_Example4.21.py for
understanding the COO list matrix is demonstrated as follows:
import numpy as mynp
import scipy.sparse as myscpy
myrows = mynp.array([0, 0, 0, 1, 1, 1])
mycols = mynp.array([0, 1, 2, 0, 1, 2])
mydata = mynp.array([11, 12, 13, 14, 15, 16])
mymatrix = myscpy.coo_matrix((mydata, (myrows,
mycols)))
print(mymatrix)
Output:
(0, 0) 11
(0, 1) 12
(0, 2) 13
(1, 0) 14
(1, 1) 15
(1, 2) 16

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.21.py).

Compressed sparse column matrix


The Compressed Sparse Column (CSC) matrix in scipy is a sparse matrix
stored in a column-major format. It implies that the matrixʼs columns are
stored consecutively in memory, while the row indices of non-zero elements
are stored in a separate array. The CSC format is widely preferred for sparse
matrices due to its high efficiency in column-oriented operations like matrix
multiplication and vector-matrix multiplication. Refer to the following
syntax for a better understanding:
scipy.sparse.csc_matrix(arg1, shape=None,
dtype=None, copy=False)
The following code program name: Chap4_Example4.22.py for
understanding CSC matrix is demonstrated as follows:
import scipy.sparse as myscpy
# Creating a 3x3 CSC matrix
mydata = [11, 12, 13, 14, 15, 16] # non-zero
values in column-wise order
myrow_indices = [0, 1, 2, 0, 1, 2] # row indices
of non-zero values
mycol_pointers = [0, 2, 4, 6] # index pointers to
start of columns

mycsc_matrix = myscpy.csc_matrix((mydata,
myrow_indices, mycol_pointers), shape=(3, 3))
print(mycsc_matrix)
Output:
(0, 0) 11
(1, 0) 12
(2, 1) 13
(0, 1) 14
(1, 2) 15
(2, 2) 16

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.22.py).

Compressed sparse row matrix


The non-zero elements of the SciPy Compressed Sparse Row (CSR)
matrix are kept in three different arrays, data, indices, and indptr. The
non-zero values of the matrix are contained in the data array in row-major
order. The column indices for each non-zero value are kept in the indices
array. The starting and ending indices of rows in the indices and data arrays
are indicated by the indptr array. Refer to the following syntax for a better
understanding:
scipy.sparse.csr_matrix(arg1, shape=None,
dtype=None, copy=False)
The following code program name: Chap4_Example4.23.py for
understanding CSR matrix is demonstrated as follows:
import numpy as mynp
import scipy.sparse as myscpy
# print(help(myscpy.csr_matrix))
myarray = mynp.array([0,0,1,0,0,2,0,3,0])
print(myscpy.csr_matrix(myarray))
print('-'*50)
myarray2 = mynp.array([[0,0,1],[0,0,2],[0,3,0]])
print('Viewing non-zero items using data
property')
print(myscpy.csr_matrix(myarray2).data)
print('-'*50)
print('Counting non-zero items using count_nonzero
method')
print(myscpy.csr_matrix(myarray2).count_nonzero())
print('-'*50)
print('Eliminating zero items using
eliminate_zeros method')
myvar = myscpy.csr_matrix(myarray2)
myvar.eliminate_zeros()
print(myvar)
Output:
(0, 2) 1
(0, 5) 2
(0, 7) 3
--------------------------------------------------
Viewing non-zero items using data property
[1 2 3]
--------------------------------------------------
Counting non-zero items using count_nonzero method
3
--------------------------------------------------
Eliminating zero items using eliminate_zeros
method
(0, 2) 1
(1, 2) 2
(2, 1) 3

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.23.py).

Diagonal sparse matrix


Diagonal (DIA) sparse matrix format in scipy is used to store matrices with
a large percentage of zero-valued members and non-zero components that
are only found on the diagonal. For diagonal matrices, this format is
extremely beneficial. Refer to the following syntax for a better
understanding:
scipy.sparse.dia_matrix(arg1, shape=None,
dtype=None, copy=False)
The following code program name: Chap4_Example4.24.py for
understanding DIA sparse matrix is demonstrated as follows:
import scipy.sparse as myscpy
# Create a diagonal matrix
mydiagonal_values = [11, 12, 13, 14]
mydiagonal_offsets = [0] # The diagonal is at
offset 0
mydia_matrix =
myscpy.dia_matrix((mydiagonal_values,
mydiagonal_offsets), shape=(4, 4))
print(mydia_matrix)
Output:
(0, 0) 11
(1, 1) 12
(2, 2) 13
(3, 3) 14

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.24.py).

Dictionary of keys sparse matrix


One of the sparse matrix representations in the library is the Dictionary Of
Keys (DOK) where elements with non-zero values are stored in a
dictionary. It is implemented using a dictionary data structure in which the
keys correspond to row, and column indices and the values represent the
corresponding element values. Refer to the following syntax for a better
understanding:
scipy.sparse.dok_matrix(arg1, shape=None,
dtype=None, copy=False)
The following code program name: Chap4_Example4.25.py for
understanding the DOK matrix is demonstrated as follows:
import numpy as mynp
import scipy.sparse as myscpy
# Create an empty DOK matrix with shape (3, 3)
mymatrix = myscpy.dok_matrix((3, 3),
dtype=mynp.float32)
# Set some non-zero elements
mymatrix[0, 1] = 12.5
mymatrix[1, 2] = 11.3
mymatrix[2, 0] = 14.7
# elements access
print(mymatrix)
# Converting DOK matrix to a dense matrix
mydense_matrix = mymatrix.toarray()
print(mydense_matrix)
Output:
(0, 1) 12.5
(1, 2) 11.3
(2, 0) 14.7
[[ 0. 12.5 0. ]
[ 0. 0. 11.3]
[14.7 0. 0. ]]

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.25.py).

Row-based linked list sparse matrix


An additional sparse matrix representation in the scipy package is the row-
based Linked List (LIL) sparse matrix which is implemented using linked
lists and is intended for efficient row-wise operations. The LIL matrices are
especially helpful when the matrix is built incrementally since they make
element insertion and deletion quick and easy. Refer to the following syntax
for a better understanding:
scipy.sparse.lil_matrix(arg1, shape=None,
dtype=None, copy=False)
The following code program name: Chap4_Example4.26.py for
understanding row-based LIL matrix is demonstrated as follows:
import numpy as mynp
import scipy.sparse as myscpy
# Create an empty LIL matrix with shape (3, 3)
mymatrix = myscpy.lil_matrix((3, 3),
dtype=mynp.float32)
# Set some non-zero elements
mymatrix[0, 1] = 12.5
mymatrix[1, 2] = 11.3
mymatrix[2, 0] = 14.7
# elements access
print(mymatrix[0, 1]) # Output: 2.5
# Converting LIL matrix to a dense matrix
mydense_matrix = mymatrix.toarray()
print(mydense_matrix)
Output:
12.5
[[ 0. 12.5 0. ]
[ 0. 0. 11.3]
[14.7 0. 0. ]]

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.26.py).

Graphs in scipy
In scipy, scipy.sparse module offers tools for working with sparse
matrices, including representations for graphs in sparse matrixes. The
module provides several sparse matrix formats that can effectively describe
graphs. Working with sparse graphs and carrying out operations on them are
made possible by the scipy.sparse.csgraph module in scipy. It contains
formulas for connectedness, shortest paths through a graph, and other
operations. The module provides several functions to effectively analyze
and work with sparse graphs.

Adjacency matrix
The links or relationships between the nodes in a graph are represented by
the scipy adjacency matrix, which is a square matrix. Typically, each node
in a graph is represented by a row and a column in the matrix. Whether
there is an edge or connection between two nodes is indicated by the values
in the matrix. Consider the following graph with elements A, B, and C:

Figure 4.1: Adjacency matrix graph

Further, we shall see some of the important methods to work with these
adjacency matrices.

connected_components method
The quantity of connected components in a graph can be found using
scipyʼs connected_components method. It accepts a sparse matrix
representation of the graph as input and outputs. The number of connected
components as well as an array provides each node a unique label
indicating its component. Refer to the following syntax for a better
understanding:
myscpy.csgraph.connected_components(csgraph,
directed=True, connection='weak',
return_labels=True)

Dijkstra method
Dijkstraʼs method is used to find the shortest path from a single source node
to all other nodes in a graph. Refer to the following syntax for a better
understanding:
myscpy.csgraph.dijkstra(csgraph, directed=True,
indices=None, return_predecessors=False,
unweighted=False, limit=np.inf,
min_only=False)

Floyd_Warshall method
The Floyd-Warshall method will be used to discover the shortest path
between all pairs of nodes in a graph. Refer to the following syntax for a
better understanding:
myscpy.csgraph.floyd_warshall(csgraph,
directed=True, return_predecessors=False,
unweighted=False,
overwrite=False)
The following code program name: Chap4_Example4.27.py for
understanding adjacency matrices is demonstrated as follows:
import scipy.sparse as myscpy
import numpy as mynp
myarr = mynp.array([
[0, 1, 0],
[1, 0, 2],
[0, 2, 0]
])
mynewarr = myscpy.csr_matrix(myarr)# numpy array
conversion into a CSR matrix
print("Usage of connected_components method")
# type
print(help(myscpy.csgraph.connected_components))
print(myscpy.csgraph.connected_components(mynewarr
))# to determine the number of connected
components in the CSR matrix
print('-'*50)
print("Usage of Dijkstra method")
# to find the shortest path from a given starting
node (index 0)
# to all other nodes in a graph represented by the
mynewarr array.
# here The return_predecessors parameter is set to
True, which means
# that the algorithm will also return the
predecessor nodes for each node in the shortest
path
print(myscpy.csgraph.dijkstra(mynewarr,
return_predecessors=True, indices=0))
print('-'*50)
# to discover the shortest path between all pair
of nodes
# in a graph represented by the mynewarr array.
print("Usage of Floyd_Warshall method")
print(myscpy.csgraph.floyd_warshall(mynewarr,
return_predecessors=True))
Output:
Usage of connected_components method
(1, array([0, 0, 0]))
--------------------------------------------------
Usage of Dijkstra method
(array([0., 1., 3.]), array([-9999, 0,
1]))
--------------------------------------------------
Usage of Floyd_Warshall method
(array([[0., 1., 3.],
[1., 0., 2.],
[3., 2., 0.]]), array([[-9999, 0,
1],
[ 1, -9999, 1],
[ 1, 2, -9999]]))

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.27.py).

There are other methods like belliman_ford which will discover the
shortest path between all pairs of elements by handling negative weights as
well, breadth_first_order, and depth_first_order are left for you to
explore.

Spatial data in scipy


In scipy, data with a spatial or geographic component is referred to as
spatial data. Usually, it conveys details regarding the physical locations or
spatial relationships between various objects or occurrences. To work with
spatial data, scipy offers several modules and functions, including the
scipy.spatial and scipy.spatial.distance modules. They provide
features for nearest neighbor search, spatial indexing, spatial distance
calculations, spatial data structures, and so on. They are frequently
employed in applications for spatial analysis, image processing, computer
vision, and Geographic Information Systems (GIS). Coordinates,
geometries, distances, geographical relationships, and attributes related to
particular locations or regions are all examples of spatial data in scipy. Let
us view some functions to work with spatial data.

Triangulation function
In scipy, the term triangulation refers to the process of building a
triangulated mesh out of a collection of points or vertices. It is frequently
employed in data visualization and computational geometry. A polygon
triangulation is the splitting up of a polygon into several triangles so that its
area can be computed.
When you triangulate with points, you build triangles out of surfaces where
each of the given points is on at least one of the trianglesʼ vertices.
The scipy.spatial module offers triangulation functionality within the
scipy environment. The scipy.spatial.Delaunay class for creating a
Delaunay triangulation is a particular kind of triangulation where no point
lies inside the circumference of any triangle in the mesh.
The following code program name: Chap4_Example4.28.py for
understanding Delaunay triangulation algorithm is demonstrated as follows:
import numpy as mynp
import scipy.spatial as myscpy
import matplotlib.pyplot as myplt
myarray = mynp.array([[2,5],[3,5],[3,3],[2,2],
[4,1]])
myvar_simplices =
myscpy.Delaunay(myarray).simplices
myxcoordinate = myarray[:, 0]
myycoordinate = myarray[:, 1]
myplt.triplot(myxcoordinate, myycoordinate,
myvar_simplices)
myplt.scatter(myxcoordinate, myycoordinate,
color='b')
myplt.show()
Output:

Figure 4.2: Output of Chap4_Example4.28.py


Note: The code referenced here is mentioned as (program name:
Chap4_Example4.28.py).

Let us understand the code with some steps:


1. The above code generates a plot of a set of points using the Delaunay
triangulation algorithm.
2. First, an array of points is created using the numpy library.
3. Then, the Delaunay function from the scipy library is used to
calculate the simplices (triangles) that form the Delaunay
triangulation of the points.
4. The x and y coordinates of the points are extracted from the array.
5. The triplot function from the matplotlib library is used to plot the
triangulation, with the x and y coordinates and the simplices as
inputs.
6. Finally, the scatter function is used to plot the original points as blue
dots. The resulting plot shows the Delaunay triangulation of the
points, with the original points overlaid.
If the requirement comes to find the smallest polygon covering all the given
points, then we need to use ConvexHull method. The following code
program name: Chap4_Example4.29.py for understanding ConvexHull
method is demonstrated as follows:
import numpy as mynp
import scipy.spatial as myscpy
import matplotlib.pyplot as myplt

myarray = mynp.array([[3, 5],[4, 5],[4, 1],[3, 3],


[5, 2],[2, 3],
[6, 1],[4, 2],[2, 3],[1,
3]])
myvar_simplices =
myscpy.ConvexHull(myarray).simplices
myxcoordinate = myarray[:, 0]
myycoordinate = myarray[:, 1]

myplt.scatter(myxcoordinate, myycoordinate,
color='b')
for mypoints in myvar_simplices:
myplt.plot(myarray[mypoints,0],
myarray[mypoints,1], 'g-')

myplt.show()
Output:

Figure 4.3: Output of Chap4_Example4.29.py

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.29.py).

Refer to the following explanation of the code:


myarray is defined as a numpy array containing a list of points in a 2D
space.
myvar_simplices is assigned the simplices (indices of points forming
the convex hull) of myarray using the ConvexHull function from the
scipy.spatial module.

myxcoordinate is assigned the x-coordinates of all points in myarray.


myycoordinate is assigned the y-coordinates of all points in myarray.
A scatter plot is created using myxcoordinate as the x-coordinates and
myycoordinate as the y-coordinates, with the points colored blue.

A loop is used to iterate over each simplex in myvar_simplices.


For each simplex, a line is plotted using the x-coordinates and y-
coordinates of the points in myarray corresponding to the indices in
the simplex. The line color is set to green.
The plot is displayed using the myplt.show().
Let us learn about KDTrees. A data structure called the KDTree in scipy
enables effective nearest-neighbor searches in a multidimensional
environment. K-Dimensional (KD) means it can handle data with any
number of dimensions. The KDTree will split the space into two parts by
representing each node of the binary tree that holds the data points as
splitting hyperplanes. Based on the median value of one of the dimensions
at each level of the tree, the splitting hyperplane is determined. A KDTree
object is returned by the KDTree() method. The neighborsʼ locations and
the distance to the closest neighbor are provided via the query() method.
The following code program name: Chap4_Example4.30.py for
understanding KDTrees is demonstrated as follows:
import scipy.spatial as myscpy
import numpy as mynp
import matplotlib.pyplot as myplt

mypoints = mynp.array([[2, 3], [4, 1],[3, 6], [5,


6], [7, 2], [8, 9]])
mykdtree = myscpy.KDTree(mypoints)
myquery_point = mynp.array([5, 6])
mydistances, myindices =
mykdtree.query(myquery_point, k=2)
mynearest_neighbors = mypoints[myindices]
print("Nearest neighbor value is as follows:",
mynearest_neighbors)

myvar_simplices =
myscpy.Delaunay(mypoints).simplices
myxcoordinate = mypoints[:, 0]
myycoordinate = mypoints[:, 1]
myplt.triplot(myxcoordinate, myycoordinate,
myvar_simplices)
myplt.scatter(myxcoordinate, myycoordinate,
color='b')
myplt.show()
Output:
Nearest neighbor value is as follows: [[5 6]
[3 6]]
Refer to the following figure to understand the output:

Figure 4.4: Output of Chap4_Example4.30.py

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.30.py).

Refer to the following explanation of the code:


First, we are creating a numpy array called mypoints with six points
in 2D space. Each point is represented by a pair of coordinates.
Then a KDTree object called mykdtree is created using the mypoints
array. This KDTree will be used to efficiently find the nearest
neighbors.
Next is defining a query point myquery_point as [5, 6]. This is the
point for which we want to find the nearest neighbors.
The KDTree method is used to find the two nearest neighbors to the
query point. It stores the distances to these neighbors in the
mydistances array and their indices in the myindices array.

The nearest neighbors are retrieved from the mypoints array using the
indices in myindices. It stores these nearest neighbors in the
mynearest_neighbors array which displays the values.

A Delaunay triangulation object is created called myvar_simplices


using the mypoints array. This triangulation will be used to plot the
points and their connections.
We extract the x-coordinates and y-coordinates from the mypoints
array and store them in myxcoordinate and myycoordinate
respectively.
The triangulation is plotted using the x-coordinates, y-coordinates,
and simplices from myvar_simplices. This creates a plot with the
points connected by triangles and then we plot the points using the x-
coordinates and y-coordinates, with the color set as blue. This adds
the individual points to the plot.
In data science, a variety of distance metrics, including Euclidean distance
and cosine distance, are used to calculate different types of distances
between a pair of sites.
The distance between two vectors can take many different forms, such as
the number of unit steps needed, the angle between them from the origin, or
even just the length of a straight line between them.
The performance of the machine learning algorithm is heavily reliant on
distance metrics. For instance, K Nearest Neighbours or K Means, and so
on. For now, we are leaving the distance measurement topic for you to
practice.
Integrate in scipy
The integrated module in scipy offers functions for carrying out numerical
integration. The integration functionality provided by SciPy serves a critical
role in numerical analysis, scientific computing, and various engineering
disciplines. It offers a suite of functions for performing numerical
integration, enabling the approximation of definite integrals for a wide
range of functions and scenarios. The need for integration arises in
numerous real-world applications where analytical solutions are either
infeasible or unavailable. These applications span fields such as physics,
engineering, economics, and statistics. For instance, in physics and
engineering, integration is often required to compute quantities such as
area, volume, center of mass, moment of inertia, and work done. In
statistics and machine learning, integration facilitates the calculation of
probabilities, expectations, and other statistical measures. SciPyʼs
integration capabilities provide efficient and accurate numerical methods
for approximating definite integrals, including both adaptive and non-
adaptive techniques. These methods enable researchers, scientists, and
engineers to tackle complex problems that involve continuous functions,
thereby enhancing the scope and accuracy of computational analyses in
diverse domains. One way to get close to a functionʼs definite integral is
through numerical integration.

Single integration
To determine a single-variable functionʼs definite integral across a specified
range, utilize single integration. Just observe the following example. The
following code program name: Chap4_Example4.31.py for understanding
single integration in scipy is demonstrated as follows:
∫01x2 dx, the answer to the example is 0.3333 + c.

scipy.integrate.quad(func, a, b, args=(),
full_output=0, epsabs=1.49e-08, epsrel=1.49e-08,
limit=50, points=None, weight=None, wvar=None,
wopts=None, maxp1=50, limlst=50,
complex_func=False)
import scipy.integrate as myscpy
def myfunc(x):
return x**2
# print(help(myscpy.quad))
myresult, myerror = myscpy.quad(myfunc, 0, 1)

print("My answer if simple integral is:",


myresult)
print("the error is:", myerror)
Output:
My answer if simple integral is:
0.33333333333333337
the error is: 3.700743415417189e-15

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.31.py).

The code defines a function called myfunc that takes a parameter x and
returns the square of x. It then uses the quad function from the myscpy
module to calculate the definite integral of myfunc over the interval [0, 1].
The result and the error of the integral calculation are then displayed.

Multiple integration
Calculating the definite integrals of functions with multiple variables is
known as multiple integrations in scipy. The nquad function from the Scipy
library allows for multiple integrations. The functions dblquad(),
tplquad(), and nquad() result from multiple integrals such as double and
triple integration. Here, we will look at the double integral problem that
scipy.integrate.dblquad(func,a,b,gfun,hfun) can resolve. Refer to the
following syntax for a better understanding:
scipy.integrate.nquad(func, ranges, args=None,
opts=None, full_output=False)
scipy.integrate.dblquad(func, a, b, gfun, hfun,
args=(), epsabs=1.49e-08, epsrel=1.49e-08)
scipy.integrate.tplquad(func, a, b, gfun, hfun,
qfun, rfun, args=(), epsabs=1.49e-08,
epsrel=1.49e-08)
The following code program name: Chap4_Example4.32.py for
understanding first multiple integration in scipy is demonstrated as follows:
Observe the following example:
The answer to the above example is 0.125 + c. Refer to the following code
for a better understanding:
import scipy.integrate as myscpy
#print(help(myscpy.nquad))
def myfunc(num1, num2):
return num1 / num2**2

x_lower = 1
x_upper = 2
y_lower = 4
y_upper = 6

myresult, myerror = myscpy.nquad(myfunc,


[[x_lower, x_upper], [y_lower, y_upper]])
print("The result of the multiple integral using
nquad is:", myresult)
print("The error of the multiple integral using
nquad is:", myerror)
Output:
The result of the multiple integral using nquad
is: 0.125
The error of the multiple integral using nquad is:
1.3877787807814457e-15

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.32.py).

A function called myfunc takes two arguments, x, and y, and returns the
result of dividing x by the square of y. It then sets the values of x_lower,
x_upper, y_lower, and y_upper to 1, 2, 4, and 6 respectively. Next, the nquad
function is used from the myscpy library to calculate the multiple integrals of
myfunc over the range defined by [x_lower, x_upper] and [y_lower,
y_upper]. The result of the integral is stored in the variable myresult, and
the error of the integral is stored in the variable myerror. Finally, the code
prints the result and error of the integral.
Now, let us explore double integration. The following code program name:
Chap4_Example4.33.py for understanding double integration in scipy is
demonstrated as follows:
Observe the following example:

The answer to the above example is 8 + c. Refer to the following code for a
better understanding:
import scipy.integrate as myscpy
#print(help(myscpy.nquad))
def myfunc(x, y):
return -4*(x + 5*y)

x_lower = 0
x_upper = 1
y_lower = lambda x:1
y_upper = lambda x: 1-x

myresult, myerror = myscpy.dblquad(myfunc,


x_lower, x_upper, y_lower, y_upper)
print("The result of the multiple integral using
dblquad is:", myresult)
print("The error of the multiple integral using
dblquad is:", myerror)
Output:
The result of the multiple integral using dblquad
is: 8.0
The error of the multiple integral using dblquad
is: 2.4328570415526704e-13

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.33.py).

An explanation of the code is mentioned as follows:


The code imports the scipy.integrate module as myscpy.
The myfunc function is defined, which takes two variables x and y,
and returns the value -4*(x + 5*y).
The lower and upper limits for the integration – x_lower is set to 0
and x_upper is set to 1. y_lower is defined as a lambda function that
takes x as input and returns 1, and y_upper is defined as a lambda
function that takes x as input and returns 1-x.
The myscpy.dblquad function is called with the following arguments,
myfunc as the function to be integrated and - x_lower and x_upper as
the limits for the integration, over x - y_lower and y_upper as the
limits for the integration over y. The result and error of the integration
are assigned to myresult and myerror respectively.
The result and error of the integration are printed using the print
function.

Interpolation in scipy
Interpolation is the technique of estimating values between known data
points in the context of the scipy library. It is frequently used to retrieve
values at key points in a dataset or to complete missing data. Scipy offers a
variety of interpolation techniques, including linear, polynomial, spline, and
others. Interpolation has various applications, and because missing data in a
dataset are frequently dealt with in machine learning, interpolation is
frequently used to replace those values. Imputation is the name for this
process of filling in data. Interpolation is frequently used in place of
imputation when discrete points in a dataset need to be smoothed. Scipy
gives us a module called scipy.interpolate that offers a wide range of
interpolation-related functions.

1D interpolation
A distribution with only one variable can be interpolated using the function
interp1d(). It requires the input of two points, x, and y which are arrays of
values representing the coordinates of the function and the kind argument
specifying the interpolation method which is to be used and thus outputs a
callable function that, when called with a new x, and returns the
corresponding y. Refer to the following syntax for a better understanding:
scipy.interpolate.interp1d(x, y, kind='linear',
axis=-1, copy=True, bounds_error=None,
fill_value=nan, assume_sorted=False)
The following code program name: Chap4_Example4.34.py for
understanding 1D interpolation in scipy is demonstrated as follows:
import scipy.interpolate as myscpy
import numpy as mynp
# print(help(myscpy.interp1d))

x = mynp.array([0, 2, 4, 6, 8, 10])
y = mynp.vectorize(lambda num: num**2)(x)
print(y)
myfunc = myscpy.interp1d(x, y, kind='linear')
print(myfunc(5))
Output:
[ 0 4 16 36 64 100]
26.0

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.34.py).

Refer to the following explanation of the code:


A numpy array x was created with values [0, 2, 4, 6, 8, 10].
A lambda function was applied to each element of x using the
vectorize function from numpy, and the resulting squares were stored
in the numpy array y.
The array y, which contained the squares of the elements in x, was
printed.
The interp1d function from scipy was used to create a linear
interpolation function named myfunc using the values of x and y.
The result of calling myfunc with the value five was printed. This
interpolated the value of y at x=5 using the linear interpolation
function.

Spline interpolation
It is a method for estimating values between known data points in spline
interpolation. It involves fitting the data to a polynomial function with
piece-wise definitions known as a spline. So, the points defined with
polynomials are fitted against a piece-wise function (a function for different
ranges has different definitions) unlike 1D interpolation where for a single
curve the points are being fitted. The spline consisting of multiple
polynomial segments results in a smooth curve by smoothly connecting the
data points. Refer to the following syntax for a better understanding:
scipy.interpolate.UnivariateSpline(x, y, w=None,
bbox=[None, None], k=3, s=None, ext=0,
check_finite=False)
Let us view the following example of first-order spline interpolation with k
value as 1 in the following code program name: Chap4_Example4.35.py:
import scipy.interpolate as myscpy
import numpy as mynp
import matplotlib.pyplot as myplt
#print(help(myscpy.UnivariateSpline))

myarr1 = mynp.array([0, 1, 2, 3, 4, 5, 6])


myarr2 = mynp.array([0, 1, 4, 9, 16, 25, 36])
myfunc = myscpy.UnivariateSpline(myarr1, myarr2,
k=1)

mynew_x = mynp.array([.5, 1.5, 2.5, 3.5, 4.5,


5.5])
mynew_y = myfunc(mynew_x)

print(mynew_y)

myplt.plot(mynew_x, mynew_y,'ro-')
myplt.scatter(mynew_x, mynew_y)
myplt.xlabel('mynew_x')
myplt.ylabel('mynew_y')
myplt.title('Spline Interpolation with k=1')
myplt.show()
Output:
[ 0.32559189 3.36574767 6.40590345 12.40590345
21.36574767 30.32559189]
Refer to the following figure for a better understanding:

Figure 4.5: Output of Chap4_Example4.35.py

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.35.py).

Refer to the explanation of the code:


Two arrays, myarr1 and myarr2, are created with values [0, 1, 2, 3, 4,
5, 6] and [0, 1, 4, 9, 16, 25, 36] respectively.
The UnivariateSpline function from the scipy library is used to
create a new function called myfunc. It takes myarr1 and myarr2 as
inputs and has a degree of 1 (k=1). This UnivariateSpline function is
used with k=1 to create a linear spline interpolation function.
A new array called mynew_x is created with values [1.5, 2.5, 3.5].
The myfunc function is applied to mynew_x to find the corresponding
values in mynew_y.
The values in mynew_y are displayed.
We will plot a line graph and scatter plot of the data points
represented by the arrays mynew_x and mynew_y. It also adds labels to
the x and y axes and a title to the graph. Finally, it will display the
graph.
We shall see second-order spline interpolation with k value as 2. The
following code program name: Chap4_Example4.36.py for understanding
second order spline interpolation is demonstrated as follows:
import scipy.interpolate as myscpy
import numpy as mynp
import matplotlib.pyplot as myplt

# Creating Sample data points


myarr1 = mynp.array([0, 1, 2, 3, 4, 5, 6])
myarr2 = mynp.array([0, 1, 4, 9, 16, 25, 36])

myfunc = myscpy.UnivariateSpline(myarr1, myarr2,


k=2)

mynew_x = mynp.array([0.5, 1.5, 2.5, 3.5, 4.5,


5.5])
mynew_y = myfunc(mynew_x)

print(mynew_y)
myplt.plot(mynew_x, mynew_y,'ro-')
myplt.scatter(mynew_x, mynew_y)
myplt.xlabel('x')
myplt.ylabel('y')
myplt.title('Quadratic Spline Interpolation with
k=2')
myplt.show()
Output:
[ 0.25 2.25 6.25 12.25 20.25 30.25]
Refer to the following figure for a better understanding:

Figure 4.6: Output of Chap4_Example4.36.py

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.36.py).

splrep function
Till now we have seen that the two necessary steps for spline interpolation
are:
1. Initially, a spline representation of the curve is computed.
2. Subsequently, the spline is assessed at the desired points.
Using the function splrep(), the direct spline approach is used to determine
the B-spline curve representation within a two-dimensional plane. With the
input keyword k, the default spline order of cubic can be modified. We can
parametrically define the curve in N-dimensional space using the
splprep() function. The amount of smoothing to apply during the spline fit
is specified using the keyword argument. The B-spline is represented as a
tuple (tck) by the splrep function, which takes a set of data points as
input. The knot sequence, the B-spline coefficients, and the splineʼs degree
are the three elements of this tuple. Refer to the following syntax for a
better understanding:
scipy.interpolate.splrep(x, y, w=None, xb=None,
xe=None, k=3, task=0, s=None, t=None,
full_output=0, per=0, quiet=1)

splev function
In the scipy.interpolate module, the splev function is used to assess a B-
spline or its derivatives at specified points. It receives the input of the B-
spline representation (tck tuple) from splrep and returns the corresponding
values. The evaluated values of the B-spline at the specified points are
returned in an array by the splev function. Refer to the following syntax for
a better understanding:
scipy.interpolate.splev(x, tck, der=0, ext=0)
Let us view the following code using these 2 functions for k = 1. The
following code program name: Chap4_Example4.37.py for understanding
splrep and splev functions are demonstrated as follows:

import numpy as mynp


import scipy.interpolate as myscpy
import matplotlib.pyplot as myplt
# print(help(myscpy.splrep))
# print(help(myscpy.splev))

# Specifying the known data points


myarr1 = mynp.array([1, 2, 3, 4, 5])
myarr2 = mynp.array([2, 4, 1, 5, 3])
# Performing spline interpolation with splrep
mytck = myscpy.splrep(myarr1, myarr2, k=1)

# Generating new x values for interpolation


myx_interp = mynp.linspace(1, 5, 100)

# Evaluating the spline at the new x values


myy_interp = myscpy.splev(myx_interp, mytck)

# Plotting the known data points


myplt.scatter(myarr1, myarr2, color='red',
label='Data known')

# Plotting the interpolated values


myplt.plot(myx_interp, myy_interp,
label='Interpolated')

# Assigning plot labels and title


myplt.xlabel('My x-axis')
myplt.ylabel('My y-axis')
myplt.title('Spline Interpolation with k=1')

# Adding legend
myplt.legend()

# Displaying the plot


myplt.show()
Refer to the following output:

Figure 4.7: Output of Chap4_Example4.37.py

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.37.py).

Refer to the explanation of the code:


The code performs spline interpolation on a set of known data points.
First, two arrays myarr1 and myarr2 are created to specify the known
data points.
Then, the splrep function from the scipy.interpolate module is
used to perform spline interpolation.
The splrep function takes the myarr1 and myarr2 arrays as input, and
the k parameter is set to one to specify linear interpolation. The
resulting spline is stored in the mytck variable.
New x values for interpolation are generated using the linspace
function from the numpy module.
The linspace function creates an array of 100 values spaced evenly
starting between one and five, which are stored in the myx_interp
variable.
The splev function from the scipy.interpolate module is then used
to evaluate the spline at the new x values.
The splev function takes the myx_interp array and the mytck spline as
input and returns the corresponding y values.
These y values are stored in the myy_interp variable.
The known data points are plotted using the scatter function from the
matplotlib.pyplot module.
The scatter function takes the myarr1 and myarr2 arrays as input, and
the color parameter is set to red to specify the color of the data
points.
The label parameter is set to data known to create a legend for the
data points.
The interpolated values are plotted using the plot function from the
matplotlib.pyplot module.

The plot function takes the myx_interp and myy_interp arrays as


input, and the label parameter is set to Interpolated to create a legend
for the interpolated values.
The x-axis label is set to My x-axis using the xlabel function, and the
y-axis label is set to My y-axis using the ylabel function.
The title of the plot is set to Spline Interpolation with k=1 using the
title function.
Finally, a legend is added to the plot using the legend function.
When the k value is changed to 2, the plot looks as the following figure:
Figure 4.8: Output of Chap4_Example4.38.py

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.38.py) which is mentioned in the code bundle due to
the repeatability of the code. The only change in the code will be in the
following line:

mytck = myscpy.splrep(myarr1, myarr2, k=2)

Interpolation with radial basis function


A weighted sum of radial basis functions is used in Radial Basis Function
(RBF) interpolation to interpolate functions or data. A mathematical
function called an RBF solely considers the separation between a point and
its center. The RBFs are employed in the interpolation context to specify the
shape of the smooth function that traverses the known data points. A
function that is defined in relation to a fixed reference point is a radial basis
function. Refer to the following syntax for a better understanding:
scipy.interpolate.Rbf(*args, **kwargs)
The following code program name: Chap4_Example4.39.py for a better
understanding:
#interpolation with radial basis function is
demonstrated as follows.
import numpy as mynp
import scipy.interpolate as myscpy
#print(help(myscpy.Rbf))
myarr1 = mynp.array([0, 1, 2, 3, 4, 5])
myarr2 = mynp.array([0, 1, 4, 9, 16, 25])

myrbf = myscpy.Rbf(myarr1, myarr2)

mynew_x = 3.5
myestimated_y = myrbf(mynew_x)

print(myestimated_y)
Output:
12.17057812839015

Note: The code referenced here is mentioned as (program name:


Chap4_Example4.39.py).

Refer to the explanation of the code:


The code creates two arrays, myarr1, and myarr2, with values [0, 1, 2,
3, 4, 5] and [0, 1, 4, 9, 16, 25] respectively.
We use the RBF function from the scipy library to create an RBF
interpolation model using the values from myarr1 as input and the
values from myarr2 as output.
After that, it assigns a new value of 3.5 to mynew_x and uses the RBF
model to estimate the corresponding output value, which is assigned
to myestimated_y.
Finally, it prints the value of myestimated_y, which is the estimated
output value for the given input value of 3.5.
Till now, we have covered many important concepts in scipy. There are
many more concepts like Ndimage, orthogonal distance regression, and
statistical significance in the scipy All these concepts which we have
learned till now will be used in our forthcoming chapters.

Conclusion
In this chapter, we initially explained the difference between numpy and
scipy array. We have explored the scipy constants which represent physical
quantities, mathematical, scientific, and other useful values. The chapter
also delved into optimizers in scipy, discussing various optimization
algorithms with practical examples such as finding roots of equations. The
concept of sparse data and sparse matrix representation in scipy is
thoroughly examined, highlighting their efficiency in storing and
manipulating large, mostly empty matrices. We have further explored
graphs in scipy, offering functions for effective analysis and manipulation
of sparse graphs. Spatial data analysis using scipy enables working with
geographic and geometric data was well explored. Integration techniques
are covered, providing numerical integration methods for functions,
including single and multiple integration examples. Lastly, interpolation
methods in scipy were demonstrated with examples, allowing estimation of
values between known data points, including 1D interpolation, spline
interpolation, and interpolation with radial basis functions, demonstrated
with examples.
In the next chapter, we shall discuss about libraries such as matplotlib and
plotly. Matplotlib is a commonly used Python charting library that offers a
range of functions and classes for constructing different types of plots, such
as line plots, bar plots, scatter plots, histograms, and so on. A powerful and
interactive plotting library that supports multiple programming languages
including Python is called plotly.

Points to remember
Scipy is built on top of numpy which is mostly written in Python
language and provides functionalities for numerical integration,
statistics, signal processing, and so on.
Scipy provides a wide range of mathematical and scientific constants,
such as pi, and e, and physical constants like the speed of light. These
constants can be accessed through the scipy.constants module.
Scipy offers various optimization algorithms to find the minimum or
maximum of a function. The optimization module provides functions
to perform unconstrained and constrained optimization, including
methods like BFGS, Newton-CG, and SLSQP.
Scipy supports sparse data structures and provides efficient methods
for working with sparse matrices. Sparse matrices are useful when
dealing with large datasets that have mostly zero values, as they save
memory and computational resources.
Scipyʼs graph module provides functionality for working with graphs
and performing graph-related operations. It includes algorithms for
finding shortest paths, computing centrality measures, and many
more.
Scipy offers tools for working with spatial data, such as distance
calculations, convex hulls, and spatial transformations. The spatial
module can be used for tasks like spatial interpolation and nearest
neighbor search.
The integrated module in scipy provides functions for the numerical
integration of functions (single and multiple integrations) and solving
Ordinary Differential Equations (ODEs).
Scipyʼs interpolation module offers various interpolation techniques,
such as linear, polynomial, and spline interpolation. These methods
can be used to estimate values between known data points or to create
smooth curves from discrete data.

Questions
1. What is scipy? Who discovered it?
2. Explain the difference between numpy and scipy.
3. Explain scipy constants in the following categories with an example:
a. Metric
b. Force
c. Binary
d. Angle
e. Mass
f. Time
g. Pressure
h. Length
i. Volume
j. Speed
k. Temperature
l. Energy
m. Power
n. Area

4. Write a Python code to find the roots of the equation 2*x + cos(x)
using scipy optimizers.
5. Explain different types of sparse matrix in scipy with examples.
6. Explain important methods to work with adjacency matrices with
examples.
7. Explain the triangulation function in scipy. Also, explain the
Delaunay method, and ConvexHull method with an example.
8. Explain simple and multiple integration in scipy with an example.
9. Explain scipy 1D, spline, and radial basis interpolation functions
with an example.
CHAPTER 5
Line Plot exploration with
Matplotlib Library

Introduction
The representation of data in a graphical or visual format is known as data
visualization. Data visualization aims to perform analysis of complicated
data sets. Patterns, trends, and relationships can be easily seen and expressed
when data is presented visually. Charts, graphs, tables, maps, and other types
of data visualization are only a few examples. Each kind of visualization
works best with kinds of information or ideas. For instance, scatterplots can
be used to depict the relationship between two variables, and line charts are
frequently used to show patterns over time.
In a variety of disciplines, including business, finance, science, and social
sciences, data visualization is a crucial tool. It lets decision-makers swiftly
grasp and analyze data, which can lead to more informed judgments. Data
visualization is a vital tool for researchers and educators because it can make
complicated ideas understandable to a wide audience.

Structure
In this chapter, we will discuss the following topics:
Python data visualization tools
Line plot creation by passing 2 ndarrays
Adding title, xlabel and ylabel to the line plot
Advanced line plot
Linestyle property
Color property
Default color
Quick settings for color, marker, and linestyle
alpha property
linewidth and markersize property
markerfacecolor property
Customizing the figure size
Plotting multiple lines in a same plot
Line plot creation by passing a single ndarray
Addition of grid lines to a plot
Major and minor grid lines display
Legends display to a plot
Using xlim and ylim functions to a plot

Objectives
By the end of this chapter, the reader will get to know about Pythonʼs data
visualization tools, covering a wide array of techniques and concepts to
enhance our ability to represent and analyze data effectively. First, we will
be learning about line plots, where we will explore creating them by passing
two ndarrays, adding essential elements like titles, x-labels, and y-labels, and
advancing our skills with properties like linestyle, color, alpha, linewidth,
markersize, markerfacecolor, and figure size customization. We will also
explore how to plot multiple lines in a single plot, creating line plots with
single ndarrays, incorporating grid lines (both major and minor), and in the
end getting to handle effective utilization of legends, xlim, and ylim
functions to a line plot.

Note: In this chapter, we are displaying the line plot graphs in colors,
but the hard copy of the book will be printed in black color. So, do not
get confused. It is requested that all the readers view the Image bundle
simultaneously for a better understanding of the code and concepts.

Python data visualization tools


The most well-liked Python data visualization tools are as follows:
Matplotlib: It is a well-liked Python package for making animated,
interactive, and static visualizations. Line charts, scatter plots, bar
charts, histograms, and other visualizations are among the many that
are offered by using the matplotlib.
Seaborn: Based on matplotlib, Seaborn is a Python data visualization
package that offers a more advanced interface for producing statistical
visualization. It offers a variety of sophisticated visualization
techniques including cluster maps, heatmaps, and violin plots.
Plotly: It is a Python visualization toolkit that offers customizable and
interactive visualization. Several other chart formats such as scatter
plots, bar charts, line charts, and more, are supported.
Bokeh: It is a Python library for building interactive web browser
visualization. It handles streaming and real-time data and offers a
variety of visualization formats, such as scatter plots, line charts, and
bar charts.
Altair: With just a few lines of code, the user can build interactive
visualization with Altair, a declarative visualization toolkit for Python.
It enables them to effortlessly modify and alter their visualization and
it supports a variety of chart styles, such as scatter plots, bar charts,
and line charts.
We will discuss about matplotlib in this chapter, to give you all a taste of it.
In order to install matplotlib, type the following code:
pip install matplotlib
We will first use the following statement:
import matplotlib.pyplot as myplt
A group of functions in Python called matplotlib.pyplot offers a high-level
interface for making different visualizations, such as line charts, scatter
plots, bar charts, histograms, and more.
Python code frequently uses import matplotlib.pyplot as myplt statement
to import the pyplot module from the matplotlib library and alias it as myplt.
In this chapter, we are using myplt as an alias while writing the code. This
enables calling functions from the pyplot module with shorter and more
convenient names. For example, calling myplt.show() instead of
matplotlib.pyplot.show().

So, matplotlib is a library in which pyplot is a module name. With the help
of the pyplot module, we can create plots such as plot, bar, pie, hist, and
scatter for creating line plots, bar charts, pie charts, histograms, and scatter
plots.

Line plot
A cornerstone of data visualization, the line plot in Matplotlib efficiently
connects data points with straight lines, facilitating the visualization of
trends over time and relationships between variables. Its significance
transcends mere visualization, extending to data exploration, communication
of insights, dataset comparison, model evaluation, and forecasting.
Renowned for its simplicity and clarity, the line plot stands as a pivotal
element in data visualization, empowering users to extract valuable insights
and make well-informed decisions.

Line plot creation by passing 2 ndarrays


In the code program name: Chap2_Example2.6.py shown below, we shall
demonstrate the line plot creation by passing 2 ndarrays:
# Line plot creation by passing 2 ndarrays
import matplotlib.pyplot as myplt
import numpy as mynp
#print(help(myplt.plot))
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis) #(1,1),(2,8),(3,27)
myplt.show()
The output is shown in Figure 5.1:

Figure 5.1: Line Plot creation by passing 2 ndarrays

Note: The preceding code is covered in (program name:


Chap5_Example5.1.py).

Figure 5.1 is a distinct window on the screen that Matplotlib uses to display
the graphs, serving as the output container for graphics.
The code is using the matplotlib library to plot a graph. It first imports the
necessary libraries, matplotlib.pyplot and numpy. Then, it creates an array
of values for the x-axis using the arange() function from numpy, with values
ranging from one to ten. It also creates an array for the y-axis by taking the
cube of each value in the x-axis array. Next, it uses the plot function from
matplotlib.pyplot to create a line plot of the x-axis values against the y-axis
values. Comments (1,1),(2,8), and (3,27) indicate the corresponding
coordinates of the first few points on the plot. Finally, it uses the show
function from matplotlib.pyplot to display the plot.

Adding title, xlabel and ylabel to the line plot


We may add a title to the line plot using myplt.title() function Cube
function line plot.
In order to get the information related to the title, type the following
command:
import matplotlib.pyplot as myplt
help(myplt.title)
In order to study various parameters, type the following code:
title(label, fontdict=None, loc='center', pad=None,
**kwargs)
Observe the following code for a better understanding:
# Line plot creation by passing 2 ndarrays with
title, xlabel and ylabel
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis) # line plot is drawn
myplt.title('Cube Function Line Plot') # title is
provided to the line plot
myplt.xlabel('X axis-Value -------') # information
is described about x-axis data
myplt.ylabel('Cube of Y axis-Value -------') #
information is described about y-axis data
myplt.show() # Display the line plot
The output is shown in Figure 5.2:

Figure 5.2: Addition of title, xlabel and ylabel properties to the line plot

Note: The preceding code is covered in (program name:


Chap5_Example5.2.py).

Advanced line plot


A line drawn on a graph can be distinguished by its color, style, width,
transparency, and so on. They can be customized to meet our requirements.
Data points on the line plot can be highlighted using the marker property.
We must provide a keyword argument for the marker as shown as follows:
myplt.plot(x_axis,y_axis,marker='o')
Here, o stands for circle.
The details against each marker character can be seen in the following Table
5.1:
S No. Character Description

1 . point marker

2 , pixel marker

3 O circle marker

4 V triangle_down marker

5 ^ triangle_up marker

6 < triangle_left marker

7 > triangle_right marker

8 1 tri_down marker

9 2 tri_up marker

10 3 tri_left marker

11 4 tri_right marker

12 S square marker

13 P pentagon marker

14 * star marker

15 H hexagon1 marker

16 H hexagon2 marker

17 + plus marker

18 X x marker

19 D diamond marker
20 D thin_diamond marker

21 | vline marker

22 _ hline marker

Table 5.1: Marker description against each character


Observe the following code:
# Line plot creation by passing 2 ndarrays with
title, xlabel, ylabel and marker
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,marker='o')
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.3:
Figure 5.3: Line plot creation using xlabel, ylabel and marker property

Note: Thepreceding code is covered in (program name:


Chap5_Example5.3.py).

Linestyle property
linestyleproperty is specified as solid, dotted, or dashed. The linestyle
keyword argument is used as follows:
myplt.plot(x_axis,y_axis,marker='o',linestyle='--')
The details against each linestyle character can be seen in the following
Table 5.2:
S No. Character Description

1 - solid line style (default value)


2 -- dashed line style

3 -. dash-dot line style

4 : dotted line style

Table 5.2: Linestyle description against each character


Observe the following code for a better understanding:
# Line plot creation by passing 2 ndarrays with
title, xlabel , ylabel ,marker and linestyle
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,marker='o',linestyle='-.')
# dash-dot linestyle
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.4:
Figure 5.4: Line plot creation using xlabel, ylabel , marker and linestyle property

Note: The preceding code is covered in (program name:


Chap5_Example5.4.py)

Color property
By utilizing the color keyword parameter, we can add colors to our plot,
define the color we want for the line plot, and utilize any color, and even hex
code.
The details against each color character can be seen in the following Table
5.3:
S No. Character Description

1 B blue

2 G Green

3 R red
4 c cyan

5 m Magenta

6 y Yellow

7 k Black

8 w White

Table 5.3: Color description against each character


The linestyle keyword argument is used as follows:
myplt.plot(x_axis,y_axis,marker='o',linestyle='--
',color='g')
Observe the following code for a better understanding:
# Line plot creation by passing 2 ndarrays with
title, xlabel , ylabel ,marker , linestyle and
color
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,marker='o',linestyle='-.',
color='g')
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.5:
Figure 5.5: Line plot creation using xlabel, ylabel , marker, linestyle and color property

Note: The preceding code is covered in (program name:


Chap5_Example5.5.py).

Default color
The default color will be selected from the style circle if we do not specify a
color. We can write the following statement to get the default color:
myplt.rcParams['axes.prop_cycle'].by_key()
# code for checking the default color
import matplotlib.pyplot as myplt
print(myplt.rcParams['axes.prop_cycle'].by_key())
Output:
{'color': ['#1f77b4', '#ff7f0e', '#2ca02c',
'#d62728', '#9467bd', '#8c564b', '#e377c2',
'#7f7f7f', '#bcbd22', '#17becf']}

Note: The preceding code is covered in (program name:


Chap5_Example5.6.py).

The first default color is blue, the second is orange, the third is green, and
the fourth color is red. Observe the following code for a better
understanding:
# Line plot creation by passing 2 ndarrays with
default color
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
myplt.plot(x_axis,x_axis) # blue
myplt.plot(x_axis,x_axis*2) # orange
myplt.plot(x_axis,x_axis*4) # green
myplt.plot(x_axis,x_axis*8) # red
myplt.title('Default color Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Display of various calculations------
-')
myplt.show()
The output is shown in Figure 5.6:
Figure 5.6: Line plot creation using default color

Note: The preceding code is covered in (program name:


Chap5_Example5.7.py).

Quick settings for color, marker and linestyle


The shortcut notation is specified using either mlc or clm where m stands for
the marker, l stands for linestyle, and c stands for color.
We should utilize the shortcodes b, g, y, k, c, and so on, for color in this
shortcut method.
Red, and yellow values are not permitted in shortcut ways. Let us explore
the code of mlc and clm form.

mlc form
MLC encompasses marker, linestyle, and color, defining the visual
characteristics of lines or points in Matplotlib plots. These attributes enable
users to personalize markers, line styles, and colors, facilitating the creation
of customized and insightful visualizations. Refer to the following code for a
better understanding:
# mlc form
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:g')# mlc form
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.7:
Figure 5.7: Line plot creation using mlc form

Note: The preceding code is covered in (program name:


Chap5_Example5.8.py).

clm form
CLM stands for color, linestyle, and marker, representing essential visual
characteristics adjustable in Matplotlib for customizing plot appearance.
These attributes enable users to define colors, line styles, and markers,
facilitating the creation of tailored visualizations. Refer to the following
code for a better understanding:
# clm form
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'g:o')# clm form
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.8:
Figure 5.8: Line plot creation using clm form

Note: The preceding code is covered in (program name:


Chap5_Example5.9.py).

When unspecified then default color is blue


When no color is specified, the default color will be blue as shown in the
following code:
# mlc form
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:')# mlc form
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.9:
Figure 5.9: Line plot creation using default color when unspecified in mlc form

Note: The preceding code is covered in (program name:


Chap5_Example5.10.py).

Alpha property
The graph plot transparency is adjusted using the alpha attribute, which
denotes the opaque or transparency of the color. The value is between 0.0
and 1.0.
If alpha < 1, the graph plot becomes more transparent. Let us view the code
of line plot creation using mlc form with the alpha property as demonstrated
in program name: Chap5_Example5.11.py as follows:
# mlc form with alpha property
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:', alpha = 0.2)# mlc
form with alpha = 0.2
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.10:
Figure 5.10: Line plot creation using mlc form with alpha property

Note: The preceding code is covered in (program name:


Chap5_Example5.11.py).

Linewidth and markersize property


The linewidth property denotes the thickness/width of the line in a plot by
using lw property.
The markersize property denotes the marker size in a plot by using ms
property. Let us view the code of line plot creation using mlc form with
linewidth and markersize property as demonstrated in program name:
Chap5_Example5.12.py as follows:

# mlc form with lw and ms


import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:g', lw=5, ms = 10)
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.11:
Figure 5.11: Line plot creation using mlc form with lw and ms property

Note: The preceding code is covered in (program name:


Chap5_Example5.12.py).

Markerfacecolor property
This property will set the face color of markers in plots. Let us view the code
of line plot creation using mfc property as demonstrated in code program
name: Chap5_Example5.13.py as follows:
# mfc property
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:g', lw=5, ms = 10,
mfc='yellow')
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.12:
Figure 5.12: Line plot creation using mfc property

Note: The preceding code is covered in (program name:


Chap5_Example5.13.py).

Customizing the figure size


In order to display all default settings in matplotlib, we can use the following
code:
print(myplt.rcParams)
print(myplt.rcParams.get('figure.figsize'))
Observe the following code:
# Getting the default figure size
import matplotlib.pyplot as myplt
print(myplt.rcParams.get('figure.figsize'))
print(myplt.rcParams['figure.figsize'])
Output:
[6.4, 4.8]
[6.4, 4.8]

Note: The preceding code is covered in (program name:


Chap5_Example5.14.py).

In matplotlib, the figure size is specified as a tuple of width and height in


inches. By default, the figure size is set to [6.4, 4.8] inches. We can
customize the size of the figure by setting the figsize parameter by using
the myplt.figure() function.
A new figure can be created, or an existing figure can be activated by using
the following:
num: It can be either int or str and specifies a unique identifier for the
figure.
figsize: It is a tuple that specifies the width and height of the figure in
inches.
Now, we shall be customizing the figure size for line plot creation in the
code program name: Chap5_Example5.15.py as follows:
import matplotlib.pyplot as myplt
import numpy as mynp
myplt.figure(num=1,figsize=(8,4),facecolor='green')
myndarray = mynp.arange(1,6)
myplt.plot(myndarray,myndarray,'o-r')
myplt.show()
The output is shown in Figure 5.13:

Figure 5.13: Line plot creation by customizing the figure size


Note: The preceding code is covered in (program name:
Chap5_Example5.15.py).

Observe the following line of code:


myplt.figure(num=1,figsize=(8,4),facecolor='green')
myplt.figure: This is a function from the matplotlib library which
creates a new figure.
num=1: This parameter sets the number of the figure. It is optional
and defaults to one if not specified.
figsize=(8,4): This parameter sets the width and height of the figure
in inches. In this case, the width is eight inches and the height is four
inches.
facecolor='green': This parameter sets the background color of the
figure. In this case, the background color is set to green. Overall, the
above line of code creates a figure with the number one, a size of 8x4
inches, and a green background color.
Now, we shall be customizing the figure with equal size for line plot creation
in the code program name: Chap5_Example5.16.py as follows:
import matplotlib.pyplot as myplt
import numpy as mynp
myplt.figure(figsize=(4,4),facecolor='green')
myndarray = mynp.arange(1,6)
myplt.plot(myndarray,myndarray,'o-r')
myplt.show()
The output is shown in Figure 5.14:
Figure 5.14: Output of Chap5_Example5.16.py

Note: The preceding code is covered in (program name:


Chap5_Example5.16.py).

Plotting multiple lines in a same plot


Let us see how to plot multiple lines in the same plot using matplotlib
library which shall be demonstrated in the code program name:
Chap5_Example5.17.py as follows:
# plotting multiple lines in the same plot
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,11)
y_axis1 = x_axis
y_axis2 = x_axis**2
y_axis3 = x_axis**3
myplt.plot(x_axis,y_axis1,'o:r')
myplt.plot(x_axis,y_axis2,'o:g')
myplt.plot(x_axis,y_axis3,'o:b')
#
myplt.plot(x_axis,y_axis1,'o:r',x_axis,y_axis2,'o:g
',x_axis,y_axis3,'o:b') # plotting 3 lines using
single plot() function
myplt.title('Plotting multiple lines in the same
plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.15:
Figure 5.15: Line plot creation displaying multiple lines in the same plot

Note: The preceding code is covered in (program name:


Chap5_Example5.17.py).

Now, let us explain the code for better understanding:


First, the code imports the necessary libraries, matplotlib.pyplot and
numpy.

It creates an array of values for the x-axis using the numpy arange()
function, with values ranging from one to ten.
It creates three arrays for the y-axis, y_axis1, y_axis2, and y_axis3.
These arrays are generated by performing mathematical operations on
the x_axis array.
The code uses the plot() function from matplotlib.pyplot to plot the
three lines on the same plot. The first plot() call plots y_axis1 against
x_axis, using a red colored line with circle markers.

The second plot() call plots y_axis2 against x_axis, using a green
colored line with circle markers.
The third plot() call plots y_axis3 against x_axis, using a blue
colored line with circle markers. We can also plot three different lines
in a single plot function. Just uncomment it and observe the same
output.
The code sets a title for the plot using the title() function.
It sets labels for the x-axis and y-axis using the xlabel() and ylabel()
functions, respectively.
Finally, the show() function is called to display the plot.

Note: If we apply the linewidth (lw) property in the plot function, it will
be applicable for all three lines.

Line plot creation by passing a single ndarray


A plot can be created using a single ndarray. Consider the following line of
code.
myplt.plot(myarr1), here, myarr1 is for y-axis values. Considering N,
the datapoints size of myarr1, from 0 to N-1, the x-axis values will be
generated automatically by matplotlib. We shall demonstrate the code
program name: Chap5_Example5.18.py below for line plot creation by
passing a single ndarray:
import matplotlib.pyplot as myplt
import numpy as mynp
myarr1 = mynp.array([20,40,60,80,100])
myplt.figure(figsize=(7,4),facecolor='pink')
myplt.plot(myarr1,'o:r')
#print(help(myplt.title))
myplt.title('Line Plot creation by passing a single
ndarray',
fontfamily='serif',
fontsize=18,
fontweight='bold',
loc='center',
pad=10,
color='white',
backgroundcolor='green')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Y axis-Value -------')
myplt.show()
The output is shown in Figure 5.16:

Figure 5.16: Line plot creation by passing a single ndarray

Note: The preceding code is covered in (program name:


Chap5_Example5.18.py).

In this example, the title() function is called with the following


parameters:
fontfamily: It specifies the font family of the title (in this case,
serif).
fontsize: It specifies the font size of the title (in this case, 18).
fontweight: It specifies the font weight of the title (in this case,
bold).
loc: It specifies the location of the title (in this case, center).
pad: It specifies the padding between the title and the plot (in this
case, 10).
color: It specifies the font color of the title (in this case, white)
backgroundcolor: It specifies the background color of the title (in
this case, green)
We can even customize xlabel and ylabel which we are leaving for you to
practice. Just type print(help(myplt.xlabel)) and
print(help(myplt.ylabel)) and play with their parameters.

Addition of grid lines to a plot


The grid lines can be added to the plot using myplt.grid() function. Just
type print(help(myplt.grid)) to view the parameters of the above function.
The grid may be/ may not be displayed in the following cases which is
explained in the following Table 5.4. Here, we are denoting with * indicating
Yes against each column value for a particular row.

If we look in Case No: 1 of the following table for the code say
myplt.grid() , then * is displayed against the column value Is grid lines
displayed in the line plot which means using this code the grid lines will
be displayed in the line plot.
Case Code Is grid lines Is grid lines not
No displayed in the displayed in the line
line plot plot

1 myplt.grid() *

2 myplt.grid() *
myplt.grid()
3 myplt.grid() * (Since provided
keyword arguments)
myplt.grid(color='y')
4 myplt.grid(visible=True) *
5 myplt.grid(visible=False) *

Table 5.4: Display of Grid lines case study in the plot


Now, we shall demonstrate a code program name: Chap5_Example5.19.py
below for the display of grid lines using various cases:
import matplotlib.pyplot as myplt
import numpy as mynp
#print(help(myplt.grid))
myarr1 = mynp.array([20,40,60,80,100])

# Create a figure with subplots


myfig, myaxs = myplt.subplots(2, 2)

# displaying case-1 of grid lines visibility


myaxs[0, 0].plot(myarr1,'o:r')
myaxs[0, 0].set_title('Subplot 0-0')
myaxs[0, 0].grid()

# displaying case-2 of grid lines visibility


myaxs[0, 1].plot(myarr1,'o:r')
myaxs[0, 1].set_title('Subplot 0-1')
myaxs[0, 1].grid()
myaxs[0, 1].grid()

# displaying case-3 of grid lines visibility


myaxs[1, 0].plot(myarr1,'o:r')
myaxs[1, 0].set_title('Subplot 1-0')
myaxs[1, 0].grid()
myaxs[1, 0].grid(color='y')

# displaying case-4 of grid lines visibility


myaxs[1, 1].plot(myarr1,'o:r')
myaxs[1, 1].set_title('Subplot 1-1')
myaxs[1, 1].grid(visible = True)

# Adjust the spacing between subplots


myplt.tight_layout()
myplt.show()
The output is shown in Figure 5.17:
Figure 5.17: Line plot creation with grid lines using various cases

Note: The preceding code is covered in (program name:


Chap5_Example5.19.py).

Now let us understand the code and how the grid lines are demonstrated:
The code uses the matplotlib library to create a figure with subplots
and display different cases of grid line visibility on each subplot.
The code imports the matplotlib.pyplot module as myplt and the
numpy module as mynp. 2. It creates an array called myarr1 with values
[20, 40, 60, 80, 100].
It creates a figure with two rows and two columns of subplots using
the subplots() function and assigns the figure object to myfig and the
axes objects to myaxs.
The first subplot (0-0) is plotted with the values of myarr1 using red
circles connected by a solid line. The title is set to Subplot 0-0 and the
grid lines are displayed.
The second subplot (0-1) is plotted and styled the same way as the
first subplot. The title is set to Subplot 0-1 and the grid lines are not
displayed here.
The third subplot (1-0) is plotted and styled the same way as the first
subplot. The title is set to Subplot 1-0 and the grid lines are displayed.
The color of the grid lines is set to yellow.
The fourth subplot (1-1) is plotted and styled the same way as the first
subplot. The title is set to Subplot 1-1 and the grid lines are displayed.
The spacing between the subplots is adjusted to avoid overlapping
elements using the tight_layout() function.
The figure with the subplots is displayed using the show() function.

Major and minor grid lines display


Major and minor grid lines can be included in a line plot to improve the
visual representation of the data. The thicker lines extending across the plot
represent major grid lines. They are used to highlight important intervals or
divisions in the plot. Major ticks on the axes typically match major grid
lines. The thinner lines that are drawn in between the major grid lines are
known as minor grid lines. They offer further direction and aid in locating
plot intervals or smaller divisions. Minor ticks on the axes typically match
minor grid lines. The grid lines are going to be displayed using which
property. Here, the allowed values are major, minor, or both whose default
value is major. Minor ticks can be displayed on the axes using
minorticks_on() function but displaying these minor ticks may reduce
performance. Now we shall demonstrate the code program name:
Chap5_Example5.20.py below for display of major and minor grid lines:

import matplotlib.pyplot as myplt


import numpy as mynp
myarr1 = mynp.array([20,40,60,80,100])
myplt.plot(myarr1,'o-.g',lw=5,markersize=5,
mfc='magenta')
myplt.grid(color='r',lw=2)
myplt.minorticks_on()
myplt.grid(which='minor', color='y')
myplt.show()
The output is shown in Figure 5.18:

Figure 5.18: Line plot creation using major and minor grid lines

Note: The preceding code is covered in (program name:


Chap5_Example5.20.py).

Now, let us understand the code and how the major and minor grid lines are
plotted:
The code imports the matplotlib.pyplot module as myplt and the
numpy module as mynp.

It then creates a numpy array myarr1 with values [20, 40, 60, 80, 100].
The code then plots myarr1 using the plot() function from myplt.
The plot is a line plot with markers represented by green dots
connected by a solid line.
The line width is set to five and the marker size is set to five.
The marker face color is set to magenta.
The code then adds a grid to the plot using the grid() function from
myplt.
The grid lines are colored red and have a line width of two.
The code enables minor ticks on the plot using the minorticks_on()
function from myplt.
The code then adds a minor grid to the plot using the grid() function
again.
The minor grid lines are colored yellow.
Finally, the code displays the plot using the show() function from
myplt.

Legends display to a plot


It is crucial to add a legend to a line plot in matplotlib because it offers a
visual framework for understanding the data represented in the plot. The
legend aids in identifying various lines or markers used in the plot and links
them with specific data or categories. It is made simpler for viewers to
correctly understand the plot by labeling each line or marker with a
descriptive text in the legend. It facilitates the communication of information
on variables, units, or any other pertinent aspects of the data. When working
with complex plots that contain numerous datasets or variables, the legend
also makes it easier to compare different lines or markers in the plot. It
makes it possible to understand and analyze the plotted data more clearly.
The legend function has three different call signatures as explained as
follows:
legend(): This call signature recognizes the labeled plot elements
automatically and generates a legend using the labels that have been
assigned to those elements. When we want to display a legend with
default labels and the plot contains multiple lines or markers, it is
quite helpful.
legend(labels): We can specify custom labels for the elements
in the plot parts using this call signature. We can assign specific names
to each line or marker in the legend by passing a list of labels as an
argument. When there is a need to provide descriptive or meaningful
labels to the elements, this is advantageous.
legend(handles,labels): This call signature provides the
most flexibility as it allows the user to explicitly specify both the
handles (objects representing the lines or markers) and the
corresponding labels for the legend. By passing a list of handles and a
list of labels, we can precisely control the mapping between plot
elements and their labels in the legend.
Let us demonstrate the code program name: Chap5_Example5.21.py below
for the display of legends to a line plot:
import matplotlib.pyplot as myplt
import numpy as mynp
myarr1 = mynp.linspace(0, 2*mynp.pi,100)
# Create a figure with subplots
myfig, myaxs = myplt.subplots(3, 1)

myaxs[0].plot(myarr1, myarr1**2, label='MySquare')


myaxs[0].plot(myarr1, mynp.sin(myarr1),
label='MySine')
myaxs[0].plot(myarr1, mynp.cos(myarr1),
label='MyCos')
myaxs[0].legend()
myaxs[1].plot(myarr1, myarr1**2)
myaxs[1].plot(myarr1, mynp.sin(myarr1))
myaxs[1].plot(myarr1, mynp.cos(myarr1))
myaxs[1].legend(['MySquare','MySine','MyCos'])

# Create a list of handles and labels


myline1, = myplt.plot(myarr1, myarr1**2)
myline2, = myplt.plot(myarr1, mynp.sin(myarr1))
myline3, = myplt.plot(myarr1, mynp.cos(myarr1))
myhandles = [myline1,myline2,myline3]
mylabels = ['MySquare','MySine','MyCos']

# Add a legend with handles and labels


myaxs[2].legend(myhandles, mylabels)
myplt.show()
The output is shown in Figure 5.19:

Figure 5.19: Line plot creation with display of legends using different approaches

Note: The
preceding code is covered in (program name:
Chap5_Example5.21.py).

The code uses the matplotlib library to create a figure with three subplots. In
the first subplot, it plots the square, sine, and cosine functions of an array of
values. It adds a legend to this subplot. In the second subplot, it plots the
same functions without a legend. The third subplot, adds a legend using
handles and labels for the functions. Finally, it displays the figure.
The legend location can also be decided in the plot by using loc argument.
The code used for different locations is displayed as follows:
S No. String location Code used

1 'best' 0

2 'upper right' 1

3 'upper left' 2

4 'lower left' 3

5 'lower right' 4
6 'right' 5

7 'center left' 6

8 'center right' 7

9 'lower center' 8

10 'upper center' 9

Table 5.5: Legend position display in the plot


The number of columns in the legend can be customized by using the ncol
argument. By, default the number of columns in the legend is one. Also, the
title can be added to the legend explicitly by using the title keyword
argument as demonstrated in the code program name: Chap5_Example5.22.py
as follows:
import matplotlib.pyplot as myplt
import numpy as mynp
myarr1 = mynp.linspace(0, 2*mynp.pi,100)

myplt.plot(myarr1, myarr1**2, label='MySquare')


myplt.plot(myarr1, mynp.sin(myarr1),
label='MySine')
myplt.plot(myarr1, mynp.cos(myarr1), label='MyCos')
myplt.legend(title = "3 different plots", loc=6,
ncol=3)
myplt.show()
The output is shown in Figure 5.20:
Figure 5.20: Line plot creation by addition of legend using title and ncol parameters

Note: The preceding code is covered in (program name:


Chap5_Example5.22.py).

Observe the following line of code which is demonstrated in the example:


myplt.legend(title = "3 different plots", loc=6,
ncol=3)
The given code is used to add a legend to a plot in matplotlib. The legend
will have a title for three different plots and will be positioned at location six
on the plot. The legend will have three columns.
Using xlim and ylim functions to a plot
In order to specify the boundaries of the x-axis and y-axis in a plot, use the
matplotlib xlim and ylim functions. They give us the option to specify the
range of values that will be shown on each axis. The xlim is a function that
will set the limits of the x-axis in a plot. The left and right arguments of the
xlim function will specify the lower and upper bounds of the x-axis,
respectively. If no arguments are passed to the xlim function, then it will act
as a getter function else if, arguments are passed, then act as a setter
function. The ylim is a function that sets the limits of the y-axis in a plot.
The bottom and top arguments of the ylim function specify the lower and
upper bounds of the y-axis, respectively. If no arguments are passed to the
ylim function, then it will act as a getter function else if, if arguments are
passed, then it will act as a setter function.
Let us view the default values of left, right, top, and bottom using the xlim
and ylim functions.
Just add the following line of code in Chap5_Example5.22.py before
myplt.show:

myleft,myright = myplt.xlim()
mybottom, mytop = myplt.ylim()
print('The Left limit value on the x-axis:',myleft)
print('The Right limit value on the x-
axis:',myright)
print('The Bottom limit value on the y-
axis:',mybottom)
print('The Top limit value on the y-axis:',mytop)
Output:
The Left limit value on the x-axis: -0.3141592653589793

The Right limit value on the x-axis: 6.5973445725385655

The Bottom limit value on the y-axis: -3.0237887142754407


The Top limit value on the y-axis: 41.502332190958995

Note: The complete code is covered in (program name:


Chap5_Example5.23.py).

Let us see an example of a change of values of left, right, top, and bottom
using xlim and ylim functions:
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(1,16)
y_axis = x_axis**3
myplt.plot(x_axis,y_axis,'o:r', alpha = 0.6)# mlc
form with alpha = 0.2
myplt.title('Cube Function Line Plot')
myplt.xlabel('X axis-Value -------')
myplt.ylabel('Cube of Y axis-Value -------')
myplt.grid()
myleft,myright = myplt.xlim()
mybottom, mytop = myplt.ylim()
print('The default Left limit value on the x-
axis:',myleft)
print('The default Right limit value on the x-
axis:',myright)
print('The default Bottom limit value on the y-
axis:',mybottom)
print('The default Top limit value on the y-
axis:',mytop)
# setting the new limit values
myplt.xlim(left = 0.7, right = 10)
myplt.ylim(bottom = 20, top = 3000)
myplt.show()
The output is shown in Figure 5.21:

Figure 5.21: Line plot creation using xlim and ylim functions

Refer to the following explanation for a better understanding:


The default left limit value on the x-axis, 0.29999999999999993
The default right limit value on the x-axis, 15.7
The default bottom limit value on the y-axis, 167.70000000000002
The default top limit value on the y-axis, 3543.7
Note: The preceding code is covered in (program name:
Chap5_Example5.24.py).

There is something called linear scaling where there is a fixed difference


between any two consecutive points on the given axis. Here, the data set
values are spread over a small range. If there is a requirement for data set
values over a big range, then we shall go for logarithmic scaling. Here, there
is no fixed difference between any two consecutive points on the given axis
and it is in multiples of ten.
We will be using myplt.xscale() and myplt.yscale() functions to set the
scale for the x-axis and y-axis. These functions will allow us to control how
the data is displayed along each axis, by changing the scale of the axis as
demonstrated in the code program name: Chap5_Example5.25.py as follows:
import matplotlib.pyplot as myplt
import numpy as mynp
x_axis = mynp.arange(100000)
y_axis = mynp.arange(100000)

myplt.plot(x_axis,y_axis)
myplt.grid()
myplt.xscale('linear')
myplt.yscale('log')

myplt.title('Plot of Linear vs Log')


myplt.xlabel('Linear -------')
myplt.ylabel('Log -------')
myplt.show()
The output is shown in Figure 5.22:
Figure 5.22: Line plot creation using log and linear scale

Note: The preceding code is covered in (program name:


Chap5_Example5.25.py).

Let us see an example of a line plot using object-oriented programming


(OOP) approach:
import matplotlib.pyplot as myplt
import numpy as mynp
my_x = mynp.arange(1,21)
my_y = my_x **3
# figure object creation --------------------------
----
#print(help(myplt.figure))
myfig = myplt.figure()

# axis object creation ---------------------------


my_axes = myfig.add_subplot(111) # 1 row, 1
column, plot number 1

# graph plotting ---------------------------


my_axes.plot(my_x, my_y)

# axes properties setting -------------------------


--
my_axes.set(xlabel='X axis values', ylabel="Y axis
values", title='Cubic Plot')
my_axes.grid()
myplt.show()
The output is shown in Figure 5.23:

Figure 5.23: Line plot creation using OOP approach

Note: The preceding code is covered in (program name:


Chap5_Example5.26.py).

The provided code uses the matplotlib library in Python to create a plot of
the cubic function. It generates an array of x-values from one to 20 using
numpyʼs arange function and calculates the corresponding y-values by
raising each x-value to the power of three. The code then creates a figure
object, an axis object, and plots the graph using the plot function. It sets the
properties of the axes, such as labels and title, and enables the grid. Finally,
it displays the plot.

Conclusion
In this chapter, we have learned about a wide spectrum of Python data
visualization tools and techniques that empower us to bring data to life. We
began our journey with a comprehensive understanding of line plot creation
using numpy. Then we explored the basics of creating line plots by passing
two ndarrays, the chapter progresses to advanced techniques such as
customizing line styles, colors, and markers. Readers will learn to add
essential elements to their plots, including titles, axis labels, and legends,
enhancing the interpretability of their visualizations. The chapter explores
properties like alpha, linewidth, markersize, and markerfacecolor for finer
control over plot appearance. The inclusion of techniques for customizing
figure size, plotting multiple lines in the same plot, and incorporating grid
lines adds depth to the readerʼs repertoire. Additionally, the usage of xlim
and ylim functions will allow to focus on specific data ranges. The chapter
culminates in a practical overview of shortcuts for setting color, marker, and
linestyle properties efficiently. Overall, this chapter equips beginners with
the foundational knowledge needed to create and customize line plots for
data visualization.
In the next chapter, we shall learn about bar plots, pie charts, histograms,
scatter plots and subplots in matplotlib which is essential for effective data
visualization across different types of data and analysis tasks. Bar plots are
useful for comparing categorical data, pie charts for displaying proportions
or percentages, histograms for visualizing distributions of continuous
variables, scatter plots for exploring relationships between two continuous
variables and subplots for organizing multiple plots within a single figure.

Points to remember
Line plots visualize trends in continuous data, and the user can create
them using matplotlib.
Pass data as two ndarrays or single ndarrays to create line plots.
Customize line plots by adding titles, x-labels, and y-labels using
functions like myplt.title(), myplt.xlabel(), and myplt.ylabel().
Advanced line plots involve setting properties like linestyle, color,
and alpha to control line appearance.
You can change the default line color using the color parameter.
Set color, marker, and linestyle using shortcut codes.
Adjust line thickness with linewidth and marker size with markersize.
Control marker face color using markerfacecolor.
Customize the figure size using myplt.figure().
Plot multiple lines in the same plot using multiple myplt.plot()
commands.
Add grid lines with myplt.grid() and specify major and minor grid
lines with myplt.grid(which='both').
Display legends for labeled lines with myplt.legend().
Use myplt.xlim() and myplt.ylim() to set specific axis limits.

Questions
1. What are some popular Python data visualization tools?
2. How can you create line plots in Python?
3. What properties can be customized in an advanced line plot?
4. How do you add a title, xlabel, and ylabel to a line plot?
5. What is the purpose of the linestyle property in a line plot?
6. How can you change the color of lines in a line plot?
7. What is the default color for lines in a Python line plot?
8. How can you set color, marker, and linestyle in a shortcut way for a
line plot?
9. What does the alpha property control in a line plot?
10. How do you customize the figure size in a line plot?

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 6
Charting Data With Various
Visuals Using Matplotlib

Introduction
In the previous chapter, we have witnessed line plots. Now, in this chapter
on data visualization by using matplotlib, various types of plots will be
explored, starting with the vertical bar chart or simple bar chart. The
discussion will include customization options such as changing the color,
width, bottom position, and alignment of each bar, along with insights into
adding labels. The chapter also explores horizontal, stacked, and grouped
bar charts. The exploration will extend to pie charts for representing
proportions, histograms for visualizing data distribution, and scatter plots
for depicting relationships between variables. Subplots in matplotlib offer a
valuable tool for creating comprehensive visualizations with multiple plots
in a single figure, enhancing the clarity and coherence of the presented
information. Overall, the diverse set of topics fulfills the varied
requirements of data analysts and scientists in effectively communicating
and interpreting data. Each plot will provide a comprehensive overview,
ensuring a thorough understanding of the capabilities and customization
options available for effective data representation and analysis.

Structure
In this chapter, we will discuss the following topics:
Barplot
Vertical bar chart and simple bar chart:
Horizontal bar chart
Stacked bar chart
Pie chart
Histogram
Scatter plot
Subplot in matplotlib

Objectives
By the end of this chapter, the readers will know about Bar plots where
various aspects such as changing the color, width, bottom position, left
alignment, and right alignment of individual bars as well as adding labels to
bars will be explained. We will also venture into horizontal, stacked (both
vertical and horizontal), and grouped bar charts, providing us with a
comprehensive understanding of bar chart variations. Additionally, we will
master plot creation i.e. charts, histograms, scatter plots, and subplots in
matplotlib, equipping the reader with a versatile toolkit for data
visualization and analysis in Python.

Barplot
An example of a visualization for categorical data is a bar plot, commonly
referred to as a bar chart. Rectangular bars are used to represent the data,
and each barʼs length or height reflects the value it stands for. The
myplt.bar() function of the well-known Python data visualization toolkit
matplotlib can be used to produce a bar plot. The parameters required to
define the positions and heights of the bars are passed into this function. So,
in a barplot, the data is represented in the form of bars, unlike the line plot
where the data points are represented by markers that are connected by a
line.
In order to create a bar plot by using matplotlib, follow the steps given as
follows:
1. Import: The required libraries need to be imported.
2. Prepare the data: We need to have the data that we want to
visualize in the form of lists or arrays.
3. Create the bar plot: For creating the bar plot, use the myplt.bar()
function. The x-coordinates (positions) of the bars and the
corresponding heights are the parameters for this function. By
specifying more parameters, the plotʼs look can also be customized.
4. The plot can be customized which is optional: We can add labels,
and titles, change colors, and modify other aspects of the plot by
using various functions provided by matplotlib.
5. Save or display the plot: Finally, use the myplt.savefig() to save it
to a file or myplt.show() function to display the plot.
Let us discuss the different types of bar charts with examples.

Vertical bar chart and simple bar chart


Vertical bars will be used to represent the data. Here, an individual category
is represented by each vertical bar. The barʼs height and length are
determined by the value it stands for.
The width of the bar is typically fixed; however, we can customize it. The
default width here is 0.8. By using the bar() function, a bar chart can be
created. Refer to the following code program name: Chap6_Example6.1.py
for better understanding:
bar(x, height, width=0.8, bottom=None, *,
align='center', data=None, **kwargs)

import matplotlib.pyplot as myplt


#print(help(myplt.bar))
my_cricketers =
['Sachin','Virat','Ricky','Sangakarra','Jacques']
# x-axis values
my_centuries = [100,76,71,63,62] #height of bars,
y-axis values
myplt.bar(my_cricketers,my_centuries)
myplt.xlabel('Cricketer
Name',color='b',fontsize=15)
myplt.ylabel('Number of
Centuries',color='g',fontsize=15)
myplt.title('Cricketer wise number of
centuries',color='r',fontsize=15)
myplt.show()
The output is shown in Figure 6.1:
Figure 6.1: Output of Chap6_Example6.1.py

Note: The preceding code is covered in (program name:


Chap6_Example6.1.py).

The code uses the matplotlib library to create a bar chart that represents the
number of centuries scored by different cricketers. We shall now explain the
code line by line:
The code imports the matplotlib.pyplot module as myplt.
The my_cricketers list contains the names of the cricketers, which
will be displayed on the x-axis of the bar chart.
The my_centuries list contains the number of centuries scored by
each cricketer, which will be represented by the height of the bars on
the y-axis.
The myplt.bar() function is called with the my_cricketers and
my_centuries lists as arguments to create the bar chart.

The myplt.xlabel() function is used to set the label for the x-axis,
with the color set to blue and the font size set to 15.
The myplt.ylabel() function is used to set the label for the y-axis,
with the color set to green and the font size set to 15.
The myplt.title() function is used to set the title of the bar chart,
with the color set to red and the font size set to 15.
The myplt.show() function is called to display the bar chart.
We can customize several elements in the bar plot which shall be explained
with some examples:

Color change of each bar


Let us look at an example for different color values of each bar which is
mentioned in the code program name: Chap6_Example6.2.py as follows:
import matplotlib.pyplot as myplt
my_cricketers =
['Sachin','Virat','Ricky','Sangakarra','Jacques']
# x-axis values
my_centuries = [100,76,71,63,62] #height of bars,
y-axis values
mycolorlist = ['r','b','g','orange','k']
myplt.bar(my_cricketers,my_centuries,color =
mycolorlist)
myplt.xlabel('Cricketer
Name',color='b',fontsize=15)
myplt.ylabel('Number of
Centuries',color='g',fontsize=15)
myplt.title('Cricketer wise number of
centuries',color='r',fontsize=15)
myplt.show()
The output is shown in Figure 6.2:

Figure 6.2: Output of Chap6_Example6.2.py

Note: The preceding code is covered in (program name:


Chap6_Example6.2.py).

In the preceding code, we will be focusing on the two lines with the
explanation as follows:
mycolorlist = ['r','b','g','orange','k']: This line
creates a list called mycolorlist which contains the colors to be used
for the bars in the plot. The colors are represented by their respective
abbreviations: r for red, b for blue, g for green, orange for orange, and
k for black.

myplt.bar(my_cricketers, my_centuries, color =


mycolorlist): This line creates the bar plot by using the bar
function from the matplotlib library. It takes the following three
arguments:
my_cricketers: This is the list of cricketersʼ names to be
plotted on the x-axis.
my_centuries: This is the list of the number of centuries
scored by each cricketer, which will determine the height of the
bars.
color = mycolorlist: This specifies the colors to be used
for the bars. Each bar will be assigned a color from the
mycolorlist based on its position in the list.
We can see from the graph that the default alignment is the center for each
bar.

Width change of each bar


Let us look at an example for different width values of each bar which is
mentioned in the code program name: Chap6_Example6.3.py as follows:
import matplotlib.pyplot as myplt
my_cricketers =
['Sachin','Virat','Ricky','Sangakarra','Jacques']
# x-axis values
my_centuries = [100,76,71,63,62] #height of bars,
y-axis values
mywidthlist= [0.9,0.8,0.7,0.6,0.5]
myplt.bar(my_cricketers,my_centuries,width =
mywidthlist)
myplt.xlabel('Cricketer
Name',color='b',fontsize=15)
myplt.ylabel('Number of
Centuries',color='g',fontsize=15)
myplt.title('Cricketer wise number of
centuries',color='r',fontsize=15)
myplt.show()
The output is shown in Figure 6.3:

Figure 6.3: Output of Chap6_Example6.3.py

Note: The preceding code is covered in (program name:


Chap6_Example6.3.py).
In the preceding code, we will be focusing on two lines whose explanation
is given as follows:
The mywidthlist list is defined with values [0.9, 0.8, 0.7, 0.6, 0.5].
This list determines the width of each bar in the bar chart.
myplt.bar(my_cricketers, my_centuries, width =
mywidthlist): This myplt.bar() function is called with three
parameters: my_cricketers, my_centuries, and width.
my_cricketers is a list containing the names of the cricketers.
my_centuries is a list containing the number of centuries scored by
each cricketer.
width is set to mywidthlist, which determines the width of each
bar in the chart.
The myplt.bar() function creates the bar chart using the provided
parameters. The height of each bar represents the number of centuries
scored by each cricketer, and the width is determined by the values in
mywidthlist.

Bottom change of each bar


We shall look example for different bottom values of each bars which is
mentioned in the code program name: Chap6_Example6.4.py as
follows:
import matplotlib.pyplot as myplt
my_cricketers =
['Sachin','Virat','Ricky','Sangakarra','Jacques']
# x-axis values
my_centuries = [100,76,71,63,62] #height of bars,
y-axis values
mybottomlist= [40,30,20,10,0]
myplt.bar(my_cricketers,my_centuries,bottom =
mybottomlist)
myplt.xlabel('Cricketer
Name',color='b',fontsize=15)
myplt.ylabel('Number of
Centuries',color='g',fontsize=15)
myplt.title('Cricketer wise number of
centuries',color='r',fontsize=15)
myplt.show()
The output is shown in Figure 6.4:

Figure 6.4: Output of Chap6_Example6.4.py


Note: The preceding code is covered in (program name:
Chap6_Example6.4.py).

In the preceding code, we will be focusing on the two lines with an


explanation:
The variable mybottomlist is created and assigned a list of values [40,
30, 20, 10, 0].
myplt.bar(my_cricketers, my_centuries, bottom =
mybottomlist), the code then calls the bar function from the
matplotlib library by passing in the variables my_cricketers and
my_centuries as the data to be plotted.

Here, the bottom parameter is set to mybottomlist, which means the


bars will start at the values specified in the list.

Left alignment of each bar


Now, let us view an example of the left alignment of each bar which is
mentioned in the code program name: Chap6_Example6.5.py as
follows:
import matplotlib.pyplot as myplt
my_cricketers =
['Sachin','Virat','Ricky','Sangakarra','Jacques']
# x-axis values
my_centuries = [100,76,71,63,62] #height of bars,
y-axis values
myplt.bar(my_cricketers,my_centuries,align='edge')
# left alignment
myplt.xlabel('Cricketer
Name',color='b',fontsize=15)
myplt.ylabel('Number of
Centuries',color='g',fontsize=15)
myplt.title('Cricketer wise number of
centuries',color='r',fontsize=15)
myplt.show()
The output is shown in Figure 6.5:

Figure 6.5: Output of Chap6_Example6.5.py

Note: The preceding code is covered in (program name:


Chap6_Example6.5.py).

In the code of line, myplt.bar(my_cricketers,my_centuries,align='edge'),


the align parameter here is set to edge, which means the bars will be left-
aligned with the x-axis tick marks.

Right alignment of each bar


If we add width = - 0.8 to the line of code mentioned above, then the
bars will be right-aligned with the x-axis tick marks. Just observe the code
program name: Chap6_Example6.6.py as follows which demonstrates the
usage of right alignment:
import matplotlib.pyplot as myplt
my_cricketers =
['Sachin','Virat','Ricky','Sangakarra','Jacques']
# x-axis values
my_centuries = [100,76,71,63,62] #height of bars,
y-axis values
mycolorlist=['r','b','g','pink','k']
myplt.bar(my_cricketers,my_centuries, width=-0.8
,align='edge', color=mycolorlist) # right
alignment
myplt.xlabel('Cricketer
Name',color='b',fontsize=15)
myplt.ylabel('Number of
Centuries',color='g',fontsize=15)
myplt.title('Cricketer wise number of
centuries',color='r',fontsize=15)
myplt.xticks(my_cricketers, rotation=30)
myplt.tight_layout()
myplt.grid(axis='y')
myplt.show()
The output is shown in Figure 6.6:
Figure 6.6: Output of Chap6_Example6.6.py

Note: The complete code is covered in (program name:


Chap6_Example6.6.py).

The following code snippet uses the xticks function from the
matplotlib.pyplot library to set the x-axis tick labels for a plot. The
rotation=30 parameter is used to rotate the tick labels by 30 degrees.

myplt.xticks(my_cricketers, rotation=30)
The tight_layout function is then called to automatically adjust the subplot
parameters to fit the plot figure area:
myplt.tight_layout()
The grid function is used to display grid lines on the y-axis of the plot, as
shown below:
myplt.grid(axis='y')

Enhancing Bar visualization with labels


Now, we shall investigate some of the approaches where we can add labels
to the bar.

Using text() function


This function is used to add text to a plot in a matplotlib figure. The text()
function takes several parameters to specify the position, content, and style
of the text. Refer to the following code program name:
Chap6_Example6.7.py for better understanding:
text(x, y, s, fontdict=None, **kwargs)

import matplotlib.pyplot as myplt


my_cricketers =
['Sachin','Virat','Ricky','Sangakarra','Jacques']
# x-axis values
my_centuries = [100,76,71,63,62] #height of bars,
y-axis values
mycolorlist=['r','b','g','pink','k']
myplt.bar(my_cricketers,my_centuries,
color=mycolorlist) # right alignment
myplt.xlabel('Cricketer
Name',color='b',fontsize=15)
myplt.ylabel('Number of
Centuries',color='g',fontsize=15)
myplt.title('Cricketer wise number of
centuries',color='r',fontsize=15)
myplt.xticks(my_cricketers, rotation=30)
myplt.tight_layout()
# print(help(myplt.text))
for loop in range(len(my_cricketers)):
myplt.text(my_cricketers[loop],
my_centuries[loop]+1, my_centuries[loop],
ha='center', color='brown')
myplt.show()
The output is shown in Figure 6.7:

Figure 6.7: Output of Chap6_Example6.7.py

Note: The preceding code is covered in (program name:


Chap6_Example6.7.py).

We are interested in the following code here:


for loop in range(len(my_cricketers)):
myplt.text(my_cricketers[loop],
my_centuries[loop]+1, my_centuries[loop],
ha='center', color='brown')
The given code uses a for loop to iterate over a range of values equal to the
length of a list called my_cricketers. During each iteration, the code uses
the text function from the myplt library to add text to a plot. The text being
added consists of the name of the cricketer from the my_cricketers list and
their corresponding number of centuries from the my_centuries list. The
text is positioned above the data point on the plot, with the number of
centuries increased by one, for better visibility. The text is centered
horizontally and colored brown.

Using annotate() function


This function is used to add an annotation to a plot in a matplotlib figure.
The annotate() function takes several parameters to specify the position,
content, and style of the annotation. Refer to the syntax and following code
program name: Chap6_Example6.8.py for better understanding:
annotate(text, xy, xytext=None, xycoords='data',
textcoords=None, arrowprops=None,
annotation_clip=None, **kwargs)
import matplotlib.pyplot as myplt
my_cricketers =
['Sachin','Virat','Ricky','Sangakarra','Jacques']
# x-axis values
my_centuries = [100,76,71,63,62] #height of bars,
y-axis values
mycolorlist=['r','b','g','pink','k']
myplt.bar(my_cricketers,my_centuries,
color=mycolorlist) # right alignment
myplt.xlabel('Cricketer
Name',color='b',fontsize=15)
myplt.ylabel('Number of
Centuries',color='g',fontsize=15)
myplt.title('Cricketer wise number of
centuries',color='r',fontsize=15)
myplt.xticks(my_cricketers, rotation=30)
myplt.tight_layout()
#print(help(myplt.annotate))
for loop in range(len(my_centuries)):
myplt.annotate(f'{my_centuries[loop]}C',
(my_cricketers[loop], my_centuries[loop]+5),
ha='center',color='brown',
backgroundcolor='yellow')
myplt.show()
The output is shown in Figure 6.8:
Figure 6.8: Output of Chap6_Example6.8.py

Note: The preceding code is covered in (program name:


Chap6_Example6.8.py).

Let us discuss the following code:


for loop in range(len(my_centuries)):
myplt.annotate(f'{my_centuries[loop]}C',
(my_cricketers[loop], my_centuries[loop]+5),
ha='center', color='brown',
backgroundcolor='yellow')
In the code here, f-strings are used in the myplt.annotate() function to
format the annotation text. The str() function is used to convert the
number of centuries to a string, and the C is appended to it by using the f-
string format {my_centuries[loop]}C. This will display the number of
centuries followed by the C in the annotation text.
Horizontal bar chart
In matplotlib, a horizontal bar chart is frequently used to contrast various
categories or groups against a single numerical value. It is helpful when we
want to display the size or scope of each category or group visually. In
general, horizontal bar charts work well when we have a long category label
or want to draw attention to the differences between the values being
compared. We should use a horizontal bar chart rather than a vertical bar
chart if the labels are long or there are many numbers to depict. We will
employ the barh() function. Here, the data will be represented in the
form of horizontal bars. Each bar corresponds to an individual category.
Data values will be represented on the x-axis, while categories will be
plotted on the y-axis. The barʼs width and length are proportionate to the
value it stands for. Although we can customize this value, the default height
is 0.8.
So, there is swapping in the following elements:
Vertical to horizontal
width to height
height to width
bar to barh
bottom to left
Refer to the syntax and the following code program name:
Chap6_Example6.9.py for better understanding:

barh(y, width, height=0.8, left=None, *,


align='center', data=None, **kwargs)

import matplotlib.pyplot as myplt


my_cricketers = ['Sachin', 'Virat', 'Ricky',
'Sangakarra', 'Jacques']
my_centuries = [100, 76, 71, 63, 62]
mycolorlist = ['r', 'b', 'g', 'pink', 'k']
myplt.barh(my_cricketers, my_centuries,
color=mycolorlist) # horizontal bar chart
myplt.ylabel('Cricketer Name', color='b',
fontsize=15)
myplt.xlabel('Number of Centuries', color='g',
fontsize=15)
myplt.title('Cricketer-wise Number of Centuries',
color='r', fontsize=15)
myplt.xticks(rotation=30)
myplt.tight_layout()
for loop in range(len(my_centuries)):
myplt.annotate(f'{my_centuries[loop]}C',
(my_centuries[loop], loop),
ha='center', color='brown',
backgroundcolor='yellow')
myplt.show()
The output is shown in Figure 6.9:
Figure 6.9: Output of Chap6_Example6.9.py

Note: The preceding code is covered in (program name:


Chap6_Example6.9.py).

We shall discuss the two lines discussed in our present code. The first line
of code is the following:
myplt.barh(my_cricketers, my_centuries,
color=mycolorlist)
This line of code is creating a horizontal bar chart by using the Matplotlibʼs
barh function, where my_cricketers are the labels on the y-axis,
my_centuries determine the height of the bars, and mycolorlist determines
the colors of the bars.
Another line of code where we need to concentrate is the following:
myplt.xticks(rotation=30)
Here, xticks is a function used to customize the tick positions and labels
on the x-axis.
rotation=30 is an argument passed to xticks, which specifies the
rotation angle (in degrees) for the x-axis tick labels. In this case, it rotates
the x-axis tick labels by 30 degrees.
This line of code is rotating the x-axis tick labels by 30 degrees to make
them more readable if they are long or overlap with each other. It is often
used when the x-axis labels are strings, as in your case where the cricketerʼs
names are being displayed on the x-axis.

Stacked bar chart


A stacked bar chart is a valuable tool in data visualization when there is a
need to compare various categories within a single bar. It is particularly
useful for illustrating the distribution or composition of different variables
in a dataset. Each subcategory here will be plotted on top of other
subcategory.
This chart type allows the user to visualize not only the total value of each
category but also the individual contributions of subcategories within that
total. This capability aids in comprehending the relative proportions and
patterns among the different variables.
For instance, user could employ a stacked bar chart to depict the sales
performance of various products in each month. In this scenario, each bar
represents the total sales and is subdivided into segments, each representing
the sales of a specific product.
In summary, stacked bar charts excel at presenting both the comprehensive
and intricate aspects of multiple categories concurrently. This makes it
significantly easier to compare and analyze complex datasets, providing
valuable insights into the data.
The stacked bar chart can be used with either bar() or barh() function, that
is, either vertical or horizontal.
Example of the vertical stacked bar chart is demonstrated in the code
program name: Chap6_Example6.10.py as follows:
import matplotlib.pyplot as myplt
myallround_cricketers=['Jacques Kallis','SIr
Garifield Sobers','Imran Khan','Ian Botham','Kapil
Dev','Sir Richard Hadlee','Sanath
Jayasuriya','Andrew Flintoff','Shaun
Pollock','Sakib Al Hasan']
my_runs=
[25528,8032,7516,7313,9031,4875,21032,7315,7386,11
955]
my_wickets=
[577,236,544,528,687,589,440,400,829,576]
myplt.bar(myallround_cricketers,my_runs,color='r',
label='Runs')
myplt.bar(myallround_cricketers,
my_wickets,bottom=my_runs, color='g',
label='Wickets')
myplt.xticks(myallround_cricketers, rotation=90)
for i in range(len(myallround_cricketers)):
myplt.text(myallround_cricketers[i],
(my_runs[i]/2),str(my_runs[i]),
ha='center',color='black',weight=1000)
myplt.text(myallround_cricketers[i],
(my_runs[i]+my_wickets[i]/2),str(my_wickets[i]),
ha='center',color='black',weight=1000)
myplt.legend()
myplt.tight_layout()
myplt.show()
The output is shown in Figure 6.10:

Figure 6.10: Output of Chap6_Example6.10.py

Note: The preceding code is covered in (program name:


Chap6_Example6.10.py).

The code we have provided uses the matplotlib library in Python to create
a bar chart. Here is a breakdown of what each line of code does:
import matplotlib.pyplot as myplt: This line imports the pyplot
module from the matplotlib library and assigns it with the alias
myplt.
myallround_cricketers: This is a list of strings that represents the
names of the cricketers.
my_runs: This is a list of integers that represents the number of runs
scored by each cricketer.
my_wickets: This is a list of integers that represents the number of
wickets taken by each cricketer.
myplt.bar(myallround_cricketers, my_runs, color='r',
label='Runs'): This line creates a bar chart with the names of the
cricketers on the x-axis, the number of runs on the y-axis, and assigns
the color red to the bars. It also adds the label Runs to the legend.
myplt.bar(myallround_cricketers, my_wickets, bottom=my_runs,
color='g', label='Wickets'): This line creates another set of bars
on top of the previous bars, representing the number of wickets. The
bottom parameter specifies that the bars should start from the height
of the previous bars. The green color is assigned to these bars, and the
label Wickets is added to the legend.
myplt.xticks(myallround_cricketers, rotation=90): This line sets
the x-axis tick labels to be the cricketersʼ names, with a rotation of 90
degrees to prevent overlapping.
for i in range(len(myallround_cricketers)): This line sets up a
loop that iterates over the indices of the myallround_cricketers list.
The len(myallround_cricketers) function returns the length of the
list, and range() generates a sequence of numbers from zero to
len(myallround_cricketers)-1.

myplt.text(myallround_cricketers[i], (my_runs[i]/2),
str(my_runs[i]), ha='center', color='black', weight=1000): This
line adds a text label to the bar chart. The myplt.text() function is
used to add the text label. The first parameter is the x-coordinate
(cricketerʼs name), the second parameter is the y-coordinate (half of
the number of runs), the third parameter is the text to be displayed
(number of runs), and the remaining parameters specify the
formatting options such as horizontal alignment(ha='center'), text
color(color='black'), and font-weight(weight=1000).

myplt.text(myallround_cricketers[i],
(my_runs[i]+my_wickets[i]/2), str(my_wickets[i]), ha='center',
color='black', weight=1000): This line adds another text label to the
bar chart. Similar to the previous line, it specifies the cricketerʼs
name as the x-coordinate, the y-coordinate as the sum of the number
of runs and half of the number of wickets, the text to be displayed as
the number of wickets, and the same formatting options.
myplt.legend(): This line adds a legend to the plot.
myplt.tight_layout(): This line adjusts the layout of the plot to
avoid overlapping elements.
myplt.show(): This line displays the plot on the screen.
An example of a horizontal stacked bar chart is demonstrated in the code
program name: Chap6_Example6.11.py as follows:
import matplotlib.pyplot as myplt

myallround_cricketers = ['Jacques Kallis', 'SIr


Garifield Sobers', 'Imran Khan', 'Ian Botham',
'Kapil Dev','Sir Richard Hadlee', 'Sanath
Jayasuriya', 'Andrew Flintoff', 'Shaun Pollock',
'Sakib Al Hasan']

my_runs = [25528, 8032, 7516, 7313, 9031, 4875,


21032, 7315, 7386, 11955]
my_wickets = [577, 236, 544, 528, 687, 589, 440,
400, 829, 576]

fig, ax = myplt.subplots()
# Plot the runs
ax.barh(myallround_cricketers, my_runs, color='r',
label='Runs')

# Plot the wickets on top of runs


ax.barh(myallround_cricketers, my_wickets,
left=my_runs, color='g', label='Wickets')

ax.set_xlabel('Performance')
ax.set_title('Performance of All-Round
Cricketers')
ax.set_xticks(range(0, 30001, 5000))
ax.invert_yaxis() # Invert the y-axis to display
the top performer at the top
ax.legend()

for i in range(len(myallround_cricketers)):
ax.text(my_runs[i] + my_wickets[i] + 100, i,
str(my_wickets[i]), ha='center', va='center',
color='black', weight='bold')
ax.text(my_runs[i] / 2, i, str(my_runs[i]),
ha='center', va='center', color='black',
weight='bold')

myplt.tight_layout()
myplt.show()
The output is shown in Figure 6.11:
Figure 6.11: Output of Chap6_Example6.11.py

Note: The preceding code is covered in (program name:


Chap6_Example6.11.py).

The provided code creates a horizontal stacked bar chart by using


Matplotlib to visualize the performance of all-round cricketers. It compares
the runs and wickets statistics for each cricketer, with their names on the y-
axis. The chart displays runs in red and wickets in green, stacked on top of
each other. The y-axis is inverted to place the top performer at the top of the
chart. Text labels for runs and wickets are added to the bars, with adjusted
positions to prevent overlap. The chart includes labels, a title, and
customized tick marks on the x-axis to enhance readability and
understanding of the all-round performance of the cricketers.

Other bar charts


A clustered bar chart, crafted with the matplotlib library in Python, is a
visual representation that groups multiple bars side by side within each
category. Its value lies in its capacity to effectively present and contrast
various data sets across distinct categories. By clustering the bars, it
simplifies the examination and interpretation of relationships and
distinctions between these categories. This type of chart is especially
advantageous when the user wishes to compare data across diverse groups
or subcategories.
Consider having data about the sales performance of various products in
different regions. Utilizing a clustered bar chart permits the user to visually
assess the sales of products within each region, facilitating the identification
of patterns, trends, and fluctuations in the data.
Matplotlib offers a wide range of customization options for clustered bar
charts, allowing the user to fine-tune colors, labels, titles, and other visual
elements to align with specific requirements. This adaptability empowers
the user to wield matplotlib as a robust tool for both data visualization and
analysis. The clustered bar chart can be created by using either bar() or
barh() functions. Observe the following code Program Name:
Chap6_Example6.12.py as follows for clear understanding:

import matplotlib.pyplot as myplt


import numpy as mynp
myallround_cricketers=['Jacques Kallis','SIr
Garifield Sobers','Imran Khan','Ian Botham','Kapil
Dev'
,'Sir Richard
Hadlee','Sanath Jayasuriya','Andrew
Flintoff','Shaun Pollock','Sakib Al Hasan']
my_runs=
[25528,8032,7516,7313,9031,4875,21032,7315,7386,11
955]
my_wickets=
[577,236,544,528,687,589,440,400,829,576]
myxpos=mynp.arange(len(myallround_cricketers))
mywidth=0.3
myplt.bar(myxpos,my_runs,color='r', label='Runs',
width=mywidth)
myplt.bar(myxpos + mywidth, my_wickets, color='g',
label='Wickets', width=mywidth)
myplt.xticks(myxpos+(mywidth/2),
myallround_cricketers, rotation=90)
myplt.legend(['Runs','Wickets'])

for i in range(len(myallround_cricketers)):
myplt.text(myxpos[i], my_runs[i] + 10 ,
my_runs[i], ha='center', color='black',
weight=1000)
myplt.text(myxpos[i]+mywidth, my_wickets[i] +
5 , my_wickets[i], ha='center', color='black',
weight=1000)

myplt.tight_layout()
myplt.show()
The output is shown in Figure 6.12:
Figure 6.12: Output of Chap6_Example6.12.py

Note: The preceding code is covered in (program name:


Chap6_Example6.12.py).

Let us focus on the following lines to explain the code:


myplt.bar(myxpos,my_runs,color='r',
label='Runs', width=mywidth)
This line creates a bar chart for the runs scored by each cricketer.
It specifies the x-positions, the data (my_runs), the bar color (r for
red), the label for the legend (Runs), and the bar width.
myplt.bar(myxpos + mywidth, my_wickets,
color='g', label='Wickets', width=mywidth)
This line creates a bar chart for the wickets taken by each
cricketer. It shifts the x-positions by mywidth to place the bars side-
by-side. It specifies the data (my_wickets), the bar color (g for
green), the label for the legend (Wickets), and the bar width.
myplt.xticks(myxpos+
(mywidth/2),myallround_cricketers,
rotation=90)
This line sets the x-axis labels to be the names of the all-round
cricketers. The rotation=90 argument rotates the labels 90 degrees
for better readability.
myplt.legend(['Runs','Wickets'])
This line adds a legend to the chart with labels Runs and Wickets.
for i in range(len(myallround_cricketers)):
myplt.text(myxpos[i], my_runs[i] + 10 ,
my_runs[i], ha='center', color='black',
weight=1000)
myplt.text(myxpos[i]+mywidth, my_wickets[i] + 5 ,
my_wickets[i], ha='center', color='black',
weight=1000)
The for loop iterates through each cricketer and uses myplt.text to add
the run and wicket values above their respective bars. This helps in
visualizing the exact values for each cricketer.

Pie chart
A pie chart is a circular statistical graphic divided into slices to represent
data proportions. A circular chart divided into segments is called wedges
which represent an individual category. The wedge area is proportional to
the category value. In the context of matplotlib, the user can create pie
charts by using the myplt.pie() function. For categories comparison, a
pie chart is very useful. Refer to the following syntax and the code
demonstrated program name: Chap6_Example6.13.py for better
understanding:
pie(x, explode=None, labels=None, colors=None,
autopct=None, pctdistance=0.6, shadow=False,
labeldistance=1.1, startangle=0, radius=1,
counterclock=True, wedgeprops=None,
textprops=None, center=(0, 0), frame=False,
rotatelabels=False, *, normalize=True, hatch=None,
data=None)

Note: The code program name: Chap6_Example6.13.py is slightly long


which is in continuation to next page.

import matplotlib.pyplot as myplt


import numpy as mynp
# print(help(myplt.pie))
my_lang_popularity= mynp.array([40,32,18,15,12]) #
some random value
# addition of labels to the wedges
mylang_labels =
['Python','Java','C#.Net','C','VB.Net']
# * autopct function: For labelling widgets with
numeric percentage and specifying its value
# using formatted string: autopct = '%.2f':
considering 2 digits after decimal point
# with addition of % symbol using wild character
for formatted string so using autopct = '%.2f%%'
myautopct = '%.2f%%'
# * explode: For highlighting a particular
category we will be using explode argument.
myexplode = [0.2,0.0,0.0,0.0,0.0]

# * shadow = shadow effect can be added to the pie


chart using shadow effect by setting value to
True.
myshadow = True

# * colors = colors can be specifed for the wedges


using list or array with color name or hex code.
# if colors number is < wedges number , then the
colors will be reused.
my_list_color=['r','b','g','pink','#8B00FF']

# * startangle: wedge will start from 0 from x-


axis and will move in counter clockwise direction
# a number will be specified from which wedge will
start. Here, we are specifying value with 90
mystartangle=90

# * wedgeprops= pie chart wedges can be customized


using wedgeprops and
# keys in the form of key-value pairs can be
edgecolor, linestyle or linewidth
mywedgeprop_para={'edgecolor':'k','linestyle':'--
'} # try increase the linewidth say 2 and check
# pie chart creation
myplt.pie(my_lang_popularity,
labels=mylang_labels,autopct = myautopct,
explode=myexplode,
shadow=myshadow, colors=my_list_color,
startangle=mystartangle, wedgeprops =
mywedgeprop_para)

# addition of legend
myplt.legend(title='Languages Popularity Pie
chart')
myplt.tight_layout()
myplt.show()

Note: In the above code, comments are written which are self-
explanatory about what is happening in the entire code.

The output is shown in Figure 6.13:


Figure 6.13: Output of Chap6_Example6.13.py

Note: The preceding code is covered in (program name:


Chap6_Example6.13.py).

Histogram
Histograms in matplotlib serve as graphical representations that depict the
distribution of a dataset. They accomplish this by breaking the data into
bins and displaying the frequency or count of observations within each bin.
The significance of histograms stems from their capacity to unveil the
inherent data distribution. They enable an understanding of the shape,
central tendency, and dispersion of the dataset. Moreover, histograms prove
invaluable for the detection of data patterns, outliers, and skewness. They
find frequent application in exploratory data analysis, assisting in data-
driven decision-making.
Furthermore, histograms enjoy widespread usage across fields such as
statistics, data science, and data visualization. They offer a rapid and
intuitive means of scrutinizing and comprehending data, rendering them
indispensable tools for researchers, analysts, and data enthusiasts.
The number of observations in the given interval is termed frequency
distribution. Histograms prove to be a valuable asset when dealing with
extensive datasets. To construct histograms, it is essential to partition the
entire set of input values into equitably sized groups or bins. Each bin
corresponds to a bar in the histogram, and the height of these bars is directly
proportional to the count of values falling within the respective bin or
interval. To generate a histogram, the hist() function can be employed
effectively. Refer to the following syntax for a better understanding:
hist(x, bins=None, range=None, density=False,
weights=None, cumulative=False, bottom=None,
histtype='bar', align='mid',
orientation='vertical', rwidth=None, log=False,
color=None, label=None, stacked=False, *,
data=None, **kwargs)
Now, we shall see a simple example of creating a histogram with the marks
of students who are in the decided range as shown as follows:
35 students got marks in the range zero to 34.
100 students got marks in the range 35 to 49.
45 students got marks in the range 50 to 59.
68 students got marks in the range 60 to 79.
27 students got marks in the range 80 to 100.
Refer to the following code program name: Chap6_Example6.14.py
for better clarity:
import matplotlib.pyplot as myplt
#print(help(myplt.hist))
# Data
# Two lists are created: mymarks_ranges contains
the ranges of marks, and
# mynum_students contains the number of students
falling within each range.
mymarks_ranges = ["0-34", "35-49", "50-59", "60-
79", "80-100"]
mynum_students = [35, 100, 45, 68, 27]

# Calculate the midpoints of each range for the x-


axis labels
# The midpoints of each mark range are calculated
and stored in the mymidpoints list.
mymidpoints = [(0 + 34) / 2, (35 + 49) / 2, (50 +
59) / 2, (60 + 79) / 2, (80 + 100) / 2]

# Create a histogram
# The hist() function is called to create the
histogram. It takes several arguments:
# - mymidpoints: the values to be plotted on the
x-axis.
# - bins: the range of values for each bin/bar in
the histogram.
# - weights: the weights or frequencies of each
value.
# - edgecolor: the color of the edges of the bars
in the histogram.
# - color: the color of the bars in the histogram.

myplt.hist(mymidpoints, bins=[0, 35, 50, 60, 80,


100], weights=mynum_students, edgecolor='black'
, color='skyblue')

# Set axis labels and title


# The xlabel(), ylabel(), and title() functions
are used to set the
# labels and title of the plot.

myplt.xlabel('Marks Range')
myplt.ylabel('Number of Students')
myplt.title('Distribution of Marks for 300
Students')

# Set x-axis labels


# The xticks() function is used to set the x-axis
labels to the values in mymarks_ranges.
myplt.xticks(mymidpoints, mymarks_ranges)

# Display the histogram


myplt.show()
The output is shown in Figure 6.14:
Figure 6.14: Output of Chap6_Example6.14.py

Note: The preceding code is covered in (program name:


Chap6_Example6.14.py).

The code displayed here creates a histogram to visualize the distribution of


marks for a group of students.

Scatter plot
A scatter plot, a graphical representation, is employed in data visualization
to illustrate the correlation between two numeric variables. This plot utilizes
a cartesian coordinate framework to position points based on two variable
values. In Pythonʼs prominent data visualization library, matplotlib, scatter
plots can be constructed through the 'scatter()' function. This function
accepts x and y values as inputs and renders them as individual points on
the chart. Furthermore, customization options, such as specifying color,
size, and point shape, are available for enhancing the scatter plotʼs
appearance.
The comparison of line and scatter plots is mentioned as follows:
Both line and scatter plots serve the purpose of visualizing the
association between two numeric variables. Terminology like x-axis,
and y-axis data are the same in scatter plots.
Both types employ a cartesian coordinate system for visual
representation.
Matplotlib, a Python library, supports the creation of both line and
scatter plots.
The differences between line and scatter plots are mentioned as follows:
Line plots utilize straight lines to connect data points, whereas scatter
plots do not establish connections between points.
Line plots are commonly employed to depict trends or patterns in
data across time or another continuous variable, whereas scatter plots
are used to portray the distribution or correlation between two
variables.
Line plots are suitable for illustrating continuous data, while scatter
plots are better suited for displaying discrete or categorical data.
In summary, although both line and scatter plots are valuable tools for
visualizing relationships between variables, they differ in terms of point
connection and the types of data they are best suited for. Refer to the
following syntax and code program name: Chap6_Example6.15.py as follows
for better understanding:
scatter(x, y, s=None, c=None, marker=None,
cmap=None, norm=None, vmin=None, vmax=None,
alpha=None, linewidths=None, *, edgecolors=None,
plotnonfinite=False, data=None, **kwargs)

import matplotlib.pyplot as myplt


import numpy as mynp
import matplotlib.patches as mpatches
#print(help(myplt.scatter))

# Xaxis and Yaxis data


my_xaxis_data = mynp.arange(1,6)
my_yaxis_data = [20,10,60,30,80]

# for each markers displaying different color


# if huge number of colors required, cmap maps
colors to numbers
mycolors=['r','b','g','k','y']

# for change of size of each marker


mysize=[100,200,300,400,500]

# Addition of labels using annotate function


mylabels=['L1','L2','L3','L4','L5']
for loop, label in enumerate(mylabels):
myplt.annotate(label,
(my_xaxis_data[loop]+0.2,my_yaxis_data[loop]))

# scatter function
myscatter =
myplt.scatter(my_xaxis_data,my_yaxis_data,
c=mycolors, s=mysize)
# creating a list of Patch objects representing
each marker
# and use it as an argument in the legend()
function.
mylegend_patches = [mpatches.Patch(color=color,
label=label) for color, label in zip(mycolors,
mylabels)]

# Displaying the legend


myplt.legend(handles=mylegend_patches)
myplt.show()
The output is shown in Figure 6.15:

Figure 6.15: Output of Chap6_Example6.15.py


Note: The preceding code is covered in (program name:
Chap6_Example6.15.py).

We shall discuss the two lines here as the rest of the code is self-explanatory
due to the comments:
The first line is Mylegend_patches = [mpatches.Patch(color=color,
label=label) for color, label in zip(mycolors, mylabels)]

mycolors and mylabels are lists containing colors and labels


respectively.
The code uses a list comprehension to create a list of legend
patches.
For each color and label pair in zip(mycolors, mylabels), a
mpatches.Patch object is created with the specified color and
label.
The resulting list of legend patches is stored in the variable
mylegend_patches.

Another one is displaying the legend


myplt.legend(handles=mylegend_patches)

myplt is assumed to be an instance of matplotlib.pyplot, often


imported as myplt.
The legend function is called on myplt to display the legend.
The handles parameter is set to mylegend_patches, which contains
the list of legend patches created in the previous step.
This causes the legend to be displayed using the specified legend
patches.

Subplot in matplotlib
The subplot function in matplotlib is used to create a grid of subplots within
a single figure. By specifying the number of rows and columns in the
subplot grid, the user can easily control the layout and arrangement of the
subplots. It allows the user to display multiple plots or charts in a single
window, organized in a grid-like structure. This function is particularly
useful when the user wants to compare different datasets or visualize
different aspects of the data side by side.
First, we shall look into the subplot function:
Refer to the following syntax and code program name:
Chap6_Example6.16.py as follows for better understanding:
subplot(*args, **kwargs)

import matplotlib.pyplot as myplt


import numpy as mynp
# print(help(myplt.subplot))
myxaxis_data = mynp.arange(1,11)
my_y1 = myxaxis_data
my_y2 = myxaxis_data *2
my_y3 = myxaxis_data **2
my_y4 = myxaxis_data **3
myfig = myplt.figure(figsize=(8,6),num=1)
myax1 = myplt.subplot(2,2,1)
myax1.plot(myxaxis_data,my_y1,color='r',marker='o'
)
myax1.set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Same Value Linearity')

myax2 = myplt.subplot(2,2,2)
myax2.plot(myxaxis_data,my_y2,color='b',marker='o'
)
myax2.set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Double of Xaxis')

myax3 = myplt.subplot(2,2,3)
myax3.plot(myxaxis_data,my_y3,color='y',marker='o'
)
myax3.set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Square function')

myax4 = myplt.subplot(224) # comma can be removed


myax4.plot(myxaxis_data,my_y4,color='g',marker='o'
)
myax4.set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Cubic function')

myplt.tight_layout()
myplt.show()
The output is shown in Figure 6.16:
Figure 6.16: Output of Chap6_Example6.16.py

Note: The preceding code is covered in (program name:


Chap6_Example6.16.py).

We are calling the subplot() function separately for every subplot. If there
is a requirement for a large number of subplots, then every time writing this
subplot() function will become difficult. So, to overcome the preceding
issue, we shall go for the subplots() function returning all axes objects at a
time.
Another one is subplots function. Refer to the following syntax:
subplots(nrows=1, ncols=1, *, sharex=False,
sharey=False, squeeze=True, width_ratios=None,
height_ratios=None, subplot_kw=None,
gridspec_kw=None, **fig_kw)
Just observe the following line of code:
myfig, myaxs = myplt.subplots(2,2)
Here, one figure object and ndarray of four axes objects are returned. Now,
we shall look into the code program name: Chap6_Example6.17.py as
follows for better understanding:
import matplotlib.pyplot as myplt
import numpy as mynp
# print(help(myplt.subplots))
myxaxis_data = mynp.arange(1,11)
my_y1 = myxaxis_data
my_y2 = myxaxis_data *2
my_y3 = myxaxis_data **2
my_y4 = myxaxis_data **3

myfig,myax = myplt.subplots(2,2)

myax[0,0].plot(myxaxis_data,my_y1,color='r',marker
='o')
myax[0,0].set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Same Value Linearity')

myax[0,1].plot(myxaxis_data,my_y2,color='b',marker
='o')
myax[0,1].set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Double of Xaxis')
myax[1,0].plot(myxaxis_data,my_y3,color='y',marker
='o')
myax[1,0].set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Square function')

myax[1,1].plot(myxaxis_data,my_y4,color='g',marker
='o')
myax[1,1].set(xlabel='Xaxis data',ylabel='Yaxis
data',title='Cubic function')

myplt.tight_layout()
myplt.show()
The output is shown in Figure 6.17:
Figure 6.17: Output of Chap6_Example6.17.py

Note: The preceding code is covered in (program name:


Chap6_Example6.17.py).

We can also use the following code alternatively:


myfig, ((myaxs1,myaxs2), (myaxs3,myaxs4)
(myaxs5,myaxs6)) = myplt.subplots(2,2)
This concludes this exciting chapter on matplotlib.

Conclusion
In this chapter, we have explored the versatility of bar plots, discovering
how to tweak colors, widths, and alignments for individual bars, along with
insights into creating horizontal and stacked bar charts. We also uncovered
the secrets of grouped bar charts, equipping us with an arsenal of bar chart
skills. Then, we explored pie charts, histograms, scatter plots, and subplots
in matplotlib, further enriching our data visualization repertoire. This
chapter has provided us with a robust foundation in Python data
visualization, enabling the reader to effectively communicate insights and
trends from data in various formats and styles.
In the next chapter, we shall learn about the pandas series in data science as
it is a fundamental data structure for handling one-dimensional labeled data,
which is common in many data analysis tasks. Pandas series offers powerful
functionalities for data manipulation, transformation, and analysis, making
it indispensable for tasks such as data cleaning, exploration, and pre-
processing. By mastering the pandas series, data scientists will gain the
ability to efficiently organize, access, and analyze data.

Points to remember
If we want to compare different categories of values, we can use a
vertical bar chart, created by using the bar() function. In addition, if
the labels are too long or there are multiple values to represent, we
should opt for a horizontal bar chart. This can be created by using the
barh() function.

If we want to compare different categories of values, and each


category contains multiple subcategories that we want to represent on
top of each other, we should use a stacked bar chart. It can be either
vertical or horizontal. If multiple subcategories are represented side
by side, we should use a clustered bar chart. It can be either vertical
or horizontal.
Pie charts are suitable for displaying relative proportions of
categorical data.
Explode slices or emphasize specific segments by using the explode
parameter.
Add a legend to indicate the categories represented in the pie chart.
Scatter plots are used to visualize the relationship between two
continuous variables.
Histograms are used to visualize the distribution of continuous data.
Control the number of bins (intervals) to divide the data by using the
bins parameter.
Customize the appearance of the histogram bars, such as their color,
edge color, and transparency.
Subplots allow multiple plots within a single figure.
Create subplots by using the myplt.subplot() or myplt.subplots().
Define the grid layout with rows and columns.
Access and customize individual subplots by using variables like ax1,
ax2, and so on.

Use myplt.tight_layout() to improve spacing in multi-subplot


figures.

Questions
1. How can you create a vertical bar chart (simple bar chart) in Python?
2. How do we add labels to bars in a bar chart?
3. What is a horizontal bar and stacked bar chart, and how do you
create them in Python?
4. Provide an example of a vertical stacked bar chart.
5. Provide an example of a horizontal stacked bar chart.
6. How do you create a grouped bar chart (clustered bar chart) in
Python?
7. How can we create a pie chart in Python?
8. What is a histogram, and how can you create one in Python?
9. How can we create a scatter plot in Python?
10. What is a subplot, and how can we use it in matplotlib for creating
multiple plots in one figure?
CHAPTER 7Exploring Pandas Series for Data
Science in Python
Introduction

Pandas is a robust and popular Python package for data analysis and
manipulation. It offers simple-to-use data structures and data analysis
capabilities for working with structured data, including tabular, time series,
and heterogeneous data. Matplotlib, SciPy, and Scikit-learn are among the
many additional data analysis libraries which Pandas is compatible with, it
is built on top of NumPy which means for operating Pandas, a numpy
library is required. Series and DataFrame are the two main data structures
in Pandas. Similar to a column in a spreadsheet or a Structured Query
Language (SQL) table, a series is a one-dimensional named array that can
carry any data type (integers, floats, texts, and so on). A two-dimensional
labeled data structure called a DataFrame can carry a variety of data kinds
in its columns of potentially different types. The data analysis in pandas is
based on numpy, whereas data visualization is based on matplotlib.

Data cleansing, merging and joining, reshaping, filtering, sorting, grouping,


and aggregation are just a few of Pandasʼs many data manipulation and
analysis features. Additionally, it supports complex data analysis activities
including machine learning, statistical modeling, and time series analysis.

Data science, machine learning, finance, economics, social science, and


many more fields use Pandas extensively. Its popularity is a result of its
simplicity, adaptability, and potent capabilities, which make it a crucial tool
for Python data analysis.

We can install pandas by typing the following pip command:

pip install pandas

Python code frequently uses the import pandas as mypd statement to import
the pandas library and alias it as mypd. In this chapter we are using mypd
as an alias while writing the code.
Also, we will be writing help in some parts of the code snippet so that the
reader can uncomment and then view the syntax, explanation of properties,
and some examples of its usage.

Note that the plethora of information on pandas is present on


pandas.pydata.org and we have tried our level best to explain the important
concepts in a nutshell focusing on the overall outcome of important
concepts that will be crucial for data science projects.

Let us learn some concepts from the pandas series.

Structure

In this chapter, we will discuss the following topics:

Pandas series constructor


Creating pandas series by passing a list
Creating pandas series by passing a dictionary
Creating pandas series by passing a numpy array
Creating pandas series using RangeIndex
Creating pandas series with dtype and name
Exploring pandas series using copy parameter
Accessing series values using head and tail methods
Accessing elements in pandas series
Pandas series slicing
Values extraction using loc and iloc indexers
Masking using Booleans for condition-based selection
Pandas series filtering
Callable object usage in selecting elements
Some useful attributes of pandas series object
Usage of apply method to pandas series
Aggregating of pandas series
Basic arithmetic operations for series object
Arithmetic operations with scalar value
Arithmetic operations between 2 series objects
Series object transformation
Iterate elements of the series
Objectives

By the end of this chapter, the readers will have an idea about a
comprehensive range of topics related to the pandas series in Python,
equipping them with the skills and knowledge needed for effective data
manipulation and analysis. We will start by exploring the pandas series,
covering fundamental aspects such as series creation, data access, slicing,
and filtering. We will also explore more advanced topics like using callable
objects for selection, leveraging useful attributes, applying functions with
the apply method, aggregating data, and performing basic arithmetic
operations. The chapter will further emphasize series transformations and
iteration for a well-rounded understanding. By the end of this chapter,
readers will have a solid foundation in working with pandas series for data
analysis and manipulation in Python.

Pandas series

A Pandas series is a one-dimensional labeled array that stores any sort of


data, including Python objects, floats, strings, or integers (only one column
of data). We can compare the Pandas series to a column in the Excel sheet.
The values in the series can be changed, that is, series objects are mutable.
However, the series size is immutable, that is, the number of elements in
the series cannot be changed. We shall see some of the key concepts in the
Pandas series in Python.

Pandas series constructor

We can create a Pandas series using the following constructor:

Series(data=None, index=None, dtype: 'Dtype | None' = None,


name=None, copy: 'bool | None' = None, fastpath: 'bool' = False)

The detailed explanation of the syntax is mentioned as follows:

data: This is the information that will be kept in the series. It could be
a scalar value, a dictionary, a list, or a numpy array.
index: This option details the series index labels. It could be a
dictionary, list, array, or range. The series will be given a default
integer index starting at zero if no index is given.
dtype: The data type for the values in the series is specified by this
argument. Pandas will attempt to determine the data type from the
input data if it is not provided.
name: The series name is specified by this argument.
Copy: With this argument, the user can decide whether the input data
should be copied.
fastpath: Most users should not use this argument because it is used
for internal optimization only.

Creating pandas series by passing a list

A pandas series can be created by passing a list as shown in the code


Chap7_Example7.1.py as follows:

import pandas as mypd

# creating a Pandas Series from a list

mylist = [1, 2, 3, 4, 5]

mypd_series = mypd.Series(mylist)

print(mypd_series)

Output:

01

12

23

34

45

dtype: int64
Note: The preceding code is covered in (program name:
Chap7_Example7.1.py).

In the preceding example, we can see that no index parameter has been
provided, so pandas is considering default index values from
Rangeindex(0,1,2,3,…) object internally.

Creating pandas series by passing a dictionary

A pandas series can be created by passing a dictionary as shown in the code


Chap7_Example7.2.py as follows:

import pandas as mypd

# creating a Pandas Series from a dictionary

mydict = {'key1':1, 'key2':2, 'key3':3, 'key4':4, 'key5':5}

mypd_series = mypd.Series(mydict)

print(mypd_series)

Output:

key1 1

key2 2

key3 3

key4 4

key5 5

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.2.py).
Creating pandas series by passing a numpy array

A pandas series can be created by passing a numpy array as shown in the


code Chap7_Example7.3.py as follows:

import pandas as mypd

import numpy as mynp

# creating a Pandas Series from a numpy array

myndarray = mynp.array([11,12,13,14,15])

mypd_series = mypd.Series(myndarray)

print(mypd_series)

Output:

0 11

1 12

2 13

3 14

4 15

dtype: int32

Note: The preceding code is covered in (program name:


Chap7_Example7.3.py).

Creating pandas series from scalar value

For creating a series from scalar value, an index has to be provided. The
repetition of scalar value will be done as per index length as demonstrated.
Observe the code Chap7_Example7.4.py below for better understanding:
import pandas as mypd

# creating a Pandas Series from a scalar value

myseries = mypd.Series(10, index = [1,2,3,4])

print(myseries)

Output:

1 10

2 10

3 10

4 10

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.4.py).

Creating pandas series using RangeIndex

Pandas RangeIndex is a default index type that represents a sequence of


integers, commonly used for DataFrames and series when no index is
specified. Refer to the following syntax and code Chap7_Example7.5.py
for clarity:

RangeIndex(start=None, stop=None, step=None, dtype: 'Dtype | None' =


None, copy: 'bool' = False, name: 'Hashable' = None)

import pandas as mypd

# creating a Pandas Series

mylist = [11, 12, 13, 14, 15]


mypd_series = mypd.Series(mylist)

#print(help(mypd.RangeIndex))

mypd_series.index = mypd.RangeIndex(start=6, stop=11, step=1)

print(mypd_series)

Output:

6 11

7 12

8 13

9 14

10 15

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.5.py).

Creating pandas series with dtype and name

Creating a pandas series with a specified data type (dtype) is achieved by


providing the dtype parameter in the pd.Series() constructor, allowing
explicit control over the data type of the series elements. The name
parameter is used to assign a name to the Series, providing a label for the
data, as shown in the following code Chap7_Example7.6.py:

import pandas as mypd

# creating a Pandas Series

mylist = [11, 12, 13, 14, 15]


mypd_series_float = mypd.Series(mylist, dtype='float')

# assigning name to the series

mypd_series_float.name = "Series name"

# assigning name to the indexes

mypd_series_float.index.name = "Index name"

print(mypd_series_float)

Output:

Index name

0 11.0

1 12.0

2 13.0

3 14.0

4 15.0

Name: Series name, dtype: float64

Note: The preceding code is covered in (program name:


Chap7_Example7.6.py).

Exploring pandas series using copy parameter

The copy parameter in the pandas series provides control over whether a
deep copy or a reference copy is made when manipulating or modifying
data. When copy=True, a deep copy is created, ensuring that changes to the
new series do not affect the original data. Conversely, setting copy=False
results in a reference copy, allowing changes to the new series to propagate
to the original data, potentially saving memory but requiring caution to
avoid unintended modifications. This parameter is valuable in scenarios
where data integrity and independence are crucial considerations during
exploratory data analysis and manipulation using pandas, as shown in the
following code Chap7_Example7.7.py:

import pandas as mypd

import numpy as mynp

myndarray = mynp.array([11,12,13,14])

myseries = mypd.Series(myndarray, copy=False)

myseries[1] = 32

print("When copy=False, no separate copy is created. myseries data is ")

print(myseries)

print(f"When copy=False, myndarray data is {myndarray}")

myndarray2 = mynp.array([11,12,13,14])

myseries2 = mypd.Series(myndarray2, copy=True)

myseries2[1] = 32

print("When copy=True, Separate copy is created. myseries data is ")

print(myseries2)

print(f"When copy=True, myndarray data is {myndarray2}")

Output:

When copy=False, no separate copy is created. myseries data is

0 11
1 32

2 13

3 14

dtype: int32

When copy=False, myndarray data is [11 32 13 14]

When copy=True, Separate copy is created. myseries data is

0 11

1 32

2 13

3 14

dtype: int32

When copy=True, myndarray data is [11 12 13 14]

Note: The preceding code is covered in (program name:


Chap7_Example7.7.py).

Accessing series values using head and tail methods

head() method, this method will access the first five rows of a series as the
default value is five. If the argument says, n other than five is passed, then
this method will access the first n rows of a series.

Refer to the following syntax for clarity:

head(n: 'int' = 5)

tail() method, this method will access the last five rows of a series as the
default value is five. If the argument says, n other than five is passed, then
this method will access the last n rows of a series. If the n value is negative,
then this function will return all rows except the first n rows, as shown in
the following Chap7_Example7.8.py:

import pandas as mypd

myseries=mypd.Series([loop for loop in range(20)])

# head method

print(myseries.head())

print(myseries.head(n=2))

# tail method

print(myseries.tail())

print(myseries.tail(n=-18)) # equivalent to print(myseries.tail(n=2)) , it will


return last 2 rows

Output:

00

11

22

33

44

dtype: int64

00

11
dtype: int64

15 15

16 16

17 17

18 18

19 19

dtype: int64

18 18

19 19

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.8.py).

Accessing elements in the pandas series

The individual elements of a pandas series can be accessed using either an


index label or the elementʼs integer location. Observe the following syntax
for clarity:

mypd_series[x]:

Here, x can be the index value, mypd_series[3]. This will return the value
present at index three. List of indices mypd_series[[1,2,4]], this will return
a series of values present at indices one, two, and four. For slices
mypd_series[2:5], this will return a series of values from the second to
fourth index.

Observe the following code Chap7_Example7.9.py for accessing elements


using index label and integer location:
import pandas as mypd

# creating a Pandas Series

mylist = [11, 12, 13, 14, 15]

mypd_series = mypd.Series(mylist, index=['r', 's', 't', 'u', 'v'])

# element access using index label

print(mypd_series['u'])

# element access using integer location

print(mypd_series[3])

Output:

14

14

Note: The preceding code is covered in (program name:


Chap7_Example7.9.py).

Pandas series slicing

A Pandas series can be sliced by using either index labels or integer


locations, as shown in the following code Chap7_Example7.10.py as
follows:

import pandas as mypd

# creating a Pandas Series

mylist = [11, 12, 13, 14, 15]

mypd_series = mypd.Series(mylist, index=['r', 's', 't', 'u', 'v'])


# slicing the Pandas Series using index labels

print(mypd_series['r':'u'])

# slicing the Pandas Series using integer locations

print(mypd_series[1:4])

Output:

r 11

s 12

t 13

u 14

dtype: int64

s 12

t 13

u 14

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.10.py).

From the preceding example, there are two points to note:

In the label-based slicing, the end/stop attribute is inclusive. This is


the reason we can view that the index label u is inclusive.
In the index-based slicing, end/stop is not at all inclusive if specified.
That is the reason four is not inclusive in [1:4] and we can see that
values are returned from the first to third index.
However, there is a problem with using this index approach. If we are
trying to access the data from the specified index or label (position-based or
label-based selection) which is not available, then we will get an error. For
example, in the same code, if we tried using print(mypd_series[10]), then
we would have faced the error IndexError: index 10 is out of bounds for
axis 0 with size 5 or print(mypd_series['a]). Then, we would have faced
KeyError: 'a'.

In order to overcome the problem, we will be using the get() method, as


shown as follows:

print(mypd_series.get(10))

print(mypd_series.get('a'))

So, in the same code, if we add the above two lines, then we will get an
output as None in both cases and we will not encounter any errors.

If the specified index/label is unavailable and we are specifying the default


value (the facility is provided to specify the default value in case of the
get() method), then we get the default value, as shown as follows:

print(mypd_series.get('a', default=22))

So, in the same code, we are adding the line shown above. The output we
are getting is 22 since the specified label is unavailable. If the specified
index/label is present and a default value is provided, the output will not be
the default value, as shown as follows:

print(mypd_series.get(2, default=32))

So, in the same code, if we add the line shown above, then the output we
are getting is 13; the specified index is available.

Values extraction using loc and iloc indexers

In order to extract values using label-based selection, we will be using loc


indexer and to extract values using position-based selection, we will be
using iloc indexer (integer loc). So, the argument will be indexed for
position-based selection. As compared to normal indexing, using loc and
iloc indexer performance is going to be improved because we need to
extract values either by using an index or label using loc or iloc.

An example of valid cases for iloc is mentioned as shown in the following


code Chap7_Example7.11.py as follows:

mypd_series.iloc[0]

mypd_series.iloc [[1,2,4]]

mypd_series.iloc [a:b] Here, b index is not inclusive

An example of valid cases for loc is mentioned as follows:

mypd_series[['labela']]

mypd_series['labela':'labelb'] Here, labelb is inclusive.

mypd_series[['labela', 'labelb']]

import pandas as mypd

# creating a Pandas Series

mylist = [11, 12, 13, 14, 15]

mypd_series = mypd.Series(mylist, index=['r', 's', 't', 'u', 'v'])

# using iloc---------------------------------------

# print(help(mypd_series.iloc))

print(mypd_series.iloc[1])

print(mypd_series.iloc[[1,3]])

print(mypd_series.iloc[1:3])
print(mypd_series.iloc[-1]) # we are able to access the last item

print('-'*50)

# using loc---------------------------------------

# print(help(mypd_series.loc))

# getting value associated with label 's'

print(mypd_series.loc['s'])

# getting values associated with label 's' and 'u'

print(mypd_series.loc[['s','u']])

# getting values from label 's' to 'u'

print(mypd_series.loc['s':'u'])

Output:

12

s 12

u 14

dtype: int64

s 12

t 13

dtype: int64

15

--------------------------------------------------
12

s 12

u 14

dtype: int64

s 12

t 13

u 14

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.11.py).

These iloc and loc indexers are commonly used in dataframes.

Masking using Booleans for condition-based selection

An array of Boolean values is to be provided which will select values from


the series where True value is present. It will be applicable for iloc, loc, and
normal indexers as demonstrated in the following code
Chap7_Example7.12.py:

import pandas as mypd

# creating a Pandas Series

mylist = [11, 12, 13, 14, 15]

mypd_series = mypd.Series(mylist)

print(mypd_series[[True, False, True, False, True]])

print(mypd_series.iloc[[True, False, True, False, True]])


print(mypd_series.loc[[True, False, True, False, True]])

print(mypd_series.get([True, False, True, False, True]))

Output:

0 11

2 13

4 15

dtype: int64

0 11

2 13

4 15

dtype: int64

0 11

2 13

4 15

dtype: int64

0 11

2 13

4 15

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.12.py).
An important point to note is that the number of values in the series and the
number of Boolean values passed must be matched, otherwise we will get
an error. However, with the get() method, if the number of Boolean values
does not match the number of elements, the output will be None.

Pandas series filtering

A pandas series can be filtered by using Boolean indexing, as demonstrated


in the following code Chap7_Example7.13.py:

import pandas as mypd

# creating a Pandas Series

mylist = [11, 12, 13, 14, 15]

mypd_series = mypd.Series(mylist, index=['r', 's', 't', 'u', 'v'])

# filtering the Pandas Series using Boolean indexing

my_bool_filter = mypd_series > 3

my_filtered_mypdseries = mypd_series[my_bool_filter]

print(my_filtered_mypdseries)

Output:

r 11

s 12

t 13

u 14

v 15

dtype: int64
Note: The preceding code is covered in (program name:
Chap7_Example7.13.py).

The preceding code creates a pandas series called mypd_series using the
list mylist as data and the index ['r', 's', 't', 'u', 'v']. It then filters the pandas
series using Boolean indexing, where a new series called
my_filtered_mypdseries is created by selecting only the values from
mypd_series that are greater than three. Finally, it displays the filtered
series.

Callable object usage in selecting elements

The callable object-like function can be used while selecting values from
the series. A callable object can be passed to the normal indexer, the loc,
and iloc indexers, and for get() method too. Just observe the following code
Chap7_Example7.14.py for better understanding:

import pandas as mypd

# creating a Pandas Series

mypd_series = mypd.Series([loop for loop in range(10)])

def my_even_number_selection(myseries):

return [True if loop%2==0 else False for loop in range(myseries.size)]

print("Approach-1--------------------------")

print(mypd_series[my_even_number_selection])

print("Approach-2--------------------------")

print(mypd_series.loc[lambda myseries: [True if loop%2==0 else False for


loop in range(myseries.size)]])

print("Approach-3--------------------------")
print(mypd_series.iloc[lambda myseries: [True if loop%2==0 else False for
loop in range(myseries.size)]])

print("Approach-4--------------------------")

print(mypd_series.get(lambda myseries: [True if loop%2==0 else False for


loop in range(myseries.size)]))

Output:

Approach-1--------------------------

00

22

44

66

88

dtype: int64

Approach-2--------------------------

00

22

44

66

88

dtype: int64

Approach-3--------------------------
00

22

44

66

88

dtype: int64

Approach-4--------------------------

00

22

44

66

88

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.14.py).

Some useful attributes of the pandas series object are mentioned as follows:

values: This attribute will return values present inside the series
object.
index: This attribute will return the series index (axis labels).
dtype: The dtype object in the underlying data is returned using this
attribute.
size: The number of elements in the underlying data is returned using
this attribute.
shape: A tuple of the shape of the underlying data is returned using
this attribute. For a series object, it will return a tuple which is several
elements in the series object.
ndim: The number of dimensions of the underlying data is returned
using this attribute. Here, it will be always one by definition.
name: The series name is returned using this attribute. The default
value is None if not provided.
is_unique: If the values in the series object are unique, then True is
returned using this attribute. If the values in the series object are not
unique, that is, repeatability of the values is there and we want to have
unique values, then use nunique() method.
is_monotonic_increasing/ is_monotonic_decreasing: Monotonic
means that the values are consistently in some order, either increasing
or decreasing. If values in the series object are monotonically
increasing, then it will return True for is_monotonic_increasing.
If values in the series object are monotonically decreasing, then it will
return False for is_monotonic_decreasing.
hasnans: This attribute will return True if the series contains NaNs. We
can use the above attribute to check whether some values are missing
or not, as shown as follows:

Observe the following code Chap7_Example7.15.py for a better


understanding of these concepts:

import pandas as mypd

# creating a Pandas Series

mypd_series = mypd.Series(

data=['Mango','Guava','Pear','Banana','Litchi'],

index = [1,2,3,4,5],

name = '1-D data'

)
# values

print(mypd_series.values)

print(type(mypd_series.values))

# index

print(mypd_series.index)

# dtype

print(mypd_series.dtype)

# size

print(mypd_series.size)

# shape

print(mypd_series.shape)

# ndim

print(mypd_series.ndim)

# name

print(mypd_series.name)

# is_unique

print(mypd_series.is_unique)

print('-'*50)

mypd_series2 = mypd.Series(

data=['Mango','Guava','Pear','Banana','Litchi','Pear','Banana', mypd.NA])
print(mypd_series2.is_unique)

print(mypd_series2.nunique())

print(mypd_series2.nunique(dropna=False)) # want to consider NA values


also the use dropna argument

# is_monotonic

print('-'*50)

mypd_series3 = mypd.Series([1,2,3,4,5])

mypd_series4 = mypd.Series([5,4,3,2,1])

print(mypd_series3.is_monotonic_increasing)

print(mypd_series3.is_monotonic_decreasing)

print(mypd_series4.is_monotonic_increasing)

print(mypd_series4.is_monotonic_decreasing)

print('-'*50)

# hasnans

mypd_series5 =
mypd.Series(['Mango','Guava','Pear','Banana','Litchi','Pear','Banana',
mypd.NA])

mypd_series6 =
mypd.Series(['Mango','Guava','Pear','Banana','Litchi','Pear','Banana', None])

print(mypd_series.hasnans)

print(mypd_series5.hasnans)

print(mypd_series6.hasnans)
Output:

['Mango' 'Guava' 'Pear' 'Banana' 'Litchi']

<class 'numpy.ndarray'>

Index([1, 2, 3, 4, 5], dtype='int64')

object

(5,)

1-D data

True

--------------------------------------------------

False

--------------------------------------------------

True

False

False

True

--------------------------------------------------
False

True

True

Note: The preceding code is covered in (program name:


Chap7_Example7.15.py).

Usage of the apply method to Pandas series

A function can be applied to each element of a pandas series using the


apply method. We will be using the apply method in the global
transformation of series. The function can be a normal Python function or a
numpy function. Refer to the following syntax and code
Chap7_Example7.16.py for clarity:

apply(func: 'AggFuncType', convert_dtype: 'bool | lib.NoDefault' =


<no_default>, args: 'tuple[Any, ...]' = (), *, by_row: "Literal[False,
'compat']" = 'compat', **kwargs)

import pandas as mypd

# creating a Pandas Series

mylist = [11, 12, 13, 14, 15]

mypd_series = mypd.Series(mylist, index=['r', 's', 't', 'u', 'v'])

# applying a function to each element of the Pandas Series

def mycube(mynum):

return mynum ** 3

mycube_pdseries = mypd_series.apply(mycube)

print(mycube_pdseries)
#print(help(mypd_series.apply))

Output:

r 1331

s 1728

t 2197

u 2744

v 3375

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.16.py).

Aggregating of pandas series

A pandas series can be aggregated using min, max, sum, mean, and median
methods, as shown in the following code Chap7_Example7.17.py:

import pandas as mypd

# creating a Pandas Series

mylist = [11, 12, 13, 14, 15]

mypd_series = mypd.Series(mylist, index=['r', 's', 't', 'u', 'v'])

# Minimum value calculation of the Series

print(f" Minimum value is: {mypd_series.min()}")

# Maximum value calculation of the Series

print(f" Maximum value is: {mypd_series.max()}")


# Calculating the sum of the Series

print(f" Sum is: {mypd_series.sum()}")

# Calculating the mean of the Series

print(f" Mean is: {mypd_series.mean()}")

# Calculating the median of the Series

print(f" Mean is: {mypd_series.median()}")

Output:

Minimum value is: 11

Maximum value is: 15

Sum is: 65

Mean is: 13.0

Median is: 13.0

Note: The preceding code is covered in (program name:


Chap7_Example7.17.py). A single description method generates
descriptive statistics like count, std, min, max, mean, and so on.

Now, let us view an example of how to use mypd.read_csv() function to


create a series object with the data from a csv file. An important point to
note is that the above function will read a comma-separated value file into
dataframe as the default return type is dataframe and not series object. So,
we are required to create a series object from the data of the csv file which
is employees_salary.csv. We can get series object directly by using usecols
parameter as shown in the following code Chap7_Example7.18.py:

import pandas as mypd

# print(help(mypd.read_csv))
mydf = mypd.read_csv('employees_salary.csv')

# we will get a dataframe object here

print('The return type is', type(mydf))

print(mydf)

print('ndim: ', mydf.ndim)

# converting into series object

mydf1 = mypd.read_csv('employees_salary.csv', usecols=['Employee


Name'])['Employee Name']

print('The return type is', type(mydf1))

print(mydf1)

print('ndim: ', mydf1.ndim)

Output:

The return type is <class 'pandas.core.frame.DataFrame'>

Employee Name Salary

0 John Deo 100000

1 Max Ruin 200000

2 Arnold 130000

3 Krish Star 140000

4 John Mike 150000

5 Alex John 160000

6 My John Rob 10000


7 Asruid 170000

8 Tes Qry 108000

9 Big John 109000

10 Ronald 300000

11 Recky 400000

ndim: 2

The return type is <class 'pandas.core.series.Series'>

0 John Deo

1 Max Ruin

2 Arnold

3 Krish Star

4 John Mike

5 Alex John

6 My John Rob

7 Asruid

8 Tes Qry

9 Big John

10 Ronald

11 Recky

Name: Employee Name, dtype: object


ndim: 1

Note: The preceding code is covered in (program name:


Chap7_Example7.18.py).

Now, suppose we have NA as a value in some column, to tackle this, we


will use fillna parameter here, as shown in the following code
Chap7_Example7.19.py:

import pandas as mypd

# Read the CSV file and select both 'Employee Name' and 'Salary' columns

mydf = mypd.read_csv('employees_salary2.csv', usecols=['Employee


Name', 'Salary'])

print(mydf)

# Replace 'NA' values with '12345' in the 'Salary' column

mydf['Salary'] = mydf['Salary'].fillna(12345)

# Convert the DataFrame into a Series object with 'Employee Name' as the
index

myseries = mydf.set_index('Employee Name')['Salary']

# Display the resulting Series object

print(myseries)

print(type(myseries))

Output:

Employee Name Salary

0 John Deo 100000.0


1 Donald NaN

2 Recky 400000.0

Employee Name

John Deo 100000.0

Donald 12345.0

Recky 400000.0

Name: Salary, dtype: float64

<class 'pandas.core.series.Series'>

Note: The preceding code is covered in (program name:


Chap7_Example7.19.py). If there is a requirement to drop NAs from the
series object, then use the Drona() method. If the user needs to drop NA
from the existing object, then use dropna(inplace=True).

Basic arithmetic operations for series objects are mentioned as follows:

Arithmetic operations with scalar value: Arithmetic operations can be


performed between series objects and scalar values. For each element,
a scalar value will be broadcasted to perform the required operation.
Kindly note that if the value is NAN, then after performing a scalar
operation the result is always NAN.
Arithmetic operations between two series objects: Arithmetic
operations can be performed between the series object and scalar
value. Only for matched indexes, these operations can be performed.
NaN will be returned for unmatched indexes, as shown in the code
Chap7_Example7.20.py as follows:

import pandas as mypd

import numpy as mynp

mydf = mypd.Series([20, mypd.NA, mynp.NaN, 25, None])


# arithmetic operations with scalar value

print(mydf*2)

# arithmetic operations between 2 series objects

mys1 = mypd.Series(data = [1,2,3,4,5], index = ['r','s','t','u','v'])

mys2 = mypd.Series(data = [6,7,8,9,10], index = ['r','s','t','u','v'])

print(mys1+mys2)

mys3 = mypd.Series(data = [1,2,3,4,5], index = ['r','s','t','u','v'])

mys4 = mypd.Series(data = [6,7,8,9,10], index = ['r','s','t','w','x'])

print(mys3+mys4)

Output:

0 40

1 NaN

2 NaN

3 50

4 NaN

dtype: object

r7

s9

t 11

u 13
v 15

dtype: int64

r 7.0

s 9.0

t 11.0

u NaN

v NaN

w NaN

x NaN

dtype: float64

Note: The preceding code is covered in (program name:


Chap7_Example7.20.py).

There is a fill_value parameter for methods like mul, div, add, and sub. The
fill_value will be considered in the place of the missing element, if the
matched index is unavailable, as shown in the following code
Chap7_Example7.21.py:

import pandas as mypd

import numpy as mynp

mys1 = mypd.Series(data = [1,2,3,4,5,mynp.NaN], index =


['r','s','t','u','v','y'])

mys2 = mypd.Series(data = [6,7,8,9,10], index = ['r','s','t','w','x'])

print(mys1.add(mys2, fill_value=0))
Output:

r 7.0

s 9.0

t 11.0

u 4.0

v 5.0

w 9.0

x 10.0

y NaN

dtype: float64

Note: The preceding code is covered in (program name:


Chap7_Example7.21.py). For arithmetic operations, equivalent methods are
contained by the Series class.

If mys1 and mys2 are two series objects, then observe the following
symbols and their meaning reference to it:

+ → mys1.add(mys2)
- → mys1.sub(mys2)
* → mys1.mul(mys2)
/ → mys1.div(mys2)

Series object transformation

As the name suggests by transforming, we are updating the values of the


series object. Let us learn the types of transformations:

Partial transformation: In partial transformation, operations will be


performed on a subset of a series/dataframe. This operation can be
performed using the update() method. Non-NAN values will be used
from the passed series to make updates, as shown in the following
syntax and code Chap7_Example7.22.py:

update(other: 'Series | Sequence | Mapping')

import pandas as mypd

# Transforming series object without using update method

# we can update a particular element using normal indexing or loc/iloc


indexers

mypd_series = mypd.Series([15,30,45,60])

mypd_series[1] = 25

mypd_series[2] = 55

print(mypd_series)

# Obs: We can see that we are using 'n' lines of code, in order to
update 'n' elements

# Transforming series object without using update method

mypd_series2 = mypd.Series([15,30,45,60])

mypd_series2.update(mypd.Series([25,55], index=[1,2]))

print(mypd_series2)

#print(help(mypd_series2.update))

Output:

3 15

3 25
2 55

3 60

dtype: int64

3 15

3 25

2 55

3 60

dtype: int64

Note: The preceding code is covered in (program name:


Chap7_Example7.22.py).

If the matched indexes are there then old values will be replaced with new
values. If the matched indexes are not possible and there are more indexes,
then extra elements will be ignored. The corresponding values are not
updated in the original series if the argument series contains NaNs, as
shown as follows.

Global transformation: In global transformation, operations will be


performed on the entire series/dataframe. This operation can be
performed using either apply() or map() method.

We have already seen the global transformation of series using the


apply() method in Chap7_Example7.16.py. Now, we shall explore the
global transformation of series using the map() method.

It is a limited version of the apply() method where we cannot pass


extra arguments to the input function. Here, each value in a series will
be substituted with another value which may be derived from a
function, a dict, or a series. Refer to the following syntax and code
Chap7_Example7.23.py for better understanding:
map(arg: 'Callable | Mapping | Series', na_action: “Literal['ignore'] |
None” = None)

import pandas as mypd

import numpy as mynp

mypd_series = mypd.Series(['Green','Yellow','Blue','Violet'])

#print(help(mypd_series.map))

# Using dict

mynewpd_series = mypd_series.map({'Green':'Pink','Blue':'Magenta'})

print(mynewpd_series) # there is no replacement for Yellow and


Violet color hence converted to NaN

# Using lambda function

my_series_func = mypd_series.map(lambda x: f" I love {x} color ")

print(my_series_func)

# using na_action parameter

def myfunc(x):

return f'I love {x} color '

mypd_series3 = mypd.Series(['Green','Yellow',mynp.NaN,'Violet'])

my_series_func2 = mypd_series3.map(myfunc, na_action='ignore') #


for keeping as NaN, na_action parameter is used so as to avoid
applying the function to missing values

print(my_series_func2)

Output:
0 Pink

1 NaN

2 Magenta

3 NaN

dtype: object

0 I love Green color

1 I love Yellow color

2 I love Blue color

3 I love Violet color

dtype: object

0 I love Green color

1 I love Yellow color

2 NaN

3 I love Violet color

dtype: object

Note: The preceding code is covered in (program name:


Chap7_Example7.23.py).

In summary, while partial transformations target a specific subset of the


data, global transformations are applied to the entire data structure. Pandas
offer a wide array of built-in functions and methods to efficiently perform
both types of transformations, making them valuable tools for data
manipulation and analysis.
Iterate elements of the series

We can iterate the elements of the series using the items() method as shown
in the code Chap7_Example7.24.py as follows:

import pandas as mypd

mypd_series = mypd.Series(['Green','Yellow','Blue','Violet'])

for index,value in mypd_series.items():

print(f'{index} ---> {value}')

Output:

0 ---> Green

1 ---> Yellow

2 ---> Blue

3 ---> Violet

Note: The preceding code is covered in (program name:


Chap7_Example7.24.py).

Conclusion

In this chapter, we explored the intricate world of the pandas series in


Python, equipping readers with a diverse skill set to effectively analyze and
manipulate data. We commenced our exploration by diving into the pandas
series, ensuring a strong grasp of fundamental concepts like series creation,
data access, slicing, and filtering. We then ventured into more advanced
territory, covering topics such as using callable objects, leveraging essential
attributes, applying functions with the versatile apply method, aggregating
data, and conducting basic arithmetic operations. Emphasis was also placed
on series transformations and effective iteration techniques for a holistic
understanding. With this strong foundation, readers are now well-prepared
to tackle data analysis and manipulation tasks with confidence, armed with
the power of the Pandas series in Python. In the next chapter, we will learn
about pandas dataframe in data science, as it serves as a primary data
structure for handling two-dimensional labeled data, resembling a
spreadsheet or SQL table. Pandas dataframe offers extensive capabilities
for data manipulation, exploration, and analysis, making it indispensable
for various data science tasks such as data cleaning, transformation,
aggregation, and visualization. In the next chapter, we will explore how to
view and inspect Pandas DataFrames, including methods for accessing
data, summarizing data, and understanding the structure and content of
DataFrames through various descriptive statistics and visualization
techniques.

Points to remember

Pandas is a popular Python library for data manipulation and analysis.


It provides two primary data structures: Series (1D) and DataFrame
(2D).
Use import pandas as mypd to import the library and commonly use
mypd as an alias.
Read data from various sources like CSV, Excel, SQL, and more using
functions like read_csv() and read_excel().
Perform basic data exploration with methods like head(), tail(), and
info().
Check data types with the dtypes attribute.
Aggregate data with functions like sum(), mean(), count(), and more.
Visualize data using pandasʼ built-in plotting functionality and
integration with Matplotlib.
Pandas series is a single column of data with labels.
Pandas series can be created from many different types of data, such
as lists, dictionaries, numpy arrays, and CSV files.
Pandas series have many functions for accessing and manipulating
data, such as slicing, filtering, and indexing.
Pandas series can perform basic arithmetic operations, such as
addition, subtraction, multiplication, and division.
Pandas series can be manipulated using a variety of methods, such as
sorting, ranking, and aggregating.
Questions

1. What is a pandas series and how is it different from a dataframe?


2. What is the pandas series constructor used for?
3. How to create a pandas series by passing a list, dictionary and a
numpy array?
4. What does the copy parameter do when exploring a pandas series?
5. How to access values from a pandas series using head and tail
methods?
6. What methods can be used to access elements in a pandas series?
7. What is pandas series slicing?
8. How to extract values using loc and iloc indexers in a pandas series?
9. How to use Booleans for condition-based selection in a pandas series?
10. What is involved in filtering a pandas series?
11. How can a callable object be used to select elements in a pandas
series?
12. What are some useful attributes of a pandas series object?
13. How is the apply method used with a pandas series?
14. How to aggregate data in a pandas series?
15. What are the basic arithmetic operations that can be performed on a
pandas series?
16. How to perform arithmetic operations with a scalar value on a pandas
series?
17. What is the process for performing arithmetic operations between two
pandas series objects?
18. How to transform a pandas series object?
19. What is the method to iterate through elements of a pandas series?

Join our book’s Discord space

Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:

https://discord.bpbonline.com
CHAPTER 8
Exploring Pandas Dataframe for
Data Science in Python

Introduction
In the previous chapter, we learned about the pandas series. In this chapter,
we will explore the pandas dataframe . A two-dimensional labeled data
structure called a DataFrame can carry a variety of data kinds in its
columns. It resembles a spreadsheet or table, capable of accommodating
various data types and facilitating operations on entire rows or columns.
Pandas dataframe shines in managing structured, tabular data with multiple
variables. The flexibility and functionality provided by pandas dataframe
include support for handling missing data, making them particularly
valuable for diverse and complex data analysis tasks. It is common to start
with a pandas series and then combine them into a dataframe to leverage the
benefits of both structures in handling different aspects of data manipulation
and analysis.
Let us learn some concepts on pandas dataframe.

Structure
The chapter will discuss the following topics:
Pandas dataframe constructor
Pandas methods and attributes for dataframe
Dropping dataframe rows with null values
Arithmetic operators for dataframes
New columns addition to dataframes
Usage of fillna() method to the dataframe
Sorting values of the dataframe
Sorting dataframe by multiple columns
Sorting dataframe based on index
Ranking series of values by using rank() method
Filter data from dataframe
Inclusion check in dataframe
Usage of isnull and notnull method
Usage of duplicated method
Usage of drop_duplicates method in dataframe
Usage of unique and nunique methods

Objectives
By the end of this chapter, the reader will have an idea about a
comprehensive range of topics related to pandas dataframes in Python,
equipping them with the skills and knowledge needed for effective data
manipulation and analysis. We will look at pandasʼ dataframes, readers will
gain insights into dataframe construction and exclusive methods and
attributes applicable only to dataframes. We will address the critical task of
handling missing data, arithmetic operations, and the addition of new
columns for dataframes. The usage of the fillna method for handling
missing values will also be covered. Sorting, ranking, and filtering data
within dataframes will be thoroughly explored, along with techniques for
checking data inclusion, identifying missing values, and handling
duplicates. Finally, the reader will be able to understand the concept of a
pandas dataframe and will be able to differentiate between pandas series.

Pandas dataframe
A two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns) is called a pandas
dataframe. Data is aligned in a tabular fashion in rows and columns, which
is a two-dimensional data structure. The data, rows, and columns are the
three primary components of a pandas dataframe. The data representation is
like a SQL table or a spreadsheet. We can say that a dataframe is a series of
collection objects. Let us see some of the key concepts of pandas dataframe
in Python.

Pandas dataframe constructor


We can create a pandas dataframe by the usage of the following constructor:
pandas.DataFrame(data=None, index=None,
columns=None, dtype=None, copy=False)
The explanation is given as follows:
data: The dataframeʼs data are specified using this argument. It might
be a dictionary, a list of dictionaries, a NumPy array, or another
dataframe, among other data structures. An empty dataframe is
created if no data is provided.
index: The row labels for the dataframe are specified using this
argument. It could be a list, an array, or another dataframe, among
other data structures. A default integer index is created if the index is
not specified.
columns: The dataframeʼs column labels are specified using this
option. It could be a list, an array, or another dataframe, among other
data structures. A default set of column labels is created if columns
are not specified.
dtype: The data type for the columns in the dataframe is specified
using the dtype option. A dictionary that maps column names to data
types or a single data type, such as int or float, are both acceptable.
copy: In order to specify whether to copy the data, use the copy
parameter. A new copy of the data is made if copy=True. The original
data is used if copy=False.

Pandas dataframe creation


Observe the following code Chap8_Example8.1.py for pandas
dataframe creation:
import pandas as mypd
mydict = {'myname': ['Alex', 'John', 'Michael',
'Tom'],
'myage': [35, 45, 55, 65],
'mycountry': ['UK', 'USA', 'Australia',
'Finland']}
mypd_dataframe = mypd.DataFrame(mydict)
print(mypd_dataframe)
print('-'*50)

# creating a DataFrame from a list of dictionaries


mylist = [{'myname': 'Alex', 'myage': 35,
'mycountry': 'UK'},
{'myname': 'John', 'myage': 45,
'mycountry': 'USA'},
{'myname': 'Michael', 'myage': 55,
'mycountry': 'Australia'},
{'myname': 'Tom', 'myage': 65,
'mycountry': 'Finland'}]
my_pd_dataframe2 = mypd.DataFrame(mylist)
print(my_pd_dataframe2)
print(my_pd_dataframe2.ndim)
print(my_pd_dataframe2.shape)
#print(help(mypd.DataFrame))
print('-'*50)
# creating a DataFrame from a dictionary of series
objects
myname = mypd.Series(['Alex', 'John', 'Michael',
'Tom'])
myage = mypd.Series([35, 45, 55, 65])
mycountry = mypd.Series(['UK', 'USA', 'Australia',
'Finland'])
mypd_dataframe =
mypd.DataFrame({'myname':myname,'myage':myage,'myc
ountry':mycountry})
print(mypd_dataframe)
Output:
myname myage mycountry
0 Alex 35 UK
1 John 45 USA
2 Michael 55 Australia
3 Tom 65 Finland
--------------------------------------------------
myage mycountry myname
0 35 UK Alex
1 45 USA John
2 55 Australia Michael
3 65 Finland Tom
2
(4,3)
--------------------------------------------------
myname myage mycountry
0 Alex 35 UK
1 John 45 USA
2 Michael 55 Australia
3 Tom 65 Finland

Note: The preceding code is covered in (program name:


Chap8_Example8.1.py).

An important point to note is that dtype attribute will be applicable for the
series object whereas for the Dataframe we will be using dtypes attribute
which returns the dtypes in the dataframe, that is, series for the data type of
each column.
We can also create dataframe from a dict of tuple objects which we have
left for you to practice. We are in a position to solve it, we can do this by
replacing [] with ().

Pandas methods and attributes for dataframe


Most of the methods and attributes are applicable for series objects as well
as for dataframes like shape, ndim, index, values (attributes), and methods
like head(), tail(), and many more. Let us discuss methods and attributes
that will be applicable exclusively to dataframe objects:
dtypes attribute: Since dataframe contains multiple columns, dtypes
attribute will be applicable as datafame object and will have no
attribute of dtype. There is a value_counts() method that will return
the column number based on the data type. This attribute will return a
series object.
columns attribute: This attribute will return the column label of the
dataframe. It is important to note that column names internally
represent index objects only. We can define our own column names.
axes attribute: This attribute will return the list representing
dataframe axes. Row-wise index means range index, that is, axes=0,
column-wise index means axes=1. Refer to the following code for a
better understanding:
Observe the code Chap8_Example8.2.py for better understanding:
import pandas as mypd
import numpy as mynp
mydf = mypd.read_csv('student.csv')
print(mydf.dtypes)
print(mydf.dtypes.value_counts())

# column names
# dataframe creation using ndarray
myndarray = mynp.array([[1,2,3],[4,5,6],[7,8,9]])
mydf2 = mypd.DataFrame(myndarray)
print(mydf2) # with default column names 0 1 and 2
print('-'*50)
mydf3 = mypd.DataFrame(myndarray, columns=
["a","b","c"])
print(mydf3)
print(mydf3.columns) # index object containing
column names
print('-'*50)
# axes
print(mydf3.axes) # list of axes object is
returned [rowindex, columnindex]
Output:
id int64
name object
class object
mark int64
gender object
dtype: object
object 3
int64 2
Name: count, dtype: int64
2 1 2
2 1 2 3
2 4 5 6
2 7 8 9
a b c
2 1 2 3
2 4 5 6
2 7 8 9
Index(['a', 'b', 'c'], dtype='object')
[RangeIndex(start=0, stop=3, step=1), Index(['a',
'b', 'c'], dtype='object')]

Note: The preceding code is covered in (program name:


Chap8_Example8.2.py).

In order to select only one required column from the dataframe, we will be
using mydf.column_name or mydf['column_name']. (Mostly
recommended because column names may contain spaces).
In order to select multiple columns from the dataframe, you can use
mydf[['column_name1'], ['column_name2']].
Now, let us discuss some methods which will be used exclusively for
dataframes:
info() method: The summary information of dataframe is generated
using info() method. It is highly recommended to use this method
before using dataframe. The output of info() method can be
customized by using the parameters as highlighted in the following
syntax and code Chap8_Example8.3.py:
info(verbose: 'bool | None' = None, buf:
'WriteBuffer[str] | None' = None, max_cols:
'int | None' = None, memory_usage: 'bool | str
| None' = None, show_counts: 'bool | None' =
None)

import pandas as mypd


# complete information of dataframe is
displayed
mydf = mypd.read_csv('student.csv', index_col
= ['id'])
print(mydf.info())
print('-'*50)
# verbose=False
print(mydf.info())
# verbose=False, memory_usage=False
print('-'*50)
print(mydf.info(verbose=False,
memory_usage=False))
# show_counts=False
print('-'*50)
print(mydf.info(show_counts=False))
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, 1 to 35
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 35 non-null object
1 class 35 non-null object
2 mark 35 non-null int64
3 gender 35 non-null object
dtypes: int64(1), object(3)
memory usage: 1.4+ KB
None
----------------------------------------------
----
<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, 1 to 35
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 35 non-null object
1 class 35 non-null object
2 mark 35 non-null int64
3 gender 35 non-null object
dtypes: int64(1), object(3)
memory usage: 1.4+ KB
None
----------------------------------------------
----
<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, 1 to 35
Columns: 4 entries, name to gender
dtypes: int64(1), object(3)None
----------------------------------------------
----
<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, 1 to 35
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 name object
1 class object
2 mark int64
3 gender object
dtypes: int64(1), object(3)
memory usage: 1.4+ KB
None

Note: The preceding code is covered in (program name:


Chap8_Example8.3.py).

The code demonstrating the usage of the buf parameter is mentioned


in the code Chap8_Example8.4.py as follows:

import pandas as mypd


import io
mydf = mypd.read_csv('student.csv', index_col
= ['id'])
# mydf.info() # we already saw that the result
will be displayed to the console
mybuffer = io.StringIO()
mydf.info(buf=mybuffer) # will not be
displayed to the console
myvar = mybuffer.getvalue()
with
open("df_information.txt",'w',encoding='utf-
8') as myfile:
myfile.write(myvar)
print("View df_information.txt for reading the
details")
The output of df_information.txt is mentioned as follows:
<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, 1 to 35
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 35 non-null object
1 class 35 non-null object
2 mark 35 non-null int64
3 gender 35 non-null object
dtypes: int64(1), object(3)
memory usage: 1.4+ KB
Note: The preceding code is covered in (program name:
Chap8_Example8.4.py)

The code reads a CSV file called student.csv using the pandas
library and assigns it to a dataframe called mydf. It then uses the
info() method to get information about the dataframe, such as the
number of rows, columns, and data types. Instead of displaying the
information to the console, it saves it to a buffer called mybuffer.
The contents of the buffer are then written to a text file called
df_information.txt using the write() method. Finally, a
message is printed to the console indicating that the details can be
found in the df_information.txt file.
sample() method: Whenever there is a requirement to get random
records from the dataframe, we will use the sample() method. A
random sample of items will be returned from an axis of the object,
as shown in the following syntax and in the code
Chap8_Example8.5.py as follows:
sample(n: 'int | None' = None, frac: 'float |
None' = None, replace: 'bool_t' = False,
weights=None, random_state: 'RandomState |
None' = None, axis: 'Axis | None' = None,
ignore_index: 'bool_t' = False)

import pandas as mypd


mydf = mypd.read_csv('student.csv')
#print(help(mydf.sample))
print(mydf.sample()) # random data will be
displayed to the console
# Using 'n' parameter to get multiple sample
records
print('-'*50)
print(mydf.sample(n=2)) # everytime 2
different records

# Using 'frac' parameter to get specified


fraction of random records from total
print('-'*50)
print(mydf.sample(frac=.2)) # 20% of total
Output
id name class mark gender
28 29 Tess Played Seven 55 male
----------------------------------------------
----
id name class mark gender
31 32 Binn Rott Seven 90 female
5 6 Alex John Four 55 male
----------------------------------------------
----
id name class mark gender
10 11 Ronald Six 89 female
11 12 Recky Six 94 female
19 20 Jackly Nine 65 female
33 34 Gain Toe Seven 69 male
29 30 Reppy Red Six 79 female
22 23 Herod Eight 79 male
2 3 Arnold Three 55 male

Note: The preceding code is covered in (program name:


Chap8_Example8.5.py).

If there is a requirement to display two random columns from the


dataframe, then use the following line of code:
print(mydf.sample(n=2, axis=1))

Dropping dataframe rows with null values


The method which will be used to drop dataframe rows will be dropna()
method. Using this method, there are more number of options provided due
to multiple columns. The rows from the dataframe will be removed where
there will be at least one missing value. Refer to the following syntax and in
the code Chap8_Example8.6.py for a better understanding:
dropna(*, axis: 'Axis' = 0, how: 'AnyAll |
lib.NoDefault' = <no_default>, thresh: 'int |
lib.NoDefault' = <no_default>, subset: 'IndexLabel
| None' = None, inplace: 'bool' = False,
ignore_index: 'bool' = False)

import pandas as mypd


mydf = mypd.read_csv('sample_dropna.csv')
#print(help(mydf.dropna))
print(mydf)
print('-'*50)
mydf1 = mydf.dropna()
print(mydf1)
Output:
Employee Name Salary Location
0 John Deo 100000.0 Hyderabad
1 Max Ruin 200000.0 Mumbai
2 Arnold 130000.0 Bengaluru
3 Krish Star 140000.0 Delhi
4 John Mike 150000.0 Chennai
5 Alex John 160000.0 Kolkata
6 My John Rob 10000.0 Bhopal
7 Asruid 170000.0 Raipur
8 Tes Qry NaN NaN
9 Big John 109000.0 NaN
10 NaN NaN NaN
--------------------------------------------------
Employee Name Salary Location
0 John Deo 100000.0 Hyderabad
1 Max Ruin 200000.0 Mumbai
2 Arnold 130000.0 Bengaluru
3 Krish Star 140000.0 Delhi
4 John Mike 150000.0 Chennai
5 Alex John 160000.0 Kolkata
6 My John Rob 10000.0 Bhopal
7 Asruid 170000.0 Raipur

Note: The preceding code is covered in (program name:


Chap8_Example8.6.py).

We can see that the row with at least one NaN value has been removed. We
can play with how parameters such that the row with any or all NaN values
can be dropped. We saw the row containing any column where NaN has
been dropped. The default value of how is any. Now, let us see the output
with the all value by looking at the following code
Chap8_Example8.7.py:
import pandas as mypd
mydf = mypd.read_csv('sample_dropna.csv')
mydf1 = mydf.dropna(how='all')
print(mydf1)
Output:
Employee Name Salary Location
0 John Deo 100000.0 Hyderabad
1 Max Ruin 200000.0 Mumbai
2 Arnold 130000.0 Bengaluru
3 Krish Star 140000.0 Delhi
4 John Mike 150000.0 Chennai
5 Alex John 160000.0 Kolkata
6 My John Rob 10000.0 Bhopal
7 Asruid 170000.0 Raipur
8 Tes Qry NaN NaN
9 Big John 109000.0 NaN
Note: The preceding code is covered in (program name:
Chap8_Example8.7.py).

We can say that the last row was removed because all the columns
contained NaN values.
Till now, we saw that dropna method returns a new dataframe. But suppose
there is a requirement to use the same dataframe and remove the rows with
missing values, then we will be using inplace parameter by setting it as
True, as shown in the code Chap8_Example8.8.py as follows:

import pandas as mypd


mydf = mypd.read_csv('sample_dropna.csv')
mydf.dropna(inplace=True)
print(mydf)
Output:
Employee Name Salary Location
0 John Deo 100000.0 Hyderabad
1 Max Ruin 200000.0 Mumbai
2 Arnold 130000.0 Bengaluru
3 Krish Star 140000.0 Delhi
4 John Mike 150000.0 Chennai
5 Alex John 160000.0 Kolkata
6 My John Rob 10000.0 Bhopal
7 Asruid 170000.0 Raipur

Note: The preceding code is covered in (program name:


Chap8_Example8.8.py).
Suppose there is a requirement to delete rows if a particular column has
missing values, then we shall be using a subset parameter, as shown as
follows:
mydf.dropna(subset=['Location'])
If the Location column contains the missing value, then we will delete that
row, as shown as follows:
mydf.dropna(subset=['Location', 'Salary'])
If either the Location or Salary column contains the missing value, then we
will only delete that row by using the following command as shown in the
code Chap8_Example8.9.py:
import pandas as mypd
mydf = mypd.read_csv('sample_dropna.csv')
mydf1 = mydf.dropna(subset=['Salary']) # if
'Salary' has missing value, then only row will be
deleted.
print(mydf1)
print('-'*50)
mydf2 = mydf.dropna(subset=['Location', 'Salary'])
# if either 'Location' or 'Salary' has missing
value, then only row will be deleted.
print(mydf2)
Output:
Employee Name Salary Location
0 John Deo 100000.0 Hyderabad
1 Max Ruin 200000.0 Mumbai
2 Arnold 130000.0 Bengaluru
3 Krish Star 140000.0 Delhi
4 John Mike 150000.0 Chennai
5 Alex John 160000.0 Kolkata
6 My John Rob 10000.0 Bhopal
7 Asruid 170000.0 Raipur
9 Big John 109000.0 NaN
--------------------------------------------------
Employee Name Salary Location
0 John Deo 100000.0 Hyderabad
1 Max Ruin 200000.0 Mumbai
2 Arnold 130000.0 Bengaluru
3 Krish Star 140000.0 Delhi
4 John Mike 150000.0 Chennai
5 Alex John 160000.0 Kolkata
6 My John Rob 10000.0 Bhopal
7 Asruid 170000.0 Raipur

Note: The preceding code is covered in (program name:


Chap8_Example8.9.py).

Till now, we have seen how to remove rows that contain missing values
because the default value for the axis parameter was zero. Rows were
dropped which contain missing values. Let us see what will happen when
axis=1. The columns will be dropped that contain the missing values. Refer
to the following code Chap8_Example8.10.py:
import pandas as mypd
mydf = mypd.read_csv('sample_axis.csv')
print(mydf)
print('-'*50)
mydf1 = mydf.dropna(axis=1) # if 'Salary' has
missing value, then only row will be deleted.
print(mydf1)
Output:
Employee Name Salary Location
0 John Deo 100000.0 Hyderabad
1 Tes Qry NaN NaN
2 Big John 109000.0 NaN
3 Johnny NaN NaN
--------------------------------------------------
Employee Name
0 John Deo
1 Tes Qry
2 Big John
3 Johnny

Note: The preceding code is covered in (program name:


Chap8_Example8.10.py).

Arithmetic operators for dataframes


Arithmetic operations cannot be applied directly to the dataframe object
because it contains multiple columns. But arithmetic operations can be
performed by selecting a particular column of the dataframe since each
column of the dataframe is a series object, as shown in the following code
Chap8_Example8.11.py:
import pandas as mypd
mydf = mypd.read_csv('Arithmetic_Example.csv')
print(mydf)
print('-'*50)
print("Addition:------------------")
print(mydf['Salary'].add(1000)) #
print(mydf['Salary']+ 1000) 1000 is broadcasted to
every value of the series
print("Subtraction:------------------")
print(mydf['Salary'].sub(1000))
print("Multiplication:------------------")
print(mydf['Salary'].mul(1000))
print("Division:------------------")
print(mydf['Salary'].div(1000))
Output:
Employee Name Salary
0 John Deo 10000
1 Donald 20000
2 Recky 40000
3 Depp 50000
--------------------------------------------------
Addition:------------------
0 11000
1 21000
2 41000
3 51000
Name: Salary, dtype: int64
Subtraction:------------------
0 9000
1 19000
2 39000
3 49000
Name: Salary, dtype: int64
Multiplication:------------------
0 10000000
1 20000000
2 40000000
3 50000000
Name: Salary, dtype: int64
Division:------------------
0 10.0
1 20.0
2 40.0
3 50.0
Name: Salary, dtype: float64

Note: The preceding code is covered in (program name:


Chap8_Example8.11.py).

New columns addition to dataframes


Let us look at the first approach by using mydf['column'] = value:
as demonstrated in the following code Chap8_Example8.12.py:
import pandas as mypd
mydf = mypd.read_csv('employees._salary2.csv')
print(mydf)
print('-'*50)
# if specified column is already available then
old value is replaced with new value
mydf['salary'] = 20000
print(mydf)
print('-'*50)
# now trying to add City Column to the DataFrame
object but all the rows under this column
# will have value as Durg
mydf['City'] = "Durg"
print(mydf)
print('-'*50)
# now will add Country Column with different
countries to each row . Here there are 3 row
values data
# So , 3 different countries we will be providing
mydf['Country'] = ['India','UK','Australia']
print(mydf)
print('-'*50)
# please note that if we are providing only 2
countries name here instead of 3, then we will be
# facing ValueError
mydf['Age'] = [10,20]
print(mydf)
Output:
Employee Name Salary
0 John Deo 100000.0
1 Donald NaN
2 Recky 400000.0
--------------------------------------------------
Employee Name Salary salary
0 John Deo 100000.0 20000
1 Donald NaN 20000
2 Recky 400000.0 20000
--------------------------------------------------
Employee Name Salary salary City
0 John Deo 100000.0 20000 Durg
1 Donald NaN 20000 Durg
2 Recky 400000.0 20000 Durg
--------------------------------------------------
Employee Name Salary salary City Country
0 John Deo 100000.0 20000 Durg India
1 Donald NaN 20000 Durg UK
2 Recky 400000.0 20000 Durg Australia
ValueError: Length of values (2) does not match
length of index (3)

Note: The preceding code is covered in (program name:


Chap8_Example8.12.py).

Using the following approach two by utilizing the insert method:


mydf.insert(1, 'column' , value)
The column can be inserted at our required location. Refer to the following
syntax and in the following code Chap8_Example8.13.py:
insert(loc: 'int', column: 'Hashable', value:
'Scalar | AnyArrayLike', allow_duplicates: 'bool |
lib.NoDefault' = <no_default>)

import pandas as mypd


mydf = mypd.read_csv('employees_salary2.csv')
# print(help(mydf.insert))
print(mydf)
print('-'*50)
mydf.insert(1,'City',['Durg','Raipur','Bhilai'])
print(mydf)
print('-'*50)
# ValueError
mydf.insert(1,'City',
['Hyderabad','Bhopal','Indore']) # if
allow_duplicates = True, then new column City will
be added
print(mydf)
Output:
Employee Name Salary
John Deo 100000.0
1 Donald NaN
2 Recky 400000.0

Employee Name City Salary


0 John Deo Durg 100000.0
1 Donald Raipur NaN
2 Recky Bhilai 400000.0
ValueError: cannot insert City, already exists

Note: The preceding code is covered in (program name:


Chap8_Example8.13.py).

Usage of fillna() method to the dataframe


When a dataframe or series contains missing or NaN values, the fillna
method in pandas can be used to replace them with a pre-determined value
or a calculated value. It aids in handling datasets with missing data. Refer to
the following syntax and in the following code
Chap8_Example8.14.py:
fillna(value: 'Hashable | Mapping | Series |
DataFrame | None' = None, *, method:
'FillnaOptions | None' = None, axis: 'Axis | None'
= None, inplace: 'bool_t' = False, limit: 'int |
None' = None, downcast: 'dict | None |
lib.NoDefault' = <no_default>)

import pandas as mypd


mydf = mypd.read_csv('sample_fillna.csv')
# print(help(mydf.fillna))
print(mydf)
print('-'*50)
mydf1 = mydf.fillna(0) # replace/fill missing
value with 0
print(mydf1)
print('-'*50)
# But replacing missing values with 0 is not
meaningful
# Recommended to provide meaningful values on
missing positions
# Based on column data type we can provide
different fill values for different columns
# Now, we will be replacing every missing value of
salary column with 20000
mydf['Salary'].fillna(20000, inplace=True)
print(mydf)
print('-'*50)
# Fill Employee Name with 'Dummy' value
mydf['Employee Name'].fillna('Dummy',
inplace=True)
print(mydf)
Output:
Employee Name Salary Location
0 My John Rob 10000.0 Bhopal
1 Asruid 170000.0 Raipur
2 Tes Qry NaN
3 Big John 109000.0 NaN
4 NaN NaN NaN
--------------------------------------------------
Employee Name Salary Location
0 My John Rob 10000.0 Bhopal
1 Asruid 170000.0 Raipur
2 Tes Qry 0.0 0
3 Big John 109000.0 0
4 0 0.0 0
--------------------------------------------------
Employee Name Salary Location
0 My John Rob 10000.0 Bhopal
1 Asruid 170000.0 Raipur
2 Tes Qry 20000.0 NaN
3 Big John 109000.0 NaN
4 NaN 20000.0 NaN
--------------------------------------------------
Employee Name Salary Location
0 My John Rob 10000.0 Bhopal
1 Asruid 170000.0 Raipur
2 Tes Qry 20000.0 NaN
3 Big John 109000.0 NaN
4 Dummy 20000.0 NaN

Note: The preceding code is covered in (program name:


Chap8_Example8.14.py).

If we view the sample_fillna.csv file, we will find that the salary column
values are of int type. But due to the presence of NaN value in some rows,
we are getting the output of the column as float data type. So, memory will
be wasted as the int data type takes less memory when compared to the
float data type. So, we will convert the particular column to the required
data type with the help of the astype() method. The dataframe column
types will be converted using astype() method, thus giving the advantage
of memory utilization. So, first, use dropna() method and then convert it
into int type using astype() method, as shown in the code
Chap8_Example8.15.py as follows:
import pandas as mypd
mydf = mypd.read_csv('sample_fillna.csv')
mydf.dropna(inplace=True)
print(mydf)
print(mydf.info())
print('-'*50)
mydf['Salary']=mydf['Salary'].astype('int')
print(mydf) # int type
print(mydf.info())
Output:
Employee Name Salary Location
0 My John Rob 10000.0 Bhopal
1 Asruid 170000.0 Raipur
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee Name 2 non-null object
1 Salary 2 non-null float64
2 Location 2 non-null object
dtypes: float64(1), object(2)
memory usage: 64.0+ bytes
None
--------------------------------------------------
Employee Name Salary Location
0 My John Rob 10000 Bhopal
1 Asruid 170000 Raipur
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee Name 2 non-null object
1 Salary 2 non-null int32
2 Location 2 non-null object
dtypes: int32(1), object(2)
memory usage: 56.0+ bytes
None

Note: The preceding code is covered in (program name:


Chap8_Example8.15.py).

From the current example, we can see that memory usage decreased when
we changed from float to int type, that is, from 64.0 + bytes to 56.0 + bytes.
Now, there may be a scenario in which a particular column contains
multiple items. So, if we create a separate copy of that value every time,
then performance will be affected by more memory usage, reducing
memory utilization.
For example, consider the following CSV file category_example.csv. We
will find that under the City column that contains multiple employee
names, we have city names such as Hyderabad, Bhopal, and Indore which
are repeated. So, if a copy is created every time, then memory usage will be
more. So, the better option will be to create a copy once and reuse the same
value multiple times where it is required instead of creating the same value
multiple times which will improve the memory utilization and performance.
This approach is possible only if we use the category type, as shown in the
code Chap8_Example8.16.py as follows:
import pandas as mypd
mydf = mypd.read_csv('category_example.csv')
print(mydf.info())
print('-'*50)
mydf['City'] = mydf['City'].astype('category')
print(mydf.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 648 entries, 0 to 647
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee Name 648 non-null object
1 Salary 648 non-null int64
2 City 648 non-null object
dtypes: int64(1), object(2)
memory usage: 15.3+ KB
None
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 648 entries, 0 to 647
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee Name 648 non-null object
1 Salary 648 non-null int64
2 City 648 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 11.0+ KB
None

Note: The preceding code is covered in (program name:


Chap8_Example8.16.py).

We can see that there are 648 entries in the file category_example.csv and
the memory usage has been improved from 15.3 +KB to 11.0 + KB. Also,
we can observe that the City column type is of the category instead of
string type(object). Pandas-specific data type is of category type.

Sorting values of the dataframe


In order to sort the values of the dataframe object, we will be using the
sort_values() method. The by parameter can be used to specify a column,
based on which sorting has to be done, as shown in the following syntax
and code Chap8_Example8.17.py:
sort_values(by: 'IndexLabel', *, axis: 'Axis' = 0,
ascending: 'bool | list[bool] | tuple[bool, ...]'
= True, inplace: 'bool' = False, kind: 'SortKind'
= 'quicksort', na_position: 'str' = 'last',
ignore_index: 'bool' = False, key: 'ValueKeyFunc |
None' = None)

import pandas as mypd


mydf = mypd.read_csv('sorting_example.csv')
# print(help(mydf.sort_values))
print(mydf)
print('-'*50)
mydf.sort_values(by = 'Salary', inplace=True ,
ascending=False) # by default ascending order so
sorting in descending order
print(mydf)
print('-'*50)
# sorting values based on reverse of alphabetical
order of City column
mydf2 = mydf.sort_values(by = "City",
ascending=False)
print(mydf2)
print('-'*50)
# By default NaN came at last position. But
suppose we want to place NAs at first, we will be
using na_position = 'first'
mydf3 = mydf.sort_values(by = "City",
ascending=False, na_position='first')
print(mydf3)
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 130000.0 Durg
3 Mintoo 140000.0 Raipur
4 Ankit 10000.0 NaN
5 Ashwin 170000.0 Lara
6 Rohit NaN Manuguru
7 Khali 109000.0 Bilaspur
8 Neeharika 400000.0 Mahasamund
--------------------------------------------------
Employee Name Salary City
8 Neeharika 400000.0 Mahasamund
1 Saurabh 200000.0 Hyderabad
5 Ashwin 170000.0 Lara
3 Mintoo 140000.0 Raipur
2 Divya 130000.0 Durg
7 Khali 109000.0 Bilaspur
0 Rahul 100000.0 Bhopal
4 Ankit 10000.0 NaN
6 Rohit NaN Manuguru
--------------------------------------------------
Employee Name Salary City
3 Mintoo 140000.0 Raipur
6 Rohit NaN Manuguru
8 Neeharika 400000.0 Mahasamund
5 Ashwin 170000.0 Lara
1 Saurabh 200000.0 Hyderabad
2 Divya 130000.0 Durg
7 Khali 109000.0 Bilaspur
0 Rahul 100000.0 Bhopal
4 Ankit 10000.0 NaN
--------------------------------------------------
Employee Name Salary City
4 Ankit 10000.0 NaN
3 Mintoo 140000.0 Raipur
6 Rohit NaN Manuguru
8 Neeharika 400000.0 Mahasamund
5 Ashwin 170000.0 Lara
1 Saurabh 200000.0 Hyderabad
2 Divya 130000.0 Durg
7 Khali 109000.0 Bilaspur
0 Rahul 100000.0 Bhopal

Note: The preceding code is covered in (program name:


Chap8_Example8.17.py).
Sorting dataframe by multiple columns
Now, suppose we have the same value in the column Employee Name and we
want to sort based on the descending order of Salary. In this case, we need
to provide a list of columns based on the by-parameter. The employee
records will be sorted first based on the alphabetical order of the Employee
Name in descending order. If multiple employees have the same Employee
Name, then sorting will happen based on the descending order of Salary for
those employees. The CSV file is sorting_example2.csv, as shown in the
following code Chap8_Example8.18.py:
import pandas as mypd
mydf = mypd.read_csv('sorting_example2.csv')
print(mydf)
print('-'*50)
mydf.sort_values(by = ['Employee Name','Salary'],
inplace=True , ascending=False) # by default
ascending order so sorting in descending order
print(mydf)
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 130000.0 Durg
3 Saurabh 140000.0 Raipur
4 Ankit 10000.0 NaN
5 Saurabh 170000.0 Lara
6 Rohit NaN Manuguru
7 Saurabh 400000.0 Mahasamund
--------------------------------------------------
Employee Name Salary City
7 Saurabh 400000.0 Mahasamund
1 Saurabh 200000.0 Hyderabad
5 Saurabh 170000.0 Lara
3 Saurabh 140000.0 Raipur
2 Divya 130000.0 Durg
0 Rahul 100000.0 Bhopal
4 Ankit 10000.0 NaN
6 Rohit NaN Manuguru

Note: The preceding code is covered in (program name:


Chap8_Example8.18.py).

Now, suppose there is a requirement such that the employee records will be
first sorted based on the alphabetical order of Employee Name in the
ascending order. If multiple employees have the same Employee Name, then
sorting will happen based on the descending order of Salary for those
employees. To do this, pass the list values first with the value as True and
then as False to ascending parameter, in the code
Chap8_Example8.19.py as follows:
import pandas as mypd
mydf = mypd.read_csv('sorting_example2.csv')
print(mydf)
print('-'*50)
mydf.sort_values(by = ['Employee Name','Salary'],
inplace=True , ascending=[True,False])
print(mydf)
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 130000.0 Durg
3 Saurabh 140000.0 Raipur
4 Ankit 10000.0 NaN
5 Saurabh 170000.0 Lara
6 Rohit NaN Manuguru
7 Saurabh 400000.0 Mahasamund
--------------------------------------------------
Employee Name Salary City
4 Ankit 10000.0 NaN
2 Divya 130000.0 Durg
0 Rahul 100000.0 Bhopal
6 Rohit NaN Manuguru
7 Saurabh 400000.0 Mahasamund
1 Saurabh 200000.0 Hyderabad
5 Saurabh 170000.0 Lara
3 Saurabh 140000.0 Raipur
Note: The preceding code is covered in (program name:
Chap8_Example8.19.py).

Sorting dataframe based on index


In order to sort the dataframe based on an index, we will be using the
sort_index() method, as shown in the following syntax and in the code
Chap8_Example8.20.py as follows:
sort_index(*, axis: 'Axis' = 0, level: 'IndexLabel
| None' = None, ascending: 'bool | Sequence[bool]'
= True, inplace: 'bool' = False, kind: 'SortKind'
= 'quicksort', na_position: 'NaPosition' = 'last',
sort_remaining: 'bool' = True, ignore_index:
'bool' = False, key: 'IndexKeyFunc | None' = None)
Here, first, we will be performing sorting based on the Employee Name in
ascending order, and then based on the index in descending order. The CSV
file is sorting_example2.csv, as shown as follows:
import pandas as mypd
mydf = mypd.read_csv('sorting_example2.csv')
# print(help(mydf.sort_index))
print(mydf)
print('-'*50)
# sorting based on ascending order of Employee
Name
mydf.sort_values(by ='Employee Name', inplace=True
)
print(mydf)
print('-'*50)
# sorting based on descending order of index
mydf.sort_index(ascending=False, inplace=True )
print(mydf)
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 130000.0 Durg
3 Saurabh 140000.0 Raipur
4 Ankit 10000.0 NaN
5 Saurabh 170000.0 Lara
6 Rohit NaN Manuguru
7 Saurabh 400000.0 Mahasamund
--------------------------------------------------
Employee Name Salary City
4 Ankit 10000.0 NaN
2 Divya 130000.0 Durg
0 Rahul 100000.0 Bhopal
6 Rohit NaN Manuguru
1 Saurabh 200000.0 Hyderabad
3 Saurabh 140000.0 Raipur
5 Saurabh 170000.0 Lara
7 Saurabh 400000.0 Mahasamund
--------------------------------------------------
Employee Name Salary City
7 Saurabh 400000.0 Mahasamund
6 Rohit NaN Manuguru
5 Saurabh 170000.0 Lara
4 Ankit 10000.0 NaN
3 Saurabh 140000.0 Raipur
2 Divya 130000.0 Durg
1 Saurabh 200000.0 Hyderabad
0 Rahul 100000.0 Bhopal

Note: The preceding code is covered in (program name:


Chap8_Example8.20.py).

Ranking series of values by using the rank() method


Numerical data ranks (one through n) will be computed along the axis using
the rank() method. If there is a case such that two values are the same, then
their average is going to be considered as shown in the following syntax
and in the code Chap8_Example8.21.py as follows:
rank(axis: 'Axis' = 0, method: "Literal['average',
'min', 'max', 'first', 'dense']" = 'average',
numeric_only: 'bool_t' = False, na_option:
"Literal['keep', 'top', 'bottom']" = 'keep',
ascending: 'bool_t' = True, pct: 'bool_t' = False)

import pandas as mypd


mydf = mypd.read_csv('rank_eg.csv').dropna()
#print(help(mydf['Salary'].rank))
print(mydf['Salary'])
print('-'*50)
print(mydf['Salary'].rank()) # least salary has
rank 1.0 in ascending order
print('-'*50)
print(mydf['Salary'].rank(ascending=False)) #
Highest salary has rank 1.0
Output:
0 100000.0
1 200000.0
2 130000.0
3 13000.0
5 200000.0
7 400000.0
Name: Salary, dtype: float64
--------------------------------------------------
0 2.0
1 4.5
2 3.0
3 1.0
5 4.5
7 6.0
Name: Salary, dtype: float64
--------------------------------------------------
0 5.0
1 2.5
2 4.0
3 6.0
5 2.5
7 1.0
Name: Salary, dtype: float64

Note: The preceding code is covered in (program name:


Chap8_Example8.21.py).

We can provide a separate column as Salary Rank and then display the
ranking, as demonstrated in the following code
Chap8_Example8.22.py:
import pandas as mypd
mydf = mypd.read_csv('rank_eg.csv').dropna()
print(mydf)
print('-'*50)
mydf['SalaryRank'] =
mydf['Salary'].rank(ascending=False)# Highest
salary has rank 1.0
mydf.sort_values(by='SalaryRank',inplace=True)
print(mydf)
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 130000.0 Durg
3 Saurabh 13000.0 Raipur
5 Saurabh 200000.0 Lara
7 Saurabh 400000.0 Mahasamund
--------------------------------------------------
Employee Name Salary City SalaryRank
7 Saurabh 400000.0 Mahasamund 1.0
1 Saurabh 200000.0 Hyderabad 2.5
5 Saurabh 200000.0 Lara 2.5
2 Divya 130000.0 Durg 4.0
0 Rahul 100000.0 Bhopal 5.0
3 Saurabh 13000.0 Raipur 6.0

Note: The preceding code is covered in (program name:


Chap8_Example8.22.py).

Filter data from dataframe


Filtering data from a pandas dataframe is essential for data analysis because
it allows the user to do the following tasks:
Extract specific subsets of data that meet certain conditions or
criteria, enabling more focused and effective analysis.
Clean the data by removing irrelevant or erroneous data, ensuring
accuracy in subsequent analysis.
Segment the data into different groups or categories, enabling the
examination of different patterns and trends.
Pre-process the data by reducing noise, handling missing values, and
removing outliers.
Create visual representations of the data based on specific subsets,
facilitating the communication of meaningful insights.
The data can be filtered from the dataframe based on some conditions, that
is, a subset of rows can be selected based on some selection criteria.
Boolean masking can be used internally for this filtration purpose, as
depicted in the code Chap8_Example8.23.py as follows:
import pandas as mypd
mydf = mypd.read_csv('filter_eg.csv').dropna()
print(mydf) # selecting only City Column
print('-'*50)
print(mydf['City']) # selecting only City Column
print('-'*50)
# Displaying True if City is Durg else will be
displaying False
print(mydf['City'] == 'Durg') # boolean series
will be returned for boolean masking
print('-'*50)
# selection of all rows where True will be
returned
print(mydf[mydf['City'] == 'Durg'])
# we can even use > , >= , < , <= operator also
print("-"*50)
print(mydf[mydf['Salary'] >= 130000])
# filtering data based on multiple conditions
using & and | operators also
print("-"*50)
# selecting City as Durg and salary >=120000
mycond1 = mydf['City'] == "Durg"
mycond2 = mydf['Salary'] >= 120000
print(mydf[mycond1 & mycond2])
# selecting employee names which City as Durg and
also those employees whose salary >=120000
print("-"*50)
print(mydf[mycond1 | mycond2])
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 140000.0 Durg
3 Saurabh 150000.0 Raipur
4 Ankit 110000.0 Durg
5 Saurabh 200000.0 Lara
--------------------------------------------------
0 Bhopal
1 Hyderabad
2 Durg
3 Raipur
4 Durg
5 Lara
Name: City, dtype: object
--------------------------------------------------
0 False
1 False
2 True
3 False
4 True
5 False
Name: City, dtype: bool
--------------------------------------------------
Employee Name Salary City
2 Divya 140000.0 Durg
4 Ankit 110000.0 Durg
--------------------------------------------------
Employee Name Salary City
1 Saurabh 200000.0 Hyderabad
2 Divya 140000.0 Durg
3 Saurabh 150000.0 Raipur
5 Saurabh 200000.0 Lara
--------------------------------------------------
Employee Name Salary City
2 Divya 140000.0 Durg
--------------------------------------------------
Employee Name Salary City
1 Saurabh 200000.0 Hyderabad
2 Divya 140000.0 Durg
3 Saurabh 150000.0 Raipur
4 Ankit 110000.0 Durg
5 Saurabh 200000.0 Lara

Note: The preceding code is covered in (program name:


Chap8_Example8.23.py).

Inclusion check in dataframe


The pandas isin() method checks if each element in a dataframe or series
is contained in a list of values. It returns a series of Boolean values
indicating whether each element is present in the list. Refer to the following
code Chap8_Example8.24.py as follows:
import pandas as mypd
mydf = mypd.read_csv('filter_eg.csv').dropna()
print(mydf[mydf['City'].isin(['Hyderabad','Durg','
Lara'])])
Output:
Employee Name Salary City
1 Saurabh 200000.0 Hyderabad
2 Divya 140000.0 Durg
4 Ankit 110000.0 Durg
5 Saurabh 200000.0 Lara

Note: The preceding code is covered in (program name:


Chap8_Example8.24.py).

Now, let us see a scenario where there will be a selection of the employeeʼs
names who are staying in their cities in one CSV file based on the Air
Quality Index (AQI) limit mentioned in another CSV file. We will be
selecting the employee names who are residing in the city whose AQI > 105,
as shown in the following code Chap8_Example8.25.py:
import pandas as mypd
mydf = mypd.read_csv('filter_eg.csv').dropna()
print(mydf)# displaying 1st dataframe
print('-'*50)
mydf2 = mypd.read_csv('AQI_eg.csv')
print(mydf2)# displaying 2nd dataframe
print('-'*50)
# selecting city name whose AQI > 105
myaqi = mydf2['AQI']>105
print(mydf2[myaqi])
print('-'*50)
print(mydf2[myaqi]['City'])
myaqicity = mydf2[myaqi]['City']
print('-'*50)
# selecting employee name who are residing in
cities whose AQI > 105
myemp = mydf['City'].isin(myaqicity)
print(mydf[myemp]['Employee Name']) # getting the
employee name only
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Saurabh 200000.0 Hyderabad
2 Divya 140000.0 Durg
3 Saurabh 150000.0 Raipur
4 Ankit 110000.0 Durg
5 Saurabh 200000.0 Lara
--------------------------------------------------
City AQI
0 Bhopal 100
1 Hyderabad 150
2 Durg 90
3 Raipur 120
4 Lara 110
--------------------------------------------------
City AQI
1 Hyderabad 150
3 Raipur 120
4 Lara 110
--------------------------------------------------
1 Hyderabad
3 Raipur
4 Lara
Name: City, dtype: object
--------------------------------------------------
1 Saurabh
3 Saurabh
5 Saurabh
Name: Employee Name, dtype: object

Note: The preceding code is covered in (program name:


Chap8_Example8.25.py).

This between() method will check for inclusion with a range of values, refer
to the following syntax and in the follwoing code
Chap8_Example8.26.py for better clarity:
between(left, right, inclusive: "Literal['both',
'neither', 'left', 'right']" = 'both')

import pandas as mypd


mydf = mypd.read_csv('filter_eg.csv').dropna()
# print(help(mydf['City'].between))
# select range of values between 100000 and 160000
without using between method
mycond1 = mydf['Salary'] >= 100000
mycond2 = mydf['Salary'] <= 160000
print(mydf[mycond1 & mycond2])
print('-'*50)
# select range of values between 100000 and 150000
using between method
print(mydf[mydf['Salary'].between(100000,150000)])
# by default inclusive = both
print('-'*50)
# select range of values between >=100000 and
<150000 (150000 is exclusive) using between method
print(mydf[mydf['Salary'].between(100000,150000,
inclusive='left')])
print('-'*50)
# select range of values between >100000 and
<=150000 (100000 is exclusive) using between
method
print(mydf[mydf['Salary'].between(100000,150000,
inclusive='right')])
print('-'*50)
# select range of values between >100000 and
<=150000 (both 100000 and 150000 are exclusive)
using between method
print(mydf[mydf['Salary'].between(100000,150000,
inclusive='neither')])
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
2 Divya 140000.0 Durg
3 Saurabh 150000.0 Raipur
4 Ankit 110000.0 Durg
--------------------------------------------------
Employee Name Salary City
0 Rahul 100000.0 Bhopal
2 Divya 140000.0 Durg
3 Saurabh 150000.0 Raipur
4 Ankit 110000.0 Durg
--------------------------------------------------
Employee Name Salary City
0 Rahul 100000.0 Bhopal
2 Divya 140000.0 Durg
4 Ankit 110000.0 Durg
--------------------------------------------------
Employee Name Salary City
2 Divya 140000.0 Durg
3 Saurabh 150000.0 Raipur
4 Ankit 110000.0 Durg
--------------------------------------------------
Employee Name Salary City
2 Divya 140000.0 Durg
4 Ankit 110000.0 Durg

Note: The preceding code is covered in (program name:


Chap8_Example8.26.py).

Usage of isnull and notnull method


The methods isnull and notnull are complementary methods that will
check the missing and non-missing values. The isnull() method in pandas
returns a boolean mask indicating which elements are missing or null in a
series or dataframe object. It returns True for each missing element and
False otherwise.

The notnull() method is the opposite of isnull():, it returns True for each
element that is not missing and False otherwise. Refer to the following
code Chap8_Example8.27.py as follows:
import pandas as mypd
mydf = mypd.read_csv('isnull_notnull_eg.csv')
print(mydf)
print('-'*50)
# Applying on series object ----------------------
-----------------
print(mydf['Employee Name'].isnull()) # will
return Boolean series
print('-'*50)
print(mydf[mydf['Employee Name'].isnull()]) #
selecting rows with True
print('-'*50)
print(mydf['Employee Name'].notnull()) # will
return Boolean series
print('-'*50)
print(mydf[mydf['Employee Name'].notnull()]) #
selecting rows with True
# Applying on dataframe object -------------------
--------------------
print(mydf.isnull())
print('-'*50)
print(mydf.notnull())
Output:
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Vikas 200000.0 Lara
2 Rohit NaN Durg
3 Saurabh 400000.0 NaN
4 NaN 310000.0 Indore
--------------------------------------------------
0 False
1 False
2 False
3 False
4 True
Name: Employee Name, dtype: bool
--------------------------------------------------
Employee Name Salary City
4 NaN 310000.0 Indore
--------------------------------------------------
0 True
1 True
2 True
3 True
4 False
Name: Employee Name, dtype: bool
--------------------------------------------------
Employee Name Salary City
0 Rahul 100000.0 Bhopal
1 Vikas 200000.0 Lara
2 Rohit NaN Durg
3 Saurabh 400000.0 NaN
Employee Name Salary City
0 False False False
1 False False False
2 False True False
3 False False True
4 True False False
--------------------------------------------------
Employee Name Salary City
0 True True True
1 True True True
2 True False True
3 True True False
4 False True True

Note: The preceding code is covered in (program name:


Chap8_Example8.27.py).

Usage of duplicated method


Duplicate rows can be checked in series or dataframe objects using
duplicated() method. Here, the Boolean series is returned where True
indicates a duplicated row, use the following syntax for the same and
observe the code Chap8_Example8.28.py as follows:
duplicated(keep: 'DropKeep' = 'first')

import pandas as mypd


mydf = mypd.read_csv('duplicated_eg.csv')
#print(help(mydf['Employee Name'].duplicated))
print(mydf)
print('-'*50)
# here duplicates will be True except for the
first occurrence
print(mydf['Employee Name'].duplicated()) #
default value of keep = First
print('-'*50)
# here duplicates will be True except for the last
occurrence
print(mydf['Employee
Name'].duplicated(keep='last'))
print('-'*50)
# here all duplicates will be True
print(mydf['Employee
Name'].duplicated(keep=False))
print('-'*50)
# select all rows without Employee Name duplicates
# and consider the Employee Name with only first
occurrence if employees are having same name
mycond1 = ~mydf['Employee Name'].duplicated()
print(mydf[mycond1])
print('-'*50)
# select all rows with Employee Name having unique
values .i.e. where names are not repeated
mycond2 = ~mydf['Employee Name'].duplicated(keep =
False)
print(mydf[mycond2])
Output:
Employee Name Salary City
0 Vikas 200000.0 Lara
1 Saurabh 400000.0 NaN
2 Vikas 210000.0 NaN
3 Mintoo 110000.0 Jhansi
4 Vikas NaN Anand
--------------------------------------------------
0 False
1 False
2 True
3 False
4 True
Name: Employee Name, dtype: bool
--------------------------------------------------
0 True
1 False
2 True
3 False
4 False
Name: Employee Name, dtype: bool
--------------------------------------------------
0 True
1 False
2 True
3 False
4 True
Name: Employee Name, dtype: bool
--------------------------------------------------
Employee Name Salary City
0 Vikas 200000.0 Lara
1 Saurabh 400000.0 NaN
3 Mintoo 110000.0 Jhansi
--------------------------------------------------
Employee Name Salary City
1 Saurabh 400000.0 NaN
3 Mintoo 110000.0 Jhansi

Note: The preceding code is covered in (program name:


Chap8_Example8.28.py).

Usage of drop_duplicates method in dataframe


drop_duplicates() is a method in the pandas library of Python that is used
to remove duplicate rows from a dataframe object. By default, this method
will drop duplicate rows if and only if all column values are duplicated. The
following syntax depicts the method and observe the code
Chap8_Example8.29.py as follows:
drop_duplicates(subset: 'Hashable |
Sequence[Hashable] | None' = None, *, keep:
'DropKeep' = 'first', inplace: 'bool' = False,
ignore_index: 'bool' = False)

import pandas as mypd


mydf = mypd.read_csv('duplicated_eg.csv')
# print(help(mydf.drop_duplicates))
mydf.drop_duplicates(inplace=True) # not talking
about a particular but all the columns
print(mydf)
print('-'*50)
# to delete records if Employee Name column is
duplicated.
# Except first occurrence of Employee Name, every
row will be deleted for duplicate Employee Name
mydf.drop_duplicates(subset=['Employee Name'],
inplace=True)
print(mydf)
Output:
Employee Name Salary City
0 Vikas 200000.0 Lara
1 Saurabh 400000.0 NaN
2 Vikas 210000.0 NaN
3 Mintoo 110000.0 Jhansi
4 Vikas NaN Anand
--------------------------------------------------
Employee Name Salary City
0 Vikas 200000.0 Lara
1 Saurabh 400000.0 NaN
3 Mintoo 110000.0 Jhansi
Note: The preceding code is covered in (program name:
Chap8_Example8.29.py).

To remove all rows with duplicate Employee Names, including the first
occurrence use the following method:
mydf.drop_duplicates(subset=['Employee Name'],
keep=False, inplace=True)

Usage of unique and nunique methods


In order to identify and return an array of unique values, we can use
unique() and nunique() methods, refer to the following :

myseries.unique(): This method will return unique values of


the series object.
myseries.nunique(): By ignoring NaN, this method will return
unique values.
myseries.nunique(dropna=False): Without ignoring NaN,
this method will return unique values.
Observe the following code Chap8_Example8.30.py for viewing the
unique and non-unique methods:

import pandas as mypd


mydf = mypd.read_csv('unique_eg.csv')
# Displaying list of unique values
print(mydf['Employee Name'].unique())
# Displaying no. of unique values excluding NaN
print(mydf['Employee Name'].nunique())
# Displaying no. of unique values including NaN
print(mydf['Employee Name'].nunique(dropna=False))
Output:
['Vikas' 'Saurabh' 'Mintoo' nan]
3
4

Note: The preceding code is covered in (program name:


Chap8_Example8.30.py).

Conclusion
In this chapter, we learned a comprehensive journey through the intricate
world of pandas dataframes in Python, equipping readers with a diverse
skill set to effectively analyze and manipulate data. Transitioning
seamlessly into the field of pandas dataframes, we provided readers with
invaluable insights into dataframe construction and the exclusive methods
and attributes that come into play. Handling missing data was tackled
comprehensively, and readers were equipped with the skills needed for
arithmetic operations on dataframes and the dynamic addition of new
columns. The fillna method was introduced as a vital tool for addressing
missing values. Sorting, ranking, and filtering data within dataframes were
explored in-depth, and various techniques for data validation, including
checking for data inclusion, identifying missing values, and handling
duplicates were covered extensively. With this robust foundation, readers
are now well-prepared to tackle data analysis and different approaches to
manipulation tasks with confidence, armed with the power of pandas
dataframes in Python.
In the next chapter, we will explore advanced data manipulation techniques
in pandas. Topics covered will include setting customized columns as
indices, selecting rows using loc[] and index positions, and renaming index
and column names. We will also delve into filtering data using the query
method, handling spaces in column names, and utilizing operators like in
and not in. The chapter will cover data aggregation, text data handling,
modifying data, and advanced functions like apply(), nlargest(),
nsmallest(), and where() methods for powerful data analysis.

Points to remember
Pandas library provides two primary data structures: Series (1D) and
DataFrame (2D).
Access columns in a dataframe using square brackets:
mydf['column_name'].

Use dot notation for column access: mydf.column_name if column


names do not have spaces or special characters.
Rename columns using the rename() method or by directly assigning
them to mydf.columns.
Filter rows based on conditions by using Boolean indexing.
Handle missing data with methods like dropna() and fillna().
Sort data using sort_values() and reset index with reset_index().
Rank data with the rank() method.
Handle duplicates with duplicated() and drop_duplicates()
methods.
Check for null values with isnull() and non-null values with
notnull().

Use the unique() and nunique() functions to find unique values and
their counts.
Pandas dataframe is a table of data with rows and columns.
Pandas dataframe can be created from many different types of data,
such as lists, dictionaries, numpy arrays, and CSV files.
Pandas dataframe has many functions for accessing and manipulating
data, such as slicing, filtering, and indexing.
Pandas dataframe can perform basic arithmetic operations, such as
addition, subtraction, multiplication, and division.
Pandas dataframe can be manipulated using a variety of methods,
such as sorting, ranking, and aggregating.

Questions
1. What is the purpose of the pandas dataframe constructor?
2. What are some methods and attributes that are exclusively applicable
to dataframes?
3. How do you drop rows with missing or null values in a dataframe?
4. What arithmetic operators can be used with dataframes?
5. How to add new columns to a dataframe?
6. What is the fillna method used for in a dataframe?
7. How to sort the values of a dataframe?
8. How to sort a dataframe based on multiple columns?
9. What is the method for sorting a dataframe based on its index?
10. How to rank a series of values in a dataframe using the rank method?
11. How to filter data from a dataframe?
12. What are the isnull and notnull methods used for?
13. How do you use the duplicated method and what does the
drop_duplicates do?

14. How to find unique values using the unique and nunique methods?
15. Explain inclusion check in a pandas dataframe ?
CHAPTER 9Advanced Dataframe Filtering
Techniques
Introduction

In the previous chapter, we learned about pandas dataframe which was a


support system for handling missing data, making them particularly
valuable for diverse and complex data analysis tasks. Now, we shall explore
necessary skills to navigate and manipulate pandas dataframes effectively
which will empower us to extract valuable insights from complex datasets.

Let us continue learning on pandas dataframe.

Structure

In this chapter, we will discuss the following topics:

Approaches of setting customized columns as index


Rows selection by index label with loc[] indexer
Accessing rows of dataframe by using index position
Renaming index and column names in dataframe
Rows and columns deletion from the dataframe
Filter dataframe using query method
in and not in operators using query method
Handling spaces in column names with query
Usage of apply() method to pandas dataframe
Using nlargest() and nsmallest() in pandas dataframe
Filtering dataframe with where() method
Handling of text data in pandas
Filtering dataframe rows with string methods
Modifying data in pandas dataframe
Example of data aggregation in pandas dataframe

Objectives
By the end of this chapter, the reader will learn various methods for
renaming index labels and column names, as well as efficient ways to
delete rows and columns from dataframes. The chapter will introduce the
powerful query method for data filtering and explore additional advanced
techniques such as using apply to manipulate dataframe data and finding
the largest and smallest values with nlargest and nsmallest. Text data
handling within pandas will be addressed, demonstrating how to filter
dataframe rows with string methods. Practical examples of data
modification in dataframes will provide readers with hands-on experience
in real-world data manipulation. Various practical examples will be
explained with output and comments wherever necessary. Finally, we can
say that readers will have a solid foundation in working pandas dataframes
for data analysis and manipulation in Python.

Approaches of setting customized columns as index

Setting customized columns as the index in a dataframe allows for more


efficient data manipulation and retrieval. This approach enhances the
capability to handle and analyze data by leveraging the unique attributes of
specific columns.

Refer to the following approaches for better understanding:

We have seen till now that default numerical indexing in ascending


order was generated in the output. However, it can be customized by
setting a particular column as an index, we can use the following two
approaches:
Approach one: By using index_col parameter inside read_csv
parameter.
Approach two: By using set_index method, the dataframe index
will be set using existing columns.

The syntax is mentioned below. Both approaches are mentioned in the


following code Chap9_Example9.1.py for better understanding:

set_index(keys, *, drop: 'bool' = True, append: 'bool' = False, inplace: 'bool'


= False, verify_integrity: 'bool' = False
import pandas as mypd

mydf = mypd.read_csv('unique_eg.csv')

print(mydf)

print('-'*50)

# Approach-1

mydf1 = mypd.read_csv('unique_eg.csv', index_col = "Employee Name")

print(mydf1)

print('-'*50)

# Approach-2

# print(help(mydf1.set_index))

mydf.set_index(keys='Employee Name', inplace=True)

print(mydf)

Output:

Employee Name Salary City

0 Vikas 200000.0 Lara

1 Saurabh 400000.0 NaN

2 Vikas 210000.0 NaN

3 Mintoo 110000.0 Jhansi

4 Vikas NaN Anand

5 NaN 130000.0 Aligarh


--------------------------------------------------

Employee Name Salary City

Vikas 200000.0 Lara

Saurabh 400000.0 NaN

Vikas 210000.0 NaN

Mintoo 110000.0 Jhansi

Vikas NaN Anand

NaN 130000.0 Aligarh

--------------------------------------------------

Employee Name Salary City

Vikas 200000.0 Lara

Saurabh 400000.0 NaN

Vikas 210000.0 NaN

Mintoo 110000.0 Jhansi

Vikas NaN Anand

NaN 130000.0 Aligarh

Note: The preceding code is covered in (program name:


Chap9_Example9.1.py).

We can reset the index from the column index to the default numerical
index by using reset_index() method, refer to the following syntax for the
same:
reset_index(level: 'IndexLabel | None' = None, *, drop: 'bool' = False,
inplace: 'bool' = False, col_level: 'Hashable' = 0, col_fill: 'Hashable' = '',
allow_duplicates: 'bool | lib.NoDefault' = <no_default>, names: 'Hashable |
Sequence[Hashable] | None' = None)

An important point to note is that the old index would not be deleted from
the dataframe if drop=False and if set to True, the old index will be deleted
from the dataframe indicating that data will be missed, as demonstrated in
the following code Chap9_Example9.2.py:

import pandas as mypd

mydf = mypd.read_csv('unique_eg.csv', index_col = "Employee Name")

#print(help(mydf.reset_index)) # Employee Name acts as index column

print(mydf)

print('-'*50)

# Employee Name column will be deleted

mydf.reset_index(drop=True, inplace=True)

print(mydf)

Output:

Employee Name Salary City

Vikas 200000.0 Lara

Saurabh 400000.0 NaN

Vikas 210000.0 NaN

Mintoo 110000.0 Jhansi

Vikas NaN Anand


NaN 130000.0 Aligarh

--------------------------------------------------

Salary City

0 200000.0 Lara

1 400000.0 NaN

2 210000.0 NaN

3 110000.0 Jhansi

4 NaN Anand

5 130000.0 Aligarh

Note: The preceding code is covered in (program name:


Chap9_Example9.2.py).

Rows selection by index label with loc[] indexer

Pandas loc[] indexer lets the user select rows and columns by their index
labels, instead of their position in the dataframe. We can pass a single label,
a list of labels, or a slice of labels to the loc[] indexer to select rows, as
shown in the following syntax:

mydf.loc[index_label]

If the row is selected with the matched index label, then a series object will
be returned. dataframe object is returned if multiple rows are associated
with the specified index.

KeyError is returned if the specified index is unavailable, as demonstrated


in the following code Chap9_Example9.3.py:

import pandas as mypd


mypd_dataframe = mypd.read_csv('loc_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)

# row selection with the matched index label --> single row associated so
series object

print(mypd_dataframe.loc['Divya'])

print('-'*50)

# now multiple rows associated with the index specified ---> we are getting
dataframe object

print(mypd_dataframe.loc['Saurabh'])

print('-'*50)

# unmatched index ---> KeyError

print(mypd_dataframe.loc['Johnny'])

Output:

myname myage mycountry

Divya 34 India

Mintoo 37 Australia

Nilesh 43 UK

Priyanka 30 USA

Saurabh 35 Finland
Saurabh 40 Italy

--------------------------------------------------

myage 34

mycountry India

Name: Divya, dtype: object

--------------------------------------------------

myname myage mycountry

Saurabh 35 Finland

Saurabh 40 Italy

KeyError: 'Johnny'

Note: The preceding code is covered in (program name:


Chap9_Example9.3.py).

To use the slice operator, refer to the following syntax:

mydf.loc[begin_index:end_index:step]

As compared to normal slicing, here both begin_index and end_index


labels are inclusive, as demonstrated in the following code
Chap9_Example9.4.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('loc_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)
print('-'*50)

print(mypd_dataframe.loc['Mintoo':'Priyanka'])

print('-'*50)

print(mypd_dataframe.loc['Mintoo':'Priyanka':2])

print('-'*50)

# The begin index label is optional and default value is first index label

print('-'*50)

print(mypd_dataframe.loc[:'Priyanka'])

# The end index label is optional and default value is also end index label

print('-'*50)

print(mypd_dataframe.loc['Nilesh':])

# selection all rows

print('-'*50)

print(mypd_dataframe.loc[:])

Output:

myname myage mycountry

Divya 34 India

Mintoo 37 Australia

Nilesh 43 UK

Priyanka 30 USA
Saurabh 35 Finland

Saurabh 40 Italy

--------------------------------------------------

myname myage mycountry

Mintoo 37 Australia

Nilesh 43 UK

Priyanka 30 USA

--------------------------------------------------

myname myage mycountry

Mintoo 37 Australia

Priyanka 30 USA

--------------------------------------------------

--------------------------------------------------

myname myage mycountry

Divya 34 India

Mintoo 37 Australia

Nilesh 43 UK

Priyanka 30 USA

--------------------------------------------------

myname myage mycountry


Nilesh 43 UK

Priyanka 30 USA

Saurabh 35 Finland

Saurabh 40 Italy

--------------------------------------------------

myname myage mycountry

Divya 34 India

Mintoo 37 Australia

Nilesh 43 UK

Priyanka 30 USA

Saurabh 35 Finland

Saurabh 40 Italy

Note: The preceding code is covered in (program name:


Chap9_Example9.4.py).

Suppose there are multiple index labels as shown as follows:

mydf.loc[[index1, index2, index3, …]]

Here, the argument is a list of index labels. All rows will be returned with
matched indexes. You might have the chance of getting KeyError, if at least
one index label is unavailable. Refer to the following code
Chap9_Example9.5.py for better understanding:

import pandas as mypd

mypd_dataframe = mypd.read_csv('loc_eg.csv', index_col='myname')


mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)

print(mypd_dataframe.loc[['Mintoo','Nilesh','Priyanka']])

print('-'*50)

# KeyError since Nivesh is unavailable

print(mypd_dataframe.loc[['Mintoo','Nilesh','Priyanka','Nivesh']])

Output:

myname myage mycountry

Divya 34 India

Mintoo 37 Australia

Nilesh 43 UK

Priyanka 30 USA

Saurabh 35 Finland

Saurabh 40 Italy

--------------------------------------------------

myname myage mycountry

Mintoo 37 Australia

Nilesh 43 UK

Priyanka 30 USA
KeyError: "['Nivesh'] not in index"

Note: The preceding code is covered in (program name:


Chap9_Example9.5.py).

Accessing rows of dataframe by using index position

In order to access rows of a dataframe by using index position in Python,


we will be using the iloc indexer. The iloc indexer allows the user to access
rows and columns by their integer position. The first-row index position
will start from zero, with the following syntax:

mydf.iloc[index_position]

The row selected with a matched index position will be returned in the
form of a series object. Non-availability of specified index position results
in IndexError.

Now, observe the following syntax:

mydf.iloc[begin_index_position:end_ index_position:step]

Here, the end_index_position would not be considered.

With multiple index positions, we will be using the line of code as


mydf.iloc[[index_pos1, index_pos2, index_pos3,…]].

Here, all rows are returned with matched index positions. The
unavailability of at least one index position will throw IndexError, refer to
the following code Chap9_Example9.6.py for better understanding:

import pandas as mypd

mypd_dataframe = mypd.read_csv('loc_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)
print('-'*50)

# Syntax-1:

print(mypd_dataframe.iloc[1]) # returns 2nd row in the form of series


object

# You please uncomment it and then try the below code

#print(mypd_dataframe.iloc[10]) # returns IndexError: single positional


indexer is out-of-bounds

print('-'*50)

# Syntax-2:

print(mypd_dataframe.iloc[1:4]) # returns rows from 1st index position to


3rd index position

print('-'*50)

# Syntax-3:

print(mypd_dataframe.iloc[[1,2,4]]) # returns rows present at indexes 1,2,4

print(mypd_dataframe.iloc[[1,2,9]]) # returns IndexError

Output:

myname myage mycountry

Divya 34 India

Mintoo 37 Australia

Nilesh 43 UK

Priyanka 30 USA
Saurabh 35 Finland

Saurabh 40 Italy

--------------------------------------------------

myage 37

mycountry Australia

Name: Mintoo, dtype: object

--------------------------------------------------

myname myage mycountry

Mintoo 37 Australia

Nilesh 43 UK

Priyanka 30 USA

--------------------------------------------------

myname myage mycountry

Mintoo 37 Australia

Nilesh 43 UK

Saurabh 35 Finland

Note: The preceding code is covered in (program name:


Chap9_Example9.6.py).

Now, suppose instead of getting all column values in matched rows, we


want to get a particular column value, then we do require one more
argument, refer to the following syntax for a better understanding:
mydf.loc[index_label, column_name]

mydf.iloc[index_position, column_name]

We saw that the first argument was used to select rows and the second
argument will be used to get the value of the specified column, as
demonstrated in the following code Chap9_Example9.7.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('cell_loc_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)

print(mypd_dataframe.loc['Saurabh','mycountry'])

print('-'*50)

print(mypd_dataframe.loc['Saurabh','mycountry'].iloc[1] )# Value of only 1


column

print('-'*50)

print(mypd_dataframe.loc[['Mintoo','Priyanka'],['mycountry','mysalary']]) #
will return rows with more than 1 column

print('-'*50)

print(mypd_dataframe.iloc[1,0])

print('-'*50)

print(mypd_dataframe.iloc[[1,3],[0,2]]) # accessing 1st and 3rd row and


0th, 2nd column data
print('-'*50)

print(mypd_dataframe.iloc[2:,:2]) # from 2nd index to last, from 0th index


to 1st index column

Output:

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname

Saurabh Finland

Saurabh Italy

Name: mycountry, dtype: object

--------------------------------------------------

Italy

--------------------------------------------------

myname mycountry mysalary

Mintoo Australia 20000


Priyanka USA 50000

--------------------------------------------------

37

--------------------------------------------------

myname myage mysalary

Mintoo 37 20000

Priyanka 30 50000

--------------------------------------------------

myname myage mycountry

Nilesh 43 UK

Priyanka 30 USA

Saurabh 35 Finland

Saurabh 40 Italy

Note: The preceding code is covered in (program name:


Chap9_Example9.7.py).

Suppose there is a requirement to set the new value for the specified cell,
then we shall be using the following syntax:

mydf.loc[index_label, column_name] = new_value

mydf.iloc[index_position, column_name] = new_value

Wherever multiple values are required, we will provide them in the form of
a list which will be demonstrated in the following code
Chap9_Example9.8.py:
import pandas as mypd

mypd_dataframe = mypd.read_csv('cell_loc_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)

# changing the value of Nilesh myage to 45 from 43

mypd_dataframe.loc['Nilesh','myage'] = 45

# changing Mintoo and Priyanka salary to 600000 and 30000

mypd_dataframe.loc[['Mintoo','Priyanka'],'mysalary'] = [600000, 30000]

print(mypd_dataframe)

print('-'*50)

# changing [Divya , Priyanka] age to [42,29] and country to [Russia ,


France]

mypd_dataframe.loc[['Divya','Priyanka'],['myage','mycountry']] =
[[42,'Russia'],[29,'France']]

print(mypd_dataframe)

# replace every occurrence of 30000 to 77777 in myname column

print('-'*50)

mycond = mypd_dataframe['mysalary'] == 30000

# new dataframe is not created and view of the existing dataframe is


returned , thus improving memory utlization
mypd_dataframe.loc[mycond,'mysalary'] = 77777

print(mypd_dataframe)

print('-'*50)

# replacing name having Saurabh row with mycountry as Italy from its
salary as 60000 to 99999

mypd_dataframe.loc[(mypd_dataframe.index == 'Saurabh') &


(mypd_dataframe['mycountry'] == 'Italy'), 'mysalary'] = 99999

print(mypd_dataframe)

Output:

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 600000

Nilesh 45 UK 40000
Priyanka 30 USA 30000

Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname myage mycountry mysalary

Divya 42 Russia 10000

Mintoo 37 Australia 600000

Nilesh 45 UK 40000

Priyanka 29 France 30000

Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname myage mycountry mysalary

Divya 42 Russia 10000

Mintoo 37 Australia 600000

Nilesh 45 UK 40000

Priyanka 29 France 77777

Saurabh 35 Finland 77777

Saurabh 40 Italy 60000

--------------------------------------------------
myname myage mycountry mysalary

Divya 42 Russia 10000

Mintoo 37 Australia 600000

Nilesh 45 UK 40000

Priyanka 29 France 77777

Saurabh 35 Finland 77777

Saurabh 40 Italy 99999

Note: The preceding code is covered in (program name:


Chap9_Example9.8.py).

Renaming index and column names in dataframe

Index labels are related to rows, that is, axis-0 and column labels are related
to columns, that is, axis-1. The index labels and column names can be
renamed using the rename() method. The user is allowed to change the
name of one or more index labels or column names by providing a mapping
of old names to the new names, refer to the following syntax for a better
understanding:

rename(mapper: 'Renamer | None' = None, *, index: 'Renamer | None' =


None, columns: 'Renamer | None' = None, axis: 'Axis | None' = None, copy:
'bool | None' = None, inplace: 'bool' = False, level: 'Level | None' = None,
errors: 'IgnoreRaise' = 'ignore')

Renaming of index labels

Renaming index labels in a dataframe is a crucial step for improving data


clarity and ensuring that the dataset is more readable and meaningful. By
assigning descriptive and intuitive names to the index labels, we can
streamline data analysis and make the data more accessible.
By using mapper and axis parameters: A mapper is a dictionary in
which existing index labels are keys and new values are values, as
demonstrated in the following code Chap9_Example9.9.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('rename_eg.csv',
index_col='myname')

mypd_dataframe.sort_index(inplace=True)

#print(help(mypd_dataframe.rename))

print(mypd_dataframe)

print('-'*50)

# Usage of mapper and axis parameters ----------

mypd_dataframe.rename(mapper={'Saurabh':'Yathartha',
'Nilesh':'Ramesh'}, inplace=True)

# axis=0 is default value. This will generate the same output for
axis='rows' or axis='index ---> plz try on your own

print(mypd_dataframe)

Output:

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000


Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Ramesh 43 UK 40000

Priyanka 30 USA 50000

Yathartha 35 Finland 30000

Yathartha 40 Italy 60000

Note: The preceding code is covered in (program name:


Chap9_Example9.9.py).

By using the index parameter: Directly use the index parameter


without specifying the axis parameter because it always considers
index labels only, as demonstrated in the following code
Chap9_Example9.10.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('rename_eg.csv',
index_col='myname')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)
# Usage of index parameter ----------

mypd_dataframe.rename(index={'Saurabh':'Yathartha',
'Nilesh':'Ramesh'}, inplace=True)

print(mypd_dataframe)

Output:

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Ramesh 43 UK 40000

Priyanka 30 USA 50000

Yathartha 35 Finland 30000

Yathartha 40 Italy 60000


Note: The preceding code is covered in (program name:
Chap9_Example9.10.py).

Renaming of column names

The column can be renamed by using mapper and axis parameters and also
by using column parameters as demonstrated in the following code
Chap9_Example9.11.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('rename_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)

# Usage of mapper and axis parameters ----------

mypd_dataframe.rename(mapper={'myage':'my_age',
'mycountry':'my_country'}, axis=1, inplace=True)

# Also if we write for axis='columns', same output will be generated --->


plz try on your own

print(mypd_dataframe)

print('-'*50)

# By using columns parameter

mypd_dataframe.rename(columns={'mysalary':'my_salary'}, inplace=True)

print(mypd_dataframe)

Output:
myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname my_age my_country mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname my_age my_country my_salary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000
Priyanka 30 USA 50000

Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

Note: The preceding code is covered in (program name:


Chap9_Example9.11.py).

It is important to note that we can rename columns by using the shortcut


technique as shown below:

mypd_dataframe.columns = ['my_age','my_country','my_salary']

But here, the length of the list must be the same as number of columns in
the dataframe. So, this technique will be useful if we want to change all the
column names.

Rows and columns deletion from the Dataframe

The drop() method in pandas will be used to delete rows and columns from
a dataframe object, as shown below:

drop(labels: 'IndexLabel | None' = None, *, axis: 'Axis' = 0, index:


'IndexLabel | None' = None, columns: 'IndexLabel | None' = None, level:
'Level | None' = None, inplace: 'bool' = False, errors: 'IgnoreRaise' = 'raise')

First, let us explore row deletion which is demonstrated in the following


code Chap9_Example9.12.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('rename_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

# print(help(mypd_dataframe.drop))

print(mypd_dataframe)
print('-'*50)

# to delete rows where index label is 'Saurabh' ---> all will generate the
same output. Uncomment it and then check

# mypd_dataframe.drop(labels='Saurabh', inplace=True)

# mypd_dataframe.drop(labels='Saurabh',axis=0, inplace=True)

# mypd_dataframe.drop(labels='Saurabh',axis='index', inplace=True)

# mypd_dataframe.drop(labels='Saurabh',axis='rows', inplace=True)

mypd_dataframe.drop(index='Saurabh', inplace=True)

print(mypd_dataframe)

print('-'*50)

# to delete rows where index label is either 'Mintoo' or 'Nilesh'

# This will also generate the same output

mypd_dataframe.drop(index=['Mintoo','Nilesh'], inplace=True)

# mypd_dataframe.drop(labels=['Mintoo','Nilesh'], inplace=True)

print(mypd_dataframe)

Output:

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000


Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

--------------------------------------------------

myname myage mycountry mysalary

Divya 34 India 10000

Priyanka 30 USA 50000

Note: The preceding code is covered in (program name:


Chap9_Example9.12.py).

If there is a default numeric index, then we can delete rows, as shown


below:

mypd_dataframe.drop(labels = 1, inplace = True)

mypd_dataframe.drop(index = 1, inplace = True)

mypd_dataframe.drop(labels = [1,3,5], inplace = True)

mypd_dataframe.drop(index = [1,3,5], inplace = True)


Let us observe the code for column deletion by using the pandas dataframe,
as demonstrated in the following code Chap9_Example9.13.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('rename_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

# print(help(mypd_dataframe.drop))

print(mypd_dataframe)

print('-'*50)

# to delete column name = 'myage'

mypd_dataframe.drop(labels='myage',axis=1, inplace=True)

# This will generate the same output

# mypd_dataframe.drop(columns='myage', inplace=True)

print(mypd_dataframe)

# By using pop method : we can delete only column at a time

print('-'*50)

mypd_dataframe.pop('mycountry')

print(mypd_dataframe)

# By using del keyword

del mypd_dataframe['mysalary']

print('-'*50)

print(mypd_dataframe)
Output:

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 35 Finland 30000

Saurabh 40 Italy 60000

--------------------------------------------------

myname mycountry mysalary

Divya India 10000

Mintoo Australia 20000

Nilesh UK 40000

Priyanka USA 50000

Saurabh Finland 30000

Saurabh Italy 60000

--------------------------------------------------

myname mysalary

Divya 10000

Mintoo 20000
Nilesh 40000

Priyanka 50000

Saurabh 30000

Saurabh 60000

--------------------------------------------------

Empty DataFrame

Columns: []

Index: [Divya, Mintoo, Nilesh, Priyanka, Saurabh, Saurabh]

Note: The preceding code is covered in (program name:


Chap9_Example9.13.py).

Filter dataframe using query method

The most commonly used method in pandas is query() method. If the user
wants to filter rows from dataframe, then we will be using the query()
method. The columns of a dataframe will be queried with a Boolean
expression, as shown in the following syntax:

query(expr: 'str', *, inplace: 'bool' = False, **kwargs)

The following points need to be considered while using the query()


method:

The expr parameter must be a valid string,


The column names contain spaces then a special care is to be taken
kwargs allows for arbitrary number of keyword arguments which are
to be passed as a dictionary.

We shall now be exploring the codes without/with usage of query method.

Without using the query method


First, let us observe an example to select all names related to India country
in the following code Chap9_Example9.14.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

# print(help(mypd_dataframe.query))

print(mypd_dataframe)

print('-'*50)

mycond = mypd_dataframe['mycountry'] == 'India'

print(mypd_dataframe[mycond])

Output:

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 40 Italy 60000

Yathartha 35 India 30000

--------------------------------------------------

myname myage mycountry mysalary

Divya 34 India 10000


Yathartha 35 India 30000

Note: The preceding code is covered in (program name:


Chap9_Example9.14.py).

Now, we will be exploring an example with the usage of the query method
which is demonstrated in the following code Chap9_Example9.15.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

mypd_dataframe.query(" mycountry == 'India' ", inplace=True)

print(mypd_dataframe)

Output:

myname myage mycountry mysalary

Divya 34 India 10000

Yathartha 35 India 30000

Note: The preceding code is covered in (program name:


Chap9_Example9.15.py).

It is like SQL, where we are getting an output based on a query. But the
query must be a valid query. Similarly, expr must be a valid string. Normal
Python operators like ==, !=, >, >=, <, <=, &, and, |, or, in, or not in can be
used.

in and not in operators using the query method

We can filter results based on multiple values using in and not in operators,
the following code which is demonstrated in the following code
Chap9_Example9.16.py shows the usage of in operator:
import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)

# to select all names where country is either 'India' or 'UK'

mypd_dataframe.query(“ mycountry in [ 'India', 'UK'] “, inplace=True)

print(mypd_dataframe)

Output:

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 40 Italy 60000

Yathartha 35 India 30000

myname myage mycountry mysalary

Divya 34 India 10000

Nilesh 43 UK 40000

Yathartha 35 India 30000


Note: The preceding code is covered in (program name:
Chap9_Example9.16.py).

Now, we shall be exploring an example with the usage of not in operator as


demonstrated in the following code Chap9_Example9.17.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)

# to select all names where country is neither 'India' nor 'UK'

mypd_dataframe.query(“ mycountry not in [ 'India', 'UK'] “, inplace=True)

print(mypd_dataframe)

Output:

myname myage mycountry mysalary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 40 Italy 60000

Yathartha 35 India 30000

myname myage mycountry mysalary


Mintoo 37 Australia 20000

Priyanka 30 USA 50000

Saurabh 40 Italy 60000

Note: The preceding code is covered in (program name:


Chap9_Example9.17.py).

Handling spaces in column names with query

The first approach is by using back quotes or backticks. The following code
Chap9_Example9.18.py displays the usage of back quotes (``):

import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

mypd_dataframe.rename(columns={'mysalary':'My Salary'}, inplace=True)

print(mypd_dataframe)

print('-'*50)

# mypd_dataframe.query("My salary <45000", inplace=True) ----> Invalid

mypd_dataframe.query("`My Salary` <45000", inplace=True) # ----> valid


here 'My #Salary' is placed in backticks

print(mypd_dataframe)

Output:

myname myage mycountry My Salary

Divya 34 India 10000


Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 40 Italy 60000

Yathartha 35 India 30000

--------------------------------------------------

myname myage mycountry My Salary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Yathartha 35 India 30000

Note: The preceding code is covered in (program name:


Chap9_Example9.18.py).

Replace space with underscore symbol (_) in column name and then apply
query() method, as demonstrated in the following code
Chap9_Example9.19.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

mypd_dataframe.rename(columns={'mysalary':'My Salary'}, inplace=True)

print(mypd_dataframe)
print('-'*50)

mypd_dataframe.columns = [mycol.replace(' ','_') for mycol in


mypd_dataframe.columns]

print(mypd_dataframe)

print('-'*50)

mypd_dataframe.query("`My_Salary` <45000", inplace=True)

print(mypd_dataframe)

Output:

myname myage mycountry My Salary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000

Saurabh 40 Italy 60000

Yathartha 35 India 30000

--------------------------------------------------

myname myage mycountry My_Salary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Priyanka 30 USA 50000


Saurabh 40 Italy 60000

Yathartha 35 India 30000

--------------------------------------------------

myname myage mycountry My_Salary

Divya 34 India 10000

Mintoo 37 Australia 20000

Nilesh 43 UK 40000

Yathartha 35 India 30000

Note: The preceding code is covered in (program name:


Chap9_Example9.19.py).

Usage of apply() method to pandas dataframe

This apply() method is exactly the same as the Series class apply() method
except that it will be applicable row-wise. Refer to the following syntax for
better understanding and code Chap9_Example9.20.py as follows:

apply(func: 'AggFuncType', axis: 'Axis' = 0, raw: 'bool' = False,


result_type: "Literal['expand', 'reduce', 'broadcast'] | None" = None, args=(),
by_row: "Literal[False, 'compat']" = 'compat', **kwargs)

import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv', index_col='myname')

mypd_dataframe.sort_index(inplace=True)

# print(help(mypd_dataframe.apply))

print(mypd_dataframe)
print('-'*50)

# tripling the salary of each name using series apply method since we are
selecting a column

mypd_dataframe['mysalary'] = mypd_dataframe['mysalary'].apply(lambda
sal: sal*3)

print(mypd_dataframe)

print('-'*50)

# Now, tripling the salary of country with 'India' using dataframe apply
method

def india_triple_salary(myrow):

mycountry = myrow[1]

mysalary = myrow[2]

if mycountry == 'India':

myrow[2] = mysalary * 3

return myrow # multiple columns data is returned so dataframe

print(mypd_dataframe.apply(india_triple_salary, axis='columns')) # to
apply function row wise we have to use axis='columns'

Output:

myname myage mycountry mysalary

0 Divya 34 India 10000

1 Mintoo 37 Australia 20000

2 Yathartha 35 India 30000


3 Nilesh 43 UK 40000

4 Priyanka 30 USA 50000

5 Saurabh 40 Italy 60000

--------------------------------------------------

myname myage mycountry mysalary

0 Divya 34 India 30000

1 Mintoo 37 Australia 60000

2 Yathartha 35 India 90000

3 Nilesh 43 UK 120000

4 Priyanka 30 USA 150000

5 Saurabh 40 Italy 180000

--------------------------------------------------

myname myage mycountry mysalary

0 Divya 34 India 90000

1 Mintoo 37 Australia 60000

2 Yathartha 35 India 270000

3 Nilesh 43 UK 120000

4 Priyanka 30 USA 150000

5 Saurabh 40 Italy 180000

Note: The preceding code is covered in (program name:


Chap9_Example9.20.py).
Let us view one more example where we will be adding a MyDesignation
column in which there will be some designation related to each name for
some range of salaries as demonstrated in the following code
Chap9_Example9.21.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv')

mypd_dataframe.sort_index(inplace=True)

# print(help(mypd_dataframe.apply))

print(mypd_dataframe)

print('-'*50)

# Now, tripling the salary of country with 'India' using dataframe apply
method

def designation_func(myrow):

mysalary = myrow[3]

if mysalary <25000:

return 'Worker'

elif mysalary <35000:

return 'Artisan'

elif mysalary <45000:

return 'DGM'

elif mysalary <55000:

return 'AGM'
else:

return 'GM'

mypd_dataframe['MyDesignation'] =
mypd_dataframe.apply(designation_func, axis='columns') # applying func
row wise

print(mypd_dataframe)

Output:

myname myage mycountry mysalary

0 Divya 34 India 10000

1 Mintoo 37 Australia 20000

2 Yathartha 35 India 30000

3 Nilesh 43 UK 40000

4 Priyanka 30 USA 50000

5 Saurabh 40 Italy 60000

myname myage mycountry mysalary MyDesignation

0 Divya 34 India 10000 Worker

1 Mintoo 37 Australia 20000 Worker

2 Yathartha 35 India 30000 Artisan

3 Nilesh 43 UK 40000 DGM

4 Priyanka 30 USA 50000 AGM

5 Saurabh 40 Italy 60000 GM


Note: The preceding code is covered in (program name:
Chap9_Example9.21.py).

The functions observations are mentioned below:

If the function is to be applied to each row, use the DataFrame classʼs


apply() method. If the function is to be applied to each value in a
particular column, use the Series classʼs apply() method.
When using the apply() method in the DataFrame class, be cautious
with the axis parameter. If axis = 0, then the apply function will be
used for each column. For axis=1, the function will be applied to each
row.

Using nlargest() and nsmallest() in pandas dataframe

In the context of a dataframe object, the nlargest and nsmallest methods


will be used to retrieve the rows with the largest and smallest values,
respectively, based on a specified column or columns.

The top n rows with the largest values in a specified column will be
returned by the nlargest() method. The syntax of nlargest() method is
shown below:

nlargest(n: 'int', columns: 'IndexLabel', keep: 'NsmallestNlargestKeep' =


'first')

The top n rows with the smallest values in a specified column will be
returned by the nsmallest() method. The syntax of nsmallest() method is
mentioned as follows:

nsmallest(n: 'int', columns: 'IndexLabel', keep: 'NsmallestNlargestKeep' =


'first')

Observe the following code Chap9_Example9.22.py for a better


understanding:

import pandas as mypd


mypd_dataframe = mypd.read_csv('query_eg.csv')

mypd_dataframe.sort_index(inplace=True)

# print(help(mypd_dataframe.nlargest))

# print(help(mypd_dataframe.nsmallest))

print(mypd_dataframe)

print('-'*50)

# finding 4 highest salaried names information

print(mypd_dataframe.nlargest(n=4, columns=['mysalary']))

print('-'*50)

# finding 4 lowest salaried names information

print(mypd_dataframe.nsmallest(n=4, columns=['mysalary']))

Output:

myname myage mycountry mysalary

0 Divya 34 India 10000

1 Mintoo 37 Australia 20000

2 Yathartha 35 India 30000

3 Nilesh 43 UK 40000

4 Priyanka 30 USA 50000

5 Saurabh 40 Italy 60000

--------------------------------------------------
myname myage mycountry mysalary

5 Saurabh 40 Italy 60000

4 Priyanka 30 USA 50000

3 Nilesh 43 UK 40000

2 Yathartha 35 India 30000

--------------------------------------------------

myname myage mycountry mysalary

0 Divya 34 India 10000

1 Mintoo 37 Australia 20000

2 Yathartha 35 India 30000

3 Nilesh 43 UK 40000

Note: The preceding code is covered in (program name:


Chap9_Example9.22.py).

Note: If more than one name contains the same value of salary, then use the
keep='all' parameter to display the top n largest or smallest values.

Filtering dataframe with where() method

Use the where() method, whenever the condition is False, that is, not
meeting the specified condition, replaces the corresponding value with
another value whose default value is NaN. As per the syntax parameter, the
other can be a scalar, Series/DataFrame, or callable. The dataframe is
returned with original values where the condition is True, otherwise NaN.
So, to get a full dataframe, use where() method by using the following
syntax:
where(cond, other=nan, *, inplace: 'bool_t' = False, axis: 'Axis | None' =
None, level: 'Level | None' = None)

Just observe the following code Chap9_Example9.23.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv')

mypd_dataframe.sort_index(inplace=True)

# print(help(mypd_dataframe.where))

print(mypd_dataframe)

print('-'*50)

# to return full dataframe we will be using where() method

mycond = mypd_dataframe['mycountry'] == 'India'

print(mypd_dataframe.where(mycond))

Output:

myname myage mycountry mysalary

0 Divya 34 India 10000

1 Mintoo 37 Australia 20000

2 Yathartha 35 India 30000

3 Nilesh 43 UK 40000

4 Priyanka 30 USA 50000

5 Saurabh 40 Italy 60000

--------------------------------------------------
myname myage mycountry mysalary

0 Divya 34.0 India 10000.0

1 NaN NaN NaN NaN

2 Yathartha 35.0 India 30000.0

3 NaN NaN NaN NaN

4 NaN NaN NaN NaN

5 NaN NaN NaN NaN

Note: The preceding code is covered in (program name:


Chap9_Example9.23.py).

This method is more meaningful on the series objects than on dataframe


objects. Now, let us look at an example in which if the salary of the
concerned person is more than 45000, we will replace it with Upper limit as
demonstrated in the following code Chap9_Example9.24.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('query_eg.csv')

mypd_dataframe.sort_index(inplace=True)

# print(help(mypd_dataframe.where))

print(mypd_dataframe)

print('-'*50)

# if salary of the concerned person is more than 45000, we will replace it


with 'Upper limit'

mypd_dataframe['mysalary'] = mypd_dataframe['mysalary'].where(lambda
x: x<=45000, other='Upper limit')
print(mypd_dataframe)

Output:

myname myage mycountry mysalary

0 Divya 34 India 10000

1 Mintoo 37 Australia 20000

2 Yathartha 35 India 30000

3 Nilesh 43 UK 40000

4 Priyanka 30 USA 50000

5 Saurabh 40 Italy 60000

--------------------------------------------------

myname myage mycountry mysalary

0 Divya 34 India 10000

1 Mintoo 37 Australia 20000

2 Yathartha 35 India 30000

3 Nilesh 43 UK 40000

4 Priyanka 30 USA Upper limit

5 Saurabh 40 Italy Upper limit

Note: The preceding code is covered in (program name:


Chap9_Example9.24.py).

Note: If myseries is a series object and mydf is a dataframe object, then in


order to create a separate independent isolated copy of the dataframe or
series, we will be using the copy() method as shown below:

myseries1 = myseries.copy()

mydf1 = mydf.copy()

The changes done in the original object would not be reflected in the copy
object and vice-versa.

Handling of text data in Pandas

We can handle text data like removing spaces, converting into required
spaces, and so on in pandas too. Just like the methods we have learned in
the basics of string methods, the same can be used here. Methods like
upper, title, lower, replace, lstrip, rstrip, strip, len, and so on can be used.
We will be using the str prefix while using all these methods, as shown in
the example demonstrated in the following code Chap9_Example9.25.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('text_eg.csv')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)

# Converting all the names in upper case

mypd_dataframe['myname'] = mypd_dataframe['myname'].str.upper()

print(mypd_dataframe)

print('-'*50)

# Replacing '$' with 'Rs' in salary column


mypd_dataframe['mysalary'] =
mypd_dataframe['mysalary'].str.replace('$','Rs')

print(mypd_dataframe)

print('-'*50)

# Replacing middle spaces with underscore symbol in myname

mypd_dataframe['myname'] =
mypd_dataframe['myname'].str.strip().str.replace(' ','_')

print(mypd_dataframe)

print('-'*50)

# Converting mysalary column to float type and dividing by 10000 and


adding 'K' unit

mypd_dataframe['mysalary'] =
(mypd_dataframe['mysalary'].str.replace('Rs','').astype(float)/1000).astype(s
tr) + 'K'

print(mypd_dataframe)

Output:

myname mysalary mystaffno

0 Divya Sharma $10000 601490

1 Yathartha Chandrakar $20000 601491

2 Saurabh Chandrakar $30000 601492

3 Nilesh Bahadure $40000 601493

4 Priyanka Chandrakar $50000 601494


--------------------------------------------------

myname mysalary mystaffno

0 DIVYA SHARMA $10000 601490

1 YATHARTHA CHANDRAKAR $20000 601491

2 SAURABH CHANDRAKAR $30000 601492

3 NILESH BAHADURE $40000 601493

4 PRIYANKA CHANDRAKAR $50000 601494

--------------------------------------------------

myname mysalary mystaffno

0 DIVYA SHARMA Rs10000 601490

1 YATHARTHA CHANDRAKAR Rs20000 601491

2 SAURABH CHANDRAKAR Rs30000 601492

3 NILESH BAHADURE Rs40000 601493

4 PRIYANKA CHANDRAKAR Rs50000 601494

--------------------------------------------------

myname mysalary mystaffno

0 DIVYA_SHARMA Rs10000 601490

1 YATHARTHA_CHANDRAKAR Rs20000 601491

2 SAURABH_CHANDRAKAR Rs30000 601492

3 NILESH_BAHADURE Rs40000 601493


4 PRIYANKA_CHANDRAKAR Rs50000 601494

--------------------------------------------------

myname mysalary mystaffno

0 DIVYA_SHARMA 10.0K 601490

1 YATHARTHA_CHANDRAKAR 20.0K 601491

2 SAURABH_CHANDRAKAR 30.0K 601492

3 NILESH_BAHADURE 40.0K 601493

4 PRIYANKA_CHANDRAKAR 50.0K 601494

Note: The preceding code is covered in (program name:


Chap9_Example9.25.py).

Filtering dataframe rows with string methods

In order to filter dataframe rows using string methods, we can use the
contains(), startswith(), and endswith() methods from the pandas library,
refer to the following points for a better understanding:

contains(sub_string): True is returned if the string contains provided


sub_string irrespective of its position, the syntax is shown below:

Series.str.contains(pat, case=True, flags=0, na=None, regex=True)


where pat is the pattern to search for, case determines case sensitivity,
flags are regex flags, na specifies the return value for missing values,
and regex indicates whether pat is treated as a regex pattern.

startswith(sub_string): True is returned if the string starts with


provided sub_string.
endswith(sub_string): True is returned if the string ends with provided
sub_string.
Refer to the following code Chap9_Example9.26.py for better
understanding:

import pandas as mypd

mypd_dataframe = mypd.read_csv('df_string_eg.csv')

mypd_dataframe.sort_index(inplace=True)

print(mypd_dataframe)

print('-'*50)

# to filter all row names which starts with letter 'S'

mycond = mypd_dataframe['myname'].str.startswith('S') # case sensitive

print(mypd_dataframe[mycond])

print('-'*50)

# to filter all row names which ends with letter 'r'

mycond = mypd_dataframe['myname'].str.endswith('r') # case sensitive

print(mypd_dataframe[mycond])

print('-'*50)

# to filter all row names which contains 'y' (lower case)

mycond = mypd_dataframe['myname'].str.contains('y') # case sensitive

print(mypd_dataframe[mycond]) # Here Yathartha was not displayed since


it starts with UpperCase 'Y'

print('-'*50)

# if we want case insensitive data use case parameter


mycond = mypd_dataframe['myname'].str.contains('y',case=False) # case
insensitive

print(mypd_dataframe[mycond]) # Here Yathartha will also be displayed


since it starts with UpperCase 'Y'

Output:

myname mysalary mystaffno

0 Divya Sharma $10000 601490

1 Yathartha Chandrakar $20000 601491

2 Saurabh Chandrakar $30000 601492

3 Nilesh Bahadure $40000 601493

4 Priyanka Chandrakar $50000 601494

5 Soumya Chandrakar $60000 601495

--------------------------------------------------

myname mysalary mystaffno

2 Saurabh Chandrakar $30000 601492

5 Soumya Chandrakar $60000 601495

--------------------------------------------------

myname mysalary mystaffno

1 Yathartha Chandrakar $20000 601491

2 Saurabh Chandrakar $30000 601492

4 Priyanka Chandrakar $50000 601494


5 Soumya Chandrakar $60000 601495

--------------------------------------------------

myname mysalary mystaffno

0 Divya Sharma $10000 601490

4 Priyanka Chandrakar $50000 601494

5 Soumya Chandrakar $60000 601495

--------------------------------------------------

myname mysalary mystaffno

0 Divya Sharma $10000 601490

1 Yathartha Chandrakar $20000 601491

4 Priyanka Chandrakar $50000 601494

5 Soumya Chandrakar $60000 601495

Note: The preceding code is covered in (program name:


Chap9_Example9.26.py).

Now, suppose in the same CSV file df_string_eg.csv, we have a


requirement that from myname column we want to split the name into
myfirstname and mysecondname. Add these two column names in the
dataframe and remove myname column from the dataframe, as shown in
the following code Chap9_Example9.27.py:

import pandas as mypd

mypd_dataframe = mypd.read_csv('df_string_eg.csv')

mypd_dataframe.sort_index(inplace=True)
print(mypd_dataframe)

print('-'*50)

# expand=True --> To get dataframe object otherwise series object


containing list of strings get(0) --> 0th column value

mypd_dataframe.insert(loc=1, column='myfirstname', value


=mypd_dataframe['myname'].str.strip().str.split(expand=True).get(0).str.titl
e())

# get(1) --> 1st column value

mypd_dataframe.insert(loc=2, column='mysecondname', value


=mypd_dataframe['myname'].str.strip().str.split(expand=True).get(1).str.titl
e())

del mypd_dataframe['myname']

print(mypd_dataframe)

Output:

myname mysalary mystaffno

0 Divya Sharma $10000 601490

1 Yathartha Chandrakar $20000 601491

2 Saurabh Chandrakar $30000 601492

3 Nilesh Bahadure $40000 601493

4 Priyanka Chandrakar $50000 601494

5 Soumya Chandrakar $60000 601495

--------------------------------------------------
myfirstname mysecondname mysalary mystaffno

0 Divya Sharma $10000 601490

1 Yathartha Chandrakar $20000 601491

2 Saurabh Chandrakar $30000 601492

3 Nilesh Bahadure $40000 601493

4 Priyanka Chandrakar $50000 601494

5 Soumya Chandrakar $60000 601495

Note: The preceding code is covered in (program name:


Chap9_Example9.27.py).

For the shortcut approach, refer to the following code:

# shortcut

mypd_dataframe[['FirstName','LastName']] =
mypd_dataframe['myname'].str.strip().str.split(expand=True)

print(mypd_dataframe)

Modifying data in pandas dataframe

By using indexing and slicing with various other methods, we can modify
the data in pandas dataframe, as demonstrated in the following code
Chap9_Example9.28.py:

import pandas as mypd

mydict = {'myname': ['Alex', 'John', 'Michael', 'Tom'],

'myage': [35, 45, 55, 65],

'mycountry': ['UK', 'USA', 'Australia', 'Finland']}


mypd_dataframe = mypd.DataFrame(mydict)

print(mypd_dataframe)

print('-'*50)

# modifying the mycountry of the second row

mypd_dataframe.loc[1, 'mycountry'] = 'Russia'

print(mypd_dataframe)

print('-'*50)

# adding a new column to the DataFrame

mypd_dataframe['Hobby'] = ['Playing Cricket', 'Listening Music', 'Reading


Books', 'Cooking']

print(mypd_dataframe)

print('-'*50)

# dropping the Hobby column from the DataFrame

my_df_drop = mypd_dataframe.drop(columns=['Hobby'])

print(my_df_drop)

print('-'*50)

# sorting the DataFrame by myname in descending order

my_df_sort = mypd_dataframe.sort_values(by='myname',
ascending=False)

print(my_df_sort)

Output:
myname myage mycountry

0 Alex 35 UK

1 John 45 USA

2 Michael 55 Australia

3 Tom 65 Finland

--------------------------------------------------

myname myage mycountry

0 Alex 35 UK

1 John 45 Russia

2 Michael 55 Australia

3 Tom 65 Finland

--------------------------------------------------

myname myage mycountry Hobby

0 Alex 35 UK Playing Cricket

1 John 45 Russia Listening Music

2 Michael 55 Australia Reading Books

3 Tom 65 Finland Cooking

--------------------------------------------------

myname myage mycountry

0 Alex 35 UK
1 John 45 Russia

2 Michael 55 Australia

3 Tom 65 Finland

--------------------------------------------------

myname myage mycountry Hobby

3 Tom 65 Finland Cooking

2 Michael 55 Australia Reading Books

1 John 45 Russia Listening Music

0 Alex 35 UK Playing Cricket

Note: The preceding code is covered in (program name:


Chap9_Example9.28.py).

Example of data aggregation in Pandas DataFrame

Data can be aggregated in a pandas dataframe by using methods such as


count, sum, mean, median, and so on. There is a groupby function,
available in the pandas library for Python, which enables the user to
categorize a dataframe based on one or more columns and execute
aggregate computations on these grouped data sets. For instance, the user
can group a sales data dataframe by the region column, subsequently
computing the total sales for each specific region. The outcome of the
groupby function is a fresh dataframe, where the groups are set as the
index, and the aggregate computations serve as the corresponding values.
Refer to the following code Chap9_Example9.29.py:

import pandas as mypd

mydict = {

'myname': ['Alex', 'John', 'Michael', 'Tom'],


'myage': [35, 45, 55, 65],

'mycountry': ['UK', 'USA', 'Australia', 'Finland']

mypd_dataframe = mypd.DataFrame(mydict)

# Convert 'myage' column to numeric data type

mypd_dataframe['myage'] = mypd.to_numeric(mypd_dataframe['myage'],
errors='coerce')

print(mypd_dataframe)

print('-' * 50)

# Getting the mean age of the DataFrame

my_mean_age = mypd_dataframe['myage'].mean()

print(my_mean_age)

print('-' * 50)

# Getting the total number of rows in the DataFrame

my_num_row_count = len(mypd_dataframe.index)

print(my_num_row_count)

print('-' * 50)

# Grouping the DataFrame by country and getting the mean age for each
group

mygroup = mypd_dataframe.groupby(['mycountry'])['myage'].mean()

print(mygroup)
Output:

myname myage mycountry

0 Alex 35 UK

1 John 45 USA

2 Michael 55 Australia

3 Tom 65 Finland

--------------------------------------------------

50.0

--------------------------------------------------

--------------------------------------------------

mycountry

Australia 55.0

Finland 65.0

UK 35.0

USA 45.0

Name: myage, dtype: float64

Note: The preceding code is covered in (program name:


Chap9_Example9.29.py).

Conclusion
In this chapter, we explored various essential operations and techniques for
effectively manipulating pandas dataframes. We explored methods for
renaming index labels and column names, ensuring clarity and consistency
in data representation. Efficient ways to delete rows and columns from
dataframes were discussed, streamlining data management tasks. The
introduction of the powerful query method for data filtering enhanced data
exploration capabilities, while advanced techniques like using apply for
data manipulation and nlargest and nsmallest for identifying extreme values
provided deeper insights into dataframe operations. The chapter also
addressed text data handling within pandas, showcasing how to filter
dataframe rows with string methods. Through practical examples of data
modification, readers gained hands-on experience in real-world data
manipulation, with detailed explanations and outputs. As a result, readers
now possess a solid foundation in working with pandas dataframe for data
analysis and manipulation in Python.

In the next chapter, we shall explore polars and how they offer an
alternative to pandas for data manipulation. We will also discuss their
features, performance benefits, and practical applications in addition to
their high-performance data structures, including series and dataframe,
which offer a seamless experience for data manipulation and analysis.

Points to remember

To set the index of the pandas dataframe we can use either the
index_col parameter when reading data from a file or the set_index()
method to set a specific column as the index.
Use df.loc[label] to select rows by index label, inclusive of endpoints.
Access rows using iloc[], exclusive of the end position.
Use the rename() method to rename index labels or column names.
The drop() method in pandas will be used to delete rows and columns
from a dataframe object.
Utilize the query() method with direct condition expressions.
Employ in and not in operators within the query() method.
If function is to be applied for every row, then we will use dataframe
class apply() method. The if function will be applied for every value
of a particular column, then we will use the series class apply()
method.
The nlargest and nsmallest methods will be used to retrieve the rows
with the largest and smallest values, respectively, based on a specified
column or columns in the context of the pandas dataframe object.
For conditional filtering of dataframe elements, use the where()
method.
The text data can be handled in pandas using string methods.

Questions

1. What approaches can be used to set a customized column as the index


in a dataframe?
2. How to select rows by index label by using the loc indexer?
3. How to access rows of a dataframe by using index position?
4. How to rename index labels and column names in a dataframe?
5. What is the process for deleting rows and columns from a dataframe?
6. How to filter a dataframe using the query method?
7. How to use in and not in operators with the query method?
8. How is text data handled in pandas for filtering dataframe rows with
string methods?
9. Provide an example of data aggregation within a pandas dataframe.
10. Write a Python snippet code to return all rows from 20th to the 25th by
using head and tail methods.
CHAPTER 10Exploring Polars Library for
Data Science in Python
Introduction

In the previous chapter, we discussed about pandas library for data analysis
and manipulation. However, pandas can be slow for large datasets and
complex operations because it is written in Python, a dynamically typed
language. Pandas can also run out of memory when working with large
datasets because it stores dataframes in memory. In addition, some may
argue that pandas is simpler to learn for beginners due to its high-level
functionality and ability to perform complex data manipulations with
relatively few lines of code, many users find it challenging. The complexity
of its Application Programming Interface (API), with a vast array of
functions and methods, can be overwhelming. Sometimes, writing efficient
code to handle large datasets without performance issues requires a deeper
understanding, making pandas less straightforward for many beginners.

On the other hand, polars is a newer DataFrame library that addresses some
of the limitations of Pandas. It is written in Rust, a compiled language that
is known for its speed and efficiency. It also uses a more efficient memory
layout for dataframes, which allows it to handle larger datasets than pandas.
Additionally, polars has a syntax that is similar to pandas, making it easy to
learn for users who are already familiar with pandas. All the available cores
on the machine will be utilized by the polars. Polars can handle datasets
much larger than the available RAM. It has a consistent and predictable
API. In polars, before running the query, data types should be known.

We can install polars by using the pip command, that is, pip install polars or
conda install polars.

Here, on every snippet code, we will be using alias name as mypl while
importing polars. So, import polars as mypl.

Structure
In this chapter, we will discuss the following topics:

Preference of polars over pandas


Difference between pandas and polars.
Data structures in polars
Importance of expressions in polars
Concept of lazy API in polars

Objectives

By the end of this chapter, the reader will embark on a comprehensive


exploration of the polars data manipulation library in Python. First, we will
highlight the key differences between pandas and polars, shedding light on
the unique advantages and features that polars bring to the table. Then, we
will explore polars data types, categorizing them into numeric, nested,
temporal, and other groups, providing a solid understanding of the data
structures. Readers will then venture into the center of polars, gaining
insights into its data structures, including series and dataframe, and learning
how to work with them effectively. Next, we will unveil the concept of
contexts in polars, focusing on essential operations, selection, and filtering
techniques. The groupby concept is demystified, enabling readers to
harness the power of polars for data aggregation and summarization.
Expressions emerge as a pivotal aspect of polars, and readers will grasp
their significance in facilitating complex data transformations. In the end,
the concept of a Lazy API will be well elucidated, demonstrating how it
enhances performance and resource utilization.

Preference of polars over pandas

Some specific reasons why you might want to choose Polars over Pandas
are mentioned below:

Performance: Polars are generally faster than pandas, especially for


large datasets and complex operations.
Scalability: Polars can handle larger datasets than pandas before
running into out-of-memory errors.
Ease of use: Polars has a syntax that is similar to pandas, making it
easy to learn for users who are already familiar with pandas.
Features: Polars has some features that Pandas does not, such as lazy
execution, streaming support.

Differences between pandas and polars

Refer to the following table to understand the difference:

S
Feature Pandas Polars
No.

Python, C and Rust can support many parallel


1 Language
Python operations than pandas.

2 Performance Slower Faster

pandas is easier polars is more consistent and


3 Ease of use
than polars efficient

Documentation is comprehensive
for polars as well but yes the
Larger (more
4 Community community is smaller so there
documentation)
might be lesser question/answers on
stack overflow and other sites

More Built in
5 Features Has less built in functions
functions

6 Support Supports eager Supports Lazy execution, that is,


evaluation, that is, first queries will be examined, then
expression is optimized and will look for the
evaluated as soon
as it is ways for accelerating the query or
encountered. reducing memory usage.

Data Internally using Internally by using the apache


7
representation numpy arrays. arrow arrays.

Core
8 Single Core Multi Core
computation

Table 10.1: Table depicting difference between pandas and polars

The documentation on polars can be found in the URL link:

https://pola-rs.github.io/polars/user-guide

Polars data types

The data types in polars are based on arrow and backed by arrow memory
arrays.

Numeric group

The numeric group consists of signed, and unsigned integers and floating-
point values. Refer to the following table for a better understanding:

S Numeric
Details Range
No. type

1 Int8 8-bit signed integer -27 to +27 -1

16-bit signed
2 Int16 -215 to +215 -1
integer
32-bit signed
3 Int32 -231 to +231 -1
integer

64-bit signed
4 Int64 -263 to +263 -1
integer

8-bit unsigned
5 UInt8 0 to 28 -1
integer

16-bit unsigned
6 UInt16 0 to 216 -1
integer

32-bit unsigned
7 UInt32 0 to 232 -1
integer

64-bit unsigned
8 UInt64 0 to 264 -1
integer

32-bit floating close to zero to approximately 3.4 x


9 Float32
point 1038.

64-bit floating close to zero to approximately 1.8 x


10 Float64
point 10308.

Table 10.2: Table depicting range of different numeric types

32-bit floating point also known as single precision has a 1-bit sign bit, an
8-bit exponent, and a 23-bit significand (also known as the mantissa). It
provides about seven-eight decimal digits of precision. Suitable for a wide
range of applications, including real-time graphics and many scientific
computations.

64-bit floating point also known as double precision has a 1-bit sign bit, an
11-bit exponent, and a 52-bit significand (mantissa). It provides about 15-
17 decimal digits of precision. Suitable for a wide range of scientific and
engineering applications where high precision is required.

Nested group

The nested group consists of values. Refer to the following table for a
better understanding:

S Nested
Explanation
No. type

It will allow to store a collection of values of different types,


1 Struct
that is, multiple values in a single column.

It will hold an ordered sequence of elements where each


2 List
element can be of any data type.

Table 10.3: Table depicting nested group

Temporal group

Temporal grouping within polar data types arranges data points


chronologically, enabling the examination of cyclical patterns and temporal
trends for insightful analysis. Refer to the following table for a better
understanding:

S Nested Explanation
No. type
The internal representation is days and serves as a date
1 Date representation from the UNIX epoch, preserved within a 32-
bit signed integer.

Time representation since midnight is internally represented


2 Time
as nanoseconds.

The internal representation microseconds and serves as a


3 Datetime datetime representation since the UNIX epoch preserved
within a 64-bit signed integer.

It is created when subtracting date/datetime and is a time


4 Duration
delta type internally represented as microseconds.

Table 10.4: Table depicting temporal group

Others group

The others group in polar data types encompasses Boolean, binary, UTF-8,
categorical, and object data, offering a comprehensive framework for
handling diverse data formats and facilitating efficient analysis across
various data types. Refer to the following table for a better understanding:

S No. Nested type Explanation

1 Boolean Effectively bit packed Boolean type.

2 Binary Data is stored as bytes


3 Utf8 It is stored as string data

4 Categorical Categorical encoding of a set of strings

5 Object Can be of any value and a limited supported data type.

Table 10.5: Table depicting others group

Let us view the following example as demonstrated in program name:


Chap10_Example10.1.py which depicts the usage of numeric group for
dataframe creation using polars:

import polars as mypl

# Polars DataFrame creation with 8-bit signed integers

mydf_int8 = mypl.DataFrame({

'int8_column': [11, 12, 13],

})

# Polars DataFrame creation with 16-bit signed integers

mydf_int16 = mypl.DataFrame({

'int16_column': [1100, 1200, 1300],

})

# Polars DataFrame creation with 32-bit signed integers

mydf_int32 = mypl.DataFrame({

'int32_column': [11000, 12000, 13000],


})

# Polars DataFrame creation with 64-bit signed integers

mydf_int64 = mypl.DataFrame({

'int64_column': [110000, 120000, 130000],

})

# Polars DataFrame creation with 8-bit unsigned integers

mydf_uint8 = mypl.DataFrame({

'int8_column': [11, 12, 13],

})

# Polars DataFrame creation with 16-bit unsigned integers

mydf_uint16 = mypl.DataFrame({

'int16_column': [1100, 1200, 1300],

})

# Polars DataFrame creation with 32-bit unsigned integers

mydf_uint32 = mypl.DataFrame({

'int32_column': [11000, 12000, 13000],

})

# Polars DataFrame creation with 64-bit unsigned integers

mydf_uint64 = mypl.DataFrame({

'int64_column': [110000, 120000, 130000],


})

mydf_f32 = mypl.DataFrame({

'float32_column': [11.23, 14.56, 17.89],

})

mydf_f64 = mypl.DataFrame({

'float64_column': [11.23, 14.56, 17.89],

})

# Display the DataFrames

print("8-bit integers:")

print(mydf_int8)

print('-'*50)

print("\n16-bit integers:")

print(mydf_int16)

print('-'*50)

print("\n32-bit integers:")

print(mydf_int32)

print('-'*50)

print("\n64-bit integers:")

print(mydf_int64)

print('-'*50)
print("8-bit unsigned integers:")

print(mydf_int8)

print('-'*50)

print("\n16-bit unsigned integers:")

print(mydf_int16)

print('-'*50)

print("\n32-bit unsigned integers:")

print(mydf_int32)

print('-'*50)

print("\n64-bit unsigned integers:")

print(mydf_int64)

print('-'*50)

print("\n32-bit floating point:")

print(mydf_f32)

print('-'*50)

print("\n64-bit floating point:")

print(mydf_f64)

Refer to the following figure for a better understanding:

Figure 10.1 (a): Output of program name: Chap10_Example10.1.py


The output is continued below:

Figure 10.1 (b): Output of program name: Chap10_Example10.1.py

Refer to the remaining output for a clear understanding:

Figure 10.1 (c): Output of program name: Chap10_Example10.1.py

Let us take a closer look at the code for a better understanding:

64-bit floating point:

shape: (3, 1)

┌────────────────┐

│ float64_column │

│ --- │

│ f64 │

╞════════════════╡

│ 11.23 │

│ 14.56 │

│ 17.89 │

└────────────────┘

Note: The preceding code is covered in (program name:


Chap10_Example10.1.py).

Data structures in polars


In polars, series and dataframe serve as robust data structures, offering
powerful tools for data manipulation and analysis with ease and efficiency.
We shall explain each one-by-one.

Series

A 1-dimensional data structure in which, within the series, all the elements
have the same data type. We can think of it as a labeled array. Now, let us
view the code as demonstrated in program name: Chap10_Example10.2.py
for series creation using a list of integers in polars:

import polars as mypl

# Create a Series from a list of integers

myseries = mypl.Series("mynumbers", [11, 12, 13, 14, 15])

# Access values and perform operations on the Series

print(myseries)

print(myseries.sum())

Output:

shape: (5,)

Series: 'mynumbers' [i64]

11

12

13

14
15

65

Note: The preceding code is covered in (program name:


Chap10_Example10.2.py).

Dataframe

A 2-dimensional data structure called a dataframe is supported by a series


and can be thought of as an abstraction of a group (such as a list) of series.
A Structured Query Language (SQL)-like query can be very closely
compared to the operations that can be carried out on a dataframe. In
addition to GROUP BY, JOIN, and PIVOT, we can specify custom
functions. Also, we can view data in a dataframe by using methods like
head, tail, sample, and describe as shown in the following example program
name: Chap10_Example10.3.py:

import polars as mypl

from datetime import datetime

mydf = mypl.DataFrame(

"myint": [11, 12, 13, 14, 15],

"mydate": [

datetime(2023, 11, 11),

datetime(2023, 11, 12),

datetime(2023, 11, 13),

datetime(2023, 11, 14),


datetime(2023, 11, 15),

],

“myfloat”: [4.0, 5.0, 6.0, 7.0, 8.0],

“mycity”: ['Hyderabad', 'Delhi', 'Durg', 'Bhilai', 'Raipur']

print(mydf)

# head: will return the first 5 rows of a dataframe as default and can specify
the number of rows we want by inserting some number other than default
value of 5

print(mydf.head(2))

# tail: will return the last 5 rows of a dataframe as default and can specify
the number of rows we want by inserting some number other than default
value of 5

print(mydf.tail(2))

# sample: will return 'n' number of random rows from the dataframe

print(mydf.sample(2))

# describe: will return quick summary statistics of the dataframe

print(mydf.describe())

Output:

Figure 10.2: Output of Program Name: Chap10_Example10.3.py


Refer to the code is mentioned below for a clear understanding:

shape: (9, 5)

┌────────────┬──────────┬─────────────────
────┬──────────┬────────┐

│ describe ┆ myint ┆ mydate ┆ myfloat ┆ mycity │

│ --- ┆ --- ┆ --- ┆ --- ┆ --- │

│ str ┆ f64 ┆ str ┆ f64 ┆ str │

╞═══════════╪══════════╪══════════════════
═╪═════════╪════════╡

│ count ┆ 5.0 ┆ 5 ┆ 5.0 ┆ 5 │

│ null_count ┆ 0.0 ┆ 0 ┆ 0.0 ┆ 0 │

│ mean ┆ 13.0 ┆ null ┆ 6.0 ┆ null │

│ std ┆ 1.581139 ┆ null ┆ 1.581139 ┆ null │

│ min ┆ 11.0 ┆ 2023-11-11 00:00:00 ┆ 4.0 ┆ Bhilai│

│ max ┆ 15.0 ┆ 2023-11-15 00:00:00 ┆ 8.0 ┆ Raipur│

│ median ┆ 13.0 ┆ null ┆ 6.0 ┆ null │

│ 25% ┆ 12.0 ┆ null ┆ 5.0 ┆ null │

│ 75% ┆ 14.0 ┆ null ┆ 7.0 ┆ null │

└────────────┴──────────┴─────────────────
────┴──────────┴────────┘

Note: The preceding code is covered in (program name:


Chap10_Example10.3.py).
Contexts in polars

Users will be allowed to load, transform, and analyze data by using various
operations and functions with the help of contexts in polars. Using
contexts, a structured and cohesive way will be provided to work with data.
Different tasks will be performed like select, filter, and aggregate which we
shall discuss one by one. We can create and manipulate data frames, apply
transformations, perform calculations, and execute queries on the data by
creating a context in polars. It will act as a container for data which will
provide a set of tools and methods to work with data effectively. The three
main contexts are selection, filtering, and groupby/aggregation which we
will discuss now with an example. We will be using the famous iris.csv
dataset.

Note: The iris.csv dataset is widely recognized and utilized in the domains
of data science and machine learning. It encompasses measurements of
diverse attributes associated with iris flowers, including sepal length, sepal
width, petal length, and petal width. These attributes are accompanied by
the respective species of each iris flower. This dataset finds extensive
application in tasks such as classification, clustering, and visualization
within the realm of data science. It serves as an ideal starting point for
beginners to practice and grasp various machine-learning algorithms and
techniques.

Typically, the iris.csv dataset comprises 150 rows, with each row
representing an individual iris flower and its corresponding attribute
measurements. Machine learning models are often trained and assessed by
using this dataset to predict the species of iris flowers based on their
attribute measurements. Due to its simplicity and versatility, the iris.csv
dataset is well-known and easily accessible. It is frequently employed as a
getting-started dataset for DS beginners.

Basic operations, select and filter concept

Observe the step-by-step code written along with the operations being
performed as shown in the following example program name:
Chap10_Example10.4.py:
7777777776import polars as mypl

import matplotlib.pyplot as myplt

print("Analyzing the iris.csv dataset")

mydf = mypl.read_csv("iris.csv")

print(mydf.shape) # 150 rows and 5 columns

print("display of first 5 rows")

print(mydf.head()) # we can display any number of rows if we pass any


argument to head method

print('-'*50)

print("converting the result of head() into a pandas dataframe using


to_pandas() function")

print(mydf.head().to_pandas())

print('-'*50)

print(" This code will return the datatypes of the columns of the
dataframe")

print(mydf.dtypes)

print('-'*50)

print("--------------------------Selecting and Filtering Data-----------------------


-----------")

print("1. indexing with [ ------------------")

print("display of first row and all the columns")

print(mydf[0,:])
print('-'*50)

print("Requirement to display first 2 rows of column name sepal.length and


petal.length only")

print(mydf[:,['sepal.length','petal.length']].head(2))

print('-'*50)

print("Requirement to filter first 3 rows where petal.length is > 2.1 ")

print(mydf.filter(mypl.col('petal.length')>2.1).head(3))

print('-'*50)

print("2. Selecting data with idiomatic polars ---------------------------")

print("Random sample of 5 rows will be selected from a dataframe")

print(mydf.sample(5).to_pandas)

print('-'*50)

print("Selecting sepal.width , petal.width columns and variety columns and


display of first 4 rows only")

# Using sql select type of operation for just selecting columns

print(mydf.select(['sepal.width','petal.width','variety']).head(4))

print('-'*50)

print("Add a new column name say Mux10_Petal.Length which is 10 *


petal.length and display of only first 3 rows")

# Note if alias name is not written, then the change will happen in
petal.length itself

print(mydf.with_columns([
(mypl.col('petal.length')*10).alias('Mux10_Petal.Length')

]).head(3))

print('-'*50)

print("Display of last 5 rows with Setosa variety only")

# The is_in() function is called on the 'variety' column. The is_in() function
checks if the values

# in the column are present in a given list. In this case, it checks if the
'variety' column values

# are in the list ['Setosa'].

print(mydf.filter

(mypl.col('variety').is_in(['Setosa']))

.tail())

# Computation performing on the select context

print('-'*50)

print('Calculating the number of unique values in sepal.length column')

print(mydf.select([

mypl.col('sepal.length').n_unique()

]))

print('-'*50)

print('Performing statistics on the entire sepal.length column')


# This code will provide various statistical measures for the 'sepal.length'
column of a dataframe called 'mydf'.

# It calculates the minimum, mean, median, maximum, and standard


deviation of the 'sepal.length'

# column and renames each calculated value accordingly.

print(mydf.select([

mypl.col('sepal.length').min().alias('min_sepal_length'),

mypl.col('sepal.length').mean().alias('mean_sepal_length'),

mypl.col('sepal.length').median().alias('median_sepal_length'),

mypl.col('sepal.length').max().alias('max_sepal_length'),

mypl.col('sepal.length').std().alias('std_sepal_length')

]))

print('-'*50)

print('Computing statistics at a series level')

print(mydf.select([mypl.col('sepal.length')]).describe())

print('-'*50)

print('Performing data processing in polars and data visualization in


pandas')

mysepal_length = mydf.select([

mypl.col('sepal.length')

]) # data processing
# The given code takes a sample of 100 rows from the "mysepal_length"
dataset,

# converts it to a pandas dataframe, and then creates a histogram with 10


bins.

mysepal_length.sample(100).to_pandas().hist(bins=10) # data visualization

myplt.show()

Output:

Figure 10.3 (a): Output of Program Name: Chap10_Example10.4.py

Refer to the following figure for a better understanding:

Figure 10.3 (b): Output of Program Name: Chap10_Example10.4.py

The code is continued in the following figure:

Figure 10.3 (c): Output of Program Name: Chap10_Example10.4.py

Refer to the following figure to understand the output of the code:

Figure 10.3 (d): Output of Program Name: Chap10_Example10.4.py

Refer to the following figure for a better understanding of the output:

Figure 10.3 (e): Output of Program Name: Chap10_Example10.4.py


The following figure depicts the graph for clarity:

Figure 10.3 (f): Image depicting histogram view with 10 bins for a sample
of 100 rows from mysepal_length dataset

Note: The preceding code is covered in (program name:


Chap10_Example10.4.py).

We have also worked the same code in Jupyter notebook under the
filename context_demo.ipynb.

One major observation from the preceding example is that the original
columns will be retained and new ones will be added by using
with_columns whereas, the original columns will be dropped by using the
select statement. The existing dataframe is filtered by using the filter
context based on arbitrary expression that evaluates to a Boolean data type.

GroupBy concept

Expressions in the group_by context operate on groups and can provide


results of any length (a group may have numerous members) as shown in
the following example program name:

Chap10_Example10.5.py:

import polars as mypl

print("Reading the iris.csv dataset and storing the contents in the variable
mydf")

mydf = mypl.read_csv("iris.csv")

print('-'*50)

print(mydf.groupby('variety').agg(mypl.count()))

'''
1. The code starts by calling the 'groupby' function on the dataframe 'mydf'
and passing the column name 'variety' as an argument.

2. This groups the dataframe based on the unique values in the 'variety'
column.

3. Then, the 'agg' function is called on the grouped dataframe, and


'mypl.count()' is passed as an argument.

4. 'mypl.count()' is a function that counts the number of non-null values in


each column of the grouped dataframe.

5. The 'agg' function applies this counting operation to each group and
returns the result.

6. The resulting dataframe will have the count of non-null values for each
column, grouped by the 'variety' column.

'''

print('Computing count and % of Instances')

print(mydf.groupby('variety').agg(mypl.count()).sort('count').with_columns
([

(mypl.col('count')/mypl.col('count').sum()).alias('mypercentage')

]))

Refer to the following figure depicting usage of graphby concept:

Figure 10.4: Image depicting usage of groupby concept in polars

Note: The preceding code is covered in (program name:


Chap10_Example10.5.py).
We have also worked on the same code in Jupyter notebook under the
filename context_demo2.ipynb.

Importance of expressions in polars

Polar expressions are crucial because they let the user manipulate and
compute the data in a variety of ways. Within the Polars framework, they
offer a versatile and potent approach to modify, filter, aggregate, and
transform the data.

Expressions are crucial in Polars for the following reasons:

Data transformation: By using different operations, such as arithmetic


computations, string manipulations, date/time operations, and more,
expressions allow the user to transform the data. This also enables the
creation of new columns or edit existing ones in accordance with the
needs.
Filtering and selection: Using expressions, the user can filter and
choose particular rows or columns from the dataset according to the
pre-defined criteria. Expressions can be used to create sophisticated
data selection rules and build complex filtering criteria.
Aggregation: The user can aggregate the data by using expressions to
compute sums, averages, counts, maximums, minimums, and other
aggregation actions. This helps produce aggregated statistics and data
summaries.
Joining and combining data: Expressions can be used to join various
datasets together based on shared columns or to combine data from
many sources by using column-wise procedures.
Performance optimization: Polars make use of expressionsʼ lazy
evaluation to optimize and carry out computations more quickly.
Polars may optimize the execution plan and carry out operations in a
lazy and parallelized way by defining your operations as expressions,
leading to faster and more effective data processing.

Let us view an example as shown in the following example program name:


Chap10_Example10.6.py:

import polars as mypl


print("Reading the iris.csv dataset and storing the contents in the variable
mydf")

mydf = mypl.read_csv("iris.csv")

# Perform multiple expressions using mydf.select

selected_data = mydf.select([

mypl.col("sepal.length"),

mypl.col("petal.length"),

mypl.col("variety")

]).filter(mypl.col("sepal.length") > 5.0).sort("petal.length")

# Display the selected and filtered data

print(selected_data)

The following figure shows the output of the code:

Figure 10.5: Image depicting usage of expressions in polars

Note: The preceding code is covered in (program name:


Chap10_Example10.6.py).

Concept of lazy API in polars

The lazy API in Polars refers to the lazy evaluation mechanism provided by
the library. It allows you to build a computation plan without executing it
immediately. Instead, the operations are executed only when the result is
explicitly requested or when an action is performed. The lazy API is useful
for several reasons:
Optimization: The lazy evaluation allows polars to optimize the
execution plan by rearranging and optimizing the operations. This can
lead to more efficient computation and improved performance.
Memory efficiency: With lazy evaluation, intermediate results are not
materialized unless explicitly requested. This reduces memory usage,
especially when dealing with large datasets or complex computations.
Flexibility: Building complex computing pipelines is made possible by
the lazy APIʼs flexibility. A series of operations and transformations
can be defined without having to carry them out immediately. This
makes it simple for the user to adjust or expand the computation
strategy as necessary.
Regarding its fundamental computational model and method of data
processing, polars' lazy API differs from pandas in several key ways,
as explained in the following points:
Evaluation model: In pandas, operations are frequently carried out,
which means that outcomes are promptly carried out and materialized
in memory. Polars, on the other hand, use lazy evaluation, in which
operations are built into a computational plan rather than being carried
out immediately. Only when a result is specifically requested or when
an action is taken does the plan is executed.
Optimization: In order to optimize the execution plan, Polars makes
use of lazy evaluation. Based on the available optimizations, such as
predicate pushdown, and projection pushdown, it rearranges and
optimizes the processes. When compared to pandas, this may lead to
more effective computation and improved performance.
We have also worked on the same code in Jupyter notebook under the
filename context_demo2.ipynb.
Lazy evaluation: It prevents intermediate results from materializing
unless specifically requested. This uses less memory, particularly
when working with large data sets or difficult calculations. Pandas, in
contrast, eagerly materialize intermediate results, which may result in
higher memory usage.
Parallelism: Polarsʼ lazy evaluation methodology facilitates
parallelism in the execution of operations. Making use of modern
multi-core CPUs, it can parallelize computations automatically across
multiple threads or cores. When compared to pandas, this may result
in faster processing of data, especially for computationally demanding
tasks.

Let us view an example of lazy evaluation as demonstrated in the following


example program name: Chap10_Example10.7.py:

import polars as mypl

print("Reading the iris.csv dataset and storing the contents in the variable
mydf")

mydf = mypl.read_csv("iris.csv")

# Define a lazy computation plan

mylazy_plan = mydf.lazy().select([

mypl.col("sepal.length"),

mypl.col("petal.length"),

mypl.col("variety")

]).filter(mypl.col("sepal.length") > 5.0).sort("petal.length")

# Execute the computation plan and get the result

myresult = mylazy_plan.collect()

# Display the result

print(myresult)

The following figure shows the output of the code:

Figure 10.6: Image depicting usage of lazy evaluation in polars


Note: The preceding code is covered in (program name:
Chap10_Example10.7.py).

In this example, we read the iris.csv dataset into a dataframe called mydf .
Then, we define a lazy computation plan by using the .lazy() method on the
dataframe. We chained multiple operations such as select, filter, and sort to
form the computation plan. However, the plan is not executed immediately.
To execute the plan and obtain the result, we call .collect() on the lazy plan.
This triggers the execution of the computation plan, and the result is stored
in the myresult variable. Finally, we display the result. By using the lazy
API, we can define a computation plan without immediately executing it.
This allows polars to optimize the plan and execute it efficiently when
needed.

Now, there comes a question about the previous example


Chap10_Example10.6.py and the current example delivers the same output.
An important point to note is that the current example uses lazy evaluation
and a separate computation plan, while the previous example performs the
operations directly on the dataframe.

Note: this is true but streaming also means that instead processing the data
all-at-once Polars can execute the query in batches allowing you to process
datasets that are larger-than-memory. Also streaming is in development
mode so it is not a stable method and is not available for all
functions/expressions/methods in polars library.

In the given code Chap10_Example10.7.py, if we write the line as the


following:

myresult = mylazy_plan.collect(streaming=True)

Then, the collected results will be returned in a streaming fashion, allowing


for real-time processing or analysis of the data as it becomes available.
Overall, the code collects the results of a lazy execution plan in a streaming
manner, enabling real-time processing of the data.

Conclusion
In this chapter, readers embarked on a captivating exploration of Polars, a
versatile Python library for data manipulation, unveiling its vast potential
and functionalities. From the outset, we set out to distinguish between
pandas and polars, highlighting the unique strengths and capabilities that
polars offer. We then discussed polars data types, breaking them down into
numeric, nested, temporal, and other groups, providing a solid foundation
for understanding the underlying data structures. As the chapter progressed,
readers gained insights into polarsʼ core data structures, namely series and
dataframes. Armed with this knowledge, they acquired the skills necessary
to work with data effectively, setting the stage for more advanced
operations. We introduced the concept of contexts in polars, focusing on
essential operations, selection, and filtering techniques, ensuring readers
were well-prepared to manipulate data powerfully and efficiently. The
unveiling of the groupby concept further enhanced readersʼ data-handling
capabilities, allowing them to harness polars for data aggregation and
summarization, a crucial aspect of data analysis. Expressions emerged as a
key feature, enabling readers to grasp their significance in facilitating
complex data transformations. Finally, in the end, the chapter meticulously
demystified the Lazy API in polars, showing how it can significantly
improve performance and resource utilization in data operations.

In the next chapter, we will explore the concept of expressions in polars and
the integration of numpy in polars. We will also see a comprehensive
comparison of operations between pandas and polars which one is faster
and better to use.

Points to remember

Polars is a high-performance columnar data format written in Rust


language.
Polars is a newer data analysis library for Python that is designed to be
faster and more efficient than Pandas.
Polars support a variety of data types, including numeric, nested,
temporal, and other data types.
Polars uses contexts to manage the execution of operations. A context
is a set of operations that are executed together.
Polars provide a variety of contexts, such as the select() context and
the groupby() context. The select() context is used to select columns
from a dataframe. The filter() context is used to filter rows from a
dataframe. The groupby() context is used to group rows in a dataframe
by a common value.
Polars support a variety of basic operators, such as arithmetic
operators, comparison operators, and logical operators.

Questions

1. What are the key differences between pandas and polars?


2. What are the various polars data types?
3. How do polars categorize data into the numeric group?
4. What is the significance of the nested group in polars?
5. What does the temporal group refer to in polars data types?
6. How are data structures organized in polars, specifically series and
dataframe?
7. What are the fundamental concepts of basic operations and the select
and filter concept in polars?
8. How does the groupby concept work in polars?
9. Why are expressions important in polars data manipulation?
10. What is the concept of the Lazy API in Polars and how does it
optimize data operations?
CHAPTER 11Exploring Expressions in Polars
Introduction

In the previous chapter, we discussed the disparities between pandas and


polars, highlighting polars enhanced performance in handling large
datasets. We explored polarsʼ distinct data types including numeric, nested,
temporal, and other groups, along with its fundamental data structures like
series and dataframe. Contexts in polars were elucidated, focusing on basic
operations, the select and filter concept, and the groupby functionality. We
emphasized the significance of expressions in polars and introduced the
concept of Lazy API, which optimizes computation by postponing
execution until necessary. In the current chapter, understanding the concept
of expressions in polars becomes imperative as we explore basic operators,
column selections, functions, casting, string manipulation, aggregation
techniques, handling missing data, folds, and working with lists and arrays,
while also examining the integration of numpy in polars. Furthermore, we
undertake a comprehensive comparison of operations between pandas and
polars to underscore the advantages and limitations of each framework in
different scenarios, aiding in informed decision-making for data processing
tasks.

Structure

In this chapter, we will discuss the following topics:

Concept of expressions in polars


Operations comparison between pandas and polars

Objectives

By the end of this chapter, the reader will cover the subtleties of
expressions in polars along with basic operators, column selections,
functions, casting, string operations, aggregation, handling missing data,
and leveraging folds, lists, and arrays. We will also introduce the
integration of numpy in polars, offering a bridge between these two
powerful libraries. Finally, readers will gain a deep understanding of when
different operations will be performed with polars and pandas, thus helping
to make informed choices when working with data in Python. By the end of
this chapter, readers will have acquired a comprehensive understanding of
polars, their data manipulation capabilities, and their unique features,
positioning them to work with data confidently and efficiently in a Python
environment.

Concept of expressions in polars

In Polars, expressions are like instructions that tells what to do with the
data. They make it easy to do lots of different things to data, like adding
numbers together or changing how itʼs organized. So, instead of writing
long, complicated code, you can use expressions to get things done quickly.

Basic operators

In Polars, basic operators can be used in conjunction with expressions to


perform numerical and logical computations on the data. Refer to the
following example Chap11_Example11.1.py for understanding basic
operators:

import polars as mypl

print("Reading the iris.csv dataset and storing the contents in the variable
mydf")

mydf = mypl.read_csv("iris.csv")

# numerical

mydf_numerical = mydf.select(

mypl.col("sepal.length"),

mypl.col("petal.length"),

(mypl.col("sepal.length") + 5).alias("sepal.length + 5"),


(mypl.col("sepal.length") - 5).alias("sepal.length - 5"),

(mypl.col("sepal.length") * mypl.col("petal.length")).alias("sepal.length *
petal.length"),

(mypl.col("sepal.length") / mypl.col("petal.length")).alias("sepal.length /
petal.length"),

print(mydf_numerical.head())

print('-'*50)

# logical

mydf_logical = mydf.select(

mypl.col("sepal.width"),

mypl.col("petal.width"),

(mypl.col("sepal.width") >3.2).alias("sepal.width_Greater"),

(mypl.col("petal.width") == 0.2).alias("petal.width_check"),

print(mydf_logical.head())

Output:

Figure 11.1 (a): Output of Program Name: Chap11_Example11.1.py

The code is continued in the following figure:


Figure 11.1 (b): Part-2 Output of Program Name: Chap11_Example11.1.py

Note: The preceding code is covered in (program name:


Chap11_Example11.1.py).

In the given code example, numerical operations are performed by using


operators like addition (+), subtraction (-), multiplication (*), and division
(/). These operations are applied to the sepal.length column, resulting in
new columns that represent the original column modified by the respective
numerical operation. For example, (mypl.col("sepal.length") + 5) adds 5 to
each value in the sepal.length column and creates a new column named
sepal.length + 5. Logical operations are also used in conjunction with
expressions.

In the code, the logical operators used are greater than (>), and equal to
(==). These operators are applied to the sepal.width and petal.width
columns. The result is new columns that represent the outcome of the
logical operation. For example, (mypl.col("sepal.width") > 3.2) creates a
new column named sepal.width_Greater, which contains boolean values
indicating whether each corresponding value in the sepal.width column is
greater than 3.2. The resulting dataframes, mydf_numerical and
mydf_logical, contain the original columns along with the computed
columns resulting from the numerical and logical operations, respectively.
The head() method is used to display the first few rows of each dataframe,
providing a glimpse of the computed values. Overall, the code
demonstrates how basic operators can be used in conjunction with
expressions in Polars to perform numerical and logical computations on the
data, resulting in new columns that capture the modified or evaluated
values.

Column selections

Depending on the dataframe schema, multiple columns can be specified


from a single expression that can be expanded into a list of expressions thus
allowing the selection of multiple columns. In the example shown below,
we will be looking into various approaches to handling the column
selection. Refer to the following example Chap11_Example11.2.py for
better understanding of column selections:
import polars as mypl

import polars.selectors as mysel

mydf = mypl.read_csv("iris.csv")

print("Case1: selection of all columns using '*' and all() method")

# M-1

print(mydf.select(mypl.col('*')).head())

# M-2

#print(mydf.select(mypl.all()).head()) # Just uncomment, run and view the


output

print('-'*50)

print("Case2: Some columns can be excluded using exclude method")

print(mydf.select(mypl.col('*').exclude('sepal.length','petal.length')).head())

print('-'*50)

print("Case3: selection of column using polar data type")

print(mydf.select(mypl.col(mypl.Utf8).n_unique())) # will return the


number of unique values in the column

print('-'*50)

print('Case4: selecting the float columns only')

print(mydf.select(mysel.float()))

print('-'*50)

print('Case5: selecting the non-float columns only')


print(mydf.select(~mysel.float()))

Output:

Figure 11.2 (a): Output of Program Name: Chap11_Example11.2.py

The code is continued in the following figure:

Figure 11.2 (b): Output of Program Name: Chap11_Example11.2.py

Refer to the following figure for a better understanding of output:

Figure 11.2 (c): Output of Program Name: Chap11_Example11.2.py

Note: The preceding code is covered in (program name:


Chap11_Example11.2.py).

Functions

Polars expressions have a large number of built-in functions that allow the
user to create complex queries without the need for user-defined functions.
There are too many to go through here, but we will cover some of the more
popular use cases. Refer to the following example Chap11_Example11.3.py
for understanding some built-in functions:

import polars as mypl

mydf = mypl.read_csv("iris.csv")

print('Case1: Usage of try except block if trying to use the same column
multiple times in the expression')

try:
print(mydf.select(mypl.col('sepal.length')+2 , mypl.col('sepal.length')-2 ))

except Exception as mye:

print("Exception...")

print(mye)

print('-'*50)

print('Case2: Removing length word from sepal.length and petal.length


columns only and exclude other columns and converting it into upper case
using map_alias')

# The map_alias method is used in the provided code to modify the column
names of the DataFrame mydf during the select operation.

print(mydf.select(mypl.col('*').exclude('variety','sepal.width','petal.width').
map_alias(lambda c: c.rstrip("length").replace(".","").upper())))

print('Case3: Support of if else-condition in polars using when, then and


otherwise syntax')

'''

The when clause contains the predicate,

and if it evaluates to true, the then expression is applied;

otherwise, the otherwise expression is applied.

Also, .lit function is used to create a literal value which will be used as

the results of the conditional expression within the when and otherwise
clauses.

'''

mydf_conditional = mydf.select(
mypl.col("sepal.length"),

mypl.when(mypl.col("sepal.length") > 4.7)

.then(mypl.lit(True))

.otherwise(mypl.lit(False))

.alias("Condition_Check_Sepal_Length"),

print(mydf_conditional.head())

Output:

Figure 11.3: Output of Program Name: Chap11_Example11.3.py

Note: The preceding code is covered in (program name:


Chap11_Example11.3.py).

Casting

The cast() method in Polars is used to convert the underlying datatype of a


column to a new one. Polars relies on Arrow for managing data in memory
and utilizes compute kernels in the rust implementation to perform the
conversion. When using the cast() method, there is a strict parameter that
determines how Polars behaves when it encounters a value that cannot be
converted from the source datatype to the target datatype. By default,
strict=True, which means that polars will throw an error and provide details
on the values that could not be cast. Alternatively, if strict=False, any
values that cannot be converted will be quietly converted to null. Refer to
the following example Chap11_Example11.4.py for using cast() method
concept:

import polars as mypl


mydf = mypl.DataFrame(

"myint_col":[10,0,12,-11,14],

"mybigint_col":[10010000,5,30030000,40040000,11],

"myfloat_col":[12.0,15.0,18.0,21.0,24.0],

"myfloat_decimal":[12.11,15.22,18.33,21.44,24.55],

"myfloat_string":["12.11","15.22","18.33","21.44","24.55"],

"mydate_string": [

"2023-11-01",

"2023-11-02",

"2023-11-03",

"2023-11-04",

"2023-11-05",

print(mydf)

print('-'*50)

print('Case1: Casting operations between integer and float')

print(mydf.select(
mypl.col("myint_col").cast(mypl.Float32).alias("myint_as_floats"),

mypl.col("myfloat_decimal").cast(mypl.Int32).alias("myfloat_decimal_as_i
ntegers"),

))

print('-'*50)

print('Case2: Checking whether it can be casted to a smaller data type or


not')

try:

print(mydf.select(mypl.col("mybigint_col").cast(mypl.Int16)))

except Exception as mye:

print(mye)

print('-'*50)

print('Case3: Checking whether it can be casted to a smaller data type or


not by making strict parameter to false')

print(mydf.select(mypl.col("mybigint_col").cast(mypl.Int16, strict=False)))
# The overflowing values will be assigned to null

print('-'*50)

print('Case4: Casting numerical data types to string and vice-versa')

print(mydf.select(

mypl.col("myfloat_string").cast(mypl.Float64),

mypl.col("myfloat_col").cast(mypl.Utf8)

))
print('-'*50)

print('Case5: Casting integer values to boolean')

print(mydf.select(

mypl.col("myint_col").cast(mypl.Boolean)

))

print('-'*50)

print('Case6: Convert string values to date')

print(mydf.select(

mypl.col("mydate_string").str.to_datetime("%Y-%m-%d")

))

Output:

Figure 11.4 (a): Output of Program Name: Chap11_Example11.4.py

The code is mentioned in the following figure:

Figure 11.4 (b): Output of Program Name: Chap11_Example11.4.py

Refer to the following figure to understand the output better:

Figure 11.4 (c): Output of program name: Chap11_Example11.4.py

Note: The preceding code is covered in (program name:


Chap11_Example11.4.py).
Strings

In order to optimize string processing and overcome the inefficiency caused


by unpredictable memory size, Polars leverages Arrow as its backend. With
Arrow, all strings are stored in a contiguous memory block. This approach
ensures cache-optimized and predictable string traversal for the CPU,
reducing the need for accessing random memory locations. To access the
str namespace of a column with Utf8 data type, the user can utilize the .str
attribute. Just observe the following example Chap11_Example11.5.py for
understanding strings in polars:

import polars as mypl

mydf = mypl.DataFrame({"Random_Data": ["Mango", "Pear and Fear",


"Bana$na", None, 'ab 12 cd','34 efgh 56']})

print('Case1: Checking of data ')

print(mydf.select(

mypl.col("Random_Data"),

mypl.col("Random_Data").str.contains("Man|na").alias("Myregex"),

mypl.col("Random_Data").str.contains("Bana$",
literal=True).alias("Myliteral"),

mypl.col("Random_Data").str.starts_with("Bana").alias("Mystarts_with"),

mypl.col("Random_Data").str.ends_with("Fear").alias("Myends_with")))

# The literal=True parameter ensures that the search is done as a literal


string instead of a regular expression.

print('-'*50)

print('Case2: Extracting the characters')

print(mydf.select(
mypl.col("Random_Data").str.extract_all(r"
([^\d\s]+)").alias("extracted_chars"),

))

Output:

Figure 11.5: Output of Program Name: Chap11_Example11.5.py

Note: The preceding code is covered in (program name:


Chap11_Example11.5.py).

Besides the examples mentioned earlier, Polars provides a range of other


string manipulation methods for tasks like formatting, stripping, splitting,
and more. To discover these additional methods, you can refer to the API
documentation of your preferred programming language for polars.

Aggregation

Observe the following code snippet Chap11_Example11.6.py for looking at


the aggregation concept in polars:

import polars as mypl

dtypes = {

"first_name": mypl.Categorical,

"gender": mypl.Categorical,

"type": mypl.Categorical,

"district": mypl.Categorical

}
mydataset = mypl.read_csv("election_data.csv",
dtypes=dtypes).with_columns(

mypl.col("birthday").str.to_date(strict=False)

print(mydataset)

print('-'*50)

mydf = (

mydataset.lazy()

.groupby("last_name")

.agg(

mypl.count(),

mypl.col("gender"),

mypl.first("first_name"),

.sort("count", descending=True)

.limit(5)

print(mydf.collect())

Output:

Figure 11.6: Output of Program Name: Chap11_Example11.6.py


Note: The preceding code is covered in (program name:
Chap11_Example11.6.py).

The code explanation is given below:

First, the code imports the polars library as mypl and defines the data
types for the columns in the dataset by using the dtypes dictionary.
Then, it reads a CSV file called election_data.csv by using the
read_csv method and assigns it to the variable mydataset. The dtypes
argument is used to specify the data types for each column in the
dataset.
The code converts the birthday column in mydataset from a string to a
date format by using the str.to_date(strict=False) method and adds it as
a new column to mydataset.
The mydataset is printed to the console.
A new dataframe called mydf is created by performing a series of
operations on mydataset. The operations include grouping the data by
the last_name column, aggregating the count of rows, selecting the
gender column, and selecting the first occurrence of the first_name
column.
The resulting mydf DataFrame is sorted in descending order based on
the count column and is limited to the top five rows.
The mydf dataframe is printed to the console by using the collect()
method.

In summary, the code reads a CSV file, performs operations on the dataset
by using the polars library, and outputs the results.

Note: We can even perform conditional, filtering, and sorting concepts by


using aggregation which we left for you to explore.

Missing data

The columns in a dataframe (or a series) are composed of arrow arrays or


collections of arrow arrays following the Apache arrow format. In arrow
and polars, missing data is denoted by a null value, which applies to all data
types, including numerical values.
Additionally, polars permit the presence of NotaNumber or NaN values
specifically for float columns. These NaN values are treated as floating
point data rather than missing data. Missing value in Python can be
manually defined with the None value. Refer to the following example
Chap11_Example11.7.py for handling missing data:

import polars as mypl

mydf = mypl.DataFrame({"values":[11,None,0,False,None,12]})

print('Case1: Displaying the dataframe object')

print(mydf)

print('-'*50)

print('Case2:First piece of metadata')

mydf_null_count = mydf.null_count() # display of number of rows with


null values in the columns

print(mydf_null_count)

print('-'*50)

print('Case3:validity bitmap used by is_null method indicating whether


each data value is null or not')

print(mydf.select(

mypl.col("values").is_null()

))

print('-'*50)

print('Case4:Filling the missing data with a specified literal value using


mypl.lit] using fill_null method')
print(mydf.select(

mypl.col("values").fill_null(mypl.lit(3))

))

print('-'*50)

print('Case5:Filling the missing data with a strategy by setting it as


backward here using fill_null method')

print(mydf.select(

mypl.col("values").fill_null(strategy='backward')

))

print('-'*50)

print('Case6:Filling the missing data with an expression using fill_null


method')

print(mydf.select(

mypl.col("values").fill_null(mypl.mean("values")) # (11+0+0(False)+12/4)

))

print('-'*50)

print('Case7:Filling the missing data with an interpolation using interpolate


function')

print(mydf.select(

mypl.col("values").interpolate())

)
Output:

Figure 11.7 (a): Output of Program Name: Chap11_Example11.7.py

The code is continued in the following figure:

Figure 11.7 (b): Output of Program Name: Chap11_Example11.7.py

Refer to the following figure to understand the code better:

Figure 11.7 (c): Output of Program Name: Chap11_Example11.7.py

Note: The preceding code is covered in (program name:


Chap11_Example11.7.py).

Folds

Polars offer expressions and methods such as sum, min, mean, and so on,
for performing horizontal aggregations. However, in cases where more
complex aggregations are required, the default methods provided by polars
may not be enough. This is where folds become useful. The fold expression
operates on columns, providing maximum speed and efficient utilization of
the data layout. It often executes in a vectorized manner, further enhancing
performance. Refer to the following example Chap11_Example11.8.py for
understanding the fold concept in polars:

import polars as mypl

mydf = mypl.DataFrame(

"list1": [4,5,6],
"list2": [10, 100, 1000],

print('Case1: Manual multiplication')

'''

- acc=mypl.lit(1) : This sets the initial value of the accumulator ( acc ) to 1.


The accumulator is used to keep track of the intermediate result during the
fold operation.

- function=lambda acc, y: acc * y : This defines the function to be applied


during the fold operation. In this case, it is a lambda function that
multiplies the accumulator ( acc ) with each element ( y ) in the expression.

- exprs=mypl.all() : This specifies that the fold operation should be applied


to all columns in the DataFrame.

The result of the fold operation is then selected using mydf.select() .

'''

print(mydf.select(

mypl.fold(acc=mypl.lit(1), function=lambda acc, y: acc * y,


exprs=mypl.all())

.alias(“mymul”)

))

print('-'*50)

print('Case2:Filtering all rows where each column value is > 5')

print(mydf.filter(
mypl.fold(

acc=mypl.lit(True),

function=lambda acc, x: acc & x,

exprs=mypl.all() > 5,

))

print('-'*50)

print('Case3:Performing concatenation element wise')

print(mydf.select(mypl.concat_str(["list1", "list2"])))

print('-'*50)

print('Case3:Performing addition element wise')

print(mydf.select(mypl.sum(["list1", "list2"])))

Output:

Figure 11.8 (a): Output of Program Name: Chap11_Example11.8.py

The output is mentioned in the following figure:

Figure 11.8 (b): Output of Program Name: Chap11_Example11.8.py

Note: The preceding code is covered in (program name:


Chap11_Example11.8.py).

List and arrays


Polars have first-class support for list columns i.e. columns where each row
is a list of homogeneous elements, of varying lengths. Polars also have an
Array datatype, which is analogous to numpyʼs ndarray objects, where the
length is identical across rows. Refer to the following example
Chap11_Example11.9.py for understanding list and arrays concept in
polars:

import polars as mypl

print("Creating a list of individual scores made by 5 batsman in 6 innings")

mydf = mypl.DataFrame(

"batsman": ["Batsman " + str(x) for x in range(5)],

"runs": [

"12 45 78 100 156 38",

"18 116 101 12 70 40",

"0 35 16 6 12 10",

"57 15 7 8 10 24",

"14 16 22 24 45 103",

],

print(mydf)

print('-'*50)
print('Case1: Creating a list column')

print(mydf.with_columns(mypl.col("runs").str.split(" ")))

print('-'*50)

print('Case2: Performing operations on list columns')

print(mydf.with_columns(mypl.col("runs").str.split(" ")).with_columns(

mypl.col("runs").list.head(2).alias("MyTop2"),

mypl.col("runs").list.slice(-2, 2).alias("Mybottom_2"),)) # tail(2) can also


be used

print('-'*50)

print('Case4: Computing the Rank of each batsman based on total number


of runs')

mydf = mydf.with_columns(

mypl.col("runs")

.str.split(" ")

.list.eval(mypl.element().cast(mypl.Int8, strict=False))

.list.sum()

.alias("Total"),

# Compute the rank of each batsman based on total number of runs

print(mydf.with_columns(mypl.col("Total").rank(method="dense").alias("
Rank")))
print("Case5: Array data type contains same number of elements per row")

print(mypl.DataFrame(

mypl.Series("MyArr1", [[11, 31], [12, 15]]),

mypl.Series("MyArr2", [[11, 17, 31], [81, 11, 10]]),

],

schema={

"MyArr1": mypl.Array(inner=mypl.Int8, width=2),

"MyArr2": mypl.Array(inner=mypl.Int8, width=3),

},

))

Output:

Figure 11.9 (a): Output of Program Name: Chap11_Example11.9.py

The output is continued in the following figure:

Figure 11.9 (b): Output of Program Name: Chap11_Example11.9.py

Refer to the following example for a better understanding of the code


outpout:

Figure 11.9 (c): Output of program name: Chap11_Example11.9.py


Note: The preceding code is covered in (program name:
Chap11_Example11.9.py).

Usage of numpy in polars

Polars supports NumPy ufuncs (Provided in the link:


https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs),
which means that if a function is not provided by polars, we can use numpy
and still benefit from fast columnar operations through the numpy API.
Refer to the following example Chap11_Example11.10.py for usage of
numpy concept in polars:

import polars as mypl

import numpy as mynp

mydf = mypl.DataFrame({"mynum1": [2, 4, 6], "mynum2": [1, 3, 5]})

# Element-wise functions

print(mydf.select(

mynp.exp(mypl.col("mynum1")).alias("_exp1"),

mynp.exp(mypl.col("mynum2")).alias("_exp2")

))

print('-'*50)

print(mydf.select(

mynp.sin(mypl.col("mynum1")).alias("_sin1"),

mynp.sin(mypl.col("mynum2")).alias("_sin2")

))

print("-"*50)
# # Vectorized function

print(mydf.select(

mynp.degrees(mypl.col("mynum1")).alias("_degrees1"),

mynp.degrees(mypl.col("mynum2")).alias("_degrees2")

))

# We can even perform aggregate functions like sim, min, max, mean,
median

Output:

Figure 11.10: Output of Program Name: Chap11_Example11.10.py

Note: The preceding code is covered in (program name:


Chap11_Example11.10.py).

Operations comparison between pandas and polars

The operations comparison between pandas and polars is illustrated in


example Chap11_Example11.11.py. The above code contains 11 cases of
comparison between pandas and polars along with the time duration of
reading a particular CSV file. Since the code is big, it is not covered in this
book but is provided in the Code bundle. The output of
Chap11_Example11.11.py is provided in the text file
Output_Chap11_Example11.11.txt file.

Note: The preceding code is covered in (program name:


Chap11_Example11.11.py).

By observing the output, it can be interpreted that pandas took more time to
read the CSV file than polars, thus indicating that polars performance is
better than pandas. The execution time may change on each run iteration,
but the polars will take less time to execute.
Conclusion

In this chapter, readers delved into the intricacies of expressions in Polars,


covering a wide range of topics including basic operators, column
selections, functions, casting, string operations, aggregation, and handling
missing data. The integration of NumPy into Polars expanded its
capabilities, facilitating seamless interaction between these powerful
libraries. Through comparisons with Pandas, readers gained a deeper
understanding, empowering them to make informed decisions when
working with Python data. As the chapter concluded, readers emerged
equipped with a comprehensive grasp of Polars, its data manipulation
prowess, and unique functionalities.

In the upcoming chapter, we will explore Seaborn, an essential tool for data
visualization tasks. Its ability to generate informative plots with minimal
code addresses the need to effectively explore and communicate insights
from complex datasets. Seabornʼs higher-level interface simplifies the
creation of intricate statistical plots compared to Matplotlib. Offering a
variety of plot types like Kernel Density Estimate (KDE) plots, Violin
plots, Line plots, Scatter plots, Joint plots, and Facet Grids, Seaborn caters
to diverse aspects of data analysis.

Points to remember

Polars expressions are used to represent data transformations.


Expressions can be used to select columns, filter data, perform
aggregations, and more.
Polars expressions can be used to select columns from a dataframe.
Columns can be selected by name or by position.
Polars expressions can be used to perform aggregations on data.
Aggregations can be used to calculate summary statistics, such as the
mean, median, and mode.
The fold expression operates on columns, providing maximum speed
and efficient utilization of the data layout.
A deep understanding of the similarities and differences in data
manipulation operations between pandas and polars is to be
developed, aiding in informed decision-making when working with
data in Python.
Polars are a good choice for data analysis tasks when speed and
memory usage are critical. It is also a good choice for users familiar
with pandas, as the API is similar. However, polars is still under
development, so it may not have all of pandasʼs features yet.

Questions

1. What are the key components of expressions in polars, including basic


operators, column selections, and functions?
2. How can casting be used within the context of polars data
manipulation?
3. How do polars handle string operations and aggregation of data?
4. What strategies are employed in polars to deal with missing data?
5. How are folds, lists, and arrays utilized in polars data manipulation?
6. How can numpy be integrated into polars, and what are its
applications within the library?
7. Explain data manipulation operations when comparing pandas and
polars?

Join our book’s Discord space

Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:

https://discord.bpbonline.com
CHAPTER 12
Exploring Seaborn Library for
Data Science in Python

Introduction
In this chapter, we will learn how we can install scipy using pip command,
that is, pip install seaborn. We will be importing and using seaborn in
Python scripts using import seaborn as mysns. Before working with seaborn,
the following libraries are to be installed:
Python version 3.6 or higher.
Numpy with version >= 1.13.3 . We are using version: 1.25.0
Scipy with version >= 1.0.1 . We are using version 1.11.1.
Matplotlib with version >= 2.1.2 . We are using version 3.7.1.
Pandas with version >= 0.22.0 . We are using version 2.1.0.
We are using version 0.12.2 of seaborn.

Note: Kindly note that in this chapter, there will be various example
codes with multiple color images via plots. Since this book will be a
paperback edition it is a must to kindly refer to the image bundle
parallelly, so that if any image the reader wants to view after going
through the code it can be easily referred. Also, there will be various
examples with different colors which will be a combination of primary
and secondary colors and will be difficult to explain with legends on
each graph. So, it is expected that the readers kindly open the image
bundle parallelly to view the output.

Structure
In this chapter, we will discuss the following topics:
Some basic statistical terms to know
Built in seaborn datasets
Plot Styling in seaborn
Concept of color palette in seaborn
Heatmap plot in seaborn
Box plot in seaborn

Objectives
By the end of this chapter, the reader will learn the concept of some basic
statistical terms, such as quantitative and qualitative variables, mean, and
their significance in data analysis. They will understand the distinction
between variables that represent numerical measurements (quantitative) and
those that represent categories or labels (qualitative). The reader will also be
familiar with various built-in datasets available in seaborn, gaining practical
exposure to real-world datasets for data visualization and analysis.
Furthermore, the reader will comprehend the principles of plot styling in
seaborn, learning how to customize the aesthetics of plots for better
presentation and interpretation. The concept of color palettes in seaborn,
including qualitative, sequential, and diverging palettes, will be covered,
enabling the reader to choose appropriate color schemes for different types
of data and visualizations. Finally, the reader will acquire the skills to create
various advanced plots in seaborn using box plots for depicting the
distribution of data.

Some basic statistical terms to know


It is important to know some basic terms in statistics. Here are some
fundamental statistical terms that provide a strong foundation for
understanding statistical methods and interpreting results. Let us have a look
at this.

Quantitative variables
These variables are numerical values that represent counts or measures, for
example, weight, percentages, and so on These variables are of two types
namely discrete and continuous variables. Discrete variables are always
numeric and have a finite number of values between any two values. On the
other hand, continuous variables are numeric variables that have an infinite
number of values between any two values.

Qualitative variables
These variables contain a finite number of categories and according to some
characteristics, they are placed into distinct categories.
According to the number of variables, statistical data is classified as
univariate and bivariate. Univariate data consists of only one variable, and
we concentrate on only one variable at a time, whereas Bivariate data
involves two different variables that are being studied to explore the
relationship between them.

Built in seaborn datasets


There are built-in seaborn datasets which are loaded as pandas dataframe
when seaborn is installed. Refer to the code Chap12_Example12.1.py
for better understanding:
import seaborn as mysns
print(mysns.get_dataset_names())
Output:
['anagrams', 'anscombe', 'attention',
'brain_networks', 'car_crashes', 'diamonds',
'dots', 'dowjones', 'exercise', 'flights', 'fmri',
'geyser', 'glue', 'healthexp', 'iris', 'mpg',
'penguins', 'planets', 'seaice', 'taxis', 'tips',
'titanic']

Note: The preceding code is covered in (program name:


Chap12_Example12.1.py).

Whenever there will be a requirement to load in-built datasets from the


seaborn library, we will be using the function load_dataset().

The syntax of load_dataset is mentioned below:


load_dataset(name, cache=True, data_home=None,
**kws)
Where each parameter description is as follows:
name: Specifies the name of the dataset to load.
data_home: Defines a custom directory path to store or read the
dataset from.
cache: Determines whether to cache the dataset locally for faster
future access (default is usually True).
**kws: Allows additional keyword arguments for dataset-specific
options or configurations.

Plot styling in Seaborn


The matplotlib parameters are split into two independent groups by seaborn.
The first group is the aesthetic plot style, and the second group is the scaling
of various elements of the plot.
The first group is used to customize the look and the plot style can be used to
change the background plot color, tick mark color, label color, type of font,
color of the text, and so on, by using functions set_style() and
axes_style().

Refer to the following example Chap12_Example12.2.py for the first


group:
import seaborn as mysns
import matplotlib.pyplot as myplt
# print(help(mysns.load_dataset))
# axes_style() returns a dictionary of rc parameters containing the current
style settings for seaborn plots. Here, rc parameters in seaborn are settings
that control the appearance and style of plots by adjusting matplotlibʼs
runtime configuration options.
print(mysns.axes_style())
'''
Using set_style() sets the plot theme with the default theme as darkgrid.
The other themes available are Whitegrid, Dark, White, and Ticks:
'''
print('-'*50)
mysns_dset = mysns.load_dataset('geyser') # EG2_T1
print(mysns_dset.head())# EG2_T2
mysns.set_style(style='ticks',rc={'axes.edgecolor':
'r','text.color': 'g','xtick.color':
'b','ytick.color': 'y'})# EG2_T3
mysns.scatterplot(x = 'duration',y
='waiting',data=mysns_dset)# EG2_T4
myplt.title('Selecting Ticks theme')# EG2_T5
myplt.show()# EG2_T6
Output:
{'axes.facecolor': 'white', 'axes.edgecolor':
'black', 'axes.grid': False, 'axes.axisbelow':
'line', 'axes.labelcolor': 'black',
'figure.facecolor': 'white', 'grid.color':
'#b0b0b0', 'grid.linestyle': '-', 'text.color':
'black', 'xtick.color': 'black', 'ytick.color':
'black', 'xtick.direction': 'out',
'ytick.direction': 'out', 'lines.solid_capstyle':
<CapStyle.projecting: 'projecting'>,
'patch.edgecolor': 'black',
'patch.force_edgecolor': False, 'image.cmap':
'viridis', 'font.family': ['sans-serif'],
'font.sans-serif': ['DejaVu Sans', 'Bitstream Vera
Sans', 'Computer Modern Sans Serif', 'Lucida
Grande', 'Verdana', 'Geneva', 'Lucid', 'Arial',
'Helvetica', 'Avant Garde', 'sans-serif'],
'xtick.bottom': True, 'xtick.top': False,
'ytick.left': True, 'ytick.right': False,
'axes.spines.left': True, 'axes.spines.bottom':
True, 'axes.spines.right': True, 'axes.spines.top':
True}
--------------------------------------------------
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Refer to the following figure for a better understanding:
Figure 12.1: Output Plot of Duration vs. Waiting Time for geyser Dataset

Note: The preceding code is covered in (program name:


Chap12_Example12.2.py).

Now we shall view the code explanation of what we are performing:


In EG2_T1,we are first loading a dataset called geyser using the
load_dataset function from the seaborn library and assign it to the
variable mysns_dset.
In EG2_T2, the first few rows of the dataset use the head function.
In EG2_T3, the style of the plot is set to ticks by using the set_style
function from the Seaborn library. It also customizes the color of
various elements of the plot by using the rc parameter.
In EG2_T4, a scatter plot is created using the scatterplot function from
the Seaborn library. It specifies the duration column as the x-axis and
the waiting column as the y-axis and uses the mysns_dset dataset as
the data source.
In EG2_T5, a title is added to the plot using the title function from the
matplotlib library and specifies the title as Selecting Ticks theme.
In EG2_T6, the plot is displayed using the show function from the
matplotlib library.
We will be covering all the themes in upcoming examples of this chapter.
The second group will be used for scaling the plot elements like changing of
line width, size of the text, label, and so on, by using functions like
set_context() and plotting_context() .

The default context in Seaborn is the notebook style, which can be modified
based on our needs. Seaborn provides four preset contexts that determine the
plot size and enables us to tailor the plot according to its presentation. These
preset contexts, arranged by relative size, include paper, notebook, talk, and
poster. To modify the size of plot components such as labels, ticks, markers,
line widths, and so on, we can employ the functions plotting_context() and
set_context(). Both functions accept identical arguments.

The plotting_context() function generates a dictionary of rc parameters


that pertain to various plot elements, including the size of the label, tick, and
marker. This function outputs an object that can be utilized within a
statement to temporarily modify the context parameters. set_context()
function is used when there is a need to set the plotting context parameters.
Let us observe the code Chap12_Example12.3.py for exploring the
above functions:
import seaborn as mysns
import matplotlib.pyplot as myplt
print(mysns.plotting_context())
mysns_dset = mysns.load_dataset('planets') # EG3_P1
mysns.set_context(context='poster',rc=
{'axes.labelsize': 10.0,'axes.titlesize':
15.0,'grid.linewidth': 3.0})# EG3_P2
mysns.scatterplot(x='mass',y='distance',data=mysns_
dset)# EG3_P3
myplt.title('poster context')# EG3_P4
myplt.show()# EG3_P5
Output:
{'font.size': 10.0, 'axes.labelsize': 'medium',
'axes.titlesize': 'large', 'xtick.labelsize':
'medium', 'ytick.labelsize': 'medium',
'legend.fontsize': 'medium',
'legend.title_fontsize': None, 'axes.linewidth':
0.8, 'grid.linewidth': 0.8, 'lines.linewidth': 1.5,
'lines.markersize': 6.0, 'patch.linewidth': 1.0,
'xtick.major.width': 0.8, 'ytick.major.width': 0.8,
'xtick.minor.width': 0.6, 'ytick.minor.width': 0.6,
'xtick.major.size': 3.5, 'ytick.major.size': 3.5,
'xtick.minor.size': 2.0, 'ytick.minor.size': 2.0}
Refer to the following figure for a better understanding:
Figure 12.2: Plot of Mass vs. Distance for planets Dataset with poster context

Note: The preceding code is covered in (program name:


Chap12_Example12.3.py).

The style and context parameters can be set in one step using set() function,
refer to the following code Chap12_Example12.4.py for the same:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns_dset = mysns.load_dataset('planets')
mysns.set(style='white',context='talk',rc=
{'axes.edgecolor': 'g','axes.titlesize': 10.0})
mysns.scatterplot(x='mass',y='distance',data=mysns_
dset)
myplt.title('talk context')
myplt.show()
Output:
Figure 12.3: Plot of Mass vs. Distance for planets Dataset with talk context

Note: The preceding code is covered in (program name:


Chap12_Example12.4.py).

Without passing any arguments, the set() function can be called if you want
to switch to default seaborn settings (font=’sans-serif’, font_scale=1,
context=’notebook, style=’darkgrid’, palette=’deep’,
color_codes=True).

A plot has four spines (borders on the plot sides) and the top and right spines
can be removed by default from the plot using despine() functio3n, as
shown in the following code Chap12_Example12.5.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns_dset = mysns.load_dataset('planets')
mysns.set(style='ticks')
mysns.scatterplot(x='mass',y='distance',data=mysns_
dset)
mysns.despine()
myplt.title('Only Left and Bottom spine')
myplt.show()
Output:
Figure 12.4: Mass vs. Distance for planets dataset with despine() function

Note: The preceding code is covered in (program name:


Chap12_Example12.5.py).

Concept of color palette in Seaborn


Let us explore some basic terms that any reader should know before
exploring the concept mentioned above. The set of available colors in
computer graphics is termed a palette. The set of pure colors namely Red,
Green, Blue, Orange, Purple, and Yellow within a color space is coined as
Hue (H). The strength or weakness of a color that depicts the color intensity
of an image is termed Saturation (S). The color saturation can vary from
zero (Grey) to 100% (bright). If brightness is zero, then the color is black
with 0% Lightness (L). If brightness is maximum, then the color is white
with 100% lightness. For 50% lightness, the color appears normal. A given
color is defined according to hue, saturation, and lightness components
through the HSL color model.
We can provide colors to the plots and add more aesthetic value to them with
the help of color_palette() function as the list of colors (RGB tuples) will
be returned as defined in a color palette.
Refer to the following syntax and observe the following code
Chap12_Example12.6.py for its applicability usage:
color_palette(palette=None, n_colors=None,
desat=None, as_cmap=False)

import seaborn as mysns


import matplotlib.pyplot as myplt
# print(help(mysns.color_palette))
mycolor_palette = mysns.color_palette()
print(mycolor_palette)
# This line calls the color_palette function from
seaborn with the argument desat=0
# to create a color palette with no saturation. The
palplot function then displays the color # palette.
This will plot the colors array horizontally.
mysns.palplot(mysns.color_palette(desat=0))
myplt.show()
Output:
[(0.12156862745098039, 0.4666666666666667,
0.7058823529411765), (1.0, 0.4980392156862745,
0.054901960784313725), (0.17254901960784313,
0.6274509803921569, 0.17254901960784313),
(0.8392156862745098, 0.15294117647058825,
0.1568627450980392), (0.5803921568627451,
0.403921568627451, 0.7411764705882353),
(0.5490196078431373, 0.33725490196078434,
0.29411764705882354), (0.8901960784313725,
0.4666666666666667, 0.7607843137254902),
(0.4980392156862745, 0.4980392156862745,
0.4980392156862745), (0.7372549019607844,
0.7411764705882353, 0.13333333333333333),
(0.09019607843137255, 0.7450980392156863,
0.8117647058823529)]
The following figure depicts the output:

Figure 12.5: Depiction of the output when desat is set to value 0

Note: The preceding code is covered in (program name:


Chap12_Example12.6.py).

Now we will discuss the color palette types that exist for data visualization.

Qualitative palettes
We will use qualitative palettes to illustrate categorical data without a
particular order among categories. The desired number of colors can be
passed by passing a value to the n_colors parameter, as shown in the
following code Chap12_Example12.7.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
mycolor_palette = mysns.color_palette(n_colors=6)
print(mycolor_palette)
# This line will plot by default 6 colors.
mysns.palplot(mycolor_palette)
myplt.show()
Output:
[(0.12156862745098039, 0.4666666666666667,
0.7058823529411765), (1.0, 0.4980392156862745,
0.054901960784313725), (0.17254901960784313,
0.6274509803921569, 0.17254901960784313),
(0.8392156862745098, 0.15294117647058825,
0.1568627450980392), (0.5803921568627451,
0.403921568627451, 0.7411764705882353),
(0.5490196078431373, 0.33725490196078434,
0.29411764705882354)]
Refer to the following figure for a better understanding:

Figure 12.6: Depiction of the output of Qualitative Palette

Note: The preceding code is covered in (program name:


Chap12_Example12.7.py).

The given code imports the seaborn library as mysns and the
matplotlib.pyplot library as myplt. It then generates a color palette using
the color_palette function from Seaborn, specifying that it should contain
six colors. The color palette is then printed. The code proceeds to plot the
color palette using seabornʼs palplot function, which displays the colors in
the palette. Finally, the plot is shown using matplotlib.pyplot library's show
function.
In qualitative palettes, there are hsl and husl color models, refer to the
following code Chap12_Example12.8.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
# By default n_colors=6 in hls_palette()
# A palette is generated with colors evenly
distributed in the HLS (Hue, Lightness, Saturation)
hue space.
# values for parameters h, l and s should be ranged
between 0 and 1.
mysns.palplot(mysns.hls_palette(h=0.5, l=0.3,
s=0.5))
myplt.show()
Output:

Figure 12.7: Depiction of the output with hls_palette() function

Note: The preceding code is covered in (program name:


Chap12_Example12.8.py).
Next, we shall view an alternative approach to the HSL color space. Refer to
the following example Chap12_Example12.9.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
# By default n_colors=6 in husl_palette()
# A palette is generated with colors evenly
distributed in the HUSL (Hue, Saturation,
Lightness) hue space.
# values for parameters h, l and s should be ranged
between 0 and 1.
mysns.palplot(mysns.husl_palette(s=0.5, l=0.5))
myplt.show()
Output:

Figure 12.8: Depiction of the output with husl_palette() function

Note: The preceding code is covered in (program name:


Chap12_Example12.9.py).

Sequential palettes
With these palettes, numeric data can be effectively visualized, showcasing a
progression from lower to higher values or vice versa using colors that
transition from light to dark. Here, the main focus is on data that has larger
values. Here, an additional character s is passed to the color passed to the
color parameter for plotting the sequential plot, refer to the following code
Chap12_Example12.10.py for clarity:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns.palplot(mysns.color_palette("Greens"))
myplt.show()
Output:

Figure 12.9: Sequential Color Palette Transitioning from Light Green to Bright Green

Note: The preceding code is covered in (program name:


Chap12_Example12.10.py).

By appending the suffix _r to the palette name, the order of colors in the
palette will be reversed, refer to the following code
Chap12_Example12.11.py for better understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns.palplot(mysns.color_palette("Blues_r"))
myplt.show()
Output:
Figure 12.10: Sequential Color Palette Transitioning from a brighter shade of blue to lighter blue

Note: The preceding code is covered in (program name:


Chap12_Example12.11.py).

Diverging palettes
By utilizing this color palette, we can distinguish between two different
colors to emphasize both high and low extreme values. Data used with this
color scheme has a natural mid-point such as zero, refer to the following
code Chap12_Example12.12.py for better understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
mysns.palplot(mysns.color_palette("BrBG",
n_colors=9))
myplt.show()
Output:

Figure 12.11: Depiction of the diverging color palette highlighting the extreme values: high and low
Note: The preceding code is covered in (program name:
Chap12_Example12.12.py).

Custom diverging palettes can also be generated by using the


diverging_palette() function as explained in the following code
Chap12_Example12.13.py along with the syntax.
Refer to the following syntax:
diverging_palette(h_neg, h_pos, s=75, l=50, sep=1,
n=6, center='light', as_cmap=False)

import seaborn as mysns


import matplotlib.pyplot as myplt
# print(help(mysns.diverging_palette))
mysns.palplot(mysns.diverging_palette(h_neg=250,
h_pos=300,n=7, s=60,l=50, sep=5,center='light'))
myplt.show()
Output:

Figure 12.12: Depiction of the diverging_palette() function usage

Note: The preceding code is covered in (program name:


Chap12_Example12.13.py).

The default matplotlib parameters can be changed using set_palette()


function to use the palette for all the plots as explained in the following code
Chap12_Example12.14.py. Refer to the following syntax for better
understanding:
set_palette(palette, n_colors=None, desat=None,
color_codes=False)
import seaborn as mysns
import matplotlib.pyplot as myplt
#print(help(mysns.set_palette))
mycolor_palette = mysns.hls_palette(h=0.5, s=0.6,
l=0.4)
mysns.set_palette(mycolor_palette, n_colors=7,
desat=0.7)
mysns_dset = mysns.load_dataset('diamonds')
print(mysns_dset.head())
mysns.barplot(x='cut',y='carat',data=mysns_dset)
myplt.show()
Output:
carat cut color clarity depth table
price x y z
0 0.23 Ideal E SI2 61.5 55.0
326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0
326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0
327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0
334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0
335 4.34 4.35 2.75
Refer to the following figure for clarity:
Figure 12.13: Depiction of the bar plot with set_palette() function usage

Note: The preceding code is covered in (program name:


Chap12_Example12.14.py).

Heatmap plot in Seaborn


Whenever there is a need to represent data values using colors by its
graphical representation, then we can go for a heatmap. Patterns can be
identified, and areas can be concentrated using a heatmap. We will be
viewing the relationship between two variables which can be plotted on the x
and y axis and can be either categorical or numerical values. We will be
using the heatmap() function of the Seaborn library for plotting. A color
scale is required that blends from one color to another to represent high and
low values when dealing with numerical data. On the other hand, categorical
color palettes will be required to represent categorical data. Using a single
cell contained in a rectangular grid, we can represent each data value in our
table. Observe the following code Chap12_Example12.15.py as its
applicational usage.
Syntax of heatmap() is mentioned as follows:
heatmap(data, *, vmin=None, vmax=None, cmap=None,
center=None, robust=False, annot=None, fmt='.2g',
annot_kws=None, linewidths=0, linecolor='white',
cbar=True, cbar_kws=None, cbar_ax=None,
square=False, xticklabels='auto',
yticklabels='auto', mask=None, ax=None, **kwargs)
import seaborn as mysns
import matplotlib.pyplot as myplt
import pandas as mypd
diamonds_datset = mysns.load_dataset('diamonds')
print(diamonds_datset.head(10))
print('-'*50)
# using pivot_table function for converting into a
format useful for analysis
mydiamonds= mypd.pivot_table(data=diamonds_datset,
index='cut', columns='color', values='depth')
print(mydiamonds)
print('-'*50)
# we will be having a heatmap that has transformed
the numerical values in the mydiamonds
# dataframe into different colors
# print(help(mysns.heatmap))
mysns.heatmap(data = mydiamonds)
myplt.title('diamonds data')
myplt.ylabel('cut') # The index values from Ideal
to Fair are used as y tick labels
myplt.xlabel('color') # The column names .i.e.
color are used as x tick labels
myplt.show()
# heatmap will use dark colors to display low
values ., that is, values below 61.5 are
represented with dark colors
# values from 63.5, 64 and 64.5 are represented
with light colors
Output:
carat cut color clarity depth table
price x y z
0 0.23 Ideal E SI2 61.5 55.0
326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0
326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0
327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0
334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0
335 4.34 4.35 2.75
5 0.24 Very Good J VVS2 62.8 57.0
336 3.94 3.96 2.48
6 0.24 Very Good I VVS1 62.3 57.0
336 3.95 3.98 2.47
7 0.26 Very Good H SI1 61.9 55.0
337 4.07 4.11 2.53
8 0.22 Fair E VS2 65.1 61.0
337 3.87 3.78 2.49
9 0.23 Very Good H VS1 59.4 61.0
338 4.00 4.05 2.39
--------------------------------------------------
color D E F
G H I J
cut
Ideal 61.678370 61.686677 61.675562
61.700164 61.733194 61.793598 61.821987
Premium 61.169245 61.175909 61.259974
61.279480 61.322034 61.328641 61.390099
Very Good 61.750165 61.730250 61.721627
61.841279 61.968366 61.934551 61.902065
Good 62.365710 62.203751 62.202310
62.526636 62.499573 62.475096 62.395765
Fair 64.048466 63.319643 63.508013
64.339809 64.585149 64.220571 64.357143
The following figure depicts the output:

Figure 12.14: Depiction of the output of heatmap with default parameters

Note: The preceding code is covered in (program name:


Chap12_Example12.15.py).

The provided Python code utilizes the seaborn and matplotlib libraries to
create a heatmap visualization of a diamond dataset. The dataset is loaded
using Seabornʼs load_dataset function, and the first ten rows are displayed
using head(10). The code then creates a pivot table (mydiamonds) using the
pandas library, where the cut column serves as the index, the color column
as columns, and the depth column as values. Finally, a heatmap is generated
using Seabornʼs heatmap function, displaying different colors to represent
numerical values. The title, ylabel, and xlabel are set using Matplotlib.
Dark colors are used for low depth values (below 61.5), while light colors
represent higher depth values (63.5, 64, and 64.5). The resulting heatmap
visually represents the depth of diamonds based on their cut and color.
Let us now customize the appearance of the heatmap with the change of
default settings as explained in the following code
Chap12_Example12.16.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
import pandas as mypd
diamonds_datset = mysns.load_dataset('diamonds')
# using pivot_table function for converting into a
format useful for analysis
mydiamonds= mypd.pivot_table(data=diamonds_datset,
index='cut', columns='color', values='depth')
mysns.heatmap(data = mydiamonds, vmin=50,
vmax=80,annot=True, fmt='0.2f',linewidth=0.4)
myplt.title('diamonds data')
myplt.ylabel('cut')
myplt.xlabel('color')
myplt.show()
Output:

Figure 12.15: Depiction of the output of heatmap with customizable parameters

Note: The preceding code is covered in (program name:


Chap12_Example12.16.py).

This Python code will use the seaborn library to generate a heatmap
visualizing the mydiamonds pivot table, where color intensity represents the
depth values within the specified range (vmin=50, vmax=80). The heatmap
includes annotations displaying the exact values and gridlines with a
linewidth of 0.4. The title diamonds data is set for the plot, with the y-axis
labeled as cut and the x-axis as color. Finally, the plot is displayed.
Also, a numerical value can be provided to the argument center which will
be used as the center of the color map then plotting data. So, we will be
using the loc attribute on the dataframe mydiamonds as shown in the
following code Chap12_Example12.17.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
import pandas as mypd

diamonds_dataset = mysns.load_dataset('diamonds')
# Use pivot_table function for converting into a
format useful for analysis
mydiamonds =
mypd.pivot_table(data=diamonds_dataset,
index='cut', columns='color', values='depth')
# Define a custom color palette ranging from dark
blue to light blue
colors = mysns.color_palette("Blues", as_cmap=True)
# Plot the heatmap with the custom color palette
mysns.heatmap(data=mydiamonds,
center=mydiamonds.loc['Good', 'H'], cmap=colors,
annot=True, fmt="0.2f")
myplt.title('Diamonds Data')
myplt.ylabel('Cut')
myplt.xlabel('Color')
myplt.show()
Output:

Figure 12.16: Depiction of the output of heatmap with the diverging color scheme

Note: The preceding code is covered in (program name:


Chap12_Example12.17.py).
The cells corresponding to the points Good and H will be the new midpoint of
the data. A diverging color scheme will be applied to this heatmap. This
code defines a custom color palette ranging from dark blue to light blue
using the seaborn library. It then plots a heatmap using the custom color
palette, with the data specified as mydiamonds. The center of the heatmap is
set to the value located at the row Good and column H in the mydiamonds
dataset. The heatmap is annotated with the values from the dataset, rounded
to two decimal places.
Also, we can infer from this dataset that the Fair cut value is more than
other cuts and the Ideal cut value is lesser than other cuts for each color. The
premium cut is color-coded in lighter shades.

Now, suppose there is a requirement to display the following cells the


diagonal in the heatmap using triu() method of numpy library, refer to the
following code Chap12_Example12.18.py for a better understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
import pandas as mypd
import numpy as mynp
diamonds_dataset = mysns.load_dataset('diamonds')
mydiamonds =
mypd.pivot_table(data=diamonds_dataset,
index='cut', columns='color', values='depth')
colors = mysns.color_palette("Blues", as_cmap=True)
mytriudata = mynp.triu(mydiamonds)
print(mytriudata)
# where mask is False .i.e. 0 values will be
plotted for those cells
mysns.heatmap(data=mydiamonds, cmap=colors,
annot=True, fmt="0.2f", cbar_kws=
{'orientation':'horizontal'}, mask=mytriudata)
myplt.title('Diamonds Data')
myplt.ylabel('Cut')
myplt.xlabel('Color')
myplt.show()
Output:
[[61.6783698 61.68667692 61.67556194 61.7001638
61.73319422 61.79359771
61.82198661]
[ 0. 61.17590929 61.25997426 61.27948016
61.3220339 61.32864146
61.39009901]
[ 0. 0. 61.72162662 61.84127882
61.96836623 61.9345515
61.9020649 ]
[ 0. 0. 0. 62.52663605
62.49957265 62.47509579
62.39576547]
[ 0. 0. 0. 0.
64.58514851 64.22057143
64.35714286]]
Refer to the following figure for a better understanding:

Figure 12.17: Depiction of the output of heatmap with a display of lower triangle
Note: The preceding code is covered in (program name:
Chap12_Example12.18.py).

Also, we can display the preceding cells the diagonal in the heatmap using
tril() method of numpy library, the code Chap12_Example12.19.py
for this is shown as follows:
import seaborn as mysns
import matplotlib.pyplot as myplt
import pandas as mypd
import numpy as mynp
diamonds_dataset = mysns.load_dataset('diamonds')
mydiamonds =
mypd.pivot_table(data=diamonds_dataset,
index='cut', columns='color', values='depth')
colors = mysns.color_palette("Blues", as_cmap=True)
mytrildata = mynp.tril(mydiamonds)
print(mytrildata)
# values will be plotted for the cells where mask
is False .i.e. 0
mysns.heatmap(data=mydiamonds, cmap=colors,
annot=True, fmt="0.2f", cbar_kws=
{'orientation':'horizontal'}, mask=mytrildata)
myplt.title('Diamonds Data')
myplt.ylabel('Cut')
myplt.xlabel('Color')
myplt.show()
Output:
[[61.6783698 0. 0. 0.
0. 0.
0. ]
[61.16924517 61.17590929 0. 0.
0. 0.
0. ]
[61.75016523 61.73025 61.72162662 0.
0. 0.
0. ]
[62.36570997 62.20375134 62.20231023 62.52663605
0. 0.
0. ]
[64.04846626 63.31964286 63.50801282 64.33980892
64.58514851 0.
0. ]]
Refer to the following figure for a better understanding:

Figure 12.18: Depiction of the output of the heatmap with a display of the upper triangle

Note: The preceding code is covered in (program name:


Chap12_Example12.19.py).

Let us perform the following example of filename as mycovid_data.csv. We


will find the country name (y-axis)with the maximum confirmed cases and
various other information on the x-axis as shown in the following code
Chap12_Example12.20.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
import pandas as mypd
mycovid_dataset = mypd.read_csv('mycovid_data.csv')
print(mycovid_dataset)
# Selecting only the desired columns
selected_columns = ['Confirmed', 'Deaths',
'Recovered', 'Active', 'Country']
selected_data = mycovid_dataset[selected_columns]
# Creating a pivot table for the heatmap
heatmap_data =
selected_data.pivot_table(index='Country')
# Plotting the heatmap
mysns.heatmap(data=heatmap_data, cmap='YlOrRd',
annot=True, fmt="0.2f", linewidth=0.4)
myplt.title('Covid Data HeatMap Analysis')
myplt.xlabel('Metrics')
myplt.ylabel('Country')
myplt.show()
Output:
Country Confirmed Deaths Recovered ...
Confirmed last week 1 week change 1 week %
increase WHO Region
0 Iceland 1854 10 1823 ...
1839 15 0.82
Europe
1 India 1480073 33408 951166 ...
1155338 324735 28.11
South-East Asia
2 Indonesia 100303 4838 58173 ...
88214 12089 13.70
South-East Asia
3 Iran 293606 15912 255144 ...
276202 17404 6.30 Eastern
Mediterranean
4 Iraq 112585 4458 77144 ...
94693 17892 18.89 Eastern
Mediterranean
5 Ireland 25892 1764 23364 ...
25766 126 0.49
Europe
6 Israel 63985 474 27133 ...
52003 11982 23.04
Europe
7 Italy 246286 35112 198593 ...
244624 1662 0.68
Europe
[8 rows x 15 columns]
Refer to the following code for a better understanding:

Figure 12.19: Depiction of the Covid Data HeatMap Analysis

Note: The preceding code is covered in (program name:


Chap12_Example12.20.py).
We can conclude that India has the maximum number of confirmed cases
with the highest recovery than other countries from the heatmap as we can
say that the color scheme diverges from yellow to orange and then to red.
Light yellow is the minimum value and dark red is the maximum value.

The box plot in seaborn


Box plots are used to graphically represent the spread or variability of a
distribution. The numerical data distribution can be visually represented with
a whisker or box plot. It measures the skewness, that is, the lack of
symmetry in data distribution. A boxplot in Seaborn is a graphical
representation of the distribution of a dataset. It displays the five-number
summary, which includes the minimum, first quartile (25th percentile),
median (50th percentile), third quartile (75th percentile), and maximum.
The median is a measure of central tendency in statistics. It is the middle
value of a data set when it is ordered from least to greatest. In other words, if
you have a set of numbers, the median is the one that falls exactly in the
middle when the numbers are arranged in numerical order.
If there is an even number of observations, then the average of the two
middle values will be the median. It is often used as an alternative to the
mean (average) when the data set has extreme values or outliers that could
skew the mean.
For example, in the set of numbers: 2, 5, 7, 10, 14, the median is 7 because it
is the middle value. If the set were 2, 5, 7, 10, the median would be the
average of 5 and 7, which is 6.
The box in the plot represents the InterQuartile Range (IQR), which is the
range between the first and third quartiles. The whiskers extend from the box
to the minimum and maximum values, excluding outliers.
Outliers in a box plot are data points that lie significantly outside the
whiskers. They are considered extreme values that do not follow the general
pattern of the dataset. Outliers may arise for a variety of reasons, including
measurement errors or truly unusual observations.
In a box plot, a whisker is a line that extends from the box to represent the
range of the data. It typically indicates the minimum and maximum values
within a certain range. The length of the whisker can vary depending on the
specific rules or criteria used to determine the range of the data. Whiskers
can provide insights into the spread or variability of the dataset and help
identify potential outliers. Refer to the following figure for a better
understanding:

Figure 12.20: Depiction of a basic 5-name summary of boxplot

In a box plot, a symmetric distribution refers to a distribution where the


median (50th percentile) is at the center of the box, and the whiskers are of
equal length on both sides. This indicates that the data is evenly distributed
around the median, resulting in a balanced or bell-shaped curve. Let us
discuss skewed distribution.
A positively skewed or right-side distribution in a box plot means that the
tail of the distribution extends toward the right, indicating a longer tail on the
positive side. In this case, the median is closer to the lower end of the box,
that is, Q1, and the whisker on the right side is longer. This suggests that
there are more values on the lower end of the distribution, with a few
extreme values on the higher end. The data points on the left-hand side of
the median are clustered. The data points on the right-hand side of the
median are scattered.
Conversely, a negatively skewed or left-side distribution in a box plot has a
tail that extends toward the left, indicating a longer tail on the negative side.
Here, the median is closer to the upper end of the box, that is, Q3, and the
whisker on the left side is longer. This suggests that there are more values on
the higher end of the distribution, with a few extreme values on the lower
end.
In summary, the symmetry or skewness of a distribution in a box plot is
determined by the position of the median and the relative lengths of the
whiskers on each side. Asymmetric distribution has an equal balance of
values around the median, while positively or negatively skewed
distributions have longer tails on one side, indicating an imbalance in the
data.
Box plots are useful when comparing the distribution of different variables
or groups. They provide a visual summary of the dataʼs central tendency,
spread, and skewness. Box plots help identify outliers and understand the
overall shape of the distribution. They are particularly helpful in identifying
differences between groups or comparing distributions across different
categories. By displaying the five-number summary, box plots offer a
concise representation of the datasetʼs key statistical measures, making them
a valuable tool for data visualization and exploratory data analysis. Refer to
the following figure for a better understanding:

Figure 12.21: Depiction of the skewness of boxplot

Refer to the following syntax:


boxplot(data=None, *, x=None, y=None, hue=None,
order=None, hue_order=None, orient=None,
color=None, palette=None, saturation=0.75,
width=0.8, dodge=True, fliersize=5, linewidth=None,
whis=1.5, ax=None, **kwargs)
Let us explore the concept of a box plot with the following code
Chap12_Example12.21.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('healthexp')
#print(help(mysns.boxplot))
print(my_dataset.head())
mysns.boxplot(data=my_dataset,x='Spending_USD')
myplt.title('Expenditure Spend on Health')
myplt.show()
Output:
Year Country Spending_USD
Life_Expectancy
0 1970 Germany 252.311
70.6
1 1970 France 192.143
72.2
2 1970 Great Britain 123.993
71.9
3 1970 Japan 150.437
72.0
4 1970 USA 326.961
70.9
The following figure shows the example:

Figure 12.22: Depiction of the box plot example

Note: The preceding code is covered in (program name:


Chap12_Example12.21.py).

As shown in Figure 12.22, we can infer that the amount spent by the people
of different countries during the period from 1970 to 2020 ranges between
123.99 USD indicated by the left-hand side whisker to approximately 8400
USD indicated by the right-hand side whisker. The median is somewhere
around 2296 USD and it is closer to the lower end of the box or Q1 which
means that the data points on the left-hand side of the median are clustered.
So, the median is pulled towards the left-hand side of the box. Also, the
median is far away from the upper end of the box or Q3 which means that
the data points on the right-hand side of the median are scattered. Also, the
whisker on the upper end of the box is longer. So, the data distribution is
skewed in the direction of longer whiskers. Here, the distribution is
positively skewed or right-side skewed. Also, we can notice that the dots
located outside the right whisker which are called outliers, are usually large
values in the data.
Orientation can be vertical by passing a parameter as orient=’v’ in the
boxplot function. Also, if we want to know the total expenditure incurred for
a specific country like the USA, then follow the following approach as
explained in the following code Chap12_Example12.22.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('healthexp')
myfiltered_df = my_dataset[my_dataset['Country'] ==
'USA']
print(myfiltered_df.head())
mysns.boxplot(data=myfiltered_df,x='Spending_USD',o
rient="v")
myplt.title('Expenditure Spend on Health')
myplt.show()
Output:
Year Country Spending_USD Life_Expectancy
4 1970 USA 326.961 70.9
9 1971 USA 357.988 71.2
12 1972 USA 397.097 71.2
15 1973 USA 439.302 71.4
18 1974 USA 495.114 72.0
Refer to the following figure to understand the example:

Figure 12.23: Box plot example with orientation vertical and filtering for USA only
Note: The preceding code is covered in (program name:
Chap12_Example12.22.py).

Conclusion
In this chapter, we initially saw the fundamental building blocks of statistical
analysis, discussing the key terms that form the bedrock of data
interpretation. By the end of this comprehensive exploration, readers now
possess a solid understanding of basic statistical terms, notably quantitative
and qualitative variables. The distinction between variables representing
numerical measurements and those embodying categories or labels has been
elucidated, providing readers with foundational knowledge, crucial for
navigating the complexities of statistical datasets. Moving beyond
theoretical concepts, readers have been introduced to the practical side of
data exploration through various built-in datasets seamlessly integrated into
seaborn. This hands-on exposure equips them with the skills to navigate and
analyze real-world datasets, setting the stage for effective data visualization.
The journey through this chapter extends beyond mere data representation; it
ventures into the art of presentation with Seabornʼs plot styling principles.
Readers now understand how to customize the aesthetics of their
visualizations, enhancing both interpretability and visual appeal. The
concept of color palettes, encompassing qualitative, sequential, and
diverging palettes, has been unveiled, empowering readers to choose suitable
color schemes tailored to different types of data and visualization goals. We
also explored the plots using heat maps and boxes with various examples.
In the next chapter, we shall explore additional seaborn plotting techniques
like KDE plot, scatter plot, violin plot, line plot, joint plot, and facet grid for
analyzing data trends and distributions using various examples.

Points to remember
Quantitative variables are numerical and measurable, representing
quantities or amounts, while qualitative variables are categorical,
representing qualities or characteristics. Seaborn provides convenient
datasets for practice and exploration, accessible directly through its
library, eliminating the need for external data sources.
Plot styling in seaborn enhances visual appeal and offers customizable
parameters for plot aesthetics, ensuring tailored and impactful
visualizations.
Seaborn offers qualitative palettes designed for categorical data,
ensuring clear distinctions between different categories. Sequential
palettes smoothly transition colors for ordered data, while diverging
palettes emphasize extremes using contrasting colors.
Heatmap plot in seaborn displays data in a matrix format, with the
intensity of colors representing values. It is useful for identifying
patterns and correlations in large datasets.
Box plot in seaborn illustrates the distribution of data where key
components include the median, quartiles, and outliers. It is helpful for
identifying skewness and outliers in the data.

Questions
1. What is a seaborn library and what is it used for?
2. What is the definition of quantitative variables in statistics?
3. How do qualitative variables differ from quantitative variables?
4. What are some advantages of using built-in datasets in seaborn for
data visualization?
5. Can you provide examples of situations where using built-in datasets
is beneficial?
6. How does plot styling contribute to the overall aesthetics of data
visualizations?
7. What are some common styling parameters available in seaborn for
customizing plots?
8. Explain the significance of using color palettes in data visualization.
9. Differentiate between qualitative, sequential, and diverging color
palettes in seaborn.
10. In what scenarios is a heatmap plot useful in data analysis?
11. How does seaborn handle the representation of data in a heatmap?
12. What insights can be gained from a box plot?
13. How does seaborn generate a box plot, and what key components
does it illustrate?

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 13
Crafting Seaborn Plots: KDE, Line,
Violin and Facets

Introduction
Building on the foundation of quantitative and qualitative variables,
statistical concepts and basic seaborn plots from the previous chapter, this
chapter dives deeper into Seaborn visualizations. In the current chapter, it is
crucial to grasp additional seaborn plotting techniques like KDE plot, violin
plot, line plot, scatter plot, joint plot, and facet grid, as these tools offer
diverse perspectives for analyzing data trends, relationships, and
distributions, thus enabling comprehensive exploration and interpretation of
datasets across various dimensions.

Structure
In this chapter, we will discuss the following topics:
KDE plot in seaborn
The violin plot in seaborn
The line plot in seaborn
The scatter plot in seaborn
The joint plot in seaborn
Facet grid in seaborn

Objectives
By the end of this chapter, the reader will be equipped with the skills to
create various advanced plots in seaborn, Kernel Density Estimation
(KDE) plots for exploring the distribution of a single variable, violin plots
for combining aspects of box plots, and KDE plots, line plots for depicting
trends, scatter plots are employed to depict the relationship between two
variables visually, joint plots for combining different types of plots in a
single grid, and FacetGrid for creating grids of subplots based on categorical
variables. Readers will also gain proficiency in utilizing Seaborn to visualize
data and extract meaningful insights from various types of datasets.

KDE plot in seaborn


A KDE plot is a visual tool used to estimate the probability density function
of a continuous random variable. It represents the distribution of the data by
smoothing out the individual data points and creating a smooth curve. The
advantage of a KDE plot over a histogram in Seaborn is that it provides a
smoother and more continuous representation of the data distribution. Unlike
a histogram, which divides the data into discrete bins, a KDE plot estimates
the underlying continuous distribution. This allows for a more nuanced
understanding of the data, especially when dealing with small or irregularly
spaced datasets.
It can be plotted using the kdeplot() function or the distplot() function by
making the hist parameter value false.
Let us view an example as demonstrated in Chap13_Example13.1.py
where we can depict why KDE plot is advantageous over histogram:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('tips')
mysns.histplot(data=my_dataset, x="total_bill",
kde=False)
myplt.title('Histogram Plot')
myplt.show()
Output:

Figure 13.1: Depiction of histogram plot of tips dataset

Note: The preceding code is covered in (program name:


Chap13_Example13.1.py).

The syntax of histplot is shown as follows:


histplot(data=None, *, x=None, y=None, hue=None,
weights=None, stat='count', bins='auto',
binwidth=None, binrange=None, discrete=None,
cumulative=False, common_bins=True,
common_norm=True, multiple='layer', element='bars',
fill=True, shrink=1, kde=False, kde_kws=None,
line_kws=None, thresh=0, pthresh=None, pmax=None,
cbar=False, cbar_ax=None, cbar_kws=None,
palette=None, hue_order=None, hue_norm=None,
color=None, log_scale=None, legend=True, ax=None,
**kwargs)
In this example, we use the seaborn library to load the tips dataset. Then,
we plot a histogram of the total_bill column by using mysns.histplot().
The resulting plot will show bars representing the frequency of values falling
within each bin.
Let us increase the number of bins to 26 in the same example:
mysns.histplot(data=my_dataset, x="total_bill",
kde=False, bins=26)
Output:
Figure 13.2: hist plot with bins count as 26

We can observe that the histogram appearing in Figure 13.2 looks


completely different than the histogram in Figure 13.1. So, changing the
number of bins and the range affects the histogram appearance. Adjusting
the number of bins redistributes data points, leading to a different visual
representation in the histogram. So, changing the number of bins leads to
different interpretations of the same data. So, we can conclude from Figure
13.2 that the bars in this histogram do have gaps as compared to Figure 13.1,
thus it can be misleading due to inappropriate starting and ending points on
the x-axis. We will now be using KDE removing the dependency on
endpoints.
The syntax of kdeplot is mentioned along with the following code
Chap13_Example13.2.py:
kdeplot(data=None, *, x=None, y=None, hue=None,
weights=None, palette=None, hue_order=None,
hue_norm=None, color=None, fill=None,
multiple='layer', common_norm=True,
common_grid=False, cumulative=False,
bw_method='scott', bw_adjust=1, warn_singular=True,
log_scale=None, levels=10, thresh=0.05,
gridsize=200, cut=3, clip=None, legend=True,
cbar=False, cbar_ax=None, cbar_kws=None, ax=None,
**kwargs)
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('tips')
#print(help(mysns.kdeplot))
mysns.kdeplot(data=my_dataset, x="total_bill")
myplt.title('kded Plot')
myplt.show()
Output:
Figure 13.3: Depiction of univariate kde plot

Note: The preceding code is covered in (program name:


Chap13_Example13.2.py).

In this updated code, we use mysns.kdeplot() instead of mysns.histplot()


to create a KDE plot of the total_bill column. The resultant plot will
exhibit a smooth curve, depicting the estimated probability density function
of the data.
mysns.kdeplot(data=my_dataset, x="total_bill",
shade=True)
Output:
Figure 13.4: Univariate KDE plot with shade parameter set to True

On the Y-axis, the density is plotted by setting the parameter vertical to True.
mysns.kdeplot(data=my_dataset, x="total_bill",
shade=True, vertical=True)
Output:

Figure 13.5: Univariate kde plot with plotting on Y-axis and shade=True

So, the advantage of the KDE plot is that it provides a continuous and
smooth representation of the data distribution, allowing for a more detailed
understanding of the underlying patterns and shapes of the data. It can reveal
features such as multiple modes, skewness, and other characteristics that
may not be easily visible in a histogram. Up to this point, we have observed
univariate plots based on a single variable. Now, we shall see a bivariate plot
as demonstrated in the following code Chap13_Example13.3.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('iris')
mysns.kdeplot(data=my_dataset.loc[my_dataset.specie
s=="versicolor"], x="sepal_length",
y="petal_length")
myplt.title('Bivariate kde Plot')
myplt.show()
Output:

Figure 13.6: Bivariate kde plot

Note: The preceding code is covered in (program name:


Chap13_Example13.3.py).

A contour plot displays a 2-D kernel density estimate. However, it can be


challenging to discern which regions exhibit high density and which regions
exhibit low density based solely on the plot. So, we will look for color
coding these contour lines by customizing this density plot.
Let us pass cmap = ‘Greens’ and then visualize the output:
mysns.kdeplot(data=my_dataset.loc[my_dataset.specie
s=="versicolor"], x="sepal_length",
y="petal_length", cmap='Greens')
Output:
Figure 13.7: Bivariate kde plot with cmap=’Greens’

Here, we have passed the color Greens to the parameter cmap. In order to
color the different levels of contour lines, a contour plot can use a color map.
It is best to visualize numeric data which uses greens as a sequential color
map that progresses from low to high density by gradually increasing the
darkness and color saturation. In Figure 13.7, the higher density values are
represented by the darker colors, and the lower density values are
represented by the contour lines with lighter density values.
We can also increase the number of levels in the contour plot, so that
important details may not be missed. Also, note that a greater number of
levels will make the plot look cluttered, and a color bar may be added for
annotating the color mapping in a bivariate plot. Just observe the following
line of code where we added n_levels=20:
mysns.kdeplot(data=my_dataset.loc[my_dataset.specie
s=="versicolor"],
x="sepal_length", y="petal_length",
cmap='Greens',
cbar=True, n_levels=20)
Output:

Figure 13.8: Bivariate kde plot with cmap=’Greens’, cbar=True and n_levels=20

Now, let us see the univariate plot using distplot() function. Refer to the
following syntax for a better understanding:
distplot(a=None, bins=None, hist=True, kde=True,
rug=False, fit=None, hist_kws=None, kde_kws=None,
rug_kws=None, fit_kws=None, color=None,
vertical=False, norm_hist=False, axlabel=None,
label=None, ax=None, x=None)
As per the documentation of Seaborn, this function has been deprecated and
will be removed in Seaborn v0.14.0. Observe the following code
Chap13_Example13.4.py for better understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('iris')
#print(help(mysns.distplot))
mysns.distplot(my_dataset['petal_length'],hist=Fals
e)
myplt.title('Univariate kde Plot using distplot()
function')
myplt.show()
Output:

Figure 13.9: Univariate kde plot using distplot() function

Note: The preceding code is covered in (program name:


Chap13_Example13.4.py).

The violin plot in seaborn


Now, we know that the box plot contains range, median, and quartiles but
dataset distribution is not shown. If the distribution has multiple peaks, then
it cannot be figured out using a box plot because how the data looks
throughout its range will not be possible using a box plot.
A violin plot is a data visualization technique that merges elements of a box
plot and a kernel density plot (two symmetrical density plots on either side
of the box plot). A violin plot showcases the distribution of numerical data
among various categories or groups by illustrating the variations within the
data.
Unlike a box plot, a violin plot utilizes the width of the plot to offer a more
detailed depiction of the data distribution, thus providing an advantage in
visualizing the data. Summary statistics such as quartiles and outliers are
shown by a box plot. On the other hand, a violin plot also shows the kernel
density estimation, which provides information about the shape and density
of the data. This allows for a more comprehensive understanding of the data
distribution, including insights into multimodality (multiple peaks) and
asymmetry.
Additionally, a violin plot can handle uneven sample sizes and varying
widths of the distributions, making it useful for comparing distributions
across different categories or groups. It provides a visual summary of the
data distribution, allowing for easy identification of differences in central
tendency, spread, and shape.
Overall, the advantage of a violin plot is that it combines the summary
statistics of a box plot with the detailed density estimation of a kernel
density plot, providing a richer visualization of the data distribution. Refer to
the following figure for a better understanding:

Figure 13.10: Violin Plot

There is a box plot of the data inside each violin plot. The first and third
quartile represents the left and right side of the inner box. The median of the
data is represented by the small white circle which we see inside the box.
The minimum and maximum values in the data are represented by the thin
black lines extending from the box. The higher probability is represented by
the wider sections of the violin plot having a large cluster of values. Lower
probability is represented by the lower sections of the violin plot around a
small cluster of values. So, in the violin plot, the detailed distribution of the
data including peaks and positions is depicted as compared to the box plot.
Refer to the following syntax:
violinplot(data=None, *, x=None, y=None, hue=None,
order=None, hue_order=None, orient=None,
color=None, palette=None, saturation=0.75,
fill=True, inner='box', split=False, width=0.8,
dodge='auto', gap=0, linewidth=None,
linecolor='auto', cut=2, gridsize=100,
bw_method='scott', bw_adjust=1,
density_norm='area', common_norm=False,
hue_norm=None, formatter=None, log_scale=None,
native_scale=False, legend='auto', scale=
<deprecated>, scale_hue=<deprecated>, bw=
<deprecated>, inner_kws=None, ax=None, **kwargs)
Observe the following code Chap13_Example13.5.py for better
understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('exercise')
#print(help(mysns.violinplot))
mysns.violinplot(my_dataset['pulse'])
myplt.title('Violin Plot')
myplt.show()
Output:

Figure 13.11: Depiction of a violin plot

Note: The preceding code is covered in (program name:


Chap13_Example13.5.py).

This Python code uses the seaborn library to create a violin plot based on the
exercise dataset. Seaborn is imported as mysns, and the dataset is loaded
using the load_dataset() function with the argument exercise. The code
then prints the help documentation for the violinplot() function, although it
is commented out. Subsequently, a violin plot is generated using the pulse
column of the dataset, which likely represents pulse rates. The violinplot()
function is used to visualize the distribution of pulse values, displaying a
combination of a kernel density estimate and a box plot. Finally, the plot is
given a title, Violin Plot, and is displayed to the user.
The violin plot is a fusion of a box plot and a density plot. We can observe
that on either side of the box plot, density plots are symmetric as they both
display the distribution of the column pulse and the y-axis displays the range
of values in the column pulse.
Let us analyze this violin plot with more detailed data representation to the
user with the help of the code Chap13_Example13.6.py mentioned as
follows:
import seaborn as mysns
import matplotlib.pyplot as myplt
# Load the exercise dataset from seaborn
my_dataset = mysns.load_dataset('exercise')
# Create a violin plot
mysns.violinplot(my_dataset['pulse'])
# Calculate summary statistics
summary_stats = my_dataset['pulse'].describe()
q1 = summary_stats['25%']
q3 = summary_stats['75%']
myiqr = q3 - q1
minimum = summary_stats['min']
maximum = summary_stats['max']
median = summary_stats['50%']
print(summary_stats['mean'])
# Calculate range excluding outliers
mylower_bound = q1 - 1.5 * myiqr
myupper_bound = q3 + 1.5 * myiqr
range_without_outliers = myupper_bound -
mylower_bound
print(f"q1 value is {q1}")
print(f"q3 value is {q3}")
print(f"iqr value is {myiqr}")
print(f"minimum value is {minimum}")
print(f"maximum value is {maximum}")
print(f"median value is {median}")
print(f"lower_bound value is {mylower_bound}")
print(f"upper_bound value is {myupper_bound}")
print(f"range_without_outliers value is
{range_without_outliers}")
# Add text annotations for summary statistics
# print(help(myplt.text))
myplt.text(0.1, q1 - 3, f'Q1: {q1}', ha='center',
va='bottom', color='red')
myplt.text(0.1, q3 + 3, f'Q3: {q3}', ha='center',
va='top', color='red')
myplt.text(0.1, q1 + 0.1 * myiqr, f'IQR: {myiqr}',
ha='center', va='bottom', color='red')
myplt.text(0.1, minimum - 5, f'Min: {minimum}',
ha='center', va='bottom', color='red')
myplt.text(0.1, maximum + 5, f'Max: {maximum}',
ha='center', va='top', color='red')
myplt.text(0.1, median + 2, f'Median: {median}',
ha='center', va='bottom', color='red')
myplt.text(0.1, myupper_bound + 2, f'Range (no
outliers): {range_without_outliers}', ha='center',
va='bottom', color='red')
# Set plot title
myplt.title('Violin Plot with Summary Statistics')
# Add grid
myplt.grid(axis='y', linestyle='--', alpha=0.7)
# Show the plot
myplt.show()
Output:
q1 value is 90.25
q3 value is 103.0
iqr value is 12.75
minimum value is 80.0
maximum value is 150.0
median value is 96.0
lower_bound value is 71.125
upper_bound value is 122.125
range_without_outliers value is 51.0
Refer to the following figure for a better understanding:

Figure 13.12: Violin plot detailed data representation

Note: The preceding code is covered in (program name:


Chap13_Example13.6.py).
This Python code utilizes Seaborn and Matplotlib to create a violin plot of
the pulse data from the exercise dataset. Using Seabornʼs load_dataset()
function, the dataset is loaded, and a violin plot is generated to visualize the
distribution of pulse values. The code then calculates various summary
statistics, including the first quartile (Q1), third quartile (Q3), InterQuartile
Range (IQR), minimum, maximum, and median. Additionally, it calculates
the range excluding outliers based on a 1.5*IQR criterion. The calculated
values are printed for informational purposes. The Matplotlib library is used
to add text annotations to the violin plot, displaying the calculated statistics.
The plot is given a title, Violin Plot with Summary Statistics, and a grid is
added to enhance visualization. The final plot is displayed with the summary
statistics annotated for better interpretation.
Here, the median is closer to the lower end of the box which means that the
data points on the left-hand side of the median are cluttered, and the right-
hand side of the median is scattered. Overall, the distribution of the pulse
column is positively skewed since the whisker on the upper end of the box is
longer.
Here, the mean of the pulse column is 99.7 which is greater than the median
as it is obvious that the tail of the distribution pulls the mean due to the
presence of the large values in the pulse column. So, due to the presence of
extreme values mean is largely influenced as compared to median.
Violin plots are used for comparing distribution across different categories.
Let us explore one more example of the titanic dataset as mentioned in the
following code Chap13_Example13.7.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('titanic')
mysns.violinplot(data=my_dataset,
x='age',y='sex',hue='alive')
myplt.title('Violin Plot with comparison of
different categories')
myplt.show()
Output:

Figure 13.13: Violin plot with a comparison of different categories

Note: The preceding code is covered in (program name:


Chap13_Example13.7.py).

From Figure 13.13, the violin plot in blue color represents the sex category
of Males and Females who are no longer alive. On the other hand, the violin
plot in orange color represents the sex category of the ones that are alive.
If we provide inner parameter value say points as shown in the following
line of code:
mysns.violinplot(data=my_dataset,
x='age',y='sex',hue='alive', inner='points')
All the data points in the distribution will be displayed as we can observe
different values in the data points of the dataset as shown in the following
figure:

Figure 13.14: Violin plot with inner parameter value as points

If we do not want points or box plots, then set the inner parameter as None.

Line plot in seaborn


In Seaborn, a line plot is a graphical representation that uses straight lines to
connect data points, facilitating visualization within a two-dimensional
context. It is especially helpful for displaying patterns and trends in data
across a time series or continuous interval. The lineplot() function in
Seaborn is used to construct the line plot.
The instances where line graphs come in handy include:
Time series data: They are typically plotted using line plots, in which
the variable of interest is plotted on the y-axis and time is represented
on the x-axis.
Trend analysis: Data trends can be found using line graphs. By
looking at the slope and direction of the lines, patterns that are
increasing, decreasing, or changing can be easily identified.
Comparisons: Line plots make it straightforward to compare multiple
lines, which helps to spot trends that differ or are similar among
various groups or conditions.
Interpolation: In order to provide a smooth representation of the data;
line plots can be used to estimate values between known data points.
Correlation and relationships: In order to see how two variables
relate to one another and comprehend how changes in one variable
affect changes in another, utilize line graphs.
When to use line plots in Seaborn depends on the nature of the data the user
has and the insights they want to gain.
This works well for identifying trends and patterns in continuous or
sequential data, such as time series. Plots such as bar or box plots may be
more suited if the user wishes to compare different groups or have
categorical data.
Refer to the following syntax for a better understanding:
lineplot(data=None, *, x=None, y=None, hue=None,
size=None, style=None, units=None, palette=None,
hue_order=None, hue_norm=None, sizes=None,
size_order=None, size_norm=None, dashes=True,
markers=None, style_order=None, estimator='mean',
errorbar=('ci', 95), n_boot=1000, seed=None,
orient='x', sort=True, err_style='band',
err_kws=None, legend='auto', ci='deprecated',
ax=None, **kwargs)
Observe the following line of following code
Chap13_Example13.8.py for demonstrating line plot using seaborn:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('iris')
#print(help(mysns.lineplot))
# Create a lineplot
mysns.lineplot(data=my_dataset, x="sepal_length",
y="petal_length", hue="species") # F1
# Display the plot
myplt.show()
Output:

Figure 13.15: Fig depicting line plot with the iris dataset

Note: The preceding code is covered in (program name:


Chap13_Example13.8.py).

We will be explaining the line of code with the tag name #F1.
The code calls the lineplot function from the mysns module. The data
parameter is set to my_dataset, which is the dataset used for the plot. The x
parameter is set to sepal_length, indicating that the values from the
sepal_length column in the dataset will be plotted on the x-axis. The y
parameter is set to petal_length, indicating that the values from the
petal_length column in the dataset will be plotted on the y-axis. The hue
parameter is set to species, indicating that different species will be
differentiated by different colors on the plot. The code executes and
generates the line plot-based on the specified parameters.

The scatter plot in seaborn


A scatter plot is a graphical representation of the correlation between two
continuous variables. In Seaborn, a scatter plot can be created using the
scatterplot() function, refer to the following syntax:

scatterplot(data=None, *, x=None, y=None, hue=None,


size=None, style=None, palette=None,
hue_order=None, hue_norm=None, sizes=None,
size_order=None, size_norm=None, markers=True,
style_order=None, legend='auto', ax=None, **kwargs)
Scatter plots are useful when you want to understand the correlation or
pattern between two variables. They are particularly effective in identifying
trends, clusters, outliers, or any other patterns in the data.
Here are some common use cases for scatter plots:
Correlation analysis: Scatter plots can help the user determine the
strength and direction of the relationship between two variables. For
example, you can use scatter plots to examine how temperature
changes affect ice cream sales.
Outlier detection: Scatter plots can help identify outliers in the data.
Outliers are data points that deviate significantly from the overall
pattern. You can easily spot any unusual observations by visualizing
the data in a scatter plot.
Clustering analysis: Scatter plots can be useful in identifying clusters
or groups within the data. By examining the distribution of points in
the plot, you can identify any distinct groups or patterns.
Comparison between groups: Scatter plots can compare two groups
or categories. By using different colors or markers for each group, you
can easily distinguish between them and observe any differences or
similarities in their relationship.
In summary, scatter plots offer a visual depiction of the correlation between
two variables, allowing the user to gain insights and make informed
decisions based on the patterns observed in the plot.
Observe the following line of code Chap13_Example13.9.py for
demonstrating scatter plot using seaborn:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('iris')
#print(help(mysns.scatterplot))
# set the style as whitegrid
mysns.set(style="whitegrid") # Set the style of
the plot
# Usage of seaborn's scatterplot function
'''
This line of code is creating a scatter plot using
Seaborn, where the x-axis represents sepal length,
the y-axis represents sepal width, points are
colored based on the species, and different marker
styles
are used for different species. This allows for
the visualization of the relationships between
sepal length and width while distinguishing the
data points by species.
'''
mysns.scatterplot(x="sepal_length",
y="sepal_width", data=my_dataset, hue="species",
style="species")
# Set plot labels and title
myplt.title("Scatter Plot of Sepal Length vs Sepal
Width")
myplt.xlabel("Sepal Length (cm)")
myplt.ylabel("Sepal Width (cm)")
# Display the plot
myplt.show()
Output:

Figure 13.16: Scatter plot with the iris dataset

Note: The preceding code is covered in (program name:


Chap13_Example13.9.py).

The joint plot in seaborn


A joint plot in seaborn is a statistical visualization that combines multiple
plot types to offer a comprehensive perspective of the relationship between
two variables. It typically includes a scatter plot to visualize the joint
distribution of the variables, along with marginal histograms or kernel
density plots to show the univariate distribution of each variable. The main
purpose of joint plots is to explore the correlation, distribution, and
individual variable characteristics simultaneously. The univariate and
bivariate plots will be allowed together in one figure using this plot.
Univariate plots focus on one variable which displays any pattern in data.
Bivariate plots explore the relationship between two variables and describe
the bonding of their relationship. By default, the jointplot() function
generates a scatter plot with two histograms positioned at the top and right
margins of the graph.
Refer to the following syntax for clarity:
jointplot(data=None, *, x=None, y=None, hue=None,
kind='scatter', height=6, ratio=5, space=0.2,
dropna=False, xlim=None, ylim=None, color=None,
palette=None, hue_order=None, hue_norm=None,
marginal_ticks=False, joint_kws=None,
marginal_kws=None, **kwargs)
The main plot illustrates the relationship between the dependent variable y
and the independent variable x. Additionally, a horizontal marginal plot is
positioned at the top of the bivariate graph, representing the distribution of
the independent variable x. Furthermore, a vertical marginal plot is placed
along the right margin of the bivariate graph, displaying the distribution of
the dependent variable y.
Here are some reasons why joint plots in seaborn are useful:
Visualizing relationships: Joint plots aid in comprehending the
association between two variables. The scatter plot in the center
provides a clear view of how the variables are distributed jointly,
showing patterns and trends.
Correlation exploration: By examining the scatter plot, you can get a
sense of the correlation between two variables. A concentrated pattern
along a diagonal line may suggest a positive correlation, while a
scattered pattern may indicate a weaker or no correlation.
Univariate distributions: The marginal histograms or kernel density
plots on the sides of the joint plot provide insights into the distribution
of each variable individually. This is valuable for understanding the
characteristics of each variable in isolation.
Outlier detection: Joint plots can help identify outliers and anomalies
in the data, especially in the context of bivariate distributions.
Customization and flexibility: Seabornʼs joint plots offer
customization options, allowing you to tweak the appearance of the
plot based on your preferences. You can choose different types of plots
for the marginal distributions, adjust color schemes, and more.
The following points will help you know when to use joint plots:
Exploratory Data Analysis (EDA): Joint plots are particularly useful
during the exploratory phase of data analysis when you want to
quickly visualize the relationships between two variables.
Comparing distributions: When we want to compare the
distributions of two variables simultaneously, joint plots provide a
concise way to do so.
Understanding correlation: If we are interested in understanding the
correlation between two variables and exploring how their values are
related, joint plots are a helpful tool.
Data presentation: Joint plots can be used in presentations or reports
to convey information about the joint and marginal distributions of
two variables in a visually appealing manner.
In summary, joint plots in seaborn are a versatile tool for visualizing
relationships between two variables, making them valuable in various data
analysis scenarios, especially during the exploratory phase of a data science
project.
Observe the following line of code Chap13_Example13.10.py for
demonstrating joint plot using seaborn:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('iris')
#print(help(mysns.jointplot))
mysns.jointplot(x="sepal_length", y="sepal_width",
data=my_dataset, hue="species")

# Set plot labels and title


myplt.title("Joint Plot of Sepal Length vs Sepal
Width")
myplt.xlabel("Sepal Length (cm)")
myplt.ylabel("Sepal Width (cm)")
# Display the plot
myplt.show()
Output:
Figure 13.17: Joint plot with iris dataset

Note: The preceding code is covered in (program name:


Chap13_Example13.10.py).

This Python code utilizes seaborn to generate a joint plot of the iris dataset,
specifically examining the correlation between the sepal length and sepal
width of iris flowers. The seaborn library is imported as mysns, and the iris
dataset is loaded into the variable my_dataset. The jointplot() function is
then employed to generate a joint plot, where the x-axis represents sepal
length, the y-axis represents sepal width, and points are colored based on the
species of the iris flowers. The hue parameter allows for the differentiation
of species with different colors in the plot. Subsequently, the title, x-axis
label, and y-axis label are set using matplotlib (myplt), indicating that the
plot illustrates the joint distribution of sepal length and width. Finally, the
show() function is called to display the resulting joint plot. This type of
visualization is beneficial for exploring the correlation, distribution, and
individual characteristics of two variables simultaneously, providing insights
into the relationship between sepal length and width in different species of
iris flowers.
In this scenario, the primary plot consists of a scatter plot, while density
plots are positioned along both margins, illustrating the data distribution
separately for the three levels of hue variables. We can view the density
curves for those three levels of hue parameters plotted in different colors.
Now, suppose we pass the kind parameter as reg and remove the hue
parameter in the above line of code since the use of hue with kind='reg' is
not currently supported. Also, we have removed the title() function here:
mysns.jointplot(x="sepal_length", y="sepal_width",
data=my_dataset,kind='reg')
Output:

Figure 13.18: Joint plot with iris dataset and kind parameter
Here, a regression line has been drawn on the scatter plot which is nothing
but the line of best fit which gives the visual representation of the
relationship between the dependent and one independent variable. Here, the
line is drawn such that it will be as close to all the data points in the plot as
possible it can. We can easily find the outliers on a scatter plot that is farther
from the regression line. We can observe that here there are many outliers in
the scatter plot. Moreover, we shall discuss this in our Machine learning
book which will be launched shortly.

Facet grid in seaborn


FacetGrid is a class in the seaborn library that facilitates the creation of
multiple plots arranged in a grid, each corresponding to subsets of the
dataset based on the values of one or more categorical variables. It is
particularly useful class for visualizing relationships in complex datasets by
breaking them down into smaller, more manageable subsets.
The key features and uses of FacetGrid in Seaborn are mentioned below:
Grid of subplots: FacetGrid allows the user to create a grid of
subplots, where each subplot represents a different subset of the data
based on the values of categorical variables.
Categorical variables: It is designed to work well with categorical
variables, enabling the creation of a grid of plots for different levels or
combinations of categorical variables.
When using a conditioning variable in a Facet Grid in Seaborn, different
levels of the variable will correspond to each subset of the data. The dataset
will be split depending on the categories. Suppose the categorical variable is
sex, then subset1 will be extracted by setting the categorical variable to
Male and subset2 will be extracted by setting the categorical variable to
Female. Graphs based on the number of levels of this variable.
The values of the conditioning variables are organized into bins, with each
bin representing a distinct subset, if the conditioning variable is numeric. So,
the different values in the numerical variable size say 2, 3, 4, 5, and so on,
will be grouped into bins and each bin will represent a subset.
Conditioning plots: FacetGrid is commonly used for creating
conditioned plots, where the data is divided into subsets, and a
separate plot is created for each subset. This is useful for
understanding how relationships or patterns vary across different
categories.
Customization and flexibility: FacetGrid provides flexibility in terms
of customization. The user can map different plots onto the grid,
specify the variables for row and column facets, and customize the
appearance of individual plots.
Pair plots and multi-variable analysis: FacetGrid is often used to
create pair plots, where multiple variables are compared
simultaneously across different facets of the grid. This is especially
valuable when exploring interactions between multiple variables.
Follow the points below to learn when you should use FacetGrid in Seaborn:
Exploring relationships across categories: When the user wishes to
examine how the correlation between two variables varies across
various categories or levels of a third variable.
Comparing multiple conditions: If the user has a dataset with
multiple conditions categories and wants to compare distributions or
relationships across these conditions, FacetGrid is a powerful tool.
Multi-variable analysis: When dealing with datasets with multiple
variables, FacetGrid helps create grid layouts that allow the user to
compare and analyze several variables simultaneously.
Pair plots and scatter plots: For creating pair plots or scatter plots
where the user wants to observe relationships between multiple pairs
of variables in a structured grid layout.
In summary, FacetGrid in Seaborn is used to create a grid of subplots based
on the values of categorical variables, providing an effective way to explore
and compare relationships within different subsets of a dataset. It is
particularly useful in situations where data can be naturally divided into
categories and want to analyze or visualize those categories independently.
The FacetGrid() function in Seaborn is used to create independent graphs.
Refer to the following syntax for clarity:
FacetGrid(data, *, row=None, col=None, hue=None,
col_wrap=None, sharex=True, sharey=True, height=3,
aspect=1, palette=None, row_order=None,
col_order=None, hue_order=None, hue_kws=None,
dropna=False, legend_out=True, despine=True,
margin_titles=False, xlim=None, ylim=None,
subplot_kws=None, gridspec_kws=None)
The crucial aspect to note is that after preparing the graphs, we must map
each facet with the plotting function using data subsets. This is achieved by
invoking the map method on the FacetGrid object, allowing the plotting
function to be applied to each facet. This map method can accept both
keyword and positional arguments. These arguments will be passed on to the
plotting function by using the map method. So, the syntax will be as follows:
FacetGrid.map(self, func, *args, **kwargs)
Here, Facetʼs subset of the data will be applied to each plotting function. Let
us execute the code Chap13_Example13.11.py for better
understanding:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('penguins')
# print(help(mysns.FacetGrid))
myfacet_obj = mysns.FacetGrid(data=my_dataset,
col="island")
myfacet_obj.map(mysns.histplot, 'bill_length_mm')
myplt.show()
Output:

Figure 13.19: Facet grid with penguins dataset with a histogram of “bill_length_mm”
Note: The preceding code is covered in (program name:
Chap13_Example13.11.py).

Here, the facets are plotted in the histograms from three subsets. In
summary, the code uses Seabornʼs FacetGrid to create a grid of histograms,
each representing the distribution of penguin bill lengths for a specific
island. This allows for a visual comparison of how the distribution of bill
lengths varies across different islands. The histplot() function is employed
to create the histograms, and the resulting grid is displayed using
matplotlib. This type of visualization is useful for understanding the
distribution of a numerical variable across different categories or conditions
in the dataset. We shall focus on the following three lines of code:
my_dataset = mysns.load_dataset('penguins'): It
loads the penguin's dataset, a built-in dataset in Seaborn, into the
variable my_dataset. This dataset contains information about various
penguin species, including their bill measurements.
myfacet_obj = mysns.FacetGrid(data=my_dataset,
col="island"): It creates a FacetGrid object named myfacet_obj.
This object will be used to generate a grid of subplots, with each
subplot corresponding to a different value of the island column in the
dataset. The grid will be arranged in columns.
myfacet_obj.map(mysns.histplot,
'bill_length_mm'): It maps the histplot function from seaborn
onto the FacetGrid. This means that a histogram of the
bill_length_mm variable will be plotted in each subplot of the grid,
with the data being split based on the values of the island column.
Now, we can even subset a dataset based on both columns and rows. The
FacetGrid function has both row and column parameters for specifying the
row name and column name. The graph will be divided into multiple facets
corresponding to the levels of categorical variables as demonstrated in the
following code Chap13_Example13.12.py:
import seaborn as mysns
import matplotlib.pyplot as myplt
my_dataset = mysns.load_dataset('penguins')
myfacet_obj = mysns.FacetGrid(data=my_dataset,
row="island", col='sex')
myfacet_obj.map(mysns.histplot, 'bill_length_mm')
myplt.show()
Output:

Figure 13.20: Joint grid with the iris dataset

Note: The preceding code is covered in (program name:


Chap13_Example13.12.py).

In the code mentioned above, the variable island and sex have three and two
categories respectively. The grid will split into six facets for each
combination of categorical variables. Let us break down these three lines of
code, refer to the following points:
my_dataset = mysns.load_dataset('penguins'), this line loads the
penguin's dataset using the load_dataset function present in the
seaborn library. The dataset contains information about different
penguin species, including measurements of bill length, flipper length,
body mass, and so on. The loaded dataset is stored in the variable
my_dataset.

myfacet_obj = mysns.FacetGrid(data=my_dataset, row="island",


col='sex'), this line creates a FacetGrid object named myfacet_obj.
The FacetGrid is a seaborn class used for creating a grid of subplots
based on the values of categorical variables.
The data parameter specifies the dataset to be used (my_dataset in this
case).
The row parameter is set to the island, indicating that the grid of
subplots will be organized by the values in the island column of the
dataset.
The col parameter is set to sex, indicating that there will be columns
in the grid for each unique value in the sex column of the dataset.
myfacet_obj.map(mysns.histplot, 'bill_length_mm'), this line maps
the histplot function from Seaborn onto the FacetGrid object
(myfacet_obj).
The histplot function is used to create histograms. In this case, it will
create a histogram of the bill_length_mm variable.
The grid of subplots will be filled with histograms, and each subplot
will represent a combination of values from the island (rows) and sex
(columns) columns.
In summary, these three lines of code load the penguins dataset, create a
FacetGrid object organized by the island in rows and sex in columns and
then use the histplot function to generate histograms of the bill_length_mm
variable within each subplot of the grid. This type of visualization allows
you to explore the distribution of penguin bill lengths across different islands
and genders. The grid structure helps in comparing these distributions based
on the specified categorical variables.
So, with this, we have come to the end of this chapter. In the next chapter on
the usage of ChatGpt with this library, we will be covering how to integrate
this ChatGPT with other plots such as bar plots and many more examples.

Conclusion
In this chapter, we initially witnessed that as the chapter progresses, readers
gain proficiency in creating a diverse array of advanced plots in Seaborn.
From KDE plots, we explored single variable distributions to violin plots
harmonizing aspects of box plots and KDE plots, readers are now equipped
with a repertoire of tools for nuanced data exploration. The chapterʼs
culmination lies in empowering readers to navigate Seabornʼs extensive
capabilities, from crafting line plots to depict trends, scatter plots to visualize
variable relationships, joint plots amalgamating diverse plot types in a single
grid, to FacetGrids creating grids of subplots based on categorical variables.
In essence, the chapter delivers on its objective to provide a comprehensive
understanding of basic statistical concepts, unveiling Seabornʼs
functionalities, and familiarizing readers with a spectrum of plots designed
for effective data exploration and presentation. Armed with this knowledge,
readers are now poised to leverage Seabornʼs capabilities, transforming
datasets into meaningful visual narratives that yield valuable insights. This
chapter not only imparts technical skills but also cultivates an appreciation
for the art and science of data visualization, ultimately fostering a data-savvy
mindset for those embarking on the journey of data analysis.
In the next chapter, we will integrate the data science libraries using
ChatGPT prompts. We will learn how to use ChatGPT to get our answers by
writing the actual prompts that are required.

Points to remember
KDE plot in seaborn utilizes Kernel Density Estimation for a smooth
representation of the underlying distribution of continuous data, aiding
in understanding the shape of the data distribution.
Violin plot in seaborn combines aspects of box plot and KDE to
display the distribution of data across different levels, making it
suitable for visualizing both central tendency and spread.
Line plot in seaborn displays data points connected by straight lines,
useful for showing trends and patterns over time or other ordered
categories.
Scatter plot in seaborn: displays individual data points on a two-
dimensional plane, optimal for depicting the connections between two
continuous variables.
Joint plot in seaborn combines multiple plots to visualize
relationships between two variables, displaying scatter plots,
histograms, and regression lines simultaneously.
Facet grid in seaborn divides the data into subsets for separate
visualizations, facilitating comparison between different subsets and
exploring patterns within categorical variables.
Questions
1. What is the purpose of a KDE plot, and how does it differ from a
histogram?
2. Explain the kernel density estimation process in Seaborn.
3. When would you choose a violin plot over a box plot?
4. Describe the anatomy of a violin plot in Seaborn.
5. In what situations is a line plot preferable for data visualization?
6. How does Seaborn handle the creation of line plots?
7. What information can be derived from a scatter plot?
8. Discuss the parameters available in Seaborn for customizing scatter
plots.
9. How does a joint plot combine multiple types of visualizations?
10. Provide examples of scenarios where a joint plot is particularly
useful.
11. Explain the concept of Facet Grid in Seaborn.
12. What advantages does a Facet Grid offer in terms of data exploration
and presentation?

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 14
Integrating Data Science Libraries
with ChatGPT Prompts

Introduction
ChatGPT, a specialized variant of OpenAIʼs Generative Pre-trained
Transformer (GPT) models, excels in generating human-like text responses
for conversational contexts. Its neural network architecture, trained on
extensive text data, facilitates tasks such as data augmentation for text-based
applications like classification and sentiment analysis. Additionally,
ChatGPT powers conversational AI systems, and virtual assistants, and
facilitates text generation and summarization tasks, catering to a wide array
of needs from customer support to creative writing. Its adaptability through
fine-tuning enables a deeper understanding of language nuances, making it
an asset in various data science and machine learning projects, albeit its
frequency of use varies based on project requirements.
In data science, ChatGPT can be useful for tasks such as generating text
data, answering queries about data analysis, or providing explanations of
complex concepts more understandably. The frequency of using this tool for
data science would depend on the specific needs of your projects and how
often you require assistance with text-based tasks or explanations. In this
chapter, we shall discuss how to use ChatGPT with data science libraries.

Structure
In this chapter, we will discuss the following topics:
ChatGPT prompts for exploring data science libraries
Querying ChatGPT for Python code with data science libraries
Querying ChatGPT with Python error
Data science library cheat sheets using ChatGPT
Novice challenges querying Python code via ChatGPT

Objectives
This chapter will cover the topics to illuminate readers on the versatile
applications of ChatGPT in the exploration of open-source data science
libraries. Firstly, it aims to guide users in formulating effective prompts for
ChatGPT, tailored to each library, including numpy, scipy, matplotlib,
pandas, polars, and seaborn, to extract valuable insights and coding
solutions. We will cover various topics, including leveraging ChatGPT for
generating insightful prompts and queries to explore data science libraries,
seeking assistance for Python code utilizing these libraries, collaboratively
generating comprehensive cheat sheets, and addressing challenges faced by
novice users. By providing practical guidance and insights, this chapter
endeavors to empower readers to enhance their proficiency in data analysis
and visualization tasks while utilizing ChatGPT as a valuable resource in
their learning journey.

ChatGPT prompts for exploring data science libraries


Before starting with ChatGPT prompts for exploring data science libraries,
we should first know what ChatGPT prompt means. A ChatGPT prompt
serves as the initial cue or context for the model to generate a response. It
functions much like presenting a query, statement, or scenario to the model,
upon which it produces a logical reply. The prompt aids the artificial
intelligence (AI) in grasping the userʼs intent and formulating appropriate
content. For example, if we ask ChatGPT to summarize a news article, the
prompt would be the text of the article or a brief description of what we want
to summarize. Similarly, if we ask ChatGPT for a creative writing prompt,
we might provide some initial details or a scenario as the prompt.
In essence, the prompt serves as the input or stimulus for the AI model,
guiding it in generating a response that is relevant and coherent concerning
the given context.
In this topic, we shall be writing ChatGPT prompts along with its reply.
Now, we shall be exploring each data science library by asking some
ChatGPT prompts:
ChatGPT Prompt1: What are the key functionalities and advantages
of using the numpy library for numerical computing, particularly in
the context of AI applications in Python?
ChatGPT reply1: Numpy stands as a cornerstone library for
numerical computing in Python, with a myriad of functionalities
tailored to meet the demands of AI applications. Its support for multi-
dimensional arrays and matrices, coupled with a vast array of
mathematical functions, streamlines complex computations inherent in
AI tasks, including matrix operations, linear algebra, and statistical
analysis.
Notably, numpyʼs optimized and vectorized operations deliver superior
performance compared to conventional Python lists, rendering it
indispensable for handling massive datasets and computationally intensive
AI tasks. Its seamless integration with other Python libraries such as SciPy,
pandas, and scikit-learn further amplifies its utility across diverse AI
workflows.
Numpyʼs broadcasting capability allows for implicit element-wise
operations on arrays of varying shapes, facilitating succinct and expressive
code development, particularly in data manipulation and preprocessing tasks.
Moreover, its rich collection of random number generators aids in
synthesizing datasets crucial for training and evaluating machine learning
models. With robust indexing and slicing capabilities, numpy empowers
efficient data extraction and manipulation, serving as the bedrock for
preprocessing datasets before model training.
In essence, NumPy emerges as an indispensable asset in the arsenal of tools
for AI practitioners, offering unparalleled versatility, performance, and
efficiency in numerical computing tasks.
So, from the previous reply, we saw how ChatGPT replied to a query over
numpy as we specifically focused on asking key functionalities and
advantages of using the numpy library for numerical computing on AI in
Python. So, based on our specific query ChatGPT answered to the point.
Refer to another prompt for a better understanding:
ChatGPT prompt2: As a newcomer to Python, I'm eager to explore
the Scipy library. Could you please introduce me to Scipy, outlining its
prominent features and distinguishing it from NumPy? It would be
great if you could present this information concisely in a table format.
ChatGPT reply2:
Scipy Numpy

Extends functionality beyond array Primarily focuses on array


manipulation. manipulation and basic linear algebra
operations.
Provides specialized functions for scientific Offers fundamental tools for
and technical computing tasks. numerical computing.
Includes modules for optimization, integration, Focuses on mathematical operations,
interpolation, signal processing, and more. array manipulation, and linear algebra.
Offers efficient numerical routines for linear Provides efficient array operations and
algebra, statistics, and sparse matrix mathematical functions.
operations.
Complements numpy by providing additional Serves as the fundamental library for
functionality for scientific computations. numerical computing in Python.

Table 14.1: ChatGPT prompt reply


This table provides a succinct comparison between scipy and numpy,
highlighting their key features and differences, which would be helpful for
someone new to Python and interested in learning about the scipy library.
We will see how to use ChatGPT to write a given Python code when are
stuck at some logic while using all these data science libraries in our further
topics. Let us take another example:
ChatGPT Prompt3: Can you provide an overview of Matplotlib,
highlighting its core functionalities and the various types of plots it
enables users to create?
ChatGPT reply3: Matplotlib is a powerful Python library for creating
static, interactive, and publication-quality visualizations. Its core
functionalities include the ability to create a wide range of plots, such
as line plots, scatter plots, bar plots, histograms, pie charts, and more.
Matplotlib provides users with extensive customization options for
controlling aspects like colors, labels, titles, axis limits, and styles,
allowing for the creation of highly customized and visually appealing
plots. Additionally, matplotlib supports various output formats,
including Portable Network Graphics (PNG), Portable Document
Format (PDF), Scalable Vector Graphics (SVG), and Encapsulated
PostScript (EPS) , making it suitable for both web-based and print-
based applications. Its integration with other libraries like NumPy and
Pandas further enhances its capabilities, enabling seamless data
manipulation and visualization. Overall, matplotlib serves as an
essential tool for data visualization tasks in fields such as data science,
machine learning, finance, and scientific research, empowering users
to effectively communicate insights and trends from their data.
So, in this ChatGPT we have asked about matplotlib overview their core
functionalities, and various plot types. Let us refer to another example:
ChatGPT Prompt4: Could you provide an in-depth overview of
Pandas, including its data structures, and common operations for data
manipulation and analysis? (Around 60 words)
ChatGPT reply4: Pandas is a Python library for data manipulation
and analysis. Its core data structures are Series (one-dimensional) and
DataFrame (two-dimensional). Key functionalities include data
loading, cleaning, manipulation, and analysis. Common operations
include indexing, selection, grouping, aggregation, merging, and
reshaping data. Pandas simplifies complex data tasks and is widely
used in data science for its ease of use and versatility.
In this ChatGPT prompt, we have restricted the word usage to around
60 words and see how beautifully ChatGPT restricted it to 59 words.
You may check it. So, based on your word limitation, we can restrict
the answer. If you remove the words prompt, then it may reply answer
point with a wise explanation in detail.
Now, observe the other ChatGPT prompt on the polars:
ChatGPT Prompt5: Could you provide a comparison between polars
and pandas, focusing on their features, performance, and usage?
Present the information in a structured table format, highlighting the
differences and similarities between the two libraries. (Around 200-
300 words)
ChatGPT reply5: Here is a comprehensive comparison between
Polars and Pandas, presented in a structured table format:
Feature Polars Pandas

Performance Optimized for speed and memory Generally slower compared to


efficiency, particularly for large polars, especially for large
datasets and parallel processing. datasets. Primarily operates in a
Supports multi-threaded and multi- single-threaded environment.
core operations for faster
computation.
Data Utilizes dataframes similar to Features dataframes as the core
structures pandas, offering similar data structure, allowing for
functionalities for data manipulation flexible data handling and
and analysis. manipulation.
Functionality Provides a wide range of operations Offers comprehensive
for data manipulation, including functionality for data
indexing, selection, filtering, manipulation, analysis, and
grouping, aggregation, and joins. visualization. Supports various
Offers efficient methods for operations like indexing,
handling missing data and selection, filtering, grouping,
performing advanced computations. aggregation, merging, and
reshaping data.
API Strives for consistency in API Known for its consistent and
consistency design, aiming to provide a familiar intuitive API design, making it
interface for users transitioning easy for users to learn and work
from Pandas. with.
Parallel Built with parallel processing in Limited support for parallel
processing mind, allowing for efficient data processing, primarily operates
processing across multiple threads in a single-threaded
or cores. environment.

Table 14.2: ChatGPT prompt reply four


From the above response from ChatGPT point-wise, we can conclude that
Polars and Pandas are both powerful libraries for data manipulation and
analysis, with each offering its own set of advantages. While Polars excels in
terms of performance and parallel processing capabilities, Pandas boasts a
comprehensive ecosystem and a user-friendly API. The choice between the
two ultimately depends on the specific requirements of the project and the
desired balance between performance and ecosystem support. This prompt
was specifically asked so that whatever we have learned about pandas and
polars can be compared in a nutshell. Now, observe the next prompt:
ChatGPT prompt6: Can you please let me know when to choose
seaborn or matplotlib for data visualization?
ChatGPT reply6: Choosing between seaborn and matplotlib for data
visualization depends on several factors, including the complexity of
your visualization needs, your familiarity with each library, and your
preference for customization versus ease of use. Here is a guideline to
help you decide:
Choose seaborn if:
Statistical visualization: You need to create complex statistical
visualizations, such as distribution plots, regression plots, or
categorical plots. Seaborn provides high-level functions specifically
designed for these tasks, making it easier to generate informative plots
with minimal code.
Default aesthetics: You prioritize visually appealing plots with
pleasing default styles and color palettes. Seaborn comes with
attractive defaults, allowing you to create professional-looking plots
without extensive customization.
Integration with Pandas: You are working with pandas dataframes
and need seamless integration for data manipulation and visualization.
Seaborn works well with Pandas, allowing for direct input of
dataframe objects into plotting functions.
Choose matplotlib if:
Flexibility and customization: You require full control over every
aspect of your plots, from the axes to the colors to the annotations.
Matplotlib offers extensive customization options, making it suitable
for creating highly customized and complex visualizations.
General purpose plotting: You need to create a wide range of plots
beyond statistical visualization, such as 3D plots, polar plots, or
animations. Matplotlib provides a broader range of plotting functions
for various types of plots and applications.
Advanced plotting requirements: You need to create specialized
plots or implement custom plotting logic that is not directly supported
by seaborn. Matplotlib offers a lower-level API that allows for more
advanced plotting techniques and customizations.
So, from the above reply from ChatGPT, choose Seaborn for its ease of use,
statistical visualization capabilities, and default aesthetics. Choose
matplotlib for its flexibility, customization options, and a broader range of
plotting functions. Ultimately, the best choice depends on your specific
visualization needs and preferences.

Querying ChatGPT for Python code with data science libraries


Now, let us explore some ChatGPT prompts where we shall ask it to write a
Python code while using data science libraries:
ChatGPT Prompt7: Write a Python code to create 3 ndarray objects.
I want to save these multiple ndarray objects to the binary file and
read and display ndarray objects from a file. Use 2 different
approaches to achieve this result.
ChatGPT reply7: The code will be covered in
Chap14_Example14.1.py and Chap14_Example14.2.py:

# Using Pickling approach-1


import pickle
import numpy as np
# Create three NumPy ndarray objects
array1 = np.array([[1, 2, 3], [4, 5, 6]])
array2 = np.array([[7, 8, 9], [10, 11, 12]])
array3 = np.array([[13, 14, 15], [16, 17, 18]])
# Save the ndarray objects to a binary file
using Pickle
with open('arrays_pickle.bin', 'wb') as f:
pickle.dump(array1, f)
pickle.dump(array2, f)
pickle.dump(array3, f)
# Read and display ndarray objects from the
file
with open('arrays_pickle.bin', 'rb') as f:
loaded_array1 = pickle.load(f)
loaded_array2 = pickle.load(f)
loaded_array3 = pickle.load(f)
print("Loaded Array 1:")
print(loaded_array1)
print("\nLoaded Array 2:")
print(loaded_array2)
print("\nLoaded Array 3:")
print(loaded_array3)
Output:
Loaded Array 1:
[[1 2 3]
[4 5 6]]

Loaded Array 2:
[[ 7 8 9]
[10 11 12]]

Loaded Array 3:
[[13 14 15]
[16 17 18]]

Note: The preceding code is covered in (program name:


Chap14_Example14.1.py).

Another approach is mentioned in the following Python snippet code:


Chap14_Example14.2.py:

# Using NumPy's save and load functions


approach-2
import numpy as np
# Create three NumPy ndarray objects
array1 = np.array([[1, 2, 3], [4, 5, 6]])
array2 = np.array([[7, 8, 9], [10, 11, 12]])
array3 = np.array([[13, 14, 15], [16, 17, 18]])
# Save the ndarray objects to a binary file
using NumPy's save function
np.savez('arrays_npz.npz', array1=array1,
array2=array2, array3=array3)
# Read and display ndarray objects from the
file using NumPy's load function
loaded_data = np.load('arrays_npz.npz')
loaded_array1 = loaded_data['array1']
loaded_array2 = loaded_data['array2']
loaded_array3 = loaded_data['array3']
print("Loaded Array 1:")
print(loaded_array1)
print("\nLoaded Array 2:")
print(loaded_array2)
print("\nLoaded Array 3:")
print(loaded_array3)
The output will be the same as that of Chap14_Example14.1.py.

Note: The preceding code is covered in (program name:


Chap14_Example14.2.py).

The conclusion drawn from the two approaches for saving and loading
NumPy ndarray objects to a binary file is that they offer different methods
with varying trade-offs:
Pickle approach: Using Pickle for serialization provides a flexible
and convenient way to save and load NumPy arrays. It is easy to use
and requires minimal code to implement. However, Pickle may not be
as efficient as other serialization methods, and there may be security
concerns when loading Pickle files from untrusted sources.
Numpy's save and load functions: Utilizing numpyʼs save and load
functions (for example, np.savez and np.load) offers a more
specialized approach tailored specifically for NumPy arrays. It
provides efficient storage and retrieval of NumPy arrays and is
optimized for performance. However, the saved files are specific to
NumPy and may not be compatible with other serialization libraries.
In summary, the choice between Pickle and numpyʼs functions depends on
factors such as ease of use, efficiency, compatibility, and security
requirements. Both approaches have their merits and can be used depending
on the specific needs of the application. Refer to another example:
ChatGPT Prompt8: Can you provide an example of interpolation
using SciPy and explain its application in real-world scenarios?
Please include a Python code demonstrating interpolation with SciPy
ChatGPT reply8: The code will be covered in
Chap14_Example14.3.py:

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d

# Generate sample data representing a time


series
time = np.array([0, 1, 2, 3, 4, 5, 6])
temperature = np.array([20, 22, 25, 24, 23, 22,
21])

# Perform linear interpolation


f_linear = interp1d(time, temperature)

# Generate denser time points for plotting the


interpolated curve
time_dense = np.linspace(0, 6, 100)
temperature_dense_linear = f_linear(time_dense)
# Plot original data points and interpolated
curve
plt.figure(figsize=(8, 6))
plt.scatter(time, temperature, label='Original
Data', color='blue')
plt.plot(time_dense, temperature_dense_linear,
label='Linear Interpolation', color='orange')
plt.title('Temperature Interpolation')
plt.xlabel('Time (hours)')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.grid(True)
plt.show()
Output:
Figure 14.1: Image depicting Interpolation Visualization for Temperature Data

Note: The preceding code is covered in (program name:


Chap14_Example14.3.py).

In this scenario, we are working with temperature data collected at specific


time points. Our goal is to fill in the gaps between these measurements to
obtain temperature estimates at any given time. To achieve this, we employ
linear interpolation, a method that approximates values between known data
points. We utilize SciPyʼs interp1d function to perform this interpolation.
The resulting interpolated temperature curve is then plotted alongside the
original data points, aiding in visualizing the interpolation process.
Interpolation finds wide application across domains like time series analysis,
signal processing, and image processing, particularly in situations where
data points are unevenly distributed or missing, necessitating the estimation
of intermediate values. Refer to the following example:
ChatGPT Prompt9: Could you showcase the creation of both a
simple line plot and a bar plot in a single example using matplotlib in
Python code?
ChatGPT reply9: The code will be covered in
Chap14_Example14.4.py:

import numpy as np
import matplotlib.pyplot as plt

# Sample data
x = np.arange(1, 6)
y_line = np.array([3, 5, 7, 4, 6])
y_bar = np.array([8, 6, 9, 7, 5])

# Create a figure and axis object


fig, ax = plt.subplots()

# Plot the line plot


ax.plot(x, y_line, marker='o', linestyle='-',
color='b', label='Line Plot')

# Plot the bar plot


ax.bar(x, y_bar, color='r', alpha=0.5,
label='Bar Plot')

# Set labels and title


ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_title('Line and Bar Plot')

# Add legend
ax.legend()

# Show plot
plt.show()
Output:

Figure 14.2: Image depicting both line and bar plots using matplotlib
Note: The preceding code is covered in (program name:
Chap14_Example14.4.py).

In this code, we create two sets of sample data (y_line and y_bar) and plot
them on the same figure using Matplotlibʼs plot() and bar() functions. We
then customize the plot by adding markers, linestyle, and colors. Finally,
we set labels for the axes, and title for the plot, and added a legend to
distinguish between the line plot and the bar plot. Refer to the following
example for a better understanding:
ChatGPT Prompt10: Provide a code snippet showcasing the
difference between pandas and polars.
ChatGPT reply10: The code will be covered in
Chap14_Example14.5.py:

import pandas as pd
import polars as pl
# Creating a DataFrame/Series
# Pandas
df_pandas = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10]})
series_pandas = pd.Series([1, 2, 3, 4, 5])

# Polars
df_polars = pl.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10]})
series_polars = pl.Series([1, 2, 3, 4, 5])
# Accessing data
# Pandas DataFrame
print(“Pandas DataFrame:”)
print(df_pandas)
# Polars DataFrame
print("\nPolars DataFrame:")
print(df_polars)
# Pandas Series
print("\nPandas Series:")
print(series_pandas)
# Polars Series
print("\nPolars Series:")
print(series_polars)
# Modifying data
# Pandas
df_pandas['C'] = df_pandas['A'] +
df_pandas['B']
series_pandas = series_pandas + 1
# Polars
df_polars =
df_polars.with_columns([series_polars.rename("C
")])
series_polars = series_polars + 1
# Display modified data
print("\nModified Pandas DataFrame:")
print(df_pandas)
print("\nModified Polars DataFrame:")
print(df_polars)
print("\nModified Pandas Series:")
print(series_pandas)
print("\nModified Polars Series:")
print(series_polars)
Output:
Pandas DataFrame:
A B
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10

Polars DataFrame:
shape: (5, 2)
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 6 │
│ 2 ┆ 7 │


│ 3 ┆ 8 │
│ 4 ┆ 9 │
│ 5 ┆ 10 │
└─────┴─────┘
Pandas Series:
0 1
1 2
2 3
3 4
4 5
dtype: int64
Polars Series:
shape: (5,)
Series: '' [i64]
[
1
2
3
4
5
]
Modified Pandas DataFrame:
A B C
0 1 6 7
1 2 7 9
2 3 8 11
3 4 9 13
4 5 10 15
Modified Polars DataFrame:
shape: (5, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ 1 │
│ 2 ┆ 7 ┆ 2 │
│ 3 ┆ 8 ┆ 3 │
│ 4 ┆ 9 ┆ 4 │
│ 5 ┆ 10 ┆ 5 │
└─────┴─────┴─────┘
Modified Pandas Series:
0 2
1 3
2 4
3 5
4 6
dtype: int64
Modified Polars Series:
shape: (5,)
Series: '' [i64]
[
2
3
4
5
6
]

Note: The preceding code is covered in (program name:


Chap14_Example14.5.py).

The code snippet provided demonstrates basic data manipulation tasks such
as creating DataFrames and Series, accessing data, and modifying data. By
comparing how these tasks are performed in both pandas and polars, users
can gain insight into the differences in syntax and functionality between the
two libraries. Refer to another example for a better understanding:
ChatGPT Prompt11: Can you demonstrate the creation of scatter
plots using both seaborn and matplotlib in Python? Please provide a
code example showcasing how to create scatter plots with each library
and highlight any differences in their syntax or features.
ChatGPT reply11: The code will be covered in
Chap14_Example14.6.py:

import seaborn as sns


import matplotlib.pyplot as plt
import numpy as np
# Sample data
np.random.seed(0)
x = np.random.randn(100)
y = np.random.randn(100)
# Scatter plot using Seaborn
plt.figure(figsize=(6, 3))
sns.scatterplot(x=x, y=y, color='blue',
label='Seaborn')
plt.title('Seaborn Scatter Plot') # Change
title for Seaborn plot
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
# Scatter plot using Matplotlib
plt.figure(figsize=(6, 3))
plt.scatter(x, y, color='red',
label='Matplotlib')
plt.title('Matplotlib Scatter Plot') # Change
title for Matplotlib plot
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
# Show plots
plt.show()
The output will be a display of two images one below the other here first by
using Seaborn Scatter Plot:

Figure 14.3: Image depicting Scatter plot using seaborn

The following figure shows Matplotlib Scatter Plot:


Figure 14.4: Image depicting scatter plot using matplotlib

Note: The preceding code is covered in (program name:


Chap14_Example14.6.py).

Here, ChatGPT is being asked to prompt a question related to the creation of


scatter plots using seaborn and matplotlib library in Python. The prompt is
designed to elicit a response that demonstrates how to create scatter plots
with both libraries and highlights any differences in their syntax or features.
This prompt is intended to guide the conversation towards discussing the use
of seaborn and matplotlib for data visualization and to provide insights into
their respective functionalities.
In all the Python codes we have seen till now, comments have been added at
various respective positions for a better understanding of the code
explanation to the user. So, whenever some code is being prompted to
ChatGPT then comments will be written by default at the required places.
This is one of the benefits of using ChatGPT.

Querying ChatGPT with Python error


Now, we shall see some ChatGPT prompts where we shall ask to provide
solution to a problem when we are writing a Python code.
ChatGPT Prompt12: Hi ChatGPT, I am trying to concatenate two NumPy
arrays along the row axis using np.concatenate, but I am getting an error
message ValueError: all the input arrays must have same number of
dimensions.
Here is the code snippet:
import numpy as np
# Arrays
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([7, 8, 9])
# Concatenate along row axis
result = np.concatenate((arr1, arr2), axis=0)
On executing the above code, we are getting the following:
ValueError: all the input arrays must have the same
number of dimensions, but the array at index 0 has
2 dimension(s) and the array at index 1 has 1
dimension(s)
Could you please help me understand what I am doing wrong and how to fix
this issue?
ChatGPT reply12: To resolve the issue described in the example question,
you need to ensure that the dimensions of the arrays youʼre concatenating
match along the specified axis. In this case, the error message indicates that
the arrays donʼt have the same number of dimensions, which is causing the
ValueError.

To fix the issue, you can reshape arr2 to have the same number of
dimensions as arr1 before concatenating.
The code will be covered in Chap14_Example14.7.py:
import numpy as np
# Arrays
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([7, 8, 9])
# Reshape arr2 to match the dimensions of arr1
arr2_reshaped = np.reshape(arr2, (1, 3))
# Concatenate along row axis
result = np.concatenate((arr1, arr2_reshaped),
axis=0)
print(result)
Output:
[[1 2 3]
[4 5 6]
[7 8 9]]

Note: The preceding code is covered in (program name:


Chap14_Example14.7.py).

Let us explore another example.


ChatGPT Prompt13: Hi ChatGPT, I am trying to create a scatter plot
using both Matplotlib and Pandas, but I'm getting TypeError. Where is the
problem? The code snippet is as follows:
import pandas as pd
import matplotlib.pyplot as plt
# Sample DataFrame
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [6,
7, 8, 9, 10]})
# Scatter plot
plt.scatter(df['x'], df['y'])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt.show()
Output:
TypeError: scatter() missing 2 required positional
arguments: x and y
ChatGPT reply13: The scatter() function in matplotlib requires the x and
y arguments to specify the coordinates of the data points to be plotted. In
your code, youʼre calling plt.scatter() without providing these arguments,
which results in the TypeError.
To resolve this issue, you need to pass the x and y data from your DataFrame
to the scatter() function. Hereʼs how you can modify only this line of code
and get the job done:
plt.scatter(df['x'], df['y'])
Output:
Figure 14.5: Image depicting scatter plot using matplotlib and pandas

Note: The preceding code is covered in (program name:


Chap14_Example14.8.py).

We have only explored two examples within this topic, yet there are
numerous other instances to consider. The primary purpose of this exercise is
to demonstrate that you can pose any query to ChatGPT, and it will respond
to the best of its ability based on its knowledge. While ChatGPT can easily
handle solutions for simpler problems like these two examples, it may
struggle with more complex issues. In such cases, obtaining a complete
solution may require persistent questioning, and providing all available
information regarding the specific problem until a satisfactory resolution is
reached. Otherwise, it becomes a repetitive cycle where the query is
repeatedly posed, and the solution provided by ChatGPT may not be entirely
accurate.

Data science library cheat sheets using ChatGPT


Now, let us create a cheat sheet using ChatGPT. A cheat sheet serves as a
succinct and practical reference tool, offering a condensed overview of
crucial details, insights, or commands about a particular topic or area of
study. So, we shall explore some important functions that every data scientist
must know.
ChatGPT Prompt14: Create a brief cheat sheet of important functions used
in data science libraries like numpy, scipy, matplotlib, pandas, polars and
seaborn.
ChatGPT reply14: Here is a brief cheat sheet of important functions used in
popular data science libraries:
NumPy:
np.array(): Create a NumPy array.
np.zeros(), np.ones(): Create arrays of zeros or ones.
np.arange(), np.linspace(): Generate arrays with evenly spaced
values.
np.reshape(): Reshape arrays.
np.mean(), np.median(), np.std(): Compute statistical measures.
np.max(), np.min(): Find maximum and minimum values.
np.concatenate(), np.vstack(), np.hstack(): Combine arrays.
SciPy:
scipy.stats: Statistical functions like probability distributions,
hypothesis tests, and descriptive statistics.
scipy.optimize: Optimization algorithms.
scipy.interpolate: Interpolation and curve fitting.
scipy.signal: Signal processing functions.
scipy.linalg: Linear algebra operations.
Matplotlib:
plt.plot(), plt.scatter(), plt.bar(): Create various types of
plots.
plt.xlabel(), plt.ylabel(), plt.title(): Add labels and title to
plots.
plt.legend(), plt.colorbar(): Add legends and colorbars.
plt.savefig(): Save plots to files.
plt.subplots(): Create subplots.
plt.hist(): Plot histograms.
Pandas:
pd.DataFrame(): Create a DataFrame.
df.head(), df.tail(): View first or last rows of DataFrame.
df.info(), df.describe(): Display summary information and
statistics.
df.groupby(), df.pivot_table(): Group and aggregate data.
df.plot(), df.hist(): Plot data directly from DataFrame.
df.to_csv(), df.to_excel(): Write DataFrame to CSV or Excel.
Polars:
pl.DataFrame(): Create a Polars DataFrame.
df.head(), df.tail(): View first or last rows of DataFrame.
df.describe(): Display summary statistics.
df.filter(), df.select(): Select columns.
df.groupby(), df.agg(): Group and aggregate data.
df.join(), df.merge(): Join DataFrames.
df.sort(), df.sort_by(): Sort DataFrame.
Seaborn:
sns.scatterplot(), sns.lineplot(), sns.barplot(): Create
different types of plots.
sns.histplot(), sns.kdeplot(): Plot histograms and kernel density
estimates.
sns.boxplot(), sns.violinplot(): Plot categorical data.
sns.pairplot(), sns.heatmap(): Plot pairwise relationships and
heatmaps.
sns.set_style(), sns.set_palette(): Set plot styles and color
palettes.
These are just a few essential functions of each library. There are many more
functions and capabilities available in each library to explore and utilize for
data analysis and visualization.

Novice challenges querying Python code via ChatGPT


Novices embarking on querying via ChatGPT for Python code may confront
various hurdles:
Articulating clear questions: Novices might grapple with expressing
their inquiries lucidly, resulting in communication ambiguity or
confusion.
Deciphering error messages: Novices could encounter difficulty
comprehending intricate error messages or tracebacks, impeding their
ability to effectively diagnose and resolve issues.
Contextual gaps: Novices may inadvertently omit crucial contextual
details about their code or problem, complicating ChatGPTʼs
provision of pertinent assistance.
Terminology unfamiliarity: Novices might lack familiarity with
programming terminology or concepts, hindering their capacity to
pose well-informed questions.
Limited knowledge base: Novices may possess a restricted
understanding of programming concepts, rendering it challenging to
grasp ChatGPTʼs responses or implement suggested solutions.
Articulating logic: Novices may encounter challenges in accurately
describing the logic or flow of their code, making it arduous for
ChatGPT to comprehend the issue and offer effective guidance.
In summary, novices may confront obstacles related to communication,
comprehension, and knowledge while querying with ChatGPT for Python
code. Consistent practice and exposure to programming concepts can aid in
overcoming these hurdles over time.
Please be aware that the preceding chapter involves queries posed to
ChatGPT, and its responses are provided alongside. All of the responses are
being mentioned by ChatGPT as in this chapter our main focus was you all
should ask queries in ChatGPT and get the required solution at your end. Itʼs
important to note that the responses may vary, except for Python code, based
on individual interactions.

Conclusion
In conclusion, this chapter has explored the myriad applications of ChatGPT
in navigating data science libraries. By guiding users in formulating
effective prompts tailored to specific libraries, such as numpy, scipy,
matplotlib, pandas, polars, and seaborn, valuable insights and coding
solutions can be obtained. Additionally, the chapter has provided insights
into utilizing ChatGPT for querying Python code related to data science
tasks, along with strategies for seeking assistance when encountering
challenges. Furthermore, readers have learned how to collaboratively create
cheat sheets summarizing key functions and commands from data science
libraries using ChatGPT. Lastly, the chapter has addressed the common
challenges faced by novice users when querying Python code with ChatGPT,
offering practical strategies and insights to overcome these obstacles.
Overall, this chapter equips readers with the knowledge and tools necessary
to leverage ChatGPT effectively in their exploration and utilization of data
science libraries, enhancing their proficiency in data analysis and
visualization tasks. In the next chapter, we will explore automated EDA
libraries for machine learning and will delve into the capabilities of the
sweetviz, autoviz, pandas_visual_analysis, y-data profiling, dtale, and
dabl libraries, each providing distinct functionalities and visualization
approaches to enhance the understanding and interpretation of datasets.

Points to remember
When formulating prompts for ChatGPT to explore data science
libraries, ensure clarity and specificity to extract relevant insights
tailored to the library of interest, such as numpy, scipy, matplotlib,
pandas, polars, or seaborn.
While asking questions to ChatGPT for Python code utilizing data
science libraries, provide sufficient context and details about the
problem statement or desired outcome to receive accurate and helpful
responses.
When stuck in any Python code, utilize ChatGPT effectively by
breaking down the problem into smaller components and providing
clear and concise queries to solicit targeted assistance, especially when
dealing with data science libraries.
When creating cheat sheets of data science libraries using ChatGPT,
collaborate with others to ensure comprehensive coverage of essential
functions and commands, and regularly update the cheat sheets as new
features or functionalities are introduced in the libraries.
Be mindful of the common challenges faced by novices when
querying Python code using ChatGPT, such as formulating clear
questions, understanding error messages, providing adequate context,
and describing the logic of the code accurately. Practice effective
communication and problem-solving strategies to overcome these
challenges and make the most out of ChatGPTʼs capabilities in data
science tasks.

Questions
1. How can ChatGPT assist in exploring numpy functionalities?
2. Can ChatGPT help me in interpolating data using scipy?
3. How do we fix a syntax error in my Python code using ChatGPT?
4. What are the key components of a cheat sheet for matplotlib that
ChatGPT can provide?
5. Can ChatGPT generate a cheat sheet summarizing essential functions
in numpy?
6. How can novices improve their communication skills when
interacting with ChatGPT?
7. How can we plot a histogram in matplotlib using ChatGPT?
8. How can ChatGPT assist in optimizing my Python code for better
performance?
9. What strategies can we employ with ChatGPT to overcome runtime
errors in Python?
10. What resources and tools are available to novices for improving their
proficiency in querying Python code with ChatGPT?

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 15
Exploring Automated EDA
Libraries for Machine Learning

Introduction
In the field of machine learning, Exploratory Data Analysis (EDA) plays a
role in understanding data properties recognizing patterns and deriving
insights. Traditionally EDA entails an iterative approach where data analysts
meticulously inspect datasets using visualization methods, statistical
summaries, and domain expertise. However, with the increase in size and
complexity of datasets, traditional EDA techniques are becoming more time-
consuming and labor-intensive. Automated EDA tools have emerged as a
solution to this challenge by harnessing algorithms and machine learning
methods to streamline the data exploration process.
These tools provide a variety of features, including data profiling,
visualization capabilities, feature engineering options, and anomaly
detection functionalities. This enables data scientists and analysts to quickly
gain insights from their datasets with intervention. Automated libraries for
EDA cater to the demand for data investigation managing intricate datasets
in a scalable manner while maintaining consistent analytical outcomes. By
utilizing methods, these tools reveal concealed trends and streamline
processes, thereby making data exploration accessible to a wider range of
users.
Structure
In this chapter, we will discuss the following topics:
Usage of PyGWalker visual library with a dataset
Usage of dataprep library with a dataset
Usage of autoviz library with a dataset
Usage of pandas_visual_analysis library with a dataset

Objectives
This chapter will introduce learners to a diverse array of Python libraries
tailored for data visualization and analysis, each uniquely contributing to an
enriched data exploration experience. Beginning with the PyGWalker visual
library, readers will discover how to seamlessly integrate it with datasets to
create visually appealing representations. The exploration continues with the
dataprep library, offering insights into efficient data preparation and
analysis. Subsequently, learners will delve into the capabilities of the
autoviz and pandas_visual_analysis, libraries, each providing distinct
functionalities and visualization approaches to enhance the understanding
and interpretation of datasets. By the end of this module, participants will be
equipped with a versatile toolkit for effective data exploration and
visualization in Python.

Note: Our discussion on each visual library will cover the following
topics: introduction, installation method, code/how to use, and features
explanation.

We will be using all these libraries in Jupyter Notebook. Since this book will
not be a colored edition, the maximum images shown will be colored and
can be found in a soft image bundle. So, if we are referring to any images
that are colored, you are requested to refer to the image bundle.

Usage of PyGWalker visual library with a dataset


The Python binding of Graphic Walker short form abbreviated as
PyGWalker comes with a variety of tools and functions to simplify the entire
process of data exploration and manipulation. It also offers various features
for the user to understand the data characteristics, patterns, and relationships
easily. PygWalker enables data scientists to effortlessly visualize, clean, and
annotate data through intuitive drag-and-drop actions and even natural
language queries. For data analysis in Jupyter Notebook, the pandas
dataframe can be turned into tableau-style UI using PyGWalker.
Installation method
We can install PyGWalker by using the following command:
pip install pygwalker
Once installed, run %pip to show pygwalker in the cell, you shall see the
version and other details after installation which we have shown in the
PyGWalker_Demo.ipynb notebook.

Code and how to use it


The dataset that we will be exploring will be tips.csv data. First, we will be
importing the pygwalker and pandas dataframe into our Jupyter Notebook
followed by loading the tips.csv data into pandas dataframe using
read_csv() method. Then, we are displaying the first five rows of the tips
dataset as shown in Figure 15.1:

Figure 15.1: Image displaying loading of tips.csv dataset in Jupyter Notebook


Then we shall be using the walk method. This method facilitates EDA,
enabling users to visually investigate and analyze the dataframe mydf. It is
capable of generating diverse statistical summaries, visual representations,
and insights from the dataset, aiding users in comprehending its inherent
traits and trends.
So, we shall be exploring the data by writing the following line of code thus
displaying a visual playground with the loading of dataframe:
mypg.walk(mydf, dark='light',
hideDataSourceConfig=True, vegaTheme='vega')
A detailed explanation of the code is given below:
# dark='light': A lighter color mode is used.
# hideDataSourceConfig=True: When set to True, it
indicates that the exploration process will hide the specifics of the data
source configuration, such as file paths or database connection
settings, ensuring a cleaner and more focused display of the data
analysis results.
# vegaTheme='vega': It allows expressive and interactive
visualization creation.
Features explanation
Let us see the features with various images. On executing the walk method,
we will be getting a canvas with two tabs; Data and Visualization as shown
in Figure 15.2:
Figure 15.2: Image displaying pygwalker with two tabs

We can check the entries and inspect the dataframe in the Data tab. Here, the
numeric values (continuous columns) will automatically be labeled as
measures and categorical columns as dimensions in pygwalker. Various
column types like:
Continuous columns are considered to be quantitative.
For categorical columns, the data will be either nominal or ordinal.
For date/time-related columns, the data type will be temporal.
In the Visualization tab, we can create different types of charts. The
available fields can be dragged and dropped into the x and y axes. Let us plot
the total bill amount by day. So, in the Y axis, let us take the total_bill
column, and in the X axis, let us take the day column. We can also resize the
chart by clicking the symbol highlighted in the black rectangle as shown in
Figure 15.3. A chart type will be automatically picked by default by the
pygwalker for the user. We can additionally charts by clicking +New in
pygwalker.
Figure 15.3: Image depicting chart plot with a single X and Y field in Chart 1

However, the chart types can be selected based on Figure 15.4 shown as
follows:
Figure 15.4: Image depicting chart types in pygwalker

We can even explore our data in a very imaginative way by adding some
extra fields into our chart either in the X or Y axis. Suppose we want to
know how much smokers spend each day. We can drag and drop the smoker
field and the day field on the X-axis against the total_bill column on the Y-
axis. The time field can be added under Color. Let us filter the sex too, that
is, only for female customers. We will only check female and uncheck male
here during selection. So, based on the selection, we have filtered the values,
and the plot is shown in Figure 15.5:

Figure 15.5: Image depicting chart plot with a multiple X and a single Y field in Chart2

Also, we can export it under three different formats either in PNG, SVG, or
BASE64 format as shown in Figure 15.6 as follows:

Figure 15.6: Image depicting export option

We can also export our code using export_code tool button as highlighted in
Figure 15.7 as follows:
Figure 15.7: Image depicting export_code button

On clicking this tool button, the following figure will pop-up in front of the
user:

Figure 15.8: Image depicting the code on clicking export_code button

Now, if we click the Copy to Clipboard button, the pop-up window will be
closed and if we paste the code in the new cell, we shall see the same
specification of our dashboard consisting of both the Charts on running.
There are more options to explore in pygwalker which we will leave for you
to explore.
For more information, you may refer to the website:
https://pypi.org/project/pygwalker/

Note: All the explanations related to pygwalker are covered in


(PyGWalker_Demo.ipynb).

Usage of dataprep library with a dataset


Another automated library for performing EDA by collecting data from a
common data source is by using dataprep library. If we split the word
dataprep it means data preparation. DataPrep library makes it easier to
explore data by simplifying tasks into lines of code. With this tool, the user
can gain insights into their data without having to write complicated code.
When conducting EDA, the most efficient approach involves utilizing the
dataprep.eda module. This library is renowned for its speed and simplicity
in performing EDA tasks. Even if we were unfamiliar with dataprep.eda, it
offers a seamless and user-friendly experience, enabling us to comprehend
and explore the dataframe with just a few lines of code.
Installation method
We can install dataprep by using the following command:
pip install dataprep
Once installed, run %pip to show dataprep in the cell, you shall see the
version and other details after installation which is shown in
Dataprep_Demo.ipynb notebook.

Code and how to use it


The dataset that we will be exploring is penguins.csv data. First, we need to
import the create_report function from the library. Then, it loads a CSV file
named penguins.csv into a pandas dataframe. The head() method displays
the first five rows of the penguin dataset for quick inspection as shown in
Figure 15.9:

Figure 15.9: Image displaying loading of penguins.csv dataset in Jupyter Notebook

Finally, the create_report() function is called on the dataframe to generate


an EDA report, which is then displayed using the following show() method:
create_report(mydata).show()
This process enables concise and efficient exploration of the dataset,
providing statistical summaries and visualizations to aid in understanding the
data. The whole visualization will be created by using a single line of code.
Features explanation
On executing the create_report() function, the dataprep library will display
a report of the dataset as shown in Figure 15.10:

Figure 15.10: Image displaying different features using dataprep library in Jupyter Notebook

Let us explain each of the features shown in the report one-by-one. Each
feature explained here will help us to understand the data quickly but make
sure that your EDA fundamentals are very strong before practicing all these
types of libraries. The overview and variables are mentioned below:
Overview: The overview of our penguins.csv dataset contains dataset
statistics and dataset insights here as shown in Figure 15.11. In dataset
statistics, we can find statistical details like the number of variables
which is nothing but the number of columns, the number of rows,
mussing cells, their percentage, duplicate rows if any, the memory
size, and the last important information which is differentiation of
numerical or categorical variables. The dataset insights display only
the percentage of missing values.
Figure 15.11: Image displaying Overview feature using dataprep libary in Jupyter Notebook

Variables: Clicking on the variable option reveals comprehensive


information for each feature, accompanied by bar graphs. This
detailed display offers insights into the characteristics of each feature
and its potential impact on model accuracy. For instance, selecting any
feature allows for the observation of various plots and graphs,
providing a deeper understanding of its behavior and relationships
within the dataset. We shall explain each numerical and categorical
variable for better understanding to the user. We shall start with the
species variable as shown in Figure 15.12:

Figure 15.12: Image displaying Categorical species variable using dataprep library
The species column is a categorical column within the dataset which
consists of three different penguin species: Adelie, Chinstrap, and Gentoo.
The species column serves as a pivotal factor for organizing and examining
penguin data, enabling researchers to discover the unique characteristics
associated with each species. We will notice the visually appealing bar plot
effectively represents categorical values. Upon clicking the Show details
button below the species, we shall find five different plots as shown in
Figure 15.13:

Figure 15.13: Image displaying Stats details of species variable

In the stats section, we will find observations categorized by Length,


Sample, and Letter. As we navigate through the plots, various visualizations
such as pie charts, Word Cloud, Word frequency (count w.r.t. Male and
Female), and word length (Number of male and female) are there which
shall provide insights into the data distribution and relationships as shown in
Figure 15.14:
Figure 15.14: Image displaying visualization details of species variable

Now, let us look into a numerical variable which is body_mass_g. Following


are the observations from this variable.
In the header section, we will encounter several numerical computations
which include mean, max, zeros, missing data percentage, uniqueness,
memory size, and so on, which is accompanied by a histogram plot as shown
in Figure 15.15:

Figure 15.15: Image displaying numerical computation details of body_mass_g variable

When we click the Show Details button, four tabs are shown to the user. The
first tab is related to Stats which is towards the bottom, and we will come
across two statistical summaries of the data points; the quantile and
descriptive statistics, encompassing measures like mean, median, variance,
and other relevant statistical calculations as shown in Figure 15.16:

Figure 15.16: Image displaying Stats details of body_mass_g variable

Different plots here are demonstrated to the user starting from Kernel
Density Estimation (KDE) visualizations serve as vital tools for assessing
and interpreting Gaussian distribution properties as shown in Figure 15.17:

Figure 15.17: Image displaying Plot visualizations of body_mass_g variable

Now, we shall explore the interactions option in dataprep visual library for
displaying scatter plots.
Interactions
This option enables the exploration of relationships between two features,
with one feature represented on the x-axis and the other on the y-axis. Here,
we are viewing the relationship between bill_length_mm on the X-axis and
body_mass_g on the Y-axis using a scatter plot as shown in Figure 15.18. We
can also change the X-axis and Y-axis variables based on our choice of
requirement. So, different features can be selected, and plotted values can be
viewed by the user.

Figure 15.18: Image displaying scatter plot with Interactions feature

Correlations
Typically, datasets contain correlated features, where the impact of one
feature on accuracy mirrors that of another correlated feature. Therefore, it is
essential to identify and analyze these correlations through EDA. Within this
option, we will encounter various types of correlation graphs, including
Pearson, Spearman, and KendallTau, accessible through a change of tabs
as shown in Figure 15.19:
Figure 15.19: Image displaying different graphs with Correlation feature

In Figure 15.19, we have displayed Pearson and Spearman correlation. We


can also write a small code in dataprep as there is another method inside
dataprep.eda module to plot the correlation as shown in Figure 15.20,
where we are displaying the KendallTau correlation:
Figure 15.20: Image displaying KendallTau Correlation feature

Missing values
Detecting missing values within a dataset can pose challenges for data
scientists. However, here, the missing values are visualized using a bar chart
and spectrum graph, providing a clear scale of missing versus present values.
Notably, missing values are distinctly highlighted in orange, aiding in their
identification and analysis within the dataset. We already stated that sex has
3.2% of missing values from the Overview feature. So, in Figure 15.21, we
are displaying the missing values using a bar chart and spectrum graph:
Figure 15.21: Image displaying Barplot and Spectrum plot for Missing Values feature

In Figure 15.21 we have displayed missing values using a bar and spectrum
chart. We can also write a small code in dataprep as there is another method
inside dataprep.eda module to search for the missing values as shown in
Figure 15.22. Here, we are displaying the missing values using Heat Map
and Dendogram. Heatmap helps us to understand which all are highly
correlated or not and the color is displayed to the user based on the
correlation. For clustering similar kinds of data which is mainly used in
hierarchical clustering, that is, grouping data points into clusters based on
their similarity or dissimilarity.

Figure 15.22: Image displaying HeatMap and Dendogram plot for Missing Values feature

For more information, you may refer to the website:


https://pypi.org/project/dataprep/

Note: All the explanations related to dataprep is covered in


(Dataprep_Demo.ipynb).

Usage of autoviz library with a dataset


Another library that can automate data visualization in a single line of code
is autoviz. It excels in identifying and showcasing the most crucial features
by automatically selecting and plotting impactful visualizations. Its
remarkable speed allows for the creation of these visualizations within mere
seconds, enhancing efficiency in data exploration and analysis. Autoviz
simplifies visualization for all skill levels, offering speed, scalability, and
automation with just one line of code. Its user-friendly interface abstracts
away complexities, making it efficient for beginners and experts alike.
Additionally, Autoviz provides customizable options and default data quality
assessments, streamlining the process of data exploration and analysis.
Installation method
We can install autoviz by using the following command:
pip install autoviz
Once installed, run %pip show autoviz in the cell, you shall see the version
and other details after installation which is shown in Autoviz_Demo.ipynb
notebook.
Code and how to use it
First, we shall import the AutoViz_Class class from the AutoViz_Class
module within the autoviz package. This class is responsible for generating
the AutoViz report. Then, we create an instance of the AutoViz_Class class
and assign it to the variable myav. This instance will be used to generate the
AutoViz report as shown in Figure 15.23:

Figure 15.23: Image displaying importing AutoViz_Class and initializing it

We will be using matplotlib.pyplot module for plotting library in Python,


often used for data visualization. A magic command is used in Jupyter
Notebook to display matplotlib plots inline within the notebook. Finally, it
generates the AutoViz report by calling the AutoViz method with the dataset
filename as an argument, resulting in comprehensive visualizations and
insights presented in the report. Refer to the following code for a better
understanding:
import matplotlib.pyplot as myplt
%matplotlib inline
mydft = myav.AutoViz("iris.csv")
Features explanation
The AutoViz report provides a holistic overview of the datasetʼs structure
and features, presenting various plots such as bar charts, violin plots,
correlation matrices (heat maps), and pair plots. Its ability to generate such
comprehensive insights with just one line of code makes it an invaluable tool
for beginners looking to explore and understand their data effectively. On
executing the code containing AutoViz method, we are getting the following
result analysis:
Shape of your Data Set loaded: (150, 5)
###################################################
####################################
######################## C L A S I F Y I N G V A R
I A B L E S ####################
###################################################
####################################
Classifying variables in data set...
5 Predictors classified...
No variables removed since no ID or low-
information variables found in data set
To fix data quality issues automatically, import
FixDQ from autoviz...
Alert: Dropping 1 duplicate rows can sometimes
cause column data types to change to object.
Double-check!
We can observe that there are 150 rows and five columns and different
information related to various columns is revealed such as Data Type,
Missing Values%, Unique Values%, Minimum Value, Maximum Value, and
DQ (Data Quality) Issue. DQ issue refers to any anomaly or inconsistency
within the dataset that could affect its integrity, accuracy, or usefulness for
analysis. These issues may include missing values, outliers, inconsistent
formatting, or erroneous data entries. AutoViz offers functionality to
automatically detect and address common data quality issues, helping users
identify and resolve potential data inconsistencies more efficiently as shown
in Figure 15.24:

Figure 15.24: Image displaying DQ issues against each column of iris.csv dataset

In AutoViz, the pair-wise scatter plot showcases the connections between all
continuous variables, offering a rich visual overview of their relationships.
This dynamic visualization aids in spotting patterns, correlations, and
emerging trends among variables, enabling a deeper understanding of the
datasetʼs underlying structure and interconnections. The pair wise scatter
plot is plotted between sepal_length vs. sepal_width, sepal_length vs.
petal_length, sepal_length vs. petal_width, sepal_width vs. petal_length,
sepal_width vs. petal_width, and petal_length vs. petal_width. We are
only displaying the image of sepal_length vs. sepal_width, and
sepal_length vs. petal_length as shown in Figure 15.25:

Figure 15.25: Image depicting scatter plot between sepal_length vs (sepal_width/ petal_length)

Also, we can view the distplot, boxplot, and probability plot for all
columns except the species column. Here, we are displaying all three plots
for the sepal_length column shown in Figure 15.26:
Figure 15.26: Image displaying Dist, Box, and Probability plot for sepal_length column

Also, other plots like Histogram, Violin (of all continuous variables),
HeatMap (of all numeric variables including target), and bar (For each
continuous column by each categorical variable) are also mentioned for the
dataset columns which we left for you to explore in the Jupyter Notebook.
The last message displayed will be the execution time to run AutoViz which
is four seconds displayed as follows:
Time to run AutoViz = 4 seconds
###################### AUTO VISUALIZATION
Completed ########################
For more information, you may refer to the website:
https://pypi.org/project/autoviz/

Note: All the explanations related to autotviz are covered in


(Autoviz_Demo.ipynb).

Usage of pandas_visual_analysis library with a dataset


Pandas Visual Analysis is a Python package that facilitates interactive visual
analysis directly within Jupyter Notebook. It generates interactive widgets
for analyzing pandasʼ dataframes, displaying diverse graph types with
linked-brushing functionality. This simplifies data exploration, even with
intricate multivariate datasets, eliminating the need for manual plot creation
and styling by automating the entire data exploration process.
Installation method
We can install pandas_visual_analysis by using the following command:
pip install pandas_visual_analysis
Once installed, run %pip show pandas_visual_analysis in the cell, you shall
see the version and other details after installation which is shown in
PVanalysis_Demo.ipynb notebook.

Code and how to use it


The dataset which we shall be exploring will be mpg.csv data. First, we will
import the pandas dataframe and VisualAnalysis class from the
pandas_visual_analysis module into our Jupyter Notebook followed by
loading the mpg.csv data into the pandas dataframe by using the read_csv()
method. Then, we display the first five rows of the mpg dataset as shown in
Figure 15.27:

Figure 15.27: Image displaying loading of mpg.csv dataset in Jupyter Notebook

Now, we will be passing pandas dataframe mydf as an argument into


VisualAnalysis class. This instance represents a visual analysis widget
tailored for the provided dataframe. Once instantiated, the VisualAnalysis
widget allows for interactive exploration and visualization of the data within
mydf, offering various plotting options and tools for gaining insights into the
datasetʼs characteristics and relationships.
So, performing visual analysis on the dataframe object mydf, as shown
below:
VisualAnalysis(mydf)
Features explanation
On executing the last line of the code, we will be getting three impressive
selection types. The first is called standard, which essentially mirrors the
information we will get by using mydf.describe() on our dataset.
On running mydf.describe(), the output will be similar to what is shown in
Figure 15.28:

Figure 15.28: Image displaying mydf.describe() details

It provides a comprehensive overview of key statistics and characteristics of


the data, offering insights akin to the standard descriptive analysis as shown
in Figure 15.29:

Figure 15.29: Image displaying Standard selection type in Jupyter Notebook


Now, let us look at the Subtractive selection type before going into the
Additive selection type. Once we understand the Subtractive selection type, it
will be easy to understand the Additive selection type.
In the subtractive mode, we can select specific features to generate scatter
plots between them. This mode offers a unique feature: from the scatter plot,
we can manually select and temporarily remove certain data points. This
allows for an analysis of how those data points influence the dataset without
permanently altering it, serving purely for exploratory purposes.
Figure 15.30 indicates the process; initially, data points within the
highlighted black rectangle area are removed:

Figure 15.30: Image displaying Subtractive selection type with highlighted black ellipse

Subsequently, in Figure 15.31, the removed data points are displayed in a


rectangular portion here, indicating their removal:
Figure 15.31: Image displaying subtractive selection type with highlighted grey area

Also, if we look at the bar chart of the displacement column, the removed
data points will be displayed in grey color as shown in Figure 15.32. The
rest of the other data points will be in blue:
Figure 15.32: Image displaying Removed data points with highlighted grey area in the bar plot

Also, an important point to note is that on removing the data points, the
statistical data pertaining to each column will be affected as shown in Figure
15.33:
Figure 15.33: Image displaying each column data affected due to the removal of data points

So, we can conclude that due to the removal of this small set of data points,
there will be an impact on the graph.
Now, let us focus on our third selection type which is Additive. Here, we will
add whatever data we have removed and then visualize the images as shown
in Figure 15.34:
Figure 15.34: Image displaying after the addition of data points under the Additive Selection type

We shall observe that we have the same data visualization that we witnessed
before removing the data including the statistical information.
In conclusion, Pandas Visual Analysis aids in gauging the effects of
removing specific data points that deviate significantly from the datasetʼs
mean. By visualizing the impact of removing these outliers, users gain
insights into potential adjustments to the dataset. Additionally, the tool offers
functionality for feature normalization, further enhancing its utility in data
exploration and analysis.
We can also download a plot as PNG, zoom in or out a particular section,
pan, either box or lasso select, autoscale, or reset axis on any of the plots as
these options are available to the user as shown in Figure 15.35:
Figure 15.35: Image displaying multiple options available to the user

For more information, you may refer to the website:


https://pypi.org/project/pandas-visual-analysis/

Note: All the explanations related to pandas_visual_analysis is covered


in (PVanalysis_Demo.ipynb)

So, we have covered the following 4 automated EDA libraries in the above
chapter viz pygwalker, dataprep, autoviz, and pandas_visual_analysis for
seamless integration, EDA, and visualization.

Conclusion
In this chapter, we equipped learners with a diverse range of Python libraries
tailored for data visualization and analysis, each offering unique
contributions to enrich the data exploration process. Beginning with
PyGWalker Visual, readers gained insights into seamlessly integrating
datasets to create visually appealing representations. Moving forward, the
exploration continued with the dataprep library, offering efficient data
preparation and analysis techniques. Subsequently, learners explored autoviz
and pandas_visual_analysis, discovering distinct functionalities and
visualization approaches to deepen their understanding and interpretation of
datasets. As a result, participants now possess a versatile toolkit for effective
data exploration and visualization in Python, empowering them to extract
valuable insights from diverse datasets with confidence and ease. In the next
chapter, we shall view 2 Case Studies with applicability of open data science
libraries which we have studied till now viz pandas, matplotlib, seaborn, and
so on.

Points to remember
PyGWalker Visual library simplifies the integration of datasets,
enabling users to effortlessly craft visually captivating representations.
The walk method facilitates EDA, enabling users to visually
investigate and analyze the dataframe.
The dataprep library simplifies data preparation by providing tools
for tasks like detecting missing values and profiling data, ensuring
datasets are clean and prepared for analysis.
The create_report() function is called on the dataframe to
generate an EDA report, which is then displayed using the show()
method.
The autoviz library streamlines visualization tasks by automatically
generating a diverse array of insightful plots and charts, requiring
minimal user input. This automation accelerates the data exploration
process, enabling quick and thorough analysis.
The AutoViz report is generated by calling the AutoViz method with
the dataset filename as an argument, resulting in comprehensive
visualizations and insights presented in the report.
pandas_visual_analysis provides an interactive environment
for data exploration, offering a range of visualizations and tools to
interactively analyze and understand dataset characteristics. Users can
benefit from features such as normalization, and outlier detection,
enhancing their ability to uncover patterns and relationships within the
dataset.
VisualAnalysis widget allows for interactive exploration and
visualization of the data within mydf, offering various plotting
options and tools for gaining insights into the datasetʼs characteristics
and relationships.

Questions
1. How does PyGWalker Visual simplify the integration of datasets for
visualization purposes?
2. What are the key functionalities offered by the dataprep library for
data preparation?
3. How does the autoviz library automate the visualization process, and
what are its main advantages?
4. What makes pandas_visual_analysis an interactive tool for data
exploration, and how does it support feature-rich visualization?
5. Can you demonstrate how to create visually appealing representations
using PyGWalker Visual?
6. How does dataprep assist in identifying and handling missing values
in a dataset? What steps are involved in data profiling with dataprep?
7. What types of visualizations can be automatically generated using
autoviz, and how does it contribute to speeding up the analysis
process?
8. What are some common challenges in data preparation, and how does
dataprep help to address them?
9. In what scenarios would you recommend using PyGWalker Visual,
dataprep, autoviz, and pandas_visual_analysis over other tools for
data exploration and analysis?
10. Name the important methods while using PyGWalker Visual,
dataprep, autoviz, and pandas_visual_analysis libraries for EDA.

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 16Case Study Using Python Data
Science Libraries
Introduction

Case studies employing Python data science libraries provide practical,


hands-on experience in manipulating, analyzing, visualizing, and modelling
data. By solving real-world problems, learners develop critical thinking
skills and reinforce theoretical concepts, while also building a portfolio that
showcases their proficiency across various domains. Collaborative learning
opportunities further enhance understanding and preparation for tackling
challenges in the field of data science.

Structure

In this chapter, we will discuss the following topics:

Case study on electrical fault classification


Case study on titanic dataset.

Objectives

This chapter will engage in a practical application of data science


methodologies by exploring two distinct datasets. The first case study shifts
attention to an electrical fault classification dataset, guiding participants
through a similar exploration by using diverse data science libraries. The
second case study focuses on the Titanic dataset, utilizing a range of data
science libraries and automated Exploratory Data Analysis (EDA) tools.
Learners will employ popular Python libraries such as pandas, NumPy, and
seaborn to preprocess and visualize the data, gaining insights into factors
influencing passenger survival. By the end of these case studies, readers
will not only acquire practical experience in applying data science
techniques but also develop a comprehensive understanding of how to
leverage different tools for varied datasets.
Note: You may refer to the image bundle for visualizing the images of the
above chapter.

Case study on electrical fault classification

Let us explore about case study on electrical fault classification. The case
study code is provided on the Jupyter Notebook file
Electrical_Fault_Classification.ipynb. The provided code displayed in
Figure 16.1 imports fundamental data science libraries, such as pandas for
data manipulation, NumPy for numerical computations, seaborn for
statistical visualization, and matplotlib for plotting.

Figure 16.1: Image depicting importing of fundamental data science


libraries for case study1

The %matplotlib inline command enables inline plotting directly within


Jupyter Notebooks. Furthermore, it imports the datetime class from the
datetime module to facilitate handling date and time data.

We are then retrieving the current timestamp by using the following


statement:

# Get current timestamp

timestamp = datetime.now()

We are then writing the code for suppressing the warning messages for
ensuring cleaner output:

import warnings

warnings.filterwarnings('ignore')

The next line of code is the following:

mydf = mypd.read_csv("Fault_Class_Data.csv")
Python uses the pandas library to read data from a CSV file named
Fault_Class_Data.csv (dataset) and store it into a dataframe named mydf.

We are then displaying the first five rows of the dataset as shown in the
following
Figure 16.2:

Figure 16.2: Image depicting first 5 rows of the EFC dataset

Next, let us execute the statement mydf.info() method in Python for


grasping the fundamental structure and attributes of the dataset, aiding in
the detection of any anomalies such as missing data. When executed on a
pandas dataframe like mydf, it offers a succinct overview of the
dataframeʼs configuration and contents. It furnishes key details like the
count of entries (rows) and columns, the data type of each column, and the
count of non-null values within each column as shown in Figure 16.3:

Figure 16.3: Image executing mydf.info() method in a dataset

Then, let us compute the sum of missing values across each column as
shown in
Figure 16.4:

Figure 16.4: Image checking null values in a dataset

Next, we will look into the quick overview of the size and structure of the
dataset to understand its scale and complexity by executing the statement
mydf.shape. A tuple is returned representing the dimensions of the
dataframe where the first element represents the number of rows, and the
second element returns the number of columns as shown in Figure 16.5:
Figure 16.5: Image checking null values in a dataset

Now, observe the following Figure 16.6:

Figure 16.6: Image checking for duplicates and displaying duplicate rows

This code snippet efficiently detects and displays duplicate rows within the
dataframe mydf, first by creating a Boolean mask to identify duplicates and
then filtering and presenting these duplicate rows in a new dataframe
named duplicate_rows. This process offers clear insight into any repetitive
entries within the dataset. You may observe the next row for the dropped
duplicates here.

Then, we will be writing the code which efficiently removes duplicate rows
from the dataframe mydf and displays the resulting dataset without
duplicates as shown in
Figure 16.7:

Figure 16.7: Image displaying dataset with dropped duplicates

This code segment provides a comprehensive approach to managing


duplicate entries within the dataframe mydf. It begins by informing the user
about the number of duplicate values present in the dataset. Subsequently, it
removes these duplicate rows by using the drop_duplicates() method,
resulting in a new dataframe named myunique_df. The resulting dataset is
then presented, showcasing a refined version of the original data without
any redundant entries. This systematic approach enhances data quality and
facilitates clearer analysis.

In the next line of code, we are assigning the dataframe myunique_df to


mydf:

mydf = myunique_df

print("Current timestamp:", timestamp)


This code will replace with the dataset devoid of duplicate rows.
Additionally, it prints the current timestamp for temporal reference. Now,
we shall explore our next section which is about creating target variable
using existing features:

Figure 16.8: Image displaying existing features in the dataset

Then, we will concatenate columns O_G, O_A, O_B, and O_C as strings
and assigns them to a new column MyFault_Type in the dataframe mydf.
The astype('str') method ensures the conversion of columns to string type
before concatenation. Finally, mydf.head() is used to display the first five
rows of the dataframe mydf with the newly created column as shown in
Figure 16.9:

Figure 16.9: Image displaying first 5 rows with target variable


MyFault_Type

Next, we will categorize fault types based on predefined patterns and


update the MyFault_Type column accordingly, providing clarity and
organization to the dataset as shown in Figure 16.10:

Figure 16.10: Image displaying the binary data for Faults/No Faults

Let us understand the code shown in the figure step-by-step:

Sets NO Fault for rows where MyFault_Type equals 0000.


Sets Line A to Ground Fault for rows where MyFault_Type equals
1100.
Sets Line B to Line C Fault for rows where MyFault_Type equals
0011.
Sets Line A Line B to Ground Fault for rows where MyFault_Type
equals 1110.
Sets Line A Line B Line C Fault for rows where MyFault_Type equals
0111.
Sets Line A Line B Line C to Ground Fault for rows where
MyFault_Type equals 1111.

Then, we shall filter and create separate dataframes (My_NF, My_LG,


My_LL, My_LLG, My_LLL, My_LLLG) based on different fault types
stored in the MyFault_Type column of dataframe mydf, facilitating
organized analysis and manipulation of fault-specific data subsets as shown
in Figure 16.11:

Figure 16.11: Image displaying new dataframes creation

After this, we will randomly select and display five rows from the
dataframe, providing a brief snapshot of the data along with the dimensions
of the dataframe. In addition, the statistics for numerical columns in the
dataframe, including count, mean, standard deviation, minimum,
maximum, and quartile values, offering insights into the distribution and
summary statistics of the data will be displayed to the user as shown in
Figure 16.12:

Figure 16.12: Image displaying sample data, shape and summary statistics
of the dataframe

Then, we will calculate the frequency count of each unique value in the
MyFault_Type column of the dataframe mydf, sorting the counts in
descending order and thus providing a summary of fault types and their
occurrences as shown in Figure 16.13:

Figure 16.13: Image displaying the frequency count for fault of target
variable
Next, let us focus on data visualization of the current dataset. We shall view
the code segment which visualizes the distribution of fault types within the
dataset. In the first subplot, let us create a countplot displaying the
frequency of each fault type, while the second subplot will create a pie
chart illustrating the distribution of fault types as percentages.
Customizations such as title, font properties, and rotation are applied for
better visualization. Finally, it prints the current timestamp, likely for
documentation purposes. Together, these visualizations offer insights into
the prevalence of different fault types in the dataset as shown in Figure
16.14:

Figure 16.14: Image displaying the data visualization code of target


variable MyFault_Type

First , we will be plotting the count plot which is a type of plot that displays
the count of observations in each category of a categorical variable which is
displayed in Figure 16.15:

Figure 16.15: Image displaying count plot with bar labels indicating display
of value

Next, we shall observe the visualization using pie plot in Figure 16.16
shown as follows:

Figure 16.16: Image displaying the pie plot

In our next code, we shall construct a side-by-side comparison of the


distribution of ground fault occurrences within the dataset. In the first
subplot, we will generate a countplot by using the seabornʼs countplot
function, displaying the frequency of ground faults. Customizations such as
title, font properties, and rotation are applied for better visualization. In the
second subplot, a pie chart will be created using pandas plot.pie() function,
illustrating the distribution of ground faults as percentages. The pie chart
also includes explode, autopct, and shadow parameters for enhanced
presentation. Additionally, the current timestamp is printed, likely for
documentation purposes as shown in Figure 16.17:

Figure 16.17: Image displaying the data visualization of feature ground


fault O_G

On the same concept, we have displayed the data visualization for Line A,
Line B and Line C faults.

Next code segment will generate histograms for each numerical column in
the dataframe mydf, with bins set to ten and a figure size of 10x10. The
hist() function from pandas is used to create the histograms, which display
the distribution of values within each numerical column, providing insight
into their data distribution and range. The myplt.show() function displays
the histograms, allowing visual examination along with a display of the
current timestamp as shown in Figure 16.18:

Figure 16.18: Image displaying histogram for each numerical columns

Next in Figure 16.19, we shall display the distribution of different features


by using Histogram, kdeplot and boxplot as shown for illustrating the
distributions of different features within the dataset.

Figure 16.19: Image displaying code for in 3*2 grid layout

In Figure 16.19, we have displayed the distribution of different features by


using Histogram, kdeplot and boxplot as shown for illustrating the
distributions of different features within the dataset. It organizes these
visualizations into a 3x2 grid layout by using matplotlib subplots. Each
subplot focuses on a specific feature, such as voltage or load on a line,
displaying its distribution through histograms, Kernel Density Estimate
(KDE) plots, or box plots. Customizations like titles, labels, and colors are
applied for clarity and aesthetic appeal. Finally, it displays the
visualizations and prints the current timestamp, likely for documentation
purposes, marking the time when the visualizations were generated.

The output is shown in Figure 16.20, for the display of various plots viz
Histogram:

Figure 16.20: Image displaying histogram plots in 3*2 grid layout for
distribution of voltage and load of line

The output is shown in Figure 16.21 for the display of various plots viz
kdeplot:

Figure 16.21: Image displaying kde plots in 3*2 grid layout for distribution
of voltage and load of line

The output is shown in Figure 16.22 for the display of various plots viz
boxplot:

Figure 16.22: Image displaying boxplot in 3*2 grid layout for distribution
of voltage and load of line

Now, we shall see fault separation in different categories.

The forthcoming code segment will generate a plot featuring three lines,
each representing a distinct feature of Current ('Ir', 'Iy', and 'Ib') sourced
from the dataframe mydf. By employing the myplt.figure(figsize = (10,4))
command, it defines the dimensions of the plot to be displayed. Then,
utilizing myplt.plot(), it maps each feature against its respective index
values, distinguishing each line with distinct colors (r for red, y for yellow,
and b for blue) and then finally it will display the current timestamp as
shown in Figure 16.23:

Figure 16.23: Image displaying plot of 3 Current features Ir, Iy and Ib

The same concept applies to the Voltage feature (Vr, Vy, and Vb) as shown
in Figure 16.24:

Figure 16.24: Image displaying plot of 3 Voltage features Vr, Vy and Vb

Now, we shall view the data visualization for a healthy system (that is, No
Fault).

Next, the code segment will create a figure with a size of 10x4 and set the
super title as No Fault Healthy System. It will then divide the figure into
two subplots arranged vertically by using subplot(2,1,1) and subplot(2,1,2).
The first subplot will plot three lines representing features Ir, Iy, and Ib
from the dataframe My_NF with colors r (red), y (yellow), and b (blue)
respectively. Similarly, in the second subplot, it plots features Vr, Vy, and
Vb from My_NF and finally it will display the current timestamp as shown
in Figure 16.25:

Figure 16.25: Image displaying plot of Healthy System

The same concept of code will be applied to the data visualization for
LineA to Ground Fault, LineA, LineB to Ground Fault and LineA, LineB,
LineC to Ground Fault which is displayed in the figures Figure 16.26,
Figure 16.27 and Figure 16.28 respectively.

In Figure 16.26, we are displaying the plot of Faulty system with LineA to
Ground Fault:
Figure 16.26: Image displaying plot of LineA to Ground Fault

From Figure 16.26, the observation is at the time of Line A to ground fault
the current in line A increases to 10 fold approximately 1000 Amperes
from normal 100 Amperes and voltage is reduced. Now, let us observe our
next figure for display of Fault system with Line A, Line B to Ground
Fault:

Figure 16.27: Image displaying plot of LineA, LineB to Ground Fault

Now, we shall observe our next figure for the display of Fault system with
Line A, Line B, and Line C to Ground Fault:

Figure 16.28: Image displaying plot of LineA, LineB, LineC to Ground


Fault

Case study on titanic dataset

Let us explore another case study on a famous dataset which is a starting


step in any of the data science aspirantʼs machine learning cycle. The case
study code is provided on the Jupyter Notebook file
EDA_TitanicDataset.ipynb.

We shall begin with importing the libraries which is similar to the first case
study as shown in Figure 16.29:

Figure 16.29: Image depicting importing of data science libraries for Case
Study2

The next line of code is presented as follows:

mydf = mypd.read_csv('titanic.csv')
Python uses the pandas library to read data from a CSV file named
"titanic.csv" (dataset) and store it in a dataframe named mydf.

We are then displaying the first five rows of the dataset as shown in the
following Figure 16.30:

Figure 16.30: Image displaying first five rows of titanic dataset

The existing features and the target variable in the Titanic dataset to the
user is mentioned as follows:

Figure 16.31: Image displaying existing features and target variable of


titanic dataset

Then, the next approach that we need to focus on is to find out the missing
data in the dataset. Here, we are writing the code mydf.isnull() serves to
identify missing values (null values) within the dataframe mydf. Upon
invocation, it generates a dataframe mirroring the structure of mydf, where
each cell contains either True or False. A True value signifies a missing
(NaN) entry in the original dataframe, while a False indicates the presence
of valid data.

In other words, mydf.isnull() allows the user to quickly identify the


locations of missing values in our titanic dataset as shown in Figure 16.32.
This information is valuable for data cleaning and pre-processing tasks, as
missing values may need to be handled appropriately before further
analysis.

Figure 16.32: Image displaying status of null values in each column

But this approach might not be a proper way of finding the null values. If
we will be looking for more dataset, then we have to have to iterate over
each dataset which is very tedious. Also, we need to scroll down and see in
Jupyter Notebook which does not display all the data. It will skip many
rows. So, we need to find an alternative approach to the visualization
concept in data science for the solution mentioned above, the explanation is
shown in Figure 16.33:

Figure 16.33: Image displaying null values using heatmap concept in


seaborn library

Focus on the following line of code for a better understanding:

mysns.heatmap(mydf.isnull(), yticklabels = False, cmap='viridis')

The code utilizes Seabornʼs heatmap function to visualize missing values


within the dataframe mydf. By leveraging the isnull() method on mydf, it
generates a binary grid where each cell represents the presence (True) or
absence (False) of missing data. The color map viridis is employed to shade
the cells, with darker hues indicating missing values. Suppressing the y-
axis tick labels with yticklabels=False enhances clarity in visualization.
From Figure 16.31 , we can conclude that four columns have missing
values which are age, embarked, deck and embark_town. There are many
missing values in the column age and deck. This succinct depiction aids in
promptly identifying areas of missing data, and streamlining subsequent
data cleaning and imputation tasks. Observe Figure 16.34 for clarity:

Figure 16.34: Image displaying the count of survived people in the dataset

The code snippet in Figure 16.34 utilizes seaborn library to visualize


survival data from the Titanic dataset. It first sets the plotting style to
whitegrid for clarity. Then, a countplot is created to display the distribution
of survival outcomes. Annotations are added to each bar by using
matplotlibʼs annotate function, indicating the count of occurrences for each
survival category. This combination of visual and numeric representation
enhances the interpretability of the data visualization, aiding in the
understanding of survival patterns aboard the Titanic.
On the other hand, the code snippet in Figure 16.35, hue='sex' is used
within the countplot function from seaborn library. This parameter serves to
categorize the data based on the sex column, allowing for a visual
comparison of survival outcomes between different genders. By specifying
hue='sex', the countplot will generate separate bars for each gender
category (male and female) within each survival outcome (survived and not
survived), effectively illustrating how survival rates vary between males
and females. This enables a more nuanced analysis of survival patterns with
respect to Gender a boarding the Titanic ship which is displayed in Figure
16.35:

Figure 16.35: Image displaying the count of survived people in the dataset
based on hue=’sex’ parameter

In the next code segment displayed in Figure 16.36, hue=’pclass’ is


employed within the countplot function from seaborn. This parameter
facilitates the segmentation of data based on the pclass column, allowing
for a visual comparison of survival outcomes across different passenger
classes. By specifying hue=’pclass’, the countplot generates separate bars
for each passenger class category (first, second, and third class) within each
survival outcome (survived and not survived) as shown in the following
figure:

Figure 16.36: Image displaying the count of survived people


in the dataset based on hue=’pclass’ parameter

The next line of code is mentioned as follows:

# determine the average age of the people who were on titanic

mysns.distplot(mydf['age'].dropna(), color='darkblue', bins=45)

The code utilizes seabornʼs distplot function to visualize the distribution of


ages among individuals who were aboard the Titanic. It first extracts the
age column from the dataframe mydf and drops any missing values by
using the dropna() method to ensure accurate representation. The resulting
non-null age values are then plotted as a histogram with a density curve
overlaid, depicted in a dark blue color. By specifying bins=45, the age
values are grouped into 45 bins, providing granularity to the distribution
plot as shown in Figure 16.37. This visualization aids in understanding the
central tendency and variability of ages among Titanic passengers. The kde
parameter which is not mentioned here has a value set to True. Observe the
following Figure 16.37 for the plot:

Figure 16.37: Image displaying distribution of ages who aboard the titanic

If kde parameter is set to False, the kde plot will not be visible to the user.

In Figure 16.38, we are viewing the histogram creation for the age column
from the dataframe mydf, which separates the data into 20 bins and
presents it in a dark blue color with 60% transparency, annotations are
introduced atop each bar to display the count value of data points within
each bin. This is achieved by iterating over each bar, extracting its height
(representing the count), and placing annotations at the center of the bars.
These annotations, slightly offset vertically for clarity, provide numerical
context alongside the visual representation of the age distribution,
enhancing the histogramʼs interpretability. Observe Figure 16.38 for
histogram view by separating the data into 20 bins:

Figure 16.38: Image displaying histogram creation for the age column

In Figure 16.39, the next provided code segment generates a countplot by


using seabornʼs countplot function to visualize the distribution of the
number of siblings/spouses each passenger had aboard the Titanic. Each bar
in the plot represents a specific count of siblings/spouses, with the x-axis
denoting the number of siblings/spouses (sibsp) and the y-axis indicating
the frequency of occurrence where annotations are used to display the exact
count value at the top of each bar. Observe the Figure 16.39 for viewing
count plot using seabornʼs function:

Figure 16.39: Image displaying countplot of sibling/spouse aboard the


titanic

Now, in Figure 16.40, the next code segment generates a histogram by


using the fare column from the dataframe mydf. It divides the fare data into
30 bins and visualizes it as a histogram plot. The bars in the histogram are
styled with a dark blue color and a transparency level of 0.6, making them
partially see-through. This visualization helps in understanding the
distribution and range of fares paid by passengers a boarding the Titanic as
demonstrated in Figure 16.40:

Figure 16.40: Image displaying histogram of fare column

The next code segment displayed in the following Figure 16.41, produces a
box plot comparing the distribution of ages among different passenger
classes (pclass) aboard the Titanic. Firstly, the code sets the figure size by
using plt.figure(figsize=(8,7)) to ensure clarity in visualization. Then, a
seaborn box plot (sns.boxplot) is created, with the x-axis representing
passenger class and the y-axis denoting age. Subsequently, the code
calculates the 25th, 50th (median), and 75th percentiles of age for each
passenger class by using the quantile method. These percentile values are
stored in separate dataframes (p25, p50, p75).

Finally, annotations are added to the box plot to mark the positions of these
percentiles for each passenger class. The annotations display the percentile
values at the corresponding positions in the plot as demonstrated in the
following Figure 16.41:

Figure 16.41: Image displaying box plot code creation using pclass vs age
On executing the preceding code, we shall get box plot creation with
annotations as demonstrated in Figure 16.42:

Figure 16.42: Image displaying box plot comparing age distribution with
pclass

Next, code segment in Figure 16.43 defines a function myinput_col that


takes a list mycols as input argument:

Figure 16.43: Image displaying myinput_col function usage

The function extracts two elements from mycols: the age (MyAge) and
passenger class (MyPClass). It then checks if the age value is null by using
mypd.isnull(MyAge). If it is null, the function assigns a default age based
on the passenger class: 37 for class 1, 29 for class 2, and 24 for class 3. If
the age value is not null, it returns the original age value.

Finally, the function is applied to the age and pclass columns of the
dataframe mydf by using the apply method. The function is applied row-
wise (axis=1), and the result is assigned back to the age column of mydf.
This process imputes missing age values in the dataframe based on the
passenger class, ensuring that each passenger has a non-null age value for
further analysis.

After this, we are again visualizing the null values against each column by
using the heatmap concept in the seaborn library as shown in Figure 16.44.
We can observe that there are no null values in the age column and only the
rest of the three columns with null values. That is the benefit of analyzing
the missing value of a particular column in the dataset.

Figure 16.44: Image displaying visualization after handling the null values
in age column
Now, we can see that the deck column has more NaN values. In order to
replace the NaN values, we have to do concept analysis by using feature
engineering which we shall cover in our next book on Machine Learning.
So, at present, we will just drop this column by using the code segment as
shown in Figure 16.45:

Figure 16.45: Image displaying dropping of deck column

After this, we will again visualize the null values against each column
using the heatmap concept in the seaborn library as shown in Figure 16.46.
We can observe that there are null values only in the rest of the two
columns which are embark and embark_town.

Figure 16.46: Image displaying visualization after handling the null values
in deck column

The provided code snippet in Figure 16.47, first fills null values in the
embarked column of the dataframe mydf with the mode (most frequent
value) of that column. The mode is obtained by using the mode() function,
and [0] is used to access the first mode in case there are multiple modes.
The fillna() function is then used to replace null values with the calculated
mode value. The same process is repeated for the embark_town column.

Subsequently, a heatmap is generated by using Seabornʼs heatmap function


to visualize the distribution of missing values within the dataframe mydf.
The heatmap represents null values as True (missing) and non-null values
as False (existing), allowing for easy identification of missing data patterns
across different columns. This visualization aids in assessing the
completeness of the dataset and identifying potential areas that require
further data cleaning or imputation. Refer to the following figure for better
understanding:
Figure 16.47: Image displaying no null values display using heatmap

We are then displaying the concise summary of the dataframeʼs structure


and content to the user using mydf.info() code statement which is displayed
in the Jupyter Notebook.

The next code segment in Figure 16.48, displays the mypd.get_dummies()


function, which when applied to a categorical column like
mydf['embarked'], generates dummy variables for each unique category in
the column. Each unique category becomes a new binary (0 or 1) column,
indicating the presence or absence of that category for each row in the
original dataframe. Setting drop_first=True drops the first dummy variable
to avoid multicollinearity in regression models. In this specific case, head()
is used to display the first five rows of the resulting dataframe. So, this
code generates dummy variables for the embarked column in mydf, and
head() displays the first few rows of the resulting dataframe.

Figure 16.48: Image displaying generation of dummy variable for each


unique category in embarked column

In the code snippet in Figure 16.49, dummy variables are created for
categorical columns sex and embarked from the dataframe mydf by using
mypd.get_dummies(). The drop_first=True parameter is set to drop the first
dummy variable to avoid multicollinearity. After creating the dummy
variables, the original categorical columns sex, embarked, and
embark_town are dropped from the dataframe mydf by using the drop()
method along the columns axis (axis=1). This step removes the original
categorical columns since they are no longer needed after creating dummy
variables. Finally, mydf.head() is used to display the first five rows of the
modified dataframe mydf, which now contains dummy variables for the sex
and embarked columns instead of the original categorical data.

Figure 16.49: Image displaying after dropping 3 columns from the dataset
The next line of code segment is shown in Figure 16.50, horizontally
combining the original dataframe mydf with the dummy variables
generated for the sex and embarked columns. By utilizing mypd.concat(), it
appends these dummy variables to the original dataframe, expanding its
features for subsequent analysis or modeling.

Figure 16.50: Image displaying final dataset after appending with dummy
variables for sex and embarked columns

Conclusion

In this chapter, we learned about the practical application of data science


methodologies by exploring two distinct datasets. The first case study shifts
attention to an Electrical Fault Classification dataset, guiding participants
through a similar exploration by using diverse data science libraries. The
second case study focuses on the Titanic dataset, utilizing a range of data
science libraries and automated EDA tools. Learners will employ popular
Python libraries such as pandas, numpy, and seaborn to preprocess and
visualize the data, gaining insights into factors influencing passenger
survival. By the end of these case studies, we are confident that the readers
have acquired practical experience in applying data science techniques and
also developed a comprehensive understanding of how to leverage different
tools for varied datasets.

Points to remember

Clearly define case study objectives and problem statement.


Collect and pre-process data to ensure quality and consistency.
Utilize EDA to understand dataset characteristics, distributions, and
relationships.
Apply EDA iteratively to validate assumptions and guide further
exploration.
Employ data visualization techniques to effectively communicate
findings to stakeholders.
Use appropriate plots and visualizations to represent different data
types and relationships.
Leverage Python libraries such as Pandas, NumPy, Matplotlib, and
Seaborn for EDA and visualization.
Document key findings, insights, and recommendations to support
decision-making processes.

Questions

1. Explore case studies for all the inbuilt seaborn datasets.


2. Write steps to perform EDA.
3. Do we need to perform data visualization before applying any
Machine Learning algorithm?
4. What is the need to explore columns with missing values in the
dataset?
5. Explore case studies on house prediction datasets by using data
science libraries and Automated EDA Tools.

Join our book’s Discord space

Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:

https://discord.bpbonline.com
Index
A

Anaconda, installing steps 5-10

annotate() 183

append() 86

apply() 301

arange() 32

arrays, categories

concatenate() 69, 71

dstack() 74, 75

hstack() 73

stack() 71, 72

vstack() 72-74

array_split() 81

astype() 254

axes_style() 380

bar chart, elements

annotate() function, using 183, 184


bottom, changing 179, 180

color, changing 176, 177

left alignment 180

right alignment 181

text() function, using 182, 183

width, changing 177, 178

barh() 185

Barplot 174

Barplot, steps

customizing 174

data, preparing 174

display, analyzing 174

importing 174

plot, creating 174

belliman_ford 115

Broadcasting 53

Broadcasting, points 53

Broadcasting, rules 53, 55

callable 219
cast() 353

ChatGPT Cheat Sheet, versioning 456-458

ChatGPT Prompts 438, 439

ChatGPT Prompts, challenges

articulating logic 458

contextual gaps 458

error message, deciphering 458

knowledge, implementing 458

question, articulating 458

unfamiliarity, terminology 458

ChatGPT Prompts Error, querying 454-456

ChatGPT Prompts Libraries, querying 443-453

ChatGPT Prompts/Numpy, comparing 439

clustered bar chart 191-193

collect() 362

concatenate() 69

ConvexHull 117

create_report() 468

Dataframe, aspects
column name, renaming 292, 293

index labels, renaming 290-292

index position, accessing 284-288

in/not, operators 298, 299

names with query, handling 299, 300

nlargest()/nsmallest(), using 304, 305

Pandas, handling 307-309

Pandas, modifying 312, 314

query, filtering 296, 297

row/columns, deletion 293-296

string method, filtering 309-312

dataprep 467

Data Science, terms

Qualitative Variables 379

Quantitative Variables 378

dataset with autoviz, utilizing 475-478

dataset with dataprep, utilizing 467-472

delete() 87

diag() 40

distplot() 408
dot() 89

drop() 293

drop_duplicates() 273

dsplit() 79

dstack() 75

duplicated() 271

electrical fault classification, case study 488-503

Elements Searching 83

Elements Searching, approaches

condition based, selecting 84

where() 83

empty() 42

endswith() 309

FacetGrid 430

FacetGrid, concepts 432-434

FacetGrid, key features 430

FacetGrid, key points 430

FacetGrid, use cases 431


flatten() 63

head() 212

hist() 496

Histograms 195-197

horizontal bar chart 185, 186

hsplit() 78

hstack() 73

insert() 85

Inserting Elements 84

Inserting Elements, approaches

append() 86, 87

insert() 85, 86

integrate 120

integrate, type

multiple integration 122, 123

single integration 121

Interpolation 124

Interpolation, types
1D Interpolation 124, 125

splev function 128-131

Spline Interpolation 125-128

splrep function 128

iris.csv 330

isin() 266

isnull 269

joint plot 426-429

jointplot() 426

joint plot, key points

correlation 427

data, presentating 427

distributions, comparing 427

EDA 427

joint plot, reasons 427

Jupyter Notebook, components

Cell 13

Kernel 13

menu bar 13
output area 13

tool bar 13

Jupyter Notebook, insights 11, 12

Jupyter Notebook, keyboard operations 14-16

Jupyter Notebook, states

command, mode 13

edit mode 13

Jupyter Notebook, variables

In[] 14

Out[] 14

kdeplot() 408

KDE Plot 408-415

KDTree() 118

line plot 137, 422, 423

line plot arrays, implementing 138

line plot figure size, customizing 152-54

line plot, form

CLM 147
MLC 146

line plot, graphs

comparisons 423

correlation, relationships 423

interopolation 423

time series data 423

trend, analyzing 423

line plot grid lines, optimizing 157-160

line plot minor grid, visualizing 160, 161

line plot property, types

alpha 149

color 143, 144

linestyle 141, 142

linewidth/markersize 150

markerfacecolor 151

line plot, utilizing 154-156

linspace() 34

linspace(), parameters

dtype 34

endpoint 34
num 34

retstep 34

start 34

stop 34

load_dataset() 379, 421

loc[] 281

minimize() 102

mydf.isnull() 505

myfunc 123

myfunc_eqn 104

myplt.bar() 174

myplt.grid() 157

myplt.title() 139

myplt.yscale() 167

ndarray elements, accessing 48, 49

ndarray elements, iterate 49-51

ndarray, sorting 81, 83

notnull 269
nquad 122

numpy arithmetic operators, analyzing 51, 52

numpy arrays, distinctions 26-28

numpy arrays, functions

flattern() 63

flat variable 64

ravel() 65, 66

reshape() 60-62

resize() 62

transpose() 66, 69

numpy arrays, module

randint() 43, 44

shuffle() 44-48

numpy arrays, process

1-D Array, creating 29

arange() function, using 32, 33

array with dtypte, analyzing 31

diag() function, utilizing 40, 42

empty() function, utilizing 42

eye() function, using 39, 40


full() function, utilizing 38

linspace() function, using 34, 35

nested lists, using 30, 31

object type, optimizing 32

ones() function, utilizing 37

tuple, using 29, 30

zeros/empty function, contrast 43

zeros() function, utilizing 36, 37

numpy arrays, similarities 26

numpy linalg module, optimizing 90, 91

numpy/scipy, differences 94

numpy, splitting arrays

array_split() 81

dsplit() 79, 80

hsplit() 78, 79

split() 76, 77

vsplit() 78

nunique() 274

Pandas Booleans Values, preventing 217, 218


Pandas Data, filtering 264-266

Pandas Dataframe 236

Pandas Dataframe, attributes

axes 239

columns 239

dtype 239

Pandas Dataframe, constructor

columns 237

copy 237

data 237

dtype 237

index 237

Pandas Dataframe, key points

arithmetic operators, optimizing 248, 249

columns, adding 250-252

columns, sorting 258-260

duplicate method, using 271, 273

fillna() method 252-256

index, sorting 260, 261

null values with rows 244-247


rank() method 262, 263

values, sorting 256-258

Pandas Dataframe, methods

info() method 240-242

sample() method 243, 244

Pandas/Polars, comparing 319

Pandas, reasons

ease, use 318

features 319

performance 318

scalability 318

Pandas Series 207

Pandas Series, aggregating 224-226

Pandas Series, attributes 221

Pandas Series Elements, accessing 214

Pandas Series, filtering 218, 219

Pandas Series, key points

dictionary 208

dtype, configuring 211

head/tail, methods 212, 213


list, passing 208

numpy array 209

parameter, using 211, 212

RangeIndex, preventing 210

scalar value, optimizing 209

Pandas Series, slicing 214, 215

Pandas Series, syntax

copy 208

data 207

dtype 207

fastpath 208

index 207

name 208

Pandas Series, uses 223, 224

Pandas Series Value, extraction 216, 217

Pandas Visual Analysis 478-483

pie chart 193-195

plot() 155

plot.pie() 496

plotting_context() 382
Polars Data, concepts

flexibility 341

lazy, evaluating 341

memory, efficiency 341

model, evaluating 341

optimization 341

parallelism 341

regarding 341

Polars Data, operations 330-338

Polars Data, reasons

aggregation 339

data, combining 340

data transformation 339

filtering 339

performance, optimizing 340

Polars Data, structures

Dataframe 328-330

series 327, 328

Polars Data, types

nested group 320


numeric group 319, 320

others group 321-327

temporal group 321

Polars, key aspects

aggregating 360-362

array, collecting 362-366

casting 353, 354

column, visualizing 349-351

folds 366

functions 351-353

list, preventing 369-372

numpy, usage 372, 373

operators 346-348

strings 358-360

PyGWalker Dataset, using 462-467

pyplot 137

Python 2

Python Data Science, libraries 22, 23

Python Data Visualization, tools

Altair 137
Bokeh 137

Matplotlib 136

Plotly 137

Seaborn 137

Python, installing steps 2, 3

Python With Linux, utilizing 3-5

Python With macOS, utilizing 3

query() 296, 300

Radial Basis Function (RBF) 132

randint() 43

randint() function, syntax

dtype 43

high 43

low 43

size 43

RangeIndex 210

rank() 262

ravel() 65
rename() 290

resize() 62

scatter() 455

scatter plot 198, 199, 424

scatter plot, comparing 198

scatter plot, use cases

clustering 425

correlation, analyzing 425

groups, comparing 425

outlier, detecting 425

scipy, categories

angle 97

area 101

binary 97

energy 100

force 97

length 99

mass 98

metric 96
power 101

pressure 99

speed 100

temperature 100

volume 99

scipy graphs, functions

adjacency matrix 113

connected_components, method 113-115

Dijkstra, method 114

Floyd-Warshall, method 114

scipy optimizers, visualizing 101-105

scipy sparse metrix, types

Block Sparse Row (BSR) 107, 108

Compressed Sparse Column (CSC) 108, 109

CSR 109, 110

Coordinate List Matrix (COO) 108

Diagonal Sparse Matrix 110

Dictionary Of Keys (DOK) 111

Row-based Linked List (LIL) 112, 113

Seaborn, architectures 379-381


Seaborn Box Plot, configuring 400-404

Seaborn, concepts 384-386

Seaborn Heatmap Plot, configuring 391-395

Seaborn, palettes

Diverging 389-391

Qualitative 386-388

Sequential 388, 389

set_style() 380

show() 156

sort() 81

sort_index() 260

sort_values() 256

splev 128

split() 76

splrep() 128

stack() 71

stacked bar chart 187, 188

stacked bar chart, breakdown 188, 189

startswith() 309

str() 185
subplot() 202

tail() 213

text() 182

tight_layout 182

titanic dataset, case study 504-519

title() 156

transpose() 66

unique() 274

unique/nunique methods, uses 274, 275

violin plot 416-422

VisualStudio

reference link 16

VSCode, demo program 20, 21

VSCode, installing steps 16-20

vsplit() 78

vstack() 72

W
where() 83, 305

xlim 165, 166

You might also like