Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
25 views

Efficient Python for Data Scientists

Efficient Python for Data Scientists by Youssef Hosni is a comprehensive guide aimed at improving Python coding practices for data science. The book covers essential topics such as writing clean code, optimizing performance, and utilizing libraries like Pandas effectively. It is designed for both aspiring and experienced data scientists, providing actionable insights and strategies to enhance coding efficiency and productivity.

Uploaded by

tanase.roxana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Efficient Python for Data Scientists

Efficient Python for Data Scientists by Youssef Hosni is a comprehensive guide aimed at improving Python coding practices for data science. The book covers essential topics such as writing clean code, optimizing performance, and utilizing libraries like Pandas effectively. It is designed for both aspiring and experienced data scientists, providing actionable insights and strategies to enhance coding efficiency and productivity.

Uploaded by

tanase.roxana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Efficient Python for Data

Scientists

To Data & Beyond Youssef Hosni


Efficient Python for Data Scientists
Efficient Python for Data Scientists
By: Youssef Hosni

To Data & Beyond


Brief of Contents
1.Three Principles to Write Python Clean Code......................................................................11
2. Defining & Measuring Code Efficiency................................................................................20
3. Optimizing Your Python Code.............................................................................................. 36
4. 5 Tips to Write Efficient Python Functions......................................................................... 50
5. How to Use Caching to Speed Up Your Python Code & LLM Application?..................... 57
6. Best Practices To Use Pandas Efficiently As A Data Scientist......................................... 65
7. Make Your Pandas Code 1000 Times Faster With This Trick............................................ 88
8. Data Exploration Becomes Easier & Better With Pandas Profiling.................................. 92
9. Top 10 Pandas Mistakes to Steer Clear of in Your Code................................................. 101
Afterword.................................................................................................................................. 115
Table of Contents
1.Three Principles to Write Python Clean Code......................................................................11
1.1. Characteristics of High-Quality Production Code............................................................11
1.2. Naming Convection........................................................................................................ 12
1.3. Using Nice white space.................................................................................................. 16
1.3.1. Indentation............................................................................................................. 16
1.3.2. Maximum Line Length............................................................................................17
1.3.3. Blank Lines............................................................................................................ 17
1.4. Comments & Documentation..........................................................................................18
1.4.1. In-line Comments...................................................................................................18
1.4.2. Docstrings.............................................................................................................. 19
1.4.3. Documentation.......................................................................................................20
2. Defining & Measuring Code Efficiency................................................................................20
2.1. Defining Efficiency.......................................................................................................... 21
2.1.1. What is meant by efficient code?........................................................................... 21
2.1.2. Python Standard Libraries..................................................................................... 22
Python Functions....................................................................................................... 22
Python Modules......................................................................................................... 24
2.2. Python Code Timing and Profiling.................................................................................. 29
2.2.1. Python Runtime Investigation................................................................................ 29
2.2.2. Code profiling for runtime.......................................................................................31
2.2.3. Code profiling for memory use...............................................................................33
3. Optimizing Your Python Code.............................................................................................. 36
3.1. Making Your Code Efficient.............................................................................................36
3.1.1. Efficiently Combining, Counting, and Iterating....................................................... 36
Combining Objects.....................................................................................................36
Counting Objects....................................................................................................... 37
Itertools...................................................................................................................... 39
Combinations............................................................................................................. 39
3.2. Introduction to The Set Theory....................................................................................... 40
3.3. Eliminating Loops........................................................................................................... 43
3.4. Writing Better Loops....................................................................................................... 46
4. 5 Tips to Write Efficient Python Functions......................................................................... 50
4. 1. Your Function Should Only Do One Thing.....................................................................50
4.2. Add-Type Hints for Better Readability & Maintainability................................................. 51
4.3. Enforce Keyword-Only Arguments to Minimize Errors................................................... 53
4.4. Use only Arguments You Need.......................................................................................55
4.5. Use Generators to Return Lists...................................................................................... 56
5. How to Use Caching to Speed Up Your Python Code & LLM Application?..................... 57
5.1. What is a cache in programming?.................................................................................. 58
5.2. Why is Caching Helpful?.................................................................................................58
5.3. Common Uses for Caching.............................................................................................59
5.4. Common Caching Strategies..........................................................................................59
5.5. Python caching using a manual decorator......................................................................60
5.6. Python caching using LRU cache decorator...................................................................61
5.7. Function Calls Timing Comparison................................................................................. 62
5.8. Use Caching to Speed up Your LLM Application............................................................ 63
6. Best Practices To Use Pandas Efficiently As A Data Scientist......................................... 65
6.1. Why do We need Efficient Coding?................................................................................ 67
6.2. Selecting & Replacing Values Effectively........................................................................69
6.2.1. Selecting Rows & Columns Efficiently using .iloc[] & .loc[].................................... 69
6.2.2. Replacing Values in a DataFrame Effectively........................................................ 71
6.3. Iterate Effectively Through Pandas DataFrame..............................................................75
6.3.1. Looping effectively using .iterrows().......................................................................75
6.3.2. Looping effectively using .apply()...........................................................................77
6.3.3. Looping effectively using vectorization.................................................................. 79
6.4. Transforming Data Effectively With .groupby()............................................................... 82
6.4.1. Common functions used with .groupby()............................................................... 82
6.4.2. Missing value imputation using .groupby() & .transform()......................................84
6.4.3. Data filtration using the .groupby() & .filter().......................................................... 86
6.5. Summary of Best Practices............................................................................................ 87
7. Make Your Pandas Code 1000 Times Faster With This Trick............................................ 88
7.1. Create Dataset & Problem Statement............................................................................ 88
7.2. Level 1: Loops................................................................................................................ 89
7.3. Level 2: Apply Function.................................................................................................. 90
7.4. Level3: Vectorization.......................................................................................................90
7.5. Measuring the Difference................................................................................................90
8. Data Exploration Becomes Easier & Better With Pandas Profiling.................................. 92
8.1. What is Pandas Profiling?...............................................................................................92
8.2. Installation of Pandas Profiling....................................................................................... 93
8.3. Pandas Profiling in Action...............................................................................................94
8.3.1. Overview................................................................................................................ 95
8.3.2. Variables................................................................................................................ 96
8.3.3. Interaction.............................................................................................................. 96
8.3.4. Correlations............................................................................................................97
8.3.5. Missing Values....................................................................................................... 98
8.3.6. Samples................................................................................................................. 99
8.4. Drawbacks of Pandas Profiling & How to Overcome It...................................................99
9. Top 10 Pandas Mistakes to Steer Clear of in Your Code................................................. 101
9.1. Having Column Names with Spaces............................................................................ 101
9.2. Not Using Query Method for Filtering........................................................................... 102
9.3. Not using @ Symbol when Writing Complex Queries.................................................. 102
9.4. Iterating over Dataframe instead of using Vectorization............................................... 103
9.5. Treating Slices of Dataframe as New Dataframe..........................................................105
9.6. Not Using Chain Commands for Multiple Transformations...........................................106
9.7. Not Setting Column dtypes Correctly............................................................................107
9.8. Not Using Pandas Plotting Builtin Function.................................................................. 107
9.9. Aggregation Manually instead of using .groupby()........................................................ 111
9.10. Saving Large Datasets as CSV File............................................................................113
Afterword.................................................................................................................................. 115
About this book
Efficient Python for Data Scientists is your practical companion to mastering the art of writing
clean, optimized, and high-performing Python code for data science. In this book, you'll explore
actionable insights and strategies to transform your Python workflows, streamline data analysis,
and maximize the potential of libraries like Pandas.

Through nine meticulously crafted chapters, this book covers everything from foundational
principles of clean code and efficiency to advanced techniques like caching and vectorization.
You'll learn how to write better functions, avoid common pitfalls, and elevate your data
exploration process using tools like Pandas Profiling.

Whether you're an aspiring data scientist or a seasoned professional, this book equips you with:

● Clear guidelines for writing clean and maintainable Python code.


● Techniques to measure and optimize code efficiency effectively.
● Practical advice for using Pandas to handle data tasks faster and smarter.
● Proven methods to avoid common coding mistakes and improve productivity.

Each chapter is packed with examples, tips, and tricks, ensuring you have a hands-on learning
experience. Dive into Efficient Python for Data Scientists and take the next step toward writing
Python code that is not just functional but exceptional.

Who Should Read This Book?


Efficient Python for Data Scientists is for anyone looking to improve their Python skills and
streamline their workflows in data science. Whether you're an aspiring data scientist building
foundational skills, a seasoned professional seeking advanced optimization techniques, or a
data analyst or engineer working to handle large datasets more efficiently, this book has
something for you. Python enthusiasts eager to write cleaner, faster, and more maintainable
code will find practical tips and insights, while students and academics can benefit from
improved coding practices for their projects and research. If you want to elevate your Python
coding efficiency and make the most of your data science projects, this book is for you.

About the code


To make it easy to follow up with the book, all the codes example in this book can be found
in this GitHub repository in addition to the supplementary codes that is used in the third
part of the book.
About the author
Youssef Hosni is a data scientist and machine learning
researcher who has been working in machine learning and AI for
more than half a decade. In addition to being a researcher and
data science practitioner, Youssef has a strong passion for
education. He is known for his leading data science and AI blog,
newsletter, and eBooks on data science and machine learning.

Youssef is a senior data scientist at Ment focusing on building


Generative AI features for Ment Products. He is also an AI
applied researcher at Aalto University working on AI agents and
their applications. Before that, he worked as a researcher in
which he applied deep learning and computer vision techniques
to medical images.
1.Three Principles to Write Python Clean Code
Writing clean code is an essential skill for every programmer, and it’s not as easy as you might
think. Even experienced coders struggle to write clean code, and it often feels like a constant
battle to keep things tidy and organized. But how do you go about doing that?

Clean code is substantially more than just removing all your commented lines or keeping the
length of your functions to a minimum. It’s about making your code readable so that any other
coder coming to your project in the future will know exactly what you meant with a given piece of
code without having to dig through comments or documentation.

There are lots of principles, techniques, and best practices we can follow to write Python clean
code. Given below are some tips that will help you get started and make the process easier the
next time you write code again.

1.1. Characteristics of High-Quality Production Code

In any software project, the code is one of the most important assets. The final production code
must be clean and easy to understand in order to facilitate its maintenance.

Reusing parts of code, modularity, and object orientation are some of the techniques used to
produce high-quality code.
In this section, I describe several characteristics that help identify high-quality production code.

These characteristics may not seem important at first glance but they have a major impact on
how efficiently developers can work with your project’s source code. Let’s take a look!

1. Production Code: software running on production servers to handle live users and data
of the intended audience. Note this is different from production quality code, which
describes code that meets expectations in reliability, efficiency, etc., for production.
Ideally, all code in production meets these expectations, but this is not always the case.
2. Clean: readable, simple, and concise. A characteristic of production quality code that is
crucial for collaboration and maintainability in software development. Clean code is a
very important characteristic of high-quality production and writing clean code will lead
to:
● Focused Code: Each function, class, or module should do one thing and do it
well.
● Easy to read code: According to Grady Booch, author of Object-Oriented
Analysis and Design with Applications: clean code reads like well-written prose.
● Easy to debug code: Clean code can be easily debugged and fix its error as it is
easy to read and follow.

11
● Easy to maintain: That is it can easily be read and enhanced by other
developers.
3. Modular Code: logically broken up into functions and modules. Also, an important
characteristic of production quality code is that makes your code more organized,
efficient, and reusable. Modules allow code to be reused by encapsulating them into files
that can be imported into other files.
4. Refactoring: Restructuring your code to improve its internal structure, without changing
its external functionality. This gives you a chance to clean and modularize your program
after you’ve got it working. Since it isn’t easy to write your best code while you’re still
trying to just get it working, allocating time to do this is essential to producing high-quality
code. Despite the initial time and effort required, this really pays off by speeding up your
development time in the long run.

So it is normal that at first, you write a code that works then after that you refactor it and make it
in a clean way. You become a much stronger programmer when you’re constantly looking to
improve your code. The more you refactor, the easier it will be to structure and write good code
the first time.

1.2. Naming Convection


Naming conventions are one of the most useful and important aspects of writing clean code.
When naming variables, functions, classes, etc, you should use meaningful names that are
descriptive and clear. And this means we would favor long descriptive names over short
ambiguous names. First, let’s start with the PEP 8 naming conventions:

● class names should be CamelCase (MyClass)


● variable names should be snake_case and all lowercase (first_name)
● function names should be snake_case and all lowercase (quick_sort())
● constants should be snake_case and all uppercase (PI = 3.14159)
● modules should have short, snake_case names and all lowercase (numpy)
● single quotes and double quotes are treated the same (just pick one and be consistent)

Here is a more detailed guide on how to give descriptive and good naming convection:

1. Variables
● Use long descriptive names that are easy to read: This is very important to make the
names easy and descriptive and can be understood on their own. This will make it
necessary to write comments:

# Not recommended
# The au variable is the number of active users
au = 105

12
# Recommended
total_active_users = 105

● Use descriptive intention revealing types: Your coworkers and developers should be
able to figure out what your variable type is and what it stores from the name. In a
nutshell, your code should be easy to read and reason about.

# Not recommended
c = ["UK", "USA", "UAE"]

for x in c:
print(x)

# Recommended
cities_list = ["UK", "USA", "UAE"]
for city in cities_list:
print(city)

● Always use the same vocabulary: Be consistent with your naming convention.
Maintaining a consistent naming convention is important to eliminate confusion when
other developers work on your code. And this applies to naming variables, files,
functions, and even directory structures.

# Not recommended
client_first_name = 'John'
customer_last_name = 'Doe;

# Recommended
client_first_name = 'John'
client_last_name = 'Doe'

# Another example:

# bad code
def fetch_clients(response, variable):
# do something
pass

def fetch_posts(res, var):


# do something
pass

13
# Recommended
def fetch_clients(response, variable):
# do something
pass

def fetch_posts(response, variable):


# do something
pass

● Don’t use magic numbers: Magic numbers are numbers with special, hardcoded
semantics that appear in code but do not have any meaning or explanation. Usually,
these numbers appear as literals in more than one location in our code.

import random

# Not recommended
def roll_dice():
return random.randint(0, 4) # what is 4 supposed to represent?

# Recommended
DICE_SIDES = 4

def roll_dice():
return random.randint(0, DICE_SIDES)

2. Functions
● Long names != descriptive names: You should be descriptive, but only with relevant
information. For E.g. good functions names describe what they do well without including
details about implementation or highly specific uses.

DICE_SIDES = 4

# Not recommended
def roll_dice_using_randint():
return random.randint(0, DICE_SIDES)

# Recommended
def roll_dice():
return random.randint(0, DICE_SIDES)

14
● Be consistent with your function naming convention: As seen with the variables
above, stick to a naming convention when naming functions. Using different naming
conventions would confuse other developers and colleagues.

# Not recommended

def fetch_user(id):
# do something
Pass

def get_post(id):
# do something
pass

# Recommended

def fetch_user(id):
# do something
Pass

def fetch_post(id):
# do something
pass

● Do not use flags or Boolean flags: Boolean flags are variables that hold a boolean
value — true or false. These flags are passed to a function and are used by the function
to determine its behavior.

text = "Python is a simple and elegant programming language."

# Not recommended
def transform_text(text, uppercase):
if uppercase:
return text.upper()
else:
return text.lower()

uppercase_text = transform_text(text, True)


lowercase_text = transform_text(text, False)

15
# Recommended
def transform_to_uppercase(text):
return text.upper()

def transform_to_lowercase(text):
return text.lower()

uppercase_text = transform_to_uppercase(text)
lowercase_text = transform_to_lowercase(text)

3. Classes
● Do not add redundant context: This can occur by adding unnecessary variables to
variable names when working with classes.

# Not recommended
class Person:
def __init__(self, person_username, person_email, person_phone,
person_address):
self.person_username = person_username
self.person_email = person_email
self.person_phone = person_phone
self.person_address = person_address

# Recommended
class Person:
def __init__(self, username, email, phone, address):

self.username = username
self.email = email
self.phone = phone
self.address = address

1.3. Using Nice white space

1.3.1. Indentation
Organize your code with consistent indentation the standard is to use 4 spaces for each indent.
You can make this a default in your text editor. When using a hanging indent the following

16
should be considered; there should be no arguments on the first line and further indentation
should be used to clearly distinguish itself as a continuation line:

# Correct:

# Aligned with opening delimiter.


foo = long_function_name(var_one, var_two,
var_three, var_four)

# Add 4 spaces (an extra level of indentation) to distinguish arguments


from the rest.
def long_function_name(
var_one, var_two, var_three,
var_four):
print(var_one)

# Hanging indents should add a level.


foo = long_function_name(
var_one, var_two,
var_three, var_four)

# Wrong:

# Arguments on first line forbidden when not using vertical alignment.


foo = long_function_name(var_one, var_two,
var_three, var_four)

# Further indentation required as indentation is not distinguishable.


def long_function_name(
var_one, var_two, var_three,
var_four):
print(var_one)

1.3.2. Maximum Line Length


Try to limit your lines to around 79 characters, which is the guideline given in the PEP 8 style
guide. In many good text editors, there is a setting to display a subtle line that indicates where
the 79-character limit is.

17
1.3.3. Blank Lines
Adding blank lines to your code will make it better, cleaner, and easy to follow. Here is a simple
guide on how to add blank lines to your code:

● Surround top-level function and class definitions with two blank lines.
● Method definitions inside a class are surrounded by a single blank line.
● Extra blank lines may be used (sparingly) to separate groups of related functions. Blank
lines may be omitted between a bunch of related one-liners (e.g. a set of dummy
implementations).
● Use blank lines in functions, sparingly, to indicate logical sections.

1.4. Comments & Documentation


No matter how hard we try to write clean code, there are still going to be parts of your program
that need additional explanation. Comments allow us to quickly tell other developers (and our
future selves) why we wrote it in the manner that we did. However, be careful that too many
comments can make your code messier than it would be without them.

1.4.1. In-line Comments


In-line comments are text following hash symbols throughout your code. They are used to
explain parts of your code, and really help future contributors understand your work.

One way comments are used is to document the major steps of complex code to help readers
follow. Then, you may not have to understand the code to follow what it does. However, others
would argue that this is using comments to justify bad code and that if code requires comments
to follow, it is a sign refactoring is needed.

Comments are valuable for explaining where code cannot explain why it was written like
this or why certain values were selected. For example, the history behind why a certain
method was implemented in a specific way. Sometimes an unconventional or seemingly
arbitrary approach may be applied because of some obscure external variable causing side
effects. These things are difficult to explain with code.

Here are some tips to write good comments:


1. Don’t comment on bad code, rewrite it: Commenting on bad code will only help you in
the short term. Sooner or later one of your colleagues will have to work with your code
and they’ll end up rewriting it after spending multiple hours trying to figure out what it
does. Therefore it is better to rewrite the bad code from the beginning instead of just
commenting on it.
2. Do not add comments when there is no need to: If your code is readable enough you
don’t need comments. Adding useless comments will only make your code less
readable. Here’s a bad example:

18
# This checks if the user with the given ID doesn't exist.
if not User.objects.filter(id=user_id).exists():
return Response({
'detail': 'The user with this ID does not exist.',
})

As a general rule, if you need to add comments, they should explain why you did something
rather than what is happening.

3. Don’t leave commented outdated code: The worst thing you can do is to leave code
comments out in your programs. All the debug code or debug messages should be
removed before pushing to a version control system, otherwise, your colleagues will be
scared of deleting it and your commented code will stay there forever.

1.4.2. Docstrings
Docstrings, or documentation strings, are valuable pieces of documentation that explain the
functionality of any function or module in your code. Ideally, each of your functions should
always have a docstring. Docstrings are surrounded by triple quotes.

The first line of the docstring is a brief explanation of the function’s purpose. The next element of
a docstring is an explanation of the function’s arguments. Here you list the arguments, state
their purpose, and state what types the arguments should be. Finally, it is common to provide
some description of the output of the function. Every piece of the docstring is optional; however,
doc strings are a part of good coding practice. Below are two examples of docstring for a
function. The first one will use single-line docstring and in the second one we will use multiple
lines docstrings:

def population_density(population, land_area):


"""Calculate the population density of an area."""
return population / land_area

def population_density(population, land_area):


"""Calculate the population density of an area.

Args:
population: int. The population of the area
land_area: int or float. This function is unit-agnostic, if you pass in
values in terms of square km or square miles the function will return a
density in those units.

Returns:
population_density: population/land_area. The population density of a

19
particular area.
"""
return population / land_area

1.4.3. Documentation
Project documentation is essential for getting others to understand why and how your code is
relevant to them, whether they are potential users of your project or developers who may
contribute to your code.

A great first step in project documentation is your README file. It will often be the first
interaction most users will have with your project. Whether it’s an application or a package, your
project should absolutely come with a README file. At a minimum, this should explain what it
does, list its dependencies, and provide sufficiently detailed instructions on how to use it. You
want to make it as simple as possible for others to understand the purpose of your project, and
quickly get something working.

Translating all your ideas and thoughts formally on paper can be a little difficult, but you’ll get
better over time, and makes a significant difference in helping others realize the value of your
project. Writing this documentation can also help you improve the design of your code, as you’re
forced to think through your design decisions more thoroughly. This also allows future
contributors to know how to follow your original intentions.

20
2. Defining & Measuring Code Efficiency
In this chapter, we will discuss what is Python efficient code and how to use different Python’s
built-in data structures, functions, and modules to write cleaner, faster, and more efficient code.

We’ll explore how to use time and profile codes to find bottlenecks. Then in the next chapter, we
will practice eliminating these bottlenecks, and other bad design patterns, using Python’s most
used libraries by data scientists: NumPy, and pandas.

2.1. Defining Efficiency

2.1.1. What is meant by efficient code?

Efficient refers to code that satisfies two key concepts. First, efficient code is fast and has a
small latency between execution and returning a result. Second, efficient code allocates
resources skillfully and isn’t subjected to unnecessary overhead. Although in general your
definition of fast runtime and small memory usage may differ depending on the task at hand, the
goal of writing efficient code is still to reduce both latency and overhead.

Python is a language that prides itself on code readability, and thus, it comes with its own set of
idioms and best practices. Writing Python code the way it was intended is often referred to as
Pythonic code. This means the code that you write follows the best practices and guiding
principles of Python. Pythonic code tends to be less verbose and easier to interpret. Although
Python supports code that doesn’t follow its guiding principles, this type of code tends to run
slower.

As an example, look at the non-Pythonic below. Not only is this code more verbose than the
Pythonic version, but it also takes longer to run. We’ll take a closer look at why this is the case
later on in the course, but for now, the main takeaway here is that Pythonic code is efficient
code!

# Non-Pythonic
doubled_numbers = []

for i in range(len(numbers)):
doubled_numbers.append(numbers[i] * 2)

# Pythonic
doubled_numbers = [x * 2 for x in numbers]

21
2.1.2. Python Standard Libraries

Python Standard Libraries are the Built-in components and libraries of Python. These libraries
come with every Python installation and are commonly cited as one of Python’s greatest
strengths. Python has a number of built-in types.

It’s worth noting that Python’s built-ins have been optimized to work within the Python language
itself. Therefore, we should default to using a built-in solution (if one exists) rather than
developing our own.

We will focus on certain built-in types, functions, and modules. The types that we will focus on
are :

● Lists
● Tuples
● Sets
● Dicts

The built-in functions that we will focus on are:

● print()
● len()
● range()
● round()
● enumerate()
● map()
● zip()

Finally, the built-in modules we will work with are:

● Os
● Sys
● NumPy
● Pandas
● Itertools
● Collections
● math

Python Functions

22
Let's start exploring some of the mentioned functions:

● range(): This is a handy tool whenever we want to create a sequence of numbers.


Suppose we wanted to create a list of integers from zero to ten. We could explicitly type
out each integer, but that is not very efficient. Instead, we can use range to accomplish
this task. We can provide a range with a start and stop value to create this sequence. Or,
we can provide just a stop value assuming that we want our sequence to start at zero.
Notice that the stop value is exclusive, or up to but not including this value. Also note the
range function returns a range object, which we can convert into a list and print. The
range function can also accept a start, stop, and step value (in that order).

# range(start,stop)
nums = range(0,11)
nums_list = list(nums)
print(nums_list)

# range(stop)
nums = range(11)
nums_list = list(nums)
print(nums_list)

# Using range() with a step value

even_nums = range(2, 11, 2)


even_nums_list = list(even_nums)
print(even_nums_list)

● enumerate(): Another useful built-in function is enumerate. enumerate creates an index


item pair for each item in the object provided. For example, calling enumerate on the list
letters produces a sequence of indexed values. Similar to the range function, enumerate
returns an enumerate object, which can also be converted into a list and printed.

letters = ['a', 'b', 'c', 'd' ]


indexed_letters = enumerate(letters)
indexed_letters_list = list(indexed_letters)
print(indexed_letters_list)

23
We can also specify the starting index of enumerate with the keyword argument start. Here, we
tell enumerate to start the index at five by passing start equals five into the function call.

#specify a start value


letters = ['a', 'b', 'c', 'd' ]
indexed_letters2 = enumerate(letters, start=5)
indexed_letters2_list = list(indexed_letters2)
print(indexed_letters2_list)

● map(): The last notable built-in function we’ll cover is the map() function. map applies a
function to each element in an object. Notice that the map function takes two arguments;
first, the function you’d like to apply, and second, the object you’d like to apply that
function on. Here, we use a map to apply the built-in function round to each element of
the nums list.

nums = [1.5, 2.3, 3.4, 4.6, 5.0]


rnd_nums = map(round, nums)
print(list(rnd_nums))

The map function can also be used with a lambda, or, an anonymous function. Notice here, that
we can use the map function and a lambda expression to apply a function, which we’ve defined
on the fly, to our original list nums. The map function provides a quick and clean way to apply a
function to an object iteratively without writing a for loop.

# map() with lambda


nums = [1, 2, 3, 4, 5]
sqrd_nums = map(lambda x: x ** 2, nums)
print(list(sqrd_nums))

Python Modules

24
NumPy, or Numerical Python, is an invaluable Python package for Data Scientists. It is the
fundamental package for scientific computing in Python and provides a number of benefits for
writing efficient code. NumPy arrays provide a fast and memory-efficient alternative to Python
lists. Typically, we import NumPy as np and use np dot array to create a NumPy array.

# python list
nums_list = list(range(5))
print(nums_list)

# Using numpy alternative to Python lists


import numpy as np
nums_np = np.array(range(5))
print(nums_np)

NumPy arrays are homogeneous, which means that they must contain elements of the same
type. We can see the type of each element using the .dtype method. Suppose we created an
array using a mixture of types. Here, we create the array nums_np_floats using the integers
[1,3] and a float [2.5]. Can you spot the difference in the output? The integers now have a
proceeding dot in the array. That’s because NumPy converted the integers to floats to retain that
array’s homogeneous nature. Using .dtype, we can verify that the elements in this array are
floats.

# NumPy array homogeneity


nums_np_ints = np.array([1, 2, 3])
print('integer numpy array',nums_np_ints)
print(nums_np_ints.dtype)

nums_np_floats = np.array([1, 2.5, 3])


print('float numpy array',nums_np_floats)
print(nums_np_floats.dtype)

Homogeneity allows NumPy arrays to be more memory efficient and faster than Python lists.
Requiring all elements to be the same type eliminates the overhead needed for data type
checking.

25
When analyzing data, you’ll often want to perform operations over entire collections of values
quickly. Say, for example, you’d like to square each number within a list of numbers. It’d be nice
if we could simply square the list, and get a list of squared values returned. Unfortunately,
Python lists don’t support these types of calculations.

# Python lists don't support broadcasting


nums = [-2, -1, 0, 1, 2]
nums ** 2

We could square the values using a list by writing a for loop or using a list comprehension as
shown in the code below. However, neither of these approaches is the most efficient way of
doing this. Here lies the second advantage of NumPy arrays — their broadcasting functionality.
NumPy arrays vectorize operations, so they are performed on all elements of an object at once.
This allows us to efficiently perform calculations over entire arrays.

Let's compare the computational time using these three approaches in the following code:

import time

# define a numerical list


nums = range(0,1000)
nums = list(nums)

# For loop (inefficient option)


# get the start time
st = time.time()

sqrd_nums = []
for num in nums:
sqrd_nums.append(num ** 2)
#print(sqrd_nums)

# get the end time

26
et = time.time()

# get the execution time


elapsed_time = et - st
print('Execution time using for loops over list:', elapsed_time, 'seconds')

# List comprehension (better option but not best)


# get the start time
st = time.time()

sqrd_nums = [num ** 2 for num in nums]


#print(sqrd_nums)

# get the end time


et = time.time()
print('Execution time using list comprehension:', elapsed_time, 'seconds')

# using numpy array broadcasting

# define the numpy array


nums_np = np.arange(0,1000)
# get the start time
st = time.time()

nums_np ** 2

# get the end time


et = time.time()

# get the execution time


elapsed_time = et - st
print('Execution time using numpy array broadcasting:', elapsed_time,
'seconds')

We can see that the first two approaches have the same time complexity while using NumPy in
the third approach has decreased the computational time to half.

27
Another advantage of NumPy arrays is their indexing capabilities. When comparing basic
indexing between a one-dimensional array and a list, the capabilities are identical. When using
two-dimensional arrays and lists, the advantages of arrays are clear.

To return the second item of the first row in our two-dimensional object, the array syntax is [0,1].
The analogous list syntax is a bit more verbose as you have to surround both the zero and one
with square brackets [0][1]. To return the first column values in the 2-D object, the array syntax
is [:,0]. Lists don’t support this type of syntax, so we must use a list comprehension to return
columns.

#2-D list
nums2 = [ [1, 2, 3],
[4, 5, 6] ]

# 2-D array
nums2_np = np.array(nums2)

# printing the second item of the first row


print(nums2[0][1])
print(nums2_np[0,1])

# printing the first row values

print([row[0] for row in nums2])


print(nums2_np[:,0])

NumPy arrays also have a special technique called boolean indexing. Suppose we wanted to
gather only positive numbers from the sequence listed here. With an array, we can create a
boolean mask using a simple inequality. Indexing the array is as simple as enclosing this
inequality in square brackets. However, to do this using a list, we need to write a for loop to filter
the list or use a list comprehension. In either case, using a NumPy array to the index is less
verbose and has a faster runtime.

nums = [-2, -1, 0, 1, 2]


nums_np = np.array(nums)

28
# Boolean indexing
print(nums_np[nums_np > 0])

# No boolean indexing for lists


# For loop (inefficient option)

pos = []
for num in nums:
if num > 0:
pos.append(num)
print(pos)

# List comprehension (better option but not best)


pos = [num for num in nums if num > 0]
print(pos)

2.2. Python Code Timing and Profiling


In the second part of this chapter, you will learn how to gather and compare runtimes between
different coding approaches. You’ll practice using the line_profiler and memory_profiler
packages to profile your code base and spot bottlenecks. Then, you’ll put what you've learned
into practice by replacing these bottlenecks with efficient Python code.

2.2.1. Python Runtime Investigation


As mentioned in the previous section code efficiency code means fast code. To be able to
measure how fast our code is we need to be able to measure the code runtime. Comparing
runtimes between two code bases, that effectively do the same thing, allows us to pick the code
with the optimal performance. By gathering and analyzing runtimes, we can be sure to
implement the code that is fastest and thus more efficient.

To compare runtimes, we need to be able to compute the runtime for a line or multiple lines of
code. IPython comes with some handy built-in magic commands we can use to time our code.
Magic commands are enhancements that have been added to the normal Python syntax.
These commands are prefixed with the percentage sign. If you aren’t familiar with magic
commands take a moment to review the documentation.

Let's start with this example: we want to inspect the runtime for selecting 1,000 random
numbers between zero and one using NumPy’s random.rand() function. Using %timeit just

29
requires adding the magic command before the line of code we want to analyze. That’s it! One
simple command to gather runtimes.

%timeit rand_nums = np.random.rand(1000)

As we can see %timeit provides an average of timing statistics. This is one of the advantages
of using %timeit. We also see that multiple runs and loops were generated. %timeit runs
through the provided code multiple times to estimate the code’s average execution time. This
provides a more accurate representation of the actual runtime rather than relying on just one
iteration to calculate the runtime. The mean and standard deviation displayed in the output is a
summary of the runtime considering each of the multiple runs.

The number of runs represents how many iterations you’d like to use to estimate the runtime.
The number of loops represents how many times you’d like the code to be executed per run. We
can specify the number of runs, using the -r flag, and the number of loops, using the -n flag.
Here, we use -r2, to set the number of runs to two and -n10, to set the number of loops to ten.
In this example, %timeit would execute our random number selection 20 times in order to
estimate runtime (2 runs each with 10 executions).

# Set the number of runs to 2 (-r2)


# Set the number of loops to 10 (-n10)
%timeit -r2 -n10 rand_nums = np.random.rand(1000)

Another cool feature of %timeit is its ability to run on single or multiple lines of code. When
using %timeit in line magic mode, or with a single line of code, one percentage sign is used and
we can run %timeit in cell magic mode (or provide multiple lines of code) by using two
percentage signs.

%%timeit
# Multiple lines of code
nums = []
for x in range(10):
nums.append(x)

30
We can also save the output of %timeit into a variable using the -o flag. This allows us to dig
deeper into the output and see things like the time for each run, the best time for all runs, and
the worst time for all runs.

# Saving the output to a variable and exploring them

times = %timeit -o rand_nums = np.random.rand(1000)


print('The timings for all the 7 runs',times.timings)
print('The best timing is',times.best)
print('The worst timeing is',times.worst)

2.2.2. Code profiling for runtime

We’ve covered how to time the code using the magic command %timeit, which works well with
bite-sized code. But, what if we wanted to time a large code base or see the line-by-line
runtimes within a function? In this part, we’ll cover a concept called code profiling that allows us
to analyze code more efficiently.

Code profiling is a technique used to describe how long, and how often, various parts of a
program are executed. The beauty of a code profiler is its ability to gather summary statistics on
individual pieces of our code without using magic commands like %timeit. We’ll focus on the
line_profiler package to profile a function’s runtime line-by-line. Since this package isn’t a part
of Python’s Standard Library, we need to install it separately. This can easily be done with a pip
install command as shown in the code below.

!pip install line_profiler

Let’s explore using line_profiler with an example. Suppose we have a list of names along with
each someone's height (in centimeters) and weight (in kilograms) loaded as NumPy arrays.

names = ['Ahmed', 'Mohammed', 'Youssef']


hts = np.array([188.0, 191.0, 185.0])
wts = np.array([ 95.0, 100.0, 75.0])

We will then develop a function called convert_units that converts each person’s height from
centimeters to inches and weight from kilograms to pounds.

31
def convert_units(names, heights, weights):
new_hts = [ht * 0.39370 for ht in heights]
new_wts = [wt * 2.20462 for wt in weights]
data = {}
for i,name in enumerate(names):
data[name] = (new_hts[i], new_wts[i])
return data
convert_units(names, hts, wts)

If we wanted to get an estimated runtime of this function, we could use %timeit. But, this will
only give us the total execution time. What if we wanted to see how long each line within the
function took to run? One solution is to use %timeit on each individual line of our convert_units
function. But, that’s a lot of manual work and not very efficient.

%timeit convert_units(names, hts, wts)

Instead, we can profile our function with the line_profiler package. To use this package, we first
need to load it into our session. We can do this using the command %load_ext followed by
line_profiler.

%load_ext line_profiler

Now, we can use the magic command %lprun, from line_profiler, to gather runtimes for
individual lines of code within the convert_units function. %lprun uses a special syntax. First,
we use the -f flag to indicate we’d like to profile a function. Then, we specify the name of the
function we’d like to profile. Note, that the name of the function is passed without any
parentheses. Finally, we provide the exact function call we’d like to profile by including any
arguments that are needed. This is shown in the code below:

%lprun -f convert_units convert_units(names, hts, wts)

The output from %lprun provides a nice table that summarizes the profiling statistics as shown
below. The first column (called Line ) specifies the line number followed by a column displaying
the number of times that line was executed (called the Hits column).

32
Next, the Time column shows the total amount of time each line took to execute. This column
uses a specific timer unit that can be found in the first line of the output. Here, the timer unit is
listed in 0.1 microseconds using scientific notation. We see that line two took 362 timer units, or,
roughly 36.2 microseconds to run.

The Per Hit column gives the average amount of time spent executing a single line. This is
calculated by dividing the Time column by the Hits column. Notice that line 6 was executed
three times and had a total run time of 15.4 microseconds, 5 microseconds per hit. The % Time
column shows the percentage of time spent on a line relative to the total amount of time spent in
the function. This can be a nice way to see which lines of code are taking up the most time
within a function. Finally, the source code is displayed for each line in the Line Contents column.

It is noteworthy to mention that the Total time reported when using %lprun and the time reported
from using %timeit do not match. Remember, %timeit uses multiple loops in order to calculate
an average and standard deviation of time, so the time reported from each of these magic
commands isn’t expected to match exactly.

2.2.3. Code profiling for memory use


We’ve defined efficient code as code that has a minimal runtime and a small memory footprint.
So far, we’ve only covered how to inspect the runtime of our code. In this section, we’ll cover a
few techniques on how to evaluate our code’s memory usage.

One basic approach for inspecting memory consumption is using Python’s built-in module sys.
This module contains system-specific functions and contains one nice method .getsizeof which
returns the size of an object in bytes. sys.getsizeof() is a quick way to see the size of an object.

import sys
nums_list = [*range(1000)]
sys.getsizeof(nums_list)

33
nums_np = np.array(range(1000))
sys.getsizeof(nums_np)

We can see that the memory allocation of a list is almost double that of a NumPy array.
However, this method only gives us the size of an individual object. However, if we wanted to
inspect the line-by-line memory size of our code.

As the runtime profile, we could use a code profiler. Just like we’ve used code profiling to gather
detailed stats on runtimes, we can also use code profiling to analyze the memory allocation for
each line of code in our code base. We’ll use the memory_profiler package which is very
similar to the line_profiler package. It can be downloaded via pip and comes with a handy
magic command (%mprun) that uses the same syntax as %lprun.

!pip install memory_profiler

To be able to apply %mprun to a function and calculate the meomery allocation, this function
should be loaded from a separate physical file and not in the IPython console so first we will
create a utils_funcs.py file and define the convert_units function in it, and then we will load this
function from the file and apply %mprun to it.

from utils_funcs import convert_units

%load_ext memory_profiler
%mprun -f convert_units convert_units(names, hts, wts)

%mprun output is similar to %lprun output. In the figure below we can see a line-by-line
description of the memory consumption for the function in question. The first column represents
the line number of the code that has been profiled. The second column (Mem usage) is the
memory used after that line has been executed.
Next, the Increment column shows the difference in memory of the current line with respect to
the previous one. This shows us the impact of the current line on total memory usage. Then the
occurrence column defines the number of occurrences of this line. The last column (Line
Contents) shows the source code that has been profiled.

34
Profiling a function with %mprun allows us to see what lines are taking up a large amount of
memory and possibly develop a more efficient solution.

It is noteworthy to mention that memory is reported in mebibytes. Although one mebibyte is not
exactly the same as one megabyte, for our purposes, we can assume they are close enough to
mean the same thing.

Another worth noting is that the memory_profiler package inspects memory consumption by
querying the operating system. This might be slightly different from the amount of memory that
is actually used by the Python interpreter. Thus, results may differ between platforms and even
between runs. Regardless, the important information can still be observed.

In this chapter, we discussed what is efficient Python code then we discussed and explored
some of the most important Python built-in standard libraries. After that, we discussed how to
measure your code efficiency. In the next chapter, we will discuss how to optimize your code
based on the efficiency measurement we discussed in this chapter.

35
3. Optimizing Your Python Code
As a data scientist, you should spend most of your time working on gaining insights from data
not waiting for your code to finish running. Writing efficient Python code can help reduce runtime
and save computational resources, ultimately freeing you up to do the things that have more
impact.

In the previous chapter "Defining & Measuring Code Efficiency" I discussed what is Python
efficient code and how to use different Python's built-in data structures, functions, and modules
to write cleaner, faster, and more efficient code. I also explored how to time and profile code in
order to find bottlenecks.

In this chapter, we will practice eliminating these bottlenecks, and other bad design patterns,
using Python's most used libraries by data scientists: NumPy, and pandas.

3.1. Making Your Code Efficient


In this part, we will cover more efficiency tips and tricks. You will learn a few useful built-in
modules for writing efficient code and practice using set theory. You’ll then learn about looping
patterns in Python and how to make them more efficient.

3.1.1. Efficiently Combining, Counting, and Iterating

Combining Objects

In this subsection, we’ll cover combining, counting, and iterating over objects efficiently in
Python. Suppose we have two lists: one of the names and the other is the age for each of them.
We want to combine these lists so that each name is stored next to its age. We can iterate over
the names list using enumerate and grab each name's corresponding age using the index
variable.

# combining objects
names = ['Ahmed', 'Youssef', 'Mohammed']
age = [25, 27, 40]
combined = []

for i,name in enumerate(names):


combined.append((name, age[i]))
print(combined)

36
But Python’s built-in function zip provides a more elegant solution. The name “zip” describes
how this function combines objects like a zipper on a jacket (making two separate things
become one). zip returns a zip object that must be unpacked into a list and printed to see the
contents. Each item is a tuple of elements from the original lists.

# Combining objects with zip


combined_zip = zip(names, age)
print(type(combined_zip))

combined_zip_list = [*combined_zip]
print(combined_zip_list)

Python also comes with a number of efficient built-in modules. The collections module contains
specialized datatypes that can be used as alternatives to standard dictionaries, lists, sets, and
tuples. A few notable specialized datatypes are:

● namedtuple: tuple subclasses with named fields


● deque: list-like container with fast appends and pops
● Counter: dict for counting hashable objects
● OrderedDict: dict that retains the order of entries
● defaultdict: dict that calls a factory function to supply missing values

Counting Objects

Let’s dig deeper into the Counter object. First, we will upload the Pokemon dataset and print
the first five rows of it then we will count the number of Pokemon per each type using the for
loop and then using the Counter function.

● Let's load the Pokemon dataset and print the first five rows:

import pandas as pd
pokemon = pd.read_csv('pokemon.csv')
pokemon.head()

37
● Now we will count the number of Pokemon per each type using loops and calculate the
execution time:

%%timeit -r1 -n1


# Counting with for loop
poke_types = pokemon['Type 1']
type_counts = {}
for poke_type in poke_types:
if poke_type not in type_counts:
type_counts[poke_type] = 1
else:
type_counts[poke_type] += 1
print(type_counts)

● Finally, we will count the number of Pokemon per each type using the Counter function
and compare the time:

%%timeit -r1 -n1


# counting with collections.Counter()
from collections import Counter
type_counts = Counter(poke_types)
print(type_counts)

We can see that using the Counter function from the collections module is a more efficient
approach. Just import Counter and provide the object to be counted. No need for a loop!
Counter returns a Counter dictionary of key-value pairs. When printed, it’s ordered by highest to
lowest counts. If comparing runtime times, we’d see that using Counter takes less time

38
compared to the standard dictionary approach!

Itertools
Another built-in module, itertools, contains functional tools for working with iterators. A subset of
these tools is:

● Infinite iterators: count, cycle, repeat


● Finite iterators: accumulate, chain, zip_longest, etc.
● Combination generators: product, permutations, combinations

We will focus on combinatoric generators. These generators efficiently yield Cartesian


products, permutations, and combinations of objects. Let’s explore an example.

Combinations
Suppose we want to gather all combination pairs of Pokémon types possible. We can do this
with a nested for loop that iterates over the poke_types list twice. Notice that a conditional
statement is used to skip pairs having the same type twice.

For example, if x is ‘Bug’ and y is ‘Bug’, we want to skip this pair. Since we’re interested in
combinations (where order doesn’t matter), another statement is used to ensure either order of
the pair doesn’t already exist within the combos list before appending it. For example, the pair
(‘Bug’, ‘Fire’) is the same as the pair (‘Fire’, ‘Bug’). We want one of these pairs, not both.

#Combinations with loop

poke_types = ['Bug', 'Fire', 'Ghost', 'Grass', 'Water']


combos = []
for x in poke_types:
for y in poke_types:
if x == y:
continue
if ((x,y) not in combos) & ((y,x) not in combos):
combos.append((x,y))
print(combos)

The combinations generator from itertools provides a more efficient solution. First, we import
combinations and then create a combinations object by providing the poke_types list and the
length of combinations we desire. Combinations return a combinations object, which we unpack
into a list and print to see the result. If comparing runtimes, we’d see using combinations is
significantly faster than the nested loop.

39
# combinations with itertools
poke_types = ['Bug', 'Fire', 'Ghost', 'Grass', 'Water']
from itertools import combinations
combos_obj = combinations(poke_types, 2)
print(type(combos_obj))

combos = [*combos_obj]
print(combos)

3.2. Introduction to The Set Theory


Often, we’d like to compare two objects to observe similarities and differences between their
contents. When doing this type of comparison, it’s best to leverage a branch of mathematics
called the set theory. As you know, Python comes with a built-in set data type. Sets come with
some handy methods we can use for comparing such as:

● intersection(): all elements that are in both sets


● difference(): all elements in one set but not the other
● symmetric_difference(): all elements in exactly one set
● union(): all elements that are in either set

When we’d like to compare objects multiple times and in different ways, we should consider
storing our data in sets to use these efficient methods. Another nice feature of Python sets is
their ability to quickly check if a value exists within its members. We call this membership testing
using the in operator. We will go through how using the in operator with a set is much faster than
using it with a list or tuple.

Suppose we had two lists of Pokémon names: list_a and list_b and we would like to compare
these lists to see which Pokémon appear in both lists. We could first use a nested for loop to
compare each item in list_a to each item in list_b and collect only those items that appear in
both lists. But, iterating over each item in both lists is extremely inefficient.

# Comparing objects with loops


list_a = ['Bulbasaur', 'Charmander', 'Squirtle']
list_b = ['Caterpie', 'Pidgey', 'Squirtle']

in_common = []
for pokemon_a in list_a:
for pokemon_b in list_b:
if pokemon_a == pokemon_b:

40
in_common.append(pokemon_a)
print(in_common)

However, a better way is to use Python’s set data type to compare these lists. By converting
each list into a set, we can use the dot-intersection method to collect the Pokémon shared
between the two sets. One simple line of code and no need for a loop!

# compaing objects using intersection()

list_a = ['Bulbasaur', 'Charmander', 'Squirtle']


list_b = ['Caterpie', 'Pidgey', 'Squirtle']
set_a = set(list_a)
set_b = set(list_b)
set_a.intersection(set_b)

We got the same answer with a much smaller number of code lines we can also compare the
runtime to see how much using sets is much faster than using loops.

%%timeit
in_common = []
for pokemon_a in list_a:
for pokemon_b in list_b:
if pokemon_a == pokemon_b:
in_common.append(pokemon_a)

%timeit in_common = set_a.intersection(set_b)

We can see that using sets is much faster than using for loops. We can also use a set method
to see Pokémon that exist in one set but not in another. To gather Pokémon that exist in set_a
but not in set_b, use set_a.difference(set_b).

set_a = {'Bulbasaur', 'Charmander', 'Squirtle'}


set_b = {'Caterpie', 'Pidgey', 'Squirtle'}
set_a.difference(set_b)

41
If we want the Pokémon in set_b, but not in set_a, we use set_b.difference(set_a).

set_a = {'Bulbasaur', 'Charmander', 'Squirtle'}


set_b = {'Caterpie', 'Pidgey', 'Squirtle'}
set_b.difference(set_a)

To collect Pokémon that exist in exactly one of the sets (but not both), we can use a method
called the symmetric difference.

set_a = {'Bulbasaur', 'Charmander', 'Squirtle'}


set_b = {'Caterpie', 'Pidgey', 'Squirtle'}
set_a.symmetric_difference(set_b)

Finally, we can combine these sets using the .union method. This collects all of the unique
Pokémon that appear in either or both sets.

set_a = {'Bulbasaur', 'Charmander', 'Squirtle'}


set_b = {'Caterpie', 'Pidgey', 'Squirtle'}
set_a.union(set_b)

Another nice efficiency gain when using sets is the ability to quickly check if a specific item is a
member of a set’s elements. Consider our collection of 720 Pokémon names stored as a list,
tuple, and set.

names_list = list(pokemon['Name'])
names_set = set(pokemon['Name'])
names_tuple = tuple(pokemon['Name'])

We want to check whether or not the character, Zubat, is in each of these data structures and
print the execution time for each data type:

%timeit 'Zubat' in names_list


%timeit 'Zubat' in names_tuple

42
%timeit 'Zubat' in names_set

When comparing runtimes, it’s clear that membership testing with a set is significantly faster
than a list or a tuple.

One final efficiency gain when using sets comes from the definition of the set itself. A set is
defined as a collection of distinct elements. Thus, we can use a set to collect unique items from
an existing object. Let’s define a primary_types list, which contains the primary types of each
Pokémon. If we wanted to collect the unique Pokémon types within this list, we could write a for
loop to iterate over the list and only append the Pokémon types that haven’t already been added
to the unique_types list.

primary_types = list(pokemon['Type 1'])


unique_types = []
for prim_type in primary_types:
if prim_type not in unique_types:
unique_types.append(prim_type)
print(unique_types)

Using a set makes this much easier. All we have to do is convert the primary_types list into a
set, and we have our solution: a set of distinct Pokémon types.

unique_types_set = set(primary_types)
print(unique_types_set)

3.3. Eliminating Loops


Although using loops when writing Python code isn’t necessarily a bad design pattern, using
extraneous loops can be inefficient and costly. Let’s explore some tools that can help us
eliminate the need to use loops in our code. Python comes with a few looping patterns that can
be used when we want to iterate over an object’s contents:

● For loops iterate over elements of a sequence piece-by-piece.

43
● While loops execute a loop repeatedly as long as some Boolean condition is met.
● Nested loops use multiple loops inside one another.

Although all of these looping patterns are supported by Python, we should be careful when
using them. Because most loops are evaluated in a piece-by-piece manner, they are often
inefficient solutions.

We should try to avoid looping as much as possible when writing efficient code. Eliminating
loops usually results in fewer lines of code that are easier to interpret. One of the idioms of
pythonic code is that “flat is better than nested.” Striving to eliminate loops in our code will help
us follow this idiom.

Suppose we have a list of lists, called poke_stats, that contains statistical values for each
Pokémon. Each row corresponds to a Pokémon, and each column corresponds to a Pokémon’s
specific statistical value. Here, the columns represent a Pokémon’s Health Points, Attack,
Defense, and Speed. We want to do a simple sum of each of these rows in order to collect the
total stats for each Pokémon. If we were to use a loop to calculate row sums, we would have to
iterate over each row and append the row’s sum to the totals list. We can accomplish the same
task, in fewer lines of code, with a list comprehension. Or, we could use the built-in map function
that we’ve discussed previously in the previous chapter.

# List of HP, Attack, Defense, Speed


poke_stats = [
[90, 92, 75, 60],
[25, 20, 15, 90],
[65, 130, 60, 75],
]
%%timeit
totals = []
for row in poke_stats:
totals.append(sum(row))

%timeit totals_comp = [sum(row) for row in poke_stats]

%timeit totals_map = [*map(sum, poke_stats)]

44
Each of these approaches will return the same list, but using a list comprehension or the map
function takes one line of code, and has a faster runtime.

We’ve also covered a few built-in modules that can help us eliminate loops in the previous
article. Instead of using the nested for loop, we can use combinations from the itertools module
for a cleaner, more efficient solution.

poke_types = ['Bug', 'Fire', 'Ghost', 'Grass', 'Water']

# Nested for loop approach


combos = []
for x in poke_types:
for y in poke_types:
if x == y:
continue
if ((x,y) not in combos) & ((y,x) not in combos):
combos.append((x,y))
# Built-in module approach
from itertools import combinations
combos2 = [*combinations(poke_types, 2)]

Another powerful technique for eliminating loops is to use the NumPy package. Suppose we
had the same collection of statistics we used in a previous example but stored in a NumPy array
instead of a list of lists.

We’d like to collect the average stat value for each Pokémon (or row) in our array. We could use
a loop to iterate over the array and collect the row averages.

%%timeit
avgs = []
for row in poke_stats:
avg = np.mean(row)
avgs.append(avg)

Eliminate Loops with NumPy

But, NumPy arrays allow us to perform calculations on entire arrays all at once. Here, we use
the dot-mean method and specify an axis equal to 1 to calculate the mean for each row

45
(meaning we calculate an average across the column values). This eliminates the need for a
loop and is much more efficient.

%timeit avgs = poke_stats.mean(axis=1)

When comparing runtimes, we see that using the dot-mean method on the entire array and
specifying an axis is significantly faster than using a loop.

3.4. Writing Better Loops


We’ve discussed how loops can be costly and inefficient. But, sometimes you can’t eliminate a
loop. In this section, we’ll explore how to make loops more efficient when looping is
unavoidable. Before diving in, some of the loops we’ll discuss can be eliminated using
techniques covered in previous lessons. For demonstrative purposes, we’ll assume the use
cases shown here are instances where a loop is unavoidable.

The best way to make a loop more efficient is to analyze what’s being done within the loop. We
want to make sure that we aren’t doing unnecessary work in each iteration. If a calculation is
performed for each iteration of a loop, but its value doesn’t change with each iteration, it’s best
to move this calculation outside (or above) the loop. If a loop is converting data types with each
iteration, it’s possible that this conversion can be done outside (or below) the loop using a map
function. Anything that can be done once should be moved outside of a loop. Let’s explore a few
examples.

Moving calculations above a loop

We have a list of Pokémon names and an array of each Pokémon’s corresponding attack value.
We’d like to print the names of each Pokémon with an attack value greater than the average of
all attack values. To do this, we’ll use a loop that iterates over each Pokémon and its attack
value. For each iteration, the total attack average is calculated by finding the mean value of all
attacks. Then, each Pokémon’s attack value is evaluated to see if it exceeds the total average.

import numpy as np

names = ['Absol', 'Aron', 'Jynx', 'Natu', 'Onix']


attacks = np.array([130, 70, 50, 50, 45])

for pokemon,attack in zip(names, attacks):


total_attack_avg = attacks.mean()

46
if attack > total_attack_avg:
print(
"{}'s attack: {} > average: {}!"
.format(pokemon, attack, total_attack_avg)
)

The inefficiency in this loop is the total_attack_avg variable being created with each iteration of
the loop. But, this calculation doesn’t change between iterations since it is an overall average.
We only need to calculate this value once. By moving this calculation outside (or above) the
loop, we calculate the total attack average only once. We get the same output, but this is a more
efficient approach.

import numpy as np

names = ['Absol', 'Aron', 'Jynx', 'Natu', 'Onix']


attacks = np.array([130, 70, 50, 50, 45])

# Calculate total average once (outside the loop)


total_attack_avg = attacks.mean()

for pokemon,attack in zip(names, attacks):


if attack > total_attack_avg:
print(
"{}'s attack: {} > average: {}!"
.format(pokemon, attack, total_attack_avg)
)

Let's compare the runtimes of both methods:

%%timeit

for pokemon,attack in zip(names, attacks):


total_attack_avg = attacks.mean()
if attack > total_attack_avg:
print(
"{}'s attack: {} > average: {}!"
.format(pokemon, attack, total_attack_avg)
)

47
%%timeit

# Calculate total average once (outside the loop)


total_attack_avg = attacks.mean()
for pokemon,attack in zip(names, attacks):
if attack > total_attack_avg:
print(
"{}'s attack: {} > average: {}!"
.format(pokemon, attack, total_attack_avg)
)

We see that keeping the total_attack_avg calculation within the loop takes more than 120
microseconds.

Holistic Conversions

Another way to make loops more efficient is to use holistic conversions outside (or below) the
loop. In the example below we have three lists from the 720 Pokémon dataset: a list of each
Pokémon’s name, a list corresponding to whether or not a Pokémon has a legendary status,
and a list of each Pokémon’s generation.
We want to combine these objects so that each name, status, and generation is stored in an
individual list. To do this, we’ll use a loop that iterates over the output of the zip function.
Remember, zip returns a collection of tuples, so we need to convert each tuple into a list since
we want to create a list of lists as our output. Then, we append each individual poke_list to our
poke_data output variable. By printing the result, we see our desired list of lists.

%%timeit
poke_data = []
for poke_tuple in zip(names_list, legend_status_list, generations_list):
poke_list = list(poke_tuple)
poke_data.append(poke_list)

However, converting each tuple to a list within the loop is not very efficient. Instead, we should
collect all of our poke_tuples together, and use the map function to convert each tuple to a list.
The loop no longer converts tuples to lists with each iteration. Instead, we moved this tuple to
list conversion outside (or below) the loop. That way, we convert data types all at once (or
holistically) rather than converting in each iteration.

48
%%timeit
poke_data_tuples = []
for poke_tuple in zip(names_list, legend_status_list, generations_list):
poke_data_tuples.append(poke_tuple)
poke_data = [*map(list, poke_data_tuples)]

Runtimes show that converting each tuple to a list outside of the loop is more efficient.

49
4. 5 Tips to Write Efficient Python Functions
Writing efficient and maintainable Python functions is crucial for producing high-quality code.
This chapter presents 5 essential tips to help refine your Python functions and improve code
readability, maintainability, and robustness.

First, adhering to the single responsibility principle by ensuring each function performs only one
task is emphasized. Next, the benefits of type hints for enhancing code clarity and long-term
maintainability are discussed.

The chapter then explores the use of keyword-only arguments, a Python feature that can
minimize errors by enforcing the explicit use of argument names. Another recommendation is to
utilize only the arguments necessary for a function, reducing complexity and potential bugs.

Finally, the chapter advocates for the use of generators, a memory-efficient technique for
returning iterable data, instead of constructing and returning entire lists.

By implementing these 5 tips, Python developers can write more efficient, readable, and
maintainable functions, ultimately leading to higher-quality code and improved developer
productivity.

4. 1. Your Function Should Only Do One Thing


The principle “Your Function Should Only Do One Thing” is a core tenet of clean code and
efficient programming. This principle, also known as the Single Responsibility Principle (SRP),
suggests that a function should have only one responsibility or task. This makes your code
easier to read, test, debug, and maintain. Here are some of the advantages of applying this
concept:

● Readability: When a function does only one thing, it’s easier to understand at a glance.
The function name can clearly describe its purpose, and the implementation is
straightforward.
● Reusability: Single-purpose functions can be reused in different parts of the program or
in other projects.
● Testability: It’s easier to write tests for a function that does one thing, and such tests are
more likely to be reliable.
● Maintainability: If a function is responsible for one task, changes in requirements
affecting that task will be localized, reducing the risk of bugs elsewhere in the code.

Let’s say you’re working on a Python program to process a list of numbers in which it will:

1. Filter out negative numbers.


2. Square the remaining numbers.
3. Calculate the sum of the squared numbers.

50
def filter_negative_numbers(numbers):
"""Filter out negative numbers from the list."""
return [num for num in numbers if num >= 0]

def square_numbers(numbers):
"""Return a list of squared numbers."""
return [num ** 2 for num in numbers]

def sum_numbers(numbers):
"""Return the sum of the numbers."""
return sum(numbers)

def process_numbers(numbers):
"""Process the list of numbers: filter, square, and sum."""
positive_numbers = filter_negative_numbers(numbers)
squared_numbers = square_numbers(positive_numbers)
total = sum_numbers(squared_numbers)
return total

# Example usage
numbers = [-2, -1, 0, 1, 2, 3]
result = process_numbers(numbers)
print(result) # Output: 14

4.2. Add-Type Hints for Better Readability & Maintainability


Type hints are a feature in Python that allows you to specify the expected types of variables,
function parameters, and return values. Introduced in PEP 484, type hints do not affect the
program's runtime behavior but provide a way to statically type-check your code. They improve
code readability and help developers understand the expected input and output types, making
the code easier to maintain. Adding type hints improves the following:

1. Improved Readability: By clearly indicating what types of arguments a function expects


and what it returns, type hints make the code more readable and easier to understand at
a glance.
2. Error Detection: Tools like mypy can be used to statically check for type errors, catching
potential bugs early in the development process.
3. Documentation: Type hints serve as a form of documentation, providing valuable
information about how functions and methods should be used.

Here’s a simple example of a set of functions without type hints:

51
def filter_negative_numbers(numbers):
"""Filter out negative numbers from the list."""
return [num for num in numbers if num >= 0]

def square_numbers(numbers):
"""Return a list of squared numbers."""
return [num ** 2 for num in numbers]

def sum_numbers(numbers):
"""Return the sum of the numbers."""
return sum(numbers)

def process_numbers(numbers):
"""Process the list of numbers: filter, square, and sum."""
positive_numbers = filter_negative_numbers(numbers)
squared_numbers = square_numbers(positive_numbers)
total = sum_numbers(squared_numbers)
return total

# Example usage
numbers = [-2, -1, 0, 1, 2, 3]
result = process_numbers(numbers)
print(result) # Output: 14

Now, let’s add type hints to the same functions:

from typing import List

def filter_negative_numbers(numbers: List[int]) -> List[int]:


"""Filter out negative numbers from the list."""
return [num for num in numbers if num >= 0]

def square_numbers(numbers: List[int]) -> List[int]:


"""Return a list of squared numbers."""
return [num ** 2 for num in numbers]

def sum_numbers(numbers: List[int]) -> int:


"""Return the sum of the numbers."""
return sum(numbers)

def process_numbers(numbers: List[int]) -> int:

52
"""Process the list of numbers: filter, square, and sum."""
positive_numbers = filter_negative_numbers(numbers)
squared_numbers = square_numbers(positive_numbers)
total = sum_numbers(squared_numbers)
return total

# Example usage
numbers = [-2, -1, 0, 1, 2, 3]
result = process_numbers(numbers)
print(result) # Output: 14

Comparing both code snippets we can see that:


1. Readability
● Without Type Hints: When reading the function definitions, it’s not immediately
clear what types of arguments are expected or what type the function returns.
● With Type Hints: The function signatures clearly specify that they work with lists
of integers and return either a list of integers or single integers.

2. Error Detection
● Without Type Hints: Type-related errors may only be caught at runtime,
potentially causing bugs that are hard to trace.
● With Type Hints: Tools like mypy can check the types at compile time, catching
errors before the code is executed.

4.3. Enforce Keyword-Only Arguments to Minimize Errors


Enforcing keyword-only arguments is a technique in Python where certain function parameters
must be specified by name when the function is called.

This is done using a special syntax in the function definition that prevents these parameters
from being passed positionally. This approach can significantly improve code clarity and reduce
errors.

Enforcing keyword-only arguments in Python functions can significantly enhance the clarity and
correctness of your code. Keyword-only arguments are parameters that can only be specified
using their parameter names. This enforcement helps to:
● Prevent Errors: By requiring arguments to be passed by name, it reduces the risk of
passing arguments in the wrong order, which can lead to subtle bugs.
● Improve Readability: It makes function calls more readable by clearly indicating what
each argument represents.
● Enhance Flexibility: It allows you to add more parameters to functions in the future
without breaking existing code, as the arguments are explicitly named.
● Increase Clarity: It makes the intention of the code clearer, as the purpose of each

53
argument is specified at the call site.

Here’s an example. The send_email function takes in an optional cc string:

# example function for sending an email


def send_email(to: str, subject: str, body: str, cc: str = None):
print(f"Sending email to {to}...")
print(f"Subject: {subject}")
print(f"Body: {body}")
if cc:
print(f"CC: {cc}")

# Example usage
send_email("recipient@example.com", "Meeting Update", "The meeting is
rescheduled to 3 PM.", "cc@example.com")

Say you want to make the optional cc argument a keyword-only argument. Here’s how you can
do it:
# enforce keyword-only arguments to minimize errors
# make the optional `cc` argument keyword-only
def send_email(to: str, subject: str, body: str, *, cc: str = None):
print(f"Sending email to {to}...")
print(f"Subject: {subject}")
print(f"Body: {body}")
if cc:
print(f"CC: {cc}")

# Example usage
send_email("recipient@example.com", "Meeting Update", "The meeting is
rescheduled to 3 PM.", cc="cc@example.com")

Let’s take a sample function call:

send_email("recipient@example.com", "Meeting Update", "The meeting is


rescheduled to 3 PM.", cc="cc@example.com")

Sending email to recipient@example.com…


Subject: Meeting Update
Body: The meeting is rescheduled to 3 PM.
CC: cc@example.com

Now try passing in all arguments as positional:

54
# throws error as we try to pass in more positional args than allowed!
send_email("recipient@example.com", "Meeting Update", "The meeting is
rescheduled to 3 PM.", "cc@example.com")
You’ll get an error as shown:

Traceback (most recent call last):


File “example.py”, line 12, in <module>
send_email(“recipient@example.com”, “Meeting Update”, “The meeting is rescheduled
to 3 PM.”, “cc@example.com”)
TypeError: send_email() takes 3 positional arguments but 4 were given

4.4. Use only Arguments You Need


When defining a function, it’s crucial to limit the parameters to only those that are necessary for
the function’s operation. Including unnecessary parameters can lead to confusion, make the
function calls cumbersome, and complicate maintenance. Here are a few reasons why you
should use only the arguments you need:

● Improved Readability: Fewer arguments make function signatures simpler and easier
to understand.
● Enhanced Maintainability: Functions with fewer parameters are easier to refactor and
test.
● Reduced Errors: The likelihood of passing incorrect or redundant data is minimized
when functions only take essential arguments.

Here’s an example of a function that includes unnecessary arguments:

# example function for processing an order with unnecessary arguments


def process_order(order_id: int, customer_id: int, customer_name: str,
amount: float, discount: float = 0.0):
print(f"Processing order {order_id} for customer {customer_id} -
{customer_name}")
total_amount = amount - discount
print(f"Total amount after discount: {total_amount}")

# Example usage
process_order(1234, 5678, "John Doe", 100.0, 10.0)

In this example, the function process_order takes both customer_id and customer_name,
which might not be necessary for processing an order if all required information can be derived
from the order_id.

55
Now, let’s refactor the function to use only the essential arguments:

# example function for processing an order with only necessary arguments


def process_order(order_id: int, amount: float, discount: float = 0.0):
print(f"Processing order {order_id}")
total_amount = amount - discount
print(f"Total amount after discount: {total_amount}")

# Example usage
process_order(1234, 100.0, 10.0)

4.5. Use Generators to Return Lists


Generators are a type of iterable in Python that allow you to iterate over a sequence of values
without storing the entire sequence in memory at once. They are defined using functions and
the yield keyword, which allows the function to return a value and pause its state, resuming
when the next value is requested. This makes generators an efficient way to handle large data
sets or streams of data.

Here are a few reasons why you should use Generators :

● Memory Efficiency: Generators produce items one at a time and only when required,
which means they don’t need to store the entire sequence in memory. This is especially
useful for large datasets.
● Performance: Since generators yield items on the fly, they can provide a performance
boost by avoiding the overhead of creating and storing large data structures.
● Lazy Evaluation: Generators compute values as needed, which can lead to more
efficient and responsive programs.
● Simpler Code: Generators can simplify the code needed to create iterators, making it
easier to read and maintain.

Here’s an example of a function that returns a list without using generators:

# example function that returns a list without using a generator


def get_squares(n):
"""Return a list of squares from 0 to n-1."""
squares = []
for i in range(n):
squares.append(i * i)
return squares

56
# Example usage
squares_list = get_squares(10)
print(squares_list)

In this example, the function get_squares generates and stores all square numbers up to n-1 in
a list before returning it. Now, let’s modify the function to use a generator:

# example function that returns a generator


def get_squares_gen(n):
"""Yield squares from 0 to n-1."""
for i in range(n):
yield i * i

# Example usage
squares_gen = get_squares_gen(10)
print(list(squares_gen))

Using generators to return lists in Python provides significant advantages in terms of memory
efficiency, performance, and lazy evaluation. By generating values on the fly, you can handle
large datasets more effectively and write cleaner, more maintainable code.

The comparison between the two approaches clearly shows the benefits of using generators,
particularly for large or computationally intensive sequences.

5. How to Use Caching to Speed Up Your Python


Code & LLM Application?
A seamless user experience is crucial for the success of any user-facing application.
Developers often aim to minimize application latencies to enhance this experience, with data
access delays typically being the main culprit.

57
By caching data, developers can drastically reduce these delays, resulting in faster load times
and happier users. This principle applies to web scraping as well, where large-scale projects
can see significant speed improvements.

But what exactly is caching, and how can it be implemented? This chapter will explore caching,
its purpose and benefits, and how to leverage it to speed up your Python code and also speed
up your LLM calls at a lower cost.

5.1. What is a cache in programming?


Caching is a mechanism for improving the performance of any application. In a technical sense,
caching is storing the data in a cache and retrieving it later.

A cache is a fast storage space (usually temporary) where frequently accessed data is kept to
speed up the system’s performance and decrease the access times. For example, a computer’s
cache is a small but fast memory chip (usually an SRAM) between the CPU and the main
memory chip (usually a DRAM).

The CPU first checks the cache when it needs to access the data. If it’s in the cache, a cache hit
occurs, and the data is thereby read from the cache instead of a relatively slower main memory.
It results in reduced access times and enhanced performance.

5.2. Why is Caching Helpful?


Caching can improve the performance of applications and systems in several ways. Here are
the primary reasons to use caching:

● Reduced access time: The primary goal of caching is to accelerate access to frequently
used data. By storing this data in a temporary, easily accessible storage area, caching
dramatically decreases access time. This leads to a notable improvement in the overall
performance of applications and systems.
● Reduced system load: Caching also alleviates system load by minimizing the number
of requests sent to external data sources, such as databases. By storing frequently
accessed data in cache storage, applications can retrieve this data directly from the
cache instead of repeatedly querying the data source. This reduces the load on the
external data source and enhances system performance.
● Improved user experience: Caching ensures rapid data retrieval, enabling more
seamless interactions with applications and systems. This is especially crucial for
real-time systems and web applications, where users expect instant responses.
Consequently, caching plays a vital role in enhancing the overall user experience.

58
5.3. Common Uses for Caching
Caching is a general concept and has several prominent use cases. You can apply it in any
scenario where data access has some patterns and you can predict what data will be
demanded next. You can prefetch the demanded data in the cache store and improve
application performance.

● Web Content: Frequently accessed web pages, images, and other static content are
often cached to reduce load times and server requests.
● Database Queries: Caching the results of common database queries can drastically
reduce the load on the database and speed up application responses.
● API Responses: External API call responses are cached to avoid repeated network
requests and to provide faster data access.
● Session Data: User session data is cached to quickly retrieve user-specific information
without querying the database each time.
● Machine Learning Models: Intermediate results and frequently used datasets are
cached to speed up machine learning workflows and inference times.
● Configuration Settings: Application configuration data is cached to avoid repeated
reading from slower storage systems.

5.4. Common Caching Strategies


Different caching strategies can be devised based on specific spatial or temporal data access
patterns.

● Cache-Aside (Lazy Loading): Data is loaded into the cache only when it is requested.
If the data is not found in the cache (a cache miss), it is fetched from the source, stored
in the cache, and then returned to the requester.
● Write-Through: Every time data is written to the database, it is simultaneously written to
the cache. This ensures that the cache always has the most up-to-date data but may
introduce additional write latency.
● Write-Back (Write-Behind): Data is written to the cache and acknowledged to the
requester immediately, with the cache asynchronously writing the data to the database.
This improves write performance but risks data loss if the cache fails before the write to
the database completes.
● Read-Through: The application interacts only with the cache, and the cache is
responsible for loading data from the source if it is not already cached.
● Time-to-Live (TTL): Cached data is assigned an expiration time, after which it is
invalidated and removed from the cache. This helps to ensure that stale data is not used
indefinitely.
● Cache Eviction Policies: Strategies to determine which data to remove from the cache
when it reaches its storage limit. Common policies include:

59
● Last-In, First-Out (LIFO): The most recently added data is the first to be removed when
the cache needs to free up space. This strategy assumes that the oldest data will most
likely be required again soon.
● Least Recently Used (LRU): The least recently accessed data is the first to be
removed. This strategy works well when the most recently accessed data is more likely
to be reaccessed.
● Most Recently Used (MRU): The most recently accessed data is the first to be
removed. This can be useful in scenarios where the most recent data is likely to be used
only once and not needed again.
● Least Frequently Used (LFU): The data that is accessed the least number of times is
the first to be removed. This strategy helps in keeping the most frequently accessed data
in the cache longer.

5.5. Python caching using a manual decorator


A decorator in Python is a function that accepts another function as an argument and outputs a
new function. We can alter the behavior of the original function using a decorator without
changing its source code.

One common use case for decorators is to implement caching. This involves creating a
dictionary to store the function’s results and then saving them in the cache for future use.

Let’s code a function that computes the n-th Fibonacci number. Here’s the recursive
implementation of the Fibonacci sequence:

def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)

Without caching, the recursive calls result in redundant computations. If the values are cached,
it’d be much more efficient to look up the cached values. For this, you can use the @cache
decorator.

The @cache decorator from the functools module in Python 3.9+ is used to cache the results of
a function. It works by storing the results of expensive function calls and reusing them when the
function is called with the same arguments. Now let’s wrap the function with the @cache
decorator:
from functools import cache
@cache
def fibonacci(n):
if n <= 1:

60
return n
return fibonacci(n-1) + fibonacci(n-2)

5.6. Python caching using LRU cache decorator


Another method to implement caching in Python is to use the built-in @lru_cache decorator
from functools. This decorator implements cache using the least recently used (LRU) caching
strategy. This LRU cache is a fixed-size cache, which means it’ll discard the data from the cache
that hasn’t been used recently.

In LRU caching, when the cache is full and a new item needs to be added, the least recently
used item in the cache is removed to make room for the new item. This ensures that the most
frequently used items are retained in the cache, while less frequently used items are discarded.

The @lru_cache decorator is similar to @cache but allows you to specify the maximum
size—as the maxsize argument—of the cache. Once the cache reaches this size, the least
recently used items are discarded. This is useful if you want to limit memory usage.

Here, the fibonacci function caches up to 7 most recently computed values:

from functools import lru_cache


@lru_cache(maxsize=7) # Cache up to 7 most recent results
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
fibonacci(5) # Computes Fibonacci(5) and caches intermediate results
fibonacci(3) # Retrieves Fibonacci(3) from the cache

Here, the fibonacci function is decorated with @lru_cache(maxsize=7), specifying that it should
cache up to 7 most recent results. When fibonacci(5) is called, the results for fibonacci(4),
fibonacci(3), and fibonacci(2) are cached.

When fibonacci(3) is called subsequently, fibonacci(3) is retrieved from the cache since it was
one of the seven most recently computed values, avoiding redundant computation.

61
5.7. Function Calls Timing Comparison
Now let’s compare the execution times of the functions with and without caching. For this
example, we don’t set an explicit value for maxsize. So maxsize will be set to the default value
of 128:

from functools import cache, lru_cache


import timeit

# without caching
def fibonacci_no_cache(n):
if n <= 1:
return n
return fibonacci_no_cache(n-1) + fibonacci_no_cache(n-2)
# with cache
@cache
def fibonacci_cache(n):
if n <= 1:
return n
return fibonacci_cache(n-1) + fibonacci_cache(n-2)
# with LRU cache
@lru_cache
def fibonacci_lru_cache(n):
if n <= 1:
return n
return fibonacci_lru_cache(n-1) + fibonacci_lru_cache(n-2)

To compare the execution times, we’ll use the timeit function from the timeit module:

# Compute the n-th Fibonacci number


n = 35

no_cache_time = timeit.timeit(lambda: fibonacci_no_cache(n), number=1)


cache_time = timeit.timeit(lambda: fibonacci_cache(n), number=1)
lru_cache_time = timeit.timeit(lambda: fibonacci_lru_cache(n), number=1)
print(f"Time without cache: {no_cache_time:.6f} seconds")
print(f"Time with cache: {cache_time:.6f} seconds")
print(f"Time with LRU cache: {lru_cache_time:.6f} seconds")

Time without cache: 7.562178 seconds


Time with cache: 0.000062 seconds

62
Time with LRU cache: 0.000020 seconds

We see a significant difference in the execution times. The function call without caching takes
much longer to execute, especially for larger values of n. While the cached versions (both
@cache and @lru_cache) execute much faster and have comparable execution times.

5.8. Use Caching to Speed up Your LLM Application


We can use the same concept to speed up your LLM application. As your LLM application
grows in popularity and encounters higher traffic levels, the expenses related to LLM API calls
can become substantial. Additionally, LLM services might exhibit slow response times,
especially when dealing with many requests.

To tackle this challenge, you can use the 𝐆𝐏𝐓𝐂𝐚𝐜𝐡𝐞 package dedicated to building a semantic
cache for storing LLM responses. GPTCache first performs embedding operations on the input
to obtain a vector and then conducts a vector approximation search in the cache storage. After
receiving the search results, it performs a similarity evaluation and returns when the set
threshold is reached.

To use we will start with Initialize the cache to run GPTCache and import openai form
gptcache.adapter, which will automatically set the map data manager to match the exact cache.
However, you will need this version of the openai package = 0.28.1

After that, if you ask ChatGPT the exact same two questions, the answer to the second question
will be obtained from the cache without requesting ChatGPT again.

import time

def response_text(openai_resp):
return openai_resp['choices'][0]['message']['content']

print("Cache loading.....")

# To use GPTCache, that's all you need


# -------------------------------------------------
from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()
# -------------------------------------------------

63
question = "what's github"
for _ in range(2):
start_time = time.time()
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{
'role': 'user',
'content': question
}
],
)
print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')

Cache loading…..
Question: what’s github
Time consuming: 6.04s
Answer: GitHub is an online platform where developers can share and collaborate on
software development projects. It is used as a hub for code repositories and includes
features such as issue tracking, code review, and project management tools. GitHub can
be used for open source projects, as well as for private projects within organizations.
GitHub has become an essential tool within the software development industry and has
over 40 million registered users as of 2021.

Question: what’s github


Time consuming: 0.00s
Answer: GitHub is an online platform where developers can share and collaborate on
software development projects. It is used as a hub for code repositories and includes
features such as issue tracking, code review, and project management tools. GitHub can
be used for open source projects, as well as for private projects within organizations.
GitHub has become an essential tool within the software development industry and has
over 40 million registered users as of 2021.

We can see that the first question took 6.04 seconds which is the average response time of an
LLM. But if we ask the same question again we can see it takes no time and also it will
decrease the cost.

64
6. Best Practices To Use Pandas Efficiently As A
Data Scientist
As a data scientist, it is important to use the right tools and techniques to get the most out of the
data. The Pandas library is a great tool for data manipulation, analysis, and visualization, and it
is an essential part of any data scientist’s toolkit.
However, it can be challenging to use Pandas efficiently, and this can lead to wasted time and
effort. Fortunately, there are a few best practices that can help data scientists get the most out
of their Pandas experience.
From using vectorized operations to taking advantage of built-in functions, these best practices
will help data scientists quickly and accurately analyze and visualize data using Pandas.
Understanding and applying these best practices will help data scientists increase their
productivity and accuracy, allowing them to make better decisions faster.

Throughout this chapter, we will use three datasets:

● Poker card game dataset


● Popular baby names
● Restaurant dataset

The first dataset is the Poker card game dataset which is shown below.

poker_data = pd.read_csv('poker_hand.csv')
poker_data.head()

In each poker round, each player has five cards in hand, each one characterized by its symbol,

65
which can be either hearts, diamonds, clubs, or spades, and its rank, which ranges from 1 to 13.
The dataset consists of every possible combination of five cards one person can possess.

● Sn: symbol of the n-th card where: 1 (Hearts), 2 (Diamonds), 3 (Clubs), 4 (Spades)
● Rn: rank of the n-th card where: 1 (Ace), 2–10, 11 (Jack), 12 (Queen), 13 (King)

The second dataset we will work with is the Popular baby names dataset which includes the
most popular names that were given to newborns between 2011 and 2016. The dataset is
loaded and shown below:

names = pd.read_csv('Popular_Baby_Names.csv')
names.head()

The dataset includes among other information, the most popular names in the US by year,
gender, and ethnicity. For example, the name Chloe was ranked second in popularity among all
female newborns of Asian and Pacific Islander ethnicity in 2011. The third dataset we will use is
the Restaurant dataset. This dataset is a collection of people having dinner at a restaurant. The
dataset is loaded and shown below:

restaurant = pd.read_csv('restaurant_data.csv')
restaurant.head()

For each customer, we have various characteristics, including the total amount paid, the tips left

66
to the waiter, the day of the week, and the time of the day.

6.1. Why do We need Efficient Coding?


Efficient code is code that executes faster and with lower computational meomery. In this
subsection, we will use the time() function to measure the computational time. This function
measures the current time so we will assign it to a variable before the code execution and after
and then calculate the difference to know the computational time of the code. A simple example
is shown in the code below:

import time
# record time before execution
start_time = time.time()
# execute operation
result = 5 + 2
# record time after execution
end_time = time.time()
print("Result calculated in {} sec".format(end_time - start_time))

Let's see some examples of how applying efficient code methods will improve the code runtime
and decrease the computational time complexity: We will calculate the square of each number
from zero, up to a million. At first, we will use a list comprehension to execute this operation, and
then repeat the same procedure using a for-loop.

First using list comprehension:

#using List comprehension

list_comp_start_time = time.time()
result = [i*i for i in range(0,1000000)]
list_comp_end_time = time.time()
print("Time using the list_comprehension: {} sec".format(list_comp_end_time
-
list_comp_start_time))

Now we will use for loop to execute the same operation:

# Using For loop

67
for_loop_start_time= time.time()
result=[]
for i in range(0,1000000):
result.append(i*i)
for_loop_end_time= time.time()
print("Time using the for loop: {} sec".format(for_loop_end_time -
for_loop_start_time))

We can see that there is a big difference between them, we can calculate the difference
between them in percentage:

list_comp_time = list_comp_end_time - list_comp_start_time


for_loop_time = for_loop_end_time - for_loop_start_time
print("Difference in time: {} %".format((for_loop_time - list_comp_time)/
list_comp_time*100))

Here is another example to show the effect of writing efficient code. We would like to calculate
the sum of all consecutive numbers from 1 to 1 million. There are two ways the first is to use
brute force in which we will add one by one to a million.

def sum_brute_force(N):
res = 0
for i in range(1,N+1):
res+=i
return res

# Using brute force


bf_start_time = time.time()
bf_result = sum_brute_force(1000000)
bf_end_time = time.time()

print("Time using brute force: {} sec".format(bf_end_time - bf_start_time))

68
Another more efficient method is to use a formula to calculate it. When we want to calculate the
sum of all the integer numbers from 1 up to a number, let’s say N, we can multiply N by N+1,
and then divide by 2, and this will give us the result we want. This problem was actually given to
some students back in Germany in the 19th century, and a bright student called Carl-Friedrich
Gauss devised this formula to solve the problem in seconds.

def sum_formula(N):
return N*(N+1)/2

# Using the formula


formula_start_time = time.time()
formula_result = sum_formula(1000000)
formula_end_time = time.time()

print("Time using the formula: {} sec".format(formula_end_time -


formula_start_time))

After running both methods, we achieved a massive improvement with a magnitude of over
160,000%, which clearly demonstrates why we need efficient and optimized code, even for
simple tasks.

6.2. Selecting & Replacing Values Effectively


Let's first start with two of the most common tasks that you will commonly do on your
DataFrame, especially in the data manipulation phase of a data science project. These two
tasks are selecting specific and random rows and columns efficiently and using the replace()
function to replace one or multiple values using lists and dictionaries.

6.2.1. Selecting Rows & Columns Efficiently using .iloc[] & .loc[]
In this subsection, we will introduce how to locate and select rows efficiently from dataframes
using .iloc[] & .loc[] pandas functions. We will use iloc[] for the index number locator and loc[]
for the index name locator.

In the example below we will select the first 500 rows of the poker dataset. Firstly by using the
.loc[] function, and then by using the .iloc[] function.

69
# Specify the range of rows to select

rows = range(0, 500)


# Time selecting rows using .loc[]
loc_start_time = time.time()
poker_data.loc[rows]
loc_end_time = time.time()
print("Time using .loc[] : {} sec".format(loc_end_time - loc_start_time))

# Specify the range of rows to select


rows = range(0, 500)
# Time selecting rows using .iloc[]
iloc_start_time = time.time()
poker_data.iloc[rows]
iloc_end_time = time.time()
print("Time using .iloc[]: {} sec".format(iloc_end_time - iloc_start_time))

loc_comp_time = loc_end_time - loc_start_time


iloc_comp_time = iloc_end_time - iloc_start_time
print("Difference in time: {} %".format((loc_comp_time - iloc_comp_time)/
iloc_comp_time*100))

While these two methods have the same syntax, iloc[] performs almost 70% faster than loc[].
The .iloc[] function takes advantage of the order of the indices, which are already sorted, and is
therefore faster. We can also use them to select columns not only rows. In the next example, we
will select the first three columns using both methods.

iloc_start_time = time.time()
poker_data.iloc[:,:3]
iloc_end_time = time.time()
print("Time using .iloc[]: {} sec".format(iloc_end_time - iloc_start_time))

70
names_start_time = time.time()
poker_data[['S1', 'R1', 'S2']]
names_end_time = time.time()
print("Time using selection by name: {} sec".format(names_end_time -
names_start_time))

loc_comp_time = names_end_time - names_start_time


iloc_comp_time = iloc_end_time - iloc_start_time
print("Difference in time: {} %".format((loc_comp_time - iloc_comp_time)/
loc_comp_time*100))

We can also see that column indexing using iloc[] is still 80% faster. So it is better to use iloc[]
as it is faster unless it is easier to use the loc[] to select certain columns by name.

6.2.2. Replacing Values in a DataFrame Effectively


Replacing values in a DataFrame is a very important task, especially in the data cleaning phase.
Since you will have to keep the whole values that represent the same object the same. Let's
take a look at the popular baby names dataset we loaded before:

Let's have a closer look at the Gender feature and see the unique values they have:

names['Gender'].unique()

71
We can see that the female gender is represented with two values both uppercase and
lowercase. This is very common in real data and an easy way to do so is to replace one of the
values with the other to keep it consistent throughout the whole dataset. There are two ways to
do it the first one is simply defining which values we want to replace, and then what we want to
replace them with. This is shown in the code below:

start_time = time.time()
names['Gender'].loc[names.Gender=='female'] = 'FEMALE'
end_time = time.time()

pandas_time = end_time - start_time


print("Replace values using .loc[]: {} sec".format(pandas_time))

The second method is to use the panda's built-in function .replace() as shown in the code
below:

start_time = time.time()
names['Gender'].replace('female', 'FEMALE', inplace=True)
end_time = time.time()
replace_time = end_time - start_time

print("Time using replace(): {} sec".format(replace_time))

We can see that there is a difference in time complexity with the built-in function 157% faster
than using the .loc() method to find the rows and columns index of the values and replace it.

print('The differnce: {} %'.format((pandas_time- replace_time


)/replace_time*100))

We can also replace multiple values using lists. Our objective is to change all ethnicities
classified as WHITE NON-HISPANIC or WHITE NON-HISP to WNH. Using the .loc[] function,
we will locate babies of the ethnicities we are looking for, using the ‘or’ statement (which in
Python is symbolized by the pipe). We will then assign the new value. As always, we also

72
measure the CPU time needed for this operation.

start_time = time.time()

names['Ethnicity'].loc[(names["Ethnicity"] == 'WHITE NON HISPANIC') |


(names["Ethnicity"] == 'WHITE NON HISP')] = 'WNH'

end_time = time.time()
pandas_time= end_time - start_time
print("Results from the above operation calculated in %s seconds"
%(pandas_time))

We can also do the same operation using the .replace() pandas built-in function as the following:

start_time = time.time()
names['Ethnicity'].replace(['WHITE NON HISPANIC','WHITE NON HISP'],
'WNH', inplace=True)

end_time = time.time()
replace_time = end_time - start_time

print("Time using .replace(): {} sec".format(replace_time))

We can see that again using the .replace() method is much faster than using the .loc[] method.
To have a better intuition of how much faster it is let's run the code below:

print('The differnce: {} %'.format((pandas_time- replace_time


)/replace_time*100))

The .replace() method is 87% faster than using the .loc[] method. If your data is huge and
needs a lot of cleaning this tip will decrease the computational time of your data cleaning and
makes your pandas code much faster and hence more efficient.

Finally, we can also use dictionaries to replace both single and multiple values in your
DataFrame. This will be very helpful if you would like to have multiple replacing functions in one
command.

We’re going to use dictionaries to replace every male’s gender with BOY and every female’s

73
gender with GIRL.

names = pd.read_csv('Popular_Baby_Names.csv')
start_time = time.time()
names['Gender'].replace({'MALE':'BOY', 'FEMALE':'GIRL', 'female': 'girl'},
inplace=True)
end_time = time.time()
dict_time = end_time - start_time
print("Time using .replace() with dictionary: {} sec".format(dict_time))

names = pd.read_csv('Popular_Baby_Names.csv')

start_time = time.time()

names['Gender'].replace('MALE', 'BOY', inplace=True)


names['Gender'].replace('FEMALE', 'GIRL', inplace=True)
names['Gender'].replace('female', 'girl', inplace=True)

end_time = time.time()

list_time = end_time - start_time


print("Time using multiple .replace(): {} sec".format(list_time))

print('The differnce: {} %'.format((list_time- dict_time )/dict_time*100))

We could do the same thing with lists, but it’s more verbose. If we compare both methods, we
can see that dictionaries run approximately 22% faster. In general, working with dictionaries in
Python is very efficient compared to lists: looking through a list requires a pass in every element
of the list while looking at a dictionary directs instantly to the key that matches the entry. The
comparison is a little unfair though since both structures serve different purposes.

Using dictionaries allows you to replace the same values on several different columns. In all the
previous examples, we specified the column from which the values to replace came. We’re now
going to replace several values from the same column with one common value. We want to

74
classify all ethnicities into three big categories: Black, Asian, and White. The syntax again is
very simple. We use nested dictionaries here: the outer key is the column in which we want to
replace values. The value of this outer key is another dictionary, where the keys are the
ethnicities to replace, and the values for the new ethnicity (Black, Asian, or White).

start_time = time.time()
names.replace({'Ethnicity': {'ASIAN AND PACI': 'ASIAN', 'ASIAN AND PACIFIC
ISLANDER': 'ASIAN',
'BLACK NON HISPANIC': 'BLACK', 'BLACK NON HISP': 'BLACK',
'WHITE NON HISPANIC': 'WHITE', 'WHITE NON HISP': 'WHITE'}})
print("Time using .replace() with dictionary: {} sec".format (time.time() -
start_time))

Here is a summary of the best practices for selecting and replacing values:
● Selecting rows and columns is faster using the .iloc[] function. So it is better to use
unless it is easier or more convenient to use .loc[] and the speed is not a priority or you
are just doing it once.
● Using the built-in replace() function is much faster than just using conventional methods.
● Replacing multiple values using Python dictionaries is faster than using lists.

6.3. Iterate Effectively Through Pandas DataFrame


As a data scientist, you will need to iterate through your dataframe extensively, especially in the
data preparation and exploration phase, so it is important to be able to do this efficiently, as it
will save you much time and give space for more important work. We will walk through three
methods to make your loops much faster and more efficient:

● Looping using the .iterrows() function


● Looping using the .apply() function
● Vectorization

6.3.1. Looping effectively using .iterrows()


Before we talk about how to use the .iterrows() function to improve the looping process, let’s
refresh the notion of a generator function. Generators are a simple tool to create iterators. Inside
the body of a generator, instead of return statements, you will find only yield() statements.
There can be just one, or several yield() statements. Here, we can see a generator,
city_name_generator(), that produces four city names. We assign the generator to the variable
city_names for simplicity.

def city_name_generator():
yield('New York')
yield('London')

75
yield('Tokyo')
yield('Sao Paolo')

city_names = city_name_generator()

To access the elements that the generator yields we can use Python’s next() function. Each time
the next() command is used, the generator will produce the next value to yield, until there are no
more values to yield. We here have 4 cities. Let’s run the next command four times and see
what it returns:

next(city_names)

next(city_names)

next(city_names)

As we can see that every time we run the next() function it will print a new city name. Let's go
back to the .iterrows() function. The .iterrows() function is a property of every pandas
DataFrame. When called, it produces a list with two elements. We will use this generator to
iterate through each line of our poker DataFrame.

The first element is the index of the row, while the second element contains a pandas Series of
each feature of the row: the Symbol and the Rank for each of the five cards. It is very similar to
the notion of the enumerate() function, which when applied to a list, returns each element along
with its index. The most intuitive way to iterate through a Pandas DataFrame is to use the
range() function, which is often called crude looping. This is shown with the code below:

start_time = time.time()
for index in range(poker_data.shape[0]):
next
print("Time using range(): {} sec".format(time.time() - start_time))

76
One smarter way to iterate through a pandas DataFrame is to use the .iterrows() function,
which is optimized for this task. We simply define the ‘for’ loop with two iterators, one for the
number of each row and the other for all the values. Inside the loop, the next() command
indicates that the loop moves to the next value of the iterator, without actually doing something.

data_generator = poker_data.iterrows()
start_time = time.time()
for index, values in data_generator:
next
print("Time using .iterrows(): {} sec".format(time.time() - start_time))

Comparing the two computational times we can also notice that the use of .iterrows() does not
improve the speed of iterating through pandas DataFrame. It is very useful though when we
need a cleaner way to use the values of each row while iterating through the dataset.

6.3.2. Looping effectively using .apply()


Now we will use the .apply() function to be able to perform a specific task while iterating through
a pandas DataFrame. The .apply() function does exactly what it says; it applies another
function to the whole DataFrame.

The syntax of the .apply() function is simple: we create a mapping, using a lambda function in
this case, and then declare the function we want to apply to every cell. Here, we’re applying the
square root function to every cell of the DataFrame. In terms of speed, it matches the speed of
just using the NumPy sqrt() function over the whole DataFrame.

data_sqrt = poker_data.apply(lambda x: np.sqrt(x))


data_sqrt.head()

This is a simple example since we would like to apply this function to the dataframe.

But what happens when the function of interest is taking more than one cell as an input? For

77
example, what if we want to calculate the sum of the rank of all the cards in each hand? In this
case, we will use the .apply() function the same way as we did before, but we need to add
‘axis=1’ at the end of the line to specify we’re applying the function to each row.

apply_start_time = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=1)
apply_end_time = time.time()
apply_time = apply_end_time - apply_start_time
print("Time using .apply(): {} sec".format(time.time() - apply_start_time))

Then, we will use the .iterrows() function we saw previously, and compare their efficiency.

for_loop_start_time = time.time()
for ind, value in poker_data.iterrows():
sum([value[1], value[3], value[5], value[7], value[9]])
for_loop_end_time = time.time()

for_loop_time = for_loop_end_time - for_loop_start_time


print("Time using .iterrows(): {} sec".format(for_loop_time))

Using the .apply() function is significantly faster than the .iterrows() function, with a magnitude
of around 400 percent, which is a massive improvement!

print('The differnce: {} %'.format((for_loop_time - apply_time) /


apply_time * 100))

As we did with rows, we can do exactly the same thing for the columns; apply one function to
each column. By replacing the axis=1 with axis=0, we can apply the sum function on every
column.

apply_start_time = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=0)
apply_end_time = time.time()
apply_time = apply_end_time - apply_start_time

78
print("Time using .apply(): {} sec".format(apply_time))

By comparing the .apply() function with the native panda's function for summing over rows, we
can see that pandas’ native .sum() functions perform the same operation faster.

pandas_start_time = time.time()
poker_data[['R1', 'R1', 'R3', 'R4', 'R5']].sum(axis=0)
pandas_end_time = time.time()
pandas_time = pandas_end_time - pandas_start_time
print("Time using pandas: {} sec".format(pandas_time))

print('The differnce: {} %'.format((apply_time - pandas_time) / pandas_time


* 100))

In conclusion, we observe that the .apply() function performs faster when we want to iterate
through all the rows of a pandas DataFrame, but is slower when we perform the same operation
through a column.

6.3.3. Looping effectively using vectorization


To understand how we can reduce the amount of iteration performed by the function, recall that
the fundamental units of Pandas, DataFrames, and Series, are both based on arrays. Pandas
perform more efficiently when an operation is performed to a whole array than to each value
separately or sequentially. This can be achieved through vectorization. Vectorization is the
process of executing operations on entire arrays.

In the code below we want to calculate the sum of the ranks of all the cards in each hand. In
order to do that, we slice the poker dataset keeping only the columns that contain the ranks of
each card. Then, we call the built-in .sum() property of the DataFrame, using the parameter axis
= 1 to denote that we want the sum for each row. In the end, we print the sum of the first five
rows of the data.
start_time_vectorization = time.time()

poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1)


end_time_vectorization = time.time()

vectorization_time = end_time_vectorization - start_time_vectorization

79
print("Time using pandas vectorization: {} sec".format(vectorization_time))

We saw previously various methods that perform functions applied to a DataFrame faster than
simply iterating through all the rows of the DataFrame. Our goal is to find the most efficient
method to perform this task.

Using .iterrows() to loop through the DataFrame:

data_generator = poker_data.iterrows()

start_time_iterrows = time.time()

for index, value in data_generator:


sum([value[1], value[3], value[5], value[7]])

end_time_iterrows = time.time()
iterrows_time = end_time_iterrows - start_time_iterrows
print("Time using .iterrows() {} seconds " .format(iterrows_time))

Using the .apply() method to loop through the DataFrame:

start_time_apply = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x),axis=1)
end_time_apply = time.time()

apply_time = end_time_apply - start_time_apply

print("Time using apply() {} seconds" .format(apply_time))

Comparing the time it takes to sum the ranks of all the cards in each hand using vectorization,
the .iterrows() function, and the .apply() function, we can see that the vectorization method
performs much better. We can also use another vectorization method to effectively iterate
through the DataFrame which is using Numpy arrays to vectorize the DataFrame.

The NumPy library, which defines itself as a “fundamental package for scientific computing in

80
Python”, performs operations under the hood in optimized, pre-compiled C code. Similar to
pandas working with arrays, NumPy operates on arrays called ndarrays. A major difference
between Series and ndarrays is that ndarrays leave out many operations such as indexing, data
type checking, etc. As a result, operations on NumPy arrays can be significantly faster than
operations on pandas Series. NumPy arrays can be used in place of the pandas Series when
the additional functionality offered by the pandas Series isn’t critical.

For the problems we explore in this chapter, we could use NumPy ndarrays instead of the
pandas series. The question at stake is whether this would be more efficient or not. Again, we
will calculate the sum of the ranks of all the cards in each hand. We convert our rank arrays
from pandas Series to NumPy arrays simply by using the .values method of pandas Series,
which returns a pandas Series as a NumPy ndarray. As with vectorization on the series, passing
the NumPy array directly into the function will lead pandas to apply the function to the entire
vector.

start_time = time.time()

poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].values.sum(axis=1)

print("Time using NumPy vectorization: {} sec" .format(time.time() -


start_time))

start_time = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1)
print("Results from the above operation calculated in %s seconds" %
(time.time() - start_time))

At this point, we can see that vectorizing over the pandas series achieves the overwhelming
majority of optimization needs for everyday calculations. However, if speed is of the highest
priority, we can call in reinforcements in the form of the NumPy Python library. Compared to the
previous state-of-the-art method, the panda's optimization, we still get an improvement in the
operating time.

Here is a summarization of the best practices for looping through DataFrame:

● Using .iterrows() does not improve the speed of iterating through the DataFrame but it is

81
more efficient.
● The .apply() function performs faster when we want to iterate through all the rows of a
pandas DataFrame, but is slower when we perform the same operation through a
column.
● Vectorizing over the pandas series achieves the overwhelming majority of optimization
needs for everyday calculations. However, if speed is of the highest priority, we can call
in reinforcements in the form of the NumPy Python library.

6.4. Transforming Data Effectively With .groupby()


In this last section of the chapter, we will use how to use the .groupby() function effectively to
group the entries of a DataFrame according to the values of a specific feature. The .groupby()
method is applied to a DataFrame and groups it according to a feature. Then, we can apply
some simple or more complicated functions on that grouped object. This is a very important tool
for every data scientist working on tabular or structured data as it will help you to manipulate
data easily and in a more effective way.

6.4.1. Common functions used with .groupby()


One of the simplest methods to apply to an aggregated group is the .count(). In the example
below we will apply this to the restaurant dataset. At first, we group the restaurant data
according to whether the customer was a smoker or not. Then, we apply the .count() method.
We obtain the count of smokers and non-smokers.

restaurant = pd.read_csv('restaurant_data.csv')

restaurant_grouped = restaurant.groupby('smoker')
print(restaurant_grouped.count())

It is no surprise that we get the same results for all the features, as the .count() method counts
the number of occurrences of each group in each feature. As there are no missing values in our
data, the results should be the same in all columns.

After grouping the entries of the DataFrame according to the values of a specific feature, we can
apply any kind of transformation we are interested in. Here, we are going to apply the z score, a
normalization transformation, which is the distance between each value and the mean, divided

82
by the standard deviation. This is a very useful transformation in statistics, often used with the
z-test in standardized testing. To apply this transformation to the grouped object, we just need to
call the .transform() method containing the lambda transformation we defined.

This time, we will group according to the type of meal: was it a dinner or a lunch? As the z-score
transformation is a group-related transformation, the resulting table is just the original table. For
each element, we subtract the mean and divide it by the standard deviation of the group it
belongs to. We can also see that numerical transformation are applied only to numerical
features of the DataFrame.

zscore = lambda x: (x - x.mean() ) / x.std()

restaurant_grouped = restaurant.groupby('time')

restaurant_transformed = restaurant_grouped.transform(zscore)
restaurant_transformed.head()

While the transform() method simplifies things a lot, is it actually more efficient than using native
Python code? As we did before, we first group our data, this time according to sex. Then we
apply the z-score transformation we applied before, measuring its efficiency. We omit the code
for measuring the time of each operation here, as you are already familiar with this. We can see
that with the use of the transform() function, we achieve a massive speed improvement. On top
of that, we’re only using one line to perform the operation of interest.

restaurant.groupby('sex').transform(zscore)

mean_female = restaurant.groupby('sex').mean()['total_bill']['Female']
mean_male = restaurant.groupby('sex').mean()['total_bill']['Male']
std_female = restaurant.groupby('sex').std()['total_bill']['Female']
std_male = restaurant.groupby('sex').std()['total_bill']['Male']

83
for i in range(len(restaurant)):
if restaurant.iloc[i][2] == 'Female':
restaurant.iloc[i][0] = (restaurant.iloc[i][0] -
mean_female)/std_female
else:
restaurant.iloc[i][0] = (restaurant.iloc[i][0] - mean_male)/std_male

6.4.2. Missing value imputation using .groupby() & .transform()


Now that we have seen why and how to use the transform() function on a grouped pandas
object, we will address a very specific task that is imputing missing value. Before we actually
see how we can use the transform() function for missing value imputation, we will see how many
missing values there are in our variable of interest in each of the groups. We can see below the
number of data points in each of the “time” feature and they are 176+68 = 244.

prior_counts = restaurant.groupby('time')
prior_counts['total_bill'].count()

Next, we will create a restaurant_nan dataset, in which the total bill of 10% random observations
was set to NaN using the code below:

import pandas as pd
import numpy as np

p = 0.1 #percentage missing data required

mask = np.random.choice([np.nan,1], size=len(restaurant), p=[p,1-p])


restaurant_nan = restaurant.copy()
restaurant_nan['total_bill'] = restaurant_nan['total_bill'] * mask

Now, let's print the number of data points in each of the “time” feature and we can see that they
are now 155+62 = 217. Since the total data points we have are 244 then the missing data points
are 24 which is equal to 10%.

prior_counts = restaurant.groupby('time')
prior_counts['total_bill'].count()

84
After counting the number of missing values in our data, we will show how to fill these missing
values with a group-specific function. The most common choices are the mean and the median,
and the selection has to do with the skewness of the data. As we did before, we define a
lambda transformation using the fillna() function to replace every missing value with its group
average. As before, we group our data according to the time of the meal and then replace the
missing values by applying the pre-defined transformation.

# Missing value imputation

missing_trans = lambda x: x.fillna(x.mean())


restaurant_nan_grouped = restaurant_nan.groupby('time')['total_bill']
restaurant_nan_grouped.transform(missing_trans)

As we can see, the observations at index 0 and index 4 are exactly the same, which means that
their missing value has been replaced by their group’s mean. Also, we can see the computation
time using this method is 0.007 seconds.

Let's compare this with the conventional method:

start_time = time.time()
mean_din = restaurant_nan.loc[restaurant_nan.time
=='Dinner']['total_bill'].mean()
mean_lun = restaurant_nan.loc[restaurant_nan.time ==
'Lunch']['total_bill'].mean()

for row in range(len(restaurant_nan)):

85
if restaurant_nan.iloc[row]['time'] == 'Dinner':
restaurant_nan.loc[row, 'total_time'] = mean_din
else:
restaurant_nan.loc[row, 'total_time'] = mean_lun
print("Results from the above operation calculated in %s seconds" %
(time.time() - start_time))

We can see that using the .transform() function applied on a grouped object performs faster
than the native Python code for this task.

6.4.3. Data filtration using the .groupby() & .filter()


Now we will discuss how we can use the .filter() function on a grouped pandas object. This
allows us to include only a subset of those groups, based on some specific conditions.

Often, after grouping the entries of a DataFrame according to a specific feature, we are
interested in including only a subset of those groups, based on some conditions. Some
examples of filtration conditions are the number of missing values, the mean of a specific
feature, or the number of occurrences of the group in the dataset.

We are interested in finding the mean amount of tips given, on the days when the mean amount
paid to the waiter is more than 20 USD. The .filter() function accepts a lambda function that
operates on a DataFrame of each of the groups. In this example, the lambda function selects
“total_bill” and checks that the mean() is greater than 20. If that lambda function returns True,
then the mean() of the tip is calculated. If we compare the total mean of the tips, we can see
that there is a difference between the two values, meaning that the filtering was performed
correctly.

restaurant_grouped = restaurant.groupby('day')
filter_trans = lambda x : x['total_bill'].mean() > 20
restaurant_filtered = restaurant_grouped.filter(filter_trans)
print(restaurant_filtered['tip'].mean())

print(restaurant['tip'].mean())

If we attempt to perform this operation without using groupby(), we end up with this inefficient
code. At first, we use a list comprehension to extract the entries of the DataFrame that refer to

86
days that have a mean meal greater than $20 and then use a for loop to append them into a list
and calculate the mean. It might seem very intuitive, but as we see, it’s also very inefficient.

t=[restaurant.loc[restaurant['day'] == i]['tip'] for i in


restaurant['day'].unique()
if restaurant.loc[restaurant['day'] == i]['total_bill'].mean()>20]
restaurant_filtered = t[0]

for j in t[1:]:
restaurant_filtered=restaurant_filtered.append(j,ignore_index=True)

6.5. Summary of Best Practices


● Selecting rows and columns is faster using the .iloc[] function. So it is better to use
unless it is easier or more convenient to use .loc[] and the speed is not a priority or you
are just doing it once.
● Using the built-in .replace() function is much faster than just using conventional
methods.
● Replacing multiple values using python dictionaries is faster than using lists.
● Using .iterrows() does not improve the speed of iterating through the DataFrame but it is
more efficient.
● The .apply() function performs faster when we want to iterate through all the rows of a
pandas DataFrame, but is slower when we perform the same operation through a
column.
● Vectorizing over the pandas series achieves the overwhelming majority of optimization
needs for everyday calculations. However, if speed is of the highest priority, we can call
in reinforcements in the form of the NumPy Python library.
● Using .groupby() to group it according to a certain feature and then using other
functions to apply it to the data is much faster than using the conventional coding
method.

87
7. Make Your Pandas Code 1000 Times Faster With
This Trick
While Pandas package is a powerful and flexible, its performance can sometimes become a
bottleneck in large datasets. In this article, we will explore a trick to make your Pandas code run
much faster, increasing its efficiency by up to 1000 times.

Whether you are a beginner or an experienced Pandas user, this chapter will provide you with
valuable insights and practical tips for speeding up your code. So, if you want to boost the
performance of your Pandas code, read on!

7.1. Create Dataset & Problem Statement


First, let's create the data we will use throughout this notebook and compare different methods.
The data we will make will have different ages, time in bed and percentage of sleeping, the
favorite food, and the least favorite food. Let's build a function to get the data given the size

import pandas as pd
import numpy as np
def get_data(size= 10000):
df = pd.DataFrame()
size = 10000
df['age'] = np.random.randint(0,100,size)
df['time_in_bed'] = np.random.randint(0,9,size)
df['pct_sleeping'] = np.random.randint(size)
df['favorite_food'] =
np.random.choice(['pizza','ice-cream','burger','rice'], size)
df['hate_food'] = np.random.choice(['milk','vegetables','eggs'])
return df

df = get_data()
df.head()

88
The task we will work on is a reward calculation based on the following measures: If they were
in bed for more than 5 hours and sleeping more than 50%, we will give them their favorite food.
Otherwise, we give them their hate food If they are over 90 years old give them their favorite
food regardless This can be represented using the following function.

def reward_cal(row):
if row['age'] >=90:
return row['favorite_food']
if (row['time_in_bed'] > 5) & (row['pct_sleeping']>0.5):
return row['favorite_food']
return row['hate_food']

7.2. Level 1: Loops


The first and straightforward approach is to use for loops to iterate over each row of the data
frame.

%%timeit

for index, row in df.iterrows():


df.loc[index,'reward'] = reward_cal(row)

2.54 s ± 28.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As we can see the computation time used to iterate through every row of the data frame is 2.5
s. Given that the data has only 10000 rows which are considered small. So if the data have
millions of rows so it will take hours to do only one task. Therefore this is not the most efficient
way to iterate through a data frame. So let's discuss the second method which will improve the
time complexity.

89
7.3. Level 2: Apply Function
The .apply() method in pandas is used to apply a function to each element in a pandas
dataframe. It can be used to apply a custom function to each element in a specific column or to
apply a function along either axis (row-wise or column-wise) of the dataframe. Let's use it to
apply the reward calculation function to each row of the data frame and then calculate the
computational time:

%%timeit
df['reward'] = df.apply(reward_cal, axis = 1)

266 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The average time to apply the function to the 10000 rows of the data frame is only 268 ms
which is 0.26 seconds. This is around 10 times faster than using the loops. However, we are still
not done. We can still improve the speed and make it 1000 times faster. Let's see how!

7.4. Level3: Vectorization


Vectorization in pandas refers to the process of applying operations to entire arrays or
sequences of data, as opposed to applying them to individual elements one by one. This is
done for performance reasons, as vectorized operations are usually much faster than
non-vectorized operations, especially in large datasets. Let's apply this to the data using the
conditions stated above:

%%timeit

df['reward'] = df['hate_food']
df.loc[((df['pct_sleeping']>0.5) &(df['time_in_bed']>5))| (df['age']>90),
'reward'] = df['favorite_food']

2.1 ms ± 62.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can see now a tremendous decrease in the computation time compared to the previous two
methods. The computation time has at least decreased by 1000. Let's have a look at the
differences in a bar plot.

7.5. Measuring the Difference


Finally to have a better intuition of the difference between these different three methods. We will
plot a bar plot diagram using the code below:

90
results = pd.DataFrame(
[
["Loop", 2690 ],
["apply", 268],
['vectorized', 2.32]

],
columns = ['method', 'run_time']

results.set_index('method')['run_time'].plot(kind='bar')

Looking at the bar plot we can get a better intuition of the huge difference between the different
computational times of the different methods used in this chapter.

91
8. Data Exploration Becomes Easier & Better With
Pandas Profiling
Data exploration is a crucial step in any data analysis and data science project. It allows you to
gain a deeper understanding of your data, identify patterns and relationships, and identify any
potential issues or outliers.

One of the most popular tools for data exploration is the Python library Pandas. The library
provides a powerful set of tools for working with data, including data cleaning, transformation,
and visualization. However, even with the powerful capabilities of Pandas, data exploration can
still be a time-consuming and tedious task. That’s where Pandas Profiling comes in.

With Pandas Profiling, you can easily generate detailed reports of your data, including summary
statistics, missing values, and correlations, making data exploration faster and more efficient.
This chapter will explore how Pandas Profiling can help you improve your data exploration
process and make it easier to understand your data.

8.1. What is Pandas Profiling?


Pandas profiling is a Python library that generates a comprehensive report of a DataFrame,
including information about the number of rows and columns, missing values, data types, and
other statistics. It can be used to quickly identify potential issues or outliers in the data, and can
also be used to generate summary statistics and visualizations of the data.

The report generated by the pandas profiling library typically includes a variety of information
about the dataset, including:

● Overview: Summary statistics for all columns, including the number of rows, missing
values, and data types.
● Variables: Information about each column, including the number of unique values,
missing values, and the top frequent values.
● Correlations: Correlation matrix and heatmap, showing the relationship between
different variables.
● Distribution: Histograms and kernel density plots for each column, show the distribution
of values.
● Categorical Variables: Bar plots for categorical variables, showing the frequency of
each category.
● Numerical Variables: Box plots for numerical variables, show the distribution of values
and outliers.
● Text: Information about text columns, including the number of characters and words.

92
● File: Information about file columns, including the number of files, and the size of each
file.
● High-Cardinality: Information about high-cardinality categorical variables, including their
most frequent values.
● Sample: A sample of the data, with the first and last few rows displayed.

It is worth noting that the report is interactive and you can drill down on each section for more
details.

8.2. Installation of Pandas Profiling

To install pandas-profiling, you can use the following command in your terminal or command
prompt.

pip install pandas-profiling

This will install the latest version of pandas-profiling and its dependencies.If you are using
Jupyter Notebook, you can also install it by running the following command in a cell.

!pip install pandas-profiling

You can also make the installation using the conda package using the command below.

conda install -c conda-forge pandas-profiling

If it did not work with you and gave you this error below

ImportError: cannot import name 'to_html' from 'pandas_profiling.report'


(C:\Users\youss\anaconda3\lib\site-packages\pandas_profiling\report\__init__.py)

You can use this command in your Jupyter notebook:

import sys

!"{sys.executable}" -m pip install -U pandas-profiling[notebook]


!jupyter nbextension enable --py widgetsnbextension

Once installed, you can import and use pandas_profiling in your code as the following:

93
import pandas_profiling as pp

8.3. Pandas Profiling in Action


Let's put the pandas profiling into action and see how it works. We will use the popular baby
names dataset we used before.

Popular_baby_names_df = pd.read_csv('Popular_Baby_Names.csv')
Popular_baby_names_df.head()

Let's run the pandas profiling and observe the output:

profile = pp.ProfileReport(Popular_baby_names_df, title='Pandas Profiling


Report')

# display the report


profile.to_widgets()

94
As you can see the report contains the six sections that we mentioned before. Let's see every
section and discuss them one by one.

8.3.1. Overview
The first section which is represented in the figure above is the overview. This section contains
three subsections which are overview, alerts, and reproduction.

The overview contains overview information of the data such as:

● number of the features


● number of observations
● missing datapoints
● number of duplicates rows
● total memory size
The information in this section is really important for feature engineering and processing,
especially the information related to the missing data points and the number of duplicate rows.
The second subsection of the overview section is the arrests which contain alerts about the data
as shown in the figure below.

This shows areas that you can take care of during correlation. For example, if certain values are
correlated with each other it will be shown so you can drop one of them or you can apply a
dimension reduction algorithm to remove the redundant information.

This shows the meta-information about the report such as the duration taken to generate the
reports and the time when this report was generated.

95
8.3.2. Variables
This section will show information about each feature with a plot that represents their
distribution. Here is the plot for this section and as you can see there are subsections for each
of the features in the data.

Let's take the plot for the first feature which is the Year of Birth. We can see all the statistical
information about the feature such as the min, and maximum values, the number of missing
values, and the mean. In addition to that, there is a histogram plot that shows the distribution of
the data.

This section is really useful for understanding the feature and will save you a lot of time in
feature processing.

8.3.3. Interaction
The third section in the report is the interaction section. This section will show you the relation
between each numerical feature with the other features. For example, let’s have a look at the
interaction between the rank and the count feature:

96
We can see the relation between the count and the Rank feature and also it shows that they are
highly correlated ( negatively ) as expected.

8.3.4. Correlations
The fourth section of the report is the correlation section. This provides a heatmap that shows
the correlation between the features. In addition to the heatmap it also provides the correlation
value between each of the features as shown below:

97
8.3.5. Missing Values
The fifth section is the missing values section. This shows the percentage of missing values in
each feature. Since here we have no missing data all of the features have a value of 1.

98
8.3.6. Samples
The final section of the report is the samples section. This shows you a sample of the data from
the first and the last rows. The figure below shows the first ten rows of the data.

8.4. Drawbacks of Pandas Profiling & How to Overcome It


Pandas Profiling is a great tool for quickly generating detailed reports of your data, but it does
have some drawbacks. One of the main drawbacks is that it can be memory intensive,
especially for large datasets. This can cause the tool to run slowly or even crash if you don’t
have enough memory available.

Another drawback is that Pandas Profiling can only be used with Pandas DataFrames. This
means that if you’re working with data in a different format, such as a CSV file or a SQL
database, you’ll need to first convert it to a Pandas DataFrame before you can use Pandas
Profiling.

Additionally, Pandas Profiling generates a lot of information and can be overwhelming to digest
if you don’t know what you’re looking for. The report is also not interactive, and you’ll have to
export it to a file format like HTML, pdf, or excel to share or present it.

To overcome these limitations, you can try the following:

● Use Pandas Profiling on a sample of your data rather than the entire dataset to reduce
memory usage.
● Use Pandas to convert your data to a DataFrame before using Pandas Profiling.

99
● Use the options in Pandas Profiling to customize the report and only include the
information that you need.
● Use visualization libraries like Matplotlib, and Seaborn to make the report more
interactive and easy to understand.
● Use the report as a starting point for your data exploration, and then use other tools and
techniques to dive deeper into your data.

100
9. Top 10 Pandas Mistakes to Steer Clear of in Your
Code
Pandas is a powerful and popular data analysis library in Python, widely used by data scientists
and analysts to manipulate and transform data. However, with great power comes great
responsibility, and it’s easy to fall into common pitfalls that can lead to inefficient code and slow
performance.

In this chapter, we’ll explore the top 10 mistakes to steer clear of when using Pandas, so you
can maximize your efficiency and get the most out of this powerful library. Whether you’re a
beginner or a seasoned Pandas user, these tips will help you write better code and avoid
common mistakes that can slow you down.

9.1. Having Column Names with Spaces


Having column names with spaces in Pandas can lead to issues when manipulating data or
trying to access column values. When a column name contains a space, it needs to be enclosed
in quotes or backticks every time it’s referenced in the code, which can be cumbersome and
error-prone. Additionally, you will not be able to use the dot function in pandas, Here’s an
example of how having a column name with a space can cause issues in Pandas:

For example, if you have a dataframe with a column name containing a space, say “Sales
Amount”, you cannot reference the column using the usual dot notation, like this:

df.Sales Amount

This will result in a syntax error, as the space in the column name causes confusion to Python’s
syntax parser. Instead, you would need to reference the column name using square brackets
and quotes, like this:

df['Sales Amount']

However, this can be cumbersome and error-prone, especially if you have a lot of column
names with spaces. Another issue with spaces in column names is that some functions in
pandas might not be able to handle them properly. For example, if you want to group by a
column with a space in the name, like this:

df.groupby('Sales Amount')['Quantity'].sum()

You will get a KeyError, as pandas cannot recognize the column name with the space. To avoid
these issues, it’s best to avoid spaces in column names altogether when working with pandas

101
dataframes. Instead, you can use underscores or camelCase to separate words in column
names. For example, “Sales_Amount” or “salesAmount”.

9.2. Not Using Query Method for Filtering


The second mistake you should avoid is not using the query method when filtering the data. The
query() method in pandas is a useful tool for creating subsets of a dataframe based on specific
conditions. It allows you to filter rows based on a boolean expression that you provide and can
be a convenient way to create complex subsets without needing to chain multiple conditions
together.

Here’s an example of how to use the query() method to create a subset of a dataframe:

import pandas as pd

df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie', 'David'],


'age': [25, 30, 35, 40]})

# Subset where age is greater than 30


subset = df.query('age > 30')

In this example, the query() method is used to create a subset of the dataframe df where the
age column is greater than 30. The resulting subset is stored in the variable subset.

You can also use the query() method to create subsets based on multiple conditions:

# Subset where age is between 30 and 40, inclusive


subset = df.query('age >= 30 and age <= 40')

In this example, the query() method is used to create a subset of the dataframe df where the
age column is between 30 and 40, inclusive. While the query() method can be a powerful tool,
it's important to keep in mind that it can be slower than other methods like boolean indexing or
the loc and iloc indexers. Additionally, it's important to be careful with variable scoping and
name collisions when using query(). Therefore, it's recommended to use query() judiciously
and to consider other methods when appropriate.

9.3. Not using @ Symbol when Writing Complex Queries


When using the query() method in pandas, it's important to keep in mind that using a string for
the query expression may not always be the most efficient or readable approach, especially for
complex queries.

102
An alternative approach is to use the @ symbol to reference variables in the query expression,
allowing you to write more readable and flexible code. Here’s an example of how to use the @
symbol with the query() method:

import pandas as pd

df = pd.read_csv('sales_data.csv')

product_category = 'Electronics'
start_date = '2022-01-01'
end_date = '2022-03-01'
min_sales_amount = 1000
min_quantity = 10

subset = df.query('product_category == @product_category and @start_date <=


order_date <= @end_date and (sales_amount >= @min_sales_amount or quantity
>= @min_quantity)')

In this example, the @ symbol is used to reference variables in the query expression, making
the code more flexible and readable. The variables product_category, start_date, end_date,
min_sales_amount, and min_quantity are defined earlier in the code and can be easily modified
without needing to update the string expression.

By using the @ symbol to reference variables, you can write more concise and readable queries
without sacrificing performance or readability. This approach is especially useful for complex
queries involving multiple variables or conditions.

9.4. Iterating over Dataframe instead of using Vectorization


Vectorization is a powerful technique in data analysis that involves performing operations on
entire arrays or columns of data at once, rather than using loops to iterate over each individual
element. Using vectorization can often lead to much faster and more efficient code, particularly
when working with large datasets.

To illustrate how to use vectorization instead of looping over a data frame, let’s consider a
simple example. Suppose we have a data frame with two columns, “x” and “y”, and we want to
create a new column “z” that contains the product of “x” and “y”.

Here’s an example of how we might do this using a loop:

import pandas as pd

103
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

for index, row in df.iterrows():


df.loc[index, 'z'] = row['x'] * row['y']

This code iterates over each row of the data frame and calculates the product of “x” and “y” for
that row, then stores the result in the new “z” column. While this code works, it can be slow and
inefficient, particularly for large data frames. Instead, we can use vectorization to perform this
calculation much more efficiently. Here’s an example:

import pandas as pd

df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

df['z'] = df['x'] * df['y']

In this code, we simply use the “*” operator to multiply the “x” and “y” columns together and then
assign the result to the new “z” column. This code performs the calculation much more
efficiently than the loop-based approach.

5. Treating Slices of Dataframe as New Dataframe


When you create a new DataFrame from a slice of an existing DataFrame in pandas, it’s
important to note that the new DataFrame is actually a view of the original data, rather than a
copy. This means that any changes you make to the new DataFrame will also be reflected in the
original DataFrame.

To avoid this behavior and ensure that any edits you make to the new DataFrame do not affect
the original DataFrame, you can use the “copy” method to create a copy of the DataFrame.
Here’s an example:

# Create a sample DataFrame


data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'age': [25, 32, 18, 47, 29],
'gender': ['F', 'M', 'M', 'M', 'F']}
df = pd.DataFrame(data)

# Create a copy of the DataFrame


new_df = df.loc[df['age'] > 30, ['name', 'age']].copy()

# Modify the new DataFrame

104
new_df['age'] = new_df['age'] + 5

# Print both DataFrames


print(df)
print(new_df)

In this example, we first create a sample DataFrame with three columns: “name”, “age”, and
“gender”. We then select a slice of the DataFrame using the “.loc” indexer to select only the
rows where the “age” column is greater than 30, and only the “name” and “age” columns. We
then use the “copy” method to create a new DataFrame that is a copy of the slice, rather than a
view of the original data.

We then modify the “age” column of the new DataFrame by adding 5 to each value. Because we
used the “copy” method to create the new DataFrame, this modification does not affect the
original DataFrame. Finally, we print both DataFrames to confirm that the modification only
applies to the new DataFrame, and not the original DataFrame.

9.5. Treating Slices of Dataframe as New Dataframe


When you create a new DataFrame from a slice of an existing DataFrame in pandas, it’s
important to note that the new DataFrame is actually a view of the original data, rather than a
copy. This means that any changes you make to the new DataFrame will also be reflected in the
original DataFrame.

To avoid this behavior and ensure that any edits you make to the new DataFrame do not affect
the original DataFrame, you can use the “copy” method to create a copy of the DataFrame.

Here’s an example:

# Create a sample DataFrame


data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'age': [25, 32, 18, 47, 29],
'gender': ['F', 'M', 'M', 'M', 'F']}
df = pd.DataFrame(data)

# Create a copy of the DataFrame


new_df = df.loc[df['age'] > 30, ['name', 'age']].copy()

# Modify the new DataFrame


new_df['age'] = new_df['age'] + 5

# Print both DataFrames

105
print(df)
print(new_df)

In this example, we first create a sample DataFrame with three columns: “name”, “age”, and
“gender”. We then select a slice of the DataFrame using the “.loc” indexer to select only the
rows where the “age” column is greater than 30, and only the “name” and “age” columns. We
then use the “copy” method to create a new DataFrame that is a copy of the slice, rather than a
view of the original data.

We then modify the “age” column of the new DataFrame by adding 5 to each value. Because we
used the “copy” method to create the new DataFrame, this modification does not affect the
original DataFrame. Finally, we print both DataFrames to confirm that the modification only
applies to the new DataFrame, and not the original DataFrame.

9.6. Not Using Chain Commands for Multiple Transformations


The sixth mistake you need to avoid is creating multiple intermediate dataframes when doing
multiple transformations. Instead, it is better to write the transformation in chain commands
where all transformations are applied once.

In pandas, you can chain multiple transformations together in a single statement, which can be
a more concise and efficient way of applying multiple transformations to a DataFrame. Here’s an
example of how to transform a DataFrame using chain commands:

# Create a sample DataFrame


data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'age': [25, 32, 18, 47, 29],
'gender': ['F', 'M', 'M', 'M', 'F']}
df = pd.DataFrame(data)

# Apply multiple transformations in a single statement


new_df = (df
.loc[df['age'] > 30, ['name', 'age']]
.copy()
.assign(age_plus_5=lambda x: x['age'] + 5))

# Print the new DataFrame


print(new_df)
In this example, we first create a sample DataFrame with three columns: “name”, “age”, and
“gender”. We then chain multiple transformations together in a single statement to create a new
DataFrame:

106
1. Use the “.loc” indexer to select only the rows where the “age” column is greater than 30,
and only the “name” and “age” columns.
2. Use the “copy” method to create a new DataFrame that is a copy of the slice, rather
than a view of the original data.
3. Use the “assign” method to create a new column in the DataFrame called “age_plus_5”,
which is equal to the “age” column plus 5.

Finally, we print the new DataFrame to confirm that all of the transformations were applied
correctly.

9.7. Not Setting Column dtypes Correctly


Setting column dtypes in Pandas is an important step in data analysis and manipulation.
Dtypes, or data types, are the way Pandas store and represent data in memory. By specifying
the dtypes for each column, you can control the memory usage of your data and improve the
performance of your Pandas operations.

In many instances, you would have to manually set the Dtype correctly of a certain column. This
will help you to make efficient manipulation of your data.

Here are some reasons why setting column dtypes is important:

1. Memory usage: By setting the right data types, you can reduce the memory usage of
your DataFrame. This is especially important when working with large datasets, as it can
help avoid running out of memory and crashing your program.
2. Data consistency: Setting column dtypes can help ensure that your data is consistent
and accurate. For example, if a column should contain only integers, setting its dtype to
“int” will prevent any non-integer values from being entered into that column.
3. Performance: Pandas operations can be much faster when the dtypes are set correctly.
For example, operations like sorting and filtering can be optimized when Pandas knows
the data types of the columns being operated on.

9.8. Not Using Pandas Plotting Builtin Function


Pandas provide many built-in plotting functions that can be used to create a wide variety of
plots, including line plots, scatter plots, bar plots, histograms, and more. A common mistake is
that many people do not use them efficiently and even use other plotting functions.

Here are some examples of using Pandas plotting functions efficiently:

1. Line plot

import pandas as pd

107
import numpy as np
import matplotlib.pyplot as plt

# Create a DataFrame with some random data


df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# Plot a line chart of column A


df['A'].plot(kind='line', color='red', title='Line Plot')
plt.xlabel('Index')
plt.ylabel('Value')
plt.show()

This code creates a DataFrame with 100 rows and 4 columns of random data and then plots a
line chart of column A using the “.plot” method. The “kind” parameter is set to “line” to create a
line plot, and the “color” parameter is set to “red” to change the color of the line. The title, xlabel,
and ylabel of the plot are also set using the standard Matplotlib functions.

2. Scatter plot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a DataFrame with some random data


df = pd.DataFrame(np.random.randn(100, 2), columns=['X', 'Y'])

# Plot a scatter chart of columns X and Y

108
df.plot(kind='scatter', x='X', y='Y', color='blue', title='Scatter Plot')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

This code creates a DataFrame with 100 rows and 2 columns of random data and then plots a
scatter chart of columns X and Y using the “.plot” method. The “kind” parameter is set to
“scatter” to create a scatter plot, and the “x” and “y” parameters are set to ‘X’ and ‘Y’
respectively to specify the columns to use as the x-axis and y-axis of the plot.

3. Bar plot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a DataFrame with some random data


df = pd.DataFrame({'A': np.random.randint(1, 10, 5), 'B':
np.random.randint(1, 10, 5)})

# Plot a bar chart of columns A and B


df.plot(kind='bar', color=['blue', 'red'], title='Bar Plot')
plt.xlabel('Index')
plt.ylabel('Value')
plt.show()

109
This code creates a DataFrame with 5 rows and 2 columns of random data and then plots a bar
chart of columns A and B using the “.plot” method. The “kind” parameter is set to “bar” to create
a bar plot, and the “color” parameter is set to a list of colors to use for the bars.

4. Histogram

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a DataFrame with some random data


df = pd.DataFrame(np.random.randn(1000, 1), columns=['A'])

# Plot a histogram of column A


df['A'].plot(kind='hist', bins=20, color='green', title='Histogram')
plt.xlabel('Value')
plt.show()

This code creates a DataFrame with 1000 rows and 1 column of random data and then plots a
histogram of column A using the “.plot” method. The “kind” parameter is set to “hist” to create a
histogram, and the “bins” parameter is set to 20 to control the number of bins in the histogram.

110
9.9. Aggregation Manually instead of using .groupby()
Aggregation is a common operation in data analysis where we group data based on one or
more columns and apply a mathematical function to the remaining columns to get summarized
information about the data. Pandas provide a .groupby() method that simplifies the process of
aggregation, but manually performing aggregation can lead to inefficient code.

Here are some examples of manually aggregating data in Pandas and how it can be improved
using .groupby():

1. Manually aggregating by looping through each group:

import pandas as pd

df = pd.read_csv('sales_data.csv')

unique_regions = df['region'].unique()

for region in unique_regions:


region_sales = df[df['region'] == region]['sales']
total_sales = region_sales.sum()
average_sales = region_sales.mean()
max_sales = region_sales.max()
min_sales = region_sales.min()
print(f'{region}: Total Sales: {total_sales}, Average Sales:
{average_sales}, Max Sales: {max_sales}, Min Sales: {min_sales}')

111
This code loops through each unique region in the dataset filters the data for that region and
manually calculates the total, average, maximum, and minimum sales for that region. This
process can be simplified using .groupby():

import pandas as pd

df = pd.read_csv('sales_data.csv')

region_stats = df.groupby('region')['sales'].agg(['sum', 'mean', 'max',


'min'])
print(region_stats)

This code groups the data by the region column and calculates the total, average, maximum,
and minimum sales for each region using the .agg() method.

2. Manually aggregating by creating multiple pivot tables

import pandas as pd

df = pd.read_csv('sales_data.csv')

region_sales = pd.pivot_table(df, index='region', values='sales',


aggfunc=['sum', 'mean', 'max', 'min'])
category_sales = pd.pivot_table(df, index='category', values='sales',
aggfunc=['sum', 'mean', 'max', 'min'])
product_sales = pd.pivot_table(df, index='product', values='sales',
aggfunc=['sum', 'mean', 'max', 'min'])

print(region_sales)
print(category_sales)
print(product_sales)

This code creates multiple pivot tables to calculate the total, average, maximum, and minimum
sales for each region, category, and product. This can be simplified using .groupby():

import pandas as pd

df = pd.read_csv('sales_data.csv')

sales_stats = df.groupby(['region', 'category',


'product'])['sales'].agg(['sum', 'mean', 'max', 'min'])

112
print(sales_stats)
This code groups the data by the region, category, and product columns and calculates the total,
average, maximum, and minimum sales for each combination of those columns using the .agg()
method.

Overall, manually aggregating data can be time-consuming and error-prone, especially for large
datasets. The .groupby() method simplifies the process and provides a more efficient and
reliable way to perform aggregation operations in Pandas.

9.10. Saving Large Datasets as CSV File


While saving large datasets as CSV files is a common and simple approach, it may not always
be the best option. Here are some reasons why:

1. Large file size: CSV files can be very large in size, especially when dealing with
datasets that have many columns and/or rows. This can cause problems with storage
and processing, especially if you have limited resources.
2. Limited data types: CSV files only support a limited range of data types, such as text,
numbers, and dates. If your dataset includes more complex data types, such as images
or JSON objects, then CSV may not be the best format to use.
3. Loss of metadata: CSV files do not support metadata, such as data types, column
names, or null values. This can cause problems when importing or exporting the data,
and can make it difficult to perform data analysis.
4. Performance issues: Reading and writing large CSV files can be slow and can put a
strain on system resources, especially when dealing with complex datasets.
5. No data validation: CSV files do not provide any built-in data validation or error
checking, which can lead to data inconsistencies and errors.

There are more efficient ways to save large dataframes than using CSV files. Some of the
options are:

1. Parquet: Parquet is a columnar storage format that is optimized for data processing on
large data sets. It can handle complex data types and supports compression, which
makes it a good choice for storing large dataframes.
2. Feather: Feather is a lightweight binary file format designed for fast read and write
operations. It supports both R and Python and can be used to store dataframes in a
compact and efficient way.
3. HDF5: HDF5 is a file format designed for storing large numerical data sets. It provides a
hierarchical structure that can be used to organize data and supports compression and
chunking, which makes it suitable for storing large dataframes.
4. Apache Arrow: Apache Arrow is a cross-language development platform for in-memory
data processing. It provides a standardized format for representing data that can be

113
used across different programming languages and supports zero-copy data sharing,
which makes it efficient for storing and processing large dataframes.

Each of these options has its own strengths and weaknesses, so the choice of which one to use
depends on your specific use case and requirements.

114
Afterword
Thanks for purchasing and reading my book! If you have any questions, feedback or praise, you
can reach me at: Youssef.Hosni95@outlook.com

You can check my other books on my website. I would be happy if you connect with me
personally on LinkedIn. If you liked my writings, make sure to follow me on Medium. You are
also welcomed to subscribe to my newsletter To Data & Beyond to never miss any of my
writings.

115
What's inside the book?

Three Principles to Write Python Clean Code


Defining & Measuring Code Efficiency
Optimizing Your Python Code
5 Tips to Write Efficient Python Functions
How to Use Caching to Speed Up Your Python Code & LLM
Application?
Best Practices To Use Pandas Efficiently As A Data Scientist
Make Your Pandas Code 1000 Times Faster With This Trick
Data Exploration Becomes Easier & Better With Pandas Profiling
Top 10 Pandas Mistakes to Steer Clear of in Your Code

About the Author

Youssef Hosni is a data scientist and machine


learning researcher who has worked in machine
learning and AI for over half a decade.
In addition to being a researcher and data science
practitioner, Youssef has a strong passion for
education. He is known for his leading data science
and AI blog, newsletter, and eBooks on data science
and machine learning.

Youssef is a senior data scientist at Ment focusing on


building Generative AI features for Ment Products. He
is also an AI applied researcher at Aalto University
working on AI agents and their applications. Before
that, he worked as a researcher in which he applied
deep learning and computer vision techniques to
medical images.

You might also like