Efficient Python for Data Scientists
Efficient Python for Data Scientists
Scientists
Through nine meticulously crafted chapters, this book covers everything from foundational
principles of clean code and efficiency to advanced techniques like caching and vectorization.
You'll learn how to write better functions, avoid common pitfalls, and elevate your data
exploration process using tools like Pandas Profiling.
Whether you're an aspiring data scientist or a seasoned professional, this book equips you with:
Each chapter is packed with examples, tips, and tricks, ensuring you have a hands-on learning
experience. Dive into Efficient Python for Data Scientists and take the next step toward writing
Python code that is not just functional but exceptional.
Clean code is substantially more than just removing all your commented lines or keeping the
length of your functions to a minimum. It’s about making your code readable so that any other
coder coming to your project in the future will know exactly what you meant with a given piece of
code without having to dig through comments or documentation.
There are lots of principles, techniques, and best practices we can follow to write Python clean
code. Given below are some tips that will help you get started and make the process easier the
next time you write code again.
In any software project, the code is one of the most important assets. The final production code
must be clean and easy to understand in order to facilitate its maintenance.
Reusing parts of code, modularity, and object orientation are some of the techniques used to
produce high-quality code.
In this section, I describe several characteristics that help identify high-quality production code.
These characteristics may not seem important at first glance but they have a major impact on
how efficiently developers can work with your project’s source code. Let’s take a look!
1. Production Code: software running on production servers to handle live users and data
of the intended audience. Note this is different from production quality code, which
describes code that meets expectations in reliability, efficiency, etc., for production.
Ideally, all code in production meets these expectations, but this is not always the case.
2. Clean: readable, simple, and concise. A characteristic of production quality code that is
crucial for collaboration and maintainability in software development. Clean code is a
very important characteristic of high-quality production and writing clean code will lead
to:
● Focused Code: Each function, class, or module should do one thing and do it
well.
● Easy to read code: According to Grady Booch, author of Object-Oriented
Analysis and Design with Applications: clean code reads like well-written prose.
● Easy to debug code: Clean code can be easily debugged and fix its error as it is
easy to read and follow.
11
● Easy to maintain: That is it can easily be read and enhanced by other
developers.
3. Modular Code: logically broken up into functions and modules. Also, an important
characteristic of production quality code is that makes your code more organized,
efficient, and reusable. Modules allow code to be reused by encapsulating them into files
that can be imported into other files.
4. Refactoring: Restructuring your code to improve its internal structure, without changing
its external functionality. This gives you a chance to clean and modularize your program
after you’ve got it working. Since it isn’t easy to write your best code while you’re still
trying to just get it working, allocating time to do this is essential to producing high-quality
code. Despite the initial time and effort required, this really pays off by speeding up your
development time in the long run.
So it is normal that at first, you write a code that works then after that you refactor it and make it
in a clean way. You become a much stronger programmer when you’re constantly looking to
improve your code. The more you refactor, the easier it will be to structure and write good code
the first time.
Here is a more detailed guide on how to give descriptive and good naming convection:
1. Variables
● Use long descriptive names that are easy to read: This is very important to make the
names easy and descriptive and can be understood on their own. This will make it
necessary to write comments:
# Not recommended
# The au variable is the number of active users
au = 105
12
# Recommended
total_active_users = 105
● Use descriptive intention revealing types: Your coworkers and developers should be
able to figure out what your variable type is and what it stores from the name. In a
nutshell, your code should be easy to read and reason about.
# Not recommended
c = ["UK", "USA", "UAE"]
for x in c:
print(x)
# Recommended
cities_list = ["UK", "USA", "UAE"]
for city in cities_list:
print(city)
● Always use the same vocabulary: Be consistent with your naming convention.
Maintaining a consistent naming convention is important to eliminate confusion when
other developers work on your code. And this applies to naming variables, files,
functions, and even directory structures.
# Not recommended
client_first_name = 'John'
customer_last_name = 'Doe;
# Recommended
client_first_name = 'John'
client_last_name = 'Doe'
# Another example:
# bad code
def fetch_clients(response, variable):
# do something
pass
13
# Recommended
def fetch_clients(response, variable):
# do something
pass
● Don’t use magic numbers: Magic numbers are numbers with special, hardcoded
semantics that appear in code but do not have any meaning or explanation. Usually,
these numbers appear as literals in more than one location in our code.
import random
# Not recommended
def roll_dice():
return random.randint(0, 4) # what is 4 supposed to represent?
# Recommended
DICE_SIDES = 4
def roll_dice():
return random.randint(0, DICE_SIDES)
2. Functions
● Long names != descriptive names: You should be descriptive, but only with relevant
information. For E.g. good functions names describe what they do well without including
details about implementation or highly specific uses.
DICE_SIDES = 4
# Not recommended
def roll_dice_using_randint():
return random.randint(0, DICE_SIDES)
# Recommended
def roll_dice():
return random.randint(0, DICE_SIDES)
14
● Be consistent with your function naming convention: As seen with the variables
above, stick to a naming convention when naming functions. Using different naming
conventions would confuse other developers and colleagues.
# Not recommended
def fetch_user(id):
# do something
Pass
def get_post(id):
# do something
pass
# Recommended
def fetch_user(id):
# do something
Pass
def fetch_post(id):
# do something
pass
● Do not use flags or Boolean flags: Boolean flags are variables that hold a boolean
value — true or false. These flags are passed to a function and are used by the function
to determine its behavior.
# Not recommended
def transform_text(text, uppercase):
if uppercase:
return text.upper()
else:
return text.lower()
15
# Recommended
def transform_to_uppercase(text):
return text.upper()
def transform_to_lowercase(text):
return text.lower()
uppercase_text = transform_to_uppercase(text)
lowercase_text = transform_to_lowercase(text)
3. Classes
● Do not add redundant context: This can occur by adding unnecessary variables to
variable names when working with classes.
# Not recommended
class Person:
def __init__(self, person_username, person_email, person_phone,
person_address):
self.person_username = person_username
self.person_email = person_email
self.person_phone = person_phone
self.person_address = person_address
# Recommended
class Person:
def __init__(self, username, email, phone, address):
self.username = username
self.email = email
self.phone = phone
self.address = address
1.3.1. Indentation
Organize your code with consistent indentation the standard is to use 4 spaces for each indent.
You can make this a default in your text editor. When using a hanging indent the following
16
should be considered; there should be no arguments on the first line and further indentation
should be used to clearly distinguish itself as a continuation line:
# Correct:
# Wrong:
17
1.3.3. Blank Lines
Adding blank lines to your code will make it better, cleaner, and easy to follow. Here is a simple
guide on how to add blank lines to your code:
● Surround top-level function and class definitions with two blank lines.
● Method definitions inside a class are surrounded by a single blank line.
● Extra blank lines may be used (sparingly) to separate groups of related functions. Blank
lines may be omitted between a bunch of related one-liners (e.g. a set of dummy
implementations).
● Use blank lines in functions, sparingly, to indicate logical sections.
One way comments are used is to document the major steps of complex code to help readers
follow. Then, you may not have to understand the code to follow what it does. However, others
would argue that this is using comments to justify bad code and that if code requires comments
to follow, it is a sign refactoring is needed.
Comments are valuable for explaining where code cannot explain why it was written like
this or why certain values were selected. For example, the history behind why a certain
method was implemented in a specific way. Sometimes an unconventional or seemingly
arbitrary approach may be applied because of some obscure external variable causing side
effects. These things are difficult to explain with code.
18
# This checks if the user with the given ID doesn't exist.
if not User.objects.filter(id=user_id).exists():
return Response({
'detail': 'The user with this ID does not exist.',
})
As a general rule, if you need to add comments, they should explain why you did something
rather than what is happening.
3. Don’t leave commented outdated code: The worst thing you can do is to leave code
comments out in your programs. All the debug code or debug messages should be
removed before pushing to a version control system, otherwise, your colleagues will be
scared of deleting it and your commented code will stay there forever.
1.4.2. Docstrings
Docstrings, or documentation strings, are valuable pieces of documentation that explain the
functionality of any function or module in your code. Ideally, each of your functions should
always have a docstring. Docstrings are surrounded by triple quotes.
The first line of the docstring is a brief explanation of the function’s purpose. The next element of
a docstring is an explanation of the function’s arguments. Here you list the arguments, state
their purpose, and state what types the arguments should be. Finally, it is common to provide
some description of the output of the function. Every piece of the docstring is optional; however,
doc strings are a part of good coding practice. Below are two examples of docstring for a
function. The first one will use single-line docstring and in the second one we will use multiple
lines docstrings:
Args:
population: int. The population of the area
land_area: int or float. This function is unit-agnostic, if you pass in
values in terms of square km or square miles the function will return a
density in those units.
Returns:
population_density: population/land_area. The population density of a
19
particular area.
"""
return population / land_area
1.4.3. Documentation
Project documentation is essential for getting others to understand why and how your code is
relevant to them, whether they are potential users of your project or developers who may
contribute to your code.
A great first step in project documentation is your README file. It will often be the first
interaction most users will have with your project. Whether it’s an application or a package, your
project should absolutely come with a README file. At a minimum, this should explain what it
does, list its dependencies, and provide sufficiently detailed instructions on how to use it. You
want to make it as simple as possible for others to understand the purpose of your project, and
quickly get something working.
Translating all your ideas and thoughts formally on paper can be a little difficult, but you’ll get
better over time, and makes a significant difference in helping others realize the value of your
project. Writing this documentation can also help you improve the design of your code, as you’re
forced to think through your design decisions more thoroughly. This also allows future
contributors to know how to follow your original intentions.
20
2. Defining & Measuring Code Efficiency
In this chapter, we will discuss what is Python efficient code and how to use different Python’s
built-in data structures, functions, and modules to write cleaner, faster, and more efficient code.
We’ll explore how to use time and profile codes to find bottlenecks. Then in the next chapter, we
will practice eliminating these bottlenecks, and other bad design patterns, using Python’s most
used libraries by data scientists: NumPy, and pandas.
Efficient refers to code that satisfies two key concepts. First, efficient code is fast and has a
small latency between execution and returning a result. Second, efficient code allocates
resources skillfully and isn’t subjected to unnecessary overhead. Although in general your
definition of fast runtime and small memory usage may differ depending on the task at hand, the
goal of writing efficient code is still to reduce both latency and overhead.
Python is a language that prides itself on code readability, and thus, it comes with its own set of
idioms and best practices. Writing Python code the way it was intended is often referred to as
Pythonic code. This means the code that you write follows the best practices and guiding
principles of Python. Pythonic code tends to be less verbose and easier to interpret. Although
Python supports code that doesn’t follow its guiding principles, this type of code tends to run
slower.
As an example, look at the non-Pythonic below. Not only is this code more verbose than the
Pythonic version, but it also takes longer to run. We’ll take a closer look at why this is the case
later on in the course, but for now, the main takeaway here is that Pythonic code is efficient
code!
# Non-Pythonic
doubled_numbers = []
for i in range(len(numbers)):
doubled_numbers.append(numbers[i] * 2)
# Pythonic
doubled_numbers = [x * 2 for x in numbers]
21
2.1.2. Python Standard Libraries
Python Standard Libraries are the Built-in components and libraries of Python. These libraries
come with every Python installation and are commonly cited as one of Python’s greatest
strengths. Python has a number of built-in types.
It’s worth noting that Python’s built-ins have been optimized to work within the Python language
itself. Therefore, we should default to using a built-in solution (if one exists) rather than
developing our own.
We will focus on certain built-in types, functions, and modules. The types that we will focus on
are :
● Lists
● Tuples
● Sets
● Dicts
● print()
● len()
● range()
● round()
● enumerate()
● map()
● zip()
● Os
● Sys
● NumPy
● Pandas
● Itertools
● Collections
● math
Python Functions
22
Let's start exploring some of the mentioned functions:
# range(start,stop)
nums = range(0,11)
nums_list = list(nums)
print(nums_list)
# range(stop)
nums = range(11)
nums_list = list(nums)
print(nums_list)
23
We can also specify the starting index of enumerate with the keyword argument start. Here, we
tell enumerate to start the index at five by passing start equals five into the function call.
● map(): The last notable built-in function we’ll cover is the map() function. map applies a
function to each element in an object. Notice that the map function takes two arguments;
first, the function you’d like to apply, and second, the object you’d like to apply that
function on. Here, we use a map to apply the built-in function round to each element of
the nums list.
The map function can also be used with a lambda, or, an anonymous function. Notice here, that
we can use the map function and a lambda expression to apply a function, which we’ve defined
on the fly, to our original list nums. The map function provides a quick and clean way to apply a
function to an object iteratively without writing a for loop.
Python Modules
24
NumPy, or Numerical Python, is an invaluable Python package for Data Scientists. It is the
fundamental package for scientific computing in Python and provides a number of benefits for
writing efficient code. NumPy arrays provide a fast and memory-efficient alternative to Python
lists. Typically, we import NumPy as np and use np dot array to create a NumPy array.
# python list
nums_list = list(range(5))
print(nums_list)
NumPy arrays are homogeneous, which means that they must contain elements of the same
type. We can see the type of each element using the .dtype method. Suppose we created an
array using a mixture of types. Here, we create the array nums_np_floats using the integers
[1,3] and a float [2.5]. Can you spot the difference in the output? The integers now have a
proceeding dot in the array. That’s because NumPy converted the integers to floats to retain that
array’s homogeneous nature. Using .dtype, we can verify that the elements in this array are
floats.
Homogeneity allows NumPy arrays to be more memory efficient and faster than Python lists.
Requiring all elements to be the same type eliminates the overhead needed for data type
checking.
25
When analyzing data, you’ll often want to perform operations over entire collections of values
quickly. Say, for example, you’d like to square each number within a list of numbers. It’d be nice
if we could simply square the list, and get a list of squared values returned. Unfortunately,
Python lists don’t support these types of calculations.
We could square the values using a list by writing a for loop or using a list comprehension as
shown in the code below. However, neither of these approaches is the most efficient way of
doing this. Here lies the second advantage of NumPy arrays — their broadcasting functionality.
NumPy arrays vectorize operations, so they are performed on all elements of an object at once.
This allows us to efficiently perform calculations over entire arrays.
Let's compare the computational time using these three approaches in the following code:
import time
sqrd_nums = []
for num in nums:
sqrd_nums.append(num ** 2)
#print(sqrd_nums)
26
et = time.time()
nums_np ** 2
We can see that the first two approaches have the same time complexity while using NumPy in
the third approach has decreased the computational time to half.
27
Another advantage of NumPy arrays is their indexing capabilities. When comparing basic
indexing between a one-dimensional array and a list, the capabilities are identical. When using
two-dimensional arrays and lists, the advantages of arrays are clear.
To return the second item of the first row in our two-dimensional object, the array syntax is [0,1].
The analogous list syntax is a bit more verbose as you have to surround both the zero and one
with square brackets [0][1]. To return the first column values in the 2-D object, the array syntax
is [:,0]. Lists don’t support this type of syntax, so we must use a list comprehension to return
columns.
#2-D list
nums2 = [ [1, 2, 3],
[4, 5, 6] ]
# 2-D array
nums2_np = np.array(nums2)
NumPy arrays also have a special technique called boolean indexing. Suppose we wanted to
gather only positive numbers from the sequence listed here. With an array, we can create a
boolean mask using a simple inequality. Indexing the array is as simple as enclosing this
inequality in square brackets. However, to do this using a list, we need to write a for loop to filter
the list or use a list comprehension. In either case, using a NumPy array to the index is less
verbose and has a faster runtime.
28
# Boolean indexing
print(nums_np[nums_np > 0])
pos = []
for num in nums:
if num > 0:
pos.append(num)
print(pos)
To compare runtimes, we need to be able to compute the runtime for a line or multiple lines of
code. IPython comes with some handy built-in magic commands we can use to time our code.
Magic commands are enhancements that have been added to the normal Python syntax.
These commands are prefixed with the percentage sign. If you aren’t familiar with magic
commands take a moment to review the documentation.
Let's start with this example: we want to inspect the runtime for selecting 1,000 random
numbers between zero and one using NumPy’s random.rand() function. Using %timeit just
29
requires adding the magic command before the line of code we want to analyze. That’s it! One
simple command to gather runtimes.
As we can see %timeit provides an average of timing statistics. This is one of the advantages
of using %timeit. We also see that multiple runs and loops were generated. %timeit runs
through the provided code multiple times to estimate the code’s average execution time. This
provides a more accurate representation of the actual runtime rather than relying on just one
iteration to calculate the runtime. The mean and standard deviation displayed in the output is a
summary of the runtime considering each of the multiple runs.
The number of runs represents how many iterations you’d like to use to estimate the runtime.
The number of loops represents how many times you’d like the code to be executed per run. We
can specify the number of runs, using the -r flag, and the number of loops, using the -n flag.
Here, we use -r2, to set the number of runs to two and -n10, to set the number of loops to ten.
In this example, %timeit would execute our random number selection 20 times in order to
estimate runtime (2 runs each with 10 executions).
Another cool feature of %timeit is its ability to run on single or multiple lines of code. When
using %timeit in line magic mode, or with a single line of code, one percentage sign is used and
we can run %timeit in cell magic mode (or provide multiple lines of code) by using two
percentage signs.
%%timeit
# Multiple lines of code
nums = []
for x in range(10):
nums.append(x)
30
We can also save the output of %timeit into a variable using the -o flag. This allows us to dig
deeper into the output and see things like the time for each run, the best time for all runs, and
the worst time for all runs.
We’ve covered how to time the code using the magic command %timeit, which works well with
bite-sized code. But, what if we wanted to time a large code base or see the line-by-line
runtimes within a function? In this part, we’ll cover a concept called code profiling that allows us
to analyze code more efficiently.
Code profiling is a technique used to describe how long, and how often, various parts of a
program are executed. The beauty of a code profiler is its ability to gather summary statistics on
individual pieces of our code without using magic commands like %timeit. We’ll focus on the
line_profiler package to profile a function’s runtime line-by-line. Since this package isn’t a part
of Python’s Standard Library, we need to install it separately. This can easily be done with a pip
install command as shown in the code below.
Let’s explore using line_profiler with an example. Suppose we have a list of names along with
each someone's height (in centimeters) and weight (in kilograms) loaded as NumPy arrays.
We will then develop a function called convert_units that converts each person’s height from
centimeters to inches and weight from kilograms to pounds.
31
def convert_units(names, heights, weights):
new_hts = [ht * 0.39370 for ht in heights]
new_wts = [wt * 2.20462 for wt in weights]
data = {}
for i,name in enumerate(names):
data[name] = (new_hts[i], new_wts[i])
return data
convert_units(names, hts, wts)
If we wanted to get an estimated runtime of this function, we could use %timeit. But, this will
only give us the total execution time. What if we wanted to see how long each line within the
function took to run? One solution is to use %timeit on each individual line of our convert_units
function. But, that’s a lot of manual work and not very efficient.
Instead, we can profile our function with the line_profiler package. To use this package, we first
need to load it into our session. We can do this using the command %load_ext followed by
line_profiler.
%load_ext line_profiler
Now, we can use the magic command %lprun, from line_profiler, to gather runtimes for
individual lines of code within the convert_units function. %lprun uses a special syntax. First,
we use the -f flag to indicate we’d like to profile a function. Then, we specify the name of the
function we’d like to profile. Note, that the name of the function is passed without any
parentheses. Finally, we provide the exact function call we’d like to profile by including any
arguments that are needed. This is shown in the code below:
The output from %lprun provides a nice table that summarizes the profiling statistics as shown
below. The first column (called Line ) specifies the line number followed by a column displaying
the number of times that line was executed (called the Hits column).
32
Next, the Time column shows the total amount of time each line took to execute. This column
uses a specific timer unit that can be found in the first line of the output. Here, the timer unit is
listed in 0.1 microseconds using scientific notation. We see that line two took 362 timer units, or,
roughly 36.2 microseconds to run.
The Per Hit column gives the average amount of time spent executing a single line. This is
calculated by dividing the Time column by the Hits column. Notice that line 6 was executed
three times and had a total run time of 15.4 microseconds, 5 microseconds per hit. The % Time
column shows the percentage of time spent on a line relative to the total amount of time spent in
the function. This can be a nice way to see which lines of code are taking up the most time
within a function. Finally, the source code is displayed for each line in the Line Contents column.
It is noteworthy to mention that the Total time reported when using %lprun and the time reported
from using %timeit do not match. Remember, %timeit uses multiple loops in order to calculate
an average and standard deviation of time, so the time reported from each of these magic
commands isn’t expected to match exactly.
One basic approach for inspecting memory consumption is using Python’s built-in module sys.
This module contains system-specific functions and contains one nice method .getsizeof which
returns the size of an object in bytes. sys.getsizeof() is a quick way to see the size of an object.
import sys
nums_list = [*range(1000)]
sys.getsizeof(nums_list)
33
nums_np = np.array(range(1000))
sys.getsizeof(nums_np)
We can see that the memory allocation of a list is almost double that of a NumPy array.
However, this method only gives us the size of an individual object. However, if we wanted to
inspect the line-by-line memory size of our code.
As the runtime profile, we could use a code profiler. Just like we’ve used code profiling to gather
detailed stats on runtimes, we can also use code profiling to analyze the memory allocation for
each line of code in our code base. We’ll use the memory_profiler package which is very
similar to the line_profiler package. It can be downloaded via pip and comes with a handy
magic command (%mprun) that uses the same syntax as %lprun.
To be able to apply %mprun to a function and calculate the meomery allocation, this function
should be loaded from a separate physical file and not in the IPython console so first we will
create a utils_funcs.py file and define the convert_units function in it, and then we will load this
function from the file and apply %mprun to it.
%load_ext memory_profiler
%mprun -f convert_units convert_units(names, hts, wts)
%mprun output is similar to %lprun output. In the figure below we can see a line-by-line
description of the memory consumption for the function in question. The first column represents
the line number of the code that has been profiled. The second column (Mem usage) is the
memory used after that line has been executed.
Next, the Increment column shows the difference in memory of the current line with respect to
the previous one. This shows us the impact of the current line on total memory usage. Then the
occurrence column defines the number of occurrences of this line. The last column (Line
Contents) shows the source code that has been profiled.
34
Profiling a function with %mprun allows us to see what lines are taking up a large amount of
memory and possibly develop a more efficient solution.
It is noteworthy to mention that memory is reported in mebibytes. Although one mebibyte is not
exactly the same as one megabyte, for our purposes, we can assume they are close enough to
mean the same thing.
Another worth noting is that the memory_profiler package inspects memory consumption by
querying the operating system. This might be slightly different from the amount of memory that
is actually used by the Python interpreter. Thus, results may differ between platforms and even
between runs. Regardless, the important information can still be observed.
In this chapter, we discussed what is efficient Python code then we discussed and explored
some of the most important Python built-in standard libraries. After that, we discussed how to
measure your code efficiency. In the next chapter, we will discuss how to optimize your code
based on the efficiency measurement we discussed in this chapter.
35
3. Optimizing Your Python Code
As a data scientist, you should spend most of your time working on gaining insights from data
not waiting for your code to finish running. Writing efficient Python code can help reduce runtime
and save computational resources, ultimately freeing you up to do the things that have more
impact.
In the previous chapter "Defining & Measuring Code Efficiency" I discussed what is Python
efficient code and how to use different Python's built-in data structures, functions, and modules
to write cleaner, faster, and more efficient code. I also explored how to time and profile code in
order to find bottlenecks.
In this chapter, we will practice eliminating these bottlenecks, and other bad design patterns,
using Python's most used libraries by data scientists: NumPy, and pandas.
Combining Objects
In this subsection, we’ll cover combining, counting, and iterating over objects efficiently in
Python. Suppose we have two lists: one of the names and the other is the age for each of them.
We want to combine these lists so that each name is stored next to its age. We can iterate over
the names list using enumerate and grab each name's corresponding age using the index
variable.
# combining objects
names = ['Ahmed', 'Youssef', 'Mohammed']
age = [25, 27, 40]
combined = []
36
But Python’s built-in function zip provides a more elegant solution. The name “zip” describes
how this function combines objects like a zipper on a jacket (making two separate things
become one). zip returns a zip object that must be unpacked into a list and printed to see the
contents. Each item is a tuple of elements from the original lists.
combined_zip_list = [*combined_zip]
print(combined_zip_list)
Python also comes with a number of efficient built-in modules. The collections module contains
specialized datatypes that can be used as alternatives to standard dictionaries, lists, sets, and
tuples. A few notable specialized datatypes are:
Counting Objects
Let’s dig deeper into the Counter object. First, we will upload the Pokemon dataset and print
the first five rows of it then we will count the number of Pokemon per each type using the for
loop and then using the Counter function.
● Let's load the Pokemon dataset and print the first five rows:
import pandas as pd
pokemon = pd.read_csv('pokemon.csv')
pokemon.head()
37
● Now we will count the number of Pokemon per each type using loops and calculate the
execution time:
● Finally, we will count the number of Pokemon per each type using the Counter function
and compare the time:
We can see that using the Counter function from the collections module is a more efficient
approach. Just import Counter and provide the object to be counted. No need for a loop!
Counter returns a Counter dictionary of key-value pairs. When printed, it’s ordered by highest to
lowest counts. If comparing runtime times, we’d see that using Counter takes less time
38
compared to the standard dictionary approach!
Itertools
Another built-in module, itertools, contains functional tools for working with iterators. A subset of
these tools is:
Combinations
Suppose we want to gather all combination pairs of Pokémon types possible. We can do this
with a nested for loop that iterates over the poke_types list twice. Notice that a conditional
statement is used to skip pairs having the same type twice.
For example, if x is ‘Bug’ and y is ‘Bug’, we want to skip this pair. Since we’re interested in
combinations (where order doesn’t matter), another statement is used to ensure either order of
the pair doesn’t already exist within the combos list before appending it. For example, the pair
(‘Bug’, ‘Fire’) is the same as the pair (‘Fire’, ‘Bug’). We want one of these pairs, not both.
The combinations generator from itertools provides a more efficient solution. First, we import
combinations and then create a combinations object by providing the poke_types list and the
length of combinations we desire. Combinations return a combinations object, which we unpack
into a list and print to see the result. If comparing runtimes, we’d see using combinations is
significantly faster than the nested loop.
39
# combinations with itertools
poke_types = ['Bug', 'Fire', 'Ghost', 'Grass', 'Water']
from itertools import combinations
combos_obj = combinations(poke_types, 2)
print(type(combos_obj))
combos = [*combos_obj]
print(combos)
When we’d like to compare objects multiple times and in different ways, we should consider
storing our data in sets to use these efficient methods. Another nice feature of Python sets is
their ability to quickly check if a value exists within its members. We call this membership testing
using the in operator. We will go through how using the in operator with a set is much faster than
using it with a list or tuple.
Suppose we had two lists of Pokémon names: list_a and list_b and we would like to compare
these lists to see which Pokémon appear in both lists. We could first use a nested for loop to
compare each item in list_a to each item in list_b and collect only those items that appear in
both lists. But, iterating over each item in both lists is extremely inefficient.
in_common = []
for pokemon_a in list_a:
for pokemon_b in list_b:
if pokemon_a == pokemon_b:
40
in_common.append(pokemon_a)
print(in_common)
However, a better way is to use Python’s set data type to compare these lists. By converting
each list into a set, we can use the dot-intersection method to collect the Pokémon shared
between the two sets. One simple line of code and no need for a loop!
We got the same answer with a much smaller number of code lines we can also compare the
runtime to see how much using sets is much faster than using loops.
%%timeit
in_common = []
for pokemon_a in list_a:
for pokemon_b in list_b:
if pokemon_a == pokemon_b:
in_common.append(pokemon_a)
We can see that using sets is much faster than using for loops. We can also use a set method
to see Pokémon that exist in one set but not in another. To gather Pokémon that exist in set_a
but not in set_b, use set_a.difference(set_b).
41
If we want the Pokémon in set_b, but not in set_a, we use set_b.difference(set_a).
To collect Pokémon that exist in exactly one of the sets (but not both), we can use a method
called the symmetric difference.
Finally, we can combine these sets using the .union method. This collects all of the unique
Pokémon that appear in either or both sets.
Another nice efficiency gain when using sets is the ability to quickly check if a specific item is a
member of a set’s elements. Consider our collection of 720 Pokémon names stored as a list,
tuple, and set.
names_list = list(pokemon['Name'])
names_set = set(pokemon['Name'])
names_tuple = tuple(pokemon['Name'])
We want to check whether or not the character, Zubat, is in each of these data structures and
print the execution time for each data type:
42
%timeit 'Zubat' in names_set
When comparing runtimes, it’s clear that membership testing with a set is significantly faster
than a list or a tuple.
One final efficiency gain when using sets comes from the definition of the set itself. A set is
defined as a collection of distinct elements. Thus, we can use a set to collect unique items from
an existing object. Let’s define a primary_types list, which contains the primary types of each
Pokémon. If we wanted to collect the unique Pokémon types within this list, we could write a for
loop to iterate over the list and only append the Pokémon types that haven’t already been added
to the unique_types list.
Using a set makes this much easier. All we have to do is convert the primary_types list into a
set, and we have our solution: a set of distinct Pokémon types.
unique_types_set = set(primary_types)
print(unique_types_set)
43
● While loops execute a loop repeatedly as long as some Boolean condition is met.
● Nested loops use multiple loops inside one another.
Although all of these looping patterns are supported by Python, we should be careful when
using them. Because most loops are evaluated in a piece-by-piece manner, they are often
inefficient solutions.
We should try to avoid looping as much as possible when writing efficient code. Eliminating
loops usually results in fewer lines of code that are easier to interpret. One of the idioms of
pythonic code is that “flat is better than nested.” Striving to eliminate loops in our code will help
us follow this idiom.
Suppose we have a list of lists, called poke_stats, that contains statistical values for each
Pokémon. Each row corresponds to a Pokémon, and each column corresponds to a Pokémon’s
specific statistical value. Here, the columns represent a Pokémon’s Health Points, Attack,
Defense, and Speed. We want to do a simple sum of each of these rows in order to collect the
total stats for each Pokémon. If we were to use a loop to calculate row sums, we would have to
iterate over each row and append the row’s sum to the totals list. We can accomplish the same
task, in fewer lines of code, with a list comprehension. Or, we could use the built-in map function
that we’ve discussed previously in the previous chapter.
44
Each of these approaches will return the same list, but using a list comprehension or the map
function takes one line of code, and has a faster runtime.
We’ve also covered a few built-in modules that can help us eliminate loops in the previous
article. Instead of using the nested for loop, we can use combinations from the itertools module
for a cleaner, more efficient solution.
Another powerful technique for eliminating loops is to use the NumPy package. Suppose we
had the same collection of statistics we used in a previous example but stored in a NumPy array
instead of a list of lists.
We’d like to collect the average stat value for each Pokémon (or row) in our array. We could use
a loop to iterate over the array and collect the row averages.
%%timeit
avgs = []
for row in poke_stats:
avg = np.mean(row)
avgs.append(avg)
But, NumPy arrays allow us to perform calculations on entire arrays all at once. Here, we use
the dot-mean method and specify an axis equal to 1 to calculate the mean for each row
45
(meaning we calculate an average across the column values). This eliminates the need for a
loop and is much more efficient.
When comparing runtimes, we see that using the dot-mean method on the entire array and
specifying an axis is significantly faster than using a loop.
The best way to make a loop more efficient is to analyze what’s being done within the loop. We
want to make sure that we aren’t doing unnecessary work in each iteration. If a calculation is
performed for each iteration of a loop, but its value doesn’t change with each iteration, it’s best
to move this calculation outside (or above) the loop. If a loop is converting data types with each
iteration, it’s possible that this conversion can be done outside (or below) the loop using a map
function. Anything that can be done once should be moved outside of a loop. Let’s explore a few
examples.
We have a list of Pokémon names and an array of each Pokémon’s corresponding attack value.
We’d like to print the names of each Pokémon with an attack value greater than the average of
all attack values. To do this, we’ll use a loop that iterates over each Pokémon and its attack
value. For each iteration, the total attack average is calculated by finding the mean value of all
attacks. Then, each Pokémon’s attack value is evaluated to see if it exceeds the total average.
import numpy as np
46
if attack > total_attack_avg:
print(
"{}'s attack: {} > average: {}!"
.format(pokemon, attack, total_attack_avg)
)
The inefficiency in this loop is the total_attack_avg variable being created with each iteration of
the loop. But, this calculation doesn’t change between iterations since it is an overall average.
We only need to calculate this value once. By moving this calculation outside (or above) the
loop, we calculate the total attack average only once. We get the same output, but this is a more
efficient approach.
import numpy as np
%%timeit
47
%%timeit
We see that keeping the total_attack_avg calculation within the loop takes more than 120
microseconds.
Holistic Conversions
Another way to make loops more efficient is to use holistic conversions outside (or below) the
loop. In the example below we have three lists from the 720 Pokémon dataset: a list of each
Pokémon’s name, a list corresponding to whether or not a Pokémon has a legendary status,
and a list of each Pokémon’s generation.
We want to combine these objects so that each name, status, and generation is stored in an
individual list. To do this, we’ll use a loop that iterates over the output of the zip function.
Remember, zip returns a collection of tuples, so we need to convert each tuple into a list since
we want to create a list of lists as our output. Then, we append each individual poke_list to our
poke_data output variable. By printing the result, we see our desired list of lists.
%%timeit
poke_data = []
for poke_tuple in zip(names_list, legend_status_list, generations_list):
poke_list = list(poke_tuple)
poke_data.append(poke_list)
However, converting each tuple to a list within the loop is not very efficient. Instead, we should
collect all of our poke_tuples together, and use the map function to convert each tuple to a list.
The loop no longer converts tuples to lists with each iteration. Instead, we moved this tuple to
list conversion outside (or below) the loop. That way, we convert data types all at once (or
holistically) rather than converting in each iteration.
48
%%timeit
poke_data_tuples = []
for poke_tuple in zip(names_list, legend_status_list, generations_list):
poke_data_tuples.append(poke_tuple)
poke_data = [*map(list, poke_data_tuples)]
Runtimes show that converting each tuple to a list outside of the loop is more efficient.
49
4. 5 Tips to Write Efficient Python Functions
Writing efficient and maintainable Python functions is crucial for producing high-quality code.
This chapter presents 5 essential tips to help refine your Python functions and improve code
readability, maintainability, and robustness.
First, adhering to the single responsibility principle by ensuring each function performs only one
task is emphasized. Next, the benefits of type hints for enhancing code clarity and long-term
maintainability are discussed.
The chapter then explores the use of keyword-only arguments, a Python feature that can
minimize errors by enforcing the explicit use of argument names. Another recommendation is to
utilize only the arguments necessary for a function, reducing complexity and potential bugs.
Finally, the chapter advocates for the use of generators, a memory-efficient technique for
returning iterable data, instead of constructing and returning entire lists.
By implementing these 5 tips, Python developers can write more efficient, readable, and
maintainable functions, ultimately leading to higher-quality code and improved developer
productivity.
● Readability: When a function does only one thing, it’s easier to understand at a glance.
The function name can clearly describe its purpose, and the implementation is
straightforward.
● Reusability: Single-purpose functions can be reused in different parts of the program or
in other projects.
● Testability: It’s easier to write tests for a function that does one thing, and such tests are
more likely to be reliable.
● Maintainability: If a function is responsible for one task, changes in requirements
affecting that task will be localized, reducing the risk of bugs elsewhere in the code.
Let’s say you’re working on a Python program to process a list of numbers in which it will:
50
def filter_negative_numbers(numbers):
"""Filter out negative numbers from the list."""
return [num for num in numbers if num >= 0]
def square_numbers(numbers):
"""Return a list of squared numbers."""
return [num ** 2 for num in numbers]
def sum_numbers(numbers):
"""Return the sum of the numbers."""
return sum(numbers)
def process_numbers(numbers):
"""Process the list of numbers: filter, square, and sum."""
positive_numbers = filter_negative_numbers(numbers)
squared_numbers = square_numbers(positive_numbers)
total = sum_numbers(squared_numbers)
return total
# Example usage
numbers = [-2, -1, 0, 1, 2, 3]
result = process_numbers(numbers)
print(result) # Output: 14
51
def filter_negative_numbers(numbers):
"""Filter out negative numbers from the list."""
return [num for num in numbers if num >= 0]
def square_numbers(numbers):
"""Return a list of squared numbers."""
return [num ** 2 for num in numbers]
def sum_numbers(numbers):
"""Return the sum of the numbers."""
return sum(numbers)
def process_numbers(numbers):
"""Process the list of numbers: filter, square, and sum."""
positive_numbers = filter_negative_numbers(numbers)
squared_numbers = square_numbers(positive_numbers)
total = sum_numbers(squared_numbers)
return total
# Example usage
numbers = [-2, -1, 0, 1, 2, 3]
result = process_numbers(numbers)
print(result) # Output: 14
52
"""Process the list of numbers: filter, square, and sum."""
positive_numbers = filter_negative_numbers(numbers)
squared_numbers = square_numbers(positive_numbers)
total = sum_numbers(squared_numbers)
return total
# Example usage
numbers = [-2, -1, 0, 1, 2, 3]
result = process_numbers(numbers)
print(result) # Output: 14
2. Error Detection
● Without Type Hints: Type-related errors may only be caught at runtime,
potentially causing bugs that are hard to trace.
● With Type Hints: Tools like mypy can check the types at compile time, catching
errors before the code is executed.
This is done using a special syntax in the function definition that prevents these parameters
from being passed positionally. This approach can significantly improve code clarity and reduce
errors.
Enforcing keyword-only arguments in Python functions can significantly enhance the clarity and
correctness of your code. Keyword-only arguments are parameters that can only be specified
using their parameter names. This enforcement helps to:
● Prevent Errors: By requiring arguments to be passed by name, it reduces the risk of
passing arguments in the wrong order, which can lead to subtle bugs.
● Improve Readability: It makes function calls more readable by clearly indicating what
each argument represents.
● Enhance Flexibility: It allows you to add more parameters to functions in the future
without breaking existing code, as the arguments are explicitly named.
● Increase Clarity: It makes the intention of the code clearer, as the purpose of each
53
argument is specified at the call site.
# Example usage
send_email("recipient@example.com", "Meeting Update", "The meeting is
rescheduled to 3 PM.", "cc@example.com")
Say you want to make the optional cc argument a keyword-only argument. Here’s how you can
do it:
# enforce keyword-only arguments to minimize errors
# make the optional `cc` argument keyword-only
def send_email(to: str, subject: str, body: str, *, cc: str = None):
print(f"Sending email to {to}...")
print(f"Subject: {subject}")
print(f"Body: {body}")
if cc:
print(f"CC: {cc}")
# Example usage
send_email("recipient@example.com", "Meeting Update", "The meeting is
rescheduled to 3 PM.", cc="cc@example.com")
54
# throws error as we try to pass in more positional args than allowed!
send_email("recipient@example.com", "Meeting Update", "The meeting is
rescheduled to 3 PM.", "cc@example.com")
You’ll get an error as shown:
● Improved Readability: Fewer arguments make function signatures simpler and easier
to understand.
● Enhanced Maintainability: Functions with fewer parameters are easier to refactor and
test.
● Reduced Errors: The likelihood of passing incorrect or redundant data is minimized
when functions only take essential arguments.
# Example usage
process_order(1234, 5678, "John Doe", 100.0, 10.0)
In this example, the function process_order takes both customer_id and customer_name,
which might not be necessary for processing an order if all required information can be derived
from the order_id.
55
Now, let’s refactor the function to use only the essential arguments:
# Example usage
process_order(1234, 100.0, 10.0)
● Memory Efficiency: Generators produce items one at a time and only when required,
which means they don’t need to store the entire sequence in memory. This is especially
useful for large datasets.
● Performance: Since generators yield items on the fly, they can provide a performance
boost by avoiding the overhead of creating and storing large data structures.
● Lazy Evaluation: Generators compute values as needed, which can lead to more
efficient and responsive programs.
● Simpler Code: Generators can simplify the code needed to create iterators, making it
easier to read and maintain.
56
# Example usage
squares_list = get_squares(10)
print(squares_list)
In this example, the function get_squares generates and stores all square numbers up to n-1 in
a list before returning it. Now, let’s modify the function to use a generator:
# Example usage
squares_gen = get_squares_gen(10)
print(list(squares_gen))
Using generators to return lists in Python provides significant advantages in terms of memory
efficiency, performance, and lazy evaluation. By generating values on the fly, you can handle
large datasets more effectively and write cleaner, more maintainable code.
The comparison between the two approaches clearly shows the benefits of using generators,
particularly for large or computationally intensive sequences.
57
By caching data, developers can drastically reduce these delays, resulting in faster load times
and happier users. This principle applies to web scraping as well, where large-scale projects
can see significant speed improvements.
But what exactly is caching, and how can it be implemented? This chapter will explore caching,
its purpose and benefits, and how to leverage it to speed up your Python code and also speed
up your LLM calls at a lower cost.
A cache is a fast storage space (usually temporary) where frequently accessed data is kept to
speed up the system’s performance and decrease the access times. For example, a computer’s
cache is a small but fast memory chip (usually an SRAM) between the CPU and the main
memory chip (usually a DRAM).
The CPU first checks the cache when it needs to access the data. If it’s in the cache, a cache hit
occurs, and the data is thereby read from the cache instead of a relatively slower main memory.
It results in reduced access times and enhanced performance.
● Reduced access time: The primary goal of caching is to accelerate access to frequently
used data. By storing this data in a temporary, easily accessible storage area, caching
dramatically decreases access time. This leads to a notable improvement in the overall
performance of applications and systems.
● Reduced system load: Caching also alleviates system load by minimizing the number
of requests sent to external data sources, such as databases. By storing frequently
accessed data in cache storage, applications can retrieve this data directly from the
cache instead of repeatedly querying the data source. This reduces the load on the
external data source and enhances system performance.
● Improved user experience: Caching ensures rapid data retrieval, enabling more
seamless interactions with applications and systems. This is especially crucial for
real-time systems and web applications, where users expect instant responses.
Consequently, caching plays a vital role in enhancing the overall user experience.
58
5.3. Common Uses for Caching
Caching is a general concept and has several prominent use cases. You can apply it in any
scenario where data access has some patterns and you can predict what data will be
demanded next. You can prefetch the demanded data in the cache store and improve
application performance.
● Web Content: Frequently accessed web pages, images, and other static content are
often cached to reduce load times and server requests.
● Database Queries: Caching the results of common database queries can drastically
reduce the load on the database and speed up application responses.
● API Responses: External API call responses are cached to avoid repeated network
requests and to provide faster data access.
● Session Data: User session data is cached to quickly retrieve user-specific information
without querying the database each time.
● Machine Learning Models: Intermediate results and frequently used datasets are
cached to speed up machine learning workflows and inference times.
● Configuration Settings: Application configuration data is cached to avoid repeated
reading from slower storage systems.
● Cache-Aside (Lazy Loading): Data is loaded into the cache only when it is requested.
If the data is not found in the cache (a cache miss), it is fetched from the source, stored
in the cache, and then returned to the requester.
● Write-Through: Every time data is written to the database, it is simultaneously written to
the cache. This ensures that the cache always has the most up-to-date data but may
introduce additional write latency.
● Write-Back (Write-Behind): Data is written to the cache and acknowledged to the
requester immediately, with the cache asynchronously writing the data to the database.
This improves write performance but risks data loss if the cache fails before the write to
the database completes.
● Read-Through: The application interacts only with the cache, and the cache is
responsible for loading data from the source if it is not already cached.
● Time-to-Live (TTL): Cached data is assigned an expiration time, after which it is
invalidated and removed from the cache. This helps to ensure that stale data is not used
indefinitely.
● Cache Eviction Policies: Strategies to determine which data to remove from the cache
when it reaches its storage limit. Common policies include:
59
● Last-In, First-Out (LIFO): The most recently added data is the first to be removed when
the cache needs to free up space. This strategy assumes that the oldest data will most
likely be required again soon.
● Least Recently Used (LRU): The least recently accessed data is the first to be
removed. This strategy works well when the most recently accessed data is more likely
to be reaccessed.
● Most Recently Used (MRU): The most recently accessed data is the first to be
removed. This can be useful in scenarios where the most recent data is likely to be used
only once and not needed again.
● Least Frequently Used (LFU): The data that is accessed the least number of times is
the first to be removed. This strategy helps in keeping the most frequently accessed data
in the cache longer.
One common use case for decorators is to implement caching. This involves creating a
dictionary to store the function’s results and then saving them in the cache for future use.
Let’s code a function that computes the n-th Fibonacci number. Here’s the recursive
implementation of the Fibonacci sequence:
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
Without caching, the recursive calls result in redundant computations. If the values are cached,
it’d be much more efficient to look up the cached values. For this, you can use the @cache
decorator.
The @cache decorator from the functools module in Python 3.9+ is used to cache the results of
a function. It works by storing the results of expensive function calls and reusing them when the
function is called with the same arguments. Now let’s wrap the function with the @cache
decorator:
from functools import cache
@cache
def fibonacci(n):
if n <= 1:
60
return n
return fibonacci(n-1) + fibonacci(n-2)
In LRU caching, when the cache is full and a new item needs to be added, the least recently
used item in the cache is removed to make room for the new item. This ensures that the most
frequently used items are retained in the cache, while less frequently used items are discarded.
The @lru_cache decorator is similar to @cache but allows you to specify the maximum
size—as the maxsize argument—of the cache. Once the cache reaches this size, the least
recently used items are discarded. This is useful if you want to limit memory usage.
Here, the fibonacci function is decorated with @lru_cache(maxsize=7), specifying that it should
cache up to 7 most recent results. When fibonacci(5) is called, the results for fibonacci(4),
fibonacci(3), and fibonacci(2) are cached.
When fibonacci(3) is called subsequently, fibonacci(3) is retrieved from the cache since it was
one of the seven most recently computed values, avoiding redundant computation.
61
5.7. Function Calls Timing Comparison
Now let’s compare the execution times of the functions with and without caching. For this
example, we don’t set an explicit value for maxsize. So maxsize will be set to the default value
of 128:
# without caching
def fibonacci_no_cache(n):
if n <= 1:
return n
return fibonacci_no_cache(n-1) + fibonacci_no_cache(n-2)
# with cache
@cache
def fibonacci_cache(n):
if n <= 1:
return n
return fibonacci_cache(n-1) + fibonacci_cache(n-2)
# with LRU cache
@lru_cache
def fibonacci_lru_cache(n):
if n <= 1:
return n
return fibonacci_lru_cache(n-1) + fibonacci_lru_cache(n-2)
To compare the execution times, we’ll use the timeit function from the timeit module:
62
Time with LRU cache: 0.000020 seconds
We see a significant difference in the execution times. The function call without caching takes
much longer to execute, especially for larger values of n. While the cached versions (both
@cache and @lru_cache) execute much faster and have comparable execution times.
To tackle this challenge, you can use the 𝐆𝐏𝐓𝐂𝐚𝐜𝐡𝐞 package dedicated to building a semantic
cache for storing LLM responses. GPTCache first performs embedding operations on the input
to obtain a vector and then conducts a vector approximation search in the cache storage. After
receiving the search results, it performs a similarity evaluation and returns when the set
threshold is reached.
To use we will start with Initialize the cache to run GPTCache and import openai form
gptcache.adapter, which will automatically set the map data manager to match the exact cache.
However, you will need this version of the openai package = 0.28.1
After that, if you ask ChatGPT the exact same two questions, the answer to the second question
will be obtained from the cache without requesting ChatGPT again.
import time
def response_text(openai_resp):
return openai_resp['choices'][0]['message']['content']
print("Cache loading.....")
cache.init()
cache.set_openai_key()
# -------------------------------------------------
63
question = "what's github"
for _ in range(2):
start_time = time.time()
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{
'role': 'user',
'content': question
}
],
)
print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')
Cache loading…..
Question: what’s github
Time consuming: 6.04s
Answer: GitHub is an online platform where developers can share and collaborate on
software development projects. It is used as a hub for code repositories and includes
features such as issue tracking, code review, and project management tools. GitHub can
be used for open source projects, as well as for private projects within organizations.
GitHub has become an essential tool within the software development industry and has
over 40 million registered users as of 2021.
We can see that the first question took 6.04 seconds which is the average response time of an
LLM. But if we ask the same question again we can see it takes no time and also it will
decrease the cost.
64
6. Best Practices To Use Pandas Efficiently As A
Data Scientist
As a data scientist, it is important to use the right tools and techniques to get the most out of the
data. The Pandas library is a great tool for data manipulation, analysis, and visualization, and it
is an essential part of any data scientist’s toolkit.
However, it can be challenging to use Pandas efficiently, and this can lead to wasted time and
effort. Fortunately, there are a few best practices that can help data scientists get the most out
of their Pandas experience.
From using vectorized operations to taking advantage of built-in functions, these best practices
will help data scientists quickly and accurately analyze and visualize data using Pandas.
Understanding and applying these best practices will help data scientists increase their
productivity and accuracy, allowing them to make better decisions faster.
The first dataset is the Poker card game dataset which is shown below.
poker_data = pd.read_csv('poker_hand.csv')
poker_data.head()
In each poker round, each player has five cards in hand, each one characterized by its symbol,
65
which can be either hearts, diamonds, clubs, or spades, and its rank, which ranges from 1 to 13.
The dataset consists of every possible combination of five cards one person can possess.
● Sn: symbol of the n-th card where: 1 (Hearts), 2 (Diamonds), 3 (Clubs), 4 (Spades)
● Rn: rank of the n-th card where: 1 (Ace), 2–10, 11 (Jack), 12 (Queen), 13 (King)
The second dataset we will work with is the Popular baby names dataset which includes the
most popular names that were given to newborns between 2011 and 2016. The dataset is
loaded and shown below:
names = pd.read_csv('Popular_Baby_Names.csv')
names.head()
The dataset includes among other information, the most popular names in the US by year,
gender, and ethnicity. For example, the name Chloe was ranked second in popularity among all
female newborns of Asian and Pacific Islander ethnicity in 2011. The third dataset we will use is
the Restaurant dataset. This dataset is a collection of people having dinner at a restaurant. The
dataset is loaded and shown below:
restaurant = pd.read_csv('restaurant_data.csv')
restaurant.head()
For each customer, we have various characteristics, including the total amount paid, the tips left
66
to the waiter, the day of the week, and the time of the day.
import time
# record time before execution
start_time = time.time()
# execute operation
result = 5 + 2
# record time after execution
end_time = time.time()
print("Result calculated in {} sec".format(end_time - start_time))
Let's see some examples of how applying efficient code methods will improve the code runtime
and decrease the computational time complexity: We will calculate the square of each number
from zero, up to a million. At first, we will use a list comprehension to execute this operation, and
then repeat the same procedure using a for-loop.
list_comp_start_time = time.time()
result = [i*i for i in range(0,1000000)]
list_comp_end_time = time.time()
print("Time using the list_comprehension: {} sec".format(list_comp_end_time
-
list_comp_start_time))
67
for_loop_start_time= time.time()
result=[]
for i in range(0,1000000):
result.append(i*i)
for_loop_end_time= time.time()
print("Time using the for loop: {} sec".format(for_loop_end_time -
for_loop_start_time))
We can see that there is a big difference between them, we can calculate the difference
between them in percentage:
Here is another example to show the effect of writing efficient code. We would like to calculate
the sum of all consecutive numbers from 1 to 1 million. There are two ways the first is to use
brute force in which we will add one by one to a million.
def sum_brute_force(N):
res = 0
for i in range(1,N+1):
res+=i
return res
68
Another more efficient method is to use a formula to calculate it. When we want to calculate the
sum of all the integer numbers from 1 up to a number, let’s say N, we can multiply N by N+1,
and then divide by 2, and this will give us the result we want. This problem was actually given to
some students back in Germany in the 19th century, and a bright student called Carl-Friedrich
Gauss devised this formula to solve the problem in seconds.
def sum_formula(N):
return N*(N+1)/2
After running both methods, we achieved a massive improvement with a magnitude of over
160,000%, which clearly demonstrates why we need efficient and optimized code, even for
simple tasks.
6.2.1. Selecting Rows & Columns Efficiently using .iloc[] & .loc[]
In this subsection, we will introduce how to locate and select rows efficiently from dataframes
using .iloc[] & .loc[] pandas functions. We will use iloc[] for the index number locator and loc[]
for the index name locator.
In the example below we will select the first 500 rows of the poker dataset. Firstly by using the
.loc[] function, and then by using the .iloc[] function.
69
# Specify the range of rows to select
While these two methods have the same syntax, iloc[] performs almost 70% faster than loc[].
The .iloc[] function takes advantage of the order of the indices, which are already sorted, and is
therefore faster. We can also use them to select columns not only rows. In the next example, we
will select the first three columns using both methods.
iloc_start_time = time.time()
poker_data.iloc[:,:3]
iloc_end_time = time.time()
print("Time using .iloc[]: {} sec".format(iloc_end_time - iloc_start_time))
70
names_start_time = time.time()
poker_data[['S1', 'R1', 'S2']]
names_end_time = time.time()
print("Time using selection by name: {} sec".format(names_end_time -
names_start_time))
We can also see that column indexing using iloc[] is still 80% faster. So it is better to use iloc[]
as it is faster unless it is easier to use the loc[] to select certain columns by name.
Let's have a closer look at the Gender feature and see the unique values they have:
names['Gender'].unique()
71
We can see that the female gender is represented with two values both uppercase and
lowercase. This is very common in real data and an easy way to do so is to replace one of the
values with the other to keep it consistent throughout the whole dataset. There are two ways to
do it the first one is simply defining which values we want to replace, and then what we want to
replace them with. This is shown in the code below:
start_time = time.time()
names['Gender'].loc[names.Gender=='female'] = 'FEMALE'
end_time = time.time()
The second method is to use the panda's built-in function .replace() as shown in the code
below:
start_time = time.time()
names['Gender'].replace('female', 'FEMALE', inplace=True)
end_time = time.time()
replace_time = end_time - start_time
We can see that there is a difference in time complexity with the built-in function 157% faster
than using the .loc() method to find the rows and columns index of the values and replace it.
We can also replace multiple values using lists. Our objective is to change all ethnicities
classified as WHITE NON-HISPANIC or WHITE NON-HISP to WNH. Using the .loc[] function,
we will locate babies of the ethnicities we are looking for, using the ‘or’ statement (which in
Python is symbolized by the pipe). We will then assign the new value. As always, we also
72
measure the CPU time needed for this operation.
start_time = time.time()
end_time = time.time()
pandas_time= end_time - start_time
print("Results from the above operation calculated in %s seconds"
%(pandas_time))
We can also do the same operation using the .replace() pandas built-in function as the following:
start_time = time.time()
names['Ethnicity'].replace(['WHITE NON HISPANIC','WHITE NON HISP'],
'WNH', inplace=True)
end_time = time.time()
replace_time = end_time - start_time
We can see that again using the .replace() method is much faster than using the .loc[] method.
To have a better intuition of how much faster it is let's run the code below:
The .replace() method is 87% faster than using the .loc[] method. If your data is huge and
needs a lot of cleaning this tip will decrease the computational time of your data cleaning and
makes your pandas code much faster and hence more efficient.
Finally, we can also use dictionaries to replace both single and multiple values in your
DataFrame. This will be very helpful if you would like to have multiple replacing functions in one
command.
We’re going to use dictionaries to replace every male’s gender with BOY and every female’s
73
gender with GIRL.
names = pd.read_csv('Popular_Baby_Names.csv')
start_time = time.time()
names['Gender'].replace({'MALE':'BOY', 'FEMALE':'GIRL', 'female': 'girl'},
inplace=True)
end_time = time.time()
dict_time = end_time - start_time
print("Time using .replace() with dictionary: {} sec".format(dict_time))
names = pd.read_csv('Popular_Baby_Names.csv')
start_time = time.time()
end_time = time.time()
We could do the same thing with lists, but it’s more verbose. If we compare both methods, we
can see that dictionaries run approximately 22% faster. In general, working with dictionaries in
Python is very efficient compared to lists: looking through a list requires a pass in every element
of the list while looking at a dictionary directs instantly to the key that matches the entry. The
comparison is a little unfair though since both structures serve different purposes.
Using dictionaries allows you to replace the same values on several different columns. In all the
previous examples, we specified the column from which the values to replace came. We’re now
going to replace several values from the same column with one common value. We want to
74
classify all ethnicities into three big categories: Black, Asian, and White. The syntax again is
very simple. We use nested dictionaries here: the outer key is the column in which we want to
replace values. The value of this outer key is another dictionary, where the keys are the
ethnicities to replace, and the values for the new ethnicity (Black, Asian, or White).
start_time = time.time()
names.replace({'Ethnicity': {'ASIAN AND PACI': 'ASIAN', 'ASIAN AND PACIFIC
ISLANDER': 'ASIAN',
'BLACK NON HISPANIC': 'BLACK', 'BLACK NON HISP': 'BLACK',
'WHITE NON HISPANIC': 'WHITE', 'WHITE NON HISP': 'WHITE'}})
print("Time using .replace() with dictionary: {} sec".format (time.time() -
start_time))
Here is a summary of the best practices for selecting and replacing values:
● Selecting rows and columns is faster using the .iloc[] function. So it is better to use
unless it is easier or more convenient to use .loc[] and the speed is not a priority or you
are just doing it once.
● Using the built-in replace() function is much faster than just using conventional methods.
● Replacing multiple values using Python dictionaries is faster than using lists.
def city_name_generator():
yield('New York')
yield('London')
75
yield('Tokyo')
yield('Sao Paolo')
city_names = city_name_generator()
To access the elements that the generator yields we can use Python’s next() function. Each time
the next() command is used, the generator will produce the next value to yield, until there are no
more values to yield. We here have 4 cities. Let’s run the next command four times and see
what it returns:
next(city_names)
next(city_names)
next(city_names)
As we can see that every time we run the next() function it will print a new city name. Let's go
back to the .iterrows() function. The .iterrows() function is a property of every pandas
DataFrame. When called, it produces a list with two elements. We will use this generator to
iterate through each line of our poker DataFrame.
The first element is the index of the row, while the second element contains a pandas Series of
each feature of the row: the Symbol and the Rank for each of the five cards. It is very similar to
the notion of the enumerate() function, which when applied to a list, returns each element along
with its index. The most intuitive way to iterate through a Pandas DataFrame is to use the
range() function, which is often called crude looping. This is shown with the code below:
start_time = time.time()
for index in range(poker_data.shape[0]):
next
print("Time using range(): {} sec".format(time.time() - start_time))
76
One smarter way to iterate through a pandas DataFrame is to use the .iterrows() function,
which is optimized for this task. We simply define the ‘for’ loop with two iterators, one for the
number of each row and the other for all the values. Inside the loop, the next() command
indicates that the loop moves to the next value of the iterator, without actually doing something.
data_generator = poker_data.iterrows()
start_time = time.time()
for index, values in data_generator:
next
print("Time using .iterrows(): {} sec".format(time.time() - start_time))
Comparing the two computational times we can also notice that the use of .iterrows() does not
improve the speed of iterating through pandas DataFrame. It is very useful though when we
need a cleaner way to use the values of each row while iterating through the dataset.
The syntax of the .apply() function is simple: we create a mapping, using a lambda function in
this case, and then declare the function we want to apply to every cell. Here, we’re applying the
square root function to every cell of the DataFrame. In terms of speed, it matches the speed of
just using the NumPy sqrt() function over the whole DataFrame.
This is a simple example since we would like to apply this function to the dataframe.
But what happens when the function of interest is taking more than one cell as an input? For
77
example, what if we want to calculate the sum of the rank of all the cards in each hand? In this
case, we will use the .apply() function the same way as we did before, but we need to add
‘axis=1’ at the end of the line to specify we’re applying the function to each row.
apply_start_time = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=1)
apply_end_time = time.time()
apply_time = apply_end_time - apply_start_time
print("Time using .apply(): {} sec".format(time.time() - apply_start_time))
Then, we will use the .iterrows() function we saw previously, and compare their efficiency.
for_loop_start_time = time.time()
for ind, value in poker_data.iterrows():
sum([value[1], value[3], value[5], value[7], value[9]])
for_loop_end_time = time.time()
Using the .apply() function is significantly faster than the .iterrows() function, with a magnitude
of around 400 percent, which is a massive improvement!
As we did with rows, we can do exactly the same thing for the columns; apply one function to
each column. By replacing the axis=1 with axis=0, we can apply the sum function on every
column.
apply_start_time = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=0)
apply_end_time = time.time()
apply_time = apply_end_time - apply_start_time
78
print("Time using .apply(): {} sec".format(apply_time))
By comparing the .apply() function with the native panda's function for summing over rows, we
can see that pandas’ native .sum() functions perform the same operation faster.
pandas_start_time = time.time()
poker_data[['R1', 'R1', 'R3', 'R4', 'R5']].sum(axis=0)
pandas_end_time = time.time()
pandas_time = pandas_end_time - pandas_start_time
print("Time using pandas: {} sec".format(pandas_time))
In conclusion, we observe that the .apply() function performs faster when we want to iterate
through all the rows of a pandas DataFrame, but is slower when we perform the same operation
through a column.
In the code below we want to calculate the sum of the ranks of all the cards in each hand. In
order to do that, we slice the poker dataset keeping only the columns that contain the ranks of
each card. Then, we call the built-in .sum() property of the DataFrame, using the parameter axis
= 1 to denote that we want the sum for each row. In the end, we print the sum of the first five
rows of the data.
start_time_vectorization = time.time()
79
print("Time using pandas vectorization: {} sec".format(vectorization_time))
We saw previously various methods that perform functions applied to a DataFrame faster than
simply iterating through all the rows of the DataFrame. Our goal is to find the most efficient
method to perform this task.
data_generator = poker_data.iterrows()
start_time_iterrows = time.time()
end_time_iterrows = time.time()
iterrows_time = end_time_iterrows - start_time_iterrows
print("Time using .iterrows() {} seconds " .format(iterrows_time))
start_time_apply = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x),axis=1)
end_time_apply = time.time()
Comparing the time it takes to sum the ranks of all the cards in each hand using vectorization,
the .iterrows() function, and the .apply() function, we can see that the vectorization method
performs much better. We can also use another vectorization method to effectively iterate
through the DataFrame which is using Numpy arrays to vectorize the DataFrame.
The NumPy library, which defines itself as a “fundamental package for scientific computing in
80
Python”, performs operations under the hood in optimized, pre-compiled C code. Similar to
pandas working with arrays, NumPy operates on arrays called ndarrays. A major difference
between Series and ndarrays is that ndarrays leave out many operations such as indexing, data
type checking, etc. As a result, operations on NumPy arrays can be significantly faster than
operations on pandas Series. NumPy arrays can be used in place of the pandas Series when
the additional functionality offered by the pandas Series isn’t critical.
For the problems we explore in this chapter, we could use NumPy ndarrays instead of the
pandas series. The question at stake is whether this would be more efficient or not. Again, we
will calculate the sum of the ranks of all the cards in each hand. We convert our rank arrays
from pandas Series to NumPy arrays simply by using the .values method of pandas Series,
which returns a pandas Series as a NumPy ndarray. As with vectorization on the series, passing
the NumPy array directly into the function will lead pandas to apply the function to the entire
vector.
start_time = time.time()
start_time = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1)
print("Results from the above operation calculated in %s seconds" %
(time.time() - start_time))
At this point, we can see that vectorizing over the pandas series achieves the overwhelming
majority of optimization needs for everyday calculations. However, if speed is of the highest
priority, we can call in reinforcements in the form of the NumPy Python library. Compared to the
previous state-of-the-art method, the panda's optimization, we still get an improvement in the
operating time.
● Using .iterrows() does not improve the speed of iterating through the DataFrame but it is
81
more efficient.
● The .apply() function performs faster when we want to iterate through all the rows of a
pandas DataFrame, but is slower when we perform the same operation through a
column.
● Vectorizing over the pandas series achieves the overwhelming majority of optimization
needs for everyday calculations. However, if speed is of the highest priority, we can call
in reinforcements in the form of the NumPy Python library.
restaurant = pd.read_csv('restaurant_data.csv')
restaurant_grouped = restaurant.groupby('smoker')
print(restaurant_grouped.count())
It is no surprise that we get the same results for all the features, as the .count() method counts
the number of occurrences of each group in each feature. As there are no missing values in our
data, the results should be the same in all columns.
After grouping the entries of the DataFrame according to the values of a specific feature, we can
apply any kind of transformation we are interested in. Here, we are going to apply the z score, a
normalization transformation, which is the distance between each value and the mean, divided
82
by the standard deviation. This is a very useful transformation in statistics, often used with the
z-test in standardized testing. To apply this transformation to the grouped object, we just need to
call the .transform() method containing the lambda transformation we defined.
This time, we will group according to the type of meal: was it a dinner or a lunch? As the z-score
transformation is a group-related transformation, the resulting table is just the original table. For
each element, we subtract the mean and divide it by the standard deviation of the group it
belongs to. We can also see that numerical transformation are applied only to numerical
features of the DataFrame.
restaurant_grouped = restaurant.groupby('time')
restaurant_transformed = restaurant_grouped.transform(zscore)
restaurant_transformed.head()
While the transform() method simplifies things a lot, is it actually more efficient than using native
Python code? As we did before, we first group our data, this time according to sex. Then we
apply the z-score transformation we applied before, measuring its efficiency. We omit the code
for measuring the time of each operation here, as you are already familiar with this. We can see
that with the use of the transform() function, we achieve a massive speed improvement. On top
of that, we’re only using one line to perform the operation of interest.
restaurant.groupby('sex').transform(zscore)
mean_female = restaurant.groupby('sex').mean()['total_bill']['Female']
mean_male = restaurant.groupby('sex').mean()['total_bill']['Male']
std_female = restaurant.groupby('sex').std()['total_bill']['Female']
std_male = restaurant.groupby('sex').std()['total_bill']['Male']
83
for i in range(len(restaurant)):
if restaurant.iloc[i][2] == 'Female':
restaurant.iloc[i][0] = (restaurant.iloc[i][0] -
mean_female)/std_female
else:
restaurant.iloc[i][0] = (restaurant.iloc[i][0] - mean_male)/std_male
prior_counts = restaurant.groupby('time')
prior_counts['total_bill'].count()
Next, we will create a restaurant_nan dataset, in which the total bill of 10% random observations
was set to NaN using the code below:
import pandas as pd
import numpy as np
Now, let's print the number of data points in each of the “time” feature and we can see that they
are now 155+62 = 217. Since the total data points we have are 244 then the missing data points
are 24 which is equal to 10%.
prior_counts = restaurant.groupby('time')
prior_counts['total_bill'].count()
84
After counting the number of missing values in our data, we will show how to fill these missing
values with a group-specific function. The most common choices are the mean and the median,
and the selection has to do with the skewness of the data. As we did before, we define a
lambda transformation using the fillna() function to replace every missing value with its group
average. As before, we group our data according to the time of the meal and then replace the
missing values by applying the pre-defined transformation.
As we can see, the observations at index 0 and index 4 are exactly the same, which means that
their missing value has been replaced by their group’s mean. Also, we can see the computation
time using this method is 0.007 seconds.
start_time = time.time()
mean_din = restaurant_nan.loc[restaurant_nan.time
=='Dinner']['total_bill'].mean()
mean_lun = restaurant_nan.loc[restaurant_nan.time ==
'Lunch']['total_bill'].mean()
85
if restaurant_nan.iloc[row]['time'] == 'Dinner':
restaurant_nan.loc[row, 'total_time'] = mean_din
else:
restaurant_nan.loc[row, 'total_time'] = mean_lun
print("Results from the above operation calculated in %s seconds" %
(time.time() - start_time))
We can see that using the .transform() function applied on a grouped object performs faster
than the native Python code for this task.
Often, after grouping the entries of a DataFrame according to a specific feature, we are
interested in including only a subset of those groups, based on some conditions. Some
examples of filtration conditions are the number of missing values, the mean of a specific
feature, or the number of occurrences of the group in the dataset.
We are interested in finding the mean amount of tips given, on the days when the mean amount
paid to the waiter is more than 20 USD. The .filter() function accepts a lambda function that
operates on a DataFrame of each of the groups. In this example, the lambda function selects
“total_bill” and checks that the mean() is greater than 20. If that lambda function returns True,
then the mean() of the tip is calculated. If we compare the total mean of the tips, we can see
that there is a difference between the two values, meaning that the filtering was performed
correctly.
restaurant_grouped = restaurant.groupby('day')
filter_trans = lambda x : x['total_bill'].mean() > 20
restaurant_filtered = restaurant_grouped.filter(filter_trans)
print(restaurant_filtered['tip'].mean())
print(restaurant['tip'].mean())
If we attempt to perform this operation without using groupby(), we end up with this inefficient
code. At first, we use a list comprehension to extract the entries of the DataFrame that refer to
86
days that have a mean meal greater than $20 and then use a for loop to append them into a list
and calculate the mean. It might seem very intuitive, but as we see, it’s also very inefficient.
for j in t[1:]:
restaurant_filtered=restaurant_filtered.append(j,ignore_index=True)
87
7. Make Your Pandas Code 1000 Times Faster With
This Trick
While Pandas package is a powerful and flexible, its performance can sometimes become a
bottleneck in large datasets. In this article, we will explore a trick to make your Pandas code run
much faster, increasing its efficiency by up to 1000 times.
Whether you are a beginner or an experienced Pandas user, this chapter will provide you with
valuable insights and practical tips for speeding up your code. So, if you want to boost the
performance of your Pandas code, read on!
import pandas as pd
import numpy as np
def get_data(size= 10000):
df = pd.DataFrame()
size = 10000
df['age'] = np.random.randint(0,100,size)
df['time_in_bed'] = np.random.randint(0,9,size)
df['pct_sleeping'] = np.random.randint(size)
df['favorite_food'] =
np.random.choice(['pizza','ice-cream','burger','rice'], size)
df['hate_food'] = np.random.choice(['milk','vegetables','eggs'])
return df
df = get_data()
df.head()
88
The task we will work on is a reward calculation based on the following measures: If they were
in bed for more than 5 hours and sleeping more than 50%, we will give them their favorite food.
Otherwise, we give them their hate food If they are over 90 years old give them their favorite
food regardless This can be represented using the following function.
def reward_cal(row):
if row['age'] >=90:
return row['favorite_food']
if (row['time_in_bed'] > 5) & (row['pct_sleeping']>0.5):
return row['favorite_food']
return row['hate_food']
%%timeit
2.54 s ± 28.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As we can see the computation time used to iterate through every row of the data frame is 2.5
s. Given that the data has only 10000 rows which are considered small. So if the data have
millions of rows so it will take hours to do only one task. Therefore this is not the most efficient
way to iterate through a data frame. So let's discuss the second method which will improve the
time complexity.
89
7.3. Level 2: Apply Function
The .apply() method in pandas is used to apply a function to each element in a pandas
dataframe. It can be used to apply a custom function to each element in a specific column or to
apply a function along either axis (row-wise or column-wise) of the dataframe. Let's use it to
apply the reward calculation function to each row of the data frame and then calculate the
computational time:
%%timeit
df['reward'] = df.apply(reward_cal, axis = 1)
266 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The average time to apply the function to the 10000 rows of the data frame is only 268 ms
which is 0.26 seconds. This is around 10 times faster than using the loops. However, we are still
not done. We can still improve the speed and make it 1000 times faster. Let's see how!
%%timeit
df['reward'] = df['hate_food']
df.loc[((df['pct_sleeping']>0.5) &(df['time_in_bed']>5))| (df['age']>90),
'reward'] = df['favorite_food']
2.1 ms ± 62.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We can see now a tremendous decrease in the computation time compared to the previous two
methods. The computation time has at least decreased by 1000. Let's have a look at the
differences in a bar plot.
90
results = pd.DataFrame(
[
["Loop", 2690 ],
["apply", 268],
['vectorized', 2.32]
],
columns = ['method', 'run_time']
results.set_index('method')['run_time'].plot(kind='bar')
Looking at the bar plot we can get a better intuition of the huge difference between the different
computational times of the different methods used in this chapter.
91
8. Data Exploration Becomes Easier & Better With
Pandas Profiling
Data exploration is a crucial step in any data analysis and data science project. It allows you to
gain a deeper understanding of your data, identify patterns and relationships, and identify any
potential issues or outliers.
One of the most popular tools for data exploration is the Python library Pandas. The library
provides a powerful set of tools for working with data, including data cleaning, transformation,
and visualization. However, even with the powerful capabilities of Pandas, data exploration can
still be a time-consuming and tedious task. That’s where Pandas Profiling comes in.
With Pandas Profiling, you can easily generate detailed reports of your data, including summary
statistics, missing values, and correlations, making data exploration faster and more efficient.
This chapter will explore how Pandas Profiling can help you improve your data exploration
process and make it easier to understand your data.
The report generated by the pandas profiling library typically includes a variety of information
about the dataset, including:
● Overview: Summary statistics for all columns, including the number of rows, missing
values, and data types.
● Variables: Information about each column, including the number of unique values,
missing values, and the top frequent values.
● Correlations: Correlation matrix and heatmap, showing the relationship between
different variables.
● Distribution: Histograms and kernel density plots for each column, show the distribution
of values.
● Categorical Variables: Bar plots for categorical variables, showing the frequency of
each category.
● Numerical Variables: Box plots for numerical variables, show the distribution of values
and outliers.
● Text: Information about text columns, including the number of characters and words.
92
● File: Information about file columns, including the number of files, and the size of each
file.
● High-Cardinality: Information about high-cardinality categorical variables, including their
most frequent values.
● Sample: A sample of the data, with the first and last few rows displayed.
It is worth noting that the report is interactive and you can drill down on each section for more
details.
To install pandas-profiling, you can use the following command in your terminal or command
prompt.
This will install the latest version of pandas-profiling and its dependencies.If you are using
Jupyter Notebook, you can also install it by running the following command in a cell.
You can also make the installation using the conda package using the command below.
If it did not work with you and gave you this error below
import sys
Once installed, you can import and use pandas_profiling in your code as the following:
93
import pandas_profiling as pp
Popular_baby_names_df = pd.read_csv('Popular_Baby_Names.csv')
Popular_baby_names_df.head()
94
As you can see the report contains the six sections that we mentioned before. Let's see every
section and discuss them one by one.
8.3.1. Overview
The first section which is represented in the figure above is the overview. This section contains
three subsections which are overview, alerts, and reproduction.
This shows areas that you can take care of during correlation. For example, if certain values are
correlated with each other it will be shown so you can drop one of them or you can apply a
dimension reduction algorithm to remove the redundant information.
This shows the meta-information about the report such as the duration taken to generate the
reports and the time when this report was generated.
95
8.3.2. Variables
This section will show information about each feature with a plot that represents their
distribution. Here is the plot for this section and as you can see there are subsections for each
of the features in the data.
Let's take the plot for the first feature which is the Year of Birth. We can see all the statistical
information about the feature such as the min, and maximum values, the number of missing
values, and the mean. In addition to that, there is a histogram plot that shows the distribution of
the data.
This section is really useful for understanding the feature and will save you a lot of time in
feature processing.
8.3.3. Interaction
The third section in the report is the interaction section. This section will show you the relation
between each numerical feature with the other features. For example, let’s have a look at the
interaction between the rank and the count feature:
96
We can see the relation between the count and the Rank feature and also it shows that they are
highly correlated ( negatively ) as expected.
8.3.4. Correlations
The fourth section of the report is the correlation section. This provides a heatmap that shows
the correlation between the features. In addition to the heatmap it also provides the correlation
value between each of the features as shown below:
97
8.3.5. Missing Values
The fifth section is the missing values section. This shows the percentage of missing values in
each feature. Since here we have no missing data all of the features have a value of 1.
98
8.3.6. Samples
The final section of the report is the samples section. This shows you a sample of the data from
the first and the last rows. The figure below shows the first ten rows of the data.
Another drawback is that Pandas Profiling can only be used with Pandas DataFrames. This
means that if you’re working with data in a different format, such as a CSV file or a SQL
database, you’ll need to first convert it to a Pandas DataFrame before you can use Pandas
Profiling.
Additionally, Pandas Profiling generates a lot of information and can be overwhelming to digest
if you don’t know what you’re looking for. The report is also not interactive, and you’ll have to
export it to a file format like HTML, pdf, or excel to share or present it.
● Use Pandas Profiling on a sample of your data rather than the entire dataset to reduce
memory usage.
● Use Pandas to convert your data to a DataFrame before using Pandas Profiling.
99
● Use the options in Pandas Profiling to customize the report and only include the
information that you need.
● Use visualization libraries like Matplotlib, and Seaborn to make the report more
interactive and easy to understand.
● Use the report as a starting point for your data exploration, and then use other tools and
techniques to dive deeper into your data.
100
9. Top 10 Pandas Mistakes to Steer Clear of in Your
Code
Pandas is a powerful and popular data analysis library in Python, widely used by data scientists
and analysts to manipulate and transform data. However, with great power comes great
responsibility, and it’s easy to fall into common pitfalls that can lead to inefficient code and slow
performance.
In this chapter, we’ll explore the top 10 mistakes to steer clear of when using Pandas, so you
can maximize your efficiency and get the most out of this powerful library. Whether you’re a
beginner or a seasoned Pandas user, these tips will help you write better code and avoid
common mistakes that can slow you down.
For example, if you have a dataframe with a column name containing a space, say “Sales
Amount”, you cannot reference the column using the usual dot notation, like this:
df.Sales Amount
This will result in a syntax error, as the space in the column name causes confusion to Python’s
syntax parser. Instead, you would need to reference the column name using square brackets
and quotes, like this:
df['Sales Amount']
However, this can be cumbersome and error-prone, especially if you have a lot of column
names with spaces. Another issue with spaces in column names is that some functions in
pandas might not be able to handle them properly. For example, if you want to group by a
column with a space in the name, like this:
df.groupby('Sales Amount')['Quantity'].sum()
You will get a KeyError, as pandas cannot recognize the column name with the space. To avoid
these issues, it’s best to avoid spaces in column names altogether when working with pandas
101
dataframes. Instead, you can use underscores or camelCase to separate words in column
names. For example, “Sales_Amount” or “salesAmount”.
Here’s an example of how to use the query() method to create a subset of a dataframe:
import pandas as pd
In this example, the query() method is used to create a subset of the dataframe df where the
age column is greater than 30. The resulting subset is stored in the variable subset.
You can also use the query() method to create subsets based on multiple conditions:
In this example, the query() method is used to create a subset of the dataframe df where the
age column is between 30 and 40, inclusive. While the query() method can be a powerful tool,
it's important to keep in mind that it can be slower than other methods like boolean indexing or
the loc and iloc indexers. Additionally, it's important to be careful with variable scoping and
name collisions when using query(). Therefore, it's recommended to use query() judiciously
and to consider other methods when appropriate.
102
An alternative approach is to use the @ symbol to reference variables in the query expression,
allowing you to write more readable and flexible code. Here’s an example of how to use the @
symbol with the query() method:
import pandas as pd
df = pd.read_csv('sales_data.csv')
product_category = 'Electronics'
start_date = '2022-01-01'
end_date = '2022-03-01'
min_sales_amount = 1000
min_quantity = 10
In this example, the @ symbol is used to reference variables in the query expression, making
the code more flexible and readable. The variables product_category, start_date, end_date,
min_sales_amount, and min_quantity are defined earlier in the code and can be easily modified
without needing to update the string expression.
By using the @ symbol to reference variables, you can write more concise and readable queries
without sacrificing performance or readability. This approach is especially useful for complex
queries involving multiple variables or conditions.
To illustrate how to use vectorization instead of looping over a data frame, let’s consider a
simple example. Suppose we have a data frame with two columns, “x” and “y”, and we want to
create a new column “z” that contains the product of “x” and “y”.
import pandas as pd
103
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
This code iterates over each row of the data frame and calculates the product of “x” and “y” for
that row, then stores the result in the new “z” column. While this code works, it can be slow and
inefficient, particularly for large data frames. Instead, we can use vectorization to perform this
calculation much more efficiently. Here’s an example:
import pandas as pd
In this code, we simply use the “*” operator to multiply the “x” and “y” columns together and then
assign the result to the new “z” column. This code performs the calculation much more
efficiently than the loop-based approach.
To avoid this behavior and ensure that any edits you make to the new DataFrame do not affect
the original DataFrame, you can use the “copy” method to create a copy of the DataFrame.
Here’s an example:
104
new_df['age'] = new_df['age'] + 5
In this example, we first create a sample DataFrame with three columns: “name”, “age”, and
“gender”. We then select a slice of the DataFrame using the “.loc” indexer to select only the
rows where the “age” column is greater than 30, and only the “name” and “age” columns. We
then use the “copy” method to create a new DataFrame that is a copy of the slice, rather than a
view of the original data.
We then modify the “age” column of the new DataFrame by adding 5 to each value. Because we
used the “copy” method to create the new DataFrame, this modification does not affect the
original DataFrame. Finally, we print both DataFrames to confirm that the modification only
applies to the new DataFrame, and not the original DataFrame.
To avoid this behavior and ensure that any edits you make to the new DataFrame do not affect
the original DataFrame, you can use the “copy” method to create a copy of the DataFrame.
Here’s an example:
105
print(df)
print(new_df)
In this example, we first create a sample DataFrame with three columns: “name”, “age”, and
“gender”. We then select a slice of the DataFrame using the “.loc” indexer to select only the
rows where the “age” column is greater than 30, and only the “name” and “age” columns. We
then use the “copy” method to create a new DataFrame that is a copy of the slice, rather than a
view of the original data.
We then modify the “age” column of the new DataFrame by adding 5 to each value. Because we
used the “copy” method to create the new DataFrame, this modification does not affect the
original DataFrame. Finally, we print both DataFrames to confirm that the modification only
applies to the new DataFrame, and not the original DataFrame.
In pandas, you can chain multiple transformations together in a single statement, which can be
a more concise and efficient way of applying multiple transformations to a DataFrame. Here’s an
example of how to transform a DataFrame using chain commands:
106
1. Use the “.loc” indexer to select only the rows where the “age” column is greater than 30,
and only the “name” and “age” columns.
2. Use the “copy” method to create a new DataFrame that is a copy of the slice, rather
than a view of the original data.
3. Use the “assign” method to create a new column in the DataFrame called “age_plus_5”,
which is equal to the “age” column plus 5.
Finally, we print the new DataFrame to confirm that all of the transformations were applied
correctly.
In many instances, you would have to manually set the Dtype correctly of a certain column. This
will help you to make efficient manipulation of your data.
1. Memory usage: By setting the right data types, you can reduce the memory usage of
your DataFrame. This is especially important when working with large datasets, as it can
help avoid running out of memory and crashing your program.
2. Data consistency: Setting column dtypes can help ensure that your data is consistent
and accurate. For example, if a column should contain only integers, setting its dtype to
“int” will prevent any non-integer values from being entered into that column.
3. Performance: Pandas operations can be much faster when the dtypes are set correctly.
For example, operations like sorting and filtering can be optimized when Pandas knows
the data types of the columns being operated on.
1. Line plot
import pandas as pd
107
import numpy as np
import matplotlib.pyplot as plt
This code creates a DataFrame with 100 rows and 4 columns of random data and then plots a
line chart of column A using the “.plot” method. The “kind” parameter is set to “line” to create a
line plot, and the “color” parameter is set to “red” to change the color of the line. The title, xlabel,
and ylabel of the plot are also set using the standard Matplotlib functions.
2. Scatter plot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
108
df.plot(kind='scatter', x='X', y='Y', color='blue', title='Scatter Plot')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
This code creates a DataFrame with 100 rows and 2 columns of random data and then plots a
scatter chart of columns X and Y using the “.plot” method. The “kind” parameter is set to
“scatter” to create a scatter plot, and the “x” and “y” parameters are set to ‘X’ and ‘Y’
respectively to specify the columns to use as the x-axis and y-axis of the plot.
3. Bar plot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
109
This code creates a DataFrame with 5 rows and 2 columns of random data and then plots a bar
chart of columns A and B using the “.plot” method. The “kind” parameter is set to “bar” to create
a bar plot, and the “color” parameter is set to a list of colors to use for the bars.
4. Histogram
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
This code creates a DataFrame with 1000 rows and 1 column of random data and then plots a
histogram of column A using the “.plot” method. The “kind” parameter is set to “hist” to create a
histogram, and the “bins” parameter is set to 20 to control the number of bins in the histogram.
110
9.9. Aggregation Manually instead of using .groupby()
Aggregation is a common operation in data analysis where we group data based on one or
more columns and apply a mathematical function to the remaining columns to get summarized
information about the data. Pandas provide a .groupby() method that simplifies the process of
aggregation, but manually performing aggregation can lead to inefficient code.
Here are some examples of manually aggregating data in Pandas and how it can be improved
using .groupby():
import pandas as pd
df = pd.read_csv('sales_data.csv')
unique_regions = df['region'].unique()
111
This code loops through each unique region in the dataset filters the data for that region and
manually calculates the total, average, maximum, and minimum sales for that region. This
process can be simplified using .groupby():
import pandas as pd
df = pd.read_csv('sales_data.csv')
This code groups the data by the region column and calculates the total, average, maximum,
and minimum sales for each region using the .agg() method.
import pandas as pd
df = pd.read_csv('sales_data.csv')
print(region_sales)
print(category_sales)
print(product_sales)
This code creates multiple pivot tables to calculate the total, average, maximum, and minimum
sales for each region, category, and product. This can be simplified using .groupby():
import pandas as pd
df = pd.read_csv('sales_data.csv')
112
print(sales_stats)
This code groups the data by the region, category, and product columns and calculates the total,
average, maximum, and minimum sales for each combination of those columns using the .agg()
method.
Overall, manually aggregating data can be time-consuming and error-prone, especially for large
datasets. The .groupby() method simplifies the process and provides a more efficient and
reliable way to perform aggregation operations in Pandas.
1. Large file size: CSV files can be very large in size, especially when dealing with
datasets that have many columns and/or rows. This can cause problems with storage
and processing, especially if you have limited resources.
2. Limited data types: CSV files only support a limited range of data types, such as text,
numbers, and dates. If your dataset includes more complex data types, such as images
or JSON objects, then CSV may not be the best format to use.
3. Loss of metadata: CSV files do not support metadata, such as data types, column
names, or null values. This can cause problems when importing or exporting the data,
and can make it difficult to perform data analysis.
4. Performance issues: Reading and writing large CSV files can be slow and can put a
strain on system resources, especially when dealing with complex datasets.
5. No data validation: CSV files do not provide any built-in data validation or error
checking, which can lead to data inconsistencies and errors.
There are more efficient ways to save large dataframes than using CSV files. Some of the
options are:
1. Parquet: Parquet is a columnar storage format that is optimized for data processing on
large data sets. It can handle complex data types and supports compression, which
makes it a good choice for storing large dataframes.
2. Feather: Feather is a lightweight binary file format designed for fast read and write
operations. It supports both R and Python and can be used to store dataframes in a
compact and efficient way.
3. HDF5: HDF5 is a file format designed for storing large numerical data sets. It provides a
hierarchical structure that can be used to organize data and supports compression and
chunking, which makes it suitable for storing large dataframes.
4. Apache Arrow: Apache Arrow is a cross-language development platform for in-memory
data processing. It provides a standardized format for representing data that can be
113
used across different programming languages and supports zero-copy data sharing,
which makes it efficient for storing and processing large dataframes.
Each of these options has its own strengths and weaknesses, so the choice of which one to use
depends on your specific use case and requirements.
114
Afterword
Thanks for purchasing and reading my book! If you have any questions, feedback or praise, you
can reach me at: Youssef.Hosni95@outlook.com
You can check my other books on my website. I would be happy if you connect with me
personally on LinkedIn. If you liked my writings, make sure to follow me on Medium. You are
also welcomed to subscribe to my newsletter To Data & Beyond to never miss any of my
writings.
115
What's inside the book?