Better Python Code
Better Python Code
“You’ll not just be aspiring to be an expert anymore after practicing through Better Python
Code: A Guide for Aspiring Experts, you’ll be one of them! Learn from David Mertz, who’s
been making experts through his writing and training for the past 20 years.”
—Iqbal Abdullah, past Chair, PyCon Asia Pacific, and past board member, PyCon Japan
“In Better Python Code: A Guide for Aspiring Experts, David Mertz serves up bite-sized
chapters of Pythonic wisdom in this must-have addition to any serious Python
programmer’s collection. This book helps bridge the gap from beginner to advanced
Python user, but even the most seasoned Python programmer can up their game with
Mertz’s insight into the ins and outs of Python.”
—Katrina Riehl, President, NumFOCUS
“What separates ordinary coders from Python experts? It’s more than just knowing best
practices—it’s understanding the benefits and pitfalls of the many aspects of Python, and
knowing when and why to choose one approach over another. In this book David draws
on his more than 20 years of involvement in the Python ecosystem and his experience as a
Python author to make sure that the readers understand both what to do and why in a wide
variety of scenarios.”
—Naomi Ceder, past Chair, Python Software Foundation
“Like a Pythonic BBC, David Mertz has been informing, entertaining, and educating the
Python world for over a quarter of a century, and he continues to do so here in his own
pleasantly readable style.”
—Steve Holden, past Chair, Python Software Foundation
“Being expert means someone with a lot of experience. David’s latest book provides some
important but common problems that folks generally learn only after spending years of
doing and fixing. I think this book will provide a much quicker way to gather those
important bits and help many folks across the world to become better.”
—Kushal Das, CPython Core Developer and Director, Python Software Foundation
“This book is for everyone: from beginners, who want to avoid hard-to-find bugs, all the
way to experts looking to write more efficient code. David Mertz has compiled a great set
of useful idioms that will make your life as a programmer easier and your users happier.”
—Marc-André Lemburg, past Chair, EuroPython, and past Director, Python Software Foundation
This page intentionally left blank
Better Python Code
This page intentionally left blank
Better Python Code
David Mertz
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and the publisher was aware of a trademark
claim, the designations have been printed with initial capital letters or in all capitals.
The author and publisher have taken care in the preparation of this book, but make no expressed or
implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed
for incidental or consequential damages in connection with or arising out of the use of the information or
programs contained herein.
For information about buying this title in bulk quantities, or for special sales opportunities (which may
include electronic versions; custom cover designs; and content particular to your business, training goals,
marketing focus, or branding interests), please contact our corporate sales department
at corpsales@pearsoned.com or (800) 382-3419.
For questions about sales outside the U.S., please contact intlcs@pearson.com.
All rights reserved. This publication is protected by copyright, and permission must be obtained from the
publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or
by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding
permissions, request forms and the appropriate contacts within the Pearson Education Global Rights &
Permissions Department, please visit www.pearson.com/permissions.
ISBN-13: 978-0-13-832094-2
ISBN-10: 0-13-832094-2
$PrintCode
Pearson’s Commitment to Diversity, Equity, and Inclusion
Pearson is dedicated to creating bias-free content that reflects the diversity of all learners.
We embrace the many dimensions of diversity, including but not limited to race, ethnicity,
gender, socioeconomic status, ability, age, sexual orientation, and religious or political
beliefs.
Education is a powerful force for equity and change in our world. It has the potential to
deliver opportunities that improve lives and enable economic mobility. As we work with
authors to create content for every product and service, we acknowledge our responsibility
to demonstrate inclusivity and incorporate diverse scholarship so that everyone can achieve
their potential through learning. As the world’s leading learning company, we have a duty
to help drive change and live up to our purpose to help more people create a better life for
themselves and to create a better world.
Our ambition is to purposefully contribute to a world where:
. Everyone has an equitable and lifelong opportunity to succeed through learning.
. Our educational products and services are inclusive and represent the rich diversity
of learners.
. Our educational content accurately reflects the histories and experiences of the
learners we serve.
. Our educational content prompts deeper discussions with learners and motivates
them to expand their own learning (and worldview).
While we work hard to present unbiased content, we want to hear from you about any
concerns or needs with this Pearson product so that we can investigate and address them.
. Please contact us with concerns about any potential bias at
https://www.pearson.com/report-bias.html.
This page intentionally left blank
This book is dedicated to my mother, Gayle Mertz, who always valued
ideas and the relentless criticism of existing reality.
This page intentionally left blank
Contents
Foreword xvii
Preface xix
Acknowledgments xxv
Introduction 1
8 Security 189
8.1 Kinds of Randomness 190
8.1.1 Use secrets for Cryptographic
Randomness 190
8.1.2 Reproducible Random
Distributions 192
8.2 Putting Passwords or Other Secrets in
“Secure” Source Code 195
Contents xv
Index 245
This page intentionally left blank
Foreword
It was a pleasure for me to be asked to write a foreword for David’s new book, as I always
expect David to provide useful, insightful content.
Much as I began with high expectations, I am delighted to say that they were not just
met but exceeded: The book is an engaging read, offers a great deal of insight for anyone
at an intermediate or advanced level to improve their Python programming skill, and
includes copious sharing of precious experience practicing and teaching the language; it is
easy to read and conversational in style. In spite of all this, David manages to keep the
book short and concise enough to absorb quickly and fully.
Most of the book’s content reflects, and effectively teaches, what amounts to a
consensus among Python experts about best practices and mistakes to avoid. In a few cases
in which the author’s well-explained opinions on certain issues of style differ from those of
other experts, David carefully and clearly points out these cases so readers can weigh the
pros and cons and come to their own decisions.
Most of the book deals with Python-related issues at intermediate levels of experience
and skill. These include many instances in which programmers familiar with different
languages may adopt an inferior style in Python, simply because it appears to be a direct
“translation” of a style that’s appropriate for the languages that they know well.
An excellent example of the latter problem is writing APIs that expose getter and setter
methods: In Python, direct getting and setting of the attribute (often enabled via the
property decorator) should take their place. Reading hypothetical code like widgets.
set_count(widgets.get_count() + 1) — where experienced Pythonistas would
instead have used the direct, readable phrasing widgets.count += 1 — would clearly
show that the hypothetical coder is ignoring or unaware of Python “best practices.” David’s
book goes a long way toward addressing this and other common misunderstandings.
Despite its overall intermediate level, the book does not hesitate to address quite a few
advanced topics, including the danger of catastrophic backtracking in regular expressions,
some quirks in floating-point representations of numbers, “round-tripping” problems with
serialization approaches like JSON, etc. The coverage of such issues makes studying this
book definitely worthwhile, not just for Python programmers of intermediate skills, but
for advanced ones too.
—Alex Martelli
This page intentionally left blank
Preface
each other. Where it feels helpful, many discussions refer to other sections that might
provide background, or foreshadow elaboration in later sections.
In general, I am aiming at a reader who is an intermediate-level Python developer, or
perhaps an advanced beginner. I assume you know the basics of the Python programming
language; these discussions do not teach the most basic syntax and semantics that you
would find in a first course or first book on Python. Mostly I simply assume you have an
inquisitive mind and a wish to write code that is beautiful, efficient, and correct.
This book is written with Python 3.12 in mind, which was released in October 2023.
Code shown has been tested against 3.12 betas. The large majority of the code examples
will work in Python 3.8, which is the earliest version that has not passed end-of-life as of
mid-2023. In some cases, I note that code requires at least Python 3.10, which was
released on October 4, 2021; or occasionally at least Python 3.11, released on October 24,
2022. The large majority of the mistakes discussed within this book were mistakes already
in Python 3.8, although a few reflect improvements in later versions of Python.
Documents titled “What’s new in Python M.m.μ”1 have been maintained since at least
the Python 1.4 days (in 1996).2
Code Samples
Most of the code samples shown in this book use the Python REPL (Read-Evaluate-
Print-Loop). Or more specifically, they use the IPython (https://ipython.readthedocs.io)
enhanced REPL, but using the %doctest_mode magic to make the prompt and output
closely resemble the plain python REPL. One IPython “magic” that is used fairly
commonly in examples is %timeit; this wraps the standard library timeit module, but
provides an easy-to-use and adaptive way of timing an operation reliably. There are some
mistakes discussed in this book where a result is not per se wrong, but it takes orders of
magnitude longer to calculate than it should; this magic is used to illustrate that.
When you write your own code, of course, interaction within a REPL—including
within Jupyter notebooks (https://jupyter.org) or other richly interactive
environments—will only be a small part of what you write. But the mistakes in this book
try to focus on samples of code that are as narrow as possible. An interactive shell is often a
good way to illustrate these mistakes; I encourage you to borrow the lessons you learn, and
copy them into full *.py files. Ideally these discussions can be adapted into rich codebases
after starting as mere snippets.
At times when presenting commands run in the operating system shell (i.e., running a
Python script to show results), I display the command prompt [BetterPython]$ to
provide a quick visual clue. This is not actually the prompt on my personal machine, but
rather is something to which I could change if I wanted to do so. On Unix-like systems,
the $ is often (but not always) part of shell prompts.
1. Python does not strictly use Semantic Versioning (https://semver.org), so my implied nomenclature
“major.minor.micro” is not strictly accurate.
2. See https://docs.python.org/3/whatsnew/index.html for an index of past release notes.
Preface xxi
Many developers who have come from other programming languages, or who
are just beginning programming in general, may not appreciate how amazingly
versatile and useful an interactive shell can be. More often than not, when I
wish to figure out how I might go about some programming task, I jump into a
Python, IPython, or Jupyter environment to get a more solid understanding of
how my imagined approach to a problem will work out.
A quick example of such a session, for me within a bash terminal, might
look like this:
[BetterPython]$ ipython
Python 3.11.0 | packaged by conda-forge |
(main, Oct 25 2022, 06:24:40) [GCC 10.4.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: %doctest_mode # 1
Exception reporting mode: Plain
Doctest mode is: ON
>>> from collections import ChainMap # 2
>>> ChainMap? # 3
Init signature: ChainMap(*maps)
Docstring:
A ChainMap groups multiple dicts (or other mappings) together
to create a single, updateable view.
[...]
File: ~/miniconda3/lib/python3.11/collections/__init__.py
Type: ABCMeta
>>> dict1 = dict(foo=1, bar=2, baz=3)
>>> dict2 = {"bar": 7, "blam": 55}
>>> chain = ChainMap(dict1, dict2)
>>> chain["blam"], chain["bar"] # 4
(55, 2)
>>> !ls src/d*.adoc # 5
src/datastruct2.adoc src/datastruct.adoc
Different programming environments will treat copying/pasting code samples into them
differently. Within IPython itself, using the %paste magic will ignore the leading >>> or
... characters in an appropriate way. Various other shells, IDEs, and code editors will
behave differently. Many of the code samples that are presented outside a REPL, and also
many of the data files used, are available at https://gnosis.cx/better. Moreover, paths are
mostly simplified for presentation; files often live within the code/ or data/ subdirectories
of the book’s website, but those paths are usually not shown. In other words, the code
presented is used to explain concepts, not as reusable code I intend for you to copy directly.
(That said, you may use it, of course.) In particular, much of the code shown is code that
has foibles in it; for that code, I most certainly do not want you to use it in production.
All code blocks whose title includes “Source code of <filename>” are available for
download from https://gnosis.cx/better. In some cases, the code shown in this book is an
excerpt from a longer file named. All other code blocks, whether titled to aid navigation
or untitled, are present only to explain concepts; of course, you are free to use them by
copying, retyping, or adapting for your purpose.
Opinions about using Black vary among Pythonistas. I have found that even if Black
occasionally formats code in a manner I wouldn’t entirely choose, enforcing consistency
when working with other developers aids the readability of shared code, especially on large
projects.
A very impressive recent project for linting and code formatting is Ruff
(https://beta.ruff.rs/docs/). Ruff covers most of the same linting rules as Flake8 and other
tools, but is written in Rust and runs several orders of magnitude faster than other linters.
As well, Ruff provides auto-formatting similar to Black, but cleans up many things that
Black does not address. (However, Black also cleans things that Ruff does not; they are
complementary.)
In modern Python development, type annotations and type-checking tools are in
relatively widespread use. The most popular of these tools are probably Mypy
(http://mypy-lang.org/), Pytype (https://google.github.io/pytype/), Pyright
(https://github.com/Microsoft/pyright), and Pyre (https://pyre-check.org/). All of these
tools have virtues, especially for large-scale projects, but this book generally avoids
discussion of the Python type-checking ecosystem. The kinds of mistakes that type
checking can detect are mostly disjointed from the semantic and stylistic issues that we
discuss herein.
Register your copy of Better Python Code on the InformIT site for convenient access
to updates and/or corrections as they become available. To start the registration process,
go to informit.com/register and log in or create an account. Enter the product ISBN
(9780138320942) and click Submit. Look on the Registered Products tab for an Access
Bonus Content link next to this product, and follow that link to access any available bonus
materials. If you would like to be notified of exclusive offers on new editions and updates,
please check the box to receive email from us.
This page intentionally left blank
Acknowledgments
David Mertz, Ph.D., has been a member of the Python community for a long time:
about 25 years—long enough to remember what was new about Python 1.5 versus 1.4.
He has followed the development of the language closely, given keynote talks about many
of the changes over versions, and his writing has had a modicum of influence on the
directions Python and popular libraries for it have taken. David has taught Python to
scientists, developers coming from other languages, and programming neophytes.
You can find voluminous details about his publications at https://gnosis.cx/publish
/resumes/david-mertz-publications.pdf. You can learn more about where he has worked at
https://gnosis.cx/publish/resumes/david-mertz-resume.pdf.
This page intentionally left blank
Introduction
The slightly joking term Pythonic is widely used in the Python community. In a
general way it means “reflecting good programming practices for the Python
language.” But there is also something just slightly ineffable about the term in
a way similar to how other programmers use fragile, elegant, or robust in
describing particular code. You’ll see the terms Pythonic and unpythonic quite
a bit in this book.
In a related bit of Pythonic humor—the language was, after all, named after
the Monty Python comedy troupe—we often use the term Pythonista for
developers who have mastered Pythonic programming.
To a fairly large extent, being Pythonic is a goal to improve the readability of your
programs so that other users and contributors can easily understand your intention, the
behavior of the code, and indeed identify its bugs. There are, as well, many times when
being unpythonic leads to unexpected behavior, and hence harms functionality in edge cases
you may not have considered or tried out during initial usage.
In this book, I am not shy in being opinionated about good Python practices.
Throughout the discussions, I try to explain the motivations for these opinions, and reflect
on my long experience using, teaching, and writing about Python. It is a truly delightful
programming language, about which I have sincere enthusiasm.
Much of what we hope for in Python code is explained by The Zen of Python.
There are many topics within the world of Python programming that are very
important, but are not addressed in this short book. The appendix, Topics for Other Books,
gives some pointers to resources, and brief summaries of the ideas I think you would do
well to pursue when learning them.
1
Looping Over the Wrong
Things
As in most procedural programming languages, Python has two kinds of loops: for and
while. The semantics of these loops is very similar to that in most analogous languages.
However, Python puts a special emphasis on looping over iterables—including lazy
iterables along with concrete collections—which many languages do not do. Many of the
mistakes in this chapter are “best practices” in other programming languages but are either
stylistically flawed or unnecessarily fragile when translated too directly into Python.
Technically, Python also allows for recursion, which is another way of “looping,” but re-
cursion depth is limited in Python, and tail-call optimization (see https://en.wikipedia.org
/wiki/Tail_call) is absent. Recursion can be enormously useful for problems that are
naturally subdivided as a means of reaching their solution, but it is rarely a good approach in
Python for constructing a mere sequence of similar actions.
None of the mistakes in this chapter deal specifically with recursion, but programmers
coming from Lisp-family languages, ML-family languages, or perhaps Haskell, Rust, Lua,
Scala, or Erlang should note that while recursion may be a good habit in the language you
used to program, it can be a bad habit for Python.
Let’s suppose that we have a function called get_word() that will return a word each
time it is called, generally different words on different calls. For example, this function
might be responding to data sent over a wire in some manner or calculated dynamically
based on something else about the state of the program. For this toy function, if
get_word() returns None, its data source is depleted; moreover, a “word” for the purpose
of this example is a sequence of ASCII lowercase letters only.
It is straightforward, and commonplace, to write code similar to the following.
Readers of other discussions in this book might recognize the number of words
generated and guess the implementation of get_word() from that. But let’s assume that
the number of words and what they are can vary across each program run, and can vary
across multiple orders of magnitude.
In a bit of crude numerology, we assign a magic number to each word simply by
valuing 'a' as 1, 'b' as 2, and so on, through 'z' as 26, and adding those values. This
particular transformation isn’t important, but the idea of “calculate a value from each
datum” is very commonplace. We use the following function for this calculation.
Assuming that all we care about is this final generated chart, there is no reason we
needed to instantiate the full collection of words, only their magic numbers. Admittedly,
the toy example is too simple to show the full advantage of the refactoring, but a sensible
approach is to lazily construct a generator of only the data we actually care about and to
only utilize the intermediate data as it is needed. For example, this code produces the chart
shown in Figure 1.2.
The example shown in Figure 1.2 still needed to instantiate the list of numbers, but not
the list of actual words. If the “words” were some much larger, memory-consuming
object, this change would become more significant. For many scenarios, exclusively
incrementally processing each value from a generator individually, with no intermediate
collection at all, will suffice, and save considerable memory.
6 Chapter 1 Looping Over the Wrong Things
for i in range(len(items)):
process(i, items[i])
1.2 Use enumerate() Instead of Looping Over an Index 7
Indeed, if you are not required to utilize the index position within a loop, utilizing the
index at all is generally a code smell in Python. A much more idiomatic option is simply:
On those fairly common occasions when you need both the index and the underlying
item, using enumerate() is much more expressive and idiomatic:
In the relatively uncommon situations when I want the index but not the item itself, I
often use enumerate() anyway, and I use the Python convention of _ (single underscore)
representing “a value I don’t care about”:
for i, _ in enumerate(items):
process(i, None)
An approach that I use from time to time, when I actually want to maintain several
increments, is to initialize the several counters prior to a loop, even if one of them could
derive from enumerate():
In the example, total could equally well be reset while enumerating the loop itself,
but you might wish to emphasize the parallel with n_foo and n_bar, and that is probably
better expressed as shown here.
8 Chapter 1 Looping Over the Wrong Things
Specifically, the following identity will always apply, for every dictionary (unless, with
great perversity, you could break this identity in a subclass of dict or in some custom
mapping):
In other words, if you wish to loop over keys, you should just write:
However, it is relatively uncommon to wish to loop only over the keys of a dictionary.
Even if you only rarely, in one branch of the code, actually use the value, it costs nearly
nothing to include it as a loop variable. Remember, Python objects are accessed by
1.4 Mutating an Object During Iteration 9
reference; you just assign a reference to an existing object by binding a loop variable; you
don’t copy or create an object.
In other words, don’t bother with:
This problem is one that linters are likely to warn about—as was probably the prior one
about using enumerate()—but understanding the mechanisms of loops goes further than
just reading a warning.
This toy code is mostly pointless in itself, but we do a selective operation on only those
elements of an iterable that meet some predicate. One thing we could certainly do rather
than print off individual characters is re-aggregate those passing the filter into some new
10 Chapter 1 Looping Over the Wrong Things
collection. That approach is generally a perfectly good solution to all of the mutation
issues, so keep it in your pocket as an option.
Suppose we want to try something similar using mutable collections rather than an
immutable string.
In this code, things appear superficially to work correctly. No execeptions are raised.
We genuinely do get a list or bytearray with some characters removed. However,
looking slightly more closely we see that one of the t characters that should be filtered out
remains in the mutated object. This happened because once an element was deleted, the
index position no longer aligned with the actual underlying sequence. A corresponding
problem would arise with insertion of new elements.
The correct way to approach this requirement is simply to create a brand-new object
based on the predicate applied and selectively append to it. An append is a cheap operation
on a Python list or bytearray (however, insertion into the middle of a new sequence
can easily hit quadratic complexity—a danger warned about in other parts of this book).
Recall that you can also make a (shallow) copy of a sequence simply by taking the null
slice of it. In slightly different scenarios, my_list[:] or my_ba[:] can often be useful as
easy syntax for creating a new sequence containing the same items.
12 Chapter 1 Looping Over the Wrong Things
1. Readers are welcome to try to guess what get_data() and predicate() are doing. Come prepared with a
deep understanding of the Mersenne Twister pseudo-random number generator (PRNG).
1.6 The Walrus Operator for “Loop-and-a-Half” Blocks 13
>>> try:
... while True:
... item = next(iterator)
... print("Current item:", item)
... except StopIteration:
... pass
...
Current item: 2
Current item: 3
Current item: 5
Current item: 7
Current item: 11
Obviously, you can do the same conditional branching, break, continue, or all the
other actions you might put inside a for loop in the preceding while construct.
Notwithstanding their formal equivalence (possibly with a small number of extra lines
to force it), it is far more common for a for loop to feel Pythonic than it is for a while
loop. This general advice has many exceptions, but you will find that almost always when
you loop in Python, it is either over a collection or over an iterable (such as a generator
function, generator comprehension, or custom iterable class). In many of the times when
this is not the case, it is a call to refactor the part of your code that provides data to operate
into an iterable.
It is not a mistake to use while, but whenever you find yourself writing it, you should
still ask yourself: “Can I do this as a for loop?” Ask yourself the same question for code
you are in a position to refactor. The answer may well be that the while loop is the most
expressive and clearest version, but the question should still have occurred to you.
Thinking in terms of (potentially infinite) sequences usually promotes clear and elegant
design in Python.
The repetition of the assignment to val both before the loop and within the loop body
just feels slightly wrong from a stylistic and code-clarify perspective (although, it’s not an
actual mistake).
A probably even less aesthetically pleasing variant on this pattern is to use break within
the body to avoid the repetition.
Since Python 3.8, we’ve had the option of using the so-called “walrus operator” to
simplify this structure. The operator is fancifully named for its resemblance to an emoticon
of a walrus with eyes and tusks. The walrus operator (:=) allows you to assign a value
within an expression rather than only as a statement.
In both cases where the predicate is inside the while statement, the loop might be
entered as few as zero times. With a while True, the loop is always entered at least once,
but it might terminate early (the “and-a-half ”) if some condition occurs.
1.7 zip() Simplifies Using Multiple Iterables 15
Using the walrus operator within an if statement is very similar in both providing a
value and perhaps not running the suite based on that value.
>>> stations = []
>>> for i in range(1255):
... station = Station(names[i], lats[i], lons[i], els[i])
... stations.append(station)
16 Chapter 1 Looping Over the Wrong Things
...
>>> pprint(stations[:4])
[Station(name='JAN MAYEN NOR NAVY', latitude='70.9333333',
longitude='-8.6666667', elevation='9.0'),
Station(name='SORSTOKKEN', latitude='59.791925',
longitude='5.34085', elevation='48.76'),
Station(name='VERLEGENHUKEN', latitude='80.05',
longitude='16.25', elevation='8.0'),
Station(name='HORNSUND', latitude='77.0',
longitude='15.5', elevation='12.0')]
The assertion in the example checks that all these files indeed have the same amount of
data. More robust error handling is possible, of course. The use of pathlib in the example
assures that files are closed after they are read in. Using pathlib gives you a similar
guarantee about proper cleanup to using context managers, which are discussed in
Chapter 3.
The prior code is not terrible, but it can be made more Pythonic. As one improvement,
we can notice that open file handles are themselves iterable. As the main point, we do not
need intermediate lists to perform this action, nor do we need to separately access
corresponding index positions within each. This calls back to several mistakes discussed in
this chapter of focusing on where a datum occurs in a collection rather than directly on the
data itself.
Cleaner code to build a list of station data namedtuples might look like this.
Station(name='HORNSUND', latitude='77.0',
longitude='15.5', elevation='12.0')]
>>> names
<_io.TextIOWrapper name='station-names.txt' mode='r' encoding='UTF-8'>
>>> next(names)
Traceback (most recent call last):
[...]
ValueError: I/O operation on closed file.
The assertion shown catches that the length of the generated list of objects isn’t exactly
1,255; but we would like code that is flexible enough to handle corresponding data with a
different number of items than that precise number.
There are two reasonable approaches we can take when we want to enforce a degree of
data consistency but do not necessarily know an exact data size to expect: requiring that all
the data files in fact are of matching length, or padding fields where data is not available.
Either is reasonable, depending on your purpose.
This approach is very helpful in working independently of the length of the several
streams of data, merely enforcing that they are the same. And it is very much a “fail fast”
approach, which is almost universally desirable.
However, there are likewise definitely situations where imputing sentinel values for
missing data is more appropriate. A sentinel is a special value that can mark a “special”
situation for a data point. A very common sentinel, in many contexts, is None. Sometimes
you might use a value like -1 as a sentinel to indicate that “normal” values are positive. At
other times, you might include a defined name like my_sentinel = object() to
guarantee that this value is distinct from everything else in your program. Filling in
imputed values is easy with zip_longest().
In the case of zip_longest(), shorter iterables are simply filled in with some sentinel.
None is the default, but it is configurable using the argument fillvalue.
Neither of the approaches in this section is flawless, of course. In particular, having
items from iterables correspond correctly is a much stricter requirement than having them
align correctly. If one series drops item 10 and another drops item 20, they could still
20 Chapter 1 Looping Over the Wrong Things
fortuitously be the same length overall. These functions are powerful, but they cannot
answer all the important questions about data consistency.
1.9 Wrapping Up
One of the loveliest elements of modern Python is its emphasis on looping over iterables,
including those that are not concrete collections. In some mistakes in Chapter 4, Advanced
Python Usage, we look at explicit “iterator algebra.” This chapter reflects patterns and
habits you will use nearly every time you write Python code; we have emphasized
Python’s focus on looping over the data you are actually concerned with rather than over
indirections towards it.
Beyond those mistakes that guide you to emphasize the right things to loop over, we
also looked at the danger of mutating concrete collections during iteration and at how
while loops, when they are the more elegant approach, can benefit from use of the newish
walrus operator.
2
Confusing Equality with
Identity
Most objects in Python are mutable. Moreover, all objects in Python are accessed by
reference. For objects that are immutable, such as strings, numbers, and frozensets,
comparing for equality or inequality rarely raises a concern about whether those objects
are also identical. However, for mutable objects such as mutable collections, it becomes
very important to distinguish identity from mere equality.
In many programming languages, a distinction is made between pass by value, pass by
address, and pass by reference (occasionally pass by name occurs as well). Python behaves most
similarly to reference passing, but in Python lingo we often emphasize Python’s semantics
by calling the behavior pass by object reference. Being thoroughly object oriented, Python
always encapsulates its objects, regardless of the scope they exist within. It’s not the value
that is passed into functions, nor is it a memory address, nor is it a variable name, it’s
simply an object.
What becomes important to consider is whether an object passed into a function or
method is immutable. If it is, then it behaves very much like a passed value in other
languages (since it cannot be changed at all and therefore is not in the calling scope). The
particular name an object has in different scopes can vary, but the object remains the same
under each name. If that object is mutable, it might be mutated within a child scope,
changing it within the calling scope (or elsewhere within the runtime behavior of a
program).
String interning
>>> e = "foobar"
>>> f = "foo" + "bar"
>>> e == f, e is f
(True, True)
>>> g = "flimflam"
>>> h = ".join(["flim", "flam"])
>>> g == h, g is h
(True, False)
However, Python binds by name in this circumstance, rather than binding by value. The
value eventually utilized is the final value a variable takes on at the time the closure
function is eventually called.
1 The lambda does nothing special here; using a def adder inner function definition
produces the exact same behavior.
2 Notice that adders is a list of functions, each of which is called in the loop.
The term closure is a bit of computer science lingo that, while important,
might not be familiar to people who are new to programming or have not
studied theoretical aspects. Don’t worry, it’s not as bad as it seems.
In programming languages, a (lexical; i.e., nested scope) closure means
that variables defined outside the current scope of a function definition are
“closed over.” That is to say, the function itself in some manner captures those
variables and they can continue to be used later when the function is called.
As we will see, however, whereas many programming languages “capture”
variables as their values, Python captures them as their names.
... console.log(adder(5));
... };
15
105
1005
undefined
In the JavaScript comparison, the const keyword is forcing “expected” scoping, but we
can accomplish the same thing in Python by using keyword binding to force more obvious
scoping. To get the output that most newcomers—and probably most experienced Python
developers as well—expect, force early binding by assigning default arguments.
As a quick guide, numbers that are equal to zero are falsy. So are collections
that are empty. So are string-like objects of length zero. Naturally, so are the
singletons False and None. When you see those values in a “Boolean context”
they are equivalent to an actual False. Most other objects are truthy.
Well-known objects that are neither truthy nor falsy are NumPy arrays and
Pandas Series and DataFrames.
>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1, 2])
>>> bool(arr)
ValueError: The truth value of an array with more than one
element is ambiguous. Use a.any() or a.all()
Whereas is True and is False have a narrow edge case where they can make sense,
using obj == True or obj == False will always cause a feeling of unease among
Pythonistas since True and False are unique identities already. In Python, numbers that
aren’t zero and collections that aren’t empty are truthy, and zeros and empty collections are
falsy. This is as much as we want to know for most constructs.
Some variations you might see will try to check more than explicitly needed.
For the most part, the mistake of coercing actual True or False values from merely
“truthy” or “falsy” is simply stylistic and does not harm program operation. But such use
has a strong code smell that should be avoided.
Often the habit of spelling out is True is borrowed from SQL, where database
columns might both be of Boolean type and be nullable.1 However, sometimes you
encounter a similar usage in existing Python code. In SQL, these checks actually do make
sense, as shown in the following code.
1 In many SQL dialects, we can get away with Python-like bare values, but best
practice in that language remains to be explicit.
Sometimes in Python code, you will see a sentinel used within a function that normally
returns an actual True or False. Often this sentinel is None, but sometimes other values
are used. The problem here, of course, is that a sentinel almost certainly has a truthiness
that can be deceptive in code using such a function.
If you are writing code from scratch, or refactoring it, you are better off using an
explicit enumeration, utilizing the well-designed enum standard library module. But in the
real world, you will probably encounter and need to use code that does not do this.
1. Effectively, the nullable Boolean type gives you a trinary, or “three-valued,” logic (https://en.wikipedia.org
/wiki/Three-valued_logic).
28 Chapter 2 Confusing Equality with Identity
If you are free to redesign this function, you might define Vowel =
enum.Enum("Vowel", ["Yes", "No", "Maybe"] then return Vowel.Yes, Vowel.No,
or Vowel.Maybe within the function, as appropriate. Comparisons will require an explicit
identity (or equality) check, but that clarifies the intention better for this case anyway.
1 Warnings in Python 3.10, 3.11, and 3.12 have gotten noticeably more
precise. This friendly reminder is a good example.
For your own classes, the “singleton pattern” is a poor choice in Python. It
is possible to implement, yes, but to accomplish all the same goals, Alex
Martelli’s “Borg idiom” is uniformly more Pythonic:
class Borg:
_the_collective = {}
def __init__(self):
self.__dict__ = self._the_collective
Many Borg can exist, but every attribute and method is shared between
each of them. None, however, remains properly a singleton.
2. In a provocative blog post title, Florimond Manca declared in 2018 that “Python Mutable Defaults Are The
Source of All Evil.” A great many other writers have given the same warning with somewhat less florid language.
30 Chapter 2 Confusing Equality with Identity
I take a somewhat more sympathetic view of Python behavior around mutable values
used as named arguments than do many of my colleagues; however, I will probably admit
that my affection arises in overly large part from knowing the “trick” for a long time, and
having written about it in more-or-less positive terms in 2001. Moreover, in Python 2.1
where I first wrote about playing with this behavior, many alternatives that now exist had
not yet entered the language.
Let’s look at a simple function to illustrate the issue. We have several word list files on
disk:
So far, so good. We might want to start with some initial list elements but add more
from a file. Straightforward enough. Let’s try it again.
At the first pass of just reading a-words.txt, all seems well. At the second pass of also
reading in b-words.txt, we notice with surprise that our results are becoming cummulative
2.4 Misunderstanding Mutable Default Arguments 31
rather than calls being independent. However, it gets even weirder on the third pass in
which we read in b-words.txt anew, but it stops being cummulative again.
Understanding what is occurring is not genuinely difficult once you think about the
execution model of Python. Keyword arguments are evaluated at definition time. All lines
of Python, within a given scope, are evaluated at definition time, so this should be
unsurprising. The list initial_words is defined once at definition time, and the same
object gets extended during each call (unless a different object is substituted for a call). But
OK, I get it. It’s weird behavior.
If we want statefulness in a function call (or in something equivalent) we have several
good approaches to doing that which don’t use the “immutable default value” hack.
You can easily tweak this API to your precise needs, but it clearly gets both statefulness
and easy comprehensibility.
We can control statefulness in this design simply by deciding whether or not to pass in a
current state to mutate or by skipping that argument for a fresh list result.
Generator-based statefulness
>>> def word_injector(initial_words=None):
... words = [] if initial_words is None else initial_words
... while True:
... fname = (yield words)
... if fname is not None:
2.5 Copies versus References to Mutable Objects 33
1 A plain next() call will always simply retrieve the current state of the word list.
2 Read about the .send() method on generators at https://docs.python.org/3
/reference/expressions.html#generator.send.
This approach resembles functional programming paradigms. If we want multiple
stateful “instances” of a word list, we do not instantiate a class, but rather simply create
new generator objects from a generator function. All statefulness is purely internal to the
position of the generator within the while True loop.
If we really wanted to, we could use a sentinel like _RESET to inject (.send()) in place
of a filename; but that is not really necessary. It is easier simply to create a new generator
that is started with values from an existing generator using either next(old_words) or
old_words.send(newfile). Or, for that matter, you can simply start a new generator
with a list from any arbitrary code that might have created a word list by whatever means.
It seems like we have a nice grid, as we wished for. However, let’s try populating it:
Rather than having created a grid, we’ve create a list of five references to the identical
object (a list in this case, but the same danger lurks for any mutable object type).
There are a number of ways you might fix this once you remember the problem.
Probably the easiest solution is to use comprehensions rather than the list multiplication
shortcut.
3. If you work with tabular data, however, do consider whether NumPy or Pandas, or another DataFrame library,
is a better choice for your purpose.
2.6 Confusing is with == (in the Presence of Interning) 35
We have a list of length 5, each item being a distinct list (as indicated by their different
IDs) that can be modified independently.
A prior discussion looked at why you should never use x == None in your code; but
ultimately, that was a stylistic issue and one of Pythonicity. Ultimately, none of your
programs would break if you used that style violation. Interned values are different. You
might notice something like this:
>>> a = 5 * 5
>>> b = 21 + 4
>>> a is b, a == b
(True, True)
Thinking too cleverly along these lines, you might conclude that identity comparison is
probably faster than equality comparison. To a small extent you would be correct (at least
for Python 3.11, on my CPU and operating system):
The problem, of course, is that only some numbers (and some strings) that are equal are
also identical, and actual programs almost always need to make comparisons on values that
vary at runtime. Other than special singletons, or when you genuinely care whether two
custom objects (for example, objects at different positions in a collection) are identical,
stick with equality checks:
>>> c = 250 + 9
>>> d = 7 * 37
2.7 Wrapping Up 37
>>> c is d, c == d
(False, True)
2.7 Wrapping Up
The puzzles of equality and identity have stymied many accomplished programmers. In
Common Lisp, developers distinguish among eq, equal, eql, and equalp. In Scheme,
they settle for just =, eqv?, and equal?. In JavaScript, equality is notoriously, and
humorously, non-transitive. A well-known diagram (shown in Figure 2.1 as the
“theological trinity”) about JavaScript gives us a perspective into the comparative sanity of
Python, which maintains transitivity (absent deliberately pathological custom classes,
which can make all horrors possible).
In Python we do not have quite so many variations on a theme. Instead, we have a test
for identical objects with is and a test for equivalent objects with ==. The semantics are
relatively straightforward, but many mistakes still occur as developers try to decide what
they mean between these concepts.
A footnote to this chapter might add that while Python’s standard library has the very
useful function copy.deepcopy() to recursively copy nested collections, there does not
exist in the standard library any function for deepequality() that would, hypothetically,
recursively compare such nested collections. A great many recipes you can find online have
implemented such a thing, but they are each slightly different and none ascend to a
ubiquity meriting inclusion in these discussions. This provides an opportunity for you to
make your very own novel mistake.
This page intentionally left blank
3
A Grab Bag of Python Gotchas
This chapter looks at concerns one encounters—and mistakes one might often
make—strictly within the Python language itself. Like the prior two chapters on
respectively looping and equality versus identity, these discussions are about the core of the
Python language. Later chapters look at less common language constructs, less used or
more specialized standard library modules, and some very common third-party modules.
While the discussions in this chapter are somewhat heterogeneous, they also address a
few of the most failings real-world Python code encounters, in my experience. A fair
number of the issues in this chapter reflect the use of habits developed for other
programming languages that are less well suited to Python code.
There are two hard things in computer science: cache invalidation, naming
things, and off-by-one errors.
In this section, we will look at where naming can go wrong. The kinds of mistakes
addressed in this section are somewhat heterogeneous, but all pertain in one way or
another to ways that choosing names badly can either cause your programs to break
outright, or at the least make them fragile, ugly, and unpythonic.
Unfortunately, it’s just complicated. This is not the book to explore the details of Python’s
import system, but a good summary write-up is available at https://docs.python.org/3
/library/sys_path_init.html.
The upshot of this complication is that developers are very well served by avoiding
filenames that conflict with the names of standard library modules—or indeed with the
names of any other third-party packages or modules they intend to use. Unfortunately,
there are a lot of names in the latter category especially, and conflicts can arise innocently.
If you are uncertain about a conflict, or fear one may occur as you add later
dependencies, use of relative imports can often avoid these mistakes.
Let’s take a look at a short shell session:
There are many “magical” ways I could obtain this odd behavior, but the one I used is
really not all that magical.
Notice that the message “Program goes BOOM!” is completely absent from this script.
That’s because it lives in re.py; not the version of that file that comes with the Python
standard library, but the version that happens to be at /home/dmertz/git/BetterPython/code
/re.py on my computer.
Of course, if you use NumPy or Pandas, the same kind of conflict might occur with
naming a local module numpy.py or pandas.py. So simply looking at the standard library
module list does not assure absence of conflict. But there are a lot of ways to come up with
distinctive names for the files in your own project.
Let’s suppose, however, that you really want to use a particular name. For example,
calendar is a standard library module, but one that is very old and that you’ve probably
never even thought about using. However, it’s a pretty good, generic name, one that could
very easily be a good choice for a submodule name within your own brand-new project.
When I mention that calendar is old, I really mean it. It was in Python 0.9 with
largely the same capabilities, in 1991:
[BetterPython]$ grep '0\.9' \
/home/dmertz/miniconda3/envs/py0.9/README
This is version 0.9 (the first beta release), patchlevel 1.
[BetterPython]$
/home/dmertz/miniconda3/envs/py0.9/bin/python
>>> import calendar
>>> calendar.prmonth(2023, 2)
Mon Tue Wed Thu Fri Sat Sun
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28
1 One or more dots before a module name indicate a relative import (see
https://docs.python.org/3/reference/import.html).
2 Yes! The API changed modestly in the 32 years between Python 0.9 and Python
3.12.
This script uses both the global and the local calendar.py module (the standard
library provides TextCalendar; the local module provides this_year and this_month).
Let’s run it:
You can, of course, use relative imports for less trivial modules and subpackages,
including across multiple levels of a directory hierarchy. See https://docs.python.org/
3/reference/import.html#package-relative-imports for details.
Avoid using the same names as other libraries, including the standard library, wherever
it feels reasonable to do so. As a fallback, relative imports are a reasonable solution to the
problem.
inf = float('inf')
for fn, num in zip([sqrt, ceil, isfinite], [-1, 4.5, inf*1j]):
try:
print(f"{fn.__name__}({num}) -> {fn(num)}")
except Exception as err:
print(err)
inf = float('inf')
for fn, num in zip([sqrt, ceil, isfinite], [-1, 4.5, inf*1j]):
try:
print(f"{fn.__name__}({num}) -> {fn(num)}")
except Exception as err:
print(err)
inf = float('inf')
for fn, num in zip([sqrt, ceil, isfinite], [-1, 4.5, inf*1j]):
try:
print(f"{fn.__name__}({num}) -> {fn(num)}")
except Exception as err:
print(err)
Clearly, we have used three different versions of sqrt() since we arrived at three
different answers. It is less clear what is occurring for ceil() and isfinite().
ceil() has produced two different answers, varying in datatype. But that might be two
implementations, and it might be three implementations. As it turns out, cmath lacks an
implementation of ceil(), so one of the implementations in math and numpy is active for
the differing scripts; those different implementations merely happen to produce the same
result in this example.
isfinite() has also produced two different answers, although one answer isn’t really a
result but rather an exception. In any case, it turns out that there are three different
implementations of isfinite() involved here, with the numpy version accepting a variety
of optional arguments and being happy to operate elementwise on arrays, as well as on
scalars.
It is, of course, possible to overwrite a name imported from one module with a later
import from another module, even if the names are specified. But explicitly including the
names makes reasoning about what is happening much more obvious.
In the preceding example, it jumps out that we are repeatedly overwriting the name
sqrt, and whatever definitions cmath or numpy might provide are inaccessible because
3.1 Naming Things 45
only the definition in math will be used. If that name had not been present in one of the
earlier modules, we would see an immediate ImportError. Of course, we could change
our imports to use the namespaced cmath.sqrt instead; or we could use from cmath
import sqrt as csqrt to provide an alternate name. Whatever choice we make
becomes apparent from the code itself.
If you want to use a few others, just add them to the list. I would make the
same comment about collections.abc, in which names like AsyncIterable
and MutableMapping are extremely unlikely to be accidentally reused by some
unrelated module (even a third-party module). There is nothing there where
import * is likely to cause harm.
There are some other modules where I also do not worry about name
conflicts very much, but the specific functionalities you want are very limited.
If you want collections.namedtuple, there is little reason you will
necessarily want the handful of other collections inside it. dataclasses.
dataclass, fractions.Fraction, and decimal.Decimal are nearly the only
names inside those modules. In the last case, however, decimal.getcontext,
decimal.setcontext, and decimal.localcontext are actually likely to be
useful; so probably decimal is another of the few modules where I personally
would not object to import *.
46 Chapter 3 A Grab Bag of Python Gotchas
asciiDocTree(BaseException)
builtins.BaseException
builtins.BaseExceptionGroup
builtins.ExceptionGroup
builtins.Exception
builtins.ArithmeticError builtins.ReferemceError
builtins.FloatingPointError builtins.RuntimeError
builtins.OverflowError builtins.NotImplementedError
builtins.ZeroDivisionError builtins.RecursionError
builtins.AssertopmError _frozen_importlib._DeadlockError
builtins.AttributerError builtins.StopAsyncIteration
builtins.BufferError builtins.StopIteration
builtins.EOFError builtins.SyntaxError
builtins.ImportError builtins.IndentationError
builtins.ModuleNotFoundError builtins.TabError
zipimport.ZipImportError builtins.SystemError
builtins.LookupError encodings.CodeRegistryError
builtins.IndexError builtins.TypeError
builtins.KeyError builtins.ValueError
encodings.CodeRegistryError builtins.UnicodeError
builtins.MemoryError builtins.UnicodeDecodeError
builtins.NameError builtins.UnicodeEncodeError
builtins.UnboundLocalError builtins.UnicodeTranslateError
builtins.OSError io.Unsupportedoperation
builtins.BlockingIOError builtins.Warning
builtins.ChildProcessError builtins.BytesWarning
builtins.ConnectionError builtins.DeprecationWarning
builtins.BrokenPipeError builtins.EncodingWarning
builtins.ConnectionAbortedError builtins.FutureWarning
builtins.ConnectionRefusedError builtins.ImportWarning
builtins.ConnectionResetError builtins.PendingDeprecationWarning
builtins.FileExistsError builtins.ResourceWarning
builtins.FileNotFoundError builtins.RuntimeWarning
builtins.InterruptedError builtins.SyntaxWarning
builtins.ISADirectoryError builtins.UnicodeWarning
builtins.NotADirectoryError builtins.UserWarning
builtins.PermissionError builtins.ExceptionGroup
builtins.ProcessLookupError warnings._OptionError
builtins.TimeoutError
io.unsupportedoperation
signal.itimer_error
builtins.GeneratorExit
builtins.KeyboardInterrupt
builtins.SystemExit
try:
ratios = []
num_fh = open(numerators) # 1
den_fh = open(denominators)
for string_a, string_b in zip(num_fh, den_fh, strict=True):
a = float(string_a.strip())
b = float(string_b.strip())
ratios.append(a/b)
print(ratios)
except:
print("Unable to perform divisions")
finally:
num_fh.close()
den_fh.close()
1 Another section discusses why a context manager around open() might be better.
However, the finally will perform cleanup of the open files and illustrates this
concern better.
Something went wrong in the suite of the try block. We have little information about
what kind of problem occurred, though. Let’s try to get a bit better visibility by modifying
the except on line 15-16 to use the following.
Floating point numbers displayed by this script are shown with no more than six digits
following the decimal point to make the presentation cleaner. You will see longer
representations if you run this code on your system. To duplicate the permission error
3.1 Naming Things 49
listed, you will probably need to run chmod -r denominators3.txt or the equivalent
for your operating system.
When we run this, we get better insight into what went wrong that is impossible with a
bare except. Let’s try running using several different sets of data files:
Several different distinct errors occurred. The program can be made less fragile by
treating each of these errant conditions more specifically. Here is an example.
numerators = sys.argv[1]
denominators = sys.argv[2]
50 Chapter 3 A Grab Bag of Python Gotchas
try:
ratios = []
num_fh = open(numerators)
den_fh = open(denominators)
line = 0
for string_a, string_b in zip(num_fh, den_fh, strict=True):
line += 1
a = float(string_a.strip())
b = float(string_b.strip())
ratios.append(a / b)
print([f"{r:.3f}" for r in ratios])
except ZeroDivisionError:
print(f"Partial results: {[f'{r:.3f}' for r in ratios]}")
print(f"Attempt to divide by zero at input line {line}")
except ValueError as err:
print(f"Partial results: {[f'{r:.3f}' for r in ratios]}")
desc = err.args[0]
if "zip()" in desc:
print(desc)
elif "could not convert" in desc:
print(f"String is not numeric at input line {line}")
except PermissionError:
print(f"Partial results: {[f'{r:.3f}' for r in ratios]}")
print("Insufficient permission to file(s). Run as sudo?")
except FileNotFoundError as err:
print(f"Partial results: {[f'{r:.3f}' for r in ratios]}")
print(f"File {err.filename} does not exist")
except OSError as err:
# Superclass after PermissionError and FileNotFoundError
print(f"Partial results: {[f'{r:.3f}' for r in ratios]}")
print(err)
finally:
try:
num_fh.close()
den_fh.close()
except NameError:
# Opened in same order as closed, if failure with open()
# on second, the first file will still get closed here.
pass
suites rather than simply print some varying messages based on the type of error
encountered.
One notable feature of the preceding code is that we catch PermissionError and
FileNotFoundError, but then also catch their parent class OSError later in the sequence
of except clauses. We have a specific behavior we want if those exact things went wrong,
but still recognize that there are other ways that open() might fail that we haven’t
specifically thought about.
In some suites, there might be a final except Exception that does generic logging of
“everything that we haven’t thought of that could go wrong.” Such a suite might either
decide to continue the rest of the program, or use a bare raise to re-raise the same
exception, depending on what served the program’s purpose better.
Let’s run the new version:
I have not bothered to display the contents of the several data files because those don’t
really matter. They are all text files with one number listed per line. However, do notice
that the code considers the case of a line containing a string that cannot be converted to a
float.
Unfortunately, the zip(strict=True) and the float("abc") cases both raise
ValueError rather than more specific subclasses. We have to tease them apart inside the
except suite by looking at the actual messages contained inside the exception object. This
remains slightly fragile because Python does not guarantee it will not change the wording
of error messages between versions, and we are simply looking for substrings that currently
occur inside those messages. Actually, even more than not guaranteeing, a conscious effort
is being made to improve the quality of error messages since 3.10; but “improve”
obviously means “change.”
52 Chapter 3 A Grab Bag of Python Gotchas
The title of this pitfall discussion includes a little bit of computer science
jargon. The word quadratic in mathematics refers to a polynomial with degree
two. It actually means exactly the same thing in computer science, but the
connection is perhaps not obvious immediately.
In computer science, we often talk about the “big-O” behavior of various
algorithms (see https://en.wikipedia.org/wiki/Big_O_notation). This concern
arises in a number of discussions in this book.
In quick synopsis, big-O complexity expresses the relationship between the
size of the data being operated on and the time a computer will take to
perform the operation. The best we can hope for is O(1), which means that
the code will take the same amount of time no matter what size the data is.
More commonly, O(N) is achievable, which means that the compute time
increases in lockstep with the data size. A bit worse, but frequently seen, is
O(N×log N); this says that the compute time is the size of the data multiplied
by the logarithm of that size.
We start to become worried when we see quadratic, that is, O(N2 ),
behavior. Worse behaviors are possible, though. Some computation might
take time equal to the cube, or the 4th power, of the data size. The worst
behavior we commonly encounter is called exponential, that is, O(2N ). These
algorithms become intractable very quickly as data size grows; some so-called
“hard problems” cannot be improved over exponential complexity.
Python string concatenation uses an intuitive plus operator, although this same operator
has a very different meaning for other types. For example, + (which under the hood calls
the “dunder”1 methods .__add__() or .__radd()__ on the class of the objects involved)
means addition of numbers and aggregation of lists as well.
Code such as this is intuitive, readable, and perfectly Pythonic:
firstname = "David"
lastname = "Mertz"
degree = "Ph.D."
1. The term dunder is commonly used by Python programmers to refer to names that have both two leading and
two trailing underscores. These are discussed in a number of places throughout the book.
3.2 Quadratic Behavior of Naive String Concatenation 53
if degree:
fullname += ", " + degree
1 String concatenation
So far, this code remains reasonably Pythonic, and as used I have no complaint about it.
As with other examples ancillary to the main point of a section, or requiring larger datasets,
the source code and data file can be found at https://gnosis.cx/better. All that matters for
the current discussion is that get_word() returns some string each time it is called.2
But what happens if we try to generate larger phrases with this code?
Going from 10 words to 1,000 words is still mostly dominated by the time it takes to
randomly select from the 267,752 available words. So rather than taking 100 times as long,
we increase to about 200 times as long. However, increasing the size of the concatenated
2. Picking random words from the SOWPODS English word list (https://en.wikipedia.org/wiki/Collins
_Scrabble_Words) may not have the specific letter-spacing distributions that typesetters like for “Lorem ipsum”
samples, but we don’t really care for the purposes within this book.
54 Chapter 3 A Grab Bag of Python Gotchas
string by another 100 times (approximately; words vary in length) takes about 5,500 times
as long.
What is happening here is that immutable strings are being allocated and deallocated
with many of the concatenations. It’s not quite on every concatenation since CPython uses
some overallocation internally, but it is common. This leads to approximately quadratic
(i.e., O(N2 ) on number of words) growth in the complexity.
It happens that for the toy code I show, there is a solution that involves almost no
change to the lorem_ipsum() function. However, this approach does not generalize if
you are doing much more than building one single long string. Python is optimized to
treat in-place string concatenation more like it does appending to a list, which has
amortized O(N) behavior (the section “Deleting or Adding Elements to the Middle of a
List” in Chapter 7, Misusing Data Structures, discusses amortized cost further).
... words.append(get_word())
... return " ".join(words)
...
>>> %timeit lorem_ipsum()
4.55 µs ± 54.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops
each)
>>> %timeit lorem_ipsum(1000)
426 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops
each)
>>> %timeit lorem_ipsum(100_000)
47.5 ms ± 917 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using a final str.join() is a few percent faster, which is not particularly important
(and doesn’t necessarily generalize). But the important thing is that it maintains linear
scaling as the size of the list/string grows.
Another approach is to use an io.StringIO stream.
Again, io.StringIO has the same linear scaling we are looking for, and is within a few
percent of the same speed as the other approaches. Using streams might be just slightly
slower in the simple case, but having a file-like object lets you do operations like .seek(),
.tell(), and .readlines() that are often independently useful. Moreover, if you need
to “scale up” to using an actual filesystem (for persistence, for example), many file-like
objects can be a drop-in replacement within the code.
56 Chapter 3 A Grab Bag of Python Gotchas
n_words = 0
for root, dir_, files in os.walk("src"):
for name in files:
if name.endswith(".adoc"):
filepath = os.path.join(root, name)
fh = open(filepath)
n_words += len(fh.read().split())
With this script, which I wrote just a few minutes ago, I can check my progress in
writing this book. The many smaller files in nested directories that make up the book are
written in a textual format called AsciiDoc (which is similar to reStructuredText or
Markdown; the only important focus here is that it’s basically just text files):
[BetterPython]$ code/wordcount
Total words in book draft: 65,376
3.3 Use a Context Manager to Open a File 57
As a word-count algorithm it’s fairly crude. More relevantly for this discussion, I have
relied solely on implicit garbage collection by Python. This will probably work fine for this
limited purpose. The reference count on the open file object will drop to zero when fh is
repeatedly rebound, and the .__del__() method of the TextIOWrapper object (which
fh is an instance of) will be called during this cleanup, closing the file.
This reasoning can quickly become less clear in more complex programs, however,
especially ones that utilize concurrency. At least two dangers, explained in the following
subsections, exist when files might not be closed because neither does flow control actually
arrive at a call to fh.close() nor does scoping or program termination succeed in forcing
garbage collection. Flow control can fail because if/elif/else branches are not fully
analyzed, or similarly for match/case (in Python 3.10+), but most often because
uncaught exceptions are raised that prevent a program reaching the explicit fh.close()
and dangling file handles stay around.
Considerably fewer than 10,000 files were created. We could, in concept, adjust the
specific number using resource.setrlimit(), but at some point we will hit limits of the
operating system itself; moreover, increasing it will cause lag in other operations the
computer is performing. Trying to open 10,000 temporary files at once is simply not a
good idea. Rather, we should take an approach that uses only a few files at a time, and
reopens them when needed rather than in advance.
1 Under the presumption that the same index is repeated, append mode is likely better.
2 As in other sections, a sample implementation of this function is at https://gnosis.cx
/better; any function that returns varying strings is equally illustrative.
Obviously this program is a toy example. Notice, however, that it has a .close()
method call included (which is not reached):
fh.write("Hello, world!\n")
fh.write("Goodbye!\n")
os._exit(1)
Done correctly, all the fh.write() lines produce output to crash.txt. You can read
more about writing your own context managers at https://docs.python.org/3/reference/
datamodel.html#context-managers. The excellent description in the Python
documentation describes the “guts” of how context managers work internally.
Not all objects can be compared for less-than inequality, which can
occasionally have the surprising effect that the sortability of a sequence
depends on the original order of elements.
While this can possibly occur, far more often sorting heterogeneous
iterables simply fails with some variety of TypeError. Still, we can see
situations like this:
>>> sorted([5, Strange(), 1, Strange(), 2+3j])
[1, 5, StrangeObject, StrangeObject, (2+3j)]
>>> sorted([5, Strange(), 2+3j, Strange(), 1])
[1, 5, StrangeObject, StrangeObject, (2+3j)]
>>> sorted([5, Strange(), 1, 2+3j, Strange()])
Traceback (most recent call last):
[...]
TypeError: '<' not supported between instances of
'complex' and 'int'
To really understand when this will succeed and when it will fail, for a given
sequence of objects that are partially incompatible, you need to understand
the details of the Timsort algorithm (https://en.wikipedia.org/wiki/Timsort).
Doing so is a worthwhile goal, but is not required to understand anything in
this book.
A more useful “thing” would presumably have additional attributes and methods, but
this suffices to show the scaffolding needed (the .__repr__() is optional, but it makes for
a more attractive display).
If a developer is not aware of the optional keyword argument key, which can be passed
to sorted() or to list.sort(), the code they write is likely to perform inefficiently or
just plain wrongly. In particular, such flawed code can sometimes wind up sorting on a
basis other than the sort order that is useful for the objects involved.
For example, suppose we wanted to sort “Things” not based on their numeric order,
but rather based on their numeric order within a ring of a given modulus (called Zn ).
A first inclination might be to subclass Thing to have this behavior.
3.4 Optional Argument key to .sort() and sorted() 61
Note Iterables are sortable (if the corresponding concrete collection is)
There might well be additional reasons to attach the modulus to the class itself, but
supposing we only cared about sorting, we could achieve the same effect more easily using
the following.
Anything that can represent a consistent transformation of the underlying objects being
sorted is suitable as a key function. The decorate-sort-undecorate pattern is vastly more
efficient as a big-O complexity than using a comparison function between every pair of
items. See a discussion at https://en.wikipedia.org/wiki/Schwartzian_transform. The less
62 Chapter 3 A Grab Bag of Python Gotchas
efficient comparison function idiom is still used in many other programming languages,
and was long ago used in Python, prior to version 2.4.
Lambda functions are absolutely appropriate to use in this context, even if in most other
places a named function would serve clarity better. But very often it is useful to use
operator.itemgetter or operator.attrgetter as faster and more expressive
functions than custom lambda functions. One place we see this need very commonly is in
manipulating deserialized JSON data, which tends to be highly nested.
For data held in classes operator.attrgetter is very similar, but simply accesses the
attribute that is passed as an argument for each instance being sorted.
students = [
dict(name="Xian", grade="A-", age=10),
dict(name="Jane", grade="B", age=12),
dict(name="John", grade="C", age=15),
dict(name="Pema", age=14),
dict(name="Thandiwe", grade="B+")
3.5 Use dict.get() for Uncertain Keys 63
We’d like to create a little report from our student list. A somewhat awkward approach
to the missing data might be the following.
Despite what I warn in Chapter 4, Advanced Python Usage, there are times when
forgiveness-not-permission makes code worse.
In this example, and in many where we work with dictionaries, neither of these
approaches is ideal. Both LBYL and EAFP require too much code and impede readability
for this task. The cleanest solution is simply “pick a default.”
Just-use-default approach
>>> print("| Name | Grade | Age")
... print("+-----------+-------+-----")
... for student in students:
... print(f"| {student.get('name', 'MISSING'):9s} "
... f"| {student.get('grade', 'PASS'):<4s} "
... f"| {student.get('age', '?')}")
...
| Name | Grade | Age
+-----------+-------+-----
| Xian | A- | 10
| Jane | B | 12
| John | C | 15
| Pema | PASS | 14
| Thandiwe | B+ | ?
Breaking the print() argument across lines means we only save one line (although a
few more characters are saved). More importantly, we simply express the intention to
default to certain values and avoid introducing extra variables within the loop.
3.6 Wrapping Up
In this chapter, we examined some very everyday features that can easily be used in ways
causing problems. These ordinary pitfalls range from code that gets the right results but
does so with far worse runtime expense than is needed, to ways that poor choices of names
can create ambiguities or conflicts, to failing to recognize the need for familiar and
idiomatic Pythonic approaches (which incidentally save you from problems in edge cases).
In this book we learn that many things are possible, and that many of them are
nonetheless best avoided. Chapter 5, Just Because You Can, It Doesn’t Mean You Should. . . ,
3.6 Wrapping Up 65
looks at what we might call “impulsive” uses of rather advanced features. The pitfalls in
this chapter, however, express tasks you perform almost every time you sit down to write a
Python program, or modify or expand an existing one. Even these simple things can be
done wrongly.
While such is not the only source of mistakes, often those in this chapter reflect habits
brought by developers from other programming languages they may have used earlier.
Adapting your Python habits to utilize Pythonic code shows kindness to yourself and to
your colleagues.
This page intentionally left blank
4
Advanced Python Usage
if type(seq1) == list: # 2
return [op(x, y) for x, y in zip(seq1, seq2, strict=True)]
if type(seq1) == tuple: # 3
return tuple(op(x, y)
for x, y in zip(seq1, seq2, strict=True))
if type(seq1) == str: # 4
nums1 = seq1.split()
nums2 = seq2.split()
new = (op(float(x), float(y))
for x, y in zip(nums1, nums2, strict=True))
return " ".join(str(n) for n in new)
This function is plausibly useful. It also will fail to work on many arguments that are
useful in an obvious way. Unless we genuinely need to avoid supporting subclasses (which
is unusual, but not inconceivable), a more general implementation based around
isinstance() is much more sensible. For example:
if isinstance(seq1, ByteString):
as_str = " ".join(str(n) for n in new)
return type(seq1)(as_str.encode("ascii")) # 2
else:
sep = type(seq1)(" ") # 2
return sep.join(type(seq1)(n) for n in new)
2 We defer to the type of the first sequence, where the two sequences are
“compatible” but distinct.
The new implementation is much more flexible while also being slightly shorter (if you
remove the extra comments). We can try it out:
1 If you wish for less flexibility in mixing non-subtypes, the code could be tweaked
easily enough.
There remains a little bit of magic in the second implementation in that we inspect
type(seq1) to decide on a precise class to use when constructing the result. A somewhat
less magical version might simply return a list whenever a mixture of Sequence types are
passed as arguments. However, a little bit of magic is not always unpythonic; at times this
power can be used wisely and powerfully.
migrated from the former to the latter, but for the most part Python programmers do not
regularly think about the distinction that much.
Whenever the Python interpreter starts (at least CPython, the reference and by far most
common implementation), everything in __builtins__ is loaded automatically. So on
the surface, it’s all just “a bunch of names.”
Let’s take a look at what the actual keywords in Python 3.12 are.
In contrast, the built-in, but not technically reserved, names are a fair bit larger. The
dunders like __name__ and __package__ are a bit special in being set differently
depending on what code is being run. But the regular names are simply functionality you
will always have available by default. It’s a long list, but useful to skim over (you might learn
names you were not familar with, or be reminded of ones you have partially forgotten).
There are a few oddballs like True and None that are both keywords and also in
__builtins__. The difference between these different “always available” names is simply
in whether they are assignable.
Few Python developers can necessarily rattle off a complete list of either keywords or
built-ins, but most of us have used almost all of these names from time to time.
The override in the preceding example is unlikely to be something you would do
inadvertently, and it will only cause problems relatively uncommonly. A FutureWarning is
used by a library that intends to deprecate some API “in the future.” However, unless you
put that override inside the library itself (either as author, or by monkey patching), the
library itself will retain the genuine version.
However, there are a handful of names in __builtins__ that are relatively easy to
overwrite simply by not giving the question a lot of thought. For many of these, I have
committed this “mistake” myself, sometimes in code that predated the addition to the
__builtins__ module that later added the name.
sum = 0
for receipt in receipts:
sum += receipt.amount
return sum
For uses like this, the “mistake” is probably not that important. If the built-in function
sum() is not used later on in the total_receipts() user function, the fact that we have
(locally) overwritten a standard name stops mattering once we leave the function scope.
Python uses a rule called LEGB (Local, Enclosing, Global, and Built-in) for
scoping variables. That is, when the runtime sees a name, it looks in each of
those scopes, in the order listed. Enclosing means an outer function or a class
in which a method is defined. Local means within the body of the function or
method where the code lives.
The important thing about scopes is that when you leave them, the names
they define are released, and do not affect the use of the same names in the
enclosing scope (nor in some other local scope you enter later).
In the case of overwriting id, the use feels natural, but is slightly more error-prone. A
loop like the one I constructed might occur within the body of a larger function or
method, and that same function might want to use the actual __builtins__.id()
function. That said, the function does remain available in the __builtins__ namespace
(which could, in principle, itself be overwritten, which is definitely a bad idea to do).
There are only a handful of names where a developer might naturally want to use that
name as a custom variable. However, a few of these do genuinely feel natural to use. Here
is an example.
@property
def seed(self):
return self.__seed
def next(self):
# Increment the state
self._state = (
(self._multiplier * self._state + self._increment)
% self._modulus)
return self._state / self._modulus
4.2 Naming Things (Revisited) 77
1 “Private” attribute
2 “Protected” attribute
3 Type annotations are not runtime enforced but document intention.
This class will produce a pretty good sequence of pseudo-random numbers, each
between 0 (closed) and 1 (open), and a different such sequence for nearly every different
integer seed below 2**32. Let’s take a quick look at using this class:
For our purposes, we are happy for users to be able to examine the seed used by a given
generator, but definitely do not want them to modify the (purported) seed:
>>> lcg.seed
456
>>> lcg.seed = 789
Traceback (most recent call last):
[...]
AttributeError: property 'seed' of 'LinearCongruentialGenerator'
object has no setter
This pattern of using a read-only property to access a “private” attribute is a good one
to follow. As creator, I want instances of my class always to honestly report the seed that
was used to initialize them.
Those attributes that only begin with a single underscore (“protected”) are not quite as
sensitive. You might be able to change them and allow the functionality to remain
“reasonable” (but you might not be so able):
>>> lcg3._multiplier = 0 # 3
>>> [lcg3.next() for _ in range(2)]
[2.3283064365386963e-10, 2.3283064365386963e-10]
In the example, the very bad multiplier of zero causes the generator to produce the same
fixed number forever. However, other bad choices can merely weaken the distribution.
The specific details of good multiplier, modulus, and increment get into relatively heavy
number theory, with primality, relative primality, the divisibility of the multiplier by 4
specifically, and other factors coming into play. In other words, the ordinary users of the
class probably do not know the relevant considerations and should not touch these
attributes.
Note The first rule of underscores is “don’t talk about name mangling”
Python initial double underscores have a dirty little secret. They are not truly
private and users can get at them if they know about name mangling. Let’s
take a look:
>>> lcg.__seed
Traceback (most recent call last):
[...]
AttributeError: 'LinearCongruentialGenerator' object has no
attribute '__seed'
>>> lcg._LinearCongruentialGenerator__seed
456
>>> lcg._LinearCongruentialGenerator__seed = 123
>>> lcg.seed
123
If you ever find yourself breaking through the privacy of names this way,
you are making a mistake or are an actual core developer of a Python
framework. However, it can certainly be both things.
The other danger with using private or protected attributes is that the author of a
module or library is explicitly not promising they will continue to exist in the next version
4.3 Keep Less-Used Features in Mind 79
of that software. They may very reasonably decide to redesign or refactor their code, and
only maintain backward compatibility with documented interfaces, such as lcg.seed and
lcg.next() in the example class.
Perhaps as author of my LCG, I decide that I definitely want to use a power-of-two
modulus. This is hard-coded into the implementation shown, but a subclass, for example,
might change that but preserve the official APIs. Given this decision, I might change my
implementation to:
class LinearCongruentialGenerator:
def __init__(self, seed: int=123):
self.__seed: int = seed
self._multiplier: int = 1_103_515_245
self._modpow: int = 32
self._increment: int = 1
# ...other code...
def next(self):
# Increment the state
self._state = (
(self._multiplier * self._state + self._increment)
% 2**self._modpow)
return self._state / self._modulus
This new implementation has all the same documented behaviors. Given the same seed,
the new implementation will produce exactly the same sequence of numbers from the
PRNG. However, the protected attribute ._modulus has simply stopped existing in this
version. A downstream user who improperly relied on that attribute, even simply to check
its value, would find their code broken by the change. If their code actually modified the
protected attribute, the failure could be less obvious since Python instances can attach
attributes that are unused for any purpose; changing ._modulus would now have no effect
on the sequence produced, again probably surprising the misbehaving downstream user.
As a final thought, I will mention that modules are much like classes in regard to
leading underscores. Often modules will have such private or protected names inside them.
The mechanism of import makes these slightly less accessible, but not completely
invisible. Nonetheless, if a module author has requested that you not use certain names,
you should believe them, and only use the documented and exported names.
There are a few odd corners where the advice in this section is properly
violated, even within the Python standard library. The most notable of these is:
>>> from collections import namedtuple
>>> Person = namedtuple("Person", "first last handedness")
>>> david = Person("David", "Mertz", "Left")
>>> david
Person(first='David', last='Mertz', handedness='Left')
>>> david._fields
('first', 'last', 'handedness')
>>> david._asdict()
{'first': 'David', 'last': 'Mertz', 'handedness': 'Left'}
Knowledge is power. Or at least often it is: The next chapter looks at cases where a
better mantra might be “ignorance is bliss.”
numerators = sys.argv[1]
denominators = sys.argv[2]
try:
ratios = []
num_fh = open(numerators)
den_fh = open(denominators)
for string_a, string_b in zip(num_fh, den_fh, strict=True):
a = float(string_a.strip())
b = float(string_b.strip())
ratios.append(a/b)
print([f"{r:.3f}" for r in ratios])
except Exception as err:
print("Unable to perform divisions")
As we saw with that earlier discussion utilizing almost this same small program, a great
many different things can go wrong in this script. In a few paragraphs we’ll see specific
causes of the errors shown generically in:
The example here, as in most discussions, is toy code. However, one might easily find
similar logic inside a function within a library or application being developed. In fact, this
function might be in use for a long while without any of these constructed error cases
being encountered. We would like to sort out what is going wrong in each example.
Using an actual debugger is absolutely a worthwhile approach.1 Moreover, sending the
more detailed information printed in the following code to a logging system is often
desirable. For the example, let’s just make the printed errors more informative.
1. I personally have a special fondness for PuDB (https://documen.tician.de/pudb/), which like IPython pdb
(https://github.com/gotcha/ipdb) is built on the standard library module pdb (https://docs.python.org/3
/library/pdb.html). Most developers, however, probably prefer the more graphical debuggers incorporated with
GUI editors or IDEs like VS Code or PyCharm.
82 Chapter 4 Advanced Python Usage
try:
ratios = []
num_fh = open(numerators)
den_fh = open(denominators)
for string_a, string_b in zip(num_fh, den_fh, strict=True):
a = float(string_a.strip())
b = float(string_b.strip())
ratios.append(a/b)
print([f"{r:.3f}" for r in ratios])
except Exception as err:
print("Unable to perform divisions")
print(f"{err=}\n{numerators=}\n{denominators=}\n"
f"{num_fh=}\n{den_fh=}\n{string_a=}\n{string_b=}")
Trying out two of the broken cases, it becomes much easier to understand what goes
wrong. From there, we can decide how best to go about preventing or catching the
specific problems:
encoding='UTF-8'>
den_fh=None
string_a=None
string_b=None
You are free to use the regular formatting options for f-strings in combination with the
special = marker. For example:
In commenting that decorators are stricto sensu never required, I mean that
even more literally than I do of other convenient features one could do
without. For any decorator you use, you could rewrite it with a small and
mechanical translation into code that was decorator free.
Use of decorator
@some_decorator
def my_function(this, that):
# do some stuff
return a_value
With exactly the same semantics, and yet drawing attention to the meaning
less well, we could write the following.
Rewriting without a decorator
def my_function(this, that):
# do some stuff
return a_value
my_function = some_decorator(my_function)
Writing Decorators
In general, a decorator is simply a function that takes one other function as an argument
and returns something. Within Python semantics, that something could be any Python
object. However, in almost all cases, useful decorators for functions and methods almost
always return a callable. Moreover, this callable almost always has the same, or almost the
same, calling signature as the decorated function or method. Useful decorators for classes
return new classes, but ones that serve a closely related purpose to the class directly in the
code—merely “tweaked” or “enhanced” in some respect.
Very simple good and bad examples follow; read a bit later to understand a better way
to write even the “good” decorator.
>>> fused_multiply_add(4, 7, 3)
31
>>> fused_multiply_add([(4, 7, 3), (7, 2, 4), (12, 1, 5)])
[31, 18, 17]
>>> fused_multiply_add.__name__
'fused_multiply_add'
>>> fused_multiply_add? # 4
Signature: fused_multiply_add(a, b, c)
Docstring: Multiply and accumulate
File: ~/git/PythonFoibles/...
Type: function
The big win of this decorator is that we are completely free to reuse it on any function
that takes multiple arguments without having to rewrite the conditional logic about
whether to vectorize inside each of them. Every function we decorate can focus solely on
the numeric (or other) operation it needs to perform and leave the “vectorized” aspect to
the decorator.
Of course, NumPy “ufuncs” (https://numpy.org/doc/stable/reference/ufuncs.html) do
this same thing for sequences that are specifically NumPy arrays, with a more optimized
implementation for that case. But this version works for any Python sequence that might
be operated on in a vectorized manner.
In a similar spirit, suppose that we want to keep track of how often a function is called.
Counting calls to a function
>>> def count_calls(fn):
... @wraps(fn)
... def inner(*args):
... inner.num_calls += 1
... return fn(*args)
88 Chapter 4 Advanced Python Usage
... inner.num_calls = 0
... return inner
...
>>> @count_calls
... def fused_multiply_add(a, b, c):
... return (a * b) + c
...
>>> [fused_multiply_add(*args)
... for args in [(4, 7, 3), (7, 2, 4), (12, 1, 5)]]
[31, 18, 17]
>>> fused_multiply_add.num_calls
3
>>> fused_multiply_add(7, 6, 5)
47
>>> fused_multiply_add.num_calls
4
Python functions are perfectly free to have additional attributes attached to them, and
that serves us well for keeping state associated with the runtime use of the function.
Standard Decorators
The Python standard library includes a number of very useful decorators. The previous
section touched on using @property, which is a name in __builtins__. The decorators
@staticmethod and @classmethod are similar as ways of modifying the behavior of
methods within a class.
Early in this discussion, we had a chance to see how @functools.lru_cache can
speed up pure functions (ones that should always return the same answer given the same
arguments). An interesting standard library decorator is @dataclasses.dataclass,
which can “enhance” the behavior of a class used primarily to store “records.” Dataclasses
are discussed in Chapter 6, Picking the Right Data Structure.
Similar to @functools.lru_cache is @functools.cache, which was added in
Python 3.9. It is simply an unbounded variation of “least-recently-used” (LRU) caching.
There are trade-offs between the two: Unbounded caching can be faster, but can also
increase memory usage indefinitely.
A very interesting decorator for a class is functools.total_ordering. If we wish for
instances of a custom class to be sortable and support inequality and equality comparisons,
we need to implement .__lt__(), .__le__(), .__gt__(), or .__ge__() and
.__eq__(). That’s a lot of work that can be made easier using a decorator.
Comparable persons
>>> import random
>>> from functools import total_ordering, cached_property
>>> @total_ordering
... class Person:
4.3 Keep Less-Used Features in Mind 89
>>> person1.lucky_number
Generating for David Mertz
88
>>> person1.lucky_number # 4
88
>>> person2.lucky_number
Generating for Guido van Rossum
17
>>> person2.lucky_number # 4
17
1 Any two of the comparison dunders may be implemented for the rest to be
built for us.
2 This property is only calculated once and then cached.
3 A comparison, .__le__(), we did not directly implement
4 Second access has no side effect, nor does it call randint() again
There are a good number of additional useful decorators scattered around Python’s
standard library, and many libraries and frameworks supply their own. Much of the time,
use of decorators provides a clean, minimal, and expressive way of creating code without
explicitly writing it. As with most things, decorators have their abuses; when well used,
they make code even more Pythonic.
90 Chapter 4 Advanced Python Usage
Users new to the concept can often overlook the somewhat subtle distinction
between iterables and iterators. The fact that many iterables are their own
iterator does not help in sorting out that confusion. These types of things are
programmatically defined by collections.abc.Iterable and
collections.abc.Iterator, respectively.
In brief, an iterable is an object that, when passed to the built-in function
iter(), returns an iterator. An iterator is simply an object that, when passed
to the built-in function next(), either returns a value or raises
StopIteration. These two simple properties make iterables and iterators
play nice with loops, comprehensions, and other “iterable contexts.”
Generator functions return iterators when called. But iterables and
iterators may also be defined by custom classes. As usual, it is the presence
of dunder methods that allows objects to work with operators and built-in
functions. A typical such class might look like the following.
An iterable class that is its own iterator
>>> class Ones:
... def __next__(self):
... return 1
... def __iter__(self):
... return self
...
>>> list(zip(Ones(), "ABC"))
[(1, 'A'), (1, 'B'), (1, 'C')]
Since an instance of the Ones class returns itself when passed to iter()
(which occurs implicitly in any iterable context, such as in an argument to
zip()), Ones() is both an iterable and an iterator. We could spell this
equivalently as itertools.repeat(1), which likewise yields infinitely many
values (in the example, the number 1 infinitely many times).
4.3 Keep Less-Used Features in Mind 91
Reading FASTA
Let’s look at a small program that uses iterators, then we’ll use itertools functions to
combine them. The FASTA format (https://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml)
is a textual format used in bioinformatics to describe nucleotide or amino acid sequences.
Very often, a single file with an extension like .fa, .fna, .fra, or .fasta will contain hundreds
of thousands of sequences, each containing tens of thousands of single-letter codes. For
example, such a file might represent an analysis of the genetic diversity in a milliliter of soil
or seawater, with each record describing a distinct bacterium species (diversity of
microscopic organisms is amazing).
def read_fasta(filename):
with open(filename) as fh:
line = next(fh)
if not line.startswith(">"):
raise ValueError(
f"{filename} does not appear to be in FASTA format")
92 Chapter 4 Advanced Python Usage
description = line[1:].rstrip()
nucleic_acids = []
for line in fh:
if line.startswith(">"):
yield FASTA(description, "".join(nucleic_acids))
description = line[1:].rstrip()
nucleic_acids = []
else:
nucleic_acids.append(line.rstrip()) # 1
yield FASTA(description, "".join(nucleic_acids))
1 FASTA files typically, but not uniformly, have 80-width lines within sequence
blocks; but the newline is not meaningful.
Here is an example of reading a file that contains just two sequences, the putative
mRNA sequences for the Pfizer and Moderna COVID-19 vaccines, respectively:2
Let’s move on to doing something with this FASTA data that we can iterate over.
Run-Length Encoding
In data that is expected to have many consecutive occurrences of the same symbol,
run-length encoding (RLE) can often offer significant compression of the representation.
RLE is often used as one component within sophisticated compression algorithms.
In particular, we can write extremely compact implementations of run-length encoding
and run-length decoding by using itertools. These implementations also have the virtue
of being able to operate lazily. That is, both of the following functions can be consumed
one value at a time, and accept iterables as arguments, and hence operate efficiently on
large, or even infinite, streams of data.
Run-length functions
from collections.abc import Iterable, Iterator
from itertools import groupby, chain, repeat
The functions are annotated only to emphasize that they operate lazily. They utilize the
itertools functions groupby(), chain.from_iterable(), and repeat() to keep the
code compact and lazy. The argument to chain.from_iterable() is itself a generator
comprehension, as well. Exactly why these work so elegantly is left, to some degree, for
readers who should read the official documentation of each function (and of the other
handy functions in the module). It’s worth seeing these operate to understand what’s
going on:
1 The objects seq and first5_regions are lazy iterators, not concrete.
2 Only by creating a list from iterator do we allocate memory.
We can verify that our paired functions are symmetrical:
Spike-encoding_contig_assembled_from_BioNTech/Pfizer_BNT-162b2
<generator object rle_decode at 0x7f5441632810>
GAGAATAAACTAGTATTCTTCTGGTCCCCACAGACTCAGAGAGAACCCGCCACCATGTTC...
Spike-encoding_contig_assembled_from_Moderna_mRNA-1273
<generator object rle_decode at 0x7f54416f7440>
GGGAAATAAGAGAGAAAAGAAGAGTAAGAAGAAATATAAGACCCCGGCGCCGCCACCATG...
Notice that the functions rle_encode() and rle_decode() are not limited to
encoding or decoding characters; such is merely handy for the examples. In fact, any kind
of iterable of any kind of value that might be repeated successively will work equally well
in being encoded or decoded by these functions. They will also work on infinitely long
iterators of values to encode or decode, as long as you only ask for a finite number of the
values at once.
To illustrate how concise and powerful iterator algebra can be, let’s look at a somewhat
famous mathematical series. The alternating harmonic series converges to the natural log of 2.
It’s not an especially fast convergence, but an elegant implementation can elegantly utilize
several iterator-combining functions:
∑∞
(−1)n+1 1 1 1 1
= 1 − + − + − ...
n=1
n 2 3 4 5
Note
Readers who come from the world of JavaScript may be familiar with the
brilliantly useful library Lodash (https://lodash.com/), which adds back into
JavaScript many of the powerful features that I personally miss most when I
have to venture into that language from my more familiar, and friendlier,
Python. A great deal of what the Lodash library does, however, is implement
functions similar to those provided by itertools and more-itertools.
This is a book about Python, however, and you need not have ever used
JavaScript or Lodash to read this chapter.
This is nice, and we can see that the direct result of calling collapse() is a lazy iterator
(which can thereby be combined and massaged using everything else within itertools
and more-itertools).
98 Chapter 4 Advanced Python Usage
What is even nicer is that this function works equally well on iterators that may
themselves yield other iterators, without ever having to concretize more than one element
at once.
annotations. Whether you should so decide is specifically not an opinion I will venture in
this text.
Use of type annotations is significantly complex, although incremental or minimal use
is possible, and tools like Mypy, Pytype, Pyright, and Pyre implement gradual type
annotations.3 As Python’s standard library typing module and Python’s syntax have grown
new capabilities, both the sophistication of what can be expressed and the range of possible
mistakes have grown enormously since 2014’s PEP 484, titled “Type Hints.”4
The main thing to keep in mind here is that type annotations have zero effect on the runtime
behavior of Python. Mostly.
As with most topics, certain caveats are needed. Annotations are available to
Python programs at runtime, as is pretty much everything about Python that
you might potentially introspect. For example:
>>> def add(a: int, b: int) -> int:
... c: int = a + b # Annotation of `c` not exposed directly
... return c
...
>>> add.__annotations__
{'a': <class 'int'>, 'b': <class 'int'>, 'return': <class 'int'>}
>>> add.__code__.co_varnames
('a', 'b', 'c')
>>> import inspect
>>> inspect.getsource(add)
'def add(a: int, b: int) -> int:\n c: int = a + b\n
return c\n'
>>> import typing
>>> typing.get_type_hints(add)
{'a': <class 'int'>, 'b': <class 'int'>, 'return': <class 'int'>}
3. The concept of gradual typing was introduced by Jeremy Siek and Walid Taha in 2006. See
https://wphomes.soic.indiana.edu/jsiek/what-is-gradual-typing/ for background. A variety of programming
languages and tools have implemented this concept, even beyond those in the Python ecosystem.
4. With a certain minor pride or vanity, I am fairly confident that my 2015 keynote at PyCon Belarus, “Python’s
(future) type annotation system(s),” was the first public conference presentation of these ideas within the Python
world. That said, I have been no more than an observer of a trend, and have contributed nothing relevant to the
specific directions typing has taken in Python.
100 Chapter 4 Advanced Python Usage
1 Line 20
2 Line 26.
Using the static type analyzer Mypy, we can find certain problems in this script.
4.4 Type Annotations Are Not Runtime Types 101
We can see that type-violation.py has typing issues both under static analysis and at
runtime. However, the errors are somewhat different from each other. Most likely, from
the name of the defined function, the developer did not intend for the function
fused_multiply_add() to be used with strings, but we cannot know that for certain
without either documentation or accurate annotations that are checked statically.
It is tempting to generalize by saying that static analysis is “more strict” than runtime
type checking. However, while that happens to be true in this example, it is not true
generally. Annotations might only be used gradually. As well, there are numerous edge
cases where static checkers will not catch issues arising dynamically. The set of
complaints—even strictly TypeError exceptions—can be more extensive either under
static analysis or under runtime checking. Moreover, the details of this comparison depend
on which static analysis tool is used, and in what version (which is very much outside the
scope of this book).
We can see why this particular example encounters a runtime type error:
>>> "foo" * 2
'foofoo'
>>> "foo" * 2.0
Traceback (most recent call last):
[...]
TypeError: can't multiply sequence by non-int of type 'float'
It happens that the static analysis tool would be perfectly happy to allow 2 and 2.0
interchangeably as floating-point-like numbers. This is what typing.SupportsFloat is
used for; it’s also why fused_multiply_add(1, 2, 3) did not raise any complaint by the
102 Chapter 4 Advanced Python Usage
static analyzer. But in Python, while strings can be “multiplied” by an integer, they cannot
be multiplied by a floating point number, even one whose value is equal to an integer.
If we removed all the type annotations from the preceding program, the runtime type
checking and program behavior would not change whatsoever.
[...]
TypeError: unhashable type: 'slice'
The code accidentally “works” with a string as an argument. The result is likely not
what we want, but no exception is raised. However, for the dictionary argument, the
exception that occurs really is not related to the fact that the argument is not a sequence of
integers.
If we wanted actual runtime checks, we could write something more like this.
Notice that the self-same object Sequence might play either a static or runtime role,
but the code is structured differently for the two purposes. Using Pydantic might provide a
path to unify these two roles, but the basic idea that “using the same object in different
contexts does different things” isn’t actually that hard to think about.
Some objects in typing definitely provide an attractive nuisance, however. For
example, an actual colleague of mine—who admittedly primarily develops in C++, but
has used Python on the side for more than a decade—wanted to have specialized integers
against which he could perform an isinstance() check elsewhere in his program. This
desire is eminently reasonable. Perhaps UserId is a special type of integer in the sense that
we’d like to make sure a generic integer isn’t used in certain places. Or maybe we want an
integer for a thermometer temperature, or for a percentile ranking, or for a distance
measure where we want to avoid confusing miles with kilometers. Knowing whether the
wrong kind of unit/value has snuck in is quite useful.
With such a need in mind, here is an obvious—and wrong—solution to the
requirement.
104 Chapter 4 Advanced Python Usage
At least this fails quickly. But why did that not work? If we read the typing
documentation carefully, we discover that “these checks are enforced only by the static
type checker.”
The way we would actually solve this problem is with the plain-old, boring class
statement that we’ve had since Python 1.0. Albeit, here I use a few more recent features for
a more robust version.
If you didn’t care about the bounds checking, it would be sufficient simply to write, for
example, class Celsius(float): pass.
4.5 Wrapping Up
In this chapter, we looked at a somewhat heterogeneous collection of mistakes, but still all
within the domain we might call “generic Python concerns.” Later chapters look at more
specialized (but not less important) concerns with matters such as testing, data structures,
security, and numeric computing.
Two mistakes here recapitulated the mistakes with good naming that started in
Chapter 3, A Grab Bag of Python Gotchas. Good names—and following Pythonic
conventions about naming—remain important as you become a more sophisticated
programmer.
Several more mistakes discussed were mistakes of ignorance or forgetfulness. Python, in
truth, has a few slightly obscure, yet enormously useful, corners within it. Knowing about
some capabilities of the Python syntax or standard library can often make your work easier,
and make your code friendlier for others to read.
A final two mistakes concerned a common confusion among newish Python
developers, and especially those more familiar with statically typed programming
languages. While the introductory material of this book discussed why this book does not
specifically delve into the world of static type analysis in Python, the presence of type
annotations can often mislead some programmers. This book itself, in fact, occasionally
uses examples with annotations; but it only does so in order to highlight correct reader
expectations; in other words, they are merely a form of documentation for most Python
users (albeit an often useful form).
This page intentionally left blank
5
Just Because You Can,
It Doesn’t Mean
You Should…
Python has a great many capabilities that exist for good reasons. However, many of these
features create what jurists call an “attractive nuisance.”1 This chapter is a mixture of
positive and negative advice, in truth. It presents some slightly unusual constructs that a
new Python developer often learns and becomes excited about, before finally stumbling
over the pitfalls those possibilities present. Hopefully, reading the mistakes here will
facilitate the learning part but discourage the misuse follow-up. In some other mistakes,
I simply discuss some techniques that people new to Python may not have known about
(but which are mistakes not to avail yourself of); in one case (and another half case), people
who have long used Python may also not be familiar with a newer feature.
5.1 Metaclasses
This section leans into slightly more arcane areas of Python’s design. If you’ve seen
metaclasses discussed, or if they are used in the codebases you work with, this is definitely
for you. If you haven’t, it’s still worth reading, but treat the topic as an amusement or an
enlightenment, rather than as a necessity. Libraries you use might use metaclasses, even if
you are not aware of the fact; knowing more is always handy.
Metaclasses push at the edge of the highly dynamic nature of Python. Even class objects
can themselves be created in different manners than their code describes by declaring a
metaclass. Moreover, the choice of metaclass can itself, in principle, be dynamically chosen.
1. Attractive nuisance: “a legal doctrine which makes a person negligent for leaving a piece of equipment or other
condition on property which would be both attractive and dangerous to curious children. These have included
tractors, unguarded swimming pools, open pits, and abandoned refrigerators” (from https://dictionary.law.com).
Some of the more obscure features in Python are certainly reminiscent of open pits and abandoned
refrigerators.
108 Chapter 5 Just Because You Can, It Doesn’t Mean You Should…
Before we get to showing what metaclasses can do, and showing why doing those things is
usually a mistake, let’s think about a comment from the early 2000s (when metaclasses
were introduced) that stands the test of 20 years:
Metaclasses are deeper magic than 99% of users should ever worry about. If
you wonder whether you need them, you don’t (the people who actually need
them know with certainty that they need them, and don’t need an explanation
about why).
— Tim Peters, Author of The Zen of Python (from the much-missed comp.lang.python
Usenet newsgroup)
Around the same time as Tim Peters’ quote, and not too long after they were
introduced to Python, I wrote a number of articles about Python metaclasses (often in
collaboration with Michele Simionato). Unfortunately, I probably elicited the commission
of many of the sins with metaclasses committed since.
Although I would almost universally advise against using metaclasses, there are some
prominent Python frameworks that make broad use of them (cough, cough, Django:
https://www.djangoproject.com/). Arguably, there must be occasions where they are the
best approach to choose. However, it’s probably informative that Python’s standard library
typing module initially used metaclasses widely but subsequently moved away from their
use (the module is largely the creation of Python creator Guido van Rossum).
Let’s take a look at a metaclass with an arguably good use. Suppose you have a large
code base, particularly one with some sort of plugin system in which components are
dynamically loaded. You’d like to be able to examine at runtime which classes have been
created in this dynamic arrangement.
class PluginLoggerMeta(type): # 1
def __init__(cls, name, bases, attrs):
logging.info(f"Registering: {name}"
f"({', '.join(b.__name__ for b in bases)}): "
f"\n contains: {', '.join(attrs.keys())}")
class Plugin(metaclass=PluginLoggerMeta):
pass
5.1 Metaclasses 109
if True: # 2
class Point2d(tuple, Plugin):
def __new__(self, x, y):
return tuple.__new__(self, (x, y))
@property
def distance(self):
return sqrt(self[0]**2 + self[1]**2)
if not False: # 2
class Point3d(tuple, Plugin):
def __new__(self, x, y, z):
return tuple.__new__(self, (x, y, z))
@property
def distance(self):
return sqrt(self[0]**2 + self[1]**2 + self[2]**2)
However, this use of metaclasses is simply more fragile than other approaches to
accomplishing the same thing. Since our convention for plugins requires adding something,
such as inheriting from Plugin, to work, we could just as easily require something else.
For example, a decorator like @register_plugin could just as well be required by the
framework using plugins. All that decorator would need to do is log information about the
class object, then return the class unchanged. For example:
110 Chapter 5 Just Because You Can, It Doesn’t Mean You Should…
Usually, when you create a class in Python, what you are doing “behind the
scenes” is calling the constructor type(). For example, here is a subclass of
tuple (which needs to use .__new__() rather than .__init__() because
immutable tuples cannot be modified during initialization, when the instance
already exists):
>>> from math import sqrt
>>> class Point2d(tuple):
... def __new__(self, x, y):
... return tuple.__new__(self, (x, y))
...
... def distance(self):
... return sqrt(self[0]**2 + self[1]**2)
...
>>> point = Point2d(3, 4)
>>> point
(3, 4)
>>> point.distance()
5.0
>>> point.__class__.__name__
'Point2d'
2. To be very pedantic, a metaclass does not strictly need to be a subclass of type. In principle, any callable that
takes as arguments a name, a tuple of bases, and a dictionary of attributes/methods would work as a metaclass.
This could even be a plain function if it returned a class object.
5.1 Metaclasses 111
def register_plugin(cls):
logging.info(f"Registering: {cls.__name__} ...")
return cls
As used, we’d see something like this (depending on the configuration of logging):
>>> @register_plugin
... class TestClass: pass
Registering: TestClass ...
Likewise, if you wanted to use inheritance, rather than using a metaclass, you could
simply include a logging .__new__() within the Plugin class. But why isn’t this simply a
minor style preference? Here’s one of the numerous places where the fragility of
metaclasses comes out:
The error message is pretty good, but in essence the problem is that Plugin and
NullBase imply different metaclasses to be used for class creation (PluginLoggerMeta
and NullMeta, respectively), and Python cannot decide which one to use. There is a
solution in creating a custom metaclass descending from both metaclasses, but this becomes
thorny and arcane quickly.
If custom metaclasses become commonplace in your code, conflicts during multiple
inheritance become almost inevitable. This is a danger that might be justified if it did
something you couldn’t do otherwise. However, as shown earlier, class decorators are
almost always cleaner, more readable, and can accomplish either absolutely or very nearly
everything that custom metaclasses can.
112 Chapter 5 Just Because You Can, It Doesn’t Mean You Should…
5.2 Monkeypatching
A powerful, and therefore easily misused, capability in highly dynamic languages like
Python is that they allow monkeypatching. With monkeypatching, even existing modules
and classes can be modified at runtime to exhibit different behaviors that might be better
specialized for the needs of a specific project.
The programming language Ruby is, in many ways, quite similar to Python. It is even
more dynamic, so a few things that are difficult to impossible to monkeypatch in Python
are easy in Ruby (mainly the possibility of dynamically altering even built-in types, which
Python does not allow). Partially as a consequence of such broad power, monkeypatching
became much more deeply incorporated into Ruby’s culture than it has in Python. In
turn, this cultural trend also largely became seen as an antipattern among Ruby developers;
however, a great deal of code (such as Ruby’s biggest mover, the Rails web framework)
included monkeypatching that could not be removed without breaking backward
compatibility.
An influential article in that ecosystem was Avdi Grimm’s 2008 “Monkey Patching Is
Destroying Ruby” (https://avdi.codes/why-monkeypatching-is-destroying-ruby/). You
can tell from the date that this discussion is not new. In fact, Grimm acknowledges Python
as the source of the term, and a predecessor language in allowing it.
Already in 2008, monkeypatching was well-known in Python, but not especially widely
used. In 2023, as I write this, that characterization remains true. You might certainly
encounter Python code that uses monkeypatching, but only moderately often. Deciding to
use monkeypatching in your own code is usually a mistake, but not always. Classes,
instances, source code itself, and a few other objects can be monkeypatched in Python. For
this example, though, I will present the most common use of the technique:
monkeypatching modules.
For a slightly contrived example that nonetheless genuinely exemplifies the situations
that tend to motivate Python developers to monkeypatch, suppose you have an application
that processes CSV data that comes from a vendor. A sample data file might look like this.
Your application might look something like a more sophisticated version of this:
import re
import sys
def get_5k_balances(rows):
for row in rows:
5.2 Monkeypatching 113
if re.match(r"5\d{3}\.\d{2}", row):
balance, account_num, owner = row.split(",")
yield (f"Account: {account_num}\n"
f" Owner: {owner.strip()}\n"
f"Balance: ${float(balance):,.2f}\n")
if __name__ == "__main__":
for account in get_5k_balances(open(sys.argv[1])):
print(account)
By stipulation, there are actually many places in this code where we perform a
re.match() against balance numbers. When we run this script, it will loop through the
specified file and yield formatted versions of some of the rows (i.e., Sarita’s and Omar’s
accounts do not have balances in the range of interest).
After a while, our vendor introduces overdraft balances, and hence decides to change
the format so that balances are always shown with a leading plus or minus sign.
The program as written will not match any lines. So one might come up with a clever
idea to let all the re.match() calls look for the extra leading plus rather than change them
throughout the code. It’s easy to do.
re.match = match_positive
If we add the single line import monkey_plus to the top of our script, creating
process-accounts2.py, it now processes the new format (in all the hypothetical
functions within the script that utilize re.match(), not only in get_5k_balances()).
A clever programmer is tempted to notice that they control all of the code within
process-accounts2.py and are confident that the only place where re.match() is used
is in matching balances (or in any case, matching numbers preceded by a plus or a minus
sign).
So far, nothing broke. But the clever programmer of this narration decides that they
would like to add a switch to control which of the many functions available are used to
process the data files that match the new vendor format. A reasonable program might be
the following.
def get_5k_balances(rows):
# ... same as previously shown ...
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("datafile", type=str, help="CSV data file")
parser.add_argument(
"-f", "--function", action="store", default="get_5k_balances")
args = parser.parse_args()
5.3 Getters and Setters 115
func = eval(args.function)
for account in func(open(args.datafile)):
print(account)
It’s easy at this point to start down a wild goose chase of trying to figure out how the
arguments for argparse might be wrong. Or wonder whether the admittedly too clever
eval() to get the function object from its name is amiss.
None of those issues is the mistake here, though. What has happened is that somewhere
buried inside the “protected” (i.e., leading single underscore, and not exported) functions
used to implement argparse there are uses of re.match(). One might even miss this
looking at the source code since that module had used import re as _re, and therefore
the calls are actually to _re.match(). Even so, the monkeypatching has badly broken a
completely unrelated module—in this case, one in the standard library—in a way that is far
from obvious, in code we did not necessarily even suspect would use our altered function.
It’s the same module object that is mutated by monkeypatching, it doesn’t matter what
name it happens to be bound to in different indirect imports.
The example provided in this discussion is a bit artificial. You probably would not be
inclined to inject a new version of a function into a standard library module. However,
you might be inclined to inject a function (or class attribute, or method, module constant,
etc.) into a slightly obscure third-party library you are utilizing that does almost but not quite
what you want. Quite likely you might inject a version derived from the provided version,
as in the example. This risks unpleasant surprises when a completely different dependency
turns out to also utilize that library, in ways that do not know about your alteration.
The actual best approach in situations of this sort is to bite the bullet and simply replace
all relevant uses of, for example, re.match() in your code. If you decide to define a
function such as match_positive() within your codebase, that can copy the signature of
the original version and is a straightforward search-and-replace to utilize, whenever
possible, and indeed almost always, it is better to leave the provided module function
untouched.
In Python it is a mistake to use this pattern, although one sees such code written by
programmers coming from other languages relatively often.
At first glance, my admonition against getters and setters might seem to contradict that
in Chapter 4, Advanced Python Usage, in the section “Directly Accessing a Protected
Attribute.” There I advised against directly modifying pseudo-protected (i.e., one leading
underscore) or pseudo-private (i.e., two leading underscores) instance attributes. I use
“pseudo-” in the prior sentence because, of course, in Python you are never truly
prevented from reading or modifying these (other than by convention and mild name
mangling).
In so-called “bondage-and-discipline languages” like Java, the intention is to prevent
users of a class from doing something they are not supposed to. In contrast, a common
saying among Pythonistas is “We’re all adults here.” That is, the creator of a class may
include access advisories by using single or double leading underscores, but these are not
access modifiers in the style of C and Java (besides “protected” and “private”, C—and Visual
Basic also—includes “friend” as an end run around other access modifiers).
If an attribute of a Python instance contains one or two leading underscores, it is not
because the creator of the class tries to guarantee you won’t access it, but simply as a way of
indicating that the attribute might disappear in the next version of their class, or its
meaning or values might be altered in ways that are not guaranteed by documented and
public APIs. Also, perhaps the meaning of that attribute with an access advisory might
actually do something subtly different than what you might initially assume. The class
author is making you no promises about that attribute.
Circling back to getters and setters, let’s look at a quick example of such an antipattern
in Python. The following works, of course; it merely feels non-idiomatic.
def get_value(self):
return self._value
Even if these extra methods took additional actions, this would not feel like good
Python. Yes, you should not muck with ._value, but if an indirect mechanism to have
guarded access to it is needed, properties provide a much more natural style.
For a slightly more fleshed out example than GetterSetterExample, let’s suppose we
want a Rectangle class, but with the special constraint that if we imagine the rectangle
having a lower-left corner at the origin, no part of the rectangle falls outside a circle with a
given radius. This is illustrated in Figure 5.1. In Pythonic style, we might write this as
follows.
5.3 Getters and Setters 117
Figure 5.1 Valid and invalid rectangles according to diagonal length rule.
class BoundedRectangle:
def __init__(self, x, y, radius=1):
assert x > 0 and y > 0 and radius > 0 # 1
assert sqrt(x**2 + y**2) <= radius
self._x = x
self._y = y
self._radius = radius # 2
@property
def x(self):
return self._x
@x.setter # 3
def x(self, new_x):
if new_x < 0 or sqrt(new_x**2 + self._y**2) > self._radius:
print("Rectangle bounds violated", file=stderr)
return
self._x = new_x
@property
def y(self):
return self._y
118 Chapter 5 Just Because You Can, It Doesn’t Mean You Should…
@y.setter
def y(self, new_y):
if new_y < 0 or sqrt(new_y**2 + self._x**2) > self._radius:
print("Rectangle bounds violated", file=stderr)
return
self._y = new_y
@property
def area(self):
return self._x * self._y
The quick summary of this section is that using properties—either read-only as with
.area or read/write as with .x and .y—provides a simpler and more Pythonic API for
users of classes. It remains the case that accessing “protected” attributes such as ._radius
violates the advice of the class creator and might produce unpredictable behavior (i.e., in
this particular class, the radius is intended to be set only on initialization).
Easier to ask for forgiveness than permission. This common Python coding
style assumes the existence of valid keys or attributes and catches exceptions if
the assumption proves false. This clean and fast style is characterized by the
presence of many try and except statements. The technique contrasts with the
LBYL style common to many other languages such as C.
— Python Documentation, Glossary
It’s a simple function, but it does something sensible. However, there are a lot of ways
this function can go amiss:
>>> total_length(problem1)
Traceback (most recent call last):
3. My lovely colleague, and writer of this book’s Foreword, Alex Martelli, contributed these lovely acronyms
early in the evolution of Python’s philosophy, perhaps a year or two on either side of 2000.
120 Chapter 5 Just Because You Can, It Doesn’t Mean You Should…
[...]
ValueError: I/O operation on closed file.
>>> total_length(problem2)
Traceback (most recent call last):
[...]
AttributeError: 'PosixPath' object has no attribute 'readline'
>>> total_length(problem3)
Traceback (most recent call last):
[...]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in
position 2: invalid continuation byte
>>> total_length(problem4)
Traceback (most recent call last):
[...]
OSError: [Errno 5] Input/output error
An LBYL approach would be to check for the problems before attempting to call
total_length() with the object you have in hand. This might start as:
The problem here is that we have not even checked for those exceptions I have
specifically thought of and created in my REPL session, let alone all of those I have so far
failed to consider. I honestly do not know a mechanism other than trying to answer the
question of whether some unread bytes will be decodable as UTF-8 (nor from another
encoding, if specified in open()). Nor do I know how to ask in advance whether I/O
might fail. In the above REPL, I created that condition by putting a file on a thumb drive
and physically removing it; similar issues occur on network filesystems and in other
circumstances.
The EAFP approach is simply more flexible and more Pythonic here. For example:
try:
print(f"Aggregate length is {total_length(words_file)}")
except AttributeError as err:
print(f"words_file is not file-like: {type(words_file)}")
5.5 Structural Pattern Matching 121
This EAFP code handles more cases than does LBYL, but it also handles absolutely every
failure with a generic fallback. We are able to provide different remediation of the problem
for every exception that we know about, and some very general fallback for those
problems we did not think about when first writing the code.
C: 01 07 14 19 27 30 34 36 36 38 44 47 47 50 51 54 58 60 61 62 82 83
83 95
D: 05 10 13 17 30 31 42 50 56 61 63 66 76 90 91 91 93
E: 03 21 23 24 26 31 31 31 33 36 38 38 39 42 49 55 68 79 81
F: 04 08 13 14 14 16 19 21 25 26 27 34 36 39 43 45 45 50 51 62 66 67
71 75 79 82 88
G: 03 10 27 49 51 64 70 71 82 86 94
H: 27 31 38 42 43 43 48 50 63 72 83 87 90 92
I: 12 16 18 19 38 39 40 43 54 55 63 73 74 74 75 77 78 79 88
Now let’s try to process this file using this pattern. Presumably in real code we would
take some action using the groups in the match, beyond printing out the fact it matched or
failed.
We can notice a few things. In the first place, the actual matches always take little time
(less than a millisecond), while most failures take a moderate fraction of a second. Except
in the case of line F, which took almost 7 seconds. It may not be obvious, but in fact if line
F had one more number less than 90, it would take twice as long. And one more number
after that, another doubling. Exponential behavior gets bad quickly.
5.6 Regular Expressions and Catastrophic Backtracking 125
We might have a long number string that uses commas to separate at the
thousands, millions, billions, etc., according to a Western convention of
grouping by three digits. The pattern r"\d,\d+$" should satisfy this.
In the illustration, characters and regex subpatterns that match appear
underlined and with a green background. Characters and regex subpatterns
that fail appear overlined and with a red background.
In order to match, the regex engine scans forward to find a digit, then
expects a comma to follow. When that fails, it abandons the partial match,
and scans again for a later digit. Once it finds a digit-comma sequence, it
looks for repeated digits. Initially, it finds a subsequence of the match, but
fails when the end-of-string is not reached. So back to scanning for the initial
digit again, giving back part of the match.
Let’s think about what is happening in the number sequences example. Since the
subpattern r"(.+ )+" can actually match any characters, it first tries to match the entire
string, then finding no lookahead of r"(?=9.)" it backtracks to consider a bit less. Perhaps
after that backtracking, it finds that the next unmatched characters are "9.". If not, it tries
backtracking more as the one-or-more quantifier allows taking fewer repetitions. However,
after each backtracking, the pattern tries to search forward with more repetitions, keeping
126 Chapter 5 Just Because You Can, It Doesn’t Mean You Should…
the place of the last backtrack. When that doesn’t succeed, we unwind the tail, and
backtrack more on the head.
In a manner similar to the famous Tower of Hanoi puzzle (https://en.wikipedia.org
/wiki/Tower_of_Hanoi), every possible state of head versus tail needs to be explored
before the pattern can ultimately fail. One might have an intuition that the problem can be
solved by using a non-greedy repeating group. Intuitively, it feels like maybe the pattern
will try to backtrack less hard if we aren’t greedy in matching as much as possible with
the +. However, if we set the pattern to r"^(.+: )(.+? )+(?=9.)", the time actually
gets a little bit worse (about 10 seconds on my machine; but an exponential increase with
length either way). If you think again about the problem, you’ll realize that matching as
little as possible instead of as much as possible doesn’t actually prevent backtracking since a
non-match overall remains “not possible” for the regex engine. The same possibility space
still needs to be explored.
In this exact problem, there is actually an easy solution. If we use the pattern
r"^(.+: )(\d+? )+(?=9\d)", all the times become negligible. We match some digits
followed by space rather than any sequence including space. However, most of the time when
you hit this problem it is not as simple as the example. Rather, the most typical situation is
when you consider alternative sub-patterns but do not realize that they can sometimes
actually match the same strings (which is effectively what we did with the (.+ ) in which
the dot can also be a space).
The general form of the “overlapping-alternatives” problem is (meta-syntactically)
(«pat1»|«pat2»|«pat3»)+ where each of the patterns might themselves be complex
and not make it obvious they can match the same thing. As a simple example, consider a
subpattern like \b(a\S*|\S*f\S*|\S*z)\b; that is, words that start with “a”, words that
have an “f ” in the middle, and words that end with “z”. Writing it in English it’s easy to
see that “alferez” (a high-ranking official in medieval Iberia) matches every alternative. If
the string contains many repetitions of that word, a full match would backtrack across all
the options.
There is not one completely general solution to these kinds of problems or mistakes.
Sometimes a lookahead for an easier-to-match but less specific pattern can precede the real
search and allow it to fail quickly. For example, in the current goal, we might simply
generically assert a space-9 occurs somewhere before really matching: ^(?=.* 9)(.+: )
(.+ )+(?=9\d). This produces the correct answers and match groups, but always in a few
microseconds for each line.
In Python 3.11, a very nice new feature was introduced to the re module. Actually,
two closely related features: possessive quantifiers and atomic groups. In both cases, the
meaning is “match or fail once, but don’t backtrack.” Despite the related purpose of these
constructs, it’s not obvious that they lend themselves to the current goal. But becoming
aware of them will benefit you in rewriting many problem patterns.
5.7 Wrapping Up
There is heterogeneity in the mistakes of this chapter. Python allows and enables you to do
a great many things, even some that are generally unwise. Some of these techniques are
5.7 Wrapping Up 127
This chapter hopes to save its readers from unnecessarily difficult code by drawing
attention to a number of useful data structures that Python beginners, and even many
advanced language users, may be unaware of or merely tend to forget. Four of these data
structures are contained in the Python standard library, in collections, and another is in
dataclasses (sort of—we’ll discuss that). The last mistake discussed looks at several
additional, closely related, sequence (i.e., list-like) data structures that can sometimes be
more efficient than lists. These are all built-ins or in standard library modules.
This book’s longest discussion of mistakes comes in Chapter 7, Misusing Data Structures,
in the section “Rolling Your Own Data Structures.” That more advanced mistake goes
into a great many pros and cons of creating custom data structures. Until you’ve considered
all the standard library collections discussed in this chapter, and also the third-party
libraries sortedcontainers (https://grantjenks.com/docs/sortedcontainers/) and pyrsistent
(https://pyrsistent.readthedocs.io/en/latest/), you are most certainly premature in creating
your own custom data structures.
Some of the discussions in this chapter—each about one particular data structure
available in the Python standard library—are arranged a bit differently than other sections
in this book. For each of them, a simple task using one of these simple data structures is
shown; following that is the mistake represented by the longer, less readable, and sometimes
notably slower code that would be needed to achieve the same goal without these data
structures.
6.1 collections.defaultdict
In the very ancient history of Python (before 2.7), collections.defaultdict was
available but collections.Counter was not; the use case that Counter has addressed
since 2010 was met with a recommendation of “use defaultdict.”
Both of these collections are subclasses of dict that add specialized capabilities. A
default dictionary is only narrowly specialized, and in that sense is more general than
collections.Counter, collections.OrderedDict, or collections.UserDict. The
130 Chapter 6 Picking the Right Data Structure
latter two of these are worth reading about in the Python documentation
(https://docs.python.org/3/library/collections.html), but this book does not specifically
address them.
The “mistake” fixed by using defaultdict is simply that of using a repetitive and
overly verbose pattern that can be simplified. I often make this mistake of inelegance, and
hence feel the reminder is worthwhile.
When you work with plain dictionaries you often want to modify a mutable value
associated with a key. But before you can perform that modification, you need to check
whether the dictionary has the needed key in the first place. For example, let’s work with
the SOWPODS wordlist used in various examples, and create a dictionary that maps first
letters to a collection of words starting with that letter:
A similar pattern is also often used with list and .append(), and sometimes with
other collections and their corresponding method for including more items.
This pattern is perfectly workable, but is made better with defaultdict:
>>> by_letter['r']
{'repositors', 'rotating', 'resectional', 'reflectometry'}
Not only collections can be used, but any callable. For example, recall the fanciful
numerological function word_number() defined in Chapter 1, Looping Over the Wrong
Things, in the section “(Rarely) Generate a List for Iteration.” But here let’s pretend that
this function is actually computationally expensive and we’d like to avoid running it more
often than needed (@functools.lru_cache and @functools.cache decorators also
provide a useful way of achieving this task):
In this particular example, however, we might want to go ahead and perform the
calculation once to actually save the expensive calculation into the dictionary (the next
example works with a regular dict as well):
already. To circle back to the first paragraph of this discussion, a Counter resembles a
defaultdict with the factory set to int, but adding additional methods.
6.2 collections.Counter
One of the loveliest collections in the collections module of the Python standard
library is Counter. This is an implementation of a concept sometimes called a multiset, bag,
or mset (https://en.wikipedia.org/wiki/Multiset). A very common use for counters is in
conveniently creating what are often informally called histograms.1 For this section and the
next few, I present the solution (i.e., a Pythonic example) first, then the mistake.
1 argparse, click, docopt, or typer would allow more versatile switch handling.
I use this utility in manners such as the following (here combined with other shell
utilities, such as head):
1. To be technical, a histogram is a representation of continuous data divided into bins by numeric value ranges
(most commonly uniformly sized bins). Similar, but not identical, is a bar chart of the count or frequency of
categorical data. The utility shown in this chapter is the latter.
6.2 collections.Counter 133
I might even create a visualization this way, as in Figure 6.1 (the similarly small
barchart utility, which uses matplotlib, is available from the book’s website):
The data shown will not be precisely accurate for the final version of this book’s
frontmatter, but it will be similar. In my script, I merely initialize a counter with iterables
(either of letters or of words), and use the .most_common() method to order the objects
being counted. In larger programs, you are likely to call the method .update() repeatedly
134 Chapter 6 Picking the Right Data Structure
on an existing counter, each time increasing the count for each object in the passed-in
iterable. The .subtract() method decrements these counts. For example:
All the letters (or any hashable objects) that have never been seen are not present, but
each one that has been seen indicates the number of times it has been seen. Because the
code uses .subtract(), we might reach a zero or negative count. Since Counter is a
subclass of dict it still has all the usual dictionary behaviors, such as deleting a key, if we
wish to use those.
>>> import re
>>> from pathlib import Path
>>> from operator import itemgetter
>>> frontmatter = Path("frontmatter").read_text()
>>> hist = {}
>>> for word in cleaned.split():
... if word in hist:
... hist[word] += 1
... else:
... hist[word] = 1
...
6.3 collections.deque 135
6.3 collections.deque
A deque is a thread-safe generalization of stacks and queues. That is to say, while a
collections.deque has almost the same API as does the built-in list, it adds .*left()
versions of several methods of lists. Crucially, even though lists can do these same
things—for example, list.pop(0) does the same thing as deque.popleft(), and
list.insert(0, x) does the same thing as deque.appendleft()—a deque does them
efficiently. Both FIFO (first in, first out) and LIFO (last in, first out) operations work well
with deques.
In Chapter 7, Misusing Data Structures, the first two sections treat situations in which
ill-advised uses of lists can have quadratic complexity (see the note in the first section of
Chapter 3, A Grab Bag of Python Gotchas, and https://en.wikipedia.org/wiki
/Big_O_notation for discussions of “quadratic”). This section foreshadows those
discussions.
Underneath the implementation of collection.deque is a doubly linked list rather than
an array of references (as with list). This makes appending or popping from both ends
efficient, but it also makes indexing less efficient than on lists, and slices are unavailable
(because providing them would falsely imply efficiency in the eyes of the deque designer,
Raymond Hettinger). However, deque provides an interesting .rotate() method that
lists lack (also because they are inefficient there). Neither deque nor list is a “pure win,”
but rather provides a trade-off. The complexity of some common operations is presented
in Table 6.1.
For this section and the few around it, I present the solution (i.e., a Pythonic example)
first, then the mistake.
5.4
12.0
18.2
For small inputs, this is not important, and the behavior is the same:
% numbers 7 | moving-average-list
510.0
571.2
670.8
However, looking at a larger window and more data, we notice an important difference
in the two implementations of a moving average algorithm:
real 0m0.869s
real 0m18.025s
1 The Unix-isms of time and /dev/null are incidental; we just want to time the
script, but don’t care about output.
Increasing efficiency by 20x is wonderful, of course. Along with it, though, we get
thread safety, which is very often important in a context of queues and stacks. When, for
example, we pop a value from a collection, a Python list does not guarantee that two
threads don’t pop the same item, whereas a deque does. When you write multithreaded
code, you need to think carefully about whether the data structures shared among threads
enable deadlocks, race conditions, and other concurrency pitfalls; picking the right data
structure does not eliminate these concerns, but picking the wrong one almost guarantees
them. Broader concerns around concurrency are outside the scope of this book, but the
appendix, Topics for Other Books, provides a very brief recap of the Python concurrency
ecosystem.
6.4 collections.ChainMap
The lovely object collections.ChainMap is a sort of virtual mapping. It provides an
abstraction for looking through multiple dictionaries (or other mappings) successively.
When using a ChainMap we avoid needing to copy any data and also avoid modification of
the underlying dictionaries outside of the proper context for doing so.
6.4 collections.ChainMap 139
For this section, as with the last few, we will look at a Python solution first, followed by
the mistake it fixes.
This becomes even more useful if some of the mappings might change at runtime. For
example:
>>> settings.maps
[{'timeout-secs': 60, 'env-file': '/home/david/.env-foobar',
'username': 'Dr. Mertz', 'application-name': 'DavidMaker'},
{'timeout-secs': 15, 'application-name': 'FooBarMaker', 'env-file':
'/user/local/foobar/.env'}, {'timeout_secs': 30, 'environment':
'development', 'env-file': '/opt/.env-default'}]
>>> default.update(application)
>>> default.update(user)
>>> default
{'timeout-secs': 60, 'environment': 'development', 'env-file':
'/home/david/.env-foobar', 'application-name': 'DavidMaker',
'username': 'Dr. Mertz'}
However, after merging application and user into default, we have lost a record of
what might have earlier been inside default and what might have been absent. For many
applications, retaining that information is valuable.
One might try to remedy this by using a brand-new name (assume all the initial
mappings are reset to the initial values shown earlier):
This is perhaps better than the last mistaken version since default has its initial
configuration maintained. Still, if the application updates one or more of the collected
mappings, that is not reflected in settings unless we use the same few lines each time
before checking settings.
of tuples, but with named fields. The second is dataclasses, which are simply Python classes
with many dunders and other useful behaviors generated automatically in a concise way.
The things we can do with dataclasses and namedtuples can also be accomplished with
the built-in types dict and tuple. Indeed, every namedtuple is simply a subclass of
tuple, so it is genuinely the same type. However, using datatypes that allow readers to
frame code very clearly as “records” improves readability and makes reasoning easier.
Very often, it is useful to work with a collection of data where each item has a fixed
collection of fields, but we work with many such items (comparing, sorting, aggregating,
etc.). Namedtuples and dataclasses have more parts than simple scalars—such as float,
int, str, or decimal.Decimal or other special types—but are more similar to datatypes
than to data structures. The most obvious difference between the two types this section
discusses is that dataclasses are mutable while namedtuples are immutable.
Let’s look at an example of record-oriented data. We often encounter such data in
relational database tables, in CSV or fixed-width files, in formats like Apache Parquet or
Apache Arrow, in JSON serialized records, and elsewhere. For extensive numeric analysis
especially, a dataframe library such as Pandas or Polars is often useful. However, for this
discussion we remain in pure-Python land.
The standard library module csv is often very useful for reading line-oriented,
delimited textual data files. Despite the acronym standing for “comma-separated values”
the module is perfectly happy to work with any delimiter you might have in your data
files. The csv module is especially useful when character escaping is needed (the delimiter,
and the escape and quoting characters themselves, have to be treated nonliterally). For the
example in this discussion, we avoid that concern, by stipulation and by the actual format
of the example file. In the archive for the book (https://gnosis.cx/better) we have a file
with some information on per-nation demographics:
[data]$ wc -l population-data.txt
236 population-data.txt
[data]$ head -5 population-data.txt
Name|Population|Pct_Change_2023|Net_Change|Density_km3|Area_km2
China|1,439,323,776|0.39|5,540,090|153|9,388,211
India|1,380,004,385|0.99|13,586,631|464|2,973,190
United States|331,002,651|0.59|1,937,734|36|9,147,420
Indonesia|273,523,615|1.07|2,898,047|151|1,811,570
>>> world_data[37][5]
306230.0
>>> world_data[37].Area_km2
306230.0
>>> world_data[37]._fields
('Name', 'Population', 'Pct_Change_2023', 'Net_Change',
'Density_km3', 'Area_km2')
>>> world_data[37]._asdict()
{'Name': 'Poland', 'Population': 37846611.0,
'Pct_Change_2023': -0.11, 'Net_Change': -41157.0,
'Density_km3': 124.0, 'Area_km2': 306230.0}
... Area=306_230
... )
...
>>> poland
Nation(Name='Poland', Population=37846611, Pct_Change=-0.11,
Net_Change=-41157, Density=124.0, Area=306230)
We can use either positional or named parameters to create a new Nation object, so
the example uses a mixture (positional first, of course). As promised, the type declarations
are not enforced; for example, poland.Area is an integer (as in the actual source data)
even though it conceptually could be nonintegral as “declared.”
>>> @dataclass
... class DataNation:
... Name: str
... Population: int = 0
... Pct_Change: float = 0
... Net_Change: int = 0
... Density: float = 0
... Area: float = 0
...
... def project_next_year(self, new_people):
... self.Population += new_people
... self.Net_Change = new_people
... self.Pct_Change = new_people / self.Population
146 Chapter 6 Picking the Right Data Structure
>>> peru_2023
DataNation(Name='Peru', Population=32971854, Pct_Change=1.42,
Net_Change=461401, Density=26, Area=1280000)
>>> peru_2023.Density, peru_2023.Population
(26, 32971854)
include bytes, bytearray, and array.array. The last of these is a family of data
structures, in the sense that it can be configured to hold a variety of identical bit-sized
elements of the same numeric datatype.
There is a relatively narrow domain where it makes sense to use Python’s standard
library array module rather than “taking the next step” to use NumPy, but within that
narrow range of use cases, it is nice to have array.array available to avoid external
dependencies. Conceptually, bytearray is very similar to array.array("B") in that
both are mutable unboxed sequences of integer values between 0 and 255 (i.e., bytes).
However, the collection of methods each provides are distinct. bytearray has most of the
same methods as str, and is deliberately string-like; in contrast array.array (of every
datatype) has methods much closer to those in list.
Just as tuple is, in some sense, a “mutable version of list,” bytes is “an immutable
version of bytearray.” The analogy isn’t perfect. Tuples have a starkly constrained
collection of methods (i.e., exactly two, .count() and .index()) but bytes has many
methods, most in common with str. The built-in type bytearray in turn essentially has
a superset of the methods of bytes (some methods relevant to mutable sequences are only
in bytearray):
>>> rand_list[:3]
[201, 217, 132]
Superficially, these four types of sequences seem similar. It took a while to run the loop
to make sure they have equal elements, but indeed they do. A first notable difference
among them is their memory usage. For the first three objects, we can ask about the
memory usage in a simple way:
These sizes are not quite identical since their headers vary (and array.array uses a
modest overallocation strategy, similar to list). However, all of these are close to the
536,870,912 bytes that are the minimal possible size to represent all of these random bytes.
The question is somewhat more complicated for rand_list. Lists use a relatively
aggressive overallocation of slots; but even apart from such overallocation of memory, each
slot is a pointer to an internal data structure used to represent a Python integer. For
byte-sized integers (i.e., between 0 and 255), this structure occupies 28 bytes. For larger
integers or wider floating point numbers, the size of the boxed number increases somewhat,
albeit slowly. In concept, a list of integers needs to contain both an array of pointer slots
(probably 64-bit on modern systems), of roughly the length of the array, and also, at a
different memory address, the underlying boxed number.
However, this is further complicated in the current example by the fact that small
integers are interned in CPython. This is discussed briefly at the start of Chapter 2, Confusing
Equality with Identity. It means that all the pointers to 8-bit integers have already been
allocated when Python starts up, and these pointers are simply reused for repeating list
slots. If we were working with larger numbers, we would have to multiply the number of
(non-duplicated) items by the size of the boxed number, and add that to the size of the
pointer array itself:
>>> sys.getsizeof(2)
28
>>> sys.getsizeof(2**31)
32
>>> sys.getsizeof(2**63)
36
>>> sys.getsizeof(2**200)
52
6.6 Efficient Concrete Sequences 149
>>> f"{sys.getsizeof(rand_list):,}"
'4,294,967,352'
As a ballpark figure, the list version takes about 4 GiB, since 64-bit pointers are eight
times as large as 8-bit numbers. If we were dealing with non-interned numbers, the
memory story would lean much more heavily against list.
Often much more important than memory usage is runtime:
It’s impressive how much faster counting on bytes or bytesarray is than on list. It’s
disappointing that the similar data structure array.array doesn’t achieve similar results
(even coming in slightly worse than list). The reason for this is that in CPython,
array.array.count() still uses the same indexing machinery as other pure-Python
sequences. This has been a “known issue” in CPython since at least 2015 and Python
3.7—no core developers have felt it is one that needs to be fixed, however, since as soon as
you start asking this question, the answer is almost always “use NumPy” (which solves an
enormous number of other problems at the same time).
Of course, the example only used 8-bit unsigned integers. If you wanted to store 16-,
32-, or 64-bit signed or unsigned integers, or floating point numbers of system width
(usually 64-bit nowadays), bytearray would clearly not be an option. Also, in these cases,
array.array would pull modestly ahead of list because of interning. As a redemption
for array.array, we can notice that it is still much faster than list in a range of other
situations where we work with sequences of numbers of the same underlying machine
type. For example:
6.7 Wrapping Up
I will, in fact, claim that the difference between a bad programmer and a good
one is whether he considers his code or his data structures more important.
Bad programmers worry about the code. Good programmers worry about data
structures and their relationships.
— Linus Torvalds, Creator of Linux
Choosing a data structure well suited to your tasks is one of the most important tasks in
programming. Python provides both ubiquitously used data structures—mainly dict,
list, tuple, and set built-in types—and some others that are more often overlooked.
Many of these overlooked gems live inside the collections module of the standard
library. A few live in other modules, or simply as less used built-ins.
On the Python Package Index (https://pypi.org/) you can probably find hundreds of
very specialized data structure modules. There are definitely times when one of these is
exactly what you need; however, much of what lives on PyPI is somewhat preliminary or
partial, being the hobby of one or a few programmers. The libraries sortedcontainers and
pyrsistent mentioned in the introduction to this chapter are definitely worth being aware
of. I’ve found specific use for pygtrie (https://pypi.org/project/pygtrie/) and R-tree
(https://pypi.org/project/Rtree/) as well, for example.
A good strategy when choosing a data structure for your project is to first think about
whether any of the “big four” listed in the preceding paragraph accomplish exactly what
6.7 Wrapping Up 151
you need (including good big-O behavior). Very often the answer is yes. If you feel like
there may be drawbacks to using those, consider using one of the other data structures in
the standard library collections, array, enum, dataclasses, or perhaps queue and a
few others. If it still feels like something is missing, think about sortedcontainers,
pyrsistent, NumPy, and Pandas. If that still falls short—and if it does, think very carefully
about whether you are genuinely asking the right questions—look for other data structure
projects in the open source world. And in the rare, but not nonexistent, circumstance
those earlier steps don’t satisfy your needs, think about creating custom data structures
(after reading the next chapter).
This page intentionally left blank
7
Misusing Data Structures
Python has extremely well-designed data structures and data representations, many of
which are discussed in the prior chapter. However, a few antipatterns, that are
unfortunately common, can make the use of data structures dramatically inefficient or lead
to unintended behavior in your code.
>>> import re
>>> flags = re.VERBOSE | re.IGNORECASE | re.DOTALL | re.UNICODE
>>> type(flags)
<flag 'RegexFlag'>
>>> re.U in flags
True
>>> type(re.M)
<flag 'RegexFlag'>
In a commonsense way, the flag re.U (which is simply an alias for re.UNICODE) is
contained in the mask of several flags. A single flag is simply a mask that indicates only one
154 Chapter 7 Misusing Data Structures
operational re modifier. Moreover, a few special objects that are not collections but
iterables also respond to in. For example, range is special in this way.
Part of the cleverness of range is that it does not need to do a linear search through its
items, even though it is in many respects list-like. A range object behaves like a realized
list in most ways, but only contains anything in a synthetic sense. In other words,
range(start, stop, step) has an internal representation similar to its call signature,
and operations like a slice or a membership test are calculated using a few arithmetic
operations. For example, n in my_range can simply check whether start ≤ n < stop
and whether (n − start) % step = 0.
The time to check for membership of an element near the “start” of a range is almost
identical to that for membership of an element near the “end” because Python is not
actually searching the members.
Lists are a concrete and ordered collection of elements that can be appended to very
quickly, and have a few other internal optimizations. However, we have to watch where
the ordered part might bite us. The only generic way to tell if an element is contained in a
7.1 Quadratic Behavior of Repeated List Search 155
list is to do a linear search on the list. We might not find it until near the end of the search,
and if it isn’t there, we will have had to search the entire list.
We can use the bisect module in the standard library if we wish to speed this
greatly for lists we are happy to keep in sorted order (which is not all of our
lists, however). The sortedcontainers third-party library (https://grantjenks.
com/docs/sortedcontainers/) also provides a similar speedup when we can
live with the mentioned constraint.
We can see where checking containment within a list becomes unwieldy with a simple
example. I keep a copy of the 267,752-word SOWPODS (https://en.wikipedia.org/wiki
/Collins_Scrabble_Words) English wordlist on my own system. We can use that as an
example of a moderately large list (of strings, in this case).
Taking a full minute for this simple operation is terrible, and it gets worse quickly—at
approximately an O(N²) rate (to be precise, it is Ω(N×M) since it gets even worse as the hit
rate goes down for specific data).1
What we’ve shown is concise and superficially intuitive code to perform one linear scan
of most_words for every word in some_words. That is, we perform an O(N) scan
operation M different times (where N and M are the sizes of the respective lists). A quick
clue you can use in spotting such pitfalls is to look for multiple occurrences of the in
keyword in an expression or within a suite. Whether in an if expression or within a loop,
the complexity is similar.
Fortunately, Python gives us a very efficient way to solve exactly this problem by using
sets.
That’s better than a 1000x speedup. We can see that the result is exactly the same. Even
assuming we needed to concretely look at where those words occur within our lists rather
1. The so-called big-O notation is commonly used in computer science when analyzing the complexity of an
algorithm. Wikipedia has a good discussion at https://en.wikipedia.org/wiki/Big_O_notation. There are
multiple symbols used for slightly different characterizations of asymptotic complexity: O, o, Ω, ω, and Θ.
Big-O is used most commonly, and indicates a worst-case behavior; Big-Theta indicates an asymptote for both
worst case and best case; Big-Omega indicates a best-case behavior. Small-o and Small-omega are used to express
the somewhat more complex concepts of one function dominating another rather than bounding another.
7.2 Deleting or Adding Elements to the Middle of a List 157
than merely count them or see what they are, 649 operations of some_words.index
(word) is comparatively cheap relative to the three-orders-of-magnitude difference
encountered (looking through the shorter list is far faster, and typically we find the
different word after searching halfway).
If the particular problem discussed is genuinely close to the one you face, look
towards the third-party module pygtrie (https://pypi.org/project/pygtrie/),
which will probably get you even faster and more flexible behavior. For the
precise problem described, CharTrie is the class you’d want. In general, the
trie data structure (https://en.wikipedia.org/wiki/Trie) is very powerful for a
class of string search algorithms.
2. Ibid.
158 Chapter 7 Misusing Data Structures
For lists, accessing an item at a given numeric position is O(1). Changing the
value at a numeric position is O(1). Perhaps surprisingly, list.append() and
list.pop() are also amortized O(1).
That is, adding more items to a list will intermittently require reallocating
memory to store their object references; but Python is clever enough to use
pre-allocated reserve space for items that might be added. Moreover, as the
size of a list grows, the pre-allocation padding also grows. The overall effect is
that reallocations become rarer, and their relative cost averages out to 0%
asymptotically. In CPython 3.11, we see the following behavior on an x86-64
architecture (but these details are not promised for a different Python
implementation, version, or chip architecture):
>>> from sys import getsizeof
>>> def pre_allocate():
... lst = []
... size = getsizeof(lst)
... print(" Len Size Alloc")
... for i in range(1, 10_001):
... lst.append('a')
... newsize = getsizeof(lst)
... if newsize > size:
... print(f"{i:>4d}{newsize:>7d}{newsize-size:>6d}")
... size = newsize
...
>>> pre_allocate() # 1
Len Size Alloc | Len Size Alloc
1 88 32 | 673 6136 704
5 120 32 | 761 6936 800
9 184 64 | 861 7832 896
17 248 64 | 973 8856 1024
25 312 64 | 1101 10008 1152
33 376 64 | 1245 11288 1280
41 472 96 | 1405 12728 1440
53 568 96 | 1585 14360 1632
65 664 96 | 1789 16184 1824
77 792 128 | 2017 18232 2048
93 920 128 | 2273 20536 2304
109 1080 160 | 2561 23128 2592
129 1240 160 | 2885 26040 2912
149 1432 192 | 3249 29336 3296
173 1656 224 | 3661 33048 3712
201 1912 256 | 4125 37208 4160
233 2200 288 | 4645 41880 4672
7.2 Deleting or Adding Elements to the Middle of a List 159
1 The word deleted at initial index 3 was railbus, but on next deletion ciclatoun was at
that index.
2 The word wringings was inserted at index 3, but got moved to index 4 when awless
was inserted at index 1.
For the handful of items inserted and removed from the small list in the example, the
relative inefficiency is not important. However, even in the small example, keeping track
of where each item winds up by index becomes confusing.
As the number of operations gets large, this approach becomes notably painful. The
following toy function performs fairly meaningless insertions and deletions, always
returning five words at the end. But the general pattern it uses is one you might be
tempted towards in real-world code.
Going from 200 operations (counting each of insertion and deletion) to 20,000
operations takes on the order of 200x as long. At these sizes the lists themselves are small
enough to matter little; the time involved is dominated by the number of calls to
get_word(), or perhaps a bit to randrange(), although we still see a 2x proportional
slowdown from the list operations.
However, upon increasing the number of operations by another 100x, to 2 million,
linear scaling would see an increase from 20 ms to about 2 seconds. Instead it jumps to
nearly 2 minutes, or about a 55x slowdown from linear scaling. I watched my memory
usage during the 15 minutes that %timeit took to run the timing seven times, and it
remained steady.
It’s not that these operations actually use very much memory; rather, every time we
insert one word near the middle of a 1 million word list, that requires the interpreter to
move 500,000 pointers up one position in the list. Likewise, each deletion near the middle
of a 1 million word list requires us to move the top 500,000 pointers back down. This gets
much worse very quickly as the number of operations increases further.
7.2 Deleting or Adding Elements to the Middle of a List 161
('sowarry', 28),
('actinobiology', 45),
('oneself', 62),
('subpanel', 68),
('alarmedly', 74),
('marbled', 98),
('dials', 120),
('dearing', 121)]
>>> pprint(words_by_priority[-5:])
[('overslow', 999976),
('ironings', 999980),
('tussocked', 999983),
('beaters', 999984),
('tameins', 999992)]
It’s possible—even likely—that the same priority occurs for multiple words,
occasionally. It’s also very uncommon that you actually care about exactly which order two
individual items come in out of 100,000 of them. However, even with duplicated
priorities, items are not dropped, they are merely ordered arbitrarily (but you could easily
enough impose an order if you have a reason to).
Deleting items from the words data structure is just slightly more difficult than was
del words[n] where it had been a list. To be safe, you’d want to do something like:
The extra print() calls and the else clause are just for illustration; presumably if this
approach is relevant to your requirements, you would omit them:
This approach remains fast and scalable, and is quite likely much closer to the actual
requirements of your software than was misuse of a list.
7.3 Strings Are Iterables of Strings 163
If you prefer LBYL (look before you leap) to EAFP (easier to ask forgiveness than
permission) you could write this as follows.
Either way, these are perfectly sensible functions to take a nested data structure with
scalar leaves, and return a linear sequence from them. These first two functions return a
concrete list, but they could equally well be written as a generator function such as the
following.
... else:
... yield o
>>> nested = [
... (1, 2, 3),
... {(4, 5, 6), 7, 8, frozenset([9, 10, 11])},
... [[[12, 13], [14, 15], 16], 17, 18]
... ]
>>> flatten(nested, []) # 1
[1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17, 18]
>>> flatten2(nested, []) # 1
[1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17, 18]
>>> for item in flatten_gen(nested):
... print(item, end=" ")
... print()
1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 16 17 18
1 The same breakage occurs with a default depth of 1000, it just shows more lines of
traceback before doing so.
2 Recent python shells simplify many tracebacks, but ipython does not by default.
7.3 Strings Are Iterables of Strings 165
The solution to these issues is to add some unfortunate ugliness to code, as in the
examples shown here.
RED = "RED"
GREEN = "GREEN"
BLUE = "BLUE"
class Sprite:
def __init__(self, shape, color):
self.shape = shape
self.color = color
def process(sprite):
if sprite.shape == TRIANGLE and sprite.color == RED:
red_triangle_action(sprite)
elif something_else:
# ... other processing
3. C, Java, Go, Rust, C#, TypeScript, and most programming languages also have enums of varying stripes. But
the CONSTANT convention is nonetheless often seen in code in those languages.
7.4 (Often) Use enum Rather Than CONSTANT 167
In a highly dynamic language like Python, we can potentially redefine “constants” since
the capitalization is only a convention rather than in the syntax or semantics of the
language. If some later line of the program redefines SQUARE = 2, buggy behavior is likely
to emerge. More likely is that some other module that gets imported has redefined SQUARE
to something other than the expectation of the current module. This risk is minimal if
imports are within namespaces, but from othermod import SQUARE, CUBE,
TESSERACT is not necessarily unreasonable to have within the current module.
Programs written like the preceding one are not necessarily broken, and not even
necessarily mistakes, but it is certainly more elegant to use enums for constants that come
in sets.
It’s not impossible to get around the protection that an Enum provides, but you have to
work quite hard to do so rather than break it inadvertently. In effect, the attributes of an
enum are read-only. Therefore, reassigning to an immutable attribute raises an exception.
There are also “constants” that are not alternatives, but simply values; these likewise
cannot actually be enforced as constants in Python. Enums might still be reasonable
namespaces with slightly more enforcement against changes than modules have.
168 Chapter 7 Misusing Data Structures
Overwriting constants
>>> import math
>>> radius = 2
>>> volume = 4/3 * math.pi * radius**3
>>> volume # 1
33.510321638291124
>>> math.pi = 3.14 # 2
>>> 4/3 * math.pi * radius**3
33.49333333333333
>>> from math import pi
>>> 4/3 * pi * radius**3
33.49333333333333
>>> pi = 3.1415 # 3
>>> 4/3 * pi * radius**3
33.50933333333333
This usage doesn’t really use Enum as a way of enumerating distinct values, but it does
carry with it a modest protection of “read-only” values.
7.5 Learn Less Common Dictionary Methods 169
Module dictionaries
>>> import re
>>> type(re.__dict__)
<class 'dict'>
>>> for key in re.__dict__.keys():
... print(key, end=" ")
...
__name__ __doc__ __package__ __loader__ __spec__ __path__ __file__
__cached__ __builtins__ enum _constants _parser _casefix _compiler
functools __all__ __version__ NOFLAG ASCII A IGNORECASE I LOCALE L
UNICODE U MULTILINE M DOTALL S VERBOSE X TEMPLATE T DEBUG RegexFlag
error match fullmatch search sub subn split findall finditer compile
purge template _special_chars_map escape Pattern Match _cache
_MAXCACHE _compile _compile_repl _expand _subx copyreg _pickle Scanner
The various functions and constants in a module are simply its dictionary. Built-in types
usually use a slightly different dictionary-like object.
170 Chapter 7 Misusing Data Structures
>>> int.__dict__["numerator"]
<attribute 'numerator' of 'int' objects>
>>> (7).__class__.__dict__["numerator"]
<attribute 'numerator' of 'int' objects>
>>> (7).numerator
7
Custom classes also continue this pattern (their instances have either .__dict__ or
.__slots__, depending on how they are defined).
The first version works, but it uses five lines where one would be slightly faster and
distinctly clearer.
But with recent Python versions, even more elegant versions are:
Making Copies
Another often-overlooked method is dict.copy(). However, I tend to feel that this
method is usually properly overlooked. The copy made by this method is a shallow copy, so
174 Chapter 7 Misusing Data Structures
any mutable values might still be changed indirectly, leading to subtle and hard-to-find bugs.
Chapter 2, Confusing Equality with Identity, is primarily about exactly this kind of mistake.
Most of the time, a much better place to look is copy.deepcopy(). For example:
This is confusing, and pretty much a bug magnet. Much better is:
Dictionaries are an amazingly rich data structure in Python. As well as the usual
efficiency that hash maps or key/value stores have in most programming languages, Python
provides a moderate number of well-chosen “enhanced” methods. In principle, if
dictionaries only had key/value insertion, key deletion, and a method to list keys, that
would suffice to do everything the underlying data structure achieves. However, your code
can be much cleaner and more intuitive with strategic use of the additional methods
discussed.
obj == pickle.loads(pickle.dumps(obj))
There are exceptions here. File handles or open sockets cannot be sensibly serialized
and deserialized, for example. But most data structures, including custom classes, survive this
round-trip perfectly well.
In contrast, this “invariant” is very frequently violated:
obj = json.loads(json.dumps(obj))
JSON is a very useful format in several ways. It is (relatively) readable pure text; it is
highly interoperable with services written in other programming languages with which a
Python program would like to cooperate; deserializing JSON does not introduce code
execution vulnerabilities.
Pickle (in its several protocol versions) is also useful. It is a binary serialization format
that is more compact than text. Or specifically, it is protocol 0, 1, 2, 3, 4, or 5, with each
successive version being improved in some respect, but all following that characterization.
Almost all Python objects can be serialized in a round-trippable way using the pickle
module. However, none of the services you might wish to interact with, written in
JavaScript, Go, Rust, Kotlin, C++, Ruby, or other languages, has any idea what to do
with Python pickles.
176 Chapter 7 Misusing Data Structures
Here false, null, and true are literals, while object, array, number, and string are
textual patterns. To simplify, a JSON object is like a Python dictionary, with curly braces,
colons, and commas. An array is like a Python list, with square brackets and commas. A
number can take a number of formats, but the rules are almost the same as what defines
valid Python numbers. Likewise, JSON strings are almost the same as the spelling of
Python strings, but always with double quotation marks. Unicode numeric codes are
mostly the same between JSON and Python (edge cases concern very obscure surrogate
pair handling).
Let’s take a look at some edge cases. The Python standard library module json
“succeeds” in two cases by producing output that is not actually JSON:
1 The result of json.dumps() is a string; printing it just removes the extra quotes in
the echoed representation.
2 Neither NaN nor Infinity (under any spelling variation) are in the JSON standards.
In some sense, this behavior is convenient for Python programmers, but it breaks
compatibility with (many) consumers of these serializations in other programming
languages. We can enforce more strictness with json.dumps(obj, allow_nan=False),
7.6 JSON Does Not Round-Trip Cleanly to Python 177
which would raise ValueError in the preceding lines. However, some other libraries in
some other programming languages also allow this almost-JSON convention.
Depending on what you mean by “round-trip,” you might say this succeeded. Indeed it
does strictly within Python itself; but it fails when the round-trip involves talking with a
service written in a different programming language, and it talking back. Let’s look at some
failures within Python itself. The most obvious cases are in Python’s more diverse
collection types.
>>> print(json.dumps(obj))
{"1": ["David", "Mertz", "37\u2103"], "2": [[4.6, 3.2, 1.5],
[9.8, -1.2, 0.4]], "3": true, "4": null}
>>> json.loads(json.dumps(obj))
{'1': ['David', 'Mertz', '37°C'], '2': [[4.6, 3.2, 1.5],
[9.8, -1.2, 0.4]], '3': True, '4': None}
In JSON, Python’s True is spelled true, and None is spelled null, but those are
entirely literal spelling changes. Likewise, the Unicode character DEGREE CELSIUS can
perfectly well live inside a JSON string (or any Unicode character other than a quotation
mark, reverse solidus/backslash, and the control characters U+0000 through U+001F). For
some reason, Python’s json module decided to substitute with the numeric code, but such
has no effect on the round-trip.
What got lost was that some data was inside a namedtuple called Person, and other
data was inside tuples. JSON only has arrays, that is, things in square brackets. The general
“meaning” of the data is still there, but we’ve lost important type information.
Moreover, in the serialization, only strings are permitted as object keys, and hence our
valid-in-Python integer keys were converted to strings. However, this is lossy since a
Python dictionary could, in principle (but it’s not great code), have both string and
numeric keys:
Two or three things conspired against us here. Firstly, the JSON specification doesn’t
prevent duplicate keys from occurring. Secondly, the integer 1 is converted to the string
"1" when it becomes JSON. And thirdly, Python dictionaries always have unique keys, so
the second try at setting the "1" key overwrote the first try.
Another somewhat obscure edge case is that JSON itself can validly represent numbers
that Python does not support:
This is not a case of crashing, nor failing to load numbers at all. But rather, one number
overflows to infinity since it is too big for float64, and the other is approximated to fewer
digits of precision than are provided.
A corner edge case is that JSON numbers that “look like Python integers” actually get
cast to int rather than float:
4. Perhaps even Donald Knuth’s “bible”: The Art of Computer Programming (various editions among its current five
volumes; but especially the 3rd edition of volume 1, Addison-Wesley, 1997).
7.7 Rolling Your Own Data Structures 179
The choice of which data structures to include as built-ins, or in the standard library, is
one that language designers debate, and which often leads to in-depth discussion and
analysis. Python’s philosophy is to include a relatively minimal, but extremely powerful and
versatile, collection of primitives with dict, list, tuple, set, frozenset, bytes, and
bytearray in __builtins__ (arguably, complex is a simple data structure as well).
Modules such as collections, queue, dataclasses, enum, array, and a few others
peripherally, include other data structures, but even there the number is much smaller than
for many programming languages.
A clear contrast with Python, in this regard, is Java. Whereas Python strives for
simplicity, Java strives to include every data structure users might ever want within its
standard library (i.e., the java.util namespace). Java has hundreds of distinct data
structures included in the language itself. For Pythonic programmers, this richness of
choice largely leads only to “analysis paralysis” (https://en.wikipedia.org/wiki
/Analysis_paralysis). Choosing among so many only-slightly-different data structures
imposes a large cognitive burden, and the final decision made (after greater work) often
remains sub-optimal. Giving someone more hammers can sometimes provide little other
than more ways for them to hit their thumb.
A really lovely example of the design discussions that go into Python is in PEP
603 (https://peps.python.org/pep-0603/), and the python-dev mailing list
and Discourse thread among core developers that followed this PEP. The
proposal of a new data structure has not been entirely rejected since
September 2019, but it also has not been accepted so far.
Internally, CPython utilizes a data structure called a Hash Array Mapped
Trie (HAMT). This isn’t used widely, but there are specific places in the C code
implementing CPython where it is the best choice. A HAMT is a kind of
immutable dictionary, in essence. Since this structure already exists in the
CPython codebase, it would be relatively easy to expose it under a name like
frozenmap or frozendict; this would parallel the existing frozenset and
tuple in being the “immutable version of built-in mutable collections.”
HAMT is clearly a useful data structure for some purposes. If it were not,
the very talented CPython developers would not have utilized it. However, the
current tide of opinion among these developers is that HAMT is not general
purpose enough to add to the cognitive load of tens of millions of Python
developers who probably won’t need it.
you reach quickly for an opportunity to use one of these data structures you have
learned—each of which genuinely does have concrete advantages in specific contexts—it
often reflects an excess of cleverness and eagerness more than it does good design instincts.
A reality is that Python itself is a relatively slow bytecode interpreter. Unlike compiled
programming languages, including just-in-time ( JIT) compiled languages, which produce
machine-native instructions, CPython is a giant bytecode dispatch loop. Every time an
instruction is executed, many levels of indirection are needed, and basic values are all
relatively complex wrappers around their underlying data (remember all those methods of
datatypes that you love so much?).
Accompanying the fact that Python is relatively slow, most of the built-in and standard
library data structures you might reach for are written in highly optimized C. Much the
same is true for the widely used library NumPy, which has a chapter of its own.
On the one hand, custom data structures such as those mentioned can have significant
big-O complexity advantages over those that come with Python.5 On the other hand,
these advantages need to be balanced against what is usually a (roughly) constant
1 The get_word() function available at this book’s website is used in many examples.
It simply returns a different word each time it is called.
2 Using the same random seed assures that we do exactly the same insertions for each
collection type.
182 Chapter 7 Misusing Data Structures
The testing function performs however many insertions we ask it to, and we can time
that:
Without having yet said just what a CountingTree is, I can say that I spent more time
ironing out the bugs in my code than I entirely want to admit. It’s not a large amount of
code, as you’ll see, but the details are futzy.
Notable points are that even though I’ve created a data structure optimized for exactly
this task, it does worse than list for 100 items. CountingTree does worse than list for
10,000 items also, even by a slightly larger margin than for 100. However, my custom data
structure pulls ahead slightly for 100,000 items; and then hugely so for a million items.
It would be painful to use list for the million-item sequence, and increasingly worse if
I needed to do even more collection.insert() operations.
E(9)
G(5) K(3)
In Figure 7.1, each node contains a value that is a single letter; in parentheses we show
the length of each node with its subtree. Identical values can occur in multiple places
(unlike, e.g., for a set or a dictionary key). Finding the len() of this data structure is a
matter of reading a single attribute. But having this length available is what guides
insertions.
It is very easy to construct a sequence from a tree. It is simply a matter of choosing a
deterministic rule for how to order the nodes. For my code, I chose to use depth-first,
left-to-right; that’s not the only possible choice, but it is an obvious and common one. In
other words, every node value occurs at exactly one position in the sequence, and every
sequence position (up to the length) is occupied by exactly one value. Since our use case is
approximately random insertion points for new items, no extra work is needed for
rebalancing or enforcing any other invariants.
The code shown only implements insertions, our stated use case. A natural extension to
the data structure would be to implement deletions as well. Or changing values at a given
position. Or other capabilities that lists and other data structures have. Most of those
capabilities would remain inexpensive, but details would vary by the specific operation, of
course.
if self.value is EMPTY:
self.value = value
elif index == self.length:
if self.right is EMPTY:
self.right = CountingTree(value)
else:
self.right.insert(
index - (self.left.length + 1), value)
elif index == 0 and self.left is EMPTY:
self.left = CountingTree(value)
else:
if index > self.left.length:
self.right.insert(
index - (self.left.length + 1), value)
else:
self.left.insert(index, value)
self.length += 1
This much is all we actually need to run the benchmarks performed here. Calling
CountingTree.insert() repeatedly creates trees much like that in the figure. The .left
and .right attributes at each level might be occupied by the sentinel EMPTY, which the
logic can utilize for nodes without a given child.
It’s useful also to define a few other behaviors we’d like a collection to have.
def __iter__(self):
if self.left is not EMPTY:
yield from self.left
if self.value is not EMPTY:
yield self.value
if self.right is not EMPTY:
yield from self.right
def __repr__(self):
return f"CountingTree({list(self)})"
7.7 Rolling Your Own Data Structures 185
def __len__(self):
return self.length
def __repr__(self):
return "EMPTY"
EMPTY = Empty()
7.7.3 Takeaways
This section has had a long discussion. The takeaway you should leave with isn’t a simple
one. The lesson is “be subtle and accurate in your judgments” about when to create and
when to avoid creating custom data structures. It’s not a recipe, but more vaguely an
advocacy of a nuanced attitude.
As a general approach to making the right choice, I’d suggest following a few steps in
your thinking:
1. Try implementing the code using a widely used, standard Python data structure.
2. Run benchmarks to find out if any theoretical sub-optimality genuinely matters for
the use case your code is put to.
3. Research the wide range of data structures that exist in the world to see which, if
any, are theoretically optimal for your use case.
7.8 Wrapping Up 187
7.8 Wrapping Up
Sometimes a powerful object method or general technique can also lead you in the wrong
direction, even in seemingly ordinary uses. This wrong direction might cause bad
complexity behavior; at times it might work for initial cases but then fail in cases you had
not yet considered.
In this chapter we probed at some operations on lists—generally one of the best
optimized and flexible data structures Python has—where a different data structure is
simply better. We also looked at how recursive algorithms need to remember that strings
are both scalar and iterable, which means they often need to be special-cased in program
flow.
Two more mistakes in this chapter looked at “sins of omission” where a facility that
may be less familiar provides a more convenient and more readable approach to common
tasks. Specifically, two mistakes served as reminders of the enum module and of some of the
less widely used methods of dictionaries.
In the penultimate mistake of this chapter, the capabilities and limitations of the widely
used JSON format were explored. In particular, we saw how Python developers might
forget the (relatively minor) lossiness of JSON representations of data.
The final mistake discussed is one of nuance and complex decision-making. Often,
creating custom data structures is premature optimization; but at other times they can
significantly improve your code. The (long) discussion provides some guidance about
making this choice wisely.
This page intentionally left blank
8
Security
This book is not the one you should read to understand either cryptography or computer
security built on top of cryptographic primitives. Of course, in a very general way, it’s a
mistake to wire money to fraudulent entities phishing under various get-rich-quick or
tug-on-heartstrings pretenses. However, that’s life advice that might touch on using
computers, not Python advice.
What the mistakes in this chapter discuss are simply approaches to security concerns that
I have frequently seen Python developers do badly. Often this reflects a misunderstanding
of some very general security concepts. At other times an unawareness of the appropriate
modules and functions to use (standard library or third-party) is evidenced.
For an actual background on cryptography, Bruce Schneier’s Applied Cryptography:
Protocols, Algorithms, and Source Code in C is a quite old, but still classic, text (2nd edition,
John Wiley & Sons, 1996; it even has a few errata corrected from the first edition by your
current author). Schneier’s somewhat more recent 2010 text, with coauthors Niels
Ferguson and Tadayoshi Kohno, Cryptography Engineering: Design Principles and Practical
Applications, is also excellent (John Wiley & Sons).
There is a distinction to be made between cryptography and security, with the latter
being more broad. Cryptography is an important element of secure software designs, but it
is not the only concern, nor even the only building block. Cryptography concerns itself
with mathematical mechanisms of enforcing confidentiality, integrity, non-repudiation,
and authentication. Security more broadly concerns itself also with risk management,
access control, classification of information, business continuity, disaster recovery, and laws
and regulations. Addressing the broad security concerns often utilizes cryptographic
protocols, but puts them in the context of a general “threat model” against which
protection is sought.1
Threat modeling and security procedures involve many concerns beyond what Python
programmers can or cannot do. For example, it considers corporate training about when
1. The term threat model is a rather elegant one, to my eyes; but it’s also likely unfamiliar to many readers. It
basically amounts to posing a collection of “what if?” questions about potential (malicious) actions that might
cause a failure in computer systems. What are the ways the system can possibly go wrong? This is only occasionally
a matter of “someone breaks the encryption protocol.” Far more often concerns like social engineering
(convincing people to act unwisely through misrepresentation) or denial of service (causing systems to break
without an intruder per se getting access to information) are threats modeled.
190 Chapter 8 Security
and with whom employees should share secrets. It includes physical locks on buildings. It
includes procedures used for in-person or telephone verification of an identity. It includes
background checks before giving humans particular access rights. It includes physical
securing of transmission wires or server rooms.
You don’t need to read any of the recommended external texts to understand this
chapter, but they provide some general background to the mathematical and algorithmic
design of particular functions and modules this chapter discusses. The broader concerns
around “social engineering” (i.e., convincing people to do things that compromise
security) are even further outside the direct scope of this chapter.
However, for actual cryptographic needs, random falls short. When a seed is not
specified to random.seed(), a small number of truly random entropy bytes are used as a
seed. In many circumstances that makes a “random” token sufficiently unpredictable.
Many circumstances are not all circumstances, and for applications that care about
cryptographic security, it is better simply to start by using secrets in case your code
becomes used in a situation where vulnerabilities are exposed. The secrets module has
been available since Python 3.6, so it is not anything very new.
An excellent, fairly informal, analysis of vulnerabilities in the Mersenne Twister was
done by James Roper.5 The short summary is that if you can observe 624 consecutive
integers produced by the Mersenne Twister, you can reconstruct its complete state, and
thereby every output it will produce in the future. Even if fewer consecutive values can be
observed by a malicious intruder, indirect vulnerabilities can likely be exploited.
Besides the vulnerability Roper points out, we also commonly spin up virtual images
with little initial entropy and/or reproducing an exact known state of the MT generator.
Running on your local machine, which has been in operation for hours or weeks, provides
sufficient strength in a generated seed. Running on a Docker image that Kubernetes spins
up frequently, on an AWS Lambda, or Google Cloud Function very well may not.
The code you run today in a circumstance where “random is fine” will probably be
ported to an environment where it is not tomorrow.
There are only a few functions in the secrets module. The most commonly used one
generates tokens of arbitrary length in a few different formats:
>>> secrets.token_bytes(12)
b'\xe7Wt\x96;\x829a\xc9\xbd\xe1\x94'
>>> secrets.token_hex(20)
'b397afc44c9cac5dba7900ef615ad48dd351d7e3'
>>> secrets.token_urlsafe(24)
'QYNBxUDVGO4feQUyetyih8V5vKKyy8nQ'
... random.choice(words)
...
>>> random.choice(words)
'spekboom'
1 This loop performs a genuinely unknowable number of steps through the Mersenne
Twister generator.
2 This choice of “remotivations” will not occur if I run the code again, or if you run
it. However, “spekboom” will remain stable if you use my wordlist.
Not only was “spekboom” (a South African succulent commonly kept as a houseplant)
the 10 millionth word chosen following initialization of the MT generator with the seed I
used, but also the prior nine million, nine hundred ninety-nine thousand, nine hundred
ninety-nine words were the same (although were not displayed in the output).
If we wish to save the state of the generator after these 10 million choices, that is
easy to do.
Just as a quick reminder, even though the preceding example worked with random.
choice from a wordlist, simply to create memorable outputs, the same reproducibility and
APIs work the same way when drawing from any of the numeric distributions.
Perhaps the most obvious need for “repeatable randomness” is in creating unit tests or
functional tests of our software. We would like to ensure that our software—for example, a
long-running web server—given a particular sequence of inputs continues to behave the
same way.
For 10 million inputs, we could probably save those in a data file without outrageous
waste on modern computers. But if it were 10 billion inputs, saving a sequence of inputs
is enormously wasteful when one seed of a few-character string, or one state of
625 numbers, would suffice.
Another common related need is when you have a complex process that you believe
you can optimize, but want to ensure that identical behavior is retained. As a contrived
example, we have a function blackbox() that takes a string and an iterable of integers as
arguments, and returns a permutation of that string. Again, for short iterables, simply
saving them as static data is fine, but for long ones repeatability is relevant.
Let’s run the existing implementation.
We could create other test sequences of varying lengths, with varying seeds, and with
varying floor and ceiling of the integers produced. But across a reasonable collection of
such configurations, we would like our new blackbox_fast() function to produce the
same outputs as a previous slow implementation.
8.2 Putting Passwords or Other Secrets in “Secure” Source Code 195
We can see that the new implementation is considerably faster, while also remaining
consistent in behavior across our range of test cases. Constructing a collection of such tests
over large iterables would be impractical without “deterministic randomness.”
def get_other_data():
_username = "DavidMertz"
_password = "jNeyQIyE6@pR"
_localservice1 = "192.168.227.1
_localservice2 = "192.168.227.2
_localservice3 = "192.168.227.3
return data
The same general principles apply for passwords, tokens, session keys, or any other
information that should generally be kept secure.
A first, and sometimes adequate, approach is to store secrets in environment variables.
def get_other_data():
_username = os.environ.get("LOCAL_SERVICE_USER")
_password = os.environ.get("LOCAL_SERVICE_PASSWORD")
_localservice1 = os.environ.get("LOCAL_SERVICE_ADDR_1")
_localservice2 = os.environ.get("LOCAL_SERVICE_ADDR_2")
_localservice3 = os.environ.get("LOCAL_SERVICE_ADDR_3")
This becomes a vulnerability only if an attacker can gain shell access, or equivalent, to a
system where the code is running. However, these secrets are visible in unencrypted form
within an OS shell. Developers (myself included) often lose track of which environment
variables were previously set, and thereby forget explicitly to unset them after an
application using them has terminated.
A step better is to use the “dotenv” approach. This style does keep secret information
within a file on the filesystem, usually with the special name .env. Specifically, this file
8.2 Putting Passwords or Other Secrets in “Secure” Source Code 197
must always be excluded from being kept under version control (e.g., in .gitignore), and
should be distributed by a separate secure channel, as needed. As well, permissions to a
.env file should be restricted to the specific user or group that has legitimate access rights.
def get_other_data():
_username = os.environ.get("LOCAL_SERVICE_USER")
_password = os.environ.get("LOCAL_SERVICE_PASSWORD")
_localservice1 = os.environ.get("LOCAL_SERVICE_ADDR_1")
_localservice2 = os.environ.get("LOCAL_SERVICE_ADDR_2")
_localservice3 = os.environ.get("LOCAL_SERVICE_ADDR_3")
This looks very similar to directly loading environment variables; it is, since the code is
identical. However, these environment variables are only loaded at the time this code runs,
and are not in the environment of the parent process.
The .env file used in this example would look like:
A better approach still is to use your operating system’s keyring service. This is handled
somewhat differently by macOS, Windows, Linux, and even Android, but all have secure
systems that do not store plaintext passwords. The module keyring wraps those
OS-specific details and provides a common interface in Python.
def get_other_data():
_username = kr.get_password("data-service", "user") # 1
_password = kr.get_password("data-service", "pw")
_localservice1 = kr.get_password("data-service", "1")
_localservice2 = kr.get_password("data-service", "2")
_localservice1 = kr.get_password("data-service", "3")
Using the keyring module is straightforward, and will not keep unencrypted versions
of any secrets anywhere on your filesystem. If at all possible, use this final approach.
However, the first two solutions are still vastly better than putting secrets directly into
source code.
1 Under Python 3.11+ .to_bytes() has a default length of 1. For older Python
versions, you need to specify .to_bytes(length=1).
200 Chapter 8 Security
Here’s the thing. I’m not nearly smart enough to know how to attack this encryption
algorithm, even if I were given many plaintexts and many ciphertexts. However, my sliver
of intelligence tells me that there are actual cryptanalysts who are vastly better at such
attacks than I am. It is extremely likely that you readers are not better cryptographers than
I am.6 Just because you can’t think of how to compromise an algorithm, that doesn’t mean
that someone else can’t.
Rather than rely on amateur cryptography, the correct approach is to obtain the
third-party pyca/cryptography library (https://cryptography.io/en/latest/). This is
implemented correctly by people who genuinely understand security, and is updated
quickly if weaknesses are discovered.
The protocol setup is probably slightly cumbersome, but is well documented in the
module documentation. What I show can easily be wrapped in functions with simple
signatures similar to the amateur ones I created earlier.
>>> iv = secrets.token_bytes(16) # 2
>>> cipher = Cipher(algorithms.AES(key), modes.CBC(iv))
>>> encryptor = cipher.encryptor()
6. As an idle diversion, I have created a challenge at https://gnosis.cx/better. For this challenge, I have created
1000 ciphertexts, all of which are encrypted with the same key, using exactly the encryption code published here.
Moreover, all 1000 plaintexts these are encryptions of are sentences from a draft of this book (they may not occur
verbatim after later editing, though). If any reader can identify the key used, or even two of the plaintext
sentences, I will deliver to them a signed copy of the print version of this book.
8.4 Use SSL/TLS for Microservices 201
less designed around microservices. The heavyweight web framework, Django, has a side
project, Django REST framework (https://www.django-rest-framework.org/), as well.
As a toy example of a web service, I wrote this Flask application.
app = Flask(__name__)
@app.route("/")
async def hello_world(): # 1
return "<p>Hello, World!</p>"
Read the Flask documentation for details on how to launch using a robust WSGI
server. You can also modify IP addresses and ports exposed. The key point for this mistake
is that we only published an http:// route by launching this way.
We can access the data this server provides from within the same localhost, for example:
However, I am not entirely secure in the belief that this server might not be accessed
externally via exposed ports on my system (or on the hosting Docker image, for example).
It would be better only to provide the data from this server over an encrypted TLS channel.
Obtaining and installing SSL certificates is a somewhat cumbersome task. The
availability of the community project Let’s Encrypt (https://letsencrypt.org/) or
commercial company Cloudflare’s free SSL/TLS, even on free accounts (https://www
.cloudflare.com/ssl/), makes this somewhat easier. However, your employer or project is
8.4 Use SSL/TLS for Microservices 203
The discussion in this mistake describes the value and procedures for utilizing
TLS/SSL. Using this means that channels are protected against
eavesdroppers.
However, in this example, no mechanism was created or described around
using authentication for access. Anyone with access to the IP address and
port used in the example can access the demonstration server. If we also
wanted to require credentials for access, that would be a separate discussion.
All the web servers we mention can certainly do that, but the specifics are not
documented herein.
likely to have its own certificate creation and distribution system, which you should,
obviously, follow for production.
Fortunately, for local development and testing purposes, using ad hoc certificates is very
simple (but not ideal; we will improve that shortly):
At this point, no http:// route has been created. We can try connecting again, to both
the TLS and the unencrypted channel.
>>> try:
... resp = requests.get("http://127.0.0.1:5000", verify=False)
... except Exception as err:
204 Chapter 8 Security
... print(err)
...
('Connection aborted.',
ConnectionResetError(104, 'Connection reset by peer'))
The connection to the unencrypted route is simply refused. The unverified certificate
settles for a warning, but still provides its data. However, it would be best to heed this
warning. Doing so is not difficult. You will need openssl installed (https://www
.openssl.org/).
We are now able to connect to the local microservice, but only using the specific
domain name “localhost” and by pointing to the correct cert.pem. We could have
configured the FQDN as “127.0.0.1” instead if we wished, but using the symbolic name is
generally the recommended practice.
>>> resp.text
'<p>Hello, World!</p>'
>>> try:
... resp = requests.get("http://localhost:5000",
... verify="code/other-cert.pem")
... except Exception as err:
... print(err)
...
('Connection aborted.',
ConnectionResetError(104, 'Connection reset by peer'))
>>> try:
... resp = requests.get("http://127.0.0.1:5000",
... verify="code/cert.pem")
... except Exception as err:
... print(err)
...
('Connection aborted.',
ConnectionResetError(104, 'Connection reset by peer'))
There is a very crucial part of the quoted documentation to pay attention to: “mostly
HTTP.” Using urllib.request with modern HTTPS websites (or microservices) is
exceedingly difficult. It is possible, but only just barely and after great frustration.
For example, a footnote in this chapter mentions challenges accompanying this book. It
happens that my website is proxied by Cloudflare (https://www.cloudflare.com/) to
provide it with SSL/TLS and content caching (under their free plan, for my limited needs);
the underlying content is contained on GitHub Pages (https://pages.github.com/). On my
local computer I use a VPN. None of this is an uncommon kind of configuration for static
pages, but it is indeed not a 2003-style page on a single server with a fixed DNS entry.
Let’s try to retrieve that page.
There is nothing difficult here, and I feel confident that the maintainers of requests
are following all relevant security and transport protocols. What about if we try to do the
same thing with the Python standard library?
MAoGCCqGSM49BAMCA0cAMEQCIFv5r9ARdjfr5ctvjV57d2i18tOwGWRAsT9HwDr/
zyy8AiA4V5gjyLS5wRF24bqfyly64AnKQqOJyAMMCXy5HAK95A==
-----END CERTIFICATE-----
>>> try:
... ctx = ssl.create_default_context(cadata=cert)
... url = f"https://{address}:{port}/{resource}/"
... resp = urllib.request.urlopen(url, context=ctx)
... except Exception as err:
... print(f"{err.__class__.__name__}: {err}")
...
URLError: <urlopen error [SSL: WRONG_VERSION_NUMBER] wrong
version number (_ssl.c:992)>
I am confident that there genuinely is some magic incantation that would convince
Python’s standard library to fetch the content of this rather ordinary, static web page. With
a great deal more effort, I would probably eventually find it.
208 Chapter 8 Security
In the meanwhile, I will be happy and productive (and secure) using the third-party
requests library, as the CPython developers themselves recommend.
1 Per the mistake of putting secrets in source code, a good approach is used here.
2 The Psycopg 3 driver to connect to PostgreSQL
However, there may be quite a few words in this wordlist, so we’d like to limit it.
Specifically, we’d like to limit it based on a criterion provided by a user; in our case, users
may request to only see words having a certain prefix.
So far, it is all still straightforward and simple. But let’s try calling the function another
time.
The problem is not, of course, simply that nothing matches the strange user input. It’s
that we actually did the following:
. Select words matching ''
. Delete everything in the table
. Include a final comment of --'
210 Chapter 8 Security
An unsanitized user-provided input just deleted the entire content of our table!
Requesting the “e” prefix words (“edacity” and “entertains”) shows the empty table, as
does trying the “p” prefix again:
>>> some_words()
[Prefix]? e
----------------------------------------
>>> some_words()
[Prefix]? p
----------------------------------------
Randall Munroe’s brilliant comic strip XKCD, as often, captures the concern of
this mistake very pithily.7
Python itself also has a wonderful reference to XKCD. Try typing import
antigravity in a Python shell sometime.
Let’s repopulate the table with some new words, and try a safer version of the function.
Showing a special fondness for the letter “b”, I ran this query in the psql shell:
7. Comic strip used with permission from Randall Munroe (xkcd), granted by The Gernert Company.
8.6 SQL Injection Attacks When Not Using DB-API 211
Before we do that, however, it is worth mentioning that the DB-API actually allows for
several different styles of parameter specification, which drivers are free to choose among.
These are specified in PEP 249 (https://peps.python.org/pep-0249/).
paramstyle Meaning
qmark Question mark style, e.g., …WHERE name=?
numeric Numeric, positional style, e.g., …WHERE name=:1
named Named style, e.g., …WHERE name=:name
format ANSI C printf format codes, e.g., …WHERE name=%s
pyformat Python extended format codes, e.g., …WHERE name=%(name)s
>>> psycopg.paramstyle
'pyformat'
1 The interaction between % as an SQL wildcard and in a Python string requires a bit
of special care for this example.
212 Chapter 8 Security
If we try a similar injection with this corrected code, nothing bad happens. We just get
zero results because '';DELETE FROM wordlist;-- isn’t a word in the wordlist table.
Or if it is an unusual word, we get the harmless result:
>>> some_words()
[Prefix]? '';DELETE FROM wordlist;--
----------------------------------------
'';DELETE FROM wordlist;--more stuff after prefix
while True:
state = (multiplier * state + increment) % modulus
yield state / modulus
We can utilize the generator to obtain pseudo-random numbers in the interval [0,1):
The assertions made were concretely useful when I was first debugging this code.
Moreover, they communicate to other developers some of my expectations about the
reasonable preconditions for the algorithm to work (some are much more nuanced; what
makes a “good” modulus and multiplier would take full papers in number theory, not
single lines).
However, suppose that a naive developer wanted to enhance this code by having a
fallback to Python’s standard library random module when disallowed parameters were
passed in.
while True:
state = (multiplier * state + increment) % modulus
yield state / modulus
except AssertionError:
import random
while True:
yield random.random()
if __name__ == '__main__':
for x, _ in zip(bad_lcg(multiplier=-1), range(5)):
print(x)
At first brush, this seems to work. At least in the sense of getting some pseudo-random
numbers (albeit not ones that are easily reproducible as in the non-fallback case):
This breaks down rather badly once we run the code with the optimization flag,
however:
2.8638169169425964e-08
0.9999999715946615
2.8638169169425964e-08
These pseudo-random numbers are very bad at their job. Specifically, the same two
numbers (close to the extremes of the unit interval) will alternate forever, which is not an
expected distribution. Moreover, if you are using this numeric stream for anything vaguely
security related, many attack vectors are opened up by failing to fall back to the alternate
code path. Depending on the context, of course, many other problems might occur as
well; for example, performance problems, errors in modeling, excessive collisions of, for
instance, hash keys, and so on.
The wisdom of this specific fallback is a separate matter; there are absolutely many
other contexts where a fallback is reasonable. The solution to this error is extremely
simple, happily. Simply use explicit checks for the conditions that are permitted or
prohibited within if/elif or match/case blocks, and use those to fall back to alternate
behavior. Using an explicit raise of an exception other than AssertionError within
those blocks is perfectly reasonable as well.
8.8 Wrapping Up
In Python, as in every modern programming language, a background concern in the
creation of any software system is the vulnerabilities it might create. Software can do things
wrong in the sense of not producing the output or interactions that we had in mind when
writing it; but it can also do things wrong in the sense of allowing bad actors to cause
harm through the use of that software.
Genuine security analysis of software is a broad topic, and one requiring a great deal of
knowledge ranging over many things not in this book. However, there are a number of
often repeated mistakes that Python developers make, each of which can be relatively
straightforwardly rectified by following the advice in this chapter. These few things will
not address all possible concerns, but they will fix surprisingly much of what we Python
developers often get wrong.
For a more in-depth look at a number of security issues, the Open Worldwide
Application Security Project (OWASP; https://owasp.org/) is a good resource. They
discuss and provide resources for some of the vulnerabilities I discuss, and for a great many
others as well.
There are many security mistakes not specifically addressed in this chapter, of necessity.
Security is a broad concern, with many books wholly about it, and many professionals
employed to understand a range of concerns. However, a few mistakes that are already well
described in the Python documentation are still worth quickly mentioning.
. Unpickling pickles with uncontrolled sources can cause execution of arbitrary code.
. Loading YAML has the same concern as with pickles, but an easy solution is
yaml.safe_load().
216 Chapter 8 Security
. Loading XML can enable denial-of-service attacks; these are discussed in detail at
https://docs.python.org/3/library/xml.html.
. Temporary files created with tempfile.mktemp() are unsafe, but Python
documents this and provides tempfile.mkstemp() as a drop-in replacement.
Use the right libraries. Choose a few of the right APIs. Avoid some missteps that will
be obvious once you internalize the mistake in this chapter. After those few small actions,
the so-called attack surface of the software you create will be vastly reduced.
9
Numeric Computation
in Python
Working with numbers is, of course, one of the most common things we do in
programming languages. And yet, there are a great many ways that our numeric code can
go subtly wrong. For larger scale and more intensive work with numbers, the third-party
libraries NumPy and Pandas make a great many tasks easier and faster. Other important
numeric libraries also exist, but those two are so widespread in the Python ecosystem as to
merit independent discussion. Several such libraries are discussed briefly in the appendix,
Topics for Other Books.
While reaching out to “vectorized” or “unboxed” numeric libraries can often be
useful,1 we should and do perform much numeric computation in Python itself.
Many of the mistakes we can make in working with numbers derive from subtle edge
cases of the behavior of floating point numbers. In a general way, this class of issues applies
just as much to the third-party libraries as it does to pure Python. In fact, three of the four
mistakes discussed in the IEEE-754 section of this chapter apply almost identically to most
programming languages other than Python, but remain concerns for Python developers, so
are discussed here.
Some mistakes Python developers can make relate to other kinds of numeric domains,
perhaps with choosing the wrong one for a given task. The last two mistakes in this
chapter touch on decimal.Decimal and fractions.Fraction and when those are good
choices for numeric types.
This section looks at several common mistakes Python programmers make when
working with floating point numbers, which often occur in places that are unlikely to raise
immediate suspicion among developers who have not yet been bitten by them.
To be precise about things, a NaN is not a value, but rather a family of values
marked by a particular bit pattern. There are 16,777,214 32-bit NaNs, and
9,007,199,254,740,990 64-bit NaNs. Moreover, within this vast space of
possible NaNs, the bit patterns are split evenly between signaling and quiet
NaNs.
This is a lot of complication that the original designers of IEEE-754,
including main architect William Kahan, hoped would be utilized to carry
payloads describing the circumstances within a computation where a
non-representable floating point value arose. As things panned out, no widely
used computing system makes any use of the many NaN values, and all
basically treat every NaN as the same value.
Let’s dig a bit deeper into the weeds. These details apply to many concerns. A floating
point number is represented as a sign bit, followed by an exponent, followed by a
mantissa.4 A NaN is simply a number that has all of its exponent bits set to 1s.
Whatever happens to occur in the mantissa, it remains a NaN and is treated in the same
special manner. Let’s see this in action.
Python, like almost all other programming languages, simply uses one of the millions
(or quadrillions) of possible NaNs, regardless of how we arrived at it. The leading 1 in the
mantissa technically makes it a signaling NaN, but no Python library I know of pays any
attention to this fact. It’s technically possible in Python to construct a NaN with a different
bit pattern, but it requires arcana from the struct or ctypes modules; no “normal
Python” operation will do this, nor pay any attention to the special value.
Let’s try a comparison that at first, certainly, seems like it should succeed.
The bit patterns stored for the NaNs in a and b are, in fact, identical. Likewise for the
other mentioned mathematical constants. The problem is that according to IEEE-754, no
NaN will ever compare as equal to another, even if their payloads are identical:
1 Do not rely on the ordering of sets, even though this example is coincidentally order
preserving. Dicts preserve insertion order
2 An example where sets do not preserve their “order” during iteration
When we introduce NaNs, it becomes impossible to say what the “mean value” is,
regardless of which approach we are using.
Depending on where elements occur within a list of numbers, the median might be the
lowest number, the highest number, a NaN (with some orderings not shown), or in the
case of statistics.median_grouped(), some number that incorporates a fairly
meaningless mean into the overall result (for strictly non-NaN values, this “value between
the elements” is sometimes quite useful).
There are basically two ways we might go about fixing this. I have argued on the
python-ideas mailing list that these functions should grow an optional argument to clarify
behavior; I’ve never completely convinced the module’s main author. One approach is to
introduce NaN propagation, the other is to introduce NaN stripping. Notably, these two
approaches are the default behaviors of NumPy and Pandas, respectively (but varying from
each other).
Fortunately, choosing either of these behaviors is easy to achieve in Python; you just
have to remember to do it.
Notice that this ternary clause works because in will do an identity check before it tries
an equality test. As we’ve seen, nan != nan, so the fact it checks this way is necessary for
the suggestion to work. Here is the code for the latter option.
Associativity and distributivity do not always fail, of course. We see examples in the
preceding code block of these properties both holding and failing, all with very ordinary
numbers. However, predicting exactly which series of operations will preserve exact
equality and which will not is exceedingly subtle.
The solution to this Gordian Knot, of course, is not to understand all the rounding
errors in a computation that might consist of thousands or millions of floating point
operations, but rather to settle for “plausible equality.” For what it’s worth, the problem
gets even more tangled in the face of concurrency, wherein you may not even be able to
predict the order in which operations are performed.
Both Python’s math.isclose() and NumPy’s numpy.isclose() provide such
plausible answers.
I had already previously commented in a 2003 book that “If you think you
understand just how complex IEEE 754 math is, you are not yet aware of all of
its subtleties.” In that ancient text, I noted that my friend, colleague, and
erstwhile professor of numeric computing, Alex Martelli, had written:
Anybody who thinks he knows what he’s doing when floating point
is involved is either naive, or Tim Peters (well, it could be W. Kahan
I guess).
Tim Peters (after whom “Timsort,” the sorting algorithm used in Python and
in many other modern programming languages), replied:
I find it’s possible to be both (wink). But nothing about fp comes
easily to anyone, and even Kahan works his butt off to come up
with the amazing things that he does.
Peters illustrated further by way of Donald Knuth (The Art of Computer
Programming, 3rd edition, Addison-Wesley, 1997: 229):
Many serious mathematicians have attempted to analyze a
sequence of floating point operations rigorously, but found the
task so formidable that they have tried to be content with
plausibility arguments instead.
We are able to say at this point that the module statistics does a good job of
averaging, and both the arithmetically obvious hand-rolled approach and NumPy do a
much worse job (that is, they are flatly wrong).
I suppose we might stop at commenting “solve your problem by using statistics.”
This is not terrible advice for those operations that statistics includes, and where the
operation assumes samples without an inherent order. We’ve seen in another puzzle that
this list does not include statistics.median() in the presence of NaNs. But for mean,
geometric mean, harmonic mean, mode, multimode, quantiles, standard deviation,
variance, linear regression, population variance, covariance, and a few other operations,
the advice is sound.
Let’s look deeper into this quandary. The underlying problem is that the structure of
floating point numbers, with a sign, an exponent, and a mantissa, causes the distribution of
representable numbers to be uneven. In particular, the gap between one floating point
number and the next representable floating point number can be more than another
number in a sample collection.
1 The second argument indicates which direction to go for this “next” float.
Pedantically, any floating point number will work there, but in most cases positive or
negative infinity is used.
Since the gap between the closest floating point numbers is more than 1.0 in the region
of 1e20, adding or subtracting 1.0 can have no effect. The best representation remains the
number we started with. In fact, this example is based around the 64-bit floating point
numbers native to the system I am writing on; the problem is much worse for 32-bit
floating point numbers, and absurdly terrible for 16-bit numbers.
Although the heuristic I provide is worth keeping in mind, it is not what the
statistics module does. Instead (simplifying a bit), that module first
converts all numbers to numerically exact (but much slower) Fractions, then
down-converts back to floats as needed.
There is, unfortunately, currently no direct way of doing the equivalent stabilization in
NumPy. You can, of course, convert a 1-D NumPy array to a Python list, and back again,
but at the cost of orders-of-magnitude slowdowns.
Floating point numbers are both ubiquitous and enormously useful as approximations
of Real numbers from mathematics. However, since finite computers cannot truly
represent the Real number line completely, many compromises and errors are introduced
by this approximation. Python developers, as nearly all software developers, should keep
attuned to the places where the inevitable errors (in a mathematical sense) become mistakes
in a programming sense.
5. There are diverse methods of accounting for leap years in actual use by different lenders, and permitted by
different regulatory jurisdictions. An example of how complex this can be is discussed at the JS Coats Capital
LLC page “Interest Calculation Methods” (https://jscoats.com/interest-calculation-methods/).
9.2 Numeric Datatypes 229
However, the regulatory jurisdiction the bank falls under specifies the following rules.
. The daily balance must be stored internally in tenths-of-a-cents. These daily internal
balances must round fractional balance half-a-hundredth amounts to an even number
of tenths.
. The customer-available daily balance is based on the internal balance, but when
rounding half-a-tenth, the cents are rounded down.
. Exact daily records are audited by the regulatory agency on a recurring basis, and
errors face stiff fines.
We know that 64-bit floating point numbers have a great deal more precision than
these required amounts. A 64-bit float carries approximately 17 decimal digits of precision
with it, which is certainly quite a lot more than the two or three digits (cents or tenths)
that we apparently need. Perhaps we can get by with a program that simply stores
approximations as needed.
if __name__ == '__main__':
deposit = float(sys.argv[1])
interest_rate = float(sys.argv[2]) / 100
print_daily_balance(deposit, interest_rate)
3 | 500.11 | 500.108
4 | 500.16 | 500.162
5 | 500.22 | 500.216
6 | 500.27 | 500.271 # 1
... | ... | ...
360 | 519.81 | 519.807 # 1
361 | 519.86 | 519.863 # 1
362 | 519.92 | 519.919 # 2
363 | 519.98 | 519.975 # 2
364 | 520.03 | 520.032 # 1
365 | 520.09 | 520.088 # 2
The divergence between the actually correct calculation and the purely floating point
one occurs slowly, and numeric error is hardly overwhelming. If these were scientific
calculations—or even if they were predictive models within finance—these numeric
9.2 Numeric Datatypes 231
divergences would be trivial. Under laws, treaties, and regulatory rules they are not,
however.
The decimal module treats decimal arithmetic correctly, including precision and
rounding rules. There are separate available rounding modes for ROUND_CEILING,
ROUND_DOWN, ROUND_FLOOR, ROUND_HALF_DOWN, ROUND_HALF_EVEN, ROUND_HALF_UP,
ROUND_UP, and ROUND_05UP. The code that solves this specific problem utilizes two of
these, and would produce slightly different (wrong) results if it had not chosen exactly the
two it does.
import sys
from decimal import Decimal, ROUND_HALF_EVEN, ROUND_HALF_DOWN
if __name__ == '__main__':
deposit = Decimal(sys.argv[1])
interest_rate = Decimal(sys.argv[2]) / 100
print_daily_balance(deposit, interest_rate)
understanding the rounding and precision rules that are imposed by regulatory or
administrative concerns. The exact details of solutions will vary based on those concerns,
but the decimal module provides options for all such widely used rules.
6. The phrase “garden of forking paths” is borrowed from the translated title of Jorge Luis Borges’ wonderful
1941 short story “El Jardín de Senderos que se Bifurcan”.
9.2 Numeric Datatypes 233
Let’s take a look at an abstract base class. For this purpose, we wish to create
a class called BigNumber whose instances “contain” all big numbers (of all
comparable numeric types, in fact). The class doesn’t do much else, nor much
useful, but virtually contains all the big numbers (here defined as more than
one thousand):
>>> class BigNumber:
... def __contains__(self, value):
... return value > 1000
...
>>> big_numbers = BigNumber()
>>> 5.7 in big_numbers
False
>>> 1_000_000 in big_numbers
True
The abstract classes within the module numbers are “virtual parents” of various
concrete numeric classes. However, the particular parent–child relationships that exist
virtually are not necessarily the ones that make obvious sense.
Estranged children
>>> from fractions import Fraction
>>> from decimal import Decimal
>>> frac = Fraction(1, 1)
>>> dec = Decimal("1.0")
>>> 1 == 1.0 == Fraction(1, 1) == Decimal("1.0") # 1
True
>>> (1.0).as_integer_ratio() # 2
(1, 1)
>>> (0.3).as_integer_ratio() # 2
(5404319552844595, 18014398509481984)
234 Chapter 9 Numeric Computation in Python
Decimal numbers mostly refuse to engage in operations with other kinds of numbers,
but make an exception for integers. One might question this decision by Python since
decimal.Decimal already carries a specific precision, and could simply round even if a
result would be inexact, but the decision isn’t obviously wrong.
What seems worse is the tendency of float to take over almost anything else it
interacts with.
>>> f"{0.3:.17f}" # 1
'0.29999999999999999'
>>> Fraction(*(0.3).as_integer_ratio()) # 2
Fraction(5404319552844595, 18014398509481984)
>>> Fraction(*(0.3).as_integer_ratio()).limit_denominator(1000)
Fraction(3, 10)
>>> Fraction(*(0.3).as_integer_ratio()).limit_denominator(9)
Fraction(2, 7)
want in the abstract. In simple examples like that shown, the choice seems obvious; but
there is no mechanism to provide that in a completely general way.
Casting Down
Rather than allow floats to annex all the results from operations that combine Fraction
with float, we could create a custom class to do the reverse. Yes, we might need to
consider periodic approximation with Fraction.limit_denominator(), but the
rounding would be our explicit choice. For example, let’s start with this:
1 The Python shell performs some “friendly” rounding in its display, so we might
mistakenly think it is producing an exact result.
We’ve moved in the right direction. The Ratio class can cause addition with a float to
maintain a Ratio. However, we lost commutivity in the process. That was an easily
rectified oversight:
Ratio(54043195528445951, 90071992547409920)
>>> Ratio(3, 10) + 0.3
Ratio(54043195528445951, 90071992547409920)
The problem is that we’ve only handled addition this way. Other operators obviously
exist as well:
Adding a full suite of dunder methods for all the operators would be straightforward,
merely slightly tedious. Start with .__mul__() and .__rmul__() and work your way
through the rest in similar fashion.
Choosing the right numeric datatype is often a relevant concern for Python programmers,
and happily Python provides several options beyond floating point. Creating custom classes
with more specialized numeric datatypes is also relatively easy in Python, although only
occasionally needed.
9.3 Wrapping Up 239
9.3 Wrapping Up
In this chapter, we looked at a variety of “obvious” numeric operations that can go wrong,
both within the numeric domain of floating point numbers and in other domains. There are
numerous assumptions that developers tend to make about numbers—perhaps because we
learned about how they should work in elementary school or middle school—that become
less obvious when we write actual computer programs.
This page intentionally left blank
A
Appendix: Topics for
Other Books
There are, of course, many domains of Python programming in which you might make
mistakes, or face trade-offs. Not all such topics can fit within this book, but those not
touched on are not necessarily less important. For the most part, those topics not addressed
are simply big enough to warrant their own full books.
A.2 Concurrency
At some point during development of this book, I wanted to discuss a number of
concurrency pitfalls and trade-offs. It’s a big topic, though; too big for a chapter or two of
this moderately sized book to treat adequately. The fact that this is not the book for
discussion of concurrency does not diminish the importance of the topic, nor suggest that
the choices you make are not subject to pitfalls and trade-offs.
Within Python itself, there are three main approaches to concurrency. Python’s standard
library has a threading module to work with threads. Threads are famously subject to
gotchas like deadlocks, race conditions, priority inversion, data structure corruption, and
other glitches. Moreover, within pure Python, threading does not enable CPU parallelism
because of the infamous GIL (global interpreter lock). That said, many third-party
modules “release the GIL” and allow true parallelism.
Python’s standard library also contains the multiprocessing module, which is largely
similar in API to threading, but works with processes rather than threads. This module
provides a means of running parallel tasks on multiple CPU cores, but is also constrained
by not being able to share data directly among processes and in being “heavier weight.” In
general, in order to communicate, processes require message passing, usually by means of
pipes and queues (which are available in the multiprocessing module).
A useful and higher-level abstraction for both threading and multiprocessing is the
concurrent.futures module of the standard library. Many problems can be more easily
and more safely expressed using the “futures” abstraction, and where possible, concurrency
is easier using this mechanism.
A third abstraction in Python for concurrency is asynchronous programming with
coroutines. This is supported via the async and await keywords, and is managed by the
asyncio standard library module, or by third-party async event loops such as uvloop
(https://uvloop.readthedocs.io/), Trio (https://trio.readthedocs.io/en/stable/), Curio
(https://curio.readthedocs.io/en/latest/), and Twisted (https://twisted.org/). The general
idea behind coroutines is that async functions can yield control in the middle of their
operations, allowing an event loop to give attention to other coroutines within the same
thread and process. This is typically useful when operations are I/O bound (since I/O is
generally several orders of magnitude slower than CPU-bound computation).
The official Python documentation contains a good discussion of many of the trade-offs
among different approaches to concurrency (see https://docs.python.org/3/library
/concurrency.html to get started).
A.4 Type Checking 243
A.3 Packaging
A large part of the Python ecosystem is about packaging software for distribution. Actually,
pretty much the same is true of every programming language. When you write software,
whether executables, libraries, or other systems, you usually wish to share your work with
other users and developers.
For some languages newer than Python, the design of the language itself was
simultaneous with the design of its packaging system. So, for example, users of the Go
programming language will use go get … to install packages. Users of Rust will use
cargo and rustup. In Julia, it is a combination of using Pkg; Pkg.add(…). In R, it’s
generally always install.packages(…). For these languages, there is one and only one
way to install a package, and pretty much exactly one way to publish the packages or tools
you have created. Other languages like Ruby have mostly congealed around gem, and
JavaScript is split between npm and yarn, but the two are mostly compatible.
Python is not as old as C, or Fortran, or even Perl, Bash, Haskell, or Forth. All of those
have, arguably, a greater disarray around packaging than Python does. But Python is pretty
old, having started in 1991, and going through numerous not-quite-compatible packaging
and installation systems over that time, while starting relatively late on putting serious
effort into this aspect. Over the past five to 10 years, Python packaging has become solid
and relatively stable, but a number of competing tools remain, as do a number of package
formats.
Wheels are supported and endorsed by the Python Packaging Authority (https://www.
pypa.io/en/latest/), but so are sdist archives for source-only packages. Numerous tools for
creating wheels are largely, but not entirely, compatible with one another. Conda packages
use a different format and a different build system, but allow completely non-Python
packages to be created, distributed, and installed. A large number of tools allow creation of
platform-specific executables for Python, including often the native packaging system of
operating system distributions. Moreover, especially with so much software being deployed
in “cloud native” or at least “cloud-centric” ways now, containerization, such as with
Docker and Kubernetes, has become a popular alternative approach as well.
This book simply does not attempt to address software packaging, but rather
recommends that you read some of the excellent online material on the topic, starting
with https://packaging.python.org/en/latest/overview.
The Python documentation on type hints is a good place to start if this topic interests
you: https://docs.python.org/3/library/typing.html. The mypy project (https://mypy.
readthedocs.io/) is the tool closest to an “official” type-checking tool for Python. The
Pyre project (https://pyre-check.org/) is a popular type-checking tool, and is especially
useful for large codebases. Pyright (https://microsoft.github.io/pyright) and Pytype
(https://google.github.io/pytype/) likewise serve similar purposes. The PyCharm IDE
(https://www.jetbrains.com/pycharm/) has excellent support for type checking and type
inference, and is worth considering if you are looking for a Python IDE.
Note: Italic letters following page numbers refer to special formats—f for figures, n for
footnotes, and t for tables.
plausibility arguments alternative, immutable objects, 9, 21, 22, 163, 168, 179
224–225 import * function, 42–45
precision of, 229 import options
statistics module treatment of, 227 complexity of, 39–40
for and while equivalence, 13 import * ambiguity, 42–45
for loop, 3, 12 relative imports, 40
fractions.Fraction class, 236–237
imputed values configurations, 19
f-strings, debugging, 80–83 infinities, excluding, 224
injection attack
code example, 209–210
G
DB-API solution, 208–209, 211–212
get data() function, 12, 15 SQL vulnerability to, 208
getters and setters integers
Java pattern to be avoided, 116 interning, 22, 148–149
Pythonic alternative, 116–118 and list structure, 148
get_word() function, 4 interest compounding, 228–232
GIL (global interpreter lock), 242 internal-break-style loop-and-a-half pattern, 14
granularity, mathematical, 226–228 interning, 22, 35–36
Grimm, Avdi, 112 isinstance() function, 69–70
iterables
and data consistency, 17–20
H emphasis on, in Python, 3, 61
Haldane, J. B. S., 217–218 vs. iterators, 90
“hand-rolled” approaches. See custom looping over, 13
approaches multiple, 17
Hash Array Mapped Trie (HAMT) data sortability of, 61
structures, 179 iteration
hashlib module, 198
and iterator length consistency, 18
Haskell programming language, 3 list generation for, 3–6
hidden failure in mutation of iterables, 10
and mutating objects, 9–11
hmac module, 198
over dict.keys(), 8
Hopper, Grace, 89, 118
“iterator algebra,” 45, 94–95, 96
HTTP vs. HTTPS, 206
itertools module
combining function accumulate(), 95–96
I distinctive names of, 45
flattening function collapse(), 97, 98
identity, of mutable objects functions most used, 94
comparison through interning, 35 import * function, 45
vs. equality, 21 impute missing data with itertools.
is vs. ==, 35, 37 zip_longest(), 19
IEEE-754-1985 standard, 218, 226, 236 purpose of functions in, 91
if obj is None, 28 and reading FASTA, 91–92
if statements, 15 and run-length encoding (RLE), 92–94
if/elif checks, 215 itertools.zip_longest() function, 19–20
252 Index
J Lisp-family languages, 3
list, 146–150
JavaScript lists
closures behavior, 23–24 boxed structure, 148
equality and identity conceptual diagram, efficiency of common operations, 135t
37f
insertions and deletions, 157–162
JSON. See JSON data-interchange format
listing, right and wrong ways, 33–35
packaging, of software, 243
memory allocation behavior, 158–159
JIT (just-in-time) Python interpreter, 180
and memory consumption, 3
JSON data-interchange format
sorting, 155
characteristics, 175–176
json Python library module, 176–177
local, enclosing, global, and built-in (LEGB)
rule, 74
and nested data structures, 97
Lodash library, 96
serialization/deserialization failures,
176–178 look-before-you-leap (LBYL) approach. See
LBYL approach
sorting deserialized data, 62
loop-and-a-half pattern, 13–15
looping, 3–20
K loops
for item in iterable, 12
Kahan, William, 218, 225 over index elements, 6
key functions, 56–59, 61 and recursion, 3
keyring module, 197–198 unpythonic, 6
in keyword, for checking containment in lists, while True, 12
153–156
Knuth, Donald, 225
M
L mapping, index to values, 8
Martelli, Alex, 29, 119n, 225
lambda function objects, 21, 24, 62
match/case checks, 215
lazy computation
matching
constructing a generator for, 5
by NFAs in backtracking, 125–126
emphasis on, in Python, 90–91
with re.match() calls, 112–115
itertools functions for, 91
structural patterns, 121–123
and run-length encoding (RLE), 92–94
math.isclose() function, 224–225
lazy iterable, 3
math.isfinite() function, 224
LBYL (look-before-you-leap) approach
mean and median, of NaN values, 222–223
code sample, 63
membership tests
EAFP comparison, 118–121
with in keyword, 153–156
LCGs (linear congruential generators), 76–79,
190, 212–215 with range, 154
LEGB (local, enclosing, global, and built-in) with RegexFlag, 153–154
rule, 74 with set(), 156–157
Let’s Encrypt certificate authority, 202 memory, consumption of
LIFO (last in, first out) operations, 135 floating point numbers as conserving,
linear congruential generators (LCGs), 76–79, 236
190, 212–215 lazy calculation to conserve, 5
Index 253
P predicate(), 12–13
primitives in Python library
packaging, of software, 243 cryptographic, 198–199
Pandas data structures, 179
access to dataframes via, 244 “private” and “protected” attributes, 77–79
bit width errors, 228 PRNGs (pseudo-random number generators),
Boolean context of, 25 12n, 76, 79, 190
DataFrames, 218, 244 programming languages, 3
for extensive numeric analysis, 142 propagation, of NaN values, 223
for tabular data handling, 34n “protected” and “private” attributes, 77–79
and naming conflicts, 41 pseudo-random number generators (PRNGs),
NaN roles in, 218 12n, 76, 79, 190
and NaN stripping, 223 _pth files and import options, 39
Z
X
The Zen of Python (Peters), 17, 97, 108, 123
x == None error, 28–29, 36 zip() function, to remedy data flaws, 16–20
XKCD comic strip (Munroe), 210 zip(strict=True) function, to remedy data
XML loading, and denial-of-service attacks, flaws, 17–18
216 zip_longest() function, 19
Expert insight for modern Python (3.6+) coding
from David Beazley, the author of
Python Essential Reference
Python educator Dave Beazley’s concise handbook, Python Distilled, focuses on
the essential core of the Python programming language, with code examples to
illuminate how Python works and how to structure programs that can be more
easily explained, tested, and debugged. Obtain expert insight and build a strong
foundation for one of the world’s most popular and fastest-growing languages.
• Reflects modern Python programming idioms for Python 3.6 and beyond
• Rich Python coverage: data abstraction, functions, objects, modules,
generators, coroutines, classes, I/O handling
Python Distilled
ISBN: 978-0-13-417327-6
informit.com/beazley
Register Your Product at informit.com/register
* on your next purchase
• Automatically receive a coupon for 35% off books, eBooks, and web editions and
65% off video courses, valid for 30 days. Look for your code in your InformIT cart
or the Manage Codes section of your account page.
• Download available product updates.
• Access bonus material if available.**
• Check the box to hear from us and receive exclusive offers on new editions
and related products.
Addison-Wesley • Adobe Press • Cisco Press • Microsoft Press • Oracle Press • Peachpit Press • Pearson IT Certifica tion • Que