This is a follow-up to Appreciating Python's match-case by parsing Python code. In that first post I discussed my adventures with Python's structural pattern matching.
I try not to work late at night, but I had an insight a few weeks ago that kept me on my computer long past my bedtime. I thought "could I make an automatic dataclass to regular class converter?" Spoilers: the answer was yes, by writing code that made sloppy-but-good-enough assumptions.
Pause. I want to note that I don't dislike dataclasses. I think dataclasses are great and I teach them often in my Python training sessions. In fact, it may seem counter-intuitive, but I'm hoping a dataclass remover will help me teach dataclasses more effectively.
When I teach Python's dataclasses I often show a "before" and "after", like an infomercial for a cleaning product.
Seeing the equivalent code for a dataclass helps us appreciate what dataclasses do for us. This process really drives home the point that dataclasses make friendly-to-use classes with less boilerplate code.
So I was up late writing a dataclass to regular class converter, which started as a script and eventually turned into a WebAssembly-powered web app. During this process I used some interesting Python features that I'd like to share.
undataclass.py
script work?Essentially the undataclass.py
script:
ast
module)__post_init__
methodsI used some tricks I don't usually get to use in Python. I used:
match
-case
blocks which replaced even hairier if
-elif
blockstextwrap.dedent
utility, which I feel should be more widely known & usedast
module's unparse
function to convert an abstract syntax tree into Python codeLet's take a look at some of the code.
Python's structural pattern matching can match [iterables][] based on their length and content.
With structural pattern matching (which uses match
-case
that was added in Python 3.10) we can turn this code (from Django):
def do_get_available_languages(parser, token):
args = token.contents.split()
if len(args) != 3 or args[1] != "as":
raise TemplateSyntaxError(
f"'get_available_languages' requires 'as variable' (got {args})"
)
return GetAvailableLanguagesNode(args[2])
Into this:
def do_get_available_languages(parser, token):
match token.split_contents():
case [name, "for", code "as" info]:
return GetLanguageInfoNode(parser.compile_filter(code), info)
case [name, *rest]:
raise TemplateSyntaxError(
f"'{name}' requires 'as variable' (got {rest!r})"
)
Python's match
-case
also allows for deep type checking and attribute content assertions, which allows turning this:
if isinstance(node, ast.Call):
if (isinstance(node.func, ast.Attribute)
and node.func.value.id == "dataclasses"
and node.func.attr == "dataclass"):
return True
elif node.func.id == "dataclass":
return True
elif (isinstance(node, ast.Attribute)
and node.value.id == "dataclasses"
and node.value.attr == "dataclass"):
return True
elif isinstance(node, ast.Name) and node.id == "dataclass":
return True
else:
return False
Into this:
match node:
case ast.Call(
func=ast.Attribute(
value=ast.Name(id="dataclasses"),
attr="dataclass",
),
):
return True
case ast.Call(func=ast.Name(id="dataclass")):
return True
case ast.Attribute(
value=ast.Name(id="dataclasses"),
attr="dataclass"
):
return True
case ast.Name(id="dataclass"):
return True
case _:
return False
Python's match
-case
statements tend to be very complex, but also much less visually dense than than an equivalent if
statement.
I ultimately ended up using 7 match
-case
statements throughout the undataclass.py
script, each of which replaced an even more complex if
-elif
statement.
If you're interested in how I used match
-case
during this adventure, see part one of this two part post on how I used match-case to parse Python code while writing undataclass.py
.
While looping through the AST nodes in a dataclass, I needed to keep track of where the new methods (__init__
, __repr__
, __eq__
, etc.) should be inserted.
It seemed most appropriate that these would be the first function definitions in our class, which means we'd insert these methods just before the first function definition we discovered.
Once I decided on my location-to-insert-methods, I needed a placeholder value to keep track of that location because I wouldn't actually have the methods-to-be-inserted until later on. But which value to use?
Objects that acts as a placeholders are often called "sentinel values".
A sentinel value is useful for indicating something that isn't real data.
In Python, the most common sentinel value is None
.
But you can also invent your own sentinel values in Python.
None
didn't feel like an appropriate placeholder value to represent "the place dataclass-equivalent methods should go", so instead I made my own sentinel value.
I called object
to make a completely unique placeholder and then pointed the DATACLASS_STUFF_HERE
variable to it (and yes that variable name isn't great):
DATACLASS_STUFF_HERE = object()
Then I stuck that unique placeholder object in a new_body
list which I used to store all the new nodes that would overwrite the original nodes from the old dataclass:
match node:
case ast.FunctionDef():
if DATACLASS_STUFF_HERE not in new_body:
new_body.append(DATACLASS_STUFF_HERE)
if node.name == "__post_init__":
post_init = node.body
else:
new_body.append(node)
But where did I replace this placeholder object with something useful? That's where sliced assignment came in.
We'll get to sliced assignment later. First let's talk about generating the AST nodes for those new methods.
My make_dataclass_methods
function accepted the class name, the options provided to the dataclass
decorator, the dataclass fields found, and a list of the AST nodes found in the __post_init__
method (if any).
This function then returned a list of AST nodes that represented the new methods we needed (__init__
, __repr__
, etc.).
dataclass_extras = make_dataclass_methods(
dataclass_node.name,
options,
fields,
post_init,
)
This make_dataclass_methods
function is essentially a big chain of if
statements which checked certain scenarios related to our dataclass options:
def make_dataclass_methods(class_name, options, fields, post_init):
"""Return AST nodes for all new dataclass attributes and methods."""
nodes = []
kw_only_fields = process_kw_only_fields(options, fields)
init_fields, init_vars = process_init_vars(fields)
if options.get("slots", False):
nodes += ast.parse(make_slots(fields)).body
if options.get("match_args", True):
nodes += ast.parse(make_match_args(fields)).body
if options.get("init", True):
nodes += ast.parse(make_init(
init_fields,
post_init,
init_vars,
options.get("frozen", False),
kw_only_fields,
)).body
if options.get("repr", True):
nodes += ast.parse(make_repr(fields)).body
if options.get("compare", True):
nodes += ast.parse(make_order("==", class_name, fields)).body
if options.get("order", False):
nodes += ast.parse(make_order("<", class_name, fields)).body
if (options.get("frozen", False) and options.get("eq", True)
or options.get("unsafe_hash", False)):
nodes += ast.parse(make_hash(fields)).body
if options.get("frozen", False):
nodes += ast.parse(make_setattr_and_delattr()).body
if options.get("slots", False):
nodes += ast.parse(make_setstate_and_getstate(fields)).body
return nodes
This acts like a restaurant menu: it figures out which features we want and then gives us the AST nodes representing those features. It asks questions like this:
slots=True
set? Great, add nodes for __slots__
.repr
not set to False
? Great, add nodes for __repr__
.order
set to True
? Great, add nodes for __lt__
.Note that in each of these if
statements we have a line that looks like this:
nodes += ast.parse(make_SOMETHING_OR_OTHER()).body
That make_SOMETHING_OR_OTHER
function returns a string representing Python code.
Once we get that string, we use ast.parse
, to parse it and then grab the body
attribute from the resulting node to get its subnodes.
We then use +=
to extend our nodes
list with these new subnodes.
If we used pdb
to inspect the nodes
list just before we return from this function, we might see something like this:
> undataclass.py(307)make_dataclass_methods()
-> return nodes
(Pdb) pp nodes
[<ast.Assign object at 0x7f2c03307070>,
<ast.FunctionDef object at 0x7f2c03307bb0>,
<ast.FunctionDef object at 0x7f2c03307eb0>,
<ast.FunctionDef object at 0x7f2c03307fa0>]
(Pdb) nodes[1].name
'__init__'
(Pdb) pp nodes[1].body
[<ast.Assign object at 0x7f2c03307b20>,
<ast.Assign object at 0x7f2c03307970>]
(Pdb) ast.unparse(nodes[1].body[0])
'self.x = x'
That second node (nodes[1]
) represents a __init__
function.
So each of our make_SOMETHING_OR_OTHER
functions need to generate Python code.
But how do they do that?
Messily.
Each of the make_SOMETHING_OR_OTHER
functions essentially made strings that represent bits of code and then glued those strings together with f-strings and the string join
method.
Have you ever written strings that represent Python code from within Python? No? That's probably a good thing! This part was unavoidably very messy.
For example here's the code that generates __slots__
(if slots=True
is set):
def attr_name_tuple(fields):
"""Return code for a tuple of all field names (as strings)."""
joined_names = ", ".join([
repr(f.name)
for f in fields
])
if len(fields) == 1: # Single item tuples need a trailing comma
return f"({joined_names},)"
else:
return f"({joined_names})"
def make_slots(fields):
"""Return code of __slots__."""
return f"__slots__ = {attr_name_tuple(fields)}"
This built-up and returned a single line of Python code (that __slots__
string below):
>>> from types import SimpleNamespace as field
>>> from undataclass import make_repr
>>> make_slots([field(name="x"), field(name="y")])
"__slots__ = ('x', 'y')"
Here's the code that builds up our __repr__
method:
def make_repr(fields):
"""Return code for the __repr__ method."""
repr_args = ", ".join([
f"{f.name}={{self.{f.name}!r}}"
for f in fields
if f.repr
])
return dedent("""
def __repr__(self):
cls = type(self).__name__
return f"{{cls}}({repr_args})"
""").format(repr_args=repr_args)
That make_repr
function returns a string that represents the Python code needed for a friendly __repr__
method:
>>> from types import SimpleNamespace as field
>>> from undataclass import make_repr
>>> print(make_repr([field(name="x", repr=True), field(name="y", repr=True)]))
def __repr__(self):
cls = type(self).__name__
return f"{cls}(x={self.x!r}, y={self.y!r})"
Note that the returned code isn't indented, even though the multi-line string we wrote to generate this code is indented.
The magic here is in the textwrap.dedent
utility.
Python's textwrap.dedent
was super helpful for generating all the needed Python code.
Without that dedent
call above, the output would look like this:
def __repr__(self):
cls = type(self).__name__
return f"{cls}(x={self.x!r}, y={self.y!r})"
Instead of like this:
def __repr__(self):
cls = type(self).__name__
return f"{cls}(x={self.x!r}, y={self.y!r})"
I use dedent
in lots of my own code that involves multi-line strings and many Python Morsels exercises include solutions that use dedent
.
You can see a demo of textwrap.dedent
in action here.
If you ever need to remove indentation from a multi-line string in Python, I highly recommend taking a look at textwrap.dedent
.
Even with dedent
, we can't be saved from messy code here.
Code that generates Python code isn't pretty by its very nature.
But keep in mind that the alternative would have been creating lots of AST nodes manually.
Writing Python code within strings and then using ast.parse
to parse those strings made for much more readable code.
After we call that make_dataclass_methods
function, we'll have a list of AST nodes (pointed to by a dataclass_extras
variable):
dataclass_extras = make_dataclass_methods(
dataclass_node.name,
options,
fields,
post_init,
)
What do we do with that list?
Remember that DATACLASS_STUFF_HERE
sentinel value we used as a placeholder?
We need to replace it with all the nodes in our dataclass_extras
list now.
We can use slice assignment to do that:
if DATACLASS_STUFF_HERE in new_body:
index = new_body.index(DATACLASS_STUFF_HERE)
new_body[index:index+1] = dataclass_extras
else:
new_body += dataclass_extras
dataclass_node.body = new_body
If DATACLASS_STUFF_HERE
was not in our new_body
list, then we add all the nodes to the end of our list.
But if DATACLASS_STUFF_HERE
was in our new_body
list then we find its position and then replace it with all those new AST nodes we made.
We're doing it through slice assignment.
Did you know you can assign to a slice in Python? It's a somewhat strange thing to see, but it's super helpful during those rare times that it's useful:
>>> numbers = [2, 1, 11, 18]
>>> numbers[1:1] = [3, 4, 7]
>>> numbers
[2, 3, 4, 7, 1, 11, 18]
>>> numbers[2:] = [29, 47]
>>> numbers
[2, 3, 29, 47]
Note that I could have instead made a new list using slicing:
if DATACLASS_STUFF_HERE in new_body:
index = new_body.index(DATACLASS_STUFF_HERE)
new_body = new_body[:index] + dataclass_extras + new_body[index+1:]
else:
new_body += dataclass_extras
dataclass_node.body = new_body
But that's not quite as fun, is it? 😜
Yes this adventure resulted in a useful tool for teaching dataclasses, but my primary motivation was to have fun doing something I don't normally get to do.
Now that we've modified each of our dataclass AST nodes to un-dataclass them, how do we generate the Python code that our abstract syntax tree represents?
We can use ast.unparse
for that!
return ast.unparse(new_nodes)
When we inspected the nodes
list generated by the make_dataclass_methods
function earlier, if we'd called ast.unparse
on that nodes
list, we might have seen something like this:
(Pdb) print(ast.unparse(nodes))
__match_args__ = ('x', 'y')
def __init__(self, x: float, y: float) -> None:
self.x = x
self.y = y
def __repr__(self):
cls = type(self).__name__
return f'{cls}(x={self.x!r}, y={self.y!r})'
def __eq__(self, other):
if not isinstance(other, Point):
return NotImplemented
return (self.x, self.y) == (other.x, other.y)
The unparse
function accepts a tree of AST nodes and returns the Python code that those nodes represent.
Neat, huh?
The big downside to using ast.unparse
is that we lose the original formatting of our code.
How many blank lines did we use?
How did wrap our code?
And were there code comments?
We lose all of that!
But this tool isn't meant to generate exactly the replacement code we need.
The generated code is meant to be an example of what a non-dataclass version would look like.
For that purpose, ast.unparse
is certainly good enough.
The code I ended up writing was very sloppy about in the assumptions it made.
This dataclass converter isn't intended to automatically turn every possible dataclass into a fully functional regular class. That task simply isn't possible: some limits need to be set.
I decided that I would make fairly reasonable assumptions about the ways dataclasses are typically written and run with those assumptions. If later on I need to refactor a section that assumed a bit too much, so be it! That's either a problem for my future self or (more likely) a problem I'll never need to worry about.
I think Python's dataclasses are great.
Dataclasses encourage Python programmers to make classes that have a friendly __init__
method, a helpful string representation, and allow for sensible equality checks.
I made this tool to demonstrate what dataclasses do for us.
You could also use this code to actually replace dataclasses as well though and that's sometimes helpful. All programming abstractions are a trade off and sometimes dataclasses become slightly more hassle than they're worth. At that time, you could consider diving deeper into another abstraction (like attrs for example) or creating your class manually by converting your dataclass to a regular class.
Regardless of whether, how, and when you use dataclasses I hope you learned something from my adventures parsing Python code to turn dataclasses into regular classes (including the first part of this journey on appreciating Python's match-case by parsing Python code). And I hope this journey inspires you to write your own code to sloppily perform silly tasks.
Need to fill-in gaps in your Python skills?
Sign up for my Python newsletter where I share one of my favorite Python tips every week.
Need to fill-in gaps in your Python skills? I send weekly emails designed to do just that.
Sign in to your Python Morsels account to track your progress.
Don't have an account yet? Sign up here.