Normalize JSON #7078

romanzdk · 2023-02-21T21:07:15Z

Problem description

It would be perfect to have similar functionality to pandas json_normalize - https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2023-02-23T14:39:11Z

Incase it's useful information: the last 2 examples already "work" with .read_json()

import io
import polars as pl

df = pl.read_json(io.BytesIO(b"""
[
    {
        "id": 1,
        "name": "Cole Volk",
        "fitness": {"height": 130, "weight": 60}
    },
    {"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
    {
        "id": 2,
        "name": "Faye Raker",
        "fitness": {"height": 130, "weight": 60}
    }
]
"""))

shape: (3, 3)
┌──────┬────────────┬───────────┐
│ id   | name       | fitness   │
│ ---  | ---        | ---       │
│ i64  | str        | struct[2] │
╞══════╪════════════╪═══════════╡
│ 1    | Cole Volk  | {130,60}  │
│ null | Mark Reg   | {130,60}  │
│ 2    | Faye Raker | {130,60}  │
└──────┴────────────┴───────────┘

You can then .unnest() the fitness column:

>>> df.unnest("fitness")
shape: (3, 4)
┌──────┬────────────┬────────┬────────┐
│ id   | name       | height | weight │
│ ---  | ---        | ---    | ---    │
│ i64  | str        | i64    | i64    │
╞══════╪════════════╪════════╪════════╡
│ 1    | Cole Volk  | 130    | 60     │
│ null | Mark Reg   | 130    | 60     │
│ 2    | Faye Raker | 130    | 60     │
└──────┴────────────┴────────┴────────┘

df = pl.read_json(io.BytesIO(b"""
[
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott"},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000}
        ]
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "counties": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337}
        ]
    }
]
"""))

shape: (2, 4)
┌─────────┬───────────┬─────────────────┬─────────────────────────────────────┐
│ state   | shortname | info            | counties                            │
│ ---     | ---       | ---             | ---                                 │
│ str     | str       | struct[1]       | list[struct[2]]                     │
╞═════════╪═══════════╪═════════════════╪═════════════════════════════════════╡
│ Florida | FL        | {"Rick Scott"}  | [{"Dade",12345}, {"Broward",4000... │
│ Ohio    | OH        | {"John Kasich"} | [{"Summit",1234}, {"Cuyahoga",13... │
└─────────┴───────────┴─────────────────┴─────────────────────────────────────┘

.unnest() and .explode()

>>> df.unnest("info").explode("counties").unnest("counties")
shape: (5, 5)
┌─────────┬───────────┬─────────────┬────────────┬────────────┐
│ state   | shortname | governor    | name       | population │
│ ---     | ---       | ---         | ---        | ---        │
│ str     | str       | str         | str        | i64        │
╞═════════╪═══════════╪═════════════╪════════════╪════════════╡
│ Florida | FL        | Rick Scott  | Dade       | 12345      │
│ Florida | FL        | Rick Scott  | Broward    | 40000      │
│ Florida | FL        | Rick Scott  | Palm Beach | 60000      │
│ Ohio    | OH        | John Kasich | Summit     | 1234       │
│ Ohio    | OH        | John Kasich | Cuyahoga   | 1337       │
└─────────┴───────────┴─────────────┴────────────┴────────────┘

Seems like it could be useful if .unnest() had the option to prefix the resulting column names to avoid name clashes.

amzar96 · 2024-05-13T02:57:06Z

DuplicateError: column with name 'name' has more than one occurrences
Agree with @cmdlineluser . Should have option to add custom prefix as well.

cmdlineluser · 2024-05-13T07:28:17Z

@amzar96 .name.map_fields() has since been added which can be used before unnesting.

daviewales · 2024-07-30T12:24:18Z

An example using .name.prefix_fields():

import io
import polars as pl

# au/nsw address data in newline delimited geojson format from openaddresses.io
address_data = io.BytesIO(b"""
{"type": "Feature", "properties": {"hash": "f3370787f7adc06a", "number": "15", "street": "ELLALONG STREET", "unit": "", "city": "PELAW MAIN", "district": "", "region": "", "postcode": "", "id": "3336983"}, "geometry": {"type": "Point", "coordinates": [151.4857842, -32.8269847, 0.0]}}
{"type": "Feature", "properties": {"hash": "1c89c6cbf8236bff", "number": "14", "street": "MILLFIELD STREET", "unit": "", "city": "PELAW MAIN", "district": "", "region": "", "postcode": "", "id": "3336984"}, "geometry": {"type": "Point", "coordinates": [151.4857532, -32.8264517, 0.0]}}
""")

df = pl.read_ndjson(address_data)
print(df)

# shape: (2, 3)
# ┌─────────┬─────────────────────────────────┬─────────────────────────────────┐
# │ type    ┆ properties                      ┆ geometry                        │
# │ ---     ┆ ---                             ┆ ---                             │
# │ str     ┆ struct[9]                       ┆ struct[2]                       │
# ╞═════════╪═════════════════════════════════╪═════════════════════════════════╡
# │ Feature ┆ {"f3370787f7adc06a","15","ELLA… ┆ {"Point",[151.485784, -32.8269… │
# │ Feature ┆ {"1c89c6cbf8236bff","14","MILL… ┆ {"Point",[151.485753, -32.8264… │
# └─────────┴─────────────────────────────────┴─────────────────────────────────┘

flattened_df = df.with_columns(
  pl.col("properties").name.prefix_fields("properties."),
  pl.col("geometry").name.prefix_fields("geometry."),
).unnest("properties", "geometry")
print(flattened_df)
# shape: (2, 12)
# ┌─────────┬──────────────────┬───────────────────┬───────────────────┬───┬─────────────────────┬───────────────┬───────────────┬───────────────────────────────┐
# │ type    ┆ properties.hash  ┆ properties.number ┆ properties.street ┆ … ┆ properties.postcode ┆ properties.id ┆ geometry.type ┆ geometry.coordinates          │
# │ ---     ┆ ---              ┆ ---               ┆ ---               ┆   ┆ ---                 ┆ ---           ┆ ---           ┆ ---                           │
# │ str     ┆ str              ┆ str               ┆ str               ┆   ┆ str                 ┆ str           ┆ str           ┆ list[f64]                     │
# ╞═════════╪══════════════════╪═══════════════════╪═══════════════════╪═══╪═════════════════════╪═══════════════╪═══════════════╪═══════════════════════════════╡
# │ Feature ┆ f3370787f7adc06a ┆ 15                ┆ ELLALONG STREET   ┆ … ┆                     ┆ 3336983       ┆ Point         ┆ [151.485784, -32.826985, 0.0] │
# │ Feature ┆ 1c89c6cbf8236bff ┆ 14                ┆ MILLFIELD STREET  ┆ … ┆                     ┆ 3336984       ┆ Point         ┆ [151.485753, -32.826452, 0.0] │
# └─────────┴──────────────────┴───────────────────┴───────────────────┴───┴─────────────────────┴───────────────┴───────────────┴───────────────────────────────┘

You can simplify this a bit by defining an expression function:

def prefix_field(field):
    return pl.col(field).name.prefix_fields(f"{field}.")

This lets you change the last bit to:

flattened_df = df.with_columns(
  prefix_field("properties"),
  prefix_field("geometry"),
).unnest("properties", "geometry")
print(flattened_df)

You can simplify this further with the following function, which will automatically unnest every struct column, and prefix the unnested fields with the parent column name:

def flatten(df):
    struct_cols = [col for col, dtype in zip(df.columns, df.dtypes) if type(dtype) is pl.Struct]
    return df.with_columns(*map(prefix_field, struct_cols)).unnest(*struct_cols)

So the last bit becomes:

flattened_df = flatten(df)
print(flattened_df)

Recursively flatten all list and struct fields

And finally, because I couldn't help myself, here's a recursive flatten function, which flattens both lists and structs, and automatically prefixes the flattened fields with the names of the parent columns:

def prefix_field(field):
    """Prefix struct fields with parent column name"""
    return pl.col(field).name.prefix_fields(f"{field}.")

def flatten(df):
    """Flatten one level of struct or list columns, and prefix flattened fields
    with parent column name
    """
    struct_cols = [col for col, dtype in zip(df.columns, df.dtypes) if type(dtype) is pl.Struct]
    list_cols = [col for col, dtype in zip(df.columns, df.dtypes) if type(dtype) is pl.List]

    return df.with_columns(
        *map(prefix_field, struct_cols),
        *map(
            lambda c: pl.col(c).list.to_struct(n_field_strategy=", fields = lambda i: f"{c}.{i}"),
            list_cols,
        ),
    ).unnest(*struct_cols, *list_cols)

def recursively_flatten(df):
    """Recursively flatten list and struct columns"""
    while any(type(dtype) in (pl.Struct, pl.List) for dtype in df.dtypes):
        df = flatten(df)
    return df

Using the example data from above, it works as follows:

import io
import polars as pl

# au/nsw address data in newline delimited geojson format from openaddresses.io
address_data = io.BytesIO(b"""
{"type": "Feature", "properties": {"hash": "f3370787f7adc06a", "number": "15", "street": "ELLALONG STREET", "unit": "", "city": "PELAW MAIN", "district": "", "region": "", "postcode": "", "id": "3336983"}, "geometry": {"type": "Point", "coordinates": [151.4857842, -32.8269847, 0.0]}}
{"type": "Feature", "properties": {"hash": "1c89c6cbf8236bff", "number": "14", "street": "MILLFIELD STREET", "unit": "", "city": "PELAW MAIN", "district": "", "region": "", "postcode": "", "id": "3336984"}, "geometry": {"type": "Point", "coordinates": [151.4857532, -32.8264517, 0.0]}}
""")

df = recursively_flatten(pl.read_ndjson(address_data))
print(df)
# shape: (2, 14)
# ┌─────────┬──────────────────┬───────────────────┬───────────────────┬───┬───────────────┬────────────────────────┬────────────────────────┬────────────────────────┐
# │ type    ┆ properties.hash  ┆ properties.number ┆ properties.street ┆ … ┆ geometry.type ┆ geometry.coordinates.0 ┆ geometry.coordinates.1 ┆ geometry.coordinates.2 │
# │ ---     ┆ ---              ┆ ---               ┆ ---               ┆   ┆ ---           ┆ ---                    ┆ ---                    ┆ ---                    │
# │ str     ┆ str              ┆ str               ┆ str               ┆   ┆ str           ┆ f64                    ┆ f64                    ┆ f64                    │
# ╞═════════╪══════════════════╪═══════════════════╪═══════════════════╪═══╪═══════════════╪════════════════════════╪════════════════════════╪════════════════════════╡
# │ Feature ┆ f3370787f7adc06a ┆ 15                ┆ ELLALONG STREET   ┆ … ┆ Point         ┆ 151.485784             ┆ -32.826985             ┆ 0.0                    │
# │ Feature ┆ 1c89c6cbf8236bff ┆ 14                ┆ MILLFIELD STREET  ┆ … ┆ Point         ┆ 151.485753             ┆ -32.826452             ┆ 0.0                    │
# └─────────┴──────────────────┴───────────────────┴───────────────────┴───┴───────────────┴────────────────────────┴────────────────────────┴────────────────────────┘

GuillaumePressiat · 2024-08-16T18:48:54Z

@daviewales since few days I'm on a task where I read highly nested XML files (XML > xmltodict > json then pola.rs) and I was wondering if this code to automatically flatten / rename all structs exists before doing it myself. So thank you.

romanzdk added the enhancement New feature or an improvement of an existing feature label Feb 21, 2023

cmdlineluser mentioned this issue Mar 6, 2023

pd.json_normalize() / automatic flattening of nested data #7374

Closed

evbo mentioned this issue Apr 17, 2023

Method to drill down through JSON hierachy to table data #5091

Closed

This was referenced Nov 9, 2023

Equivalent to json_normalize() from Pandas #12219

Open

.unnest_all() #12353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize JSON #7078

Normalize JSON #7078

romanzdk commented Feb 21, 2023

cmdlineluser commented Feb 23, 2023

amzar96 commented May 13, 2024

cmdlineluser commented May 13, 2024

daviewales commented Jul 30, 2024 •

edited

Loading

GuillaumePressiat commented Aug 16, 2024

Normalize JSON #7078

Normalize JSON #7078

Comments

romanzdk commented Feb 21, 2023

Problem description

cmdlineluser commented Feb 23, 2023

amzar96 commented May 13, 2024

cmdlineluser commented May 13, 2024

daviewales commented Jul 30, 2024 • edited Loading

Recursively flatten all list and struct fields

GuillaumePressiat commented Aug 16, 2024

daviewales commented Jul 30, 2024 •

edited

Loading