Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize JSON #7078

Open
romanzdk opened this issue Feb 21, 2023 · 5 comments
Open

Normalize JSON #7078

romanzdk opened this issue Feb 21, 2023 · 5 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@romanzdk
Copy link

Problem description

It would be perfect to have similar functionality to pandas json_normalize - https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html

@romanzdk romanzdk added the enhancement New feature or an improvement of an existing feature label Feb 21, 2023
@cmdlineluser
Copy link
Contributor

Incase it's useful information: the last 2 examples already "work" with .read_json()

import io
import polars as pl

df = pl.read_json(io.BytesIO(b"""
[
    {
        "id": 1,
        "name": "Cole Volk",
        "fitness": {"height": 130, "weight": 60}
    },
    {"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
    {
        "id": 2,
        "name": "Faye Raker",
        "fitness": {"height": 130, "weight": 60}
    }
]
"""))
shape: (3, 3)
┌──────┬────────────┬───────────┐
│ id   | name       | fitness   │
│ ---  | ---        | ---       │
│ i64  | str        | struct[2] │
╞══════╪════════════╪═══════════╡
│ 1    | Cole Volk  | {130,60}  │
│ null | Mark Reg   | {130,60}  │
│ 2    | Faye Raker | {130,60}  │
└──────┴────────────┴───────────┘

You can then .unnest() the fitness column:

>>> df.unnest("fitness")
shape: (3, 4)
┌──────┬────────────┬────────┬────────┐
│ id   | name       | height | weight │
│ ---  | ---        | ---    | ---    │
│ i64  | str        | i64    | i64    │
╞══════╪════════════╪════════╪════════╡
│ 1    | Cole Volk  | 130    | 60     │
│ null | Mark Reg   | 130    | 60     │
│ 2    | Faye Raker | 130    | 60     │
└──────┴────────────┴────────┴────────┘
df = pl.read_json(io.BytesIO(b"""
[
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott"},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000}
        ]
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "counties": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337}
        ]
    }
]
"""))
shape: (2, 4)
┌─────────┬───────────┬─────────────────┬─────────────────────────────────────┐
│ state   | shortname | info            | counties                            │
│ ---     | ---       | ---             | ---                                 │
│ str     | str       | struct[1]       | list[struct[2]]                     │
╞═════════╪═══════════╪═════════════════╪═════════════════════════════════════╡
│ Florida | FL        | {"Rick Scott"}  | [{"Dade",12345}, {"Broward",4000... │
│ Ohio    | OH        | {"John Kasich"} | [{"Summit",1234}, {"Cuyahoga",13... │
└─────────┴───────────┴─────────────────┴─────────────────────────────────────┘

.unnest() and .explode()

>>> df.unnest("info").explode("counties").unnest("counties")
shape: (5, 5)
┌─────────┬───────────┬─────────────┬────────────┬────────────┐
│ state   | shortname | governor    | name       | population │
│ ---     | ---       | ---         | ---        | ---        │
│ str     | str       | str         | str        | i64        │
╞═════════╪═══════════╪═════════════╪════════════╪════════════╡
│ Florida | FL        | Rick Scott  | Dade       | 12345      │
│ Florida | FL        | Rick Scott  | Broward    | 40000      │
│ Florida | FL        | Rick Scott  | Palm Beach | 60000      │
│ Ohio    | OH        | John Kasich | Summit     | 1234       │
│ Ohio    | OH        | John Kasich | Cuyahoga   | 1337       │
└─────────┴───────────┴─────────────┴────────────┴────────────┘

Seems like it could be useful if .unnest() had the option to prefix the resulting column names to avoid name clashes.

@amzar96
Copy link

amzar96 commented May 13, 2024

DuplicateError: column with name 'name' has more than one occurrences
Agree with @cmdlineluser . Should have option to add custom prefix as well.

@cmdlineluser
Copy link
Contributor

@amzar96 .name.map_fields() has since been added which can be used before unnesting.

@daviewales
Copy link

daviewales commented Jul 30, 2024

An example using .name.prefix_fields():

import io
import polars as pl

# au/nsw address data in newline delimited geojson format from openaddresses.io
address_data = io.BytesIO(b"""
{"type": "Feature", "properties": {"hash": "f3370787f7adc06a", "number": "15", "street": "ELLALONG STREET", "unit": "", "city": "PELAW MAIN", "district": "", "region": "", "postcode": "", "id": "3336983"}, "geometry": {"type": "Point", "coordinates": [151.4857842, -32.8269847, 0.0]}}
{"type": "Feature", "properties": {"hash": "1c89c6cbf8236bff", "number": "14", "street": "MILLFIELD STREET", "unit": "", "city": "PELAW MAIN", "district": "", "region": "", "postcode": "", "id": "3336984"}, "geometry": {"type": "Point", "coordinates": [151.4857532, -32.8264517, 0.0]}}
""")

df = pl.read_ndjson(address_data)
print(df)

# shape: (2, 3)
# ┌─────────┬─────────────────────────────────┬─────────────────────────────────┐
# │ type    ┆ properties                      ┆ geometry                        │
# │ ---     ┆ ---                             ┆ ---                             │
# │ str     ┆ struct[9]                       ┆ struct[2]                       │
# ╞═════════╪═════════════════════════════════╪═════════════════════════════════╡
# │ Feature ┆ {"f3370787f7adc06a","15","ELLA… ┆ {"Point",[151.485784, -32.8269… │
# │ Feature ┆ {"1c89c6cbf8236bff","14","MILL… ┆ {"Point",[151.485753, -32.8264… │
# └─────────┴─────────────────────────────────┴─────────────────────────────────┘

flattened_df = df.with_columns(
  pl.col("properties").name.prefix_fields("properties."),
  pl.col("geometry").name.prefix_fields("geometry."),
).unnest("properties", "geometry")
print(flattened_df)
# shape: (2, 12)
# ┌─────────┬──────────────────┬───────────────────┬───────────────────┬───┬─────────────────────┬───────────────┬───────────────┬───────────────────────────────┐
# │ type    ┆ properties.hash  ┆ properties.number ┆ properties.street ┆ … ┆ properties.postcode ┆ properties.id ┆ geometry.type ┆ geometry.coordinates          │
# │ ---     ┆ ---              ┆ ---               ┆ ---               ┆   ┆ ---                 ┆ ---           ┆ ---           ┆ ---                           │
# │ str     ┆ str              ┆ str               ┆ str               ┆   ┆ str                 ┆ str           ┆ str           ┆ list[f64]                     │
# ╞═════════╪══════════════════╪═══════════════════╪═══════════════════╪═══╪═════════════════════╪═══════════════╪═══════════════╪═══════════════════════════════╡
# │ Feature ┆ f3370787f7adc06a ┆ 15                ┆ ELLALONG STREET   ┆ … ┆                     ┆ 3336983       ┆ Point         ┆ [151.485784, -32.826985, 0.0] │
# │ Feature ┆ 1c89c6cbf8236bff ┆ 14                ┆ MILLFIELD STREET  ┆ … ┆                     ┆ 3336984       ┆ Point         ┆ [151.485753, -32.826452, 0.0] │
# └─────────┴──────────────────┴───────────────────┴───────────────────┴───┴─────────────────────┴───────────────┴───────────────┴───────────────────────────────┘

You can simplify this a bit by defining an expression function:

def prefix_field(field):
    return pl.col(field).name.prefix_fields(f"{field}.")

This lets you change the last bit to:

flattened_df = df.with_columns(
  prefix_field("properties"),
  prefix_field("geometry"),
).unnest("properties", "geometry")
print(flattened_df)

You can simplify this further with the following function, which will automatically unnest every struct column, and prefix the unnested fields with the parent column name:

def flatten(df):
    struct_cols = [col for col, dtype in zip(df.columns, df.dtypes) if type(dtype) is pl.Struct]
    return df.with_columns(*map(prefix_field, struct_cols)).unnest(*struct_cols)

So the last bit becomes:

flattened_df = flatten(df)
print(flattened_df)

Recursively flatten all list and struct fields

And finally, because I couldn't help myself, here's a recursive flatten function, which flattens both lists and structs, and automatically prefixes the flattened fields with the names of the parent columns:

def prefix_field(field):
    """Prefix struct fields with parent column name"""
    return pl.col(field).name.prefix_fields(f"{field}.")

def flatten(df):
    """Flatten one level of struct or list columns, and prefix flattened fields
    with parent column name
    """
    struct_cols = [col for col, dtype in zip(df.columns, df.dtypes) if type(dtype) is pl.Struct]
    list_cols = [col for col, dtype in zip(df.columns, df.dtypes) if type(dtype) is pl.List]

    return df.with_columns(
        *map(prefix_field, struct_cols),
        *map(
            lambda c: pl.col(c).list.to_struct(n_field_strategy=", fields = lambda i: f"{c}.{i}"),
            list_cols,
        ),
    ).unnest(*struct_cols, *list_cols)

def recursively_flatten(df):
    """Recursively flatten list and struct columns"""
    while any(type(dtype) in (pl.Struct, pl.List) for dtype in df.dtypes):
        df = flatten(df)
    return df

Using the example data from above, it works as follows:

import io
import polars as pl

# au/nsw address data in newline delimited geojson format from openaddresses.io
address_data = io.BytesIO(b"""
{"type": "Feature", "properties": {"hash": "f3370787f7adc06a", "number": "15", "street": "ELLALONG STREET", "unit": "", "city": "PELAW MAIN", "district": "", "region": "", "postcode": "", "id": "3336983"}, "geometry": {"type": "Point", "coordinates": [151.4857842, -32.8269847, 0.0]}}
{"type": "Feature", "properties": {"hash": "1c89c6cbf8236bff", "number": "14", "street": "MILLFIELD STREET", "unit": "", "city": "PELAW MAIN", "district": "", "region": "", "postcode": "", "id": "3336984"}, "geometry": {"type": "Point", "coordinates": [151.4857532, -32.8264517, 0.0]}}
""")

df = recursively_flatten(pl.read_ndjson(address_data))
print(df)
# shape: (2, 14)
# ┌─────────┬──────────────────┬───────────────────┬───────────────────┬───┬───────────────┬────────────────────────┬────────────────────────┬────────────────────────┐
# │ type    ┆ properties.hash  ┆ properties.number ┆ properties.street ┆ … ┆ geometry.type ┆ geometry.coordinates.0 ┆ geometry.coordinates.1 ┆ geometry.coordinates.2 │
# │ ---     ┆ ---              ┆ ---               ┆ ---               ┆   ┆ ---           ┆ ---                    ┆ ---                    ┆ ---                    │
# │ str     ┆ str              ┆ str               ┆ str               ┆   ┆ str           ┆ f64                    ┆ f64                    ┆ f64                    │
# ╞═════════╪══════════════════╪═══════════════════╪═══════════════════╪═══╪═══════════════╪════════════════════════╪════════════════════════╪════════════════════════╡
# │ Feature ┆ f3370787f7adc06a ┆ 15                ┆ ELLALONG STREET   ┆ … ┆ Point         ┆ 151.485784             ┆ -32.826985             ┆ 0.0                    │
# │ Feature ┆ 1c89c6cbf8236bff ┆ 14                ┆ MILLFIELD STREET  ┆ … ┆ Point         ┆ 151.485753             ┆ -32.826452             ┆ 0.0                    │
# └─────────┴──────────────────┴───────────────────┴───────────────────┴───┴───────────────┴────────────────────────┴────────────────────────┴────────────────────────┘

@GuillaumePressiat
Copy link

@daviewales since few days I'm on a task where I read highly nested XML files (XML > xmltodict > json then pola.rs) and I was wondering if this code to automatically flatten / rename all structs exists before doing it myself. So thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants