Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust): follow function params & aliases for left-most name #13819

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Jan 18, 2024

Resolves #13817 and resolves #13820.

Not completely sure this is the proper resolution but it fixes the issue. This does cause one test to fail so @reswqa I may need your help here. The failing test is this:

import polars as pl
df = (
    pl.DataFrame(
        {
            "col1": pl.int_range(0, 7, eager=True),
            "col2": pl.int_range(0, 7, eager=True).reverse(),
        }
    )
).with_columns(
    [
        pl.when(pl.int_range(0, pl.len(), eager=False) < 2)
        .then(None)
        .otherwise(pl.all())
        .name.suffix("_nulls")
    ]
)
polars.exceptions.ComputeError: The name: 'literal_nulls' passed to `LazyFrame.with_columns` is duplicate

Edit: ok, the reason is because pl.when() now dives into the left-most arg, whereas it looks like before the pl.all() from the otherwise branch populated the names by default. What is supposed to happen?

@mcrumiller mcrumiller changed the title fix(rust): follow function params for left-most name fix(rust): follow function params & aliases for left-most name Jan 18, 2024
@reswqa
Copy link
Collaborator

reswqa commented Jan 19, 2024

When deciding the schema for Ternary, we let the name of truthy branch be the name of Ternary expression. We should also follow this with RenameAlias, so I think it's indeed a bug to bypass truthy but get falsy name, and this will be fixed by this PR.

In addition, without this suffix, the following query would raise now. I don't think adding a suffix should prevent it from giving an duplicate columns error.

        with_columns(
            pl.when(pl.int_range(0, pl.len(), eager=False) < 2)
            .then(None)
            .otherwise(pl.all())
    )

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jan 19, 2024

When deciding the schema for Ternary, we let the name of truthy branch be the name of Ternary expression. We should also follow this with RenameAlias, so I think it's indeed a bug to bypass truthy but get falsy name, and this will be fixed by this PR.

Question: when you say "truthy" branch, do you mean the first input (in the when case)? I don't understand otherwise, what if the result of the when clause is sometimes True and sometimes False? Which branch is truthy in that case?

With that aside....doesn't this mean that we can never do the following pattern?

pl.when(pl.some_function()).then(something).otherwise(pl.all())

In other words, pl.all() is invalid in the otherwise case, which procludes us from using a single column to affect multiple other columns. Is there a way around this? This seems like it could break a lot of stuff.

@reswqa
Copy link
Collaborator

reswqa commented Jan 19, 2024

That's the purpose of name.keep, right?

pl.when(pl.col.c).then(None).otherwise(pl.all()).name.keep()

@mcrumiller
Copy link
Contributor Author

That's the purpose of name.keep, right?

Indeed! Thanks. Let me update the failing test.

@mcrumiller
Copy link
Contributor Author

Hmm, it appears we can't do .name.keep().name.suffix():

>>> pl.DataFrame({"a": [1, 2, 3]}).with_columns(pl.all().name.keep().name.suffix("_2"))
pyo3_runtime.PanicException: no `name.keep` expected at this point

@mcrumiller
Copy link
Contributor Author

Also, how does .name.keep() know which expressions' names we want to keep?

@mcrumiller
Copy link
Contributor Author

@reswqa I've added a workaround to the renaming in the unit test, along with a note. It's a bit messier (now have to create seprate dfs and concat them) but it works.

I still don't quite understand how .name.keep() is supposed to work when there are multiple expressions present. I assume that it, like everything else, looks for the left-most expression, which m eans that .name.keep() should be failing as well, just like .name.suffix() now does too.

@mcrumiller
Copy link
Contributor Author

A theoretical workaround: when using a pl.col in an expression, allow an extra tag that says "this is the column whose name I want to keep". Somethine like this:

from polars import col
df.with_columns(
    col("a").expression_chain().random_function(col("x", "y", keep_name=True)).more_functions()
)
# output: df with output columns `x` and `y`

@reswqa
Copy link
Collaborator

reswqa commented Jan 19, 2024

I assume that it, like everything else, looks for the left-most expression, which means that .name.keep() should be failing as well, just like .name.suffix() now does too.

The key is that name.keep only looks for leaf Expr::Column, but name.suffix is expected to suffix the real output name, it has to consider more things like Alias.

@mcrumiller
Copy link
Contributor Author

Alright. So should we just proceed as if this is now correct, and wait until someone runs into an issue in the future with the when/then case?

@reswqa
Copy link
Collaborator

reswqa commented Jan 19, 2024

until someone runs into an issue

What do you mean by issue? The test case you have to workaround?

I can open a PR(It might be a little late because I have some things to do this weekend :)) that lets us support the following usage, Is this enough to address your concerns.

         pl.when(pl.int_range(0, pl.len(), eager=False) < 2)
            .then(None)
            .otherwise(pl.all())
            .name.keep()
            .name.suffix("_nulls")

And it's results should be:

┌──────┬──────┬────────────┬────────────┐
│ col1col2col1_nullscol2_nulls │
│ ------------        │
│ i64i64i64i64        │
╞══════╪══════╪════════════╪════════════╡
│ 06nullnull       │
│ 15nullnull       │
│ 2424          │
│ 3333          │
│ 4242          │
│ 5151          │
│ 6060          │
└──────┴──────┴────────────┴────────────┘

@mcrumiller
Copy link
Contributor Author

Hmm, yes I think that works. I'm still struggling a bit to follow the expected behavior though. I understand that name.keep() under the hood finds the first Column in the expression chain and keeps that, but the documentation says:

Keep the original root name of the expression

The "root name" (I think) should be what's in the pl.when expression (as of this PR), which is now "literal". So .keep_name after when/then chain should keep whatever is in the when clause, which now is literal. So, it appears there is a bit of disconnect between the documentation of .name.keep and the implementation. Do you agree with that assessment? Perhaps the documentation should be amended to say:

Keep the name of the first named column(s) used in the expression

@reswqa
Copy link
Collaborator

reswqa commented Jan 19, 2024

The "root name" (I think) should be what's in the pl.when expression (as of this PR), which is now "literal"

Depending on how you define the root expr for Ternary, we actually define it as the expr inside then() branch. After all, the final result will only come from the then/otherwise expression, not the when() expression, right? Note: It is a separate Expr, not Expr::Function.

If you accept that the first child-expr traversed is the root expr of this expression, we actually traverse when first for Expr::Ternary. Of course you can argue that the order should not be defined in this way.

@mcrumiller
Copy link
Contributor Author

After all, the final result will only come from the then/otherwise expression, not the when() expression, right?

Duh, you are absolutely right.

Note: It is a separate Expr, not Expr::Function.

Why, then, does my PR appear to break this? when().then().otherwise(pl.all()).name.with_suffix() worked before, now it does not.

@reswqa
Copy link
Collaborator

reswqa commented Jan 19, 2024

Why, then, does my PR appear to break this? when().then().otherwise(pl.all()).name.with_suffix() worked before, now it does not.

The traversal order is then->otherwise->when. We have then (literal), before your change, it does not match anything in get_single_leaf. So we go to otherwise, which matches Expr::Column. But after this PR, then can match Expr::Literal directly. The result is that we are suffix literal.

@reswqa
Copy link
Collaborator

reswqa commented Jan 19, 2024

But I think your changes is correct. suffix should take into account both Literal and Alias IMO.

@mcrumiller
Copy link
Contributor Author

Thanks Weijie, that makes perfect sense. I think my PR title then covers that case properly as well. I think your keep PR fix for the suffix after keep_name would be very appropriate after this. Make sure to update the docs since it says:

Notes

Due to implementation constraints, this method can only be called as the last expression in a chain.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jan 19, 2024

An alternative thought would be to add a suffix parameter to keep_name, which may reduce complexity with regards to the positioning of keep_name.

@reswqa
Copy link
Collaborator

reswqa commented Jan 19, 2024

An alternative thought would be to add a suffix parameter to keep_name, which may reduce complexity with regards to the positioning of keep_name.

Well, this would really make things easier, mind setting up an issue to discuss this? I thought it was worth discussing before I started making changes.

Update: Oh, to do so, it probably should take a function: Callable[[str], str] argument. Because we need to support all name mapping instead of only suffix. If people don't like this, I can support .name.keep().name.map() then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pl.int_range().name.suffix ComputeError pl.len().alias + .name.suffix discards the name given to alias()
2 participants