Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expr.shuffle uses different order per column #18233

Open
Firionus opened this issue Aug 16, 2024 · 2 comments
Open

Expr.shuffle uses different order per column #18233

Firionus opened this issue Aug 16, 2024 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@Firionus
Copy link

Description

If Expr.shuffle is used on multiple columns, by default it doesn't keep the rows intact:

>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
>>> df.select(pl.all().shuffle())
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 3   │
│ 3   ┆ 1   │
│ 1   ┆ 2   │
└─────┴─────┘

Note that columns "a" and "b" were previously always the same in each row but are not after using Expr.shuffle.

The behavior I would expect (rows shuffled instead of each column) can be achieved by using a fixed seed:

>>> df.select(pl.all().shuffle(seed=1))
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 2   │
│ 1   ┆ 1   │
│ 3   ┆ 3   │
└─────┴─────┘

Note that now column "a" always equals column "b" as in the original DataFrame.

Suggestion

  1. Include my example in the documentation with/without seed to make the behavior clear
  2. I'm not too familiar with the concepts in Polars, but maybe the wording in the documentation could be changed from

Shuffle the contents of this expression.

to

Shuffle each column in this expression.

Though maybe that doesn't quite cover what different things you can do with Expr, I don't know.

Other possibly affected functions

  • Expr.sort: currently says "Sort this column.", when "Sort each column in this expression" might be better (?).
  • Expr.sample currently says "Sample from this expression." when it might better say "Sample from each column in this expression" (?).
  • Probably more cases to be found, but I haven't discovered them yet.

Used Versions

>>> pl.__version__
'0.20.18'
>>> sys.version
'3.11.8 (main, Feb 25 2024, 03:41:44) [MSC v.1929 64 bit (AMD64)]'

Link

https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.shuffle.html

@Firionus Firionus added the documentation Improvements or additions to documentation label Aug 16, 2024
@Julian-J-S
Copy link
Contributor

Julian-J-S commented Aug 16, 2024

agree that this might be problematic / unexpected.

Hoewever, for the use case of shuffling an entire df the better solution is probably using df.sample with shuffle=True 😉
https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sample.html

Might even be worth it to create an DataFrame.shuffle method 🤔

@Firionus
Copy link
Author

Hoewever, for the use case of shuffling an entire df the better solution is probably using df.sample with shuffle=True 😉

Agreed, and DataFrame.shuffle might be more intuitive there than calling DataFrame.sample. But I stumbled over this in an aggregation context, so it wouldn't have helped.

Thanks for the super quick response 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants