`Expr.shuffle` uses different order per column #18233

Firionus · 2024-08-16T11:08:56Z

Description

If Expr.shuffle is used on multiple columns, by default it doesn't keep the rows intact:

>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
>>> df.select(pl.all().shuffle())
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 3   │
│ 3   ┆ 1   │
│ 1   ┆ 2   │
└─────┴─────┘

Note that columns "a" and "b" were previously always the same in each row but are not after using Expr.shuffle.

The behavior I would expect (rows shuffled instead of each column) can be achieved by using a fixed seed:

>>> df.select(pl.all().shuffle(seed=1))
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 2   │
│ 1   ┆ 1   │
│ 3   ┆ 3   │
└─────┴─────┘

Note that now column "a" always equals column "b" as in the original DataFrame.

Suggestion

Include my example in the documentation with/without seed to make the behavior clear
I'm not too familiar with the concepts in Polars, but maybe the wording in the documentation could be changed from

Shuffle the contents of this expression.

to

Shuffle each column in this expression.

Though maybe that doesn't quite cover what different things you can do with Expr, I don't know.

Other possibly affected functions

Expr.sort: currently says "Sort this column.", when "Sort each column in this expression" might be better (?).
Expr.sample currently says "Sample from this expression." when it might better say "Sample from each column in this expression" (?).
Probably more cases to be found, but I haven't discovered them yet.

Used Versions

>>> pl.__version__
'0.20.18'
>>> sys.version
'3.11.8 (main, Feb 25 2024, 03:41:44) [MSC v.1929 64 bit (AMD64)]'

Link

https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.shuffle.html

The text was updated successfully, but these errors were encountered:

Julian-J-S · 2024-08-16T12:01:16Z

agree that this might be problematic / unexpected.

Hoewever, for the use case of shuffling an entire df the better solution is probably using df.sample with shuffle=True 😉
https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sample.html

Might even be worth it to create an DataFrame.shuffle method 🤔

Firionus · 2024-08-16T15:00:47Z

Hoewever, for the use case of shuffling an entire df the better solution is probably using df.sample with shuffle=True 😉

Agreed, and DataFrame.shuffle might be more intuitive there than calling DataFrame.sample. But I stumbled over this in an aggregation context, so it wouldn't have helped.

Thanks for the super quick response 👍

Firionus added the documentation Improvements or additions to documentation label Aug 16, 2024

cmdlineluser mentioned this issue Aug 16, 2024

.struct.field() after shuffle() seems to produce incorrect results #18234

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Expr.shuffle` uses different order per column #18233

`Expr.shuffle` uses different order per column #18233

Firionus commented Aug 16, 2024

Julian-J-S commented Aug 16, 2024 •

edited

Loading

Firionus commented Aug 16, 2024

Expr.shuffle uses different order per column #18233

Expr.shuffle uses different order per column #18233

Comments

Firionus commented Aug 16, 2024

Description

Suggestion

Other possibly affected functions

Used Versions

Link

Julian-J-S commented Aug 16, 2024 • edited Loading

Firionus commented Aug 16, 2024

`Expr.shuffle` uses different order per column #18233

`Expr.shuffle` uses different order per column #18233

Julian-J-S commented Aug 16, 2024 •

edited

Loading