Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting a "ComputeError: cannot evaluate two Series of different lengths" with straight forward Lazy expressions. #18124

Closed
2 tasks done
dalejung opened this issue Aug 9, 2024 · 2 comments · Fixed by #18177
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@dalejung
Copy link

dalejung commented Aug 9, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

    import polars as pl

    df = pl.from_repr(
        """
        ┌─────────────────────┬───────┬─────┐
        │ datetime            ┆ alpha ┆ num │
        │ ---                 ┆ ---   ┆ --- │
        │ datetime[μs]        ┆ str   ┆ i64 │
        ╞═════════════════════╪═══════╪═════╡
        │ 2022-01-01 11:00:00 ┆ A     ┆ 0   │
        │ 2022-01-01 11:01:00 ┆ B     ┆ 1   │
        │ 2022-01-01 11:02:00 ┆ C     ┆ 2   │
        │ 2022-01-01 11:03:00 ┆ D     ┆ 3   │
        │ 2022-01-01 11:04:00 ┆ E     ┆ 4   │
        │ 2022-01-01 11:05:00 ┆ F     ┆ 5   │
        │ 2022-01-01 11:06:00 ┆ G     ┆ 6   │
        │ 2022-01-01 11:07:00 ┆ H     ┆ 7   │
        │ 2022-01-01 11:08:00 ┆ I     ┆ 8   │
        │ 2022-01-01 11:09:01 ┆ J     ┆ 9   │
        │ 2022-01-02 11:00:00 ┆ A     ┆ 0   │
        │ 2022-01-02 11:01:00 ┆ B     ┆ 1   │
        │ 2022-01-02 11:02:00 ┆ C     ┆ 2   │
        │ 2022-01-02 11:03:00 ┆ D     ┆ 3   │
        │ 2022-01-02 11:04:00 ┆ E     ┆ 4   │
        │ 2022-01-02 11:05:00 ┆ F     ┆ 5   │
        │ 2022-01-02 11:06:00 ┆ G     ┆ 6   │
        │ 2022-01-02 11:07:00 ┆ H     ┆ 7   │
        │ 2022-01-02 11:08:00 ┆ I     ┆ 8   │
        │ 2022-01-02 11:09:01 ┆ J     ┆ 9   │
        └─────────────────────┴───────┴─────┘
        """
    )

    ts_tolerance = pl.duration(seconds=-1)
    ts_col = 'datetime'

    out = (
        df.lazy()
        .with_columns(
            ts_diff=pl.col(ts_col).diff(),
            ts_diff_after=pl.col(ts_col).diff(5).shift(-5),
        )
        .with_columns(
            ts_diff_sign=pl.col('ts_diff') > pl.duration(seconds=0),
            ts_diff_after_sign=pl.col('ts_diff_after') > pl.duration(seconds=0),
        )
        .filter(pl.col('ts_diff') > ts_tolerance)
    ).collect()
ComputeError: cannot evaluate two Series of different lengths (19 and 2)

Log output

No response

Issue description

Previous working code now gives a ComputeError: cannot evaluate two Series of different lengths error.

The example lazy frame is pretty straight forward. Switching this to eager api does not result in an error.

Turning predicate_pushdown off fixes the error. Also commenting out ts_diff_after_sign or removing the filter() fixes the problem. Which is confusing since they don't touch any common columns other than the original datetime.

Expected behavior

No error.

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             Linux-6.9.6-arch1-1-x86_64-with-glibc2.40
Python:               3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               0.9.2.post8+g4cb29ba
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               3.0.0.dev0+756.ge8e6be071c
pyarrow:              18.0.0.dev15+gd745fd7d1
pydantic:             2.7.2
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@dalejung dalejung added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 9, 2024
@cmdlineluser
Copy link
Contributor

Can reproduce.

Error originated in expression: 
'[(col("ts_diff")) > (col("__POLARS_CSER_0x4c59a5dbe447f401"))]'

It seems like the error expression indicates that it is a CSE optimizer issue - it also runs with it disabled: .collect(comm_subexpr_elim=False)

(Which may explain the confusing behaviour.)

@ritchie46 ritchie46 self-assigned this Aug 14, 2024
@ritchie46
Copy link
Member

Taking a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants