Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars performance with simple array arithmetic much slower than NumPy? #18088

Closed
wesm opened this issue Aug 7, 2024 · 4 comments · Fixed by #18148
Closed

polars performance with simple array arithmetic much slower than NumPy? #18088

wesm opened this issue Aug 7, 2024 · 4 comments · Fixed by #18148
Assignees
Labels
accepted Ready for implementation bug Something isn't working performance Performance issues or improvements

Comments

@wesm
Copy link

wesm commented Aug 7, 2024

Description

I was working on a histogram implementation for polars in Positron and I stumbled on surprising performance differences for operations with float64 arrays:

In [8]: arr = np.random.randn(1000000)

In [9]: arrp = pl.Series(arr)

In [10]: timeit arr - 1
397 µs ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [11]: timeit arrp - 1
3.4 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I haven't looked too deeply, but I am on a AMD Ryzen 7 PRO 7840U which has avx2/avx512 extensions, but almost a 10x difference in performance surprised me. I will just convert polars arrays to NumPy arrays and do the work there, but I would be interested to diagnose the problem further to develop an intuition when I should write polars code vs. dropping down to NumPy when doing numerical operations

@wesm wesm added the enhancement New feature or an improvement of an existing feature label Aug 7, 2024
@coastalwhite
Copy link
Collaborator

coastalwhite commented Aug 8, 2024

This is quite similar to #17414.

Interesting to see that this is happening on AMD, however.

@orlp
Copy link
Collaborator

orlp commented Aug 8, 2024

We looked into it further and have narrowed it 100% down to our use of jemallocator. The following toy example shows exactly the same slowdown we see when using Polars, and no slowdown when jemallocator is commented out. time indicates that when jemallocator is used almost all of the time is not spent in userland but rather in the system, indicating it is either expensive kernel calls that jemallocator makes, or extra page faults that jemallocator causes.

#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

fn sub_one(x: &[f64]) -> Vec<f64> {
    x.iter().map(|x| *x - 1.0).collect()
}

fn main() {
    let v = vec![0.0; 1_000_000];
    let start = std::time::Instant::now();
    for _ in 0..10_000 {
        std::hint::black_box(sub_one(std::hint::black_box(&v)));
    }
    dbg!(start.elapsed() / 10_000);
}

We will have to figure out if this is something we can resolve by configuring jemalloc or whether we need to switch to a different allocator altogether.

@orlp orlp added bug Something isn't working performance Performance issues or improvements accepted Ready for implementation and removed enhancement New feature or an improvement of an existing feature labels Aug 9, 2024
@orlp
Copy link
Collaborator

orlp commented Aug 9, 2024

Ok, we figured out that the slowdown is caused by page faults that the default jemalloc options causes.

Setting the jemalloc option muzzy_decay_ms to -1 to disable MADV_DONTNEED entirely solves the issue, but might cause overly large memory usage statistics to be reported on some machines (since only MADV_FREE will be emitted and not MADV_DONTNEED).

Strangely enough, setting muzzy_decay_ms to a finite value like 1000 for 1 second delay still causes the problem, even if there isn't a second delay between accesses. I reported the issue to jemalloc here: jemalloc/jemalloc#2688.

@ritchie46
Copy link
Member

ritchie46 commented Aug 9, 2024

Can people try setting _RJEM_MALLOC_CONF="background_thread:true,dirty_decay_ms:500,muzzy_decay_ms:-1" before Polars is imported. This will resolve the issue. I am curious how memory behavior is for your use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working performance Performance issues or improvements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants