polars performance with simple array arithmetic much slower than NumPy? #18088

wesm · 2024-08-07T21:26:48Z

Description

I was working on a histogram implementation for polars in Positron and I stumbled on surprising performance differences for operations with float64 arrays:

In [8]: arr = np.random.randn(1000000)

In [9]: arrp = pl.Series(arr)

In [10]: timeit arr - 1
397 µs ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [11]: timeit arrp - 1
3.4 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I haven't looked too deeply, but I am on a AMD Ryzen 7 PRO 7840U which has avx2/avx512 extensions, but almost a 10x difference in performance surprised me. I will just convert polars arrays to NumPy arrays and do the work there, but I would be interested to diagnose the problem further to develop an intuition when I should write polars code vs. dropping down to NumPy when doing numerical operations

The text was updated successfully, but these errors were encountered:

coastalwhite · 2024-08-08T13:35:44Z

This is quite similar to #17414.

Interesting to see that this is happening on AMD, however.

orlp · 2024-08-08T14:40:38Z

We looked into it further and have narrowed it 100% down to our use of jemallocator. The following toy example shows exactly the same slowdown we see when using Polars, and no slowdown when jemallocator is commented out. time indicates that when jemallocator is used almost all of the time is not spent in userland but rather in the system, indicating it is either expensive kernel calls that jemallocator makes, or extra page faults that jemallocator causes.

#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

fn sub_one(x: &[f64]) -> Vec<f64> {
    x.iter().map(|x| *x - 1.0).collect()
}

fn main() {
    let v = vec![0.0; 1_000_000];
    let start = std::time::Instant::now();
    for _ in 0..10_000 {
        std::hint::black_box(sub_one(std::hint::black_box(&v)));
    }
    dbg!(start.elapsed() / 10_000);
}

We will have to figure out if this is something we can resolve by configuring jemalloc or whether we need to switch to a different allocator altogether.

orlp · 2024-08-09T13:39:02Z

Ok, we figured out that the slowdown is caused by page faults that the default jemalloc options causes.

Setting the jemalloc option muzzy_decay_ms to -1 to disable MADV_DONTNEED entirely solves the issue, but might cause overly large memory usage statistics to be reported on some machines (since only MADV_FREE will be emitted and not MADV_DONTNEED).

Strangely enough, setting muzzy_decay_ms to a finite value like 1000 for 1 second delay still causes the problem, even if there isn't a second delay between accesses. I reported the issue to jemalloc here: jemalloc/jemalloc#2688.

ritchie46 · 2024-08-09T18:27:43Z

Can people try setting _RJEM_MALLOC_CONF="background_thread:true,dirty_decay_ms:500,muzzy_decay_ms:-1" before Polars is imported. This will resolve the issue. I am curious how memory behavior is for your use cases.

wesm added the enhancement New feature or an improvement of an existing feature label Aug 7, 2024

orlp added bug Something isn't working performance Performance issues or improvements accepted Ready for implementation and removed enhancement New feature or an improvement of an existing feature labels Aug 9, 2024

orlp mentioned this issue Aug 12, 2024

perf: Tune jemalloc to not create muzzy pages #18148

Merged

ritchie46 closed this as completed in #18148 Aug 12, 2024

cmdlineluser mentioned this issue Aug 15, 2024

Slowdown of up to 2x compared to NumPy / Pandas in division operator #17414

Closed

2 tasks

c-peters assigned orlp Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polars performance with simple array arithmetic much slower than NumPy? #18088

polars performance with simple array arithmetic much slower than NumPy? #18088

wesm commented Aug 7, 2024 •

edited

Loading

coastalwhite commented Aug 8, 2024 •

edited

Loading

orlp commented Aug 8, 2024 •

edited

Loading

orlp commented Aug 9, 2024 •

edited

Loading

ritchie46 commented Aug 9, 2024 •

edited by MarcoGorelli

Loading

polars performance with simple array arithmetic much slower than NumPy? #18088

polars performance with simple array arithmetic much slower than NumPy? #18088

Comments

wesm commented Aug 7, 2024 • edited Loading

Description

coastalwhite commented Aug 8, 2024 • edited Loading

orlp commented Aug 8, 2024 • edited Loading

orlp commented Aug 9, 2024 • edited Loading

ritchie46 commented Aug 9, 2024 • edited by MarcoGorelli Loading

wesm commented Aug 7, 2024 •

edited

Loading

coastalwhite commented Aug 8, 2024 •

edited

Loading

orlp commented Aug 8, 2024 •

edited

Loading

orlp commented Aug 9, 2024 •

edited

Loading

ritchie46 commented Aug 9, 2024 •

edited by MarcoGorelli

Loading