-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize PFCOUNT, PFMERGE command by SIMD acceleration #1293
Conversation
8bcf1ae
to
f730f91
Compare
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1293 +/- ##
============================================
+ Coverage 70.69% 70.79% +0.10%
============================================
Files 114 118 +4
Lines 63161 63478 +317
============================================
+ Hits 44650 44938 +288
- Misses 18511 18540 +29
|
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results? |
The algorithms are verified by comparing the results between scalar code and simd code with random input. |
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
This is so cool! 😎
I think we need to test this in our repo in some way. The binary representation can't change, because things like replicas will not understand it, so we should verify the binary representation. A hyperloglog key is actually a string so we can use the GET command to get the binary representation. In a TCL test case, we can use GET and compare the reply to the binary data we have stored earlier. Alternatively we can use DUMP. Can you add it? |
@lipzhu You're the performance expert. Do you want to review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nugine Great job, the performance is impressive.
Echo @zuiderkwast @xbasel .
Maybe a unit test needed to verify the results is totally same W/ AVX2 instructions and make sure binary file will not changed when RDB presentence is enabled?
WIP: unit tests for verifying the results produced by non-AVX2 and AVX2 implementations. I have other things to do these days. So it may take a while. Edit: Done! |
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths. Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated. AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6). `PFDEBUG SIMD (ON|OFF)` subcommand is added for unit tests. A new TCL unit test checks that the results produced by non-AVX2 and AVX2 implementations are exactly equal. When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version. ``` pfcount key1 key2 key3 pfmerge keyall key1 key2 key3 ``` ``` ====================================================================================================== Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ------------------------------------------------------------------------------------------------------ PFCOUNT-scalar 5665.56 35.29839 32.25500 63.99900 67.58300 608.60 PFCOUNT-avx2 72377.83 2.75834 2.67100 5.34300 6.81500 7774.96 ------------------------------------------------------------------------------------------------------ PFMERGE-scalar 9851.29 20.28806 20.09500 36.86300 39.16700 615.71 PFMERGE-avx2 125621.89 1.59126 1.55100 3.11900 4.70300 15702.74 ------------------------------------------------------------------------------------------------------ scalar: valkey:unstable 2df56d8 avx2: Nugine:hll-simd 8f9adc3 CPU: 13th Gen Intel® Core™ i9-13900H × 20 Memory: 32.0 GiB OS: Ubuntu 22.04.5 LTS ``` Experiment repo: https://github.com/Nugine/redis-hyperloglog Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp --------- Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
The bug was introduced in #13558 . When merging dense hll structures, `hllDenseCompress` writes to wrong location and the result will be zero. The unit tests didn't cover this case. This PR + fixes the bug + adds `PFDEBUG SIMD (ON|OFF)` for unit tests + adds a new TCL test to cover the cases Synchronized from valkey-io/valkey#1293 --------- Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn> Co-authored-by: debing.sun <debing.sun@redis.com>
The bug was introduced in #13558 . When merging dense hll structures, `hllDenseCompress` writes to wrong location and the result will be zero. The unit tests didn't cover this case. This PR + fixes the bug + adds `PFDEBUG SIMD (ON|OFF)` for unit tests + adds a new TCL test to cover the cases Synchronized from valkey-io/valkey#1293 --------- Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn> Co-authored-by: debing.sun <debing.sun@redis.com>
This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths.
Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated.
AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6).
PFDEBUG SIMD (ON|OFF)
subcommand is added for unit tests. A new TCL unit test checks that the results produced by non-AVX2 and AVX2 implementations are exactly equal.When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version.
Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp