Optimize PFCOUNT, PFMERGE command by SIMD acceleration #1293

Nugine · 2024-11-12T07:52:46Z

This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths.

Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated.

AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6).

PFDEBUG SIMD (ON|OFF) subcommand is added for unit tests. A new TCL unit test checks that the results produced by non-AVX2 and AVX2 implementations are exactly equal.

When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version.

pfcount key1 key2 key3
pfmerge keyall key1 key2 key3

======================================================================================================
Type             Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------
PFCOUNT-scalar    5665.56        35.29839        32.25500        63.99900        67.58300       608.60
PFCOUNT-avx2     72377.83         2.75834         2.67100         5.34300         6.81500      7774.96
------------------------------------------------------------------------------------------------------
PFMERGE-scalar    9851.29        20.28806        20.09500        36.86300        39.16700       615.71
PFMERGE-avx2    125621.89         1.59126         1.55100         3.11900         4.70300     15702.74
------------------------------------------------------------------------------------------------------

scalar: valkey:unstable  2df56d87c0ebe802f38e8922bb2ea1e4ca9cfa76
avx2:   Nugine:hll-simd  8f9adc34021080d96e60bd0abe06b043f3ed0275

CPU:    13th Gen Intel® Core™ i9-13900H × 20
Memory: 32.0 GiB
OS:     Ubuntu 22.04.5 LTS

Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

codecov · 2024-11-12T09:20:22Z

Codecov Report

Attention: Patch coverage is 98.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.79%. Comparing base (2df56d8) to head (454dbfc).
Report is 50 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/hyperloglog.c	98.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1293      +/-   ##
============================================
+ Coverage     70.69%   70.79%   +0.10%     
============================================
  Files           114      118       +4     
  Lines         63161    63478     +317     
============================================
+ Hits          44650    44938     +288     
- Misses        18511    18540      +29

Files with missing lines	Coverage Δ
src/hyperloglog.c	`92.23% <98.00%> (+0.99%)`	⬆️

... and 41 files with indirect coverage changes

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

src/hyperloglog.c

xbasel · 2024-11-14T14:01:25Z

How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results?

Nugine · 2024-11-14T14:07:57Z

How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results?

The algorithms are verified by comparing the results between scalar code and simd code with random input.
https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

zuiderkwast · 2024-11-18T10:26:44Z

This is so cool! 😎

How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results?

The algorithms are verified by comparing the results between scalar code and simd code with random input. https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

I think we need to test this in our repo in some way. The binary representation can't change, because things like replicas will not understand it, so we should verify the binary representation.

A hyperloglog key is actually a string so we can use the GET command to get the binary representation. In a TCL test case, we can use GET and compare the reply to the binary data we have stored earlier. Alternatively we can use DUMP. Can you add it?

zuiderkwast · 2024-11-19T09:50:58Z

@lipzhu You're the performance expert. Do you want to review this?

lipzhu

@Nugine Great job, the performance is impressive.

Echo @zuiderkwast @xbasel .
Maybe a unit test needed to verify the results is totally same W/ AVX2 instructions and make sure binary file will not changed when RDB presentence is enabled?

src/hyperloglog.c

Nugine · 2024-11-20T12:11:46Z

WIP: unit tests for verifying the results produced by non-AVX2 and AVX2 implementations.

I have other things to do these days. So it may take a while.

Edit: Done!

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

tests/unit/hyperloglog.tcl

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

zuiderkwast

Thanks @Nugine and also thanks for helping review @xbasel @lipzhu!

I don't fully understand the AVX code, but the test case is great and with the additional reviews I'm OK with merging this.

This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths. Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated. AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6). `PFDEBUG SIMD (ON|OFF)` subcommand is added for unit tests. A new TCL unit test checks that the results produced by non-AVX2 and AVX2 implementations are exactly equal. When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version. ``` pfcount key1 key2 key3 pfmerge keyall key1 key2 key3 ``` ``` ====================================================================================================== Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ------------------------------------------------------------------------------------------------------ PFCOUNT-scalar 5665.56 35.29839 32.25500 63.99900 67.58300 608.60 PFCOUNT-avx2 72377.83 2.75834 2.67100 5.34300 6.81500 7774.96 ------------------------------------------------------------------------------------------------------ PFMERGE-scalar 9851.29 20.28806 20.09500 36.86300 39.16700 615.71 PFMERGE-avx2 125621.89 1.59126 1.55100 3.11900 4.70300 15702.74 ------------------------------------------------------------------------------------------------------ scalar: valkey:unstable 2df56d8 avx2: Nugine:hll-simd 8f9adc3 CPU: 13th Gen Intel® Core™ i9-13900H × 20 Memory: 32.0 GiB OS: Ubuntu 22.04.5 LTS ``` Experiment repo: https://github.com/Nugine/redis-hyperloglog Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp --------- Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

The bug was introduced in #13558 . When merging dense hll structures, `hllDenseCompress` writes to wrong location and the result will be zero. The unit tests didn't cover this case. This PR + fixes the bug + adds `PFDEBUG SIMD (ON|OFF)` for unit tests + adds a new TCL test to cover the cases Synchronized from valkey-io/valkey#1293 --------- Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn> Co-authored-by: debing.sun <debing.sun@redis.com>

Nugine force-pushed the hll-simd branch 2 times, most recently from 8bcf1ae to f730f91 Compare November 12, 2024 07:56

Nugine mentioned this pull request Nov 12, 2024

Send PRs Nugine/redis-hyperloglog#1

Open

3 tasks

Optimize PFCOUNT, PFMERGE command by SIMD acceleration

8f9adc3

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

Nugine force-pushed the hll-simd branch from f730f91 to 8f9adc3 Compare November 12, 2024 08:41

run clang-format-18

48bfb1d

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

xbasel reviewed Nov 14, 2024

View reviewed changes

src/hyperloglog.c Show resolved Hide resolved

xbasel reviewed Nov 14, 2024

View reviewed changes

src/hyperloglog.c Show resolved Hide resolved

add comments about shuffle

a6136ca

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

xbasel approved these changes Nov 18, 2024

View reviewed changes

zuiderkwast requested a review from SoftlyRaining November 18, 2024 19:36

lipzhu reviewed Nov 20, 2024

View reviewed changes

src/hyperloglog.c Show resolved Hide resolved

src/hyperloglog.c Show resolved Hide resolved

Nugine added 3 commits December 1, 2024 17:19

add PFDEBUG SIMD (ON|OFF)

188572d

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

add unit test for PFMERGE

39ae878

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

fix bug in pfmergeCommand

3c1ccb0

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

Nugine mentioned this pull request Dec 1, 2024

Fix bug in PFMERGE command redis/redis#13672

Merged

Nugine added 2 commits December 1, 2024 18:10

fix

cf01f4f

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

fix

45534e0

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

xbasel reviewed Dec 1, 2024

View reviewed changes

tests/unit/hyperloglog.tcl Outdated Show resolved Hide resolved

change test

454dbfc

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>

lipzhu approved these changes Dec 2, 2024

View reviewed changes

zuiderkwast approved these changes Dec 2, 2024

View reviewed changes

zuiderkwast added release-notes This issue should get a line item in the release notes performance labels Dec 2, 2024

zuiderkwast merged commit 3df609e into valkey-io:unstable Dec 2, 2024
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize PFCOUNT, PFMERGE command by SIMD acceleration #1293

Optimize PFCOUNT, PFMERGE command by SIMD acceleration #1293

Nugine commented Nov 12, 2024 •

edited

Loading

codecov bot commented Nov 12, 2024 •

edited

Loading

xbasel commented Nov 14, 2024

Nugine commented Nov 14, 2024 •

edited

Loading

zuiderkwast commented Nov 18, 2024

zuiderkwast commented Nov 19, 2024

lipzhu left a comment

Nugine commented Nov 20, 2024 •

edited

Loading

zuiderkwast left a comment

Optimize PFCOUNT, PFMERGE command by SIMD acceleration #1293

Optimize PFCOUNT, PFMERGE command by SIMD acceleration #1293

Conversation

Nugine commented Nov 12, 2024 • edited Loading

codecov bot commented Nov 12, 2024 • edited Loading

Codecov Report

xbasel commented Nov 14, 2024

Nugine commented Nov 14, 2024 • edited Loading

zuiderkwast commented Nov 18, 2024

zuiderkwast commented Nov 19, 2024

lipzhu left a comment

Choose a reason for hiding this comment

Nugine commented Nov 20, 2024 • edited Loading

zuiderkwast left a comment

Choose a reason for hiding this comment

Nugine commented Nov 12, 2024 •

edited

Loading

codecov bot commented Nov 12, 2024 •

edited

Loading

Nugine commented Nov 14, 2024 •

edited

Loading

Nugine commented Nov 20, 2024 •

edited

Loading