README.md

ZSTD on DPU

Port of ZSTD decompressor for a DPU-based solution.

make should do the trick. Setting DPU_SIMPLE=1 will use the simple implementation instead of the standard one.

A DPU can decompress 12 files in parallel.

File type	File size	DPU duration	DPU throughput @267MHz	System throughput
Executable	1028152 B each	921253408 cycles	3.58 MB/s	3.58 GB/s
Executable	1660152 B each	1250792384 cycles	4.25 MB/s	4.25 GB/s

The figures are for a 1024-DPU system.

The diversity of benchmarks should be extended to have more precise measures on different usecases (eg. small files).
The host overhead is not taken into account in the current performance measures.
Performances are limited in various ways:
- the implementation is currently MRAM-bound (only 0.797 instruction per cycle); this can be hard to improve because of the structures used in the implementation that have to be located in MRAM (notably the 4096B random-access array used to decode symbols)
- 12 threads are used (in order to fit their data in the Work RAM). This is the minimum to fully use the pipeline (when the execution is not slowed down by the MRAM accesses)
- the dictionary feature is not implemented, which could be an improvement for small files
- the lto feature and the O2 optimization level are not enabled in order to fit in the instruction memory (IRAM)
The DPU implementation does not support the checksum feature.
The DPU inputs and outputs are limited to 2GB each, which means that the compression implementation needs to be adapted to limit the frame size.
The DPU only accepts 12 inputs at a time. For small files, this is not efficient and should be changed to support a list of inputs.