Using S3 Select with MinIO's object storage can provide 100x performance improvements over AWS S3. S3 Select offloads filtering of data to storage, supporting formats like CSV, JSON, and Parquet. MinIO accelerated S3 Select performance by using techniques like zero-copy parsing and SIMD to process data 10x faster. With ongoing work, S3 Select on MinIO using SIMD could achieve additional speedups versus AWS S3 Select.
1 of 15
More Related Content
Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
1. Using S3 Select to Deliver 100X
Performance Improvements
Versus the Public Cloud
Frank Wessels
CTO, MinIO
2. S3 Select
▪ Recent addition to S3 API
○ Offload filtering to storage
○ Formats: CSV, JSON, Parquet
▪ Advantages
○ Faster
○ Less network traffic
○ Smaller compute nodes
■ S3 Select for Spark
○ https://github.com/minio/spark-select
Before
After
Up to 400% faster
Up to 80% Cheaper
Applications
Applications
S3 SELECT
2
3. 3
MinIO is a high performance, distributed object storage server,
designed for peta-scale data infrastructure.
S3-Compatible Scalable PerformantSimple Optimized for Intel/
ARM/Power9 CPUs
Introduction to MinIO
8. 8
Manage memory allocations: garbage collected vs. non-garbage collected
Source:
https://bitbucket.org/ewanhiggs/csv-game
First 10X Acceleration: Zero Copy
9. 9
▪ SIMD = Single Instruction Multiple Data
○ Intel: AVX2
▪ Process 32 bytes in parallel
○ delimiter / separator detection
○ bitmap handling & parsing
○ string compares
▪ Performance (single core)
Second 10X Acceleration: SIMD
10. 10
▪ Same queries as before
○ minio with select-simd vs AWS S3
Results using select-simd
11. Demo
■ Source data
○ parking-citations.csv (25M rows / 3.5 GB)
■ AWS region
○ us-east-1
■ minio with select-simd-integration branch
running on a single instance: c5.2xlarge (8 vCPUs)
■ mc client running in same region on c5.large instance
12. 12
▪ Works in progress
○ Initial focus on CSV
▪ Next: add support for
○ Parquet
○ JSON: https://github.com/lemire/simdjson
▪ Investigate AVX-512
○ erasure coding
▫ AVX-512 4x speedup over AVX2
○ k-registers are great /
2KB on-core register space
▪ Dynamic code generation (think LLVM)
Status and what’s next