Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Using S3 Select to Deliver 100X
Performance Improvements
Versus the Public Cloud
Frank Wessels
CTO, MinIO
S3 Select
▪ Recent addition to S3 API
○ Offload filtering to storage
○ Formats: CSV, JSON, Parquet
▪ Advantages
○ Faster
○ Less network traffic
○ Smaller compute nodes
■ S3 Select for Spark
○ https://github.com/minio/spark-select
Before
After
Up to 400% faster
Up to 80% Cheaper
Applications
Applications
S3 SELECT
2
3
MinIO is a high performance, distributed object storage server,
designed for peta-scale data infrastructure.
S3-Compatible Scalable PerformantSimple Optimized for Intel/
ARM/Power9 CPUs
Introduction to MinIO
4
Global Scale
5
Focus on Performance
6
S3 Select Performance on AWS
Format Time (s) Records Throughput
csv 5.46 733K/s 94 MB/s
json 14.28 280K/s 98 MB/s
parquet 32.25 124K/s 4.3 MB/s
7
Evaluation (“where”)
Processing (“select”)
CSV JSON Parquet
Parsing Parsing Loading
Accelerating S3 Select on minio
8
Manage memory allocations: garbage collected vs. non-garbage collected
Source:
https://bitbucket.org/ewanhiggs/csv-game
First 10X Acceleration: Zero Copy
9
▪ SIMD = Single Instruction Multiple Data
○ Intel: AVX2
▪ Process 32 bytes in parallel
○ delimiter / separator detection
○ bitmap handling & parsing
○ string compares
▪ Performance (single core)
Second 10X Acceleration: SIMD
10
▪ Same queries as before
○ minio with select-simd vs AWS S3
Results using select-simd
Demo
■ Source data
○ parking-citations.csv (25M rows / 3.5 GB)
■ AWS region
○ us-east-1
■ minio with select-simd-integration branch
running on a single instance: c5.2xlarge (8 vCPUs)
■ mc client running in same region on c5.large instance
12
▪ Works in progress
○ Initial focus on CSV
▪ Next: add support for
○ Parquet
○ JSON: https://github.com/lemire/simdjson
▪ Investigate AVX-512
○ erasure coding
▫ AVX-512 4x speedup over AVX2
○ k-registers are great /
2KB on-core register space
▪ Dynamic code generation (think LLVM)
Status and what’s next
Power9 CPUs
PCIe Gen4
24x NVMe
Dual Mellanox CX5 (4x100 GbE/s)
High performance object storage
13
▪ Benefits
○ Faster queries
○ Less network traffic
○ Smaller compute needs
▪ Stay tuned for overall impact
○ S3 “plain” vs S3 Select
○ minio/simd-select vs AWS S3 Select
S3 Select benefits for Spark
Questions?
Visit our booth #509
@minio
https://github.com/minio/minio
https://slack.minio.io
https://minio.io

More Related Content

Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud

  • 1. Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud Frank Wessels CTO, MinIO
  • 2. S3 Select ▪ Recent addition to S3 API ○ Offload filtering to storage ○ Formats: CSV, JSON, Parquet ▪ Advantages ○ Faster ○ Less network traffic ○ Smaller compute nodes ■ S3 Select for Spark ○ https://github.com/minio/spark-select Before After Up to 400% faster Up to 80% Cheaper Applications Applications S3 SELECT 2
  • 3. 3 MinIO is a high performance, distributed object storage server, designed for peta-scale data infrastructure. S3-Compatible Scalable PerformantSimple Optimized for Intel/ ARM/Power9 CPUs Introduction to MinIO
  • 6. 6 S3 Select Performance on AWS Format Time (s) Records Throughput csv 5.46 733K/s 94 MB/s json 14.28 280K/s 98 MB/s parquet 32.25 124K/s 4.3 MB/s
  • 7. 7 Evaluation (“where”) Processing (“select”) CSV JSON Parquet Parsing Parsing Loading Accelerating S3 Select on minio
  • 8. 8 Manage memory allocations: garbage collected vs. non-garbage collected Source: https://bitbucket.org/ewanhiggs/csv-game First 10X Acceleration: Zero Copy
  • 9. 9 ▪ SIMD = Single Instruction Multiple Data ○ Intel: AVX2 ▪ Process 32 bytes in parallel ○ delimiter / separator detection ○ bitmap handling & parsing ○ string compares ▪ Performance (single core) Second 10X Acceleration: SIMD
  • 10. 10 ▪ Same queries as before ○ minio with select-simd vs AWS S3 Results using select-simd
  • 11. Demo ■ Source data ○ parking-citations.csv (25M rows / 3.5 GB) ■ AWS region ○ us-east-1 ■ minio with select-simd-integration branch running on a single instance: c5.2xlarge (8 vCPUs) ■ mc client running in same region on c5.large instance
  • 12. 12 ▪ Works in progress ○ Initial focus on CSV ▪ Next: add support for ○ Parquet ○ JSON: https://github.com/lemire/simdjson ▪ Investigate AVX-512 ○ erasure coding ▫ AVX-512 4x speedup over AVX2 ○ k-registers are great / 2KB on-core register space ▪ Dynamic code generation (think LLVM) Status and what’s next
  • 13. Power9 CPUs PCIe Gen4 24x NVMe Dual Mellanox CX5 (4x100 GbE/s) High performance object storage 13
  • 14. ▪ Benefits ○ Faster queries ○ Less network traffic ○ Smaller compute needs ▪ Stay tuned for overall impact ○ S3 “plain” vs S3 Select ○ minio/simd-select vs AWS S3 Select S3 Select benefits for Spark
  • 15. Questions? Visit our booth #509 @minio https://github.com/minio/minio https://slack.minio.io https://minio.io