Planet-scale Data Ingestion Pipeline: Bigdam

Planet-scale Data Ingestion Pipeline
Bigdam
PLAZMA TD Internal Day 2018/02/19
#tdtech
Satoshi Tagomori (@tagomoris)

Satoshi Tagomori (@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, Woothee, ...
Treasure Data, Inc.
Backend Team

• Design for Large Scale Data Ingestion
• Issues to Be Solved
• Re-designing Systems
• Re-designed Pipeline: Bigdam
• Consistency
• Scaling

Large Scale Data Ingestion:
Traditional Pipeline

Data Ingestion in Treasure Data
• Accept requests from clients
• td-agent
• TD SDKs (incl. HTTP requests w/ JSON)
• Format data into MPC1
• Store MPC1 ﬁles into Plazmadb
clients Data Ingestion Pipeline
json
msgpack.gz
MPC1
Plazmadb
Presto
Hive

Traditional Pipeline
• Streaming Import API for td-agent
• API Server (RoR), Temporary Storage (S3)
• Import task queue (perfectqueue), workers (Java)
• 1 msgpack.gz ﬁle in request → 1 MPC1 ﬁle on Plazmadb
td-agent
api-import
(RoR)
msgpack.gz
S3
PerfectQueue
Plazmadb
Import
Worker
msgpack.gz
MPC1

Traditional Pipeline: Event Collector
• APIs for TD SDKs
• Event Collector nodes (hosted Fluentd)
• on the top of Streaming Import API
• 1 MPC1 ﬁle on Plazmadb per 3min. per Fluentd process
TD SDKs
api-import
(RoR)
json
S3
PerfectQueue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-collector
(Fluentd)
msgpack.gz

Growing Traffics on Traditional Pipeline
• Throughput of perfectqueue
• Latency until queries via Event-Collector
• Maintaining Event-Collector code
• Many small temporary files on S3
• Many small imported files on Plazmadb on S3
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz

Perfectqueue Throughput Issue
• Perfectqueue
• "PerfectQueue is a highly available distributed queue built on top of
RDBMS."
• Fair scheduling
• https://github.com/treasure-data/perfectqueue
• Perfectqueue is NOT "perfect"...
• Need wide lock on table: poor concurrency
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz

Latency until Queries via Event-Collector
• Event-collector buffers data in its storage
• 3min. + α
• Customers have to wait 3+min. until a record become
visible on Plazmadb
• 1/2 buffering time make 2x MPC1 ﬁles
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz

Maintaining Event-Collector Code
• Mitsu says: "No problem about maintaining event-collector code"
• :P
• Event-collector processes HTTP requests in Ruby code
• Hard to test it
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz

Many Small Temporary Files on S3
• api-import uploads all requested msgpack.gz ﬁles to S3
• S3 outage is critical issue
• AWS S3 outage in us-east-1 at Feb 28th, 2017
• Many uploaded ﬁles makes costs expensive
• costs per object
• costs per operation
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz

Many Small Imported Files on Plazmadb on S3
• 1 MPC1 file on Plazmadb from 1 msgpack.gz file
• on Plazmadb realtime storage
• https://www.slideshare.net/treasure-data/td-techplazma
• Many MPC1 files:
• S3 request cost to store
• S3 request cost to fetch (from Presto, Hive)
• Performance regression to fetch many small files in queries 
(256MB expected vs. 32MB actual)

Make "Latency" Shorter (1)
• Clients to our endpoints
• JS SDK on customers' page sends data to our endpoints 
from mobile devices
• Longer latency increases % of dropped records
• Many endpoints on the Earth: US, Asia + others
• Plazmadb in us-east-1 as "central location"
• Many geographically-parted "edge location"

Make "Latency" Shorter (2)
• Shorter waiting time to query records
• Flexible import task scheduling - better if conﬁgurable
• Decouple buffers from endpoint server processes
• More frequent import with aggregated buffers
bufferendpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
buffer
BEFORE
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
AFTER
MPC1
MPC1
MPC1
MPC1
MPC1
buffer
buffer
buffer

Redesigning Queues
• Fair scheduling is not required for import tasks
• Import tasks are FIFO (First In, First Out)
• Small payload - (apikey, account_id, database, table)
• More throughput
• Using Queue service + RDBMS
• Queue service for enqueuing/dequeuing
• RDBMS to provide at-least-once

S3-free Temporary Storage
• Make the pipeline free from S3 outage
• Distributed storage cluster as buffer for uploaded data (w/ replication)
• Buffer transferring between edge and central locations
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
clients
Edge location Central location
buffer
buffer
buffer
Storage 
Cluster
Storage 
Cluster

Merging Temporary Buffers into a File on Plazmadb
• Non-1-by-1 conversion from msgpack.gz to MPC1
• Buffers can be gathered using secondary index
• primary index: buffer_id
• secondary index: account_id, database, table, apikey
bufferendpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
buffer
BEFORE
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
AFTER
MPC1
MPC1
MPC1
MPC1
MPC1
buffer
buffer
buffer

Should It Provide Read-After-Write Consistency?
• BigQuery provides Read-After-Write consistency
• Pros: Inserted record can be queried now
• Cons:
• Much longer latency (especially from non-US regions)
• Much more expensive to host API servers for longer HTTP sessions
• Much more expensive to host Query nodes for smaller ﬁles on Plazmadb
• Much more troubles
• Say "No!" for it
Appendix

Bigdam: Planet-scale! 
Edge locations on the earth + the Central location

Bigdam-Gateway (mruby on h2o)
• HTTP Endpoint servers
• Rack-like API for mruby handlers
• Easy to write, easy to test (!)
• Async HTTP requests from mruby, managed by h2o using Fiber
• HTTP/2 capability in future
• Handles all requests from td-agent and TD SDKs
• decode/authorize requests
• send data to storage nodes in parallel (to replicate)

Bigdam-Pool (Java)
• Distributed Storage for buffering
• Expected data size: 1KB (a json) ~ 32MB (a msgpack.gz from td-agent)
• Append data into a buffer
• Query buffers using secondary index
• Transfer buffers from edge to central
chunks
buffers
Central location
Over Internet
Using HTTPS or HTTP/2
Buffer committed
(size or timeout)Edge location
Import workers
account_id, database, table

Bigdam-Scheduler (Golang)
• Scheduler server
• Bigdam-pool requests to schedule import tasks to bigdam-scheduler 
(many times in seconds)
• Bigdam-scheduler enqueues import tasks to bigdam-queue, 
(once in conﬁgured interval: default 1min.)
bigdam-pool
nodes bigdam-queue
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-Scheduler
for every committed buﬀers
once in a minutes
per account/db/table

account_id, database, table, apikey
1. bigdam-pool requests to schedule import tasks for every buﬀers
2. requested task is added in scheduler entries, if missing
l
account1, db1, table1, apikeyA
scheduler entries
bigdam-pool
nodes
account9, db8, table7, apikeyB bigdam-queuel
3. schedule a task to be enqueued after timeout from entry creation
bigdam-pool
nodes bigdam-queue
tasks to be enqueued
l
scheduler entries
bigdam-pool
nodes
bigdam-queuel
4. enqueue an import task into bigdam-queue
bigdam-pool
nodes
bigdam-queue
l
scheduler entries
account9, db8, table7, apikeyB l
l
scheduler entries
5. remove an entry in schedule if succeeded
l
scheduler entries
bigdam-pool
nodes bigdam-queue

Bigdam-Queue (Java)
• High throughput queue for import tasks
• Enqueue/dequeue using AWS SQS (standard queue)
• Task state management using AWS Aurora
• Roughly ordered, At-least-once
enqueue tasks
bigdam-scheduler
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task
2. enqueue 1. INSERT INTO
task, enqueued
task, enqueued
task, enqueued
AWS Aurora
request to dequeue task
bigdam-import
AWS SQS
(standard)
task
task
task
1. dequeue
task, enqueued
task, enqueued
task, running
2. UPDATE
AWS Aurora
ﬁnish
bigdam-import
AWS SQS
(standard)
task
task
task, enqueued
task, enqueued
1. DELETE
AWS Aurora

Bigdam-Import (Java)
• Import worker
• Convert source (json/msgpack.gz) to MPC1
• Execute import tasks in parallel
• Dequeue tasks from bigdam-queue
• Query and download buffers from bigdam-pool
• Make a list of chunk ids and put it to bigdam-dddb
• Execute deduplication to determine chunks to be imported
• Make MPC1 ﬁles and put them into Plazmadb

Bigdam-Dddb (Java)
• Database service for deduplication
• Based on AWS Aurora and S3
• Stores unique chunk ids per import task 
not to import same chunk twice
1. store chunk-id list (small)
bigdam-import
bigdam-dddb server (Java)
2. INSERT
task-id, list-of-chunk-ids
AWS Aurora
2. store task-id 
and S3 object path
bigdam-import
3. INSERT
1. upload 
encoded chunk-ids
task-id, path-of-ids
AWS AuroraAWS S3
list-of-chunk-idstask-id, list-of-chunk-ids
For small list of chunk ids For huge list of chunk ids
1. query lists of past tasks
bigdam-import
2. SELECT
task-id, path-of-ids
AWS Aurora
list-of-chunk-idstask-id, list-of-chunk-ids
Fetch chunk id lists imported in past
3. download 
if needed

Executing Deduplication at the end of pipeline
• Make it simple & reliable
gateway
clients 
(data input)
At-least-once everywhere
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
Deduplication
(Transaction + Retries)

At-Least-Once: Bigdam-pool Data Replication
Client-side replication:
client uploads 3 replica
to 3 nodes in parallel
Server-side replication:
primary node appends chunks to existing buffer,
and replicate them
(for equal contents/checksums in nodes)
for large chunks 
(1MB~)
for small chunks 
(~1MB)

At-Least-Once: Bigdam-pool Data Replication
Server replication
for transferred buffer

Scaling-out (almost) Everywhere
• Scalable components on EC2 (& ready for AWS autoscaling)
• AWS Aurora (w/o table locks) + AWS SQS (+ AWS S3)
gateway
clients 
(data input)
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
scale-outscale-out

Scaling-up Just For A Case: Scheduler
• Scheduler need to collect notiﬁcations of all buffers
• and cannot be parallelized by nodes (in easy way)
• Solution: high-performant singleton server: 90k+ reqs/sec
gateway
clients 
(data input)
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
singleton 
server

Bigdam Current status: Under Testing

It's great fun
to design
Distributed Systems!
Thank you!
@tagomoris

Planet-scale Data Ingestion Pipeline: Bigdam

More Related Content

Planet-scale Data Ingestion Pipeline: Bigdam