Bigdam is a planet-scale data ingestion pipeline designed for large-scale data ingestion. It addresses issues with the traditional pipeline such as imperfectqueue throughput limitations, latency in queries from event collectors, difficulty maintaining event collector code, many small temporary and imported files. The redesigned pipeline includes Bigdam-Gateway for HTTP endpoints, Bigdam-Pool for distributed buffer storage, Bigdam-Scheduler to schedule import tasks, Bigdam-Queue as a high throughput queue, and Bigdam-Import for data conversion and import. Consistency is ensured through at-least-once design and deduplication is performed at the end of the pipeline for simplicity and reliability. Components are designed to scale out horizontally.
1 of 46
More Related Content
Planet-scale Data Ingestion Pipeline: Bigdam
1. Planet-scale Data Ingestion Pipeline
Bigdam
PLAZMA TD Internal Day 2018/02/19
#tdtech
Satoshi Tagomori (@tagomoris)
6. Data Ingestion in Treasure Data
• Accept requests from clients
• td-agent
• TD SDKs (incl. HTTP requests w/ JSON)
• Format data into MPC1
• Store MPC1 files into Plazmadb
clients Data Ingestion Pipeline
json
msgpack.gz
MPC1
Plazmadb
Presto
Hive
7. Traditional Pipeline
• Streaming Import API for td-agent
• API Server (RoR), Temporary Storage (S3)
• Import task queue (perfectqueue), workers (Java)
• 1 msgpack.gz file in request → 1 MPC1 file on Plazmadb
td-agent
api-import
(RoR)
msgpack.gz
S3
PerfectQueue
Plazmadb
Import
Worker
msgpack.gz
MPC1
8. Traditional Pipeline: Event Collector
• APIs for TD SDKs
• Event Collector nodes (hosted Fluentd)
• on the top of Streaming Import API
• 1 MPC1 file on Plazmadb per 3min. per Fluentd process
TD SDKs
api-import
(RoR)
json
S3
PerfectQueue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-collector
(Fluentd)
msgpack.gz
9. Growing Traffics on Traditional Pipeline
• Throughput of perfectqueue
• Latency until queries via Event-Collector
• Maintaining Event-Collector code
• Many small temporary files on S3
• Many small imported files on Plazmadb on S3
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
10. Perfectqueue Throughput Issue
• Perfectqueue
• "PerfectQueue is a highly available distributed queue built on top of
RDBMS."
• Fair scheduling
• https://github.com/treasure-data/perfectqueue
• Perfectqueue is NOT "perfect"...
• Need wide lock on table: poor concurrency
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
11. Latency until Queries via Event-Collector
• Event-collector buffers data in its storage
• 3min. + α
• Customers have to wait 3+min. until a record become
visible on Plazmadb
• 1/2 buffering time make 2x MPC1 files
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
12. Maintaining Event-Collector Code
• Mitsu says: "No problem about maintaining event-collector code"
• :P
• Event-collector processes HTTP requests in Ruby code
• Hard to test it
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
13. Many Small Temporary Files on S3
• api-import uploads all requested msgpack.gz files to S3
• S3 outage is critical issue
• AWS S3 outage in us-east-1 at Feb 28th, 2017
• Many uploaded files makes costs expensive
• costs per object
• costs per operation
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
14. Many Small Imported Files on Plazmadb on S3
• 1 MPC1 file on Plazmadb from 1 msgpack.gz file
• on Plazmadb realtime storage
• https://www.slideshare.net/treasure-data/td-techplazma
• Many MPC1 files:
• S3 request cost to store
• S3 request cost to fetch (from Presto, Hive)
• Performance regression to fetch many small files in queries
(256MB expected vs. 32MB actual)
16. Make "Latency" Shorter (1)
• Clients to our endpoints
• JS SDK on customers' page sends data to our endpoints
from mobile devices
• Longer latency increases % of dropped records
• Many endpoints on the Earth: US, Asia + others
• Plazmadb in us-east-1 as "central location"
• Many geographically-parted "edge location"
17. Make "Latency" Shorter (2)
• Shorter waiting time to query records
• Flexible import task scheduling - better if configurable
• Decouple buffers from endpoint server processes
• More frequent import with aggregated buffers
bufferendpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
buffer
BEFORE
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
AFTER
MPC1
MPC1
MPC1
MPC1
MPC1
buffer
buffer
buffer
18. Redesigning Queues
• Fair scheduling is not required for import tasks
• Import tasks are FIFO (First In, First Out)
• Small payload - (apikey, account_id, database, table)
• More throughput
• Using Queue service + RDBMS
• Queue service for enqueuing/dequeuing
• RDBMS to provide at-least-once
19. S3-free Temporary Storage
• Make the pipeline free from S3 outage
• Distributed storage cluster as buffer for uploaded data (w/ replication)
• Buffer transferring between edge and central locations
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
clients
Edge location Central location
buffer
buffer
buffer
Storage
Cluster
Storage
Cluster
20. Merging Temporary Buffers into a File on Plazmadb
• Non-1-by-1 conversion from msgpack.gz to MPC1
• Buffers can be gathered using secondary index
• primary index: buffer_id
• secondary index: account_id, database, table, apikey
bufferendpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
buffer
BEFORE
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
AFTER
MPC1
MPC1
MPC1
MPC1
MPC1
buffer
buffer
buffer
21. Should It Provide Read-After-Write Consistency?
• BigQuery provides Read-After-Write consistency
• Pros: Inserted record can be queried now
• Cons:
• Much longer latency (especially from non-US regions)
• Much more expensive to host API servers for longer HTTP sessions
• Much more expensive to host Query nodes for smaller files on Plazmadb
• Much more troubles
• Say "No!" for it
Appendix
26. Bigdam-Gateway (mruby on h2o)
• HTTP Endpoint servers
• Rack-like API for mruby handlers
• Easy to write, easy to test (!)
• Async HTTP requests from mruby, managed by h2o using Fiber
• HTTP/2 capability in future
• Handles all requests from td-agent and TD SDKs
• decode/authorize requests
• send data to storage nodes in parallel (to replicate)
28. Bigdam-Pool (Java)
• Distributed Storage for buffering
• Expected data size: 1KB (a json) ~ 32MB (a msgpack.gz from td-agent)
• Append data into a buffer
• Query buffers using secondary index
• Transfer buffers from edge to central
chunks
buffers
Central location
Over Internet
Using HTTPS or HTTP/2
Buffer committed
(size or timeout)Edge location
Import workers
account_id, database, table
30. Bigdam-Scheduler (Golang)
• Scheduler server
• Bigdam-pool requests to schedule import tasks to bigdam-scheduler
(many times in seconds)
• Bigdam-scheduler enqueues import tasks to bigdam-queue,
(once in configured interval: default 1min.)
bigdam-pool
nodes bigdam-queue
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-Scheduler
for every committed buffers
once in a minutes
per account/db/table
31. account_id, database, table, apikey
1. bigdam-pool requests to schedule import tasks for every buffers
2. requested task is added in scheduler entries, if missing
l
account1, db1, table1, apikeyA
scheduler entries
bigdam-pool
nodes
account9, db8, table7, apikeyB bigdam-queuel
3. schedule a task to be enqueued after timeout from entry creation
bigdam-pool
nodes bigdam-queue
tasks to be enqueued
l
account1, db1, table1, apikeyA
scheduler entries
bigdam-pool
nodes
bigdam-queuel
tasks to be enqueued
4. enqueue an import task into bigdam-queue
bigdam-pool
nodes
bigdam-queue
l
account1, db1, table1, apikeyA
scheduler entries
account9, db8, table7, apikeyB l
tasks to be enqueued
account1, db1, table1, apikeyA
l
account1, db1, table1, apikeyA
scheduler entries
account9, db8, table7, apikeyB l
tasks to be enqueued
account1, db1, table1, apikeyA
5. remove an entry in schedule if succeeded
l
scheduler entries
account9, db8, table7, apikeyB l
tasks to be enqueued
bigdam-pool
nodes bigdam-queue
33. Bigdam-Queue (Java)
• High throughput queue for import tasks
• Enqueue/dequeue using AWS SQS (standard queue)
• Task state management using AWS Aurora
• Roughly ordered, At-least-once
enqueue tasks
bigdam-scheduler
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task
2. enqueue 1. INSERT INTO
task, enqueued
task, enqueued
task, enqueued
AWS Aurora
request to dequeue task
bigdam-import
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task
1. dequeue
task, enqueued
task, enqueued
task, running
2. UPDATE
AWS Aurora
finish
bigdam-import
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task, enqueued
task, enqueued
1. DELETE
AWS Aurora
35. Bigdam-Import (Java)
• Import worker
• Convert source (json/msgpack.gz) to MPC1
• Execute import tasks in parallel
• Dequeue tasks from bigdam-queue
• Query and download buffers from bigdam-pool
• Make a list of chunk ids and put it to bigdam-dddb
• Execute deduplication to determine chunks to be imported
• Make MPC1 files and put them into Plazmadb
37. Bigdam-Dddb (Java)
• Database service for deduplication
• Based on AWS Aurora and S3
• Stores unique chunk ids per import task
not to import same chunk twice
1. store chunk-id list (small)
bigdam-import
bigdam-dddb server (Java)
2. INSERT
task-id, list-of-chunk-ids
AWS Aurora
2. store task-id
and S3 object path
bigdam-import
bigdam-dddb server (Java)
3. INSERT
1. upload
encoded chunk-ids
task-id, path-of-ids
AWS AuroraAWS S3
list-of-chunk-idstask-id, list-of-chunk-ids
For small list of chunk ids For huge list of chunk ids
1. query lists of past tasks
bigdam-import
bigdam-dddb server (Java)
2. SELECT
task-id, path-of-ids
AWS Aurora
list-of-chunk-idstask-id, list-of-chunk-ids
Fetch chunk id lists imported in past
3. download
if needed
39. Executing Deduplication at the end of pipeline
• Make it simple & reliable
gateway
clients
(data input)
At-least-once everywhere
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
Deduplication
(Transaction + Retries)
40. At-Least-Once: Bigdam-pool Data Replication
Client-side replication:
client uploads 3 replica
to 3 nodes in parallel
Server-side replication:
primary node appends chunks to existing buffer,
and replicate them
(for equal contents/checksums in nodes)
for large chunks
(1MB~)
for small chunks
(~1MB)
43. Scaling-up Just For A Case: Scheduler
• Scheduler need to collect notifications of all buffers
• and cannot be parallelized by nodes (in easy way)
• Solution: high-performant singleton server: 90k+ reqs/sec
gateway
clients
(data input)
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
singleton
server