Yieldbot Tech Talk, Sept 20, 2012

Yieldbot Tech Talk – MongoDB to k/v

© 2012 Yieldbot
© 2012 Yieldbot / CONFIDENTIAL

Yieldbot Tech Talk – MongoDB to key/value, Sept 20, 2012

What We Do
• Yieldbot technology creates marketplaces where
advertisers target realtime consumer intent flowing
through premium publishers.
• At a high level: Analytics + Ad Serving
– Geo-distributed
• Data collection
• Realtime ad matching
– Cascalog batch analytics
– Rich Analytics Results visualizations

© 2012 Yieldbot


Why MongoDB (Dec 2009)
• Needed manageable by dev team (1 person!)
• Flexible
• Easy to get started, run on laptop or deploy
• Scale wasn’t initially biggest concern
• Could focus on other stuff
– Lucene
– Analytics
– Ad serving dynamics

© 2012 Yieldbot


How MongoDB Used Initially
• Configuration
– Publisher profiles, ad matching rules, etc.
• Data collection
– Pageviews, impressions, clicks
• Analytics results
• Task state tracking
• Lookup tables for ad serving
• Real-time ad stats

© 2012 Yieldbot


Couple Aspects of Note
• Master/Slave
– convenient for simple durability
– convenient for geo distribution
– not unique to Mongo, now similar redis topology
• Indexing
– Easy to set up, but eventually RAM scaling issue
– initially great for efficient views of data in UI
– moved analytics results as key/value in mongo
• Durable sharded config (replica sets) expensive

© 2012 Yieldbot


Data Collection
• Mongo: collections for pageviews, impressions, clicks
– Wasn’t archived anywhere else
– Not where you want to infinitely scale
• Now flows through redis, to files, to S3

© 2012 Yieldbot


Data Collection with redis Assist
• redis lists populated as events come in
• Daemons pull off lists and write to files
• Periodically compress and archive files to S3
• S3 files used for input later
– Hadoop (Cascalog) batch analytics
– Advertising Stats Calculations

© 2012 Yieldbot


Matching Lookup Tables
• Mongo: collections for different lookup types
– Eg., geo, url
– Built periodically, updated on config change
– Lookup in each, correlate results
• redis
– Ability to pipeline operations in single server call
– Set intersection across lookup dimensions and one
response back
– Same master/slave as Mongo for distribution

© 2012 Yieldbot


Configuration
• Mongo
– Database per publisher
– Collections for objects
– Denormalized where possible
– Manual Foreign Keys
– Obviously best candidate for relational model
• History and Versioning was paramount to us
– Roll our own: HeroDB

© 2012 Yieldbot


HeroDB
• History and granular versioning highest goal
• Database built on top of git
– Golden database is a bare repo
– Can clone to anywhere, make changes, push
– Changes in single commit are atomic
• How, when, and who changed it
• Ability to set to specific previous state of DB
• Much more to do, in production 6+ months
– Recent change, caching

© 2012 Yieldbot


Analytics Results
• ARCv1, Mongo: indexed collections
– Very easy to code to
– Initially with everything else in same server
– Moved out to dedicated server
– Memory became an issue
• Indexes bigger than data itself
– Overhead of importing Cascalog results
• Pull json files from S3 to local disk
• mongoimport files into DB

© 2012 Yieldbot


Analytics Results Cont’d
• ARCv2, Mongo: paged data, key/value
– Migrated app to key/value access pattern
– Much better memory usage
– Application sharded, publishers spread around
– DB per day per publisher, most recent 7 held
– Still overhead of importing Hadoop results
• Pull json files from S3 to local disk
• mongoimport files into DB

© 2012 Yieldbot


Analytics Results - ElephantDB
• Cascalog support to directly write EDB format
– Berkeley DB or LevelDB
• Ring Topology
– Shards distributed around ring, consistent hashing
– Configurable replication factor
– Request to any node, forwards as necessary
– Incrementally increase ring size
• Import from S3 efficient
– Copy shard from S3 to local disk

© 2012 Yieldbot


Real-time Ad Stats
• Mongo: DB per day, collection by entity type
– Document per entity instance
– stat_type.hour.minute nested values, atomic
increment
– Never a good story around aggregating at larger
timeframes
• Enter redis again

© 2012 Yieldbot


Real-time Ad Stats Cont’d
• redis has robust access patterns
– More pipelining
• Initially realtime and aggregated kept in redis
• Issue with redis scaling is DB has to fit in memory
• Time-period aggregations now kept in HBase
• Only most recent hours kept in redis

© 2012 Yieldbot


Task State Tracking
• The last holdout
• Collection of tasks
– Each task is a document
– Indexed as needed
– Mongo query and update syntax convenient
• Both in static code, but also in Python or Mongo
repl

© 2012 Yieldbot


Honorable Mention
• redis for the celery backend, used for task messaging
infrastructure
• but was never mongo anyway...

© 2012 Yieldbot


MongoDB Migration Summary
• Configuration  HeroDB
• Data Collection  to S3 via redis
• Analytics Results  ElephantDB
• Task State Tracking  still Mongo
• Matcher Lookup Tables  redis
• Real-time Ad Stats  redis/HBase

© 2012 Yieldbot

Yieldbot Tech Talk, Sept 20, 2012

More Related Content

Yieldbot Tech Talk, Sept 20, 2012