Shopify's Big Data Platform
Shopify's Big Data Platform
Platform
Jason White
Team Lead, Data Team
11 Feb 2015
A bit about me
Worked at Shopify > 2 years
80th percentilethats crazy
Team Lead, Business Data Team
5 Data Analysts, focused on internal
business metrics
Python, Ruby, SQL, Coffeescript, Pig, .NET
Where We Started
Extract
Custom Ruby application pulled from
production sources
Load
Same application loaded into HP Vertica
database
Transform
Custom SQL queries embedded in all reports
Views in Vertica containerized some business
logic
Extractor also did some simple
transformations
Extract-Load-Transform
Pros
Cons
Simple to setup
Fragile
Easily extensible
Difficult to test
Flexibility
Inconsistency
Dimensional Modelling
Standard DB design is optimized for
transactional integrity
In the analytics world, this is the wrong
problem
We need to optimize for:
Analytical Consistency
Analytical Speed
Business Users (humans that are not
developers)
User trust is the central problem
Throw everything 3NF out the window
Dimensional Modelling
Dimensional Modelling
Conformed Dimensions
Conformed Fact Tables
Measurables
What to measure, count, add, average
Use monoids
Fine-grained as possible (transactional
grain)
Dimensional Modelling
This is just a taste of dimensional modelling
Sacrifices some flexibility for consistency,
reliability
Very powerful, but must be principled in
approach
Starscream
The T in our ETL, HDFS -> HDFS
Reads raw data, other pre-processed
data
Stores data in our frontroom
High-quality, curated datasets
Fact tables & reusable dimensions
Runs on Apache Spark
Starscream
Contracts help ensure consistency throughout the
pipeline
Each transform is bookended with Contracts
Each input passes through input contract
Output is checked against an output contract
Usage of contracts is mandatory
Catches many, many errors for us
Upstream data changes
Field names, types changed
NULLs where we werent expecting them
Apache Spark
Resilient Distributed Dataset is the defining
characteristic
Each RDD has:
Partitions
Dependencies
Computation
Output:
Shuffled for use in another RDD as input,
Serialized to storage, or
Returned to driver
Shuffles use local worker memory or disk as
necessary
Apache Spark
Apache Spark
In[1]:rdd1=sc.parallelize([(1,"hello"),
(2,"goodbye")])
In[2]:rdd2=sc.parallelize([(1,"world"),
(1,"everyone"),(2,"cruelworld")])
In[3]:rdd1.join(rdd2).collect()
Out[3]:
[(1,(u'hello',u'everyone')),
(1,(u'hello',u'world')),
(2,(u'goodbye',u'cruelworld'))]
Approximate Counting
Standard term counting is more precise than
we need
What do we actually need?
All keys that have > threshold
False positives are OK
Future Work
Model ALL the things!
Machine Learning algorithms on our nice,
clean datasets
Forecasting
New externally-facing products!
References
The Data Warehouse Toolkit (2nd Edition), Kimball
& Ross
Advanced Spark Training, Reynold Xin (2014)
http://lkozma.net/blog/sketching-data-structures/