Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
1K views

Shopify's Big Data Platform

This document summarizes Shopify's data platform transition from an Extract-Load-Transform (ELT) approach to an Extract-Transform-Load (ETL) approach using Apache Spark and dimensional modeling. It discusses challenges with the ELT approach and how the new ETL platform using Spark, HDFS, and Redshift provides more scalability, reliability, and consistency. Key aspects of the dimensional modeling approach and use of Spark RDDs, joins, and approximate counting techniques like Count-Min Sketch are also summarized.

Uploaded by

Jason White
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Shopify's Big Data Platform

This document summarizes Shopify's data platform transition from an Extract-Load-Transform (ELT) approach to an Extract-Transform-Load (ETL) approach using Apache Spark and dimensional modeling. It discusses challenges with the ELT approach and how the new ETL platform using Spark, HDFS, and Redshift provides more scalability, reliability, and consistency. Key aspects of the dimensional modeling approach and use of Spark RDDs, joins, and approximate counting techniques like Count-Min Sketch are also summarized.

Uploaded by

Jason White
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Shopifys Big Data

Platform
Jason White
Team Lead, Data Team
11 Feb 2015

A bit about me
Worked at Shopify > 2 years
80th percentilethats crazy
Team Lead, Business Data Team
5 Data Analysts, focused on internal
business metrics
Python, Ruby, SQL, Coffeescript, Pig, .NET

Where We Started
Extract
Custom Ruby application pulled from
production sources
Load
Same application loaded into HP Vertica
database
Transform
Custom SQL queries embedded in all reports
Views in Vertica containerized some business
logic
Extractor also did some simple
transformations

Extract-Load-Transform
Pros

Cons

Simple to setup

Fragile

Worked for small teams

Stopped working at scale

Easily extensible

Difficult to test

Quick iteration cycles

Restated history all the time

Flexibility

Inconsistency

ELT worked for a long time, until it didnt


Needed something more testable, reliable,
scalable
Time to move to ETL

Onwards and Upwards


Extract
Longboat: dumb Extractor, as few Transforms as
possible
JRuby application using classic Hadoop M/R
Stores in HDFS
Transform
Starscream: PySpark application
Dimensional Modelling approach, using Kimball
methodology
HDFS -> HDFS transformations
Load
Canonical truth is on HDFS
Load to Redshift as a dumb caching layer

Onwards and Upwards


Reporting
Tableau Desktop & Web read from Redshift
Hive available for developers
0xDBE for SQL access to Redshift
Other Data Consumers
Havent really figured this part out yet
Some sort of API or library TBD

Dimensional Modelling
Standard DB design is optimized for
transactional integrity
In the analytics world, this is the wrong
problem
We need to optimize for:
Analytical Consistency
Analytical Speed
Business Users (humans that are not
developers)
User trust is the central problem
Throw everything 3NF out the window

Dimensional Modelling

Processes are central


Every table has strict, explicit grain
Nearly always have time as a dimension
Dimensions - How to slice & dice?

Dimensional Modelling

Conformed Dimensions
Conformed Fact Tables
Measurables
What to measure, count, add, average
Use monoids
Fine-grained as possible (transactional
grain)

Dimensional Modelling
This is just a taste of dimensional modelling
Sacrifices some flexibility for consistency,
reliability
Very powerful, but must be principled in
approach

Starscream
The T in our ETL, HDFS -> HDFS
Reads raw data, other pre-processed
data
Stores data in our frontroom
High-quality, curated datasets
Fact tables & reusable dimensions
Runs on Apache Spark

Starscream
Contracts help ensure consistency throughout the
pipeline
Each transform is bookended with Contracts
Each input passes through input contract
Output is checked against an output contract
Usage of contracts is mandatory
Catches many, many errors for us
Upstream data changes
Field names, types changed
NULLs where we werent expecting them

Starscream data changes


Transform modified, but consuming transform missed

Apache Spark
Resilient Distributed Dataset is the defining
characteristic
Each RDD has:
Partitions
Dependencies
Computation
Output:
Shuffled for use in another RDD as input,
Serialized to storage, or
Returned to driver
Shuffles use local worker memory or disk as
necessary

Apache Spark

Apache Spark

In[1]:rdd1=sc.parallelize([(1,"hello"),
(2,"goodbye")])
In[2]:rdd2=sc.parallelize([(1,"world"),
(1,"everyone"),(2,"cruelworld")])
In[3]:rdd1.join(rdd2).collect()
Out[3]:
[(1,(u'hello',u'everyone')),
(1,(u'hello',u'world')),
(2,(u'goodbye',u'cruelworld'))]

Joining Data in PySpark

Joining Data In PySpark


What if 1 key has 1 billion entries?

Joining Data in PySpark


Answer is to use another strategy
Broadcasting
Download complete smaller RDD to the
driver
Upload complete smaller RDD to each
executor
Now join == map

Joining Data in PySpark

Joining Data in PySpark


Broadcasting has size limitations
Downloading & uploading entire sets of data
Only useful when one of the datasets is
relatively small
Trick: Horizontal Partitioning
Identify the high-frequency keys
Partition both datasets using these keys
High-frequency set: use Broadcast Join
Low-frequency set: use Standard Join
Union results together

Joining Data in PySpark


How to identify high-frequency terms?
Easy solution: standard term-counting
problem

Map each row to (key, 1)


Reduce with add function
Filter above threshold
Collect to driver

Approximate Counting
Standard term counting is more precise than
we need
What do we actually need?
All keys that have > threshold
False positives are OK

Count Min Sketch

Count Min Sketch


CMS vastly improved partitioning
performance
Observed 2x speed of standard count for
large RDDs
Data being shuffled went from GBs -> MBs

Standard triad of probabilistic data


structures
Count-Min Sketch
HyperLogLog
Bloom Filters

Future Work
Model ALL the things!
Machine Learning algorithms on our nice,
clean datasets
Forecasting
New externally-facing products!

References
The Data Warehouse Toolkit (2nd Edition), Kimball
& Ross
Advanced Spark Training, Reynold Xin (2014)
http://lkozma.net/blog/sketching-data-structures/

You might also like