0% found this document useful (0 votes)

1K views

Shopify's Big Data Platform

This document summarizes Shopify's data platform transition from an Extract-Load-Transform (ELT) approach to an Extract-Transform-Load (ETL) approach using Apache Spark and dimensional modeling. It discusses challenges with the ELT approach and how the new ETL platform using Spark, HDFS, and Redshift provides more scalability, reliability, and consistency. Key aspects of the dimensional modeling approach and use of Spark RDDs, joins, and approximate counting techniques like Count-Min Sketch are also summarized.

Uploaded by

Jason White

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views

Shopify's Big Data Platform

Uploaded by

Jason White

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Shopifys Big Data

Platform
Jason White
Team Lead, Data Team
11 Feb 2015

A bit about me
Worked at Shopify > 2 years
80th percentilethats crazy
Team Lead, Business Data Team
5 Data Analysts, focused on internal
business metrics
Python, Ruby, SQL, Coffeescript, Pig, .NET

Where We Started
Extract
Custom Ruby application pulled from
production sources
Load
Same application loaded into HP Vertica
database
Transform
Custom SQL queries embedded in all reports
Views in Vertica containerized some business
logic
Extractor also did some simple
transformations

Extract-Load-Transform
Pros

Cons

Simple to setup

Fragile

Worked for small teams

Stopped working at scale

Easily extensible

Difficult to test

Quick iteration cycles

Restated history all the time

Flexibility

Inconsistency

ELT worked for a long time, until it didnt

Needed something more testable, reliable,
scalable
Time to move to ETL

Onwards and Upwards

Extract
Longboat: dumb Extractor, as few Transforms as
possible
JRuby application using classic Hadoop M/R
Stores in HDFS
Transform
Starscream: PySpark application
Dimensional Modelling approach, using Kimball
methodology
HDFS -> HDFS transformations
Load
Canonical truth is on HDFS
Load to Redshift as a dumb caching layer

Onwards and Upwards

Reporting
Tableau Desktop & Web read from Redshift
Hive available for developers
0xDBE for SQL access to Redshift
Other Data Consumers
Havent really figured this part out yet
Some sort of API or library TBD

Dimensional Modelling
Standard DB design is optimized for
transactional integrity
In the analytics world, this is the wrong
problem
We need to optimize for:
Analytical Consistency
Analytical Speed
Business Users (humans that are not
developers)
User trust is the central problem
Throw everything 3NF out the window

Dimensional Modelling

Processes are central

Every table has strict, explicit grain
Nearly always have time as a dimension
Dimensions - How to slice & dice?

Dimensional Modelling

Conformed Dimensions
Conformed Fact Tables
Measurables
What to measure, count, add, average
Use monoids
Fine-grained as possible (transactional
grain)

Dimensional Modelling
This is just a taste of dimensional modelling
Sacrifices some flexibility for consistency,
reliability
Very powerful, but must be principled in
approach

Starscream
The T in our ETL, HDFS -> HDFS
Reads raw data, other pre-processed
data
Stores data in our frontroom
High-quality, curated datasets
Fact tables & reusable dimensions
Runs on Apache Spark

Starscream
Contracts help ensure consistency throughout the
pipeline
Each transform is bookended with Contracts
Each input passes through input contract
Output is checked against an output contract
Usage of contracts is mandatory
Catches many, many errors for us
Upstream data changes
Field names, types changed
NULLs where we werent expecting them

Starscream data changes

Transform modified, but consuming transform missed

Apache Spark
Resilient Distributed Dataset is the defining
characteristic
Each RDD has:
Partitions
Dependencies
Computation
Output:
Shuffled for use in another RDD as input,
Serialized to storage, or
Returned to driver
Shuffles use local worker memory or disk as
necessary

Apache Spark

In[1]:rdd1=sc.parallelize([(1,"hello"),
(2,"goodbye")])
In[2]:rdd2=sc.parallelize([(1,"world"),
(1,"everyone"),(2,"cruelworld")])
In[3]:rdd1.join(rdd2).collect()
Out[3]:
[(1,(u'hello',u'everyone')),
(1,(u'hello',u'world')),
(2,(u'goodbye',u'cruelworld'))]

Joining Data in PySpark

Joining Data In PySpark

What if 1 key has 1 billion entries?

Joining Data in PySpark

Answer is to use another strategy
Broadcasting
Download complete smaller RDD to the
driver
Upload complete smaller RDD to each
executor
Now join == map

Joining Data in PySpark

Broadcasting has size limitations
Downloading & uploading entire sets of data
Only useful when one of the datasets is
relatively small
Trick: Horizontal Partitioning
Identify the high-frequency keys
Partition both datasets using these keys
High-frequency set: use Broadcast Join
Low-frequency set: use Standard Join
Union results together

Joining Data in PySpark

How to identify high-frequency terms?
Easy solution: standard term-counting
problem

Map each row to (key, 1)

Reduce with add function
Filter above threshold
Collect to driver

Approximate Counting
Standard term counting is more precise than
we need
What do we actually need?
All keys that have > threshold
False positives are OK

Count Min Sketch

CMS vastly improved partitioning
performance
Observed 2x speed of standard count for
large RDDs
Data being shuffled went from GBs -> MBs

Standard triad of probabilistic data

structures
Count-Min Sketch
HyperLogLog
Bloom Filters

Future Work
Model ALL the things!
Machine Learning algorithms on our nice,
clean datasets
Forecasting
New externally-facing products!

References
The Data Warehouse Toolkit (2nd Edition), Kimball
& Ross
Advanced Spark Training, Reynold Xin (2014)
http://lkozma.net/blog/sketching-data-structures/

Group 4 - Samasource
No ratings yet
Group 4 - Samasource
7 pages
Big Data and Hadoop-Sentiment Analysis Using Flume and Hive
No ratings yet
Big Data and Hadoop-Sentiment Analysis Using Flume and Hive
27 pages
Future of Business Analytics PDF
No ratings yet
Future of Business Analytics PDF
14 pages
Affiliate Marketing: by Alekya TVL DM-05-060
No ratings yet
Affiliate Marketing: by Alekya TVL DM-05-060
11 pages
SQL For Product Managers - HelloPM - Co
No ratings yet
SQL For Product Managers - HelloPM - Co
17 pages
FOODIGO
No ratings yet
FOODIGO
11 pages
AI Chatbot
No ratings yet
AI Chatbot
4 pages
Price Comparison Website For Online Shopping Project
No ratings yet
Price Comparison Website For Online Shopping Project
5 pages
Tips To Build An Outstanding Dev Portfolio
100% (1)
Tips To Build An Outstanding Dev Portfolio
15 pages
Online Gift - Ecommerce: A Project On
No ratings yet
Online Gift - Ecommerce: A Project On
64 pages
Resume of Md. Ahsanul Mobin
100% (1)
Resume of Md. Ahsanul Mobin
2 pages
Context:: 1.need of Project 2.project Stages 3.project Requirements
No ratings yet
Context:: 1.need of Project 2.project Stages 3.project Requirements
7 pages
Sysco
100% (1)
Sysco
6 pages
Iphone 15 Pro Max PDF File Easy Guide
100% (1)
Iphone 15 Pro Max PDF File Easy Guide
2 pages
Lazy Analysts Guide To Faster SQL
No ratings yet
Lazy Analysts Guide To Faster SQL
18 pages
MCQ Question Paper SEO BBADMD501
No ratings yet
MCQ Question Paper SEO BBADMD501
24 pages
Iab-Affiliate-Marketing-Handbook 2016
No ratings yet
Iab-Affiliate-Marketing-Handbook 2016
29 pages
App Development Codealpha
No ratings yet
App Development Codealpha
11 pages
Reasons-To-Choose Php-Laravel-Framework
No ratings yet
Reasons-To-Choose Php-Laravel-Framework
16 pages
New Microsoft Word Document
100% (1)
New Microsoft Word Document
25 pages
Business Plan Shop and Go
100% (1)
Business Plan Shop and Go
16 pages
Search Engine Optimization: Crash Course
No ratings yet
Search Engine Optimization: Crash Course
41 pages
PROIDS: Probabilistic Data Structure Based Intrusion Detection System
No ratings yet
PROIDS: Probabilistic Data Structure Based Intrusion Detection System
13 pages
Finger Print Atm Security
No ratings yet
Finger Print Atm Security
24 pages
HuggingChat: The New Open-Source Chatbot Challenging ChatGPT
No ratings yet
HuggingChat: The New Open-Source Chatbot Challenging ChatGPT
4 pages
How AI and Automation Can Boost Your Business Without Extra Hassle
No ratings yet
How AI and Automation Can Boost Your Business Without Extra Hassle
2 pages
Datawarehousing
No ratings yet
Datawarehousing
71 pages
Android Application Development Literature Review
No ratings yet
Android Application Development Literature Review
7 pages
Project Idea
No ratings yet
Project Idea
15 pages
Wordpress Optimization Guide
No ratings yet
Wordpress Optimization Guide
77 pages
Unit-5 Complete
0% (1)
Unit-5 Complete
52 pages
E Book Unleashing AI Powered Search Pureinsights
No ratings yet
E Book Unleashing AI Powered Search Pureinsights
48 pages
Uml 2.4.1 Specification Superstructure
No ratings yet
Uml 2.4.1 Specification Superstructure
748 pages
Digiskills Freelancing Exercise No 3
100% (1)
Digiskills Freelancing Exercise No 3
1 page
Call To Action Conference 2016 Notes
No ratings yet
Call To Action Conference 2016 Notes
118 pages
E5172 As-22 (Surfline)
100% (1)
E5172 As-22 (Surfline)
49 pages
Automated Workflow Guide For Persana AI
No ratings yet
Automated Workflow Guide For Persana AI
4 pages
Anylogic Agent Based Epidemic Modeling
No ratings yet
Anylogic Agent Based Epidemic Modeling
7 pages
Create A Composable Commerce Site With Contentful and Medusa - Contentful
No ratings yet
Create A Composable Commerce Site With Contentful and Medusa - Contentful
21 pages
Tailored Chatbots_ Custom AI Solutions for Your Unique Business Needs
No ratings yet
Tailored Chatbots_ Custom AI Solutions for Your Unique Business Needs
46 pages
Google Analytics
No ratings yet
Google Analytics
38 pages
Mobile Analytics - PPTs
No ratings yet
Mobile Analytics - PPTs
23 pages
GTDT Cloud Computing 2020
No ratings yet
GTDT Cloud Computing 2020
104 pages
The 10x Academy
No ratings yet
The 10x Academy
9 pages
AI Week 1
No ratings yet
AI Week 1
21 pages
Email Alerts On Whatsapp
No ratings yet
Email Alerts On Whatsapp
12 pages
Tech 101 For PMs HelloPM 1640879694
No ratings yet
Tech 101 For PMs HelloPM 1640879694
10 pages
Report - Online Learning System
No ratings yet
Report - Online Learning System
42 pages
Training Manual For Digital Marketing
100% (1)
Training Manual For Digital Marketing
92 pages
Uber Business Proposal - Alice Zagrean
No ratings yet
Uber Business Proposal - Alice Zagrean
7 pages
Sample Proposal 2
No ratings yet
Sample Proposal 2
20 pages
RSS For Content Aggregation Using Google Reader
100% (1)
RSS For Content Aggregation Using Google Reader
8 pages
AWS HIPAA Compliance Whitepaper
No ratings yet
AWS HIPAA Compliance Whitepaper
51 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Tune Your Zabbix For Better Performance
No ratings yet
Tune Your Zabbix For Better Performance
31 pages
Location SIG 10.5.12 Nicolas Graube CSR
No ratings yet
Location SIG 10.5.12 Nicolas Graube CSR
12 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
Project Report "Building An E-Com Website"
100% (1)
Project Report "Building An E-Com Website"
17 pages
Tableau Exasol WhitePaper
No ratings yet
Tableau Exasol WhitePaper
9 pages
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet