Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

Building a healthy data ecosystem
around Kafka and Hadoop:
Lessons learned at LinkedIn
Mar 16, 2017
Shirshanka Das, Principal Staff Engineer, LinkedIn
Yael Garten, Director of Data Science, LinkedIn
@shirshanka, @yaelgarten

The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka

Achieve
Data
Democracy
Data
Scientists
write code
Unleash
Insights
Share Learnings
at Strata!
Three (Naïve) Steps to #DataScienceHappiness
circa 2010

Achieving Data
Democracy
“… helping everybody to access and understand data .…
breaking down silos… providing access to data when and where
it is needed at any given moment.”
Collect, flow, store as much data as you can
Provide efficient access to data in all its stages of evolution

The forms of data
Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable
Espresso
Venice
Pinot Galene Graph DB
Document
DB
DynamoDB
Azure
Blob, Data
Lake
Storage

The forms of data
At RestIn Motion
Espresso
Venice
Pinot
Galene
Graph DBDocument
DB
DynamoDB
Azure
Blob, Data
Lake
Storage

The forms of data
At RestIn Motion
Scale
O(10) clusters
~1.7 Trillion messages
~450 TB
Scale
O(10) clusters
~10K machines
~100 PB

At RestIn Motion
SFTPJDBCREST
Data Integration
Azure
Blob, Data
Lake
Storage

Data Integration: key requirements
Source, Sink
Diversity
Batch
+
Streaming
Data
Quality
So, we built

SFTP
JDBC
REST
Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal,
NerdWallet and many more…
Apache incubation under way
SFTP
Azure
Blob, Data
Lake
Storage

Query
Engines
At RestIn Motion
Processing
Frameworks

Kafka Hadoop
Samza Jobs
Pinot
minutes
hour +
Distributed Multi-dimensional OLAP
Columnar + indexes
No joins
Latency: low ms to sub-second
Query
Engines

Site-facing Apps Reporting dashboards Monitoring
Open source.
In production @
LinkedIn, Uber

At RestIn Motion
Processing
Frameworks
Data Infra 1.0 for Data Democracy
Query Engines
2010 - now

Achieve
Data
Democracy
Data
Scientists
write code
Unleash
Insights
Share Learnings
at Strata

How
does
LinkedIn
build
data-
driven
products?  
Data Scientist
PM Designer
Engineer
We should enable users
to filter connection
suggestions by company
How much do
people utilize
existing filter
capabilities?
Let's see how
users send
connection
invitations
today.

Tracking data records user activity
InvitationClickEvent()

(powers metrics and data products)
InvitationClickEvent()
Scale fact: 
~ 1000 tracking event types,  
~ Hundreds of metrics & data
products
Tracking data records user activity

user
engagement
tracking data
metric scripts
production code
Tracking Data Lifecycle
TransportProduce Consume
Member facing
data products
Business facing
decision making

Tracking Data Lifecycle & Teams
Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams:  
Analytics, ML Engineers,...
user
engagement
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
TransportProduce Consume
Members
Execs

How do we calculate a metric: ProfileViews
PageViewEvent
Record 1:
{
"memberId" : 12345,
"time" : 1454745292951,
"appName" : "LinkedIn",
"pageKey" : "profile_page",
"trackingInfo" :
“Viewee=1214,lnl=f,nd=1,o=1214,
^SP=pId-'pro_stars',rslvd=t,vs=v,
vid=1214,ps=EDU|EXP|SKIL| ..."
}
Metric:  
ProfileViews = sum(PageViewEvent 
where pagekey = profile_page 
)
PageViewEvent
Record 101:
{
"memberId" : 12345,
"time" : 1454745292951,
"appName" : "LinkedIn",
"pageKey" : "new_profile_page",
"trackingInfo" :
"viewee_id=1214,lnl=f,nd=1,o=1214,
^SP=pId-'pro_stars',rslvd=t,vs=v,
vid=1214,ps=EDU|EXP|SKIL| ..."
}
or new_profile_page
Ok but
forgot to notifyundesirable

Metrics ecosystem at LinkedIn: 3 yrs ago
Operational Challenges for infra teams
Diminished Trust due to multiple sources of truth

What was causing unhappiness?
1. No contracts: Downstream scripts broke when upstream changed 
2. "Naming things is hard": different semantics & conventions in various data Events
(per team)  
--> need to email to figure out what is correct and complete logic to use  
--> inefficient and potentially wrong 
3. Discrepant metric logic:  
Duplicate tech allowed for duplicate logic allowed for discrepant metric logic
So how did we solve this?

Data Modeling Tip
Say no to Fragile Formats or Schema-Free
Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing
your messages to your persistent stores: Kafka, Hadoop, DBs etc.
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Chose Avro as our format

Sometimes you need a committee
Leads from product and infra teams
Review each new data model
Ensure that it follows our conventions,
patterns and best practices across entire
data lifecycle
1. No contracts
Data Model Review Committee
(DMRC)
Tooling to codify conventions
“Always be reviewing”
Who and What Evolution

Unified Metrics Platform
A single source of truth for all
business metrics at LinkedIn
1. No contracts
- metrics processing platform as a
service
- a metrics computation template
- a set of tools and process to
facilitate metrics life-cycle
Central Team,
Relevant
Stakeholders
Sandbox
Metric
Definition
Code
Repo
Build
& Deploy
System JobsCore Metrics
Job
Metric
Owner
1. iterate
2. create
4. check in
3. review
5,000 metrics daily

Unified Metrics Platform: Pipeline
Metrics Logic
Raw
Data
Pinot
UMP Harness
Incremental
Aggregate
Backfill
Auto-join
Raptor
dashboards
HDFS
Aggregated
Data
Experiment
Analysis
Machine
Learning
Anomaly
Detection
HDFS
Ad-hoc
1. No contracts
Tracking
+ Database
+ Other data

Tracking Platform: standardizing production
Schema compatibility
Time
Audit
KafkaClient-side
Tracking
Tracking
Frontend
Services
Tools

Query
Engines
At RestIn Motion
Processing
Frameworks
Data Infra + Platforms 2.0
Pinot
Tracking Platform Unified Metrics Platform (UMP)
Production Consumption
circa 2015

What was still causing unhappiness?
1. Old bad data sticks around (e.g. old mobile app versions)
2. No clear contract for data production - Producers unaware of consumers concerns
3. Never a good time to pay down this tech debt
We started from the bottom.

Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams:  
Analytics, ML Engineers,...
user
engagement
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
Members
Execs
3. Never a good time to pay down this "data" debt
#victimsOfTheData —> #DataScienceHappiness  
via proactively forging our own data destiny.
Features are waiting to ship to members... some of this stuff is invisible
But what is the cost
of not doing it?

The Big Problem Opportunity in 2015  
Launch a completely rewritten LinkedIn mobile app

PageViewEvent

{
"header" : {
"memberId" : 12345,
"time" : 1454745292951,
"appName" : {
"string" : "LinkedIn"
"pageKey" : "profile_page"
},
},
"trackingInfo" : {
["Viewee" : "23456"],
...
}
}
We already wanted to move to better data models
ProfileViewEvent
{
"header" : {
"memberId" : 12345,
"time" : 4745292951145,
"appName" : {
},
},
"entityView" : {
"viewType" : "profile-view",
"viewerId" : “12345”,
"vieweeId" : “23456”,
},
}
viewee_ID

1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
scratch),
b. Save: consumers avoid migrating. 
 
2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
There were two options:

1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
scratch),
b. Save: consumers avoid migrating. 
 
2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
How much work would it be?
#DataScienceHappiness

We pitched it to our Leadership team
Do it!
CTOCEO

2. Clear contract did not exist for data production
Producers were unaware of consumers needs, and were "Throwing data over the wall".  
Albeit avro, Schema adherence != Semantics equivalence
user engagement
tracking data
metric  
scripts
production 
code
Member facing 
data products
Business facing
decision making
#victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition
Own the artifact that
feeds the data ecosystem
(and data scientists!)
Data producers
(PM, app developers)
Data consumers  
(DS)

2a. Ensure dialogue between Producers & Consumers
• Awareness: Train about end-to-end data pipeline, data modeling
• Instill communication & collaborative ownership process between all: a step-by-step
playbook for who & how to develop and own tracking

2b. Standardized core data entities
• Event types and names: Page, Action, Impression
• Framework level client side tracking: views, clicks, flows
• For all else (custom) - guide when to create a new Event 
Navigation
Page View
Control Interaction

2c. Created clear maintainable data production contracts
Tracking specification with monitoring and alerting for adherence:  
clear, visual, consistent contract
Need tooling to support culture and process shift - "Always be tooling" 
Tracking specification Tool

1. Old bad data sticks around
PageViewEvent
{
"header" : {
"memberId" : 12345,
"time" : 1454745292951,
"appName" : {
"pageKey" :
"profile_page"
},
},
"trackingInfo" : {
["vieweeID" : "23456"],
...
}
}
ProfileViewEvent
{
"header" : {
"memberId" : 12345,
"time" : 4745292951145,
"appName" : {
},
},
"entityView" : {
"viewType" : "profile-view",
"viewerId" : “12345”,
"vieweeId" : “23456”,
},
}

How do we handle old and new?
PageViewEvent
ProfileViewEvent
Producers Consumers
old
new
Relevance
Analytics

The Big Challenge
load “/data/tracking/PageViewEvent” using AvroStorage()
(Pig scripts)
My Raw Data
Our scripts were doing ….

My Raw Data
My Data API
We need “microservices" for Data

The Database community solved this
decades ago...
Views!

We built Dali to solve this
A Data Access Layer for Linkedin
Abstract away underlying physical details to allow users to
focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem

Solving
With
Views
Producers
LinkedInProfileView
PageViewEvent
ProfileViewEvent
new
old
Consumers
pagekey==
profile
1:1
Relevance
Analytics

Views
ecosystem
51
Producers Consumers
LinkedInProfileView
JSAProfileView
Job Seeker App
(JSA)
LinkedIn App
UnifiedProfileView

Dali: Implementation Details in Context
Dali FileSystem
Processing Engine
(MR, Spark)
Dali Datasets (Tables+Views)
Dataflow APIs
(MR, Spark,
Scalding)
Query Layers
(Pig, Hive,
Spark)
Dali CLI
Data Catalog
Git + Artifactory
View Def +
UDFs
Dataset
Owner
Data Source
Data Sink

From
load ‘/data/tracking/PageViewEvent’
using AvroStorage();
To
load ‘tracking.UnifiedProfileView’ using
DaliStorage();
One small step for a script

A Few Hard Problems
Versioning
Views and UDFs
Mapping to Hive metastore entities
Development lifecycle
Git as source of truth
Gradle for build
LinkedIn tooling integration for deployment

State of the world today
~300 views
Pretty much all new UMP metrics use Dali
data sources
ProfileViews
MessagesSent
Searches
InvitationsSent
ArticlesRead
JobApplications
...
At Rest
Data
Processing
Frameworks

Now brewing: Dali on Kafka
Can we take the same
views and run them
seamlessly on Kafka as
well?
Stream Data
Standard streaming API-s
- Samza System Consumer
- Kafka Consumer

What’s next for Dali?
Selective materialization
Open source
Hive is an implementation detail, not a long term bet

Dali: When are we done dreaming?
At RestIn Motion
Data
Processing
Frameworks
Dali

Query
Engines
At RestIn Motion
Processing
Frameworks
Data Infra + Platforms 3.0
Pinot
Tracking Platform Unified Metrics Platform (UMP)
DaliDr Elephant WhereHows
circa 2017

Did we succeed? We just handled another huge rewrite!
#DataScienceHappiness

Achieve
Data
Democracy
Data
Scientists
write code
Unleash
Insights
Share Learnings
at Strata
Three (Naïve) Steps to #DataScienceHappiness

Basic data
infrastructure
for data democracy
Platforms, Process
to standardize
produce + consume
Evangelize
investing 
in
#DataScience
Happiness
Tech + process 
to sustain
healthy data
ecosystem
Our Journey towards #DataScienceHappiness
Dali,
Dialogue
2015->
Tracking, UMP
DMRC
2013 ->
Kafka, Hadoop,
Gobblin, Pinot
2010 -> 2015 ->

The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka
Thank You!
to be continued…

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

More Related Content

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn