2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Report
Share
Report
Share
1 of 63
More Related Content
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn
1. Building a healthy data ecosystem
around Kafka and Hadoop:
Lessons learned at LinkedIn
Mar 16, 2017
Shirshanka Das, Principal Staff Engineer, LinkedIn
Yael Garten, Director of Data Science, LinkedIn
@shirshanka, @yaelgarten
2. The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka
5. Achieving Data
Democracy
“… helping everybody to access and understand data .…
breaking down silos… providing access to data when and where
it is needed at any given moment.”
Collect, flow, store as much data as you can
Provide efficient access to data in all its stages of evolution
6. The forms of data
Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable
Espresso
Venice
Pinot Galene Graph DB
Document
DB
DynamoDB
Azure
Blob, Data
Lake
Storage
7. The forms of data
At RestIn Motion
Espresso
Venice
Pinot
Galene
Graph DBDocument
DB
DynamoDB
Azure
Blob, Data
Lake
Storage
8. The forms of data
At RestIn Motion
Scale
O(10) clusters
~1.7 Trillion messages
~450 TB
Scale
O(10) clusters
~10K machines
~100 PB
10. Data Integration: key requirements
Source, Sink
Diversity
Batch
+
Streaming
Data
Quality
So, we built
11. SFTP
JDBC
REST
Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal,
NerdWallet and many more…
Apache incubation under way
SFTP
Azure
Blob, Data
Lake
Storage
19. How
does
LinkedIn
build
data-
driven
products?
Data Scientist
PM Designer
Engineer
We should enable users
to filter connection
suggestions by company
How much do
people utilize
existing filter
capabilities?
Let's see how
users send
connection
invitations
today.
21. (powers metrics and data products)
InvitationClickEvent()
Scale fact:
~ 1000 tracking event types,
~ Hundreds of metrics & data
products
Tracking data records user activity
23. Tracking Data Lifecycle & Teams
Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams:
Analytics, ML Engineers,...
user
engagement
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
TransportProduce Consume
Members
Execs
24. How do we calculate a metric: ProfileViews
PageViewEvent
Record 1:
{
"memberId" : 12345,
"time" : 1454745292951,
"appName" : "LinkedIn",
"pageKey" : "profile_page",
"trackingInfo" :
“Viewee=1214,lnl=f,nd=1,o=1214,
^SP=pId-'pro_stars',rslvd=t,vs=v,
vid=1214,ps=EDU|EXP|SKIL| ..."
}
Metric:
ProfileViews = sum(PageViewEvent
where pagekey = profile_page
)
PageViewEvent
Record 101:
{
"memberId" : 12345,
"time" : 1454745292951,
"appName" : "LinkedIn",
"pageKey" : "new_profile_page",
"trackingInfo" :
"viewee_id=1214,lnl=f,nd=1,o=1214,
^SP=pId-'pro_stars',rslvd=t,vs=v,
vid=1214,ps=EDU|EXP|SKIL| ..."
}
or new_profile_page
Ok but
forgot to notifyundesirable
25. Metrics ecosystem at LinkedIn: 3 yrs ago
Operational Challenges for infra teams
Diminished Trust due to multiple sources of truth
26. What was causing unhappiness?
1. No contracts: Downstream scripts broke when upstream changed
2. "Naming things is hard": different semantics & conventions in various data Events
(per team)
--> need to email to figure out what is correct and complete logic to use
--> inefficient and potentially wrong
3. Discrepant metric logic:
Duplicate tech allowed for duplicate logic allowed for discrepant metric logic
So how did we solve this?
27. Data Modeling Tip
Say no to Fragile Formats or Schema-Free
Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing
your messages to your persistent stores: Kafka, Hadoop, DBs etc.
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Chose Avro as our format
28. Sometimes you need a committee
Leads from product and infra teams
Review each new data model
Ensure that it follows our conventions,
patterns and best practices across entire
data lifecycle
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Data Model Review Committee
(DMRC)
Tooling to codify conventions
“Always be reviewing”
Who and What Evolution
29. Unified Metrics Platform
A single source of truth for all
business metrics at LinkedIn
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
- metrics processing platform as a
service
- a metrics computation template
- a set of tools and process to
facilitate metrics life-cycle
Central Team,
Relevant
Stakeholders
Sandbox
Metric
Definition
Code
Repo
Build
& Deploy
System JobsCore Metrics
Job
Metric
Owner
1. iterate
2. create
4. check in
3. review
5,000 metrics daily
30. Unified Metrics Platform: Pipeline
Metrics Logic
Raw
Data
Pinot
UMP Harness
Incremental
Aggregate
Backfill
Auto-join
Raptor
dashboards
HDFS
Aggregated
Data
Experiment
Analysis
Machine
Learning
Anomaly
Detection
HDFS
Ad-hoc
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Tracking
+ Database
+ Other data
31. Tracking Platform: standardizing production
Schema compatibility
Time
Audit
KafkaClient-side
Tracking
Tracking
Frontend
Services
Tools
33. What was still causing unhappiness?
1. Old bad data sticks around (e.g. old mobile app versions)
2. No clear contract for data production - Producers unaware of consumers concerns
3. Never a good time to pay down this tech debt
We started from the bottom.
34. Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams:
Analytics, ML Engineers,...
user
engagement
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
Members
Execs
3. Never a good time to pay down this "data" debt
#victimsOfTheData —> #DataScienceHappiness
via proactively forging our own data destiny.
Features are waiting to ship to members... some of this stuff is invisible
But what is the cost
of not doing it?
35. The Big Problem Opportunity in 2015
Launch a completely rewritten LinkedIn mobile app
37.
1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
scratch),
b. Save: consumers avoid migrating.
2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
There were two options:
38.
1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
scratch),
b. Save: consumers avoid migrating.
2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
How much work would it be?
#DataScienceHappiness
40. 2. Clear contract did not exist for data production
Producers were unaware of consumers needs, and were "Throwing data over the wall".
Albeit avro, Schema adherence != Semantics equivalence
user engagement
tracking data
metric
scripts
production
code
Member facing
data products
Business facing
decision making
#victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition
Own the artifact that
feeds the data ecosystem
(and data scientists!)
Data producers
(PM, app developers)
Data consumers
(DS)
41. 2a. Ensure dialogue between Producers & Consumers
• Awareness: Train about end-to-end data pipeline, data modeling
• Instill communication & collaborative ownership process between all: a step-by-step
playbook for who & how to develop and own tracking
42. 2b. Standardized core data entities
• Event types and names: Page, Action, Impression
• Framework level client side tracking: views, clicks, flows
• For all else (custom) - guide when to create a new Event
Navigation
Page View
Control Interaction
43. 2c. Created clear maintainable data production contracts
Tracking specification with monitoring and alerting for adherence:
clear, visual, consistent contract
Need tooling to support culture and process shift - "Always be tooling"
Tracking specification Tool
49. We built Dali to solve this
A Data Access Layer for Linkedin
Abstract away underlying physical details to allow users to
focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem
54. A Few Hard Problems
Versioning
Views and UDFs
Mapping to Hive metastore entities
Development lifecycle
Git as source of truth
Gradle for build
LinkedIn tooling integration for deployment
55. State of the world today
~300 views
Pretty much all new UMP metrics use Dali
data sources
ProfileViews
MessagesSent
Searches
InvitationsSent
ArticlesRead
JobApplications
...
At Rest
Data
Processing
Frameworks
56. Now brewing: Dali on Kafka
Can we take the same
views and run them
seamlessly on Kafka as
well?
Stream Data
Standard streaming API-s
- Samza System Consumer
- Kafka Consumer
57. What’s next for Dali?
Selective materialization
Open source
Hive is an implementation detail, not a long term bet
58. Dali: When are we done dreaming?
At RestIn Motion
Data
Processing
Frameworks
Dali
62. Basic data
infrastructure
for data democracy
Platforms, Process
to standardize
produce + consume
Evangelize
investing
in
#DataScience
Happiness
Tech + process
to sustain
healthy data
ecosystem
Our Journey towards #DataScienceHappiness
Dali,
Dialogue
2015->
Tracking, UMP
DMRC
2013 ->
Kafka, Hadoop,
Gobblin, Pinot
2010 -> 2015 ->
63. The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka
Thank You!
to be continued…