Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
STREAM PROCESSING
IN UBER
MARKETPLACE
~ 68 countries / 350+ cities
Transportation as reliable as running
water, everywhere, for everyone
2
Agenda
What’s on the menu?
•Use Cases
•Problem Space
•Overall Architecture
•Choices & Tradeoffs
•Q & A
Use Case: Realtime OLAP
There is always need for quick exploration
How many open cars in the world, NOW?
Streaming Processing in Uber Marketplace for Kafka Summit 2016
How many UberXs were driving clients in SF in the past 10
minutes by hexagons?
How many UberXs were driving clients in SF in the past 10 minutes by hexagons?
Driving time and other metrics over time by hexagonal area
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Use Case: Complex Event Processing
There are patterns in event streams
How many drivers cancel requests
more than 3 times in a row within a 10-
minute window?
Report riders requesting a pickup 100 miles
apart within a half hour window?
IF
This —>
Then that —>
● Sigma is similar - but for offline/batch applications
Complex Event Processing
Use Case: Supply Positioning
Clusters Of Supply & Demand
Predicted Health
Metrics
Actual Health Metrics
Monitor Marketplace Health
Challenges
OLAP of Geo-spatial Temporal Data
Reasonably Large Scale
Near Real Time
• Indexing, Lookup, Rendering
• Symmetric Neighbors
• Convex & Compact Regions
• Equal Areas
• Equal Shape
Hexagons
Scale
Geo Space Vehicle Types Time Status
X X X
Granular Geo Areas
Granular Geo Areas
Over 10,000 hexagons in a city
Multiple Vehicle Types
7 vehicle types
Minute-level Time Buckets
1440 minutes in a day
Many Driver States
13 driver states
Many Cities
300 cities
Granular Data
1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion
possible combinations
Unknown Query Patterns
Any combination of dimensions
Variety of Aggregations
- Heatmap
- Top N
- Histogram
- count(), avg(), sum(), percent(), geo
Large Data Volume
• Hundreds of thousands of events per second

• At least dozens of fields in each event
Multiple Topics
Rider States Driver States
Let’s build a stream processing pipeline
Accurate Statistics
• E.g., can’t over count
Pipeline Template
Event Collection
Multiple Event Types with Different Volume
Hundreds of Thousands of Events Per Second
Events Should Be Available Under a Second
Events Should Rarely Get Lost
Multiple Consumers
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Natural Choice: Apache Kafka
- Low latency and high throughput
- Persistent events
- Distributes a topic by partitions
- Groups consumers by consumer groups
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Event Processing
Transformation
Event Transformation Example
(Lat, Long) -> (zipcode, hexagon, S2)
Pre-aggregation
Joining Multiple Streams
Sessionization
Multi-Staged Processing
Minimum Requirements
- Statement Management
- Checkpointing
- Automatic Resource Management
- Multi-staged processing
Apache Samza
Why Apache Samza?
- DAG on Kafka
- Excellent integration with Kafka
- Built-in checkpointing
- Built-in state management
- Excellent support from our data team
Samza Is Conceptually Simple
IF
This —>
Then that —>
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Applications
Dashboard of Realtime Business Metrics
Ad-Hoc Queries
Visualization with Streaming
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
100%
100%
100%
10%
5%
Visualization with Streaming
LocationUpdate	

where	city	=	‘SF’
LocationUpdate		
where	city	=	‘LA’		
						and	vehicle	
10%
5%
100% 100%
Ad-hoc Exploration
A Few Trade-Offs
Lambda vs Kappa
We Use Lambda
- Spark + HDFS/S3 for batch processing
- Yes, it is painful, but
- We may need to go way back due to change of business
requirements
- Batch process can run faster — they scale differently
- It was not easy to start a new stream processing instance
Processing by Event Time Is Not Always Easy
Leverage The Storage Layer
Dealing with Limitation of Samza
-No broadcasting. We have to override
SystemStreamPartitionGrouper
-No dynamic topology. Can’t have arbitrary number of
nested CEP queries
-Tedious configuration and deployment of jobs. In house
code-gem and deployment solution
Thank You

More Related Content

Streaming Processing in Uber Marketplace for Kafka Summit 2016