Streaming Processing in Uber Marketplace for Kafka Summit 2016

1. STREAM PROCESSING IN UBER MARKETPLACE

2. ~ 68 countries / 350+ cities Transportation as reliable as running water, everywhere, for everyone 2

3. Agenda What’s on the menu? •Use Cases •Problem Space •Overall Architecture •Choices & Tradeoffs •Q & A

4. Use Case: Realtime OLAP

5. There is always need for quick exploration

6. How many open cars in the world, NOW?

8. How many UberXs were driving clients in SF in the past 10 minutes by hexagons?

9. How many UberXs were driving clients in SF in the past 10 minutes by hexagons?

10. Driving time and other metrics over time by hexagonal area

12. Use Case: Complex Event Processing

13. There are patterns in event streams

14. How many drivers cancel requests more than 3 times in a row within a 10- minute window?

15. Report riders requesting a pickup 100 miles apart within a half hour window?

16. IF This —> Then that —> ● Sigma is similar - but for offline/batch applications Complex Event Processing

17. Use Case: Supply Positioning

18. Clusters Of Supply & Demand

19. Predicted Health Metrics Actual Health Metrics Monitor Marketplace Health

20. Challenges

21. OLAP of Geo-spatial Temporal Data Reasonably Large Scale Near Real Time

22. • Indexing, Lookup, Rendering • Symmetric Neighbors • Convex & Compact Regions • Equal Areas • Equal Shape Hexagons

23. Scale Geo Space Vehicle Types Time Status X X X

24. Granular Geo Areas

25. Granular Geo Areas Over 10,000 hexagons in a city

26. Multiple Vehicle Types 7 vehicle types

27. Minute-level Time Buckets 1440 minutes in a day

28. Many Driver States 13 driver states

29. Many Cities 300 cities

30. Granular Data 1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion possible combinations

31. Unknown Query Patterns Any combination of dimensions

32. Variety of Aggregations - Heatmap - Top N - Histogram - count(), avg(), sum(), percent(), geo

33. Large Data Volume • Hundreds of thousands of events per second  • At least dozens of fields in each event

34. Multiple Topics Rider States Driver States

35. Let’s build a stream processing pipeline

36. Accurate Statistics • E.g., can’t over count

37. Pipeline Template

38. Event Collection

39. Multiple Event Types with Different Volume

40. Hundreds of Thousands of Events Per Second

41. Events Should Be Available Under a Second

42. Events Should Rarely Get Lost

43. Multiple Consumers

45. Natural Choice: Apache Kafka - Low latency and high throughput - Persistent events - Distributes a topic by partitions - Groups consumers by consumer groups

47. Event Processing

48. Transformation

49. Event Transformation Example (Lat, Long) -> (zipcode, hexagon, S2)

50. Pre-aggregation

51. Joining Multiple Streams

52. Sessionization

53. Multi-Staged Processing

54. Minimum Requirements - Statement Management - Checkpointing - Automatic Resource Management - Multi-staged processing

55. Apache Samza

56. Why Apache Samza? - DAG on Kafka - Excellent integration with Kafka - Built-in checkpointing - Built-in state management - Excellent support from our data team

57. Samza Is Conceptually Simple

58. IF This —> Then that —> ● Sigma is similar - but for offline/batch applications Complex Event Processing

59. ● Sigma is similar - but for offline/batch applications Complex Event Processing

63. ● Sigma is similar - but for offline/batch applications Slightly Expanded Version

68. Applications

69. Dashboard of Realtime Business Metrics

70. Ad-Hoc Queries

71. Visualization with Streaming

72. Visualization with Streaming LocationUpdate where city = X LocationUpdate where city = Y and vehicle = ‘UberX’ 100% 100% 100% 10% 5%

77. Visualization with Streaming LocationUpdate   where city = ‘SF’ LocationUpdate where city = ‘LA’ and vehicle 10% 5% 100% 100%

78. Ad-hoc Exploration

79. A Few Trade-Offs

80. Lambda vs Kappa

81. We Use Lambda - Spark + HDFS/S3 for batch processing - Yes, it is painful, but - We may need to go way back due to change of business requirements - Batch process can run faster — they scale differently - It was not easy to start a new stream processing instance

82. Processing by Event Time Is Not Always Easy

83. Leverage The Storage Layer

84. Dealing with Limitation of Samza -No broadcasting. We have to override SystemStreamPartitionGrouper -No dynamic topology. Can’t have arbitrary number of nested CEP queries -Tedious configuration and deployment of jobs. In house code-gem and deployment solution

85. Thank You

Streaming Processing in Uber Marketplace for Kafka Summit 2016

More Related Content

Streaming Processing in Uber Marketplace for Kafka Summit 2016