Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Sawtooth Windows
Zipline - Feature Engineering Framework
Nikhil Simha
nikhil.simha@airbnb.com

2

• Machine Learning
• Supervised
• Structured data – database records, event streams
• Not unstructured data – images, video, audio, text
• Not labels
Features in context

3

Exploration
Problem
Feature
Creation
Model
Training
Model
Serving
Feature
Serving
Application
Labeling

4

• Complex models > Simple models
• Can learn complicated relationships within data
Rules of thumb

5

• Good data >> Bad data
• Labels: True, Balanced
• Features:
• Consistent
• Real-time
• Stable
Rules of thumb

6

• Simple models + good data >> Complex models + Bad data
• Effort to better data >> Effort to better model
• Realtime features are hard
• Windowed Aggregations are unsupported/inefficient
• Training/Serving consistency
Rules of thumb

7

• Inadequate data sources
• Event sources: Don’t go back in history
• Database sources: Range scans are very expensive
• Skill gap
• ML vs system engineering
• Missing Backfills - Slow iteration
Hardness of Realtime features

8

• Features should be real-time
• Features are aggregations
• Most aggregations should be windowed
• Sawtooth windows
Goal

9

Example
● Restaurant recommendation
● Ratings of restaurant last year
● Check-ins of user by cuisine in the last month
● Latest cuisine check-in by user

10

Checkins
Ratings
1 1 1
3
Time
1
2 4
Label L
Prediction P1 P2
3
3
4
2.5
L L
Training
data set

11

Contract
● Serving
● User, Restaurant -> avg_restaurant_rating_1yr, cuisine_visits_30d
● Training
● Labeled Data: (User, Restaurant, timestamp, label)
● Enrich with features

12

Data sources
● Events
● Timestamped – user_txn stream
● Entities
● As served by microservices etc
● Based on DB
● User_balance table
● Or non-real-time : dim/fct tables

13

Service
Fleet
Production
Database
DB
Snapshot
Event log
Change
Capture
Stream
Event
Stream
Change
capture log
M
essage
Bus
D
a
t
a
L
a
k
e
Live
Derived
Data
Media

14

Feature Set Example

15

Feature Set Example

16

Feature Set Example

17

API – Philosophy
• SQL is two languages
• Keep Expression Language
• CAST(get_json_object(response, “$.age”) AS BIGINT)
• Control Structural language
• GROUPBY, JOIN, HAVING, SELECT, WHERE, FROM

18

API – Philosophy
Windows are first class
Source equivalence: topic ~ table ~ mutations
Data Models are first class
Entity (dim)
Events (fact, timestamped)

19

API – Internals
• Python -> Thrift-Json -> Spark + Scala
• Versioned
• Driven by airflow

20

Aggregation Math

21

Aggregations – SUM
• Commutative: a + b = b + a
• Order independent
• Associative: (a + b) + c = a + (b + c)
• Parallelizable

22

Aggregations – AVG
• One not-so-clever trick
• Operate on “Intermediate Representation” / IR
• Factors into (sum, count)
• Finalized by a division: (sum/count)

23

Aggregations
• Constant memory / Bounded IR
• Two classes of aggregations
• Sum, Avg, Count
• Min/Max, Approx Unique, percentiles, topK
• Mutations – updates, deletes etc.

24

Windows – Hopping

25

Windows – Hopping
• Staleness
• As stale as the hop size
• Memory Efficient
• One partial per hop

26

Windows – Sliding
• Freshness
• Memory intensive

27

Windows – Sawtooth
• Freshness
• Writes are taken into account immediately
• Memory
• Partial aggregates per hop

28

Windows – Sawtooth

29

Windows – Sawtooth
• Catch
• sum/count vs others
• Consistency

30

Model Server
Serving Architecture
Feature
Declaration
Streaming
aggregates
Batch
aggregates
Feature
Store
Model
Feature
Client
Application
Server

31

Windows – Lambda
• Points of change

32

Windows – Lambda

33

Choosing hops
• Automatically chosen
• Hop size < x% of window size
• Daily, hourly, 5minute
• X ~ 8.34%
• Caching – variety of window sizes can re-use the hop
• 90d, 30d
• Across windows & across queries

34

Questions

More Related Content

Sawtooth Windows for Feature Aggregations