4d WSC Sample
4d WSC Sample
4d WSC Sample
Systems
• Dietmar Jannach
– University Klagenfurt, Austria
• Massimo Quadrana
– Pandora, Italy
2
Today’s tutorial based on
• M. Quadrana, P. Cremonesi, D. Jannach,
"Sequence-Aware Recommender Systems“
ACM Computing Surveys, 2018
• Link to paper:
bit.ly/sequence-aware-rs
3
About you?
4
Agenda
• 13:30 – 15:00 Part 1
– Introduction, Problem definition
• 15:00 – 15:30 Coffee break
• 15:30 - 16:15 Part 2
– Algorithms
• 16:15 – 16:45 Part 3
– Evaluation
• 16:45 – 17:00
– Closing, Discussion
5
Part I: Introduction
Recommender Systems
• A central part of our daily user experience
– They help us locate potentially interesting things
– They serve as filters in times of information overload
– They have an impact on user behavior and business
7
Recommendations everywhere
8
A field with a tradition
The last 18 years in Recsys research:
• 2000
recommendation = matrix competition
• 2010
recommendation = learning to rank
9
Common problem abstraction (1):
matrix completion
• Goal
– Learn missing ratings
• Quality assessment
– Error between true and estimated ratings
11
A field with a tradition
12
Real-world problem situations
• User intent
• Short-term intent/context vs. long term taste
• Order can matter
• Order matters (constraints)
• Interest drift
• Reminders
• Repeated purchases
13
User intent
• Our user searched and listened to
“Last Christmas” by Wham!
• Should we, …
– Play more songs by Wham!?
– More pop Christmas songs?
– More popular songs from the 1980s?
– Play more songs with
controversial user feedback?
14
User intent
• Knowing the user’s intention can be crucial
15
Short-term intent/context vs.
long term taste
• Here’s what the customer purchased during the
last weeks
16
Short-term intent/context vs.
long term taste
• What to recommend?
• Some plausible options
– Only shoes
– Mostly Nike shoes
– Maybe also some T-shirts
products purchased
in the past
products browsed
in the current session
17
Short-term intent/context vs.
long term taste
• Using the matrix completion formulation
– One trains a model based only on past actions
– Without the context, the algorithm will probably
most recommend
– Mostly (Nike) T-shirts and some trousers
– Is this what you expect?
products purchased
in the past
products browsed
in the current session
18
Order can matter
• Next rack recommendation
• What to recommend next should suit the
previous tracks
19
Order matters (order constraints)
• Is it meaningful to recommend Star Wars III, if
the user has not seen the previous episodes?
20
Interest drift
21
Reminders
• Should we recommend items that the user
already knows?
• Amazon does
22
Repeated purchases
• When should be remind users
(through recommendations?)
23
Algorithms depends on …
• The choice of the best recommender algorithm
depends on
– Goal (user task, application domain, context, …)
– Data available
– Target quality
24
Examples based on goal …
• Movie recommendation
– if you watched a SF movie …
25
Examples based on goal …
• Movie recommendation
– if you watched a SF movie …
– … you recommend another SF movie
• Algorithm: most similar item
26
Examples based on goal …
• On-line store recommendations
– if you bought a coffee maker …
27
Examples based on goal …
• On-line store recommendations
– if you bought a coffee maker …
– … you recommend cups or coffee beans (not
another coffee maker)
• Algorithm: frequently bought together
user who bought … also bought
28
Example based on data …
• If you have
– all items with many ratings
– no item attributes
• You need CF
• If you have
– no ratings
– all item with many attributes
• You need CBF
29
Examples for
sequence-aware algorithms …
• Goal
– next track recommendation, …
• Data
– new users, …
30
Implications for research
• Many of the scenarios cannot be addressed by a
matrix completion or learning-to-rank problem
formulation
– No short-term context
– Only one single user-item interaction pair
– Often only one type of interaction (e.g., ratings)
– Rating timestamps sometimes exist, but might be
disconnected from actually experiencing/using the
item
31
Sequence-Aware Recommenders
• A family of recommenders that
– uses different input data,
– often bases the recommendations on certain types
of sequential patterns in the data,
– addresses the mentioned practical problem settings
32
Problem Characterization
33
Characterizing Sequence-Aware
Recommender Systems
34
Inputs
• Ordered (and time-stamped) set of user actions
– Known or anonymous
(beyond the session)
35
Inputs
• Actions are usually connected with items
– Exceptions: search terms,
category navigation, …
36
Inputs
37
Inputs
38
Inputs
39
Inputs
40
Inputs
• Different types of actions
– Item purchase/consumption, item view, add-to-
catalog, add-to-wish-list, …
41
Inputs
• Attributes of actions: user/item details, context
– Dwelling times, item discounts, etc.
42
Inputs
• Typical: Enriched clickstream data
43
Inputs: differences with traditional
RSs
• for each user, all sequences contain
one single action (item)
44
Inputs: differences with traditional
RSs (with time-stamp)
• for each user, there is
one single sequence
with several actions (item)
45
Output (1)
• One (or more) ordered list of items
• The list can have different interpretations,
based on goal, domain, application scenario
• Usual item-ranking
tasks
– list of alternatives
for a given item
– complements or
accessories
47
Output (2)
• An ordered list of items
• The list can have different interpretations,
based on goal, domain, application scenario
• Suggested sequence
of actions
– next-track music
recommendations
48
Output (3)
• An ordered list of items
• The list can have different interpretations,
based on goal, domain, application scenario
• Strict sequence
of actions
– course learning
recommendations
49
Typical computational tasks
• Find sequence-related in the data, e.g.,
– co-occurrence patterns
– sequential patterns
– distance patterns
• Reasoning about order constraints
– weak and strong constraints
• Relate patterns with user profile and current
point in time
– e.g., items that match the current session
51
Abstract problem characterization:
Item-ranking or list-generation
• Some definitions
– 𝑈 : users
– 𝐼 : items
– 𝐿 : ordered list of items of length 𝑘
– 𝐿∗ : set of all possible lists 𝐿 of length up to 𝑘
– 𝑓 𝑢, 𝐿 : utility function, with 𝑢 ∈ 𝑈 and 𝐿 ∈ 𝐿∗
𝐿𝑢 = argmax 𝑓 𝑢, 𝐿 𝑢∈𝑈
𝐿∈𝐿∗
54
On recommendation purposes
• Often, researchers are not explicit about the
purpose
– Traditionally, could be information filtering or
discovery, with conflicting goals
• Commonly used abstraction
– e.g., predict hidden rating
• For sequence-aware recommenders
– often: predict next (hidden) action, e.g., for a given
session beginning
55
Relation to other areas
• Implicit feedback recommender systems
– Sequence-aware recommenders are often built on
implicit feedback signals (action logs)
– Problem formulation is however not based on matrix
completion
• Context-aware recommender systems
– Sequence-aware recommenders often are special
forms of context-aware systems
– Here: Interactional context is relevant, which is only
implicitly defined through the user’s actions
56
Relation to other areas
• Time-Aware RSs
– Sequence-aware recommenders do not necessarily
need explicit timestamps
– Time-Aware RSs use explicit time
• e.g., to detect long-term user drifts
• Other:
– interest drift
– user-modeling
57
Categorization
58
Categorization of tasks
• Four main categories
– Context adaptation
– Trend detection
– Repeated recommendation
– Consideration of order constraints and sequential
patterns
• Notes
– Categories are not mutually exclusive
– All types of problems based on the same problem
characterization, but with different utility functions,
and using the data in different ways
59
Context adaptation
• Traditional context-aware recommenders are
often based on the representational context
– defined set of variables and observations, e.g.,
weather, time of the day etc.
• Here, the interactional context is relevant
– no explicit representation of the variables
– contextual situation has to be inferred from user
actions
60
How much past information is
considered?
• Last-N interactions based recommendation:
– Often used in Next-Point-Of-Interest
recommendation scenarios
– In many cases only the very last visited location is
considered
– Also: “Customers who bought … also bought”
• Reasons to limit oneself:
– Not more information available
– Previous information not relevant
61
How much past information is
considered?
• Session-based recommendation:
– Short-term only
– Only last sequence of actions of the current user is
known
– User might be anonymous
• Session-aware recommendation:
– Short-term + Long-term
– In addition, past session of the current user are
known
– Allows for personalized session-based
recommendation
Quadrana, M., Karatzoglou, A., Hidasi, B., Cremonesi, P.: “Personalizing Session-based Recommendations 62
with Hierarchical Recurrent Neural Networks”. Proceedings ACM RecSys 2017: 130-137
Session-based recommendation
Anonym 1
Anonym 2
Anonym 3
Time
63
Session-based recommendation
Soccer
Anonym 1
Cartoons
Anonym 2
NBA
Anonym 3
Time
64
Session-aware recommendation
Soccer
User 1
Cartoons
User 2
NBA
User 1
Time
65
Session-aware recommendation
Soccer
User 1
Sports!
NBA
User 1
Time
66
Session-aware recommendation
Soccer
User 1
Sports!
Capcakes
User 1
Time
67
What to find?
• Next
• Alternatives
• Complements
• Continuations
68
What to pick
• One
• All
69
Trend detection
• Less explored than context adaptation
• Community trends:
– Consider the recent or seasonal popularity of items,
e.g., in the fashion domain and, in particular, in the
news domain
• Individual trends:
– E.g., natural interest drift
• Over time, because of influence of other people, because of
a recent purchase, because something new was discovered
(e.g., a new artist)
Jannach, D., Ludewig, M. and Lerche, L.: "Session-based Item Recommendation in E-Commerce: On Short-
Term Intents, Reminders, Trends, and Discounts". User-Modeling and User-Adapted Interaction, Vol. 27(3-
5). Springer, 2017, pp. 351-392
70
Repeated Recommendation
• Identifying repeated user behavior patterns:
– Recommendation of repeat purchases or actions
• E.g., ink for a printer, next app to open after call on mobile
– Patterns can be mined from the individual or the
community as a whole
• Repeated recommendations as reminders
– Remind users of things they found interesting in the past
• To remind the of things they might have forgotten
• As navigational shortcuts, e.g., in a decision situation
• Timing as an interesting question in both situations
71
Consideration of Order Constraints
and Observed Sequential Patterns
• Two types of sequentiality information
– External domain knowledge: strict or weak ordering
constraints
• Strict, e.g., sequence of learning courses
• Weak, e.g., when recommending sequels to movies
– Information that is mined from the user behavior
• Learn that one movie is always consumed after another
• Predict next web page, e.g., using sequential pattern mining
techniques
72
Review of existing works (1)
• 100+ papers reviewed
– Focus on last N interactions common
• But more emphasis on session-based / session-aware
approaches in recent years
73
Review of existing works (2)
• 100+ papers reviewed
– Very limited work for other problems:
• Repeated recommendation, trend detection, consideration
of constraints
74
Review of existing works (3)
• Application domains
75
Summary of first part
• Matrix completion abstraction not well suited
for many practical problems
• In reality, rich user interaction logs are available
• Different types of information can be derived
from the sequential information in the logs
– and used for special recommendation tasks, in
particular for the prediction of the next-action
• Coming next:
– Algorithms and evaluation
76
Part II: Algorithms
Agenda
• 13:30 – 15:00 Part 1
– Introduction, Problem definition
• 15:00 – 15:30 Coffee break
• 15:30 - 16:15 Part 2
– Algorithms
• 16:15 – 16:45 Part 3
– Evaluation
• 16:45 – 17:00
– Closing, Discussion
78
Taxonomy
• Sequence Learning
– Frequent Pattern Mining
– Sequence Modeling
– Distributed Item Representations
– Supervised Models with Sliding Window
• Sequence-aware Matrix Factorization
• Hybrids
– Factorized Markov Chains
– LDA/Clustering + sequence learning
• Others
– Graph-based, Discrete-optimization 79
Sequence Learning (SL)
• Useful in application domains where input data
has an inherent sequential nature
– Natural Language Processing
– Time-series prediction
– DNA modelling
– Sequence-Aware Recommendation
80
SL: Frequent Pattern Mining (FPM)
1. Discover user consumption patterns
• e.g. Association rules, (Contiguous) Sequential Patterns)
2. Look for patterns matching partial transactions
3. Rank items by confidence of matched rules
• FPM Applications
– Page prefetching and recommendations
Personalized FPM for next-item recommendation
Next-app prediction
81
SL: Frequent Pattern Mining (FPM)
• Pros
– Easy to implement
– Explainable predictions
• Cons
– Choice of the minimum support/confidence
thresholds
– Data sparsity
– Limited scalability
82
SL: Sequence Modeling
• Sequences of past user actions as
time series with discrete observations
– Timestamps used only to order user actions
• SM aim to learn models from past observations
(user actions) to predict future ones
• Categories of SM models
– Markov Models
– Reinforcement Learning
– Recurrent Neural Networks
83
SL: Sequence Modeling
Markov Models
• Stochastic processes over discrete random
variables
– Finite history (= order of the model) user actions
depend on a limited # of most recent actions
• Applications
– Online shops
– Playlist generation
– Variable Order Markov Models for news
recommendation
– Hidden Markov Models for contextual next track
prediction
84
SL: Sequence Modeling
Reinforcement Learning
• Learn by sequential interactions with the
environment
• Generate recommendations (actions)
and collect user feedback (reward)
• Markov Decisions Processes (MDPs)
• Applications
– Online e-commerce services
– Sequential relationships between the attributes of
items explored in a user session
85
SL: Sequence Modeling
Recurrent Neural Networks (RNN)
• Distributed real-valued hidden state models
with non-linear dynamics
– Hidden state: latent representation of user state
within/across sessions
– Update the hidden state on the current input and its
previous value, then use it to predict the probability
for the next action
• Trained end-to-end with gradient descent
• Applications
– Next-click prediction with RNNs
86
SL:
Distributed Item Representations
• Dense, lower-dimensional representations of
the items
– derived from sequences of events
– preserve the sequential relationships between items
• Similar to latent factor models
– “similar” items are projected into similar vectors
– every item is associated with a real-valued
embedding vector
– its projection into a lower-dimensional space
– certain item transition properties are preserved
• e.g., co-occurrence of items in similar contexts
87
SL:
Distributed Item Representations
• Different approaches are possible
– Prod2Vec
– Latent Markov Embedding (LME)
– …
88
SL: Distributed Item
Representations - Prod2Vec
vi = vector representing item i
ik = item at step k of the session
90
Sequence-aware Matrix
Factorization
• Sequence information usually derived from
timestamps
91
Hybrid Methods
• Sequence Learning +
Matrix Completion (CF or CBF)
92
Statistics
14
12
10
8
6
4
2
0
93
Going deep: FPM
94
FPM
• Association Rule Mining
– items co-occurring within the same sessions
• no check on order
• if you like A and B, you also like C (aka: learning to rank)
• Sequential Pattern Mining
– Items co-occurring in the same order
• no check on distance
• If you watch A and later watch B, you will later watch C
• Contiguous Sequential Pattern Mining
– Item co-occurring in the same order and distance
• If you watch A and B one after the other, if now watch C
95
FPM
• Two steps approach
1. Offline: rule mining
2. Online: rule matching (with current user session)
• Rules have
– Confidence: conditional probability
– Support: number of examples (main parameter)
• lower thresholds -> fewer rules
– few rules: difficult to find rules matching a session
– many rules: noisy rules (low quality)
96
FPM
• More constrained FPM methods produce fewer
rules and are computationally more complex
– CSPM -> SPM -> ARM
• Efficiency and number of mined rules decrease
with the number N of past user actions (items)
considered in the mining
– N=1 : users who like A also like B
– N=2 : users who like A and B also like C
– …
97
Going deep: RNN
98
Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
𝒚𝟏 𝒚𝟐
𝒉𝟎 𝒉𝟏 𝒉𝟐
𝒙𝟏 𝒙𝟐
99
Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
• Three weight matrices
𝑇
– ℎ𝑡 = 𝑓 𝑥 𝑇 𝑊𝑥 + ℎ𝑡−1 𝑊ℎ + 𝑏ℎ
– 𝑦𝑡 = 𝑓 ℎ𝑡𝑇 𝑊𝑦
𝒚𝟏 𝒚𝟐
𝒉𝟎 𝒉𝟏 𝒉𝟐
𝒙𝟏 𝒙𝟐
100
Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
• Three weight matrices
𝑇
– ℎ𝑡 = 𝑓 𝑥 𝑇 𝑊𝑥 + ℎ𝑡−1 𝑊ℎ + 𝑏ℎ
– 𝑦𝑡 = 𝑓 ℎ𝑡𝑇 𝑊𝑦
𝒚𝟏 𝒚𝟐
𝒉𝟎 𝒉𝟏 𝒉𝟐
𝒙𝟏 𝒙𝟐
101
Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
• Three weight matrices
𝑇
– ℎ𝑡 = 𝑓 𝑥 𝑇 𝑊𝑥 + ℎ𝑡−1 𝑊ℎ + 𝑏ℎ
– 𝑦𝑡 = 𝑓 ℎ𝑡𝑇 𝑊𝑦
𝒚𝟏 𝒚𝟐
𝒉𝟎 𝒉𝟏 𝒉𝟐
𝒙𝟏 𝒙𝟐
102
Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
• Three weight matrices
𝑇
– ℎ𝑡 = 𝑓 𝑥 𝑇 𝑊𝑥 + ℎ𝑡−1 𝑊ℎ + 𝑏ℎ
– 𝑦𝑡 = 𝑓 ℎ𝑡𝑇 𝑊𝑦
𝒚𝟏 𝒚𝟐
𝒉𝟎 𝒉𝟏 𝒉𝟐
𝒙𝟏 𝒙𝟐
103
Simple Recurrent Neural Network
• Item of event as 1-of-N coded vector
• Example:
– Output event 2, item 4
y2 = 0 0 0 1 0 𝒚𝟏 𝒚𝟐
x2 = 0 0 1 0 0
𝒙𝟏 𝒙𝟐
104
Distributed Item Representations:
Prod2Vec
vi = vector representing item i
ik = item at step k of the session
𝒙𝟏 𝒙𝟐
106
Multiple sessions
• Naïve solution: concatenation
• Whole user history as a single sequence
• Trivial implementation but limited effectiveness
R R R R R R R R R R
N N N N N N N N N N
N N N N N N N N N N
…
User 1
107
Multiple sessions
• Naïve solution: concatenation
• Whole user history as a single sequence
• Trivial implementation but limited effectiveness
R R R R R R R R R R
N N N N N N N N N N
N N N N N N N N N N
…
User 1
Session 1 Session 2
User 1
109
Architecture
Session 1
110
Architecture
Session 1
Across session:
• Update: 𝑐𝑚 = RNNusr 𝑠𝑚 , 𝑐𝑚−1
User 1 previous user-state
last session-state
111
Architecture – HRNN Init
Session 1 Session 2
User 1
112
Architecture – HRNN All
Session 1 Session 2
User 1
113
Architecture - Complete
Session 1 Session 2
User 1
114
Architecture - Complete
Session 1 Session 2
User 1
115
Number of hidden layers
• One is enough
116
Including attributes
Naïve solutions fail
ItemID (next)
One-hot vector
Scores on items
f()
Network output
Hidden layer
wID wfeatures
ItemID
One-hot vector
Scores on items
f()
Weighted output
wID wfeatures
ItemID 118
Training Parallel RNN
• Straight backpropagation fails Alternative
training methods
• Train one subnet at the time and freeze others
(like in ALS!)
• Depending on the frequency of iteration
– Residual training (train fully)
– Alternating training (per epoch)
– Interleaving training (per mini-batch)
119
Part III: Evaluation and Datasets
Agenda
• 13:30 – 15:00 Part 1
– Introduction, Problem definition
• 15:00 – 15:30 Coffee break
• 15:30 - 16:15 Part 2
– Algorithms
• 16:15 – 16:45 Part 3
– Evaluation
• 16:45 – 17:00
– Closing, Discussion
121
Traditional evaluation approaches
• Off-line
– evaluation metrics
• Error metrics: RMSE, MAE
• Classification metrics: Precision, Recall, …
• Ranking metrics: MAP, MRR, NDCG, …
– dataset partitioning (hold-out, leave-one-out, …)
– available datasets
• On-line evaluation
– users studies
– field test
Dataset partitioning
• Splitting the data into training and test sets
event-level session-level
Dataset partitioning
• Splitting the data into training and test sets
event-level session-level
Users
Time (events)
Training
Test
Dataset partitioning
• Splitting the data into training and test sets
event-level session-level
Users
Users
Time (events)
Training
Test
Profile
Dataset partitioning
Community Level User-level
Training
Training
users
users
users
users
Test
Test
Time (events) Time (events)
Training
Test
Profile
Dataset partitioning
Community Level User-level
Training
Training
users
users
users
users
Test
Test
Time (events) Time (events)
Training
Test
Profile
Dataset partitioning
• session-based or session-aware task
use session-level partitioning
– better mimic real systems trained on past sessions
• session-based task
use session-level + user-level
– to test for new users
Split size
• Fixed
– e.g., 80% training / 20% test
• Time rolling
– Move the time-split forward in time
Target Items
• Sequence agnostic prediction
– e.g., similar to matrix completion
Time (events)
Target Items
• Sequence agnostic prediction
• Given-N next item prediction
– e.g., next track prediction
Time (events)
Target Items
• Sequence agnostic prediction
• Given-N next item prediction
– e.g., predict a sequence of events
Time (events)
Target Items
• Sequence agnostic prediction
• Given-N next item prediction
• Given-N next item prediction with look-ahead
Time (events)
Target Items
• Sequence agnostic prediction
• Given-N next item prediction
• Given-N next item prediction with look-ahead
• Given-N next item prediction with look-back
– e.g., how many events are required to predict a final
product
Time (events)
Target Items
• Sequence agnostic prediction
• Given-N next item prediction
• Given-N next item prediction with look-ahead
• Given-N next item prediction with look-back
140
Thank you for your attention