Deep Learning for Personalized Search and Recommender Systems

Deep Learning for Personalized
Search and Recommender
Systems
Ganesh Venkataraman
Airbnb
Nadia Fawaz, Saurabh Kataria, Benjamin Le, Liang Zhang
LinkedIn
1

Tutorial Outline
• Part I (45min) Deep Learning Key concepts
• Part II (45min) Deep learning for Search and Recommendations at Scale
• Coffee break (30 min)
• Deep Learning Case Studies
• Part III (45min) Jobs You May Be Interested In (JYMBII) at LinkedIn
• Part IV (45min) Job Search at LinkedIn
Q&A at the end of each part
2

Motivation – Why Recommender Systems?
• Recommendation systems are everywhere. Some examples of impact:
• “Netflix values recommendations at half a billion dollars to the company”
[netflix recsys]
• “LinkedIn job matching algorithms to improves performance by 50%” [San Jose
Mercury News]
• “Instagram switches to using algorithmic feed” [Instagram blog]
3

Motivation – Why Search?
4
PERSONALIZED SEARCH
4
Query = “things to do in halifax”
Search view – this is a classic IR problem
Recommendations view – For this query,
what are the recommended results?

Why Deep Learning? Why now?
• Many of the fundamental algorithmic techniques have existed since
the 80s or before
2.5 Exobytes of data produced per
day
Or 530,000,000 songs
150,000,000 iPhones 5

Why Deep Learning?
Image classification
eCommerce fraud
Search
Recommendations
NLP
Deep learning is eating the world
6

Why Deep Learning and Recommender
Systems?
• Features
• Semantic understanding of words/sentences possible with embeddings
• Better classification of images (identifying cats in YouTube videos)
• Modeling
• Can we cast matching problems into a deep (and possibly) wide net and learn
family of functions?
7

Part I – Representation Learning and Deep
Learning: Key Concepts
8

Deep Learning and AI
http://www.deeplearningbook.org/contents/intro.html 9

Part I Outline
• Shallow Models for Embedding Learning
• Word2Vec
• Deep Architectures
• FF, CNN, RNN
• Training Deep Neural Networks
• SGD, Backpropagation, Learning Rate Schedule, Regularization, Pre-Training
10

Representation learning for automated feature generation
• Natural Language Processing
• Word embedding: word2vec, GloVe
• Sequence modeling using RNN’s and LSTM’s
• Graph Inputs
• Deep Walk
• Multiple Hierarchy of features for varying granularities for semantic meaning
with deep networks
12

Example Application of Representation
Learning - Understanding Text
• One of the keys to any content based recommender system
is understanding text
• What does “understanding” mean?
• How similar/dissimilar are any two words?
• What does the word represent? (Named Entity
Recognition)
• “Abraham Lincoln, the 16th President ...”
• “My cousin drives a Lincoln”
13

How to represent a word?
• Vocabulary – run, jog, math
• Simple representation:
• [1, 0, 0], [0, 1, 0], [0, 0, 1]
• No representation of meaning
• Cooccurrence in a word/document matrix
14

How to represent a word?
• Trouble with cooccurrence matrix
• Large dimension, lots of memory
• Dimensionality reduction using SVD
• High computational time nxm matrix => O(mn^2)
• Adding new word => redo everything
15

Word embeddings taking context
• Key Conjecture
• Context matters.
• Words that convey a certain context occur together
• “Abraham Lincoln was the 16th President of the United States”
• Bigram model
• P (“Lincoln”|”Abraham”)
• Skip Gram Model
• Consider all words within context and ignore position
• P(Context|Word)
16

Word2Vec: Skip Gram Model
• Basic notations:
• w represents a word, C(w) represents all the context around a word
• 𝜃 represents the parameter space
• D represent all the (w, c) pairs
• 𝑝 𝑐 𝑤; 𝜃 represents the probability of context c given word w
parametrized by 𝜃
• The probability of all the context appearing given a word is given by:
• 𝑐∈𝐶(𝑤) 𝑝(𝑐|𝑤; 𝜃)
• The loss function then becomes:
• 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝑤,𝑐 ∈𝐷 𝑝(𝑐|𝑤; 𝜃)
18

Word2vec details
• Let 𝑣 𝑤 and 𝑣𝑐 represent the current word and context. Note that
𝑣𝑐 and 𝑣 𝑤 are parameters we want to learn
• p c w; 𝜃 =
𝑒 𝑣 𝑐∗𝑣 𝑤
𝑑∈𝐶 𝑒 𝑣 𝑑∗𝑣 𝑤
• C represents set of all available contexts
19

Negative Sampling – basic intuition
p c w; 𝜃 =
𝑒 𝑣 𝑐∗𝑣 𝑤
𝑑∈𝐶 𝑒 𝑣 𝑑∗𝑣 𝑤
• Sample from unigram distribution instead of taking all contexts into
account
• Word2vec itself is a shallow model and can be used to initialize a
deep model
20

Deep Architectures
FF, CNN, RNN
21

Neuron: Computational Unit
• Input vector: x = [x1, x2 ,… ,xn]
• Neuron
• Weight vector: W
• Bias: b
• Activation function: f
• Output
a = f(WT x + b)
x1
x2
x3
x4
W
b
f
a = f(WTx + b)
Input x Neuron Output a 22

Activation Functions
• Tanh: ℝ → (-1,1)
tanh(𝑥) =
𝑒 𝑥
− 𝑒−𝑥
𝑒 𝑥 + 𝑒−𝑥
• Sigmoid: ℝ → (0,1)
𝜎 𝑥 =
1
1 + 𝑒−𝑥
• ReLU: ℝ → [0, +∞)
𝑓 𝑥 = max 0, 𝑥 = 𝑥+
http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/
23

Layer
• Layer l: nl neurons
• weight matrix: W = [W1,…, Wnl]
• bias vector: b = [b1,…, bnl]
• activation function: f
• output vector
• a = f(WT x + b)
x1
x2
x3
x4
W1
b1
f
a1 = f(W1
T x + b1)
W2
b2
f
a2= f(W2
T x + b2)
Input x Layer Output a
W3
b3
f
a3= f(W3
T x + b3)
24

Layer: Matrix Notation
• Layer l: nl neurons
• weight matrix: W
• bias vector: b
• activation function: f
• output vector
• a = f(WT x + b)
• more compact notation
• fast-linear algebra routines for
quick computations in network
x1
x2
x3
x4
Input x Layer Output a
a = f(WT a + b)
W , b , f
25

Feed Forward Network
• Depth L layers
• Activation at layer l+1
a(l+1) = f(W(l)T a(l) + b(l) )
• Output: prediction in
supervised learning
• goal: approximate y = F(x)
x1
x2
x3
x4
Input Layer 1 Hidden Layer 3
a(3)
Hidden Layer 2
W(1) , b(1) , f(1) W(2) , b(2) , f(2)
a(2)
Depth L = 4
a(L)
W(3) , b(3) , f(3)
26Output Layer 4: Prediction layer

Why CNN: Convolutional Neural Networks?
• Large size grid structured data
• 1D: time series
• 2D: image
• Convolution to extract features from image (e.g. edges, texture)
• Local connectivity
• Parameter sharing
• Equivariance to translation: small translations in input do not affect output

Convolution example
https://docs.gimp.org/en/plug-in-convmatrix.html
Edge detect kernel Sharpen kernel

2D convolution
http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/
2D kernel (3x3)
W1 W2 W3 W4
input matrix
Kernel matrix (2x2)
29

• Fully connected
• hidden unit connected to all input units
• computationally expensive
• Large image NxN pixels and Hidden layer K features
• Number of parameters: ~KN2
• Locally connected
• hidden unit connected to some contiguous input
units
• no parameter sharing
• Convolution
• locally connected
• kernel: parameter sharing
• 1D Kernel vector [W1, W2]
• 1D Toeplitz weight matrix W
• Scaling to large input, images
• Equivariance to translation
30
W11 W12 W22 W23 W33 W34
W1 W2 W1 W2 W1 W2
W11 W12 W13 W14
W21 W22 W23 W24
W31 W32 W33 W34
W11 W12 0 0
0 W22 W23 0
0 0 W33 W34
Kernel vector
Weight matrix W
Convolution
W1 W2 0 0
0 W1 W2 0
0 0 W1 W2

Pooling
• Summary statistics
• Aggregate over region
• Reduce size
• Less overfitting
• Translation invariance
• Max, mean
http://ufldl.stanford.edu/tutorial/supervised/Pooling/
31

CNN: Convolutional Neural Network
Combination
• Convolutional layers
• Pooling layers
• Fully connected layers
http://colah.github.io/posts/2014-07-Conv-Nets-Modular/
32
[LeCun et al., 1998]

CNN example for image recognition: ImageNet [Krizhevsky et al., 2012]
Pictures courtesy of [Krizhevsky et al., 2012], http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
33
1st GPU
2nd GPU
filters learned by first CNN layer

Why RNN: Recurrent Neural Network?
• Sequential data processing
• ex: predict next word in sentence: “I was born in France. I can speak…”
• RNN
• Persist information through feedback loop
• loop passes information from one step to the next
• Parameter sharing across time indexes
• output unit depends on previous output units through same
update rule.
xt
ht
ht-1

Unfolded RNN
• Copies of NN passing feedback to one another
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
35

LSTM: Long Short Term Memory [Hochreiter et al., 1997]
• Avoid vanishing or exploding gradient
• Cell state updates regulated by gates
• Forget: how much info from cell state to let
through
• Input: which cell state components to update
• Tanh: values to add to cell state
• Output: select component values to output
picture courtesy of http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Cell state
• Long term dependencies
• large gap between relevant information and
where it is needed
• Cell state: long-term memory
• Can remember relevant information over long
period of time
36

Examples of RNN application
• Speech recognition [Graves et al., 2013]
• Language modeling [Mikolov, 2012]
• Machine translation [Kalchbrenner et al., 2013][Sustkever et al., 2014]
• Image captioning [Vinyals et al., 2014]
37

Training a Deep Neural Network
38

Cost Function
• m training samples (feature vector, label)
(𝑥 1 , 𝑦 1 ), … , (𝑥 𝑚 , 𝑦 𝑚 )
• Per sample cost: error between label and output from prediction layer
𝐽 𝑊, 𝑏; 𝑥 𝑖 , 𝑦 𝑖 = 𝑎(𝐿) 𝑥 𝑖 − 𝑦(𝑖) 2
• Minimize cost function over parameters: weights W and biases b
𝐽 𝑊, 𝑏 =
1
𝑚
𝑖=1
𝑚
𝐽(𝑊, 𝑏; 𝑥 𝑖
, 𝑦(𝑖)
) +
𝜆
2
𝑙=1
𝐿
𝑊(𝑙)
𝐹
2
Average error Regularization 39

Gradient Descent
• Random parameter initialization: symmetry breaking
• Gradient descent step: update for every parameter Wij
(l) and bi
(l)
𝜃 = 𝜃 − 𝛼𝛻θ 𝔼[𝐽(𝜃)]
• Gradient computed by Backpropagation
• High cost of backpropagation over full training set
40

Stochastic Gradient Descent (SGD)
• SGD: follow negative gradient after
• single sample
𝜃 = 𝜃 − 𝛼𝛻𝜃J(θ; 𝑥 𝑖
, 𝑦(𝑖)
)
• a few samples: mini-batch (256)
• Epoch: full pass through training set
• Randomly shuffle data prior to each training epoch
41

Backpropagation [Rumelhart et al., 1986]
Goal: Compute gradient numerically
Recursively apply chain rule for derivative of composition of functions
Let 𝑦 = 𝑔 𝑥 and 𝑧 = 𝑓 𝑦 = 𝑓(𝑔(𝑥))
then
𝜕𝑧
𝜕𝑥
=
𝜕𝑧
𝜕𝑦
𝜕𝑦
𝜕𝑥
= 𝑓′
𝑔 𝑥 𝑔′(𝑥)
Backpropagation steps
1. Feedforward pass: compute all activations
2. Output error: measures node contribution to output error
3. Backpropagate error through all layers
4. Compute partial derivatives
42

Training optimization
• Learning Rate Schedule
• Changing learning rate as learning progresses
• Pre-training
• Goal: training simple model on simple task before training desired model to perform desired task
• Greedy supervised pre-training: pre-train for task on subset of layers as initialization for final network
• Regularization to curb overfitting
• Goal: reduce generalization error
• Penalize parameter norm: L2, L1
• Augment dataset: train on more data
• Early stopping: return parameter set at point in time with lowest validation error
• Drop out [Srivatstava, 2013] : train ensemble of all subnetworks formed by removing non-output units
• Gradient clipping to avoid exploding gradient
• norm clipping
• element wise clipping
43

Part II – Deep Learning for Personalized
Recommender Systems at Scale
44

Examples of Personalized Recommender Systems
45

Job Search
46

47

item j from a set of candidates
User i
with
<user features, query
(optional)>
(e.g., industry,
behavioral features,
Demographic features,……)
(i, j) : response yijvisits
Algorithm selects
(action or not, e.g. click, like, share, apply…)
Which item(s) should we recommend to the user?
• The item(s) with the best expected utility
• Utility examples:
• CTR, Revenue, Job Apply rates, Ads conversion rates, …
• Can be a combination of the above for trade-offs
Personalized Recommender Systems
48

An Example Architecture of
Personalized Recommender
Systems
49

User
Interaction
Logs
Offline Modeling
Workflow + User /
Item derived
features
User
User Feature
Store
Item Store +
Features
Recommendation
Ranking
Ranking
Model Store
Additional Re-
ranking Steps
1
2
4
5
Offline System Online System
3
An example of Recommender System
Architecture
Item
derived features
50

User
Interaction
Logs
Offline Modeling
Workflow + User /
Item derived
features
User
Search-based
Candidate
Selection &
Retrieval
Query
Construction
User Feature
Store
Search Index
of Items
Recommendation
Ranking
Ranking
Model Store
Additional Re-
ranking Steps
1
2
3
4 5
6
7
Item
derived features
An example of Personalized Search
System Architecture
51

Key Components – Offline Modeling
• Train the model offline (e.g. Hadoop)
• Push model to online ranking model store
• Pre-generate user / item derived features for online systems
to consume
• E.g. user / item embeddings from word2vec / DNNs based
on the raw features
52

Key Components – Candidate Selection
• Personalized Search (With user query):
• Form a query to the index based on user query annotation [Arya et al., 2016]
• Example: Panda Express Sunnyvale +restaurant:panda express
+location:sunnyvale
• Recommender system (Optional):
• Can help dramatically reduce the number of items to score in ranking steps
[Cheng, et al., 2016, Borisyuk et al. 2016]
• Form a query based on the user features
• Goal: Fetch only the items with at least some match with user feature
• Example: a user with title software engineer -> +title:software engineer for
jobs recommendation
53

Key Components - Ranking
• Recommendation Ranking
• The main ML model that ranks items retrieved by candidate selection based
on the expected utility
• Additional Re-ranking Steps
• Often for user experience optimization related to business rules, e.g.
• Diversification of the ranking results
• Recency boost
• Impression discounting
• …
54

Integration of Deep Learning Models
into Personalized Recommender
Systems at Scale
55

Literature: Deep Learning for Recommendation Systems
• RBM for Collaborative Filtering [Salakhutdinov et al., 2007]
• Deep Belief Networks [Hinton et al., 2006]
• Neural Autoregressive Distribution Estimator (NADE) [Zheng, 2016]
• Neural Collaborative Filtering [He, et al., 2017]
• Siamese networks for user item matching [Huang et al., 2013]
• Deep Belief Networks with Pre-training [Hinton et al., 2006]
• Collaborative Deep Learning [Wang et al., 2015]
56

User
Interaction
Logs
Offline Modeling
Workflow + User /
Item derived
features
User
Search-based
Candidate
Selection &
Retrieval
Query
Construction
User Feature
Store
Search Index
of Items
Recommendation
Ranking
Ranking
Model Store
Additional Re-
ranking Steps
1
2
3
4 5
6
7
Item
derived features
57

Offline Modeling + User / Item Embeddings
User Features Item Features
User Embedding
Vector
Item Embedding
Vector
Sim(U,I)
User Feature
Store
Item Store / Index
with Features
58

Query Formulation & Candidate Selection
• Issues of using raw text: Noisy or incorrect query tagging due to
• Failure to capture semantic meaning
• Ex. Query: Apple watch -> +food:apple +product:watch or +product:apple watch?
• Multilingual text
• Query: 熊猫快餐 -> +restaurant:panda express
• Cross-domain understanding
• People search vs job search
59

Query Formulation & Candidate Selection
• Represent Query as an
embedding
• Expand query to similar
queries in a semantic
space
• KNN search in dense
feature space with
Inverted Index [Cheng,
et al., 2016]
Q = “Apple Watch”
D = “iphone”
D = “Orange Swatch”
D = “ipad”
60

Recommendation Ranking Models
• Wide and Deep Models to capture all possible signals [Cheng, et
al., 2016]
https://arxiv.org/pdf/1606.07792.pdf
61

Challenges & Open Problems for Deep
Learning at Recommender Systems
• Distributed training on very large data
• Tensorflow on Spark (https://github.com/yahoo/TensorFlowOnSpark)
• CNTK (https://github.com/Microsoft/CNTK)
• MXNet (http://mxnet.io/)
• Caffe (http://caffe.berkeleyvision.org/)
• …
• Latency Issues from Online Scoring
• Pre-generation of user / item embeddings
• Multi-layer scoring (simple models => complex)
• Batch vs online training
62

Part III – Case Study: Jobs You May Be
Interested In (JYMBII)
63

Outline
• Introduction
• Generating Embeddings via Word2vec
• Generating Embeddings via Deep Networks
• Tree Feature Transforms in Deep + Wide Framework
64

Introduction: Problem Formulation
• Rank jobs by 𝑃 User 𝑢 applies to Job 𝑗 𝑢, 𝑗)
• Model response given:
66
Careers History, Skills, Education, Connections Job Title, Description, Location, Company
66

Introduction: JYMBII Modeling- Generalization
Recommend
• Model should learn general rules to predict which
jobs to recommend to a member.
• Learn generalizations based on similarity in title, skill,
location, etc between profile and job posting
67

Introduction: JYMBII Modeling - Memorization
Applies to
68
• Model should memorize exceptions to the rules
• Learn exceptions based on frequent co-
occurrence of features

Introduction: Baseline Features
• Dense BoW Similarity Features for Generalization
• i.e: Similarity in title text good predictor of response
• Sparse Two-Depth Cross Features for Memorization
• i.e: Memorize that computer science students will transition to entry engineering roles
Vector BoW Similarity Feature
Sim(User Title BoW,
Job Title BoW)
Sparse Cross Feature
AND(user = Comp Sci. Student,
job = Software Engineer)
AND(user = In Silicon Valley,
job = In Austin, TX)
AND(user = ML Engineer,
job = UX Designer)
69

Introduction: Issues
• BoW Features don’t capture semantic similarity between user/job
• Cosine Similarity between Application Developer and Software Engineer is 0
• Generating three-depth, four-depth cross features won’t scale
• i.e. Memorizing that Factory Workers from Detroit are applying to Fracking
jobs in Pennsylvania
• Hand-engineered features time consuming and will have low coverage
• Permutations of three-depth, four-depth cross features grows exponentially
70

Introduction: Deep + Wide for JYMBII
• BoW Features don’t capture semantic similarity between user/job
• Generate embeddings to capture Generalization through semantic similarity
• Deep + Wide model for JYMBII [Cheng et al., 2016]
Semantic Similarity Feature
Sim(User Embedding,
Job Embedding)
Global Model Cross Feature
User Model Cross Feature
AND(user = User 2,
job = Job Latent Feature 1 )
Job Model Cross Feature
AND(user = User Latent Feature,
job = Job 1)
71
AND(user = In Silicon Valley,
job = In Austin, TX)
AND(user = ML Engineer,
job = UX Designer)
Vector BoW Similarity Feature
Sim(User Title BoW,
Job Title BoW)

Generating Embeddings via Word2vec:
Training Word Vectors
• Key Ideas
• Same users (context) apply to similar jobs (target)
• Similar users (target) will apply to the same jobs (context)
Application Developer => Software Engineer
• Train word vectors via word2vec skip-gram architecture
• Concatenate user’s current title and the applied job’s title as input
User Title Applied Job Title
72

Model Structure
Application, Developer Software, EngineerTokenized Titles
Word Embedding Lookup
Pre-trained Word
Vectors
Entity Embeddings
Via Average Pooling
Word Vectors
Response Prediction (Logistic Regression)
Cosine Similarity
User Job 73

Results and Next Steps
• Receiver Operating Characteristic – Area Under Curve for evaluation
• Response prediction is binary classification: Apply or don’t Apply
• Highly skewed data: Low CTR for Apply Action
• Good metric for ranking quality: Focus on discriminatory ability of model
• Marginal 0.87% ROC AUC Gain
• How to improve quality of embeddings?
• Optimize embeddings for prediction task with supervised training
• Leverage richer context about user and job
74

Generating Embeddings via Deep Networks:
Model Structure
User Job
Sparse Features (Title, Skill,
Company)
Embedding Layer
Hidden Layer
Entity Embedding
Hadamard Product (Elementwise Product)
75

Hyper Parameters, Lots of Knobs!
• Optimizer Used
• SGD w/ Momentum and exponential decay vs. Adam [Kingma et al., 2015] (Adam)
• Learning Rate
• 10−5
to 10−3
(𝟏𝟎−𝟒
)
• Embedding Layer Size
• 50 to 200 (100)
• Dropout
• 0% to 50% dropout (0% dropout)
• Sharing Parameter Space for both user/job embeddings
• Assumes communitive property of recommendations (a + b = b + a) (No shared parameter space)
• Hidden Layer Sizes
• 0 to 2 Hidden Layers (200 -> 200 Hidden Layer Size)
• Activation Function
• ReLU vs. Tanh (ReLU)
76

Training Challenges
• Millions of rows of training data impossible to store all in memory
• Stream data incrementally directly from files into a fixed size example pool
• Add shuffling by randomly sampling from example pool for training batches
• Extreme dimensionality of company sparse feature
• Reduce dimensionality of company feature from millions -> tens of thousands
• Perform feature selection by frequency in training set
• Hyper parameter tuning
• Distribute grid search through parallel modeling in single driver Spark jobs
77

Results
Model ROC AUC
Baseline Model 0.753
Deep + Wide Model 0.790 (+4.91%***)
*** For reference, a previous major JYMBII
modeling improvement with a 20% lift in ROC
AUC resulted in a 30% lift in Job Applications
78

The Current Deep + Wide Model
Deep Embedding Features (Feed Forward NN)
• Generating three-depth, four-depth cross features won’t scale
• Smart feature selection required
Wide Sparse Cross Features (Two-Depth)
79

Tree Feature Transforms: Feature Selection via
Gradient Boosted Decision Trees
Each tree outputs a path from root to leaf encoding
a combination of feature crosses [He et al., 2014]
GDBT’s select the most useful combinations of
feature crosses for memorization
Member Seniority: Vice
President
Yes
No
Member Industry:
Banking
Yes
No
Member Location:
Silicon Valley
Member Skill:
Statistics
Yes No
80
Yes No
Job Seniority:
CXO
NoYes
Job Title: ML
Engineer
Yes No

Tree Feature Transforms: The Full Picture
How to train both the NN model and GBDT model
jointly with each other?
Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GBDT)
81

Tree Feature Transforms: Joint Training via
Block-wise Cyclic Coordinate Descent
• Treat NN model and GBDT model as separate block-wise coordinates
• Implemented by
1. Training the NN until convergence
2. Training GBDT w/ fixed NN embeddings
3. Training the regression layer weights w/ generated cross features from GBDT
4. Training the NN until convergence w/ fixed cross features
5. Cycle step 2-4 until global convergence criteria
82

Tree Feature Transforms: Train NN Until
Convergence
Initially no trees are in our forest
Deep Embedding Features (Feed Forward NN) Wide Sparse Cross Features (GDBT)
83

Tree Feature Transforms: Train GDBT w/ NN
Section as Initial Margin
84

Tree Feature Transforms: Train GDBT w/ NN
85

Tree Feature Transforms: Train Regression
Layer Weights
86

Tree Feature Transforms: Train NN w/ GDBT
87

Tree Feature Transforms: Block-wise
Coordinate Descent Results
Model ROC AUC
Baseline Model 0.753
Deep + Wide Model 0.790 (+4.91%)
Deep + Wide Model w/ GBDT Iteration 1 0.792 (+5.18%)
88

JYMBII Deep + Wide: Future Direction
• Generating Embeddings w/ LSTM Networks
• Leverage sequential career history data
• Promising results in NEMO: Next Career Move Prediction with Contextual
Embedding [Li et al., 2017]
• Semi-Supervised Training
• Leverage pre-trained title, skill, and company embeddings on profile data
• Replace Hadamard Product for entity embedding similarity function
• Deep Crossing [Shan et al., 2016]
• Add even richer context
• i.e. Location, Education, and Network features
89

Part IV – Case Study: Deep Learning Networks
for Job Search
90

Outline
• Introduction
• Representations via Word2vec
• Robust Representations via DSSM
91

Introduction: Search Architecture
Index
Indexer
Top-K retrieval
ResultsOffline Training /
Model
Result Ranking
User QueryQuery
Understanding
93

Introduction: Query Understanding -
Segmentation and Tagging
• First divide the search query into
segments
• Tag query segments based on
recognized entity tags
Oracle
Java
Application Developer
Oracle
Java Application Developer
Query Segmentations
COMPANY = Oracle
SKILL = Java
TITLE = Application Developer
COMPANY = Oracle
TITLE = Java Application
Developer
Query Tagging
94

Introduction: Query Understanding –
Expansion
• Task of adding additional
synonyms/related entities to the
query to improve recall
• Current Approach: Curated dictionary
for common synonyms and related
entities
COMPANY = Oracle OR NetSuite OR
Taleo OR Sun Microsystems OR …
SKILL = Java OR Java EE OR J2EE
OR JVM OR JRE OR JDK …
TITLE = Application Developer OR
Software Engineer OR
Software Developer OR
Programmer …
Green – Synonyms
Blue – Related Entities
95

Introduction: Query Understanding - Retrieval
and Ranking
COMPANY = Oracle OR NetSuite OR Taleo OR
Sun Microsystems OR …
SKILL = Java OR Java EE OR J2EE OR JVM
OR JRE OR JDK …
TITLE = Application Developer OR
Software Engineer OR
Software Developer OR
Programmer …
Title
Title
Skills
Company
96

Introduction: Issues – Retrieval and Ranking
• Term retrieval has limitations
• Cross language retrieval
• Softwareentwickler  Software developer
• Word Inflections
• Engineering Management  Engineering Manager
• Query expansion via curated dictionary of synonyms is not scalable
• Expensive to refresh and store synonyms for all possible entities
• Heavy reliance on query tagging is not robust enough
• Novel title, skill, and company entities will not be tagged correctly
• Errors upstream propagates to poor retrieval and ranking
97

Introduction: Solution – Deep Learning for
Query and Document Representations
• Query and document representations
• Map queries and document text to vectors in semantic space
• Robust to Handle Out of Vocabulary words
• Term retrieval has limitations
• Query expansion via curated dictionary of synonyms is not scalable
• Map synonyms, translations and inflections to similar vectors in semantic space
• Term retrieval on cluster id or KNN based retrieval
• Heavy reliance on query tagging is not robust enough
• Compliment structured query representations with semantic representations
98

Representations via Word2vec:
Leverage JYMBII Work
• Key Ideas
• Similar users (context) apply to the same job (target)
• The same user (target) will apply to similar jobs (context)
Application Developer => Software Engineer
• Train word vectors via word2vec skip-gram architecture
• Concatenate user’s current title and the applied job’s title as input
User Title Applied Job Title
99

Word2vec in Ranking
Application, Developer Software, EngineerTokenized Text
Word Embedding Lookup
Pre-trained Word
Vectors
Entity Embeddings
Via Average Pooling
Word Vectors
Learning to Rank Model (NDCG Loss)
Cosine Similarity
JobQuery 100

Ranking Model Results
Model Normalized Cumulative
Discounted Gain@5 (NDCG@5)
CTR@5(%)
Baseline Model 0.582 +0.0%
Baseline Model + Word2Vec Feature 0.595 (+2.2%) +1.6%
101

Optimize Embeddings for Job Search Use Case
• Leverage apply and click feedback to guide learning of embeddings
• Fine tune embeddings for task using supervised feedback
• Handle out of vocabulary words and scale to query vocabulary size
• Compared to JYMBII, query vocabulary is much larger and less well-formed
• Misspellings
• Word Inflections
• Free text search
• Need to make representations more robust for these free text queries
102

Robust Representations via DSSM:
Deep Structured Semantic Model [Huang et al., 2013]
Query Applied Job (Positive)
Application Developer Software EngineerRaw Text
#Ap, App, ppl… #So, Sof, oft…Tri-letter Hashing #Ha, Hai, air…
Hairdresser
Randomly Sampled
Applied Job (Negative)
Hidden Layer 3
Hidden Layer 2
Hidden Layer 1
Cosine Similarity
Softmax w/ Cross Entropy Loss
103

Tri-letter Hashing
• Tri-letter Hashing Example
• Engineer -> #en, eng, ngi, gin, ine, nee, eer, er#
• Benefits of Tri-letter Hashing
• More compact Bag of Tri-letters vs. Bag of Words representation
• 700K Word Vocabulary -> 75K Tri-letters
• Can generalize for out of vocabulary words
• Tri-letter hashing robust to minor misspellings and inflections of words
• Engneer -> #en, eng, ngn, gne, nee, eer, er#
104

Training Details
105
• Parameter Sharing Helps
• Better and faster convergence
• Model size is reduced
• Regularization
• L2 performs better than dropout
• Toolkit Comparisons (CNTK vs TensorFlow)
• CNTK: Faster convergence and better model quality
• TensorFlow: Easy to implement and better community support.
Comparative model quality
Training performance with/o parameter sharing

Lessons in Production Environment
106
+ 100%
+ 70%
+ 40%
• Bottlenecks in Production
Environment
• Latency due to extra computation
• Latency due to GC activity
• Fat Jars in JVM environment
• Practical Lessons
• Avoid JVM Heap while serving the
model
• Caching most accessed entities’
embedding

DSSM Qualitative Results
Software Engineer Data Mining LinkedIn Softwareentwickler
Engineer Software Data Miner Google Software
Software Engineers Machine Learning
Engineer
Software Engineers Software Engineer
Software Engineering Microsoft Research Software Engineer Engineer Software
For qualitative results, only top head queries are taken to analyze similarity to each other
107

DSSM Metric Results
Model Normalized Cumulative
Discounted Gain@5 (NDCG@5)
CTR@5 Lift (%)
Baseline Model 0.582 +0.0%
Baseline Model + Word2Vec Feature 0.595 (+2.2%) +1.6%
Baseline Model + DSSM Feature 0.602 (+3.4%) +3.2%
108

DSSM Future Direction
• Leverage Current Query Understanding Into DSSM Model
• Query tag entity information for richer context embeddings
• Query segmentation structure can be considered into the network design
• Deep Crossing for Similarity Layer [Shan et al., 2016]
• Convolutional DSSM [Shen et al., 2014]
109

Conclusion
• Recommender Systems and personalized search are very similar
problems
• Deep Learning is here to stay and can have significant impact on both
• Understanding and constructing queries
• Ranking
• Deep learning and more traditional techniques are *not* mutually
exclusive (hint: Deep + Wide)
110

Appendix – Backup slides
111

Difference between parameter sharing in 1-D
convolution and RNN?
• CNN Kernel: output unit depends on small number of neighboring input units
through same kernel
• RNN update rule: output unit depends on previous output units through same
update rule. Deeper computational graph.

Deep Learning for Personalized Search and Recommender Systems

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning for Personalized Search and Recommender Systems

Similar to Deep Learning for Personalized Search and Recommender Systems (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Personalized Search and Recommender Systems

Editor's Notes