RecSys 2015 Tutorial
Scalable Recommender Systems
Where Machine Learning
Meets Search!
Diana Hu
Senior Data Scientist
Joaquin Delgado, PhD.
Director of Engineering
The content of this presentation are of the
authors’ personal statements and does not
officially represent their employer’s view in
anyway. Included content is especially not
intended to convey the views of OnCue or Verizon.
1. Introduction
1. What to expect?
2. Scaling recommender systems is hard
2. Recommender System Problem as a Search Problem
1. Representing queries as recommendations
3. Introduction to Search and Information Retrieval
1. Scalability in search
2. Introduction to Elasticsearch
4. Overview of Machine Learning Techniques for Recommender Systems
1. Learning to rank
2. Scalability in machine learning
3. ML software frameworks
5. Re-writing the ranking function
1. Writing a new ranking/scoring function in Elasticsearch
2. Training a spark model as a Elasticsearch plugin for custom ranking/scoring function
6. References
1. Introduction
What to expect from this tutorial?
• The focus is on practical examples of how
to implement scalable recommender
systems using search and learning-to-rank
(machine learning) techniques
• What it is not
• Deep dive into any specific areas (Search,
RecSys, Learning to rank, or Machine learning)
• Algorithmic survey
• Comparative Analysis
Finding commonalities
What is a recommendation?
Beyond rating prediction
Paradigms of recommender systems
• Reduce information load by estimating
• Ranking Approaches:
• Collaborative filtering: “Tell me what is popular
amongst my peers”
• Content Based: “Show me more of what I liked”
• Knowledge Based: “Tell me what fits my needs”
• Hybrid
Model Type Pros Cons
Collaborative • No metadata engineering
• Serendipity of results
• Learns market segments
• Requires rating feedback
• Cold start for new users and
new items
Content-based • No community required
• Comparison between
items possible
• Content descriptions
• Cold start for new users
• No serendipity
• Deterministic
• Assured quality
• No cold-start
• Interactive user sessions
• Knowledge engineering
effort to bootstrap
• Static
• Does not react to short-term
Scaling recommender systems is hard!
• Millions of users
• Millions of items
• Cold start for ever increasing size of
catalog and new users added
• Imbalanced Datasets – power law
distribution is quite common
• Many algorithms have not been fully tested
at “Internet Scale”
2. Recommender System Problem as a
Search Problem
Content-based methods inspired by IR
• Rec Task: Given a user profile find the best matching
items by their attributes
• Similarity calculation: based on keyword overlap
between user/items
• Neighborhood method (i.e. nearest neighbor)
• Query-based retrieval (i.e Rocchio’s method)
• Probabilistic methods (classical text classification)
• Explicit decision models
• Feature representation: based on content analysis
• Vector space model
• Topic Modeling
Search queries as content-based
• Exact matching (Boolean)
• Relevant or not relevant (no ranking)
• Ranking by similarity to query (Vector
Space Model)
• Text similarity: Bag of words, TF-IDF, Incidence
• Ranking by importance (e.g. PageRank)
Content-based similarity measures
• Simple match
• Dice’s Coefficient
• Jaccard’s Coefficient
• Cosine Coefficient
• Overlap Coefficient
3D Term Vector Space
Knowledge-based methods inspired by IR
• Rec Task: Given explicit recommendation rules find the
best matches between user’s requirements and item’s
characteristics (i.e., which item should be recommended in
which context?)
• Similarity calculation: based on constraint satisfaction
problem and distance similarity requirements<->attributes
• Conjunctive queries
• Similarity metrics for item retrieval
• Feature representation: based on query representation
• User defined preferences
• Utility-based preferences
• Conjoint analysis
Search queries as knowledge-based
• Constraint satisfaction problem (CSP) is a tuple
• V – set of variables
• D – set of finite domains for V
• C – set of constraints of possible V permutations
• Recommendation as CSP:
(V,D,C) => (Vi U Vu, D, Cr U Ci U Cf U REQ)
• Vu – user properties (possible user’s requirements)
• Vi – item properties
• Cr – compatibility constraints (possible Vc permutations)
• Ci – Item constraints (conjunction fully defines an item)
• Cf – filter conditions (define Vu<->Vi relationships)
• REQ – user’s requirements
3. Introduction to Search and Information
Search is about finding specific things that are either
known or assumed to exist, Discovery is about is
about helping the user encounter what he/she didn’t
even know exists
Both Search and Discovery can be achieved through
a query based data/information system.
Predicate Logic and Declarative Languages Rock!
Examples of query based systems
• Focused on Search
• Search engines
• Database systems
• Focus on Discovery
• Recommender systems
• Advertising systems
IR: The science behind search!
Information Retrieval (IR) is a query based on
data retrieval + relevance ranking (scoring)
usually applied to unstructured data (i.e. text
documents and fields); often referred to as full-
text or keyword search.
Have you heard of Bag-of-Words?
Vector Space Representation?
What about TF-IDF?
IR Architecture
Matched Hits
Matched Hits
Input Query
Matched Hits
Matched Hits
Retrieved Documents
Query Representation Doc Representation
(*) Optional
Retrieval Models
Model Type Query
Boolean • Boolean
• Connected by
• Set of keywords
• Bag of words
• Binary term weight
• Exact match
• Binary relevance
• No ranking
• Vector
• Desired terms
with optional
• Vectors
• Bag of words with
weight based on
TF-IDF scheme
• Similarity score
• Output documents
are ranked
• Relevance
feedback support
• Similarity with
• Document
• Ranks documents
in decreasing
probability of
Ranking in the Vector Space Model
Search Engines: the big hammer!
• Search engines are largely used to solve
non-IR search problems, and here is why:
• Widely available
• Fast and scalable distributed systems
• Integrates well with existing data stores (SQL and NoSQL)
But are we using the right tool?
• Search Engines were originally designed
for IR.
• More complex non-IR search/discovery
tasks sometimes require a multi-phase,
multi-system approach
Filter + Scoring: Two Phase Approach
Filter Rank
• What is Elasticsearch?
• Elasticsearch is an open-source search engine
• Elasticsearch is written in Java
• Built on top of Apache Lucene™
• A distributed real-time document store where every field is
indexed and searchable out-of-the box
• A distributed search engine with real-time analytics
• Has a plugin architecture that facilitates extending the
core system
• Written with NRT and cloud support in mind
• Easy index, shard and replicas creation on live cluster
• Has Optimistic Concurrency Control
Examples of scaling challenges
• More than 50 millions of documents a day
• Real time search
• Less than 200ms average query latency
• Throughput of at least 1000 QPS
• Multilingual indexing
• Multilingual querying
Who uses ES?
• Wikipedia
• Uses Elasticsearch to provide full-text search with highlighted
search snippets, and search-as-you-type and did-you-mean
• The Guardian
• Uses Elasticsearch to combine visitor logs with social -network
data to provide real-time feedback to its editors about the
public’s response to new articles.
• Stack Overflow
• Combines full-text search with geo-location queries and uses
more-like-this to find related questions and answers.
• GitHub
• Uses Elasticsearch to query 130 billion lines of code.
How ES scales?
• Sharding and Replicas
• Several indices (at least one index for each day of
• Indices divided into multiple shards
• Multiple replicas of a single shard
• Real-time, synchronous replication
• Near-real-time index refresh (1 to 30 seconds)
Indexing the data
Querying ES
Node 1 Node 2 Node 3 Node 4
Node 5 Node 6 Node 7 Node 8
ES Index
Using Search Engines for RS
• Its not just about rating prediction and ranking
• Business filtering logic
• Age restrictions
• Catalog navigation context (e.g. e-commerce)
• Promotional materials
• Low latency and scale
• SLAs on response times including query, responses
and presentation
• Actual time for computing recommendations is just a
small fraction of total allocated time
Stacking things up
Visualization / UI
Query Generation and
Contextual Pre-filtering
Model Building
Index Building
Data/Events Collections
Data Analytics
Contextual Post Filtering
Ranking in Elasticsearch
4. Overview of Machine Learning
Techniques for Recommender Systems
Machine Learning
Machine Learning in particular supervised learning
refer to techniques used to learn how to classify or
score previously unseen objects based on a training
Inference and Generalization are the Key!
Recommendations as data mining
Amatriain, Xavier, et al.
"Data mining methods for
recommender systems."
Recommender Systems
Handbook. Springer US,
2011. 39-71.
Learning to rank
• Formulate the problem as standard
supervised learning
• Training data can be cardinal or binary
• Various approaches:
• Pointwise: Typically approximated by regression
• Pairwise: Approximated via binary classifier
• Listwise: Directly optimize whole list (difficult!)
• A trick with ES is to include raw scores
returned by ES into the feature vector
Learning to rank with ES
Elastic Search
Contextual features
Ranked Results
Web scale ML challenges
• Massive amount of examples
• Billions of features
• Big models don’t fit in a single machine’s memory
• Variety of algorithms that need to be scaled up
A Note of Caution….
“Invariably, simple models and
a lot of data trump more elaborate
models based on less data.”
Alon Halevy, Peter Norvig, and
Fernando Pereira, Google
Scalability in Machine Learning
• Distributed systems – Fault tolerance,
Throughput vs. latency
• Parallelization Strategies – Hashing, trees
• Processing – Map reduce variants, MPI,
graph parallel
• Databases – Key/Value Stores, NoSQL
What is Spark?
Fast, expressive cluster computing system
approx queries
Spark SQL
structured data
Spark Core
What is Spark?
• Work on distributed collections like local ones
• RDD:
• Immutable
• Parallel transforms
• Resilient and configurable persistence
• Operations
• Transforms: Lazy operations (map, filter, join,…)
• Actions: Return/write results (collect, save, count,…)
ML Software Framework: Spark MLlib
• Subproject with ML primitives
• Building blocks (as a framework vs. library)
• Large scale statistics
• Classification
• Regression
• Clustering
• Matrix factorization
• Optimization
• Frequent pattern mining
• Dimensionality reduction
What is ML-Scoring?
• Creates an Elastic Search (ES) document index of
• Trains a supervised learning ML model from a dataset
of instances + labels
• Generate an Elasticsearch plugin that uses the trained
ML model to score documents at query time
• A
An Open Source POC!
Remember the elephant?
Visualization / UI
Query Generation and
Contextual Pre-filtering
Model Building
Index Building
Data/Events Collections
Data Analytics
Contextual Post Filtering
Simplifying the Stack!
Visualization / UI
Query Generation and
Contextual Pre-filtering
Model Building
Index Building
Data/Events Collections
Data Analytics
Contextual Post Filtering
Elastic Search
ML-Scoring Architecture
+ Labels
Trainer +
ML Model
5. Re-writing the ranking function
Using ML-Scoring
• Creating an ES Index
• Boolean queries
• More-Like-This queries
• Built-in scoring functions
• Scoring script
• Scoring plugin
• ML-Score evaluator using Spark
• ML-Score query
Creating an Index in ES
POST /my_movie_catalog/movies/_bulk
{ "index": { "_id": 1 }}
{ ”genre" : “Documentary”, ”productID" : "XHDK-A-1293-#fJ3" , “title” :
“Olympic Sports”, “content” : “Olympic greateness…“, price” : 20}
{ "index": { "_id": 2 }}
{ ”genre" : “Sports”, ”productID" : "KDKE-B-9947-#kL5", “title” : “NY
Yankees: Winning the World Series”, , “content” : “There is no better
team than the NY…“ “price” :20}
{ "index": { "_id": 3 }}
{ ”genre" : “Action”, “productID" : "JODL-X-1937-#pV7",”title” :
“Rambo III”, , “content” : “Sylvester Stallone is evermore…“ “price” :
{ "index": { "_id": 4 }}
{ ”genre" : “Children”, ”productID" : "QQPX-R-3956-#aD8", “title” :
“Fairy Tale”, , “content” : “Once upon a time…“, “price” : 30}
Boolean queries
• SQL representation
SELECT movie
FROM movies
WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3")
AND (price != 30)
GET /my_movie_catalog/movies/_search
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"should" : [
{ "term" : {"price" : 20}},
{ "term" : {"productID" : "XHDK-A-1293-#fJ3"}}
"must_not" : {
"term" : {"price" : 30}
Content based similarity queries (MLT)
"more_like_this" : {
"fields" : ["title", "description"],
"like_text" : "Once upon a time",
"min_term_freq" : 1,
"max_query_terms" : 12
• The More Like This Query (MLT Query) finds documents
that are "like" a given set of documents. In order to do so,
MLT selects a set of representative terms of these input
documents, forms a query using these terms, executes the
query and returns the results.
Similar to a given document{
"more_like_this" : {
"fields" : ["title",
"docs" : [
"_index" : "imdb",
"_type" : "movies",
"_id" : "1"
"_index" : "imdb",
"_type" : "movies",
"_id" : "2"
"min_term_freq" : 1,
"max_query_terms" : 12
Built-in functions
• Suppose we want to boost movies by
popularity (base-line of many RS)
Popularity-based boosting
GET /my_movie_catalog/movies/post/_search
"query": {
"function_score": {
"query": {
"multi_match": { "query": "popularity",
"fields": [ "title", "content" ]
"field_value_factor": {
"field": "votes",
"modifier": "log1p"
• Suppose we want to build a location-aware
recommender system
Decay functions
• Supported decay
• Linear
• Gauss
• Exp
• Also supported
• random_score
GET /_search
{ "query": {
"function_score": {
"functions": [
{ "gauss": {
"location": { "origin": { "lat":
"offset": "2km", "scale": "3
{ "gauss": {
"price": {
"origin": "50",
"offset": "50",
"scale": "20"
"weight": 2
ES scoring script
• Trickier pricing and margin based scoring
if (price < threshold) {
profit = price * margin
} else {
profit = price * (1 - discount) * margin
return profit / target
ES Scoring Script
GET /_search
"function_score": {
"functions": [
{ ...location clause... },
{ ...price clause... },
"script_score": {
"params": { "threshold": 80, "discount": 0.1, "target":
10 },
"script": "price = doc['price'].value; margin =
doc['margin'].value; if (price < threshold) { return price *
margin / target };return price * (1 - discount) * margin /
target; "}
Limitations of ranking using ES
practical scoring function
• Stateless computation
• Meant primarily for text search
• Hard to represent context and history
• Limited complexity (simple math functions only)
• Nevertheless, original score should not be
discarded as it may become handy!
Scoring plugin in ES
public class PredictorPlugin extends AbstractPlugin {
public String name() {
return getClass().getName();
public String description() {
return "Simple plugin to predict values.";
public void onModule(ScriptModule module) {
ML-Scoring evaluator using Spark
class SparkPredictorEngine[M](val readPath: String, val spHelp:
SparkModelHelpers[M]) extends PredictorEngine {
private var _model: ModelData[M] = ModelData[M]()
override def getPrediction(values: Collection[IndexValue]) = {
if (_model.clf.nonEmpty) {
val v = ReadUtil.cIndVal2Vector( values, _model.mapper)
} else {
throw new PredictionException("Empty model");
def readModel() = _model = spHelp.readSparkModel(readPath)
def getModel: ModelData[M] = _model
ML-Scoring query
"query": {
"function_score": {
"query": {
"match_all": {}
"functions": [
"script_score": {
"script": "search-predictor",
"lang": "native",
"params": {}
"boost_mode": "replace"
Potential issues
• Performance
• It may be a problem if the search space is very
large and/or the computation to intensive
• Operations
• Code running on a key infrastructure
• Versioning and binary compatibility
• Importance of the whole picture – RS seen from the lenses
of the whole elephant
• RS research is a new field in comparison to IR
• Scalability is hard! Why not learn from all of RS’s cousins:
• Search
• Distributed systems
• Databases
• Machine learning
• Content analysis
• …
• Bridging the gap between research and engineering is an
ongoing effort
• Baeza-Yates, R., & Ribeiro-Neto, B. 2011. Modern information retrieval. New York:
ACM press.
• Chirita, P. A., Firan, C. S., & Nejdl, W. 2007. Personalized query expansion for the web.
In Proceedings of the 30th annual international ACM SIGIR conference on Research
and development in information retrieval (pp. 7-14). ACM.
• Croft, W. B., Metzler, D., & Strohman, T. 2010. Search engines: Information retrieval in
practice. Reading: Addison-Wesley.
• Dunning, T. 1993. Accurate methods for the statistics of surprise and
coincidence. Computational linguistics, 19(1), 61-74.
• Elastic, Elasticsearch: RESTful, Distributed Search & Analytics. 2015.
• Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. 2009.
The WEKA data mining software: an update. ACM SIGKDD explorations
newsletter, 11(1), 10-18.
• Ihaka, R., & Gentleman, R. 1996. R: a language for data analysis and graphics. Journal
of computational and graphical statistics, 5(3), 299-314.
• Kantor, P. B., Rokach, L., Ricci, F., & Shapira, B. 2011. Recommender systems handbook.
• Manning, C. D., Raghavan, P., & Schütze, H. 2008. Introduction to information retrieval.
Cambridge: Cambridge university press.
• Qiu, F., & Cho, J. 2006. Automatic identification of user interest for personalized search.
In Proceedings of the 15th international conference on World Wide Web (pp. 727-736).
• Sun, J. T., Zeng, H. J., Liu, H., Lu, Y., & Chen, Z. 2005. Cubesvd: a novel approach to
personalized web search. In Proceedings of the 14th international conference on World
Wide Web (pp. 382-390). ACM.
• Xing, B., & Lin, Z. 2006. The impact of search engine optimization on online advertising
market. In Proceedings of the 8th international conference on Electronic commerce: The
new e-commerce: innovations for conquering current barriers, obstacles and limitations to
conducting successful business on the internet (pp. 519-529). ACM.
• Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. 2010. Spark: cluster
computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in
cloud computing (Vol. 10, p. 10)
Additional Credits
• Doug Kang
• Data Scientist, Verizon OnCue
• Federico Ponte
• System Engineer from Mahisoft
• Yessika Labrador
• Data Engineer from Mahisoft

  18. There is more to Recsys than algorithms and ranking - Retrieval - User Interface & Feedback - Data - AB Testing - Systems & Architectures
  19. There is more to Recsys than algorithms and ranking - Retrieval - User Interface & Feedback - Data - AB Testing - Systems & Architectures
  20. Performance It may be a problem if the search space is very large and/or the computation to intensive Operations Code running on a key infrastructure People are more hesitant to touch an infrastructure/DB component such as Elasticsearch. Similar concerns exist surrounding DB stored procedures. No way to sandbox a native plugin Requires strong automated regression and performance testing How handle versioning and binary compatibility Potential deployment issues Upgrades to Elastic search, the plugin code and/or the models may present challenges