Applied Recommender Systems 
Bob Brehm 
Presentation Topics 
 Hadoop MapReduce Overview 
 Mahout Overview 
 Hive Overview 
 Review recommender systems 
 Introduction to Spring XD 
 Demonstrations as we go
Hadoop Overview 
 History [7] 
 2003: Apache Nutch (open-source web 
search engine) was created by Doug 
Cutting and Mike Caferalla. 
 2004: Google File System and 
MapReduce papers published. 
 2005: Hadoop was created in Nutch as 
an open source inplementation to GFS 
and MapReduce.
Hadoop Overview 
 Today Hadoop is an independent Apache 
Project consisting of 4 modules: [6] 
 Hadoop common 
 HDFS – distributed, scalable file system 
 YARN (V2) – job scheduling and cluster 
resource management 
 MapReduce – system for parallel 
processing of large data sets 
 Hadoop market size is over $3 billion!
Hadoop Overview 
 Other Hadoop Related projects include 
 Hive – data warehouse infrastructure 
 Mahout – Machine learning library 
 While there are many more projects the 
rest of the talk will be focused on these two 
as well as MapReduce and HDFS.
Hadoop Overview 
 NameNode – keeps track of all DataNodes 
 JobTracker – main scheduler 
 Data Node – individual data clusters 
 TaskTracker – sequences each DataNode
Hadoop Overview 
 HDFS basic command examples: 
 Put – copies from local to HDFS 
 hadoop fs -put localfile 
 Mkdir – makes a directory 
 hadoop fs -mkdir /user/hadoop/dir1 
 Tail – Displays last kilobyte of file 
 hadoop fs -tail pathname 
 Very similar to Linux commands
Hadoop Overview 
 Input data – wrangling can be difficult 
 Mapper – split data into key value pairs 
 Sort – sort values by key 
 Reducer – Combine values by key
Hadoop Overview 
 Wordcount (HelloWorld) – Counts 
occurrences of each word in a document 
 Half of TF-IDF
Hadoop Overview 
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, 
Text, IntWritable> { 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> 
output, Reporter reporter) throws IOException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
output.collect(word, one); 
Hadoop Overview 
public static void main(String[] args) throws Exception { 
JobConf conf = new JobConf(WordCount.class); 
FileInputFormat.setInputPaths(conf, new Path(args[0])); 
FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
Hadoop Overview 
 Setup the data: 
 /usr/joe/wordcount/input - input directory in HDFS 
 /usr/joe/wordcount/output - output directory in HDFS 
 $ hadoop fs -ls /usr/joe/wordcount/input/ 
 /usr/joe/wordcount/input/file01 
 /usr/joe/wordcount/input/file02 
 $ hadoop fs -cat /usr/joe/wordcount/input/file01 
 Hello World Bye World 
 $ hadoop fs -cat /usr/joe/wordcount/input/file02 
 Hello Hadoop Goodbye Hadoop
Hadoop Overview 
 Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is 
the Hadoop version installed, compile WordCount.java and create a jar: 
 $ mkdir wordcount_classes 
 $ javac -classpath ${HADOOP_HOME}/hadoop- 
${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ . 
 Run the application: 
 $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount 
/usr/joe/wordcount/input /usr/joe/wordcount/output 
 Output: 
 $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 
 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
Hadoop Overview 
 Interesting facts about MapReduce 
 MapReduce can run on any type of file including 
 Hadoop streaming technology allows other 
languages to use MapReduce. Python, R, Ruby. 
 Can include a Combiner method that can 
streamline traffic 
 Not required to include a Reducer (image 
processing, ETL) 
 Hadoop includes a JobTracker WebUI 
 MRUnit – Junit test framework
Hadoop Overview 
 Spring for Apache Hadoop project 
 Configure and run MapReduce jobs as 
container managed objects 
 Provide template helper classes for 
HDFS, Hbase, Pig and Hive. 
 Use standard Spring approach for 
 Access all Spring goodies – Messaging, 
Persistence, Security, Web Services, etc.
 Hive is an alternative to writing MapReduce 
jobs. Hive compiles to MapReduce. 
 Hive programs are written in HiveQL. 
Similar to to SQL. 
 Examples: 
 Create table: hive> CREATE TABLE pokes (foo 
INT, bar STRING); 
 Loading data: hive> LOAD DATA LOCAL 
INPATH './examples/files/kv1.txt' OVERWRITE 
 Examples (cont): 
 Getting data out of hive: INSERT OVERWRITE 
DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM 
invites a WHERE a.ds='2008-08-15'; 
 Join: FROM pokes t1 JOIN invites t2 ON (t1.bar 
= t2.bar) INSERT OVERWRITE TABLE events 
SELECT t1.bar, t1.foo, t2.foo; 
 Hive may reduce the amount of code you have to write 
when you are doing data wrangling. 
 It's a tool that has it's place and is useful to know.
 Started as a subproject of Lucene in 2008. 
 Idea behind Mahout is that is provides a 
framework for the development and 
deployment of Machine Learning 
 Currently it has three distinct capabilities: 
 Classification 
 Clustering 
 Recommenders
 Support for recommenders include: 
 Data model – provides connections to data 
 UserSimilarity – provides similarity to users 
 ItemSimilarity – provides similarity to items 
 UserNeighborhood – find a neighborhood (mini cluster) of 
like-minded users. 
 Recommender – the producer of recommendations. 
 Algorithms!
Intro to Recommenders
What is a recommender? 
 Wikipedia [3]: 
 A subclass of [an] information filtering system that seek to 
predict the 'rating' or 'preference' that user would give to 
an item 
 My addition: A subclass of machine-learning. 
 Recommender model [2]: 
 Users 
 Items 
 Ratings 
 Community
What is a recommender? [2]
Recommender types 
 Non-personalized [2] 
 Content-based filtering (user-item) [2] 
 Hybrid [3] 
 Collaborative filtering (user-user, item-item) 
Recommender types 
 Non-personalized [2] 
 Content-based filtering (user-item) [2] 
 Hybrid [3] 
 Collaborative filtering (user-user, item-item) 
Collaborative Filtering 
 We will now look at item-item collaborative 
filtering as the recommendation algorithm. 
 Answers the question: what items are similar 
to the ones you like? 
 Popularized by Amazon who found that 
item-item scales better, can be done in real 
time, and generate high-quality results. [8] 
 Specifically we will look at Pearson 
Correlation Coefficient algorithm.
Collaborative Filtering 
 Pearson's correlation coefficient - defined as the 
covariance of the two variables divided by the product of 
their standard deviations.
Collaborative Filtering 
 Idea is to examine a log file for user's 
movie ratings. Data looks like this: 
 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4 
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; 
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286" 
 - - [05/Jan/1998:22:48:57 -0800] "GET /rate?movie=345&rating=4 
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; 
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286" 
 - - [05/Jan/1998:22:50:15 -0800] "GET /rate?movie=312&rating=4 
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; 
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
Collaborative Filtering 
 Steps used for the analysis: 
 Run a hive script to extract the user data 
from a log file 
 Run Mahout command from the 
command line (could be done 
programmatically as well). 
 Examine the contents.
Collaborative filtering 
<hive-runner id="hiveRunner"> 
SELECT cookie as user, 
regexp_extract(request, "GET 
/rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 1) as movie, 
CAST(regexp_extract(request, "GET 
/rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) as double) 
as rating 
WHERE regexp_extract(request, "GET 
/rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) != ""; 
Collaborative filtering 
public class HiveApp { 
private static final Log log = 
public static void main(String[] args) throws Exception { 
AbstractApplicationContext context = new 
"/META-INF/spring/hive-context.xml", HiveApp.class); 
HiveRunner runner = context.getBean(HiveRunner.class); 
Collaborative Filtering 
 Hive output looks like this (This is 
the format that Mahout requires): 
UserId, MovieID, relationship strength 
Collaborative filtering 
 Rerun Mahout with a different correlation 
 Do A/B comparison in production 
 Gather statistics over time 
 See if one algorithm is better than others.
Spring XD 
 XD - Spring.io project that extends the work 
that Spring Data team did on Spring for 
Apache Hadoop project. 
 High throughput distributed data ingestion into HDFS from a 
variety of input sources. 
 Real-time analytics at ingestion time, e.g. gathering metrics and 
counting values. 
 Hadoop workflow management via batch jobs that combine 
interactions with standard enterprise systems (e.g. RDBMS) as 
well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive 
or Cascading). 
 High throughput data export, e.g. from HDFS to a RDBMS or 
NoSQL database.
Spring XD 
 Configure a stream using XD. Simple case:
Spring XD 
 More typical Corporate Use Case Stream:
Spring XD 
 Admin UI
 [1] Introduction to recommender systems. Joseph Konstan. 
 [2] Intro to recommendations. Coursera. 
 [3] Recommender system. Wikipedia. 
 [4] An Algorithmic Framework for Performing Collaborative Filtering. 
 [5] Hybrid Web Recommender Systems. 
 [6] Hadoop web site. 
 [7] Apache Hadoop. Wikipedia 
 [8] Amazon.com Recommendations paper. cs.umd.edu. 
 [9] Cloudera Data Science Training. Cloudera.

