Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Applied Recommender Systems 
Bob Brehm 
5/20/2014
Presentation Topics 
 Hadoop MapReduce Overview 
 Mahout Overview 
 Hive Overview 
 Review recommender systems 
 Introduction to Spring XD 
 Demonstrations as we go
Hadoop Overview 
 History [7] 
 2003: Apache Nutch (open-source web 
search engine) was created by Doug 
Cutting and Mike Caferalla. 
 2004: Google File System and 
MapReduce papers published. 
 2005: Hadoop was created in Nutch as 
an open source inplementation to GFS 
and MapReduce.
Hadoop Overview 
 Today Hadoop is an independent Apache 
Project consisting of 4 modules: [6] 
 Hadoop common 
 HDFS – distributed, scalable file system 
 YARN (V2) – job scheduling and cluster 
resource management 
 MapReduce – system for parallel 
processing of large data sets 
 Hadoop market size is over $3 billion!
Hadoop Overview 
 Other Hadoop Related projects include 
 Hive – data warehouse infrastructure 
 Mahout – Machine learning library 
 While there are many more projects the 
rest of the talk will be focused on these two 
as well as MapReduce and HDFS.
Hadoop Overview 
 NameNode – keeps track of all DataNodes 
 JobTracker – main scheduler 
 Data Node – individual data clusters 
 TaskTracker – sequences each DataNode
Hadoop Overview 
 HDFS basic command examples: 
 Put – copies from local to HDFS 
 hadoop fs -put localfile 
/user/hadoop/hadoopfile 
 Mkdir – makes a directory 
 hadoop fs -mkdir /user/hadoop/dir1 
/user/hadoop/dir2 
 Tail – Displays last kilobyte of file 
 hadoop fs -tail pathname 
 Very similar to Linux commands
Hadoop Overview 
 Input data – wrangling can be difficult 
 Mapper – split data into key value pairs 
 Sort – sort values by key 
 Reducer – Combine values by key
Hadoop Overview 
 Wordcount (HelloWorld) – Counts 
occurrences of each word in a document 
 Half of TF-IDF
Hadoop Overview 
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, 
Text, IntWritable> { 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> 
output, Reporter reporter) throws IOException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken()); 
output.collect(word, one); 
} 
} 
}
Hadoop Overview 
public static void main(String[] args) throws Exception { 
JobConf conf = new JobConf(WordCount.class); 
conf.setJobName("wordcount"); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(IntWritable.class); 
conf.setMapperClass(Map.class); 
conf.setCombinerClass(Reduce.class); 
conf.setReducerClass(Reduce.class); 
conf.setInputFormat(TextInputFormat.class); 
conf.setOutputFormat(TextOutputFormat.class); 
FileInputFormat.setInputPaths(conf, new Path(args[0])); 
FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
JobClient.runJob(conf); 
}
Hadoop Overview 
 Setup the data: 
 /usr/joe/wordcount/input - input directory in HDFS 
 /usr/joe/wordcount/output - output directory in HDFS 
 $ hadoop fs -ls /usr/joe/wordcount/input/ 
 /usr/joe/wordcount/input/file01 
 /usr/joe/wordcount/input/file02 
 $ hadoop fs -cat /usr/joe/wordcount/input/file01 
 Hello World Bye World 
 $ hadoop fs -cat /usr/joe/wordcount/input/file02 
 Hello Hadoop Goodbye Hadoop
Hadoop Overview 
 Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is 
the Hadoop version installed, compile WordCount.java and create a jar: 
 $ mkdir wordcount_classes 
 $ javac -classpath ${HADOOP_HOME}/hadoop- 
${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ . 
 Run the application: 
 $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount 
/usr/joe/wordcount/input /usr/joe/wordcount/output 
 Output: 
 $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 
 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
Hadoop Overview 
 Interesting facts about MapReduce 
 MapReduce can run on any type of file including 
images. 
 Hadoop streaming technology allows other 
languages to use MapReduce. Python, R, Ruby. 
 Can include a Combiner method that can 
streamline traffic 
 Not required to include a Reducer (image 
processing, ETL) 
 Hadoop includes a JobTracker WebUI 
 MRUnit – Junit test framework
Hadoop Overview 
 Spring for Apache Hadoop project 
 Configure and run MapReduce jobs as 
container managed objects 
 Provide template helper classes for 
HDFS, Hbase, Pig and Hive. 
 Use standard Spring approach for 
Hadoop! 
 Access all Spring goodies – Messaging, 
Persistence, Security, Web Services, etc.
Hive 
 Hive is an alternative to writing MapReduce 
jobs. Hive compiles to MapReduce. 
 Hive programs are written in HiveQL. 
Similar to to SQL. 
 Examples: 
 Create table: hive> CREATE TABLE pokes (foo 
INT, bar STRING); 
 Loading data: hive> LOAD DATA LOCAL 
INPATH './examples/files/kv1.txt' OVERWRITE 
INTO TABLE pokes;
Hive 
 Examples (cont): 
 Getting data out of hive: INSERT OVERWRITE 
DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM 
invites a WHERE a.ds='2008-08-15'; 
 Join: FROM pokes t1 JOIN invites t2 ON (t1.bar 
= t2.bar) INSERT OVERWRITE TABLE events 
SELECT t1.bar, t1.foo, t2.foo; 
 Hive may reduce the amount of code you have to write 
when you are doing data wrangling. 
 It's a tool that has it's place and is useful to know.
Mahout 
 Started as a subproject of Lucene in 2008. 
 Idea behind Mahout is that is provides a 
framework for the development and 
deployment of Machine Learning 
algorithms. 
 Currently it has three distinct capabilities: 
 Classification 
 Clustering 
 Recommenders
Mahout 
 Support for recommenders include: 
 Data model – provides connections to data 
 UserSimilarity – provides similarity to users 
 ItemSimilarity – provides similarity to items 
 UserNeighborhood – find a neighborhood (mini cluster) of 
like-minded users. 
 Recommender – the producer of recommendations. 
 Algorithms!
Intro to Recommenders
What is a recommender? 
 Wikipedia [3]: 
 A subclass of [an] information filtering system that seek to 
predict the 'rating' or 'preference' that user would give to 
an item 
 My addition: A subclass of machine-learning. 
 Recommender model [2]: 
 Users 
 Items 
 Ratings 
 Community
What is a recommender? [2]
Recommender types 
 Non-personalized [2] 
 Content-based filtering (user-item) [2] 
 Hybrid [3] 
 Collaborative filtering (user-user, item-item) 
[2]
Recommender types 
 Non-personalized [2] 
 Content-based filtering (user-item) [2] 
 Hybrid [3] 
 Collaborative filtering (user-user, item-item) 
[2]
Collaborative Filtering 
 We will now look at item-item collaborative 
filtering as the recommendation algorithm. 
 Answers the question: what items are similar 
to the ones you like? 
 Popularized by Amazon who found that 
item-item scales better, can be done in real 
time, and generate high-quality results. [8] 
 Specifically we will look at Pearson 
Correlation Coefficient algorithm.
Collaborative Filtering 
 Pearson's correlation coefficient - defined as the 
covariance of the two variables divided by the product of 
their standard deviations.
Collaborative Filtering 
 Idea is to examine a log file for user's 
movie ratings. Data looks like this: 
 109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4 
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; 
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286" 
 109.170.148.120 - - [05/Jan/1998:22:48:57 -0800] "GET /rate?movie=345&rating=4 
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; 
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286" 
 109.170.148.120 - - [05/Jan/1998:22:50:15 -0800] "GET /rate?movie=312&rating=4 
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; 
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
Collaborative Filtering 
 Steps used for the analysis: 
 Run a hive script to extract the user data 
from a log file 
 Run Mahout command from the 
command line (could be done 
programmatically as well). 
 Examine the contents.
Collaborative filtering 
<hive-runner id="hiveRunner"> 
<script> 
CREATE TABLE MAHOUT_INPUT_A 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
AS 
SELECT cookie as user, 
regexp_extract(request, "GET 
/rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 1) as movie, 
CAST(regexp_extract(request, "GET 
/rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) as double) 
as rating 
from ACCESS_LOGS 
WHERE regexp_extract(request, "GET 
/rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) != ""; 
</script> 
</hive-runner>
Collaborative filtering 
public class HiveApp { 
private static final Log log = 
LogFactory.getLog(HiveApp.class); 
public static void main(String[] args) throws Exception { 
AbstractApplicationContext context = new 
ClassPathXmlApplicationContext( 
"/META-INF/spring/hive-context.xml", HiveApp.class); 
context.registerShutdownHook(); 
HiveRunner runner = context.getBean(HiveRunner.class); 
runner.call(); 
} 
}
Collaborative Filtering 
 Hive output looks like this (This is 
the format that Mahout requires): 
UserId, MovieID, relationship strength 
943,373,3.0 
943,391,2.0 
943,796,3.0 
943,237,4.0 
943,840,4.0 
943,230,1.0 
943,229,2.0 
943,449,1.0 
943,450,1.0 
943,228,3.0
Collaborative filtering 
 Rerun Mahout with a different correlation 
say SIMILARITY_EUCLIDEAN_DISTANCE 
 Do A/B comparison in production 
 Gather statistics over time 
 See if one algorithm is better than others.
Spring XD 
 XD - Spring.io project that extends the work 
that Spring Data team did on Spring for 
Apache Hadoop project. 
 High throughput distributed data ingestion into HDFS from a 
variety of input sources. 
 Real-time analytics at ingestion time, e.g. gathering metrics and 
counting values. 
 Hadoop workflow management via batch jobs that combine 
interactions with standard enterprise systems (e.g. RDBMS) as 
well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive 
or Cascading). 
 High throughput data export, e.g. from HDFS to a RDBMS or 
NoSQL database.
Spring XD 
 Configure a stream using XD. Simple case:
Spring XD 
 More typical Corporate Use Case Stream:
Spring XD 
 Admin UI
?
Thanks!
References 
 [1] Introduction to recommender systems. Joseph Konstan. 
 [2] Intro to recommendations. Coursera. 
 [3] Recommender system. Wikipedia. 
 [4] An Algorithmic Framework for Performing Collaborative Filtering. 
 [5] Hybrid Web Recommender Systems. 
 [6] Hadoop web site. 
 [7] Apache Hadoop. Wikipedia 
 [8] Amazon.com Recommendations paper. cs.umd.edu. 
 [9] Cloudera Data Science Training. Cloudera.

More Related Content

Recommender.system.presentation.pjug.05.20.2014

  • 1. Applied Recommender Systems Bob Brehm 5/20/2014
  • 2. Presentation Topics  Hadoop MapReduce Overview  Mahout Overview  Hive Overview  Review recommender systems  Introduction to Spring XD  Demonstrations as we go
  • 3. Hadoop Overview  History [7]  2003: Apache Nutch (open-source web search engine) was created by Doug Cutting and Mike Caferalla.  2004: Google File System and MapReduce papers published.  2005: Hadoop was created in Nutch as an open source inplementation to GFS and MapReduce.
  • 4. Hadoop Overview  Today Hadoop is an independent Apache Project consisting of 4 modules: [6]  Hadoop common  HDFS – distributed, scalable file system  YARN (V2) – job scheduling and cluster resource management  MapReduce – system for parallel processing of large data sets  Hadoop market size is over $3 billion!
  • 5. Hadoop Overview  Other Hadoop Related projects include  Hive – data warehouse infrastructure  Mahout – Machine learning library  While there are many more projects the rest of the talk will be focused on these two as well as MapReduce and HDFS.
  • 6. Hadoop Overview  NameNode – keeps track of all DataNodes  JobTracker – main scheduler  Data Node – individual data clusters  TaskTracker – sequences each DataNode
  • 7. Hadoop Overview  HDFS basic command examples:  Put – copies from local to HDFS  hadoop fs -put localfile /user/hadoop/hadoopfile  Mkdir – makes a directory  hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2  Tail – Displays last kilobyte of file  hadoop fs -tail pathname  Very similar to Linux commands
  • 8. Hadoop Overview  Input data – wrangling can be difficult  Mapper – split data into key value pairs  Sort – sort values by key  Reducer – Combine values by key
  • 9. Hadoop Overview  Wordcount (HelloWorld) – Counts occurrences of each word in a document  Half of TF-IDF
  • 10. Hadoop Overview public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 11. Hadoop Overview public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }
  • 12. Hadoop Overview  Setup the data:  /usr/joe/wordcount/input - input directory in HDFS  /usr/joe/wordcount/output - output directory in HDFS  $ hadoop fs -ls /usr/joe/wordcount/input/  /usr/joe/wordcount/input/file01  /usr/joe/wordcount/input/file02  $ hadoop fs -cat /usr/joe/wordcount/input/file01  Hello World Bye World  $ hadoop fs -cat /usr/joe/wordcount/input/file02  Hello Hadoop Goodbye Hadoop
  • 13. Hadoop Overview  Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar:  $ mkdir wordcount_classes  $ javac -classpath ${HADOOP_HOME}/hadoop- ${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java  $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .  Run the application:  $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output  Output:  $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000  Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
  • 14. Hadoop Overview  Interesting facts about MapReduce  MapReduce can run on any type of file including images.  Hadoop streaming technology allows other languages to use MapReduce. Python, R, Ruby.  Can include a Combiner method that can streamline traffic  Not required to include a Reducer (image processing, ETL)  Hadoop includes a JobTracker WebUI  MRUnit – Junit test framework
  • 15. Hadoop Overview  Spring for Apache Hadoop project  Configure and run MapReduce jobs as container managed objects  Provide template helper classes for HDFS, Hbase, Pig and Hive.  Use standard Spring approach for Hadoop!  Access all Spring goodies – Messaging, Persistence, Security, Web Services, etc.
  • 16. Hive  Hive is an alternative to writing MapReduce jobs. Hive compiles to MapReduce.  Hive programs are written in HiveQL. Similar to to SQL.  Examples:  Create table: hive> CREATE TABLE pokes (foo INT, bar STRING);  Loading data: hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
  • 17. Hive  Examples (cont):  Getting data out of hive: INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';  Join: FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;  Hive may reduce the amount of code you have to write when you are doing data wrangling.  It's a tool that has it's place and is useful to know.
  • 18. Mahout  Started as a subproject of Lucene in 2008.  Idea behind Mahout is that is provides a framework for the development and deployment of Machine Learning algorithms.  Currently it has three distinct capabilities:  Classification  Clustering  Recommenders
  • 19. Mahout  Support for recommenders include:  Data model – provides connections to data  UserSimilarity – provides similarity to users  ItemSimilarity – provides similarity to items  UserNeighborhood – find a neighborhood (mini cluster) of like-minded users.  Recommender – the producer of recommendations.  Algorithms!
  • 21. What is a recommender?  Wikipedia [3]:  A subclass of [an] information filtering system that seek to predict the 'rating' or 'preference' that user would give to an item  My addition: A subclass of machine-learning.  Recommender model [2]:  Users  Items  Ratings  Community
  • 22. What is a recommender? [2]
  • 23. Recommender types  Non-personalized [2]  Content-based filtering (user-item) [2]  Hybrid [3]  Collaborative filtering (user-user, item-item) [2]
  • 24. Recommender types  Non-personalized [2]  Content-based filtering (user-item) [2]  Hybrid [3]  Collaborative filtering (user-user, item-item) [2]
  • 25. Collaborative Filtering  We will now look at item-item collaborative filtering as the recommendation algorithm.  Answers the question: what items are similar to the ones you like?  Popularized by Amazon who found that item-item scales better, can be done in real time, and generate high-quality results. [8]  Specifically we will look at Pearson Correlation Coefficient algorithm.
  • 26. Collaborative Filtering  Pearson's correlation coefficient - defined as the covariance of the two variables divided by the product of their standard deviations.
  • 27. Collaborative Filtering  Idea is to examine a log file for user's movie ratings. Data looks like this:  109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"  109.170.148.120 - - [05/Jan/1998:22:48:57 -0800] "GET /rate?movie=345&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"  109.170.148.120 - - [05/Jan/1998:22:50:15 -0800] "GET /rate?movie=312&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
  • 28. Collaborative Filtering  Steps used for the analysis:  Run a hive script to extract the user data from a log file  Run Mahout command from the command line (could be done programmatically as well).  Examine the contents.
  • 29. Collaborative filtering <hive-runner id="hiveRunner"> <script> CREATE TABLE MAHOUT_INPUT_A ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS SELECT cookie as user, regexp_extract(request, "GET /rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 1) as movie, CAST(regexp_extract(request, "GET /rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) as double) as rating from ACCESS_LOGS WHERE regexp_extract(request, "GET /rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) != ""; </script> </hive-runner>
  • 30. Collaborative filtering public class HiveApp { private static final Log log = LogFactory.getLog(HiveApp.class); public static void main(String[] args) throws Exception { AbstractApplicationContext context = new ClassPathXmlApplicationContext( "/META-INF/spring/hive-context.xml", HiveApp.class); context.registerShutdownHook(); HiveRunner runner = context.getBean(HiveRunner.class); runner.call(); } }
  • 31. Collaborative Filtering  Hive output looks like this (This is the format that Mahout requires): UserId, MovieID, relationship strength 943,373,3.0 943,391,2.0 943,796,3.0 943,237,4.0 943,840,4.0 943,230,1.0 943,229,2.0 943,449,1.0 943,450,1.0 943,228,3.0
  • 32. Collaborative filtering  Rerun Mahout with a different correlation say SIMILARITY_EUCLIDEAN_DISTANCE  Do A/B comparison in production  Gather statistics over time  See if one algorithm is better than others.
  • 33. Spring XD  XD - Spring.io project that extends the work that Spring Data team did on Spring for Apache Hadoop project.  High throughput distributed data ingestion into HDFS from a variety of input sources.  Real-time analytics at ingestion time, e.g. gathering metrics and counting values.  Hadoop workflow management via batch jobs that combine interactions with standard enterprise systems (e.g. RDBMS) as well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive or Cascading).  High throughput data export, e.g. from HDFS to a RDBMS or NoSQL database.
  • 34. Spring XD  Configure a stream using XD. Simple case:
  • 35. Spring XD  More typical Corporate Use Case Stream:
  • 36. Spring XD  Admin UI
  • 37. ?
  • 39. References  [1] Introduction to recommender systems. Joseph Konstan.  [2] Intro to recommendations. Coursera.  [3] Recommender system. Wikipedia.  [4] An Algorithmic Framework for Performing Collaborative Filtering.  [5] Hybrid Web Recommender Systems.  [6] Hadoop web site.  [7] Apache Hadoop. Wikipedia  [8] Amazon.com Recommendations paper. cs.umd.edu.  [9] Cloudera Data Science Training. Cloudera.