This document provides an overview of applied recommender systems. It discusses Hadoop, MapReduce, Hive, Mahout and collaborative filtering recommender algorithms. Hadoop is used to process large datasets in parallel across clusters. MapReduce is the programming model and Hive provides a SQL-like interface. Mahout contains machine learning libraries including collaborative filtering algorithms to generate recommendations. Pearson correlation is discussed as an item-item collaborative filtering approach.
2. Presentation Topics
Hadoop MapReduce Overview
Mahout Overview
Hive Overview
Review recommender systems
Introduction to Spring XD
Demonstrations as we go
3. Hadoop Overview
History [7]
2003: Apache Nutch (open-source web
search engine) was created by Doug
Cutting and Mike Caferalla.
2004: Google File System and
MapReduce papers published.
2005: Hadoop was created in Nutch as
an open source inplementation to GFS
and MapReduce.
4. Hadoop Overview
Today Hadoop is an independent Apache
Project consisting of 4 modules: [6]
Hadoop common
HDFS – distributed, scalable file system
YARN (V2) – job scheduling and cluster
resource management
MapReduce – system for parallel
processing of large data sets
Hadoop market size is over $3 billion!
5. Hadoop Overview
Other Hadoop Related projects include
Hive – data warehouse infrastructure
Mahout – Machine learning library
While there are many more projects the
rest of the talk will be focused on these two
as well as MapReduce and HDFS.
6. Hadoop Overview
NameNode – keeps track of all DataNodes
JobTracker – main scheduler
Data Node – individual data clusters
TaskTracker – sequences each DataNode
7. Hadoop Overview
HDFS basic command examples:
Put – copies from local to HDFS
hadoop fs -put localfile
/user/hadoop/hadoopfile
Mkdir – makes a directory
hadoop fs -mkdir /user/hadoop/dir1
/user/hadoop/dir2
Tail – Displays last kilobyte of file
hadoop fs -tail pathname
Very similar to Linux commands
8. Hadoop Overview
Input data – wrangling can be difficult
Mapper – split data into key value pairs
Sort – sort values by key
Reducer – Combine values by key
9. Hadoop Overview
Wordcount (HelloWorld) – Counts
occurrences of each word in a document
Half of TF-IDF
10. Hadoop Overview
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
11. Hadoop Overview
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
13. Hadoop Overview
Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is
the Hadoop version installed, compile WordCount.java and create a jar:
$ mkdir wordcount_classes
$ javac -classpath ${HADOOP_HOME}/hadoop-
${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java
$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .
Run the application:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
/usr/joe/wordcount/input /usr/joe/wordcount/output
Output:
$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
14. Hadoop Overview
Interesting facts about MapReduce
MapReduce can run on any type of file including
images.
Hadoop streaming technology allows other
languages to use MapReduce. Python, R, Ruby.
Can include a Combiner method that can
streamline traffic
Not required to include a Reducer (image
processing, ETL)
Hadoop includes a JobTracker WebUI
MRUnit – Junit test framework
15. Hadoop Overview
Spring for Apache Hadoop project
Configure and run MapReduce jobs as
container managed objects
Provide template helper classes for
HDFS, Hbase, Pig and Hive.
Use standard Spring approach for
Hadoop!
Access all Spring goodies – Messaging,
Persistence, Security, Web Services, etc.
16. Hive
Hive is an alternative to writing MapReduce
jobs. Hive compiles to MapReduce.
Hive programs are written in HiveQL.
Similar to to SQL.
Examples:
Create table: hive> CREATE TABLE pokes (foo
INT, bar STRING);
Loading data: hive> LOAD DATA LOCAL
INPATH './examples/files/kv1.txt' OVERWRITE
INTO TABLE pokes;
17. Hive
Examples (cont):
Getting data out of hive: INSERT OVERWRITE
DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM
invites a WHERE a.ds='2008-08-15';
Join: FROM pokes t1 JOIN invites t2 ON (t1.bar
= t2.bar) INSERT OVERWRITE TABLE events
SELECT t1.bar, t1.foo, t2.foo;
Hive may reduce the amount of code you have to write
when you are doing data wrangling.
It's a tool that has it's place and is useful to know.
18. Mahout
Started as a subproject of Lucene in 2008.
Idea behind Mahout is that is provides a
framework for the development and
deployment of Machine Learning
algorithms.
Currently it has three distinct capabilities:
Classification
Clustering
Recommenders
19. Mahout
Support for recommenders include:
Data model – provides connections to data
UserSimilarity – provides similarity to users
ItemSimilarity – provides similarity to items
UserNeighborhood – find a neighborhood (mini cluster) of
like-minded users.
Recommender – the producer of recommendations.
Algorithms!
21. What is a recommender?
Wikipedia [3]:
A subclass of [an] information filtering system that seek to
predict the 'rating' or 'preference' that user would give to
an item
My addition: A subclass of machine-learning.
Recommender model [2]:
Users
Items
Ratings
Community
25. Collaborative Filtering
We will now look at item-item collaborative
filtering as the recommendation algorithm.
Answers the question: what items are similar
to the ones you like?
Popularized by Amazon who found that
item-item scales better, can be done in real
time, and generate high-quality results. [8]
Specifically we will look at Pearson
Correlation Coefficient algorithm.
26. Collaborative Filtering
Pearson's correlation coefficient - defined as the
covariance of the two variables divided by the product of
their standard deviations.
27. Collaborative Filtering
Idea is to examine a log file for user's
movie ratings. Data looks like this:
109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64;
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
109.170.148.120 - - [05/Jan/1998:22:48:57 -0800] "GET /rate?movie=345&rating=4
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64;
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
109.170.148.120 - - [05/Jan/1998:22:50:15 -0800] "GET /rate?movie=312&rating=4
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64;
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
28. Collaborative Filtering
Steps used for the analysis:
Run a hive script to extract the user data
from a log file
Run Mahout command from the
command line (could be done
programmatically as well).
Examine the contents.
29. Collaborative filtering
<hive-runner id="hiveRunner">
<script>
CREATE TABLE MAHOUT_INPUT_A
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
AS
SELECT cookie as user,
regexp_extract(request, "GET
/rate?movie=(d+) & rating=(d) HTTP/1.1", 1) as movie,
CAST(regexp_extract(request, "GET
/rate?movie=(d+) & rating=(d) HTTP/1.1", 2) as double)
as rating
from ACCESS_LOGS
WHERE regexp_extract(request, "GET
/rate?movie=(d+) & rating=(d) HTTP/1.1", 2) != "";
</script>
</hive-runner>
30. Collaborative filtering
public class HiveApp {
private static final Log log =
LogFactory.getLog(HiveApp.class);
public static void main(String[] args) throws Exception {
AbstractApplicationContext context = new
ClassPathXmlApplicationContext(
"/META-INF/spring/hive-context.xml", HiveApp.class);
context.registerShutdownHook();
HiveRunner runner = context.getBean(HiveRunner.class);
runner.call();
}
}
31. Collaborative Filtering
Hive output looks like this (This is
the format that Mahout requires):
UserId, MovieID, relationship strength
943,373,3.0
943,391,2.0
943,796,3.0
943,237,4.0
943,840,4.0
943,230,1.0
943,229,2.0
943,449,1.0
943,450,1.0
943,228,3.0
32. Collaborative filtering
Rerun Mahout with a different correlation
say SIMILARITY_EUCLIDEAN_DISTANCE
Do A/B comparison in production
Gather statistics over time
See if one algorithm is better than others.
33. Spring XD
XD - Spring.io project that extends the work
that Spring Data team did on Spring for
Apache Hadoop project.
High throughput distributed data ingestion into HDFS from a
variety of input sources.
Real-time analytics at ingestion time, e.g. gathering metrics and
counting values.
Hadoop workflow management via batch jobs that combine
interactions with standard enterprise systems (e.g. RDBMS) as
well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive
or Cascading).
High throughput data export, e.g. from HDFS to a RDBMS or
NoSQL database.
34. Spring XD
Configure a stream using XD. Simple case:
35. Spring XD
More typical Corporate Use Case Stream: