Recommender.system.presentation.pjug.05.20.2014

Applied Recommender Systems
Bob Brehm
5/20/2014

Presentation Topics
 Hadoop MapReduce Overview
 Mahout Overview
 Hive Overview
 Review recommender systems
 Introduction to Spring XD
 Demonstrations as we go

Hadoop Overview
 History [7]
 2003: Apache Nutch (open-source web
search engine) was created by Doug
Cutting and Mike Caferalla.
 2004: Google File System and
MapReduce papers published.
 2005: Hadoop was created in Nutch as
an open source inplementation to GFS
and MapReduce.

Hadoop Overview
 Today Hadoop is an independent Apache
Project consisting of 4 modules: [6]
 Hadoop common
 HDFS – distributed, scalable file system
 YARN (V2) – job scheduling and cluster
resource management
 MapReduce – system for parallel
processing of large data sets
 Hadoop market size is over $3 billion!

Hadoop Overview
 Other Hadoop Related projects include
 Hive – data warehouse infrastructure
 Mahout – Machine learning library
 While there are many more projects the
rest of the talk will be focused on these two
as well as MapReduce and HDFS.

Hadoop Overview
 NameNode – keeps track of all DataNodes
 JobTracker – main scheduler
 Data Node – individual data clusters
 TaskTracker – sequences each DataNode

Hadoop Overview
 HDFS basic command examples:
 Put – copies from local to HDFS
 hadoop fs -put localfile
/user/hadoop/hadoopfile
 Mkdir – makes a directory
 hadoop fs -mkdir /user/hadoop/dir1
/user/hadoop/dir2
 Tail – Displays last kilobyte of file
 hadoop fs -tail pathname
 Very similar to Linux commands

Hadoop Overview
 Input data – wrangling can be difficult
 Mapper – split data into key value pairs
 Sort – sort values by key
 Reducer – Combine values by key

Hadoop Overview
 Wordcount (HelloWorld) – Counts
occurrences of each word in a document
 Half of TF-IDF

Hadoop Overview
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

Hadoop Overview
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}

Hadoop Overview
 Setup the data:
 /usr/joe/wordcount/input - input directory in HDFS
 /usr/joe/wordcount/output - output directory in HDFS
 $ hadoop fs -ls /usr/joe/wordcount/input/
 /usr/joe/wordcount/input/file01
 /usr/joe/wordcount/input/file02
 $ hadoop fs -cat /usr/joe/wordcount/input/file01
 Hello World Bye World
 $ hadoop fs -cat /usr/joe/wordcount/input/file02
 Hello Hadoop Goodbye Hadoop

Hadoop Overview
 Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is
the Hadoop version installed, compile WordCount.java and create a jar:
 $ mkdir wordcount_classes
 $ javac -classpath ${HADOOP_HOME}/hadoop-
${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java
 $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .
 Run the application:
 $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
/usr/joe/wordcount/input /usr/joe/wordcount/output
 Output:
 $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2

Hadoop Overview
 Interesting facts about MapReduce
 MapReduce can run on any type of file including
images.
 Hadoop streaming technology allows other
languages to use MapReduce. Python, R, Ruby.
 Can include a Combiner method that can
streamline traffic
 Not required to include a Reducer (image
processing, ETL)
 Hadoop includes a JobTracker WebUI
 MRUnit – Junit test framework

Hadoop Overview
 Spring for Apache Hadoop project
 Configure and run MapReduce jobs as
container managed objects
 Provide template helper classes for
HDFS, Hbase, Pig and Hive.
 Use standard Spring approach for
Hadoop!
 Access all Spring goodies – Messaging,
Persistence, Security, Web Services, etc.

Hive
 Hive is an alternative to writing MapReduce
jobs. Hive compiles to MapReduce.
 Hive programs are written in HiveQL.
Similar to to SQL.
 Examples:
 Create table: hive> CREATE TABLE pokes (foo
INT, bar STRING);
 Loading data: hive> LOAD DATA LOCAL
INPATH './examples/files/kv1.txt' OVERWRITE
INTO TABLE pokes;

Hive
 Examples (cont):
 Getting data out of hive: INSERT OVERWRITE
DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM
invites a WHERE a.ds='2008-08-15';
 Join: FROM pokes t1 JOIN invites t2 ON (t1.bar
= t2.bar) INSERT OVERWRITE TABLE events
SELECT t1.bar, t1.foo, t2.foo;
 Hive may reduce the amount of code you have to write
when you are doing data wrangling.
 It's a tool that has it's place and is useful to know.

Mahout
 Started as a subproject of Lucene in 2008.
 Idea behind Mahout is that is provides a
framework for the development and
deployment of Machine Learning
algorithms.
 Currently it has three distinct capabilities:
 Classification
 Clustering
 Recommenders

Mahout
 Support for recommenders include:
 Data model – provides connections to data
 UserSimilarity – provides similarity to users
 ItemSimilarity – provides similarity to items
 UserNeighborhood – find a neighborhood (mini cluster) of
like-minded users.
 Recommender – the producer of recommendations.
 Algorithms!

What is a recommender?
 Wikipedia [3]:
 A subclass of [an] information filtering system that seek to
predict the 'rating' or 'preference' that user would give to
an item
 My addition: A subclass of machine-learning.
 Recommender model [2]:
 Users
 Items
 Ratings
 Community

Recommender types
 Non-personalized [2]
 Content-based filtering (user-item) [2]
 Hybrid [3]
 Collaborative filtering (user-user, item-item)
[2]

Collaborative Filtering
 We will now look at item-item collaborative
filtering as the recommendation algorithm.
 Answers the question: what items are similar
to the ones you like?
 Popularized by Amazon who found that
item-item scales better, can be done in real
time, and generate high-quality results. [8]
 Specifically we will look at Pearson
Correlation Coefficient algorithm.

 Pearson's correlation coefficient - defined as the
covariance of the two variables divided by the product of
their standard deviations.

 Idea is to examine a log file for user's
movie ratings. Data looks like this:
 109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64;
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"

 Steps used for the analysis:
 Run a hive script to extract the user data
from a log file
 Run Mahout command from the
command line (could be done
programmatically as well).
 Examine the contents.

Collaborative filtering
<hive-runner id="hiveRunner">
<script>
CREATE TABLE MAHOUT_INPUT_A
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
AS
SELECT cookie as user,
regexp_extract(request, "GET
/rate?movie=(d+) & rating=(d) HTTP/1.1", 1) as movie,
CAST(regexp_extract(request, "GET
/rate?movie=(d+) & rating=(d) HTTP/1.1", 2) as double)
as rating
from ACCESS_LOGS
WHERE regexp_extract(request, "GET
/rate?movie=(d+) & rating=(d) HTTP/1.1", 2) != "";
</script>
</hive-runner>

public class HiveApp {
private static final Log log =
LogFactory.getLog(HiveApp.class);
public static void main(String[] args) throws Exception {
AbstractApplicationContext context = new
ClassPathXmlApplicationContext(
"/META-INF/spring/hive-context.xml", HiveApp.class);
context.registerShutdownHook();
HiveRunner runner = context.getBean(HiveRunner.class);
runner.call();
}
}

 Hive output looks like this (This is
the format that Mahout requires):
UserId, MovieID, relationship strength
943,373,3.0
943,391,2.0
943,796,3.0
943,237,4.0
943,840,4.0
943,230,1.0
943,229,2.0
943,449,1.0
943,450,1.0
943,228,3.0

 Rerun Mahout with a different correlation
say SIMILARITY_EUCLIDEAN_DISTANCE
 Do A/B comparison in production
 Gather statistics over time
 See if one algorithm is better than others.

Spring XD
 XD - Spring.io project that extends the work
that Spring Data team did on Spring for
Apache Hadoop project.
 High throughput distributed data ingestion into HDFS from a
variety of input sources.
 Real-time analytics at ingestion time, e.g. gathering metrics and
counting values.
 Hadoop workflow management via batch jobs that combine
interactions with standard enterprise systems (e.g. RDBMS) as
well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive
or Cascading).
 High throughput data export, e.g. from HDFS to a RDBMS or
NoSQL database.

Spring XD
 Configure a stream using XD. Simple case:

Spring XD
 More typical Corporate Use Case Stream:

References
 [1] Introduction to recommender systems. Joseph Konstan.
 [2] Intro to recommendations. Coursera.
 [3] Recommender system. Wikipedia.
 [4] An Algorithmic Framework for Performing Collaborative Filtering.
 [5] Hybrid Web Recommender Systems.
 [6] Hadoop web site.
 [7] Apache Hadoop. Wikipedia
 [8] Amazon.com Recommendations paper. cs.umd.edu.
 [9] Cloudera Data Science Training. Cloudera.

Recommender.system.presentation.pjug.05.20.2014

More Related Content

Recommender.system.presentation.pjug.05.20.2014