Prerequisites: Single Node Setup Cluster Setup
Prerequisites: Single Node Setup Cluster Setup
Prerequisites: Single Node Setup Cluster Setup
Overview
Hadoop MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Typically both the input
and the output of the job are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture
Guide) are running on the same set of nodes. This configuration allows the framework
to effectively schedule tasks on the nodes where data is already present, resulting in
very high aggregate bandwidth across the cluster.
• Hadoop Streaming is a utility which allows users to create and run jobs with any
executables (e.g. shell utilities) as the mapper and/or the reducer.
• Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications
(non JNITM based).
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -
> <k3, v3> (output)
WordCount is a simple application that counts the number of occurences of each word
in a given input set.
Source Code
WordCount.java
1. package org.myorg;
2.
3. import java.io.IOException;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
11.
13.
22. word.set(tokenizer.nextToken());
24. }
25. }
26. }
27.
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntW
IntWritable> {
33. }
35. }
36. }
37.
40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46. conf.setCombinerClass(Reduce.class);
47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class);
50. conf.setOutputFormat(TextOutputFormat.class);
51.
54.
55. JobClient.runJob(conf);
57. }
58. }
59.
Usage
Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the
Hadoop version installed, compileWordCount.java and create a jar:
$ mkdir wordcount_classes
$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar
-d wordcount_classes WordCount.java
$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .
Assuming that:
Output: