Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Prerequisites: Single Node Setup Cluster Setup

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Prerequisites

Ensure that Hadoop is installed, configured and is running. More details:

• Single Node Setup for first-time users.


• Cluster Setup for large, distributed clusters.

Overview
Hadoop MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Typically both the input
and the output of the job are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture
Guide) are running on the same set of nodes. This configuration allows the framework
to effectively schedule tasks on the nodes where data is already present, resulting in
very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master JobTracker and one


slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs'
component tasks on the slaves, monitoring them and re-executing the failed tasks. The
slaves execute the tasks as directed by the master.

Minimally, applications specify the input/output locations and


supply map and reduce functions via implementations of appropriate interfaces and/or
abstract-classes. These, and other job parameters, comprise the job configuration. The
Hadoop job client then submits the job (jar/executable etc.) and configuration to
the JobTracker which then assumes the responsibility of distributing the
software/configuration to the slaves, scheduling tasks and monitoring them, providing
status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in JavaTM, MapReduce applications need


not be written in Java.

• Hadoop Streaming is a utility which allows users to create and run jobs with any
executables (e.g. shell utilities) as the mapper and/or the reducer.
• Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications
(non JNITM based).

Inputs and Outputs


The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a
set of <key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement
the WritableComparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -
> <k3, v3> (output)

Example: WordCount v1.0


Before we jump into the details, lets walk through an example MapReduce application to
get a flavour for how they work.

WordCount is a simple application that counts the number of occurences of each word
in a given input set.

This works with a local-standalone, pseudo-distributed or fully-distributed Hadoop


installation (Single Node Setup).

Source Code
WordCount.java

1. package org.myorg;

2.

3. import java.io.IOException;

4. import java.util.*;

5.

6. import org.apache.hadoop.fs.Path;

7. import org.apache.hadoop.conf.*;

8. import org.apache.hadoop.io.*;

9. import org.apache.hadoop.mapred.*;

10. import org.apache.hadoop.util.*;

11.

12. public class WordCount {

13.

14. public static class Map extends MapReduceBase implements Mapper<LongWritable,


IntWritable> {

15. private final static IntWritable one = new IntWritable(1);

16. private Text word = new Text();


17.

18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritab


Reporter reporter) throws IOException {

19. String line = value.toString();

20. StringTokenizer tokenizer = new StringTokenizer(line);

21. while (tokenizer.hasMoreTokens()) {

22. word.set(tokenizer.nextToken());

23. output.collect(word, one);

24. }

25. }

26. }

27.

28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntW
IntWritable> {

29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Te


IntWritable> output, Reporter reporter) throws IOException {

30. int sum = 0;

31. while (values.hasNext()) {

32. sum += values.next().get();

33. }

34. output.collect(key, new IntWritable(sum));

35. }

36. }

37.

38. public static void main(String[] args) throws Exception {

39. JobConf conf = new JobConf(WordCount.class);

40. conf.setJobName("wordcount");

41.

42. conf.setOutputKeyClass(Text.class);

43. conf.setOutputValueClass(IntWritable.class);
44.

45. conf.setMapperClass(Map.class);

46. conf.setCombinerClass(Reduce.class);

47. conf.setReducerClass(Reduce.class);

48.

49. conf.setInputFormat(TextInputFormat.class);

50. conf.setOutputFormat(TextOutputFormat.class);

51.

52. FileInputFormat.setInputPaths(conf, new Path(args[0]));

53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));

54.

55. JobClient.runJob(conf);

57. }

58. }

59.

Usage
Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the
Hadoop version installed, compileWordCount.java and create a jar:

$ mkdir wordcount_classes
$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar
-d wordcount_classes WordCount.java
$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .

Assuming that:

• /usr/joe/wordcount/input - input directory in HDFS


• /usr/joe/wordcount/output - output directory in HDFS
Sample text-files as input:

$ bin/hadoop dfs -ls /usr/joe/wordcount/input/


/usr/joe/wordcount/input/file01
/usr/joe/wordcount/input/file02

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01


Hello World Bye World

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02


Hello Hadoop Goodbye Hadoop
Run the application:

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount


/usr/joe/wordcount/input /usr/joe/wordcount/output

Output:

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000


Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

You might also like