0% found this document useful (0 votes)

7 views

MapReduce Tutorial

Uploaded by

meganathan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

MapReduce Tutorial

Uploaded by

meganathan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

MapReduce Tutorial: Introduction

In this MapReduce Tutorial blog, I am going to introduce you to MapReduce, which is one of the core building
blocks of processing in Hadoop framework. Before moving ahead, I would suggest you to get familiar with HDFS
concepts which I have covered in my previous HDFS tutorial blog. This will help you to understand the
MapReduce concepts quickly and easily.

Before we begin, let us have a brief understanding of the following.

What is Big Data?

Big Data can be termed as that colossal load of data that can be hardly processed using the traditional data
processing units. A better example of Big Data would be the currently trending Social Media sites like Facebook,
Instagram, WhatsApp and YouTube.

What is Hadoop?
Hadoop is a Big Data framework designed and deployed by Apache Foundation. It is an open-source software
utility that works in the network of computers in parallel to find solutions to Big Data and process it using the
MapReduce algorithm.

Google released a paper on MapReduce technology in December 2004. This became the genesis of the
Hadoop Processing Model. So, MapReduce is a programming model that allows us to perform parallel and
distributed processing on huge data sets. The topics that I have covered in this MapReduce tutorial blog are as
follows:

 Traditional Way for parallel and distributed processing

 What is MapReduce?
 MapReduce Example
 MapReduce Advantages
 MapReduce Program
 MapReduce Program Explained
 MapReduce Use Case: KMeans Algorithm

MapReduce Tutorial: Traditional Way

Let us understand, when the MapReduce framework was not there, how parallel and distributed processing used
to happen in a traditional way. So, let us take an example where I have a weather log containing the daily
average temperature of the years from 2000 to 2015. Here, I want to calculate the day having the highest
temperature in each year.

So, just like in the traditional way, I will split the data into smaller parts or blocks and store them in different
machines. Then, I will find the highest temperature in each part stored in the corresponding machine. At last, I
will combine the results received from each of the machines to have the final output. Let us look at the
challenges associated with this traditional approach:

1. Critical path problem: It is the amount of time taken to finish the job without delaying the next milestone
or actual completion date. So, if, any of the machines delay the job, the whole work gets delayed.
2. Reliability problem: What if, any of the machines which are working with a part of data fails? The
management of this failover becomes a challenge.
3. Equal split issue: How will I divide the data into smaller chunks so that each machine gets even part of
data to work with. In other words, how to equally divide the data such that no individual machine is
overloaded or underutilized.
4. The single split may fail: If any of the machines fail to provide the output, I will not be able to calculate
the result. So, there should be a mechanism to ensure this fault tolerance capability of the system.
5. Aggregation of the result: There should be a mechanism to aggregate the result generated by each of
the machines to produce the final output.

These are the issues which I will have to take care individually while performing parallel processing of huge data
sets when using traditional approaches.

To overcome these issues, we have the MapReduce framework which allows us to perform such parallel
computations without bothering about the issues like reliability, fault tolerance etc. Therefore, MapReduce gives
you the flexibility to write code logic without caring about the design issues of the system.

MapReduce Tutorial: What is MapReduce?

MapReduce is a programming framework that allows us to perform distributed and parallel processing on large
data sets in a distributed environment.

Big Data Hadoop Certification Training

 Instructor-led Sessions
 Real-life Case Studies
 Assessments
 Lifetime Access

Explore Curriculum

 MapReduce consists of two distinct tasks – Map and Reduce.

 As the name MapReduce suggests, the reducer phase takes place after the mapper phase has been
completed.
 So, the first is the map job, where a block of data is read and processed to produce key-value pairs as
intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller
set of tuples or key-value pairs which is the final output.

Let us understand more about MapReduce and its components. MapReduce majorly has the following three
Classes. They are,

Mapper Class

The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader processes
each Input record and generates the respective key-value pair. Hadoop’s Mapper store saves this intermediate
data into the local disk.

 Input Split

It is the logical representation of data. It represents a block of work that contains a single map task in the
MapReduce Program.
 RecordReader

It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.

Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which processes it and generates the
final output which is then saved in the HDFS.

Driver Class

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a MapReduce Job
to run-in Hadoop. We specify the names of Mapper and Reducer Classes long with data types and their
respective job names.

Meanwhile, you may go through this MapReduce Tutorial video where our expert from Hadoop online
training has discussed all the concepts related to MapReduce has been clearly explained using examples:

Hadoop MapReduce Tutorial | MapReduce Example | Edureka

MapReduce Tutorial: A Word Count Example of MapReduce

Let us understand, how a MapReduce works by taking an example where I have a text file called example.txt
whose contents are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will be finding the
unique words and the number of occurrences of those unique words.
 First, we divide the input into three splits as shown in the figure. This will distribute the work among all the
map nodes.
 Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of the tokens
or words. The rationale behind giving a hardcoded value equal to 1 is that every word, in itself, will occur
once.
 Now, a list of key-value pair will be created where the key is nothing but the individual words and value is
one. So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1; River, 1. The
mapping process remains the same on all the nodes.
 After the mapper phase, a partition process takes place where sorting and shuffling happen so that all the
tuples with the same key are sent to the corresponding reducer.
 So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values
corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
 Now, each Reducer counts the values which are present in that list of values. As shown in the figure,
reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of ones in the very
list and gives the final output as – Bear, 2.
 Finally, all the output key/value pairs are then collected and written in the output file.

MapReduce Tutorial: Advantages of MapReduce

The two biggest advantages of MapReduce are:

1. Parallel Processing:

In MapReduce, we are dividing the job among multiple nodes and each node works with a part of the job
simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps us to process the data
using different machines. As the data is processed by multiple machines instead of a single machine in parallel,
the time taken to process the data gets reduced by a tremendous amount as shown in the figure below (2).
Fig.: Traditional
Way Vs. MapReduce Way – MapReduce Tutorial

Big Data Training

BIG DATA HADOOP CERTIFICATION TRAINING

Big Data Hadoop Certification Training
Reviews
5(159421)

PYTHON SPARK CERTIFICATION TRAINING USING PYSPARK

Python Spark Certification Training using PySpark
Reviews
5(5392)

APACHE SPARK AND SCALA CERTIFICATION TRAINING

Apache Spark and Scala Certification Training
Reviews
5(26712)

SPLUNK TRAINING & CERTIFICATION- POWER USER & ADMIN

Splunk Training & Certification- Power User & Admin
Reviews
5(7464)

APACHE KAFKA CERTIFICATION TRAINING

Apache Kafka Certification Training
Reviews
5(6180)
ELK STACK TRAINING & CERTIFICATION
ELK Stack Training & Certification
Reviews
5(1361)

HADOOP ADMINISTRATION CERTIFICATION TRAINING

Hadoop Administration Certification Training
Reviews
5(24948)

APACHE STORM CERTIFICATION TRAINING

Apache Storm Certification Training
Reviews
5(5529)

APACHE SOLR CERTIFICATION TRAINING

Apache Solr Certification Training
Reviews
5(6454)

Next
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data in the MapReduce
Framework. In the traditional system, we used to bring data to the processing unit and process it. But, as the
data grew and became very huge, bringing this huge amount of data to the processing unit posed the following
issues:

 Moving huge data to processing is costly and deteriorates the network performance.
 Processing takes time as the data is processed by a single unit which becomes the bottleneck.
 The master node can get over-burdened and may fail.

Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the data. So, as
you can see in the above image that the data is distributed among multiple nodes where each node processes
the part of the data residing on it. This allows us to have the following advantages:

 It is very cost-effective to move processing unit to the data.

 The processing time is reduced as all the nodes are working with their part of the data in parallel.
 Every node gets a part of the data to process and therefore, there is no chance of a node getting
overburdened.

MapReduce Tutorial: MapReduce Example Program

Before jumping into the details, let us have a glance at a MapReduce example program to have a basic idea
about how things work in a MapReduce environment practically. I have taken the same word count example
where I have to find out the number of occurrences of each word. And Don’t worry guys, if you don’t
understand the code when you look at it for the first time, just bear with me while I walk you through each part of
the MapReduce code.
MapReduce Tutorial: Explanation of MapReduce Program

The entire MapReduce program can be fundamentally divided into three parts:

 Mapper Phase Code

 Reducer Phase Code
 Driver Code

We will understand the code for each of these three parts sequentially.

Mapper code:

public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {

1
public void map(LongWritable key, Text value, Context context) throws
2
IOException,InterruptedException {
3 String line = value.toString();
4 StringTokenizer tokenizer = new StringTokenizer(line);
5 while (tokenizer.hasMoreTokens()) {
6 value.set(tokenizer.nextToken());
7 context.write(value, new IntWritable(1));
8 }
 We have created a class Map that extends the class Mapper which is already defined in the MapReduce

Framework.
 We define the data types of input and output key/value pair after the class declaration using angle
brackets.
 Both the input and output of the Mapper is a key/value pair.
 Input:
 The key is nothing but the offset of each line in the text file: LongWritable
 The value is each individual line (as shown in the figure at the right): Text
 Output:
 The key is the tokenized words: Text
 We have the hardcoded value in our case which is 1: IntWritable
 Example – Dear 1, Bear 1, etc.
 We have written a java code where we have tokenized each word and assigned them a hardcoded value
equal to 1.

Reducer Code:
public static class Reduce extends
Reducer<Text,IntWritable,Text,IntWritable> {
1
public void reduce(Text key, Iterable<IntWritable>
2
values,Context context)
3
throws IOException,InterruptedException {
4
int sum=0;
5 for(IntWritable x: values)
6 {
7 sum+=x.get();
8 }
9 context.write(key, new
10 IntWritable(sum));
11 }
}

 We have created a class Reduce which extends class Reducer like that of Mapper.
 We define the data types of input and output key/value pair after the class declaration using angle brackets
as done for Mapper.
 Both the input and the output of the Reducer is a key-value pair.
 Input:
 The key nothing but those unique words which have been generated after the sorting and shuffling
phase: Text
 The value is a list of integers corresponding to each key: IntWritable
 Example – Bear, [1, 1], etc.
 Output:
 The key is all the unique words present in the input text file: Text
 The value is the number of occurrences of each of the unique words: IntWritable
 Example – Bear, 2; Car, 3, etc.
 We have aggregated the values present in each of the list corresponding to each key and produced the
final answer.
 In general, a single reducer is created for each of the unique words, but, you can specify the number of
reducer in mapred-site.xml.

Driver Code:

Configuration conf= new Configuration();

1 Job job = new Job(conf,"My Word Count Program");
2 job.setJarByClass(WordCount.class);
3 job.setMapperClass(Map.class);
4 job.setReducerClass(Reduce.class);
5 job.setOutputKeyClass(Text.class);
6 job.setOutputValueClass(IntWritable.class);
7 job.setInputFormatClass(TextInputFormat.class);
8 job.setOutputFormatClass(TextOutputFormat.class);
9 Path outputPath = new Path(args[1]);
10 //Configuring the input/output path from the
11 filesystem into the job
12 FileInputFormat.addInputPath(job, new Path(args[0]));
13 FileOutputFormat.setOutputPath(job, new
Path(args[1]));

 In the driver class, we set the configuration of our MapReduce job to run in Hadoop.
 We specify the name of the job, the data type of input/output of the mapper and reducer.
 We also specify the names of the mapper and reducer classes.
 The path of the input and output folder is also specified.
 The method setInputFormatClass () is used for specifying how a Mapper will read the input data or what
will be the unit of work. Here, we have chosen TextInputFormat so that a single line is read by the mapper
at a time from the input text file.
 The main () method is the entry point for the driver. In this method, we instantiate a new Configuration
object for the job.

Source code:

1 package co.edureka.mapreduce;
2 import java.io.IOException;
3 import java.util.StringTokenizer;
4 import org.apache.hadoop.io.IntWritable;
5 import org.apache.hadoop.io.LongWritable;
6 import org.apache.hadoop.io.Text;
7 import org.apache.hadoop.mapreduce.Mapper;
8 import org.apache.hadoop.mapreduce.Reducer;
9 import org.apache.hadoop.conf.Configuration;
10 import org.apache.hadoop.mapreduce.Job;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
15 import org.apache.hadoop.fs.Path;
16
17 public class WordCount{
18 public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
19 public void map(LongWritable key, Text value,Context context) throws
20 IOException,InterruptedException{
21 String line = value.toString();
22 StringTokenizer tokenizer = new StringTokenizer(line);
23 while (tokenizer.hasMoreTokens()) {
24 value.set(tokenizer.nextToken());
25 context.write(value, new IntWritable(1));
26 }
27 }
28 }
29 public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable>
30 {
31 public void reduce(Text key, Iterable<IntWritable> values,Context
32 context) throws IOException,InterruptedException {
33 int sum=0;
34 for(IntWritable x: values)
35 {
36 sum+=x.get();
37 }
38 context.write(key, new IntWritable(sum));
39 }
40 }
41 public static void main(String[] args) throws Exception {
42 Configuration conf= new Configuration();
43 Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
44
job.setMapperClass(Map.class);
45
job.setReducerClass(Reduce.class);
46
job.setOutputKeyClass(Text.class);
47
job.setOutputValueClass(IntWritable.class);
48
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
49 Path outputPath = new Path(args[1]);
50 //Configuring the input/output path from the filesystem into the job
51 FileInputFormat.addInputPath(job, new Path(args[0]));
52 FileOutputFormat.setOutputPath(job, new Path(args[1]));
53 //deleting the output path automatically from hdfs so that we don't have to
54 delete it explicitly
55 outputPath.getFileSystem(conf).delete(outputPath);
56 //exiting the job only if the flag value becomes false
57 System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Run the MapReduce code:

The command for running a MapReduce code is:

hadoop jar hadoop-mapreduce-example.jar WordCount

1 /sample/input /sample/output
Now, we will look into a Use Case based on MapReduce Algorithm.

Use case: KMeans Clustering using Hadoop’s MapReduce.

KMeans Algorithm is one of the simplest Unsupervised Machine Learning Algorithm. Typically, unsupervised
algorithms make inferences from datasets using only input vectors without referring to known or labelled
outcomes.
Executing the KMeans Algorithm using Python with a smaller Dataset or a .csv file is easy. But, when it comes
to executing the Datasets at the level of Big Data, then the normal procedure cannot stay handy anymore.

That is exactly when you deal Big Data with Big Data tools. The Hadoop’s MapReduce. The following code
snippets are the Components of MapReduce performing the Mapper, Reducer and Driver Jobs

//Mapper Class

public void map(LongWritable key, Text value, OutputCollector<DoubleWritable,

1
DoubleWritable> output, Reporter reporter) throws IOException {
2
String line = value.toString();
3
double point = Double.parseDouble(line);
4
double min1, min2 = Double.MAX_VALUE, nearest_center = mCenters.get(0);
5
for (double c : mCenters) {
6
min1 = c - point;
7 if (Math.abs(min1) < Math.abs(min2)) {
8 nearest_center = c;
9 min2 = min1;
10 }
11 }
12 output.collect(new DoubleWritable(nearest_center),
13 new DoubleWritable(point));
14 }
15 }
//Reducer Class

1 public static class Reduce extends MapReduceBase implements

2 Reducer<DoubleWritable, DoubleWritable, DoubleWritable, Text> {
@Override
3
public void reduce(DoubleWritable key, Iterator<DoubleWritable> values,
4
OutputCollector<DoubleWritable, Text> output, Reporter reporter)throws IOException {
5
double newCenter;
6
double sum = 0;
7
int no_elements = 0;
8
String points = "";
9 while (values.hasNext()) {
10 double d = values.next().get();
11 points = points + " " + Double.toString(d);
12 sum = sum + d;
13 ++no_elements;
14 }
15 newCenter = sum / no_elements;
16 output.collect(new DoubleWritable(newCenter), new Text(points));
17 }
18 }
//Driver Class

1 public static void run(String[] args) throws Exception {

2 IN = args[0];
3 OUT = args[1];
4 String input = IN;
5 String output = OUT + System.nanoTime();
6 String again_input = output;
7 int iteration = 0;
8 boolean isdone = false;
9 while (isdone == false) {
10 JobConf conf = new JobConf(KMeans.class);
11 if (iteration == 0) {
12 Path hdfsPath = new Path(input +
13 CENTROID_FILE_NAME);
14 DistributedCache.addCacheFile(hdfsPath.toUri(),
15 conf);
16 } else {
17 Path hdfsPath = new Path(again_input +
18 OUTPUT_FIE_NAME);
19 DistributedCache.addCacheFile(hdfsPath.toUri(),
20 conf);
21 }
22 conf.setJobName(JOB_NAME);
23 conf.setMapOutputKeyClass(DoubleWritable.class);
24 conf.setMapOutputValueClass(DoubleWritable.class);
25 conf.setOutputKeyClass(DoubleWritable.class);
26 conf.setOutputValueClass(Text.class);
27 conf.setMapperClass(Map.class);
28 conf.setReducerClass(Reduce.class);
29 conf.setInputFormat(TextInputFormat.class);
30 conf.setOutputFormat(TextOutputFormat.class);
31 FileInputFormat.setInputPaths(conf, new Path(input +
32 DATA_FILE_NAME));
FileOutputFormat.setOutputPath(conf, new
33
Path(output));
34
JobClient.runJob(conf);
35
Path ofile = new Path(output + OUTPUT_FIE_NAME);
36
FileSystem fs = FileSystem.get(new Configuration());
37
38 BufferedReader br = new BufferedReader(new
39 InputStreamReader(fs.open(ofile)));
40 List<Double> centers_next = new ArrayList<Double>();
41 String line = br.readLine();
42 while (line != null) {
43 String[] sp = line.split("t| ");
44 double c = Double.parseDouble(sp[0]);
45 centers_next.add(c);
46 line = br.readLine();
47 }
48 br.close();
49 String prev;
50 if (iteration == 0) {
51 prev = input + CENTROID_FILE_NAME;
52 } else {
53 prev = again_input + OUTPUT_FILE_NAME;
54 }
55 Path prevfile = new Path(prev);
56 FileSystem fs1 = FileSystem.get(new Configuration());
57 BufferedReader br1 = new BufferedReader(new
58 InputStreamReader(fs1.open(prevfile)));
59 List<Double> centers_prev = new ArrayList<Double>();
60 String l = br1.readLine();
while (l != null) {
61
String[] sp1 = l.split(SPLITTER);
62
double d = Double.parseDouble(sp1[0]);
63
centers_prev.add(d);
64
l = br1.readLine();
65
}
br1.close();
Collections.sort(centers_next);
Collections.sort(centers_prev);
66
67 Iterator<Double> it = centers_prev.iterator();
68 for (double d : centers_next) {
69 double temp = it.next();
70 if (Math.abs(temp - d) <= 0.1) {
71 isdone = true;
72 } else {
73 isdone = false;
74 break;
75 }
76 }
77 ++iteration;
again_input = output;
output = OUT + System.nanoTime();
}
}
Now, we will go through the complete executable code

//Source Code
Big Data Hadoop Certification Training

Weekday / Weekend BatchesSee Batch Details

1 import java.io.IOException;
2 import java.util.*;
3 import java.io.*;
4 import org.apache.hadoop.conf.Configuration;
5 import org.apache.hadoop.filecache.DistributedCache;
6 import org.apache.hadoop.fs.FileSystem;
7 import org.apache.hadoop.fs.Path;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapred.*;
10 import org.apache.hadoop.mapred.Reducer;
11
12 @SuppressWarnings("deprecation")
13 public class KMeans {
14 public static String OUT = "outfile";
15 public static String IN = "inputlarger";
16 public static String CENTROID_FILE_NAME = "/centroid.txt";
17 public static String OUTPUT_FILE_NAME = "/part-00000";
18 public static String DATA_FILE_NAME = "/data.txt";
19 public static String JOB_NAME = "KMeans";
20 public static String SPLITTER = "t| ";
21 public static List<Double> mCenters = new ArrayList<Double>();
22 public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
23 DoubleWritable, DoubleWritable> {
24 @Override
25 public void configure(JobConf job) {
26 try {
27 Path[] cacheFiles = DistributedCache.getLocalCacheFiles(job);
28 if (cacheFiles != null && cacheFiles.length > 0) {
29 String line;
30 mCenters.clear();
31 BufferedReader cacheReader = new BufferedReader(
32 new FileReader(cacheFiles[0].toString()));
33 try {
34 while ((line = cacheReader.readLine()) != null) {
35 String[] temp = line.split(SPLITTER);
36 mCenters.add(Double.parseDouble(temp[0]));
37 }
38 } finally {
39 cacheReader.close();
40 }
41 }
} catch (IOException e) {
42
System.err.println("Exception reading DistribtuedCache: " + e);
43
}
44
}
45
@Override
46
47 public void map(LongWritable key, Text value, OutputCollector<DoubleWritable,
48 DoubleWritable> output, Reporter reporter) throws IOException {
49 String line = value.toString();
50 double point = Double.parseDouble(line);
51 double min1, min2 = Double.MAX_VALUE, nearest_center = mCenters.get(0);
52 for (double c : mCenters) {
53 min1 = c - point;
54 if (Math.abs(min1) < Math.abs(min2)) {
55 nearest_center = c;
56 min2 = min1;
57 }
58 }
59 output.collect(new DoubleWritable(nearest_center),
60 new DoubleWritable(point));
61 }
62 }
63 public static class Reduce extends MapReduceBase implements
64 Reducer<DoubleWritable, DoubleWritable, DoubleWritable, Text> {
65 @Override
66 public void reduce(DoubleWritable key, Iterator<DoubleWritable> values,
67 OutputCollector<DoubleWritable, Text> output, Reporter reporter)throws IOException {
68 double newCenter;
69 double sum = 0;
int no_elements = 0;
70
String points = "";
71
while (values.hasNext()) {
72
double d = values.next().get();
73
points = points + " " + Double.toString(d);
74
75 sum = sum + d;
76 ++no_elements;
77 }
78 newCenter = sum / no_elements;
79 output.collect(new DoubleWritable(newCenter), new Text(points));
80 }
81 }
82 public static void main(String[] args) throws Exception {
83 run(args);
84 }
85 public static void run(String[] args) throws Exception {
86 IN = args[0];
87 OUT = args[1];
88 String input = IN;
89 String output = OUT + System.nanoTime();
90 String again_input = output;
91 int iteration = 0;
92 boolean isdone = false;
93 while (isdone == false) {
94 JobConf conf = new JobConf(KMeans.class);
95 if (iteration == 0) {
96 Path hdfsPath = new Path(input + CENTROID_FILE_NAME);
97 DistributedCache.addCacheFile(hdfsPath.toUri(), conf);
} else {
98
Path hdfsPath = new Path(again_input + OUTPUT_FIE_NAME);
99
DistributedCache.addCacheFile(hdfsPath.toUri(), conf);
100
}
101
conf.setJobName(JOB_NAME);
102
103 conf.setMapOutputKeyClass(DoubleWritable.class);
104 conf.setMapOutputValueClass(DoubleWritable.class);
105 conf.setOutputKeyClass(DoubleWritable.class);
106 conf.setOutputValueClass(Text.class);
107 conf.setMapperClass(Map.class);
108 conf.setReducerClass(Reduce.class);
109 conf.setInputFormat(TextInputFormat.class);
110 conf.setOutputFormat(TextOutputFormat.class);
111 FileInputFormat.setInputPaths(conf, new Path(input + DATA_FILE_NAME));
112 FileOutputFormat.setOutputPath(conf, new Path(output));
113 JobClient.runJob(conf);
114 Path ofile = new Path(output + OUTPUT_FIE_NAME);
115 FileSystem fs = FileSystem.get(new Configuration());
116 BufferedReader br = new BufferedReader(new
117 InputStreamReader(fs.open(ofile)));
118 List<Double> centers_next = new ArrayList<Double>();
119 String line = br.readLine();
120 while (line != null) {
121 String[] sp = line.split("t| ");
122 double c = Double.parseDouble(sp[0]);
123 centers_next.add(c);
124 line = br.readLine();
125 }
br.close();
126
String prev;
127
if (iteration == 0) {
128
prev = input + CENTROID_FILE_NAME;
129
} else {
130
131 prev = again_input + OUTPUT_FILE_NAME;
132 }
133 Path prevfile = new Path(prev);
134 FileSystem fs1 = FileSystem.get(new Configuration());
135 BufferedReader br1 = new BufferedReader(new
136 InputStreamReader(fs1.open(prevfile)));
137 List<Double> centers_prev = new ArrayList<Double>();
138 String l = br1.readLine();
139 while (l != null) {
140 String[] sp1 = l.split(SPLITTER);
141 double d = Double.parseDouble(sp1[0]);
142 centers_prev.add(d);
143 l = br1.readLine();
144 }
145 br1.close();
146 Collections.sort(centers_next);
147 Collections.sort(centers_prev);
148
149 Iterator<Double> it = centers_prev.iterator();
150 for (double d : centers_next) {
151 double temp = it.next();
152 if (Math.abs(temp - d) <= 0.1) {
153 isdone = true;
154 } else {
isdone = false;
155
break;
156
}
157
}
158
++iteration;
again_input = output;
output = OUT + System.nanoTime();
159 }
}
}
Now, you guys have

Fanuc PMC - Ladder Language - Programming Manual PDF
67% (6)
Fanuc PMC - Ladder Language - Programming Manual PDF
1,508 pages
Demantra Interview Questions
100% (6)
Demantra Interview Questions
21 pages
Learn R Programming in 24 Hours
From Everand
Learn R Programming in 24 Hours
Alex Nordeen
No ratings yet
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Data Science
No ratings yet
Data Science
7 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
BDA FW-4
No ratings yet
BDA FW-4
7 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Introduction to batch processing
No ratings yet
Introduction to batch processing
23 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
2 MapReduce continue
No ratings yet
2 MapReduce continue
12 pages
Rohit
No ratings yet
Rohit
14 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Hadoop Interview Questions Author: Pappupass Learning Resource
No ratings yet
Hadoop Interview Questions Author: Pappupass Learning Resource
16 pages
Untitled
No ratings yet
Untitled
16 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Lec 6
No ratings yet
Lec 6
14 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Lec 6
No ratings yet
Lec 6
16 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
No ratings yet
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
4 pages
Unit 3
No ratings yet
Unit 3
10 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
10 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Map Red
No ratings yet
Map Red
6 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
Hadoop Interview Questions Faq
No ratings yet
Hadoop Interview Questions Faq
14 pages
Research Paper - Map Reduce - CSC3323
No ratings yet
Research Paper - Map Reduce - CSC3323
16 pages
Unit 4 CS 3RD Yr
No ratings yet
Unit 4 CS 3RD Yr
13 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Unit 3 Bba
No ratings yet
Unit 3 Bba
11 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Hadoop MapReduce Explained Simply
No ratings yet
Hadoop MapReduce Explained Simply
3 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
Map reduce
No ratings yet
Map reduce
35 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
BDT UNIT - III
No ratings yet
BDT UNIT - III
12 pages
Map Reduce 1
No ratings yet
Map Reduce 1
50 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
What is Map Reduce Programming Model_ Explain.
No ratings yet
What is Map Reduce Programming Model_ Explain.
3 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
Unit 3
No ratings yet
Unit 3
13 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Top Answers To Map Reduce Interview Questions: Criteria Mapreduce Spark
No ratings yet
Top Answers To Map Reduce Interview Questions: Criteria Mapreduce Spark
2 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Big Data
No ratings yet
Big Data
17 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Ciso Guide To Machine Identity Management
No ratings yet
Ciso Guide To Machine Identity Management
17 pages
C++ Programming Exercise-2:: Sample Output
No ratings yet
C++ Programming Exercise-2:: Sample Output
2 pages
Management Information System For The College of Sports, Physical Education and Recreation
No ratings yet
Management Information System For The College of Sports, Physical Education and Recreation
30 pages
Cairo Documentation
No ratings yet
Cairo Documentation
21 pages
Svelte Practice
No ratings yet
Svelte Practice
2 pages
Functions
No ratings yet
Functions
34 pages
1 Introduction Interaction Design-VRE
No ratings yet
1 Introduction Interaction Design-VRE
30 pages
LSMW Hup Long Text
No ratings yet
LSMW Hup Long Text
10 pages
Design and Implementation of Algorithm For DES Cryptanalysis
No ratings yet
Design and Implementation of Algorithm For DES Cryptanalysis
5 pages
Stata Intro
No ratings yet
Stata Intro
20 pages
2122 Exam 4eso Rec 3 Ev
No ratings yet
2122 Exam 4eso Rec 3 Ev
1 page
SQL - Milestone Assesment1 - MCQ - Final - 30
No ratings yet
SQL - Milestone Assesment1 - MCQ - Final - 30
24 pages
COBOL Programming Course 1 Getting Started
No ratings yet
COBOL Programming Course 1 Getting Started
46 pages
T.alsolh CV
No ratings yet
T.alsolh CV
6 pages
Strategi Dan Arah Kebijakan
No ratings yet
Strategi Dan Arah Kebijakan
19 pages
Spring Boot Annotations
No ratings yet
Spring Boot Annotations
12 pages
It 25059 Input File Wzeit Plifz Cf1 Log
No ratings yet
It 25059 Input File Wzeit Plifz Cf1 Log
52 pages
EKS4 Exchange
No ratings yet
EKS4 Exchange
10 pages
Security Model in Salesforce - QR Solutions
No ratings yet
Security Model in Salesforce - QR Solutions
9 pages
Adobe Forms - Create Table (Using Subform)
No ratings yet
Adobe Forms - Create Table (Using Subform)
6 pages
University of The East: EASTPORTRANS - East Portal Multi-Modal Transport Terminal
No ratings yet
University of The East: EASTPORTRANS - East Portal Multi-Modal Transport Terminal
2 pages
Perancangan User Interface Dan User Experience (Uiux) Design Thrifting E-Commerce Berbasis Website Menggunakan Figma
No ratings yet
Perancangan User Interface Dan User Experience (Uiux) Design Thrifting E-Commerce Berbasis Website Menggunakan Figma
7 pages
Writing Test Benches: A Free Application Note
No ratings yet
Writing Test Benches: A Free Application Note
8 pages
Autodesk Design Review: About DWF and DWFX
No ratings yet
Autodesk Design Review: About DWF and DWFX
7 pages
CUG - Directory 11-11-2020
No ratings yet
CUG - Directory 11-11-2020
51 pages
Principles of information systems 14th Edition Ralph M. Stair download pdf
100% (4)
Principles of information systems 14th Edition Ralph M. Stair download pdf
66 pages
Bottom Price List-CBit 9 2022
No ratings yet
Bottom Price List-CBit 9 2022
2 pages
CC
No ratings yet
CC
1 page