Lab 4 - Installation of Hadoop and MapReduce WordCount Example
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
1
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
ssh localhost
exit
3. Install Hadoop by navigating to the following link and downloading the tar.gz file
for Hadoop version 3.3.0 (or a later version if you wish). (478 MB)
https://hadoop.apache.org/release/3.3.0.html
2
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
3
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
4
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
5
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
6
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_space/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_space/hdfs/datanode</value>
</property>
7
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
8
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
</property>
9
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
14. Now, run the following commands on the terminal to create a directory for
hadoop space, name node and data node.
10
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
16. Before starting the Hadoop Distributed File System (hdfs), we need to
make sure that the rcmd type is “ssh” not “rsh” when we type the following
command
pdsh -q -w localhost
11
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
17. If the rcmd type is “rsh” as in the above figure, type the following
commands:
export PDSH_RCMD_TYPE=ssh
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Run Step 16 again to check that the rcmd type is now ssh.
If not, skip that step.
12
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
start-yarn.sh
20. Type the following command. You should see an output similar to the one
in the following figure.
jps
21. Go to localhost:9870 from the browser. You should expect the following
13
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
2. Create a directory on the Desktop named Lab and inside it create two folders;
one called “Input” and the other called “tutorial_classes”.
[You can do this step using GUI normally or through terminal commands]
cd Desktop
mkdir Lab
mkdir Lab/Input
mkdir Lab/tutorial_classes
3. Add the file attached with this document “WordCount.java” in the directory Lab
4. Add the file attached with this document “input.txt” in the directory Lab/Input.
5. Type the following command to export the hadoop classpath into bash.
export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure it is now exported.
echo $HADOOP_CLASSPATH
6. It is time to create these directories on HDFS rather than locally. Type the
following commands.
hadoop fs -mkdir /WordCountTutorial
hadoop fs -mkdir /WordCountTutorial/Input
hadoop fs -put Lab/Input/input.txt /WordCountTutorial/Input
7. Go to localhost:9870 from the browser, Open “Utilities → Browse File System”
and you should see the directories and files we placed in the file system.
8. Then, back to local machine where we will compile the WordCount.java file.
Assuming we are currently in the Desktop directory.
cd Lab
javac -classpath $HADOOP_CLASSPATH -d tutorial_classes
WordCount.javaPut the output files in one jar file (There is a dot at the end)
jar -cvf WordCount.jar -C tutorial_classes .
14
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
Requirement:
15
CMPN451 Big Data Analytics Lab 4 - MapReduce with Hadoop
The data science team at Vodafone are analyzing the customers’ data which include
the customer personal information, the prepaid card they purchased, the timestamp
they registered the prepaid amount on their Vodafone accounts, among other
information.
The details of the customers are omitted, and you are only provided with a file “in.csv”
which includes two columns.
1. Customer ID. (Each ID maps to a certain customer, whose data is hidden for
confidentiality).
2. Prepaid Card Amount.
Disclaimer: Thanks to Vodafone DS team who provided us with this real customer
data.
16