Word Count Exercise Instructions
Word Count Exercise Instructions
Word Count Exercise Instructions
Assignment Instructions
Instructions:
Software required:
1. Putty: http://www.putty.org/
2. WinSCP: https://winscp.net/eng/download.php
3. Oracle Virtual Box: https://www.virtualbox.org/
4. Cloudera: http://www.cloudera.com/downloads/quickstart_vms/5-8.html
For detailed description on how to install Cloudera, watch this video:
https://www.youtube.com/watch?v=L0lPPC5qeyU
By default, Cloudera contains Eclipse and Hadoop packages installed which can be used to
program MapReduce programs. Cloudera contains single node cluster. Use Cloudera to test your
code on small inputs. For large inputs, use DSBA-cluster. Once you are confident that your code
works correctly, run in the cluster.
******************************************************************************
To Log In to DSBA Hadoop Cluster follow the instructions below :
For example:
4.To move files from client to the cluster, use following command:
hadoop fs -put path-of-the-file-in-client path-of-the-destination-folder-in-cluster
NOTE: The destination folder should not exist before running this command. To
remove a folder, use following command,
hadoop fs -rm -r path-of-the-folder
6.To get the output folder back to the client, use following command,
7.Repeat steps 4-6 for the Mammals book text file and return all lines of text which contain the
word “mammal”. Download Mammals book text file here:
http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/03_MammalsBook_Text_34848.txt
.utf8.txt
To understand how MapReduce works, you can see following links along with example
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html - for Hadoop version 1.0
All basic HDFS commands can be found here:
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-
common/FileSystemShell.html
Submit all your source codes, all your outputs and output files and a comparison write-up for
TASK-2 and TASK-3. We need following outputs,
1. GREP command output of ListOfActionRules file
2. GREP command output of Mammals book
3. WordCount v2.0
4. Modified WordCount