Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Word Count Exercise Instructions

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3
At a glance
Powered by AI
The document provides instructions on logging into a Hadoop cluster and running basic commands like GREP to process text files. It also describes tasks for running the WordCount and a modified version of WordCount MapReduce programs.

The steps described are to use FTP/SFTP to copy files to and from the cluster, use Putty to log into the cluster and run commands, and examples of GREP commands to search text files for words. It also provides commands for moving files between the local system and HDFS.

Task 2 describes running the WordCount MapReduce program. The basic steps are to transfer the input file to HDFS, run the MapReduce job using the jar file while specifying the input, output and jar file paths, and then transfer the output back from HDFS.

ITCS 6161/ITCS 8162: Knowledge Discovery in Databases

Assignment Instructions

Instructions:

Software required:
1. Putty: http://www.putty.org/
2. WinSCP: https://winscp.net/eng/download.php
3. Oracle Virtual Box: https://www.virtualbox.org/
4. Cloudera: http://www.cloudera.com/downloads/quickstart_vms/5-8.html
For detailed description on how to install Cloudera, watch this video:
https://www.youtube.com/watch?v=L0lPPC5qeyU
By default, Cloudera contains Eclipse and Hadoop packages installed which can be used to
program MapReduce programs. Cloudera contains single node cluster. Use Cloudera to test your
code on small inputs. For large inputs, use DSBA-cluster. Once you are confident that your code
works correctly, run in the cluster.

******************************************************************************
To Log In to DSBA Hadoop Cluster follow the instructions below :

TASK – 1: Logging into Hadoop cluster and running simple commands


1. To Log-In to Hadoop via FTP client ( in order to copy and paste data and to view files )

Open your FTP Client(WinSCP)


Choose Session | New Session

File protocol SFTP :


Host Name : dsba-hadoop.uncc.edu

Type UserName and Password

and click Save | check the Save Password checkbox

2. Log-in To dsba-hadoop.uncc.edu via the Putty or ( in order to run commands )


3. Run sample text processing on the ListOfInputActionRules using GREP command.
ListOfInputActionRules is a text file containing one action rule per line.

For example:

(a, a1->a2) ^ (c = c2) -> (f, f1->f0) [2, 50%]


(a, a1->a3) ^ (b, ->b1) -> (f, f1->f0) [3, 75%]
(a, a1->a3) ^ (c = c2) -> (f, f1->f0) [1, 80%]
(a, ->a3) ^ (b, b2->b1) -> (f, f1->f0) [3, 50%]

4.To move files from client to the cluster, use following command:
hadoop fs -put path-of-the-file-in-client path-of-the-destination-folder-in-cluster

5.Run following GREP command on ListOfInputActionRules to return all lines of text


(ActionRules) which contain the word ‘ a1 ‘

hadoop org.apache.hadoop.examples.Grep input-path-of-


ListOfInputActionRules-file path-of-destination-folder ".*a1.*"

NOTE: The destination folder should not exist before running this command. To
remove a folder, use following command,
hadoop fs -rm -r path-of-the-folder

6.To get the output folder back to the client, use following command,

hadoop fs -get path-of-the-output-folder-in-cluster path-of-the-folder-in-client

7.Repeat steps 4-6 for the Mammals book text file and return all lines of text which contain the
word “mammal”. Download Mammals book text file here:

http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/03_MammalsBook_Text_34848.txt
.utf8.txt

For TASK-2 and TASK-3, use Mammals book as an input file.

TASK – 2: Running WordCount


Read the "MapReduce Tutorial" from
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Basic procedure to follow when executing a MapReduce program in a Hadoop cluster:
1. The inputs should be transferred to HDFS from the local system
2. The JAR file can reside in the FTP client side(i.e in WinSCP)
The output of MapReduce programs will be written on HDFS which can be transferred back to
the local system

To understand how MapReduce works, you can see following links along with example
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html - for Hadoop version 1.0
All basic HDFS commands can be found here:
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-
common/FileSystemShell.html

1. Create a new JAVA project in Cloudera Eclipse


2. Cloudera Eclipse contains a sample MapReduce project. That project consists of all
required MapReduce jar files. Import all those .jar files into your project.
3. Copy WordCount v2.0 from https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
into your project
4. Convert your project into a .jar file
5. Run the .jar file in a cluster and produce the output
6. Use the following command to run the .jar file
 hadoop jar path-of-the-jar-file path-of-input-folder path-of output-folder

TASK – 3: Running modified version of WordCount

1. Download the .jar file from https://github.com/Keval17/Hadoop-Map-Reduce-with-


Modified-Map-function-for-efficient-word-counts
2. Save it in the client
3. Run and produce the output

TASK – 4: Write-up comparing the results of TASK-2 and TASK-3

Submit all your source codes, all your outputs and output files and a comparison write-up for
TASK-2 and TASK-3. We need following outputs,
1. GREP command output of ListOfActionRules file
2. GREP command output of Mammals book
3. WordCount v2.0
4. Modified WordCount

You might also like