Running A Mapreduce Program On Cloudera Quickstart VM: Requirements
Running A Mapreduce Program On Cloudera Quickstart VM: Requirements
Running A Mapreduce Program On Cloudera Quickstart VM: Requirements
Requirements:
1. Cloudera QuickStart VM must be installed and running on the system.
1. Start your virtual machine, and open Eclipse which comes pre-installed with
Cloudera Quickstart VM.
2. Right Click on the package Explorer window and select New Java Project.
See Figure 1.
3. Write the name of the project in the Project Name attribute and click Next. See
Figure 2
Figure 2: Creating JAVA Project
4. Click on the “Libraries” tab and select Add External Jars option. See Figure 3
Figure 3: Creating JAVA Project
5. The external jars which need to be added are present in the following directory:
usr/lib/hadoop/client-0.20 and
usr/lib/hadoop/lib
6. Select client-0.20 directory inside the hadoop directory. A lot of jar files are
present in the directory. Though, all the jar files are not required, still to be on the
safer side, it’s advised to add all the jar files to your project. See Figure 4.
Figure 4: Creating JAVA Project
9. Find and select the jar file named as commons-httpclient-3.1.jar. See Figure 5.
Figure 5: Creating JAVA Project
10. All the jar files required to run your MapReduce program have been added to the
project. Click on Finish to complete the process.
11. A new project with the name specified in the Project Name field is created and
appears in the “Package Explorer” window. Now it’s time to link the java classes
to this project.
1. Open a source file and find the package name on the top of the code.
2. In the package explorer section of eclipse, go to your project, right click on src,
go to New and select package.
3. Write the same package name in the name field of “New Java Package” window
as the one written in the source code.
4. Next, add the source files to the package. This can be done by selecting all the
source files and dragging and dropping them to the newly created package.
Creation of new source files if source files are not available:
Once the project is created and source files are added to it, you are set to run the
program in standalone mode. Follow the given steps to do so :
1. Right click on the project and select “Run as” ”Run Configurations” and
select the “New Launch Configuration “ button in the upper left corner. See
Figure 6 and Figure 7.
2. Enter the name of the class containing the main function in the “Main Class” field.
Click on search button and select the main class. It will be displayed in the format
<driver class>-<package name>. See Figure 8. Once you select the correct
option, <packagename>.<classname> appears in the “Main Class” field. See
Figure 9.
Figure 8: Running MapReduce
Figure 9: Running MapReduce
3. If the program takes input as arguments, then switch to “Arguments” tab and
enter the required arguments in the correct sequence.
4. Ensure that the input file exists in the package folder inside the “workspace”
folder.
5. Click Run.
1. The class files obtained after compilation need to be converted into jar files
before running the program in pseudo-distributed mode. Right click on the project
and select Export. See Figure 10.
3. Select the export destination where the jar file needs to be stored and the name
of the jar file in the “Select the export destination:” field. There is no constraint on
the choice of export destination or the name of the jar file. Click on Next. See
Figure 12.
Figure 12: Running MapReduce in Standalone mode
4. Click on Finish. Compilation would require some time. When the process is
completed, a dialogue box appears indicating that the jar export has finished.
Click on Ok. The jar file of specified name is created in the desired folder.See
Figure 13.
Figure 13: Running a MapReduce in standalone mode
5. If your program requires an input file, then the file should be stored on the HDFS
(Hadoop distributed file system). The command for copying a file from local file
system to HDFS is as follows:
hdfs dfs -put <input file with path> <path in HDFS>
6. Go to the directory where jar file is stored. The command used to do this is as
follows:
cd <path>
7. Run the jar file stored in the folder using the following command. Before this,
ensure that your jar file contains the Driver, Mapper and the Reducer class files.
hadoop jar <jar_file_name> <driver_class_path> <arguments>
The arguments can be path of input file stored on HDFS or the path of a directory
on HDFS where the output is to be stored. The output files generated by the
program will also be stored on HDFS. Ensure that no directory exists with the
name specified as the output folder before running the jar file.
8. After completion of the MapReduce job, check the output files generated in the
specified directory on HDFS system by using the read commands.