Course: Big Data Analytics Lab Scheme: 2017
Course: Big Data Analytics Lab Scheme: 2017
Course: Big Data Analytics Lab Scheme: 2017
Lab Manual
Scheme 2017
List of Experiments
1. Perform Hadoop setup in Local and Pseudo mode and monitor through web
based UI.
6. Write pig latin scripts on Describe, for each and order by operator.
Video Tutorials
https://www.youtube.com/channel/UC_6mhzMATOtsC1UXO0sHpwA
Pseudo mode
Step Details
1. Prerequisites: a) VMWare b) Ubuntu 18.04
c) Jdk 8 d) Hadoop 2.10.0
2. Open Terminal and type in the following command
sudo apt-get install openjdk-8-jdk
34. cd /home/hduser
35. sudo mkdir –p
p hadoop_tmp/hdfs/namenode
36. sudo mkdir –p
p hadoop_tmp/hdfs/datanode
37. sudo chmod 777 –RR hadoop_tmp/hdfs/namenode
38. sudo chmod 777 –RR hadoop_tmp/hdfs/datanode
39. sudo chown –R
R hduser hadoop_tmp/hdfs/datanode
40. hdfs namenode -format
format
Course: Big Data Analytics Lab Scheme: 2017
41. start-dfs.sh
42. start-yarn.sh
43. jps
jps command shows the following output
44. To stop all hadoop daemon services, use the following command
stop-dfs.sh
stop-yarn.sh
Delete a directory
hadoop fs -mv URI [URI ...] <dest> hadoop fs -mv /user/hadoop/file1
/user/hadoop/file2
Moves files from source to
destination. This command allows
multiple sources as well in which
case the destination needs to be a
directory.
Course: Big Data Analytics Lab Scheme: 2017
Step Details
1. Prerequisites:
a) VMWare or Virtualbox b) Cloudera (CDH5)
2. File New Java Project Project Name as WordCount Libraries
Add External Jars
3. Open Terminal
cat > /home/cloudera/inputFile.txt
--Enter words
4. hdfs dfs -mkdir /inputnew
hdfs dfs -put /home/cloudera/inputFile.txt /inputnew/
5. hdfs dfs -cat /inputnew/inputFile.txt
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
extends Reducer<Text,IntWritable,Text,IntWritable> {
IOException, InterruptedException {
int sum = 0;
sum += val.get();
result.set(sum);
context.write(key, result);
}
Course: Big Data Analytics Lab Scheme: 2017
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Course: Big Data Analytics Lab Scheme: 2017
Step Details
1. Prerequisites:
a) VMWare or Virtualbox b) Cloudera (CDH5)
2. Download gutenberg dataset and paste into gutenbergdata folder
http://www.gutenberg.org/cache/epub/4300/pg4300.txt
3. Follow the similar steps as Wordcount MapReduce program
4. Open Terminal
5. Type the command:
hdfs dfs -mkdir /guteninput
6. hdfs dfs -put /home/cloudera/gutenbergdata/pg4300.txt /guteninput/
7. hadoop jar /home/cloudera/Wordcount.jar WordCount
/guteninput/pg4300.txt /gutenoutput
8. hdfs dfs -cat /gutenoutput/part-r-00000
9. You can also use hdfs dfs -cat /gutenoutput/*
command instead of step 19
Source code:
Step Details
1. Prerequisites:
a) VMWare or Virtualbox b) Cloudera (CDH5)
2. Download the dataset (save in weatherdata folder) and jar file:
https://drive.google.com/file/d/0B-
ur4R5mlgGLcVRZMTZGekRpZWM/view
https://drive.google.com/file/d/0B-
ur4R5mlgGLMzVyTmdITTVmbjA/view
Source code:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
Course: Big Data Analytics Lab Scheme: 2017
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
dayReport.toString(), "\t");
int counter = 0;
while (strTokens.hasMoreElements()) {
if (counter == 0) {
date = strTokens.nextToken();
} else {
if (counter % 2 == 1) {
currentTime = strTokens.nextToken();
} else {
currnetTemp = Float.parseFloat(strTokens.nextToken());
minTemp = currnetTemp;
maxTemp = currnetTemp;
counter++;
temp.set(maxTempANDTime);
dateText.set(date);
try {
con.write(dateText, temp);
} catch (Exception e) {
e.printStackTrace();
temp.set(minTempANDTime);
dateText.set(date);
con.write(dateText, temp);
}
Course: Big Data Analytics Lab Scheme: 2017
int counter = 0;
if (counter == 0) {
reducerInputStr = value.toString().split("AND");
f1 = reducerInputStr[0];
f1Time = reducerInputStr[1];
else {
reducerInputStr = value.toString().split("AND");
f2 = reducerInputStr[0];
f2Time = reducerInputStr[1];
counter = counter + 1;
} else {
if (key.toString().substring(0, 2).equals("CA")) {
fileName = CalculateMaxAndMinTemeratureTime.calOutputName;
fileName = CalculateMaxAndMinTemeratureTime.nyOutputName;
fileName = CalculateMaxAndMinTemeratureTime.njOutputName;
fileName = CalculateMaxAndMinTemeratureTime.ausOutputName;
fileName = CalculateMaxAndMinTemeratureTime.bosOutputName;
fileName = CalculateMaxAndMinTemeratureTime.balOutputName;
@Override
InterruptedException {
mos.close();
ClassNotFoundException, InterruptedException {
job.setJarByClass(CalculateMaxAndMinTemeratureWithTime.class);
job.setMapperClass(WhetherForcastMapper.class);
job.setReducerClass(WhetherForcastReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleOutputs.addNamedOutput(job, calOutputName,
MultipleOutputs.addNamedOutput(job, nyOutputName,
MultipleOutputs.addNamedOutput(job, njOutputName,
MultipleOutputs.addNamedOutput(job, bosOutputName,
MultipleOutputs.addNamedOutput(job, ausOutputName,
MultipleOutputs.addNamedOutput(job, balOutputName,
"hdfs://192.168.213.133:54310/weatherInputData/input_temp.txt");
"hdfs://192.168.213.133:54310/user/hduser1/testfs/output_mapred3");
FileInputFormat.addInputPath(job, pathInput);
FileOutputFormat.setOutputPath(job, pathOutputDir);
try {
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (Exception e) {
e.printStackTrace();
}
Course: Big Data Analytics Lab Scheme: 2017
06. Write pig latin scripts on Describe, for each and order by operator
Operator Description
describe operator is used to view the schema of a relation.
DESCRIBE Usage:
DESCRIBE relationname;
FOREACH operator is used to generate specified data
transformations based on the column data.
Usage:
FOREACH
relationname2 = FOREACH relationname1 GENERATE (required
columndata);
Step Details
1. Prerequisites:
a) VMWare or Virtualbox b) Cloudera (CDH5)
2. Open Terminal and type the command: pig
3. gprec_data = LOAD 'gprec.txt' using PigStorage(',') as (branchid:int,
branch:chararray,strength:int)
Assuming gprec.txt contains data
4. DUMP gprec_data;
5. DESCRIBE gprec_data;
6. foreach_opr = FOREACH gprec_data GENERATE branch,strength;
7. DUMP foreach_opr;
8. foreach_opr2 = FOREACH gprec_data GENERATE lower(branch);
DUMP foreach_opr2;
9. orderby_opr = ORDER gprec_data BY strength DESC;
10. DUMP orderby_opr;
Course: Big Data Analytics Lab Scheme: 2017
07. Write pig latin scripts to perform set and sort operation
Syntax:
grunt> relationname3 = UNION relationname1, relationname2;
Self-join is used to join a table with itself as if the table were two relations.
Syntax: Relation3_name = join Relation1_name BY key, Relation2_name BY key
Inner Join
Course: Big Data Analytics Lab Scheme: 2017
SORT Operation
Assume the file (raw_sales.txt) with the following contents
CatZ,Prod22-cZ,30,60
CatA,Prod88-cA,15,50
CatY,Prod07-cY,20,40
CatB,Prod18-cB,10,50
CatX,Prod29-cZ,40,60
CatC,Prod09-cC,80,140
CatZ,Prod83-cZ,20,60
CatA,Prod17-cA,25,50
CatY,Prod98-cY,10,40
CatB,Prod99-cB,30,50
CatX,Prod19-cZ,10,60
CatC,Prod73-cC,50,140
CatZ,Prod52-cZ,10,60
CatA,Prod58-cA,15,50
CatY,Prod57-cY,10,40
CatB,Prod58-cB,10,50
CatX,Prod59-cZ,10,60
CatC,Prod59-cC,10,140
grunt> rawSales = LOAD 'raw_sales.txt' USING PigStorage(',') AS (category:
chararray, product: chararray, sales: long, total_sales_category: long);
grunt> DUMP rawSales;
grpByCatTotals = GROUP rawSales BY (total_sales_category, category);
grunt> DUMP grpByCatTotals
sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC;
grunt> sortGrpByCatTotals
topSalesCats = LIMIT sortGrpByCatTotals 2;
grunt> topSalesCats
Course: Big Data Analytics Lab Scheme: 2017
hive> CREATE TABLE Employee (empid INT, empname STRING, empcity STRING);
DROP TABLE
HBASE: