Hadoop Lab Programs
Hadoop Lab Programs
Week 1,2:
1. Implement the following Data structures in Java
a)Linked Lists b) Stacks c) Queues d) Set e) Map
Week 3, 4:
2. (i) Perform setting up and Installing Hadoop in its three operating modes:
Standalone,
Pseudo distributed,
Fully distributed
(ii)Use web based tools to monitor your Hadoop setup.
Week 5:
3. Implement the following file management tasks in Hadoop:
Adding files and directories
Retrieving files
Deleting files
Week 6:
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
Week 7:
5. Write a Map Reduce program that mines weather data. Weather sensors collecting data every
hour at many locations across the globe gather a large volume of log data, which is a good
candidate for analysis with MapReduce, since it is semi structured and record-oriented.
Week 8:
6. Implement Matrix Multiplication with Hadoop Map Reduce
Week 9,10:
7. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
Week 11,12:
8. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions,
and indexes.
1. Implement the following Data structures in Java
a) Linked Lists
b) Stacks
a) Implementation of LinkedList
import java.util.*;
public class Test
{
public static void main(String args[])
{
// Creating object of class linked list
LinkedList<String> object = new LinkedList<String>();
// Adding elements to the linked list
object.add("A");
object.add("B");
object.addLast("C");
object.addFirst("D");
object.add(2, "E");
Expected Output :
Linked list : [D, A, E, B, C, F, G]
Linked list after deletion: [A, E, F]
List contains the element 'E'
Size of linked list = 3
Element returned by get() : F
Linked list after change : [A, E, Y]
b) stack implementation
import java.io.*;
import java.util.*.
class MyStack
{
// Pushing element on the top of the stack
static void stack_push(Stack<Integer> stack)
{
for(int i = 0; i < 5; i++)
{
stack.push(i);
}
}
Expected Output :
Pop :
4
3
2
1
0
Element on stack top : 4
Element is found at position 3
Element not found
a) Implementation of Queues
import java.util.LinkedList;
import java.util.Queue;
public class QueueExample
{
public static void main(String[] args)
{
Queue<Integer> q = new LinkedList<>();
// Adds elements {0, 1, 2, 3, 4} to queue
for (int i=0; i<5; i++)
q.add(i);
b)set implementation
import java.util.*;
class Test
{
public static void main(String[]args)
{
HashSet<String> h = new HashSet<String>();
// adding into HashSet
h.add("India");
h.add("Australia");
h.add("South Africa");
h.add("India");// adding duplicate elements
// printing HashSet
System.out.println(h);
System.out.println("List contains India or not:" +
h.contains("India"));
// Removing an item
h.remove("Australia");
System.out.println("List after removing Australia:"+h);
// Iterating over hash set items
System.out.println("Iterating over list:");
Iterator<String> i = h.iterator();
while (i.hasNext())
System.out.println(i.next());
}
}
Expected Output :
import java.util.*;
class HashMapDemo
{
public static void main(String args[])
{
HashMap< String,Integer> hm =
new HashMap< String,Integer>();
hm.put("a", new Integer(100));
hm.put("b", new Integer(200));
hm.put("c", new Integer(300));
hm.put("d", new Integer(400));
// Returns Set view
Set< Map.Entry< String,Integer> > st = hm.entrySet();
for (Map.Entry< String,Integer> me:st)
{
System.out.print(me.getKey()+":");
System.out.println(me.getValue());
}
}
}
Output:
a:100
b:200
c:300
d:400
EXPERIMENT-3, 4
The BigData is a keyword to the present data trends. The data is increasing rapidly, but the
hardware is not feasible to process the huge amount of data with sufficient velocity. So the user has to
wait more time to perform operations on the data.
The main challenges facing BigData are:
Velocity:-User requires immediate result to their actions, so the data should be processed
and transferred faster through the network.
Volume:-Volume is the storage issue. Transactional based data contains from years. This
data may have huge volume. Though the storage cost is negligible when compare to data
processing.
Variety:-There are many verities of data like transactional, textual and multimedia data. The
data should be processed any kind of information.
Veracity:-Trust worthiness of data.
b. MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
A MapReduce (Fig.4) job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the
maps, which are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks.
Fig.4. MapReduce
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of nodes. The
MapReduce is a simple programming paradigm to process large datasets parallel on several systems.
The MapReduce program creates as multiple instances depend on the blocks of data. All the instances
are running parallel at each block. The combiner will join the all intermediate results and reduce as
single output. It is easiest process because of there is no need to travel the data through the network
every time. The mapreduce program is having small size. So distribution of mapreduce program to all
nodes is easiest task than other techniques. The data never travels through namenode. When user
requests to store the data according to the metadata, namenode will give the available block
information to the user, then the user directly stores into the blocks.
Hadoop can run on three modes
a) Standalone mode
b) Pseudo mode
c) Fully distributed mode
The software requirements for hadoop installation are
Java Development Kit
Hadoop framework
Secured shell
A) STANDALONE MODE:
Installation of jdk 7
Command: sudo apt-get install openjdk-7-jdk
B) PSEUDO MODE:
Hadoop single node cluster runs on single machine. The namenodes and datanodes are
performing on the one machine. The installation and configuration steps as given below:
Configure core-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
Configure hdfs-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>
Configure mapred-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
NOTE: Verify the passwordless ssh environment from namenode to all datanodes as “huser” user.
Login to master node
Command: ssh pcetcse1
Command: ssh pcetcse2
Command: ssh pcetcse3
Command: ssh pcetcse4
Command: ssh pcetcse5
Add JAVA_HOME directory in hadoop-env.sh file in all nodes/machines
Command: sudo gedit /usr/lib/hadoop/conf/hadoop-env.sh export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
HDFS Jobtracker
http://locahost:50030/
HDFS Logs
http://locahost:50070/logs/
HDFS Tasktracker
http://locahost:50060/
EXPERIMENT-5
PROCEDURE:
PROGRAM:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
public static class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, Text>
{
@Override
public void map(LongWritable arg0, Text Value, Context context) throws IOException,
InterruptedException {
String line = Value.toString(); if
(!(line.length() == 0)) {
String date = line.substring(6, 14);
float temp_Min = Float.parseFloat(line.substring(22, 28).trim()); float
temp_Max = Float.parseFloat(line.substring(32, 36).trim()); if
(temp_Max > 35.0) {
context.write(new Text("Hot Day " + date),new
Text(String.valueOf(temp_Max)));
}
if (temp_Min < 10) {
context.write(new Text("Cold Day " + date),new
Text(String.valueOf(temp_Min)));
}
}
}
}
public static class MaxTemperatureReducer extends Reducer<Text, Text, Text, Text> { public
void reduce(Text Key, Iterator<Text> Values, Context context)throws IOException,
InterruptedException {
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}
}
public static void main(String[] args) throws Exception
{ Configuration conf = new Configuration();
Job job = new Job(conf, "weather example");
job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class); Path
OutputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Compiling and creating jar file for hadoop mapreduce java program:
Command: hadoop com.sun.tools.javac.Main MyMaxMin.java Command:
jar cvf we.jar MyMaxMin*.class
Runnning weather dataset mapreduce jar file on hadoop
Command: hadoop jar we.jar MyMaxMin weather/input weather/output
output:
EXPERIMENT-8
AIM: Implement Matrix Multiplication with Hadoop Map Reduce
PROGRAM:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class MatrixMul {
/*******************Mapper class**********************/ public
static class Map extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration(); int
m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) { for
(int k = 0; k < p; k++)
{
outputKey.set(indicesAndValue[1] + "," + k);
outputValue.set("A," + indicesAndValue[2] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
} else {
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("B," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
}
}
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
AIM: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
PROCEDURE:
Download and extract pig-0.13.0.
Command: wget https://archive.apache.org/dist/pig/pig-0.13.0/pig-0.13.0.tar.gz
Command: tar xvf pig-0.13.0.tar.gz
Command: sudo mv pig-0.13.0 /usr/lib/pig
Set Path for pig
Command: sudo gedit $HOME/.bashrc
export PIG_HOME=/usr/lib/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_COMMON_HOME/conf
pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can
set various parameters as given below.
pig -h properties
Verifying the Installation
Verify the installation of Apache Pig by typing the version command. If the installation is successful,
you will get the version of Apache Pig as shown below.
Command: pig -version
Joining Of Data:
Run pig script program of JOIN on hadoop mapreduce
grunt>
customers = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/customers.txt'
USING PigStorage(',')as (id:int, name:chararray, age:int, address:chararray,
salary:int);
orders = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/orders.txt' USING
PigStorage(',')as (oid:int, date:chararray, customer_id:int, amount:int);
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
Filtering of data:
Run pig script program of FILTER on hadoop mapreduce
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/
as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown
below.
grunt>
student_details = LOAD
„hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt' USING
PigStorage(',')as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the students who belong to the city
Chennai.
grunt> filter_data = FILTER student_details BY city == 'Chennai';
Verification
Verify the relation filter_data using the DUMP operator as shown below.
grunt> Dump filter_data;
Output
It will produce the following output, displaying the contents of the relation filter_data as
follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
EXPERIMENT -11, 12
AIM: Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions,
and indexes
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Example
We will insert the following data into the table. It is a text file named sample.txt in
/home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Krian 40000 Hr Admin
1205 Kranthi 30000 Op Admin
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+
Example
Let us take an example for view. Assume employee table as given below, with the fields
Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000. We store the result in a view named emp_30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
> SELECT * FROM employee
> WHERE salary>30000;
Indexes:
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';