Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
47 views

Hadoop Lab Programs

The document outlines a lab schedule for learning Hadoop and big data concepts over 12 weeks. It includes implementing data structures, setting up Hadoop in different modes, file management tasks, a word count MapReduce program, weather data analysis with MapReduce, matrix multiplication with MapReduce, installing and using Pig and Hive.

Uploaded by

gunturanusha88
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Hadoop Lab Programs

The document outlines a lab schedule for learning Hadoop and big data concepts over 12 weeks. It includes implementing data structures, setting up Hadoop in different modes, file management tasks, a word count MapReduce program, weather data analysis with MapReduce, matrix multiplication with MapReduce, installing and using Pig and Hive.

Uploaded by

gunturanusha88
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

HADOOP AND BIGDATA LAB

Week 1,2:
1. Implement the following Data structures in Java
a)Linked Lists b) Stacks c) Queues d) Set e) Map
Week 3, 4:
2. (i) Perform setting up and Installing Hadoop in its three operating modes:
 Standalone,
 Pseudo distributed,
 Fully distributed
(ii)Use web based tools to monitor your Hadoop setup.
Week 5:
3. Implement the following file management tasks in Hadoop:
 Adding files and directories
 Retrieving files
 Deleting files
Week 6:
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
Week 7:
5. Write a Map Reduce program that mines weather data. Weather sensors collecting data every
hour at many locations across the globe gather a large volume of log data, which is a good
candidate for analysis with MapReduce, since it is semi structured and record-oriented.
Week 8:
6. Implement Matrix Multiplication with Hadoop Map Reduce
Week 9,10:
7. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
Week 11,12:
8. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions,
and indexes.
1. Implement the following Data structures in Java
a) Linked Lists
b) Stacks
a) Implementation of LinkedList

import java.util.*;
public class Test
{
public static void main(String args[])
{
// Creating object of class linked list
LinkedList<String> object = new LinkedList<String>();
// Adding elements to the linked list

object.add("A");
object.add("B");
object.addLast("C");
object.addFirst("D");
object.add(2, "E");

// similar to above add F and G

System.out.println("Linked list : " + object);


// Removing elements from the linked list
object.remove("B");
object.remove(3);
object.removeFirst();
object.removeLast();
System.out.println("Linked list after deletion: " + object);
// Finding elements in the linked list
boolean status = object.contains("E");
if(status)
System.out.println("List contains the element 'E' ");
else
System.out.println("List doesn't contain the element 'E'");
// Number of elements in the linked list

int size = object.size();


System.out.println("Size of linked list = " + size);

// Get and set elements from linked list


Object element = object.get(2);
System.out.println("Element returned by get() : " + element);
object.set(2, "Y");
System.out.println("Linked list after change : " + object);
}
}

Expected Output :
Linked list : [D, A, E, B, C, F, G]
Linked list after deletion: [A, E, F]
List contains the element 'E'
Size of linked list = 3
Element returned by get() : F
Linked list after change : [A, E, Y]
b) stack implementation

import java.io.*;
import java.util.*.
class MyStack
{
// Pushing element on the top of the stack
static void stack_push(Stack<Integer> stack)
{
for(int i = 0; i < 5; i++)
{
stack.push(i);
}
}

// Popping element from the top of the stack


static void stack_pop(Stack<Integer> stack)
{
System.out.println("Pop :");
for(int i = 0; i < 5; i++)
{
Integer y = (Integer) stack.pop();
System.out.println(y);
}
}

// Displaying element on the top of the stack


static void stack_peek(Stack<Integer> stack)
{
Integer element = (Integer) stack.peek();
System.out.println("Element on stack top : " + element);
}

// Searching element in the stack


static void stack_search(Stack<Integer> stack, int element)
{
Integer pos = (Integer) stack.search(element);
if(pos == -1)
System.out.println("Element not found");
else
System.out.println("Element is found at position " + pos);
}
public static void main (String[] args)
{
Stack<Integer> stack = new Stack<Integer>();
stack_push(stack);
stack_pop(stack);
stack_push(stack);
stack_peek(stack);
stack_search(stack, 2);
stack_search(stack, 6);
}
}

Expected Output :
Pop :
4
3
2
1
0
Element on stack top : 4
Element is found at position 3
Element not found

2. Implement the following Data structures in Java


a)Queues
b) Set
c) Map

a) Implementation of Queues

import java.util.LinkedList;
import java.util.Queue;
public class QueueExample
{
public static void main(String[] args)
{
Queue<Integer> q = new LinkedList<>();
// Adds elements {0, 1, 2, 3, 4} to queue
for (int i=0; i<5; i++)
q.add(i);

// Display contents of the queue.


System.out.println("Elements of queue-"+q);
// To remove the head of queue.
int removedele = q.remove();
System.out.println("removed element-" + removedele);
System.out.println(q);
// To view the head of queue
int head = q.peek();
System.out.println("head of queue-" + head);
// Rest all methods of collection interface,
// Like size and contains can be used with this
// implementation.
int size = q.size();
System.out.println("Size of queue-" + size);
}
}
Expected Output:
Elements of queue-[0, 1, 2, 3, 4]
removed element-0
[1, 2, 3, 4]
head of queue-1
Size of queue-4

b)set implementation

// Java code for adding elements in Set


import java.util.*;
public class Set_example
{
public static void main(String[] args)
{
// Set deonstration using HashSet
Set<String> hash_Set = new HashSet<String>();
hash_Set.add("srinivas");
hash_Set.add("shoba");
hash_Set.add("srithan");
hash_Set.add("krithik");
hash_Set.add("shoba");
System.out.print("Set output without the duplicates");
System.out.println(hash_Set);
// Set deonstration using TreeSet
System.out.print("Sorted Set after passing into TreeSet");
Set<String> tree_Set = new TreeSet<String>(hash_Set);
System.out.println(tree_Set);
}
}
Output:
Set output without the duplicates[srinivas, shoba, srithan sai, sai krithik]
Sorted Set after passing into TreeSet[sai krithik, shoba, srinivas,srithan sai]

working with HashSet:

import java.util.*;
class Test
{
public static void main(String[]args)
{
HashSet<String> h = new HashSet<String>();
// adding into HashSet
h.add("India");
h.add("Australia");
h.add("South Africa");
h.add("India");// adding duplicate elements
// printing HashSet
System.out.println(h);
System.out.println("List contains India or not:" +
h.contains("India"));
// Removing an item
h.remove("Australia");
System.out.println("List after removing Australia:"+h);
// Iterating over hash set items
System.out.println("Iterating over list:");
Iterator<String> i = h.iterator();
while (i.hasNext())
System.out.println(i.next());
}
}
Expected Output :

[Australia, South Africa, India]


List contains India or not:true
List after removing Australia:[South Africa, India]
Iterating over list:
South Africa
India

c)working with Map interface

import java.util.*;
class HashMapDemo
{
public static void main(String args[])
{
HashMap< String,Integer> hm =
new HashMap< String,Integer>();
hm.put("a", new Integer(100));
hm.put("b", new Integer(200));
hm.put("c", new Integer(300));
hm.put("d", new Integer(400));
// Returns Set view
Set< Map.Entry< String,Integer> > st = hm.entrySet();
for (Map.Entry< String,Integer> me:st)
{
System.out.print(me.getKey()+":");
System.out.println(me.getValue());
}
}
}

Output:
a:100
b:200
c:300
d:400
EXPERIMENT-3, 4

The BigData is a keyword to the present data trends. The data is increasing rapidly, but the
hardware is not feasible to process the huge amount of data with sufficient velocity. So the user has to
wait more time to perform operations on the data.
The main challenges facing BigData are:
 Velocity:-User requires immediate result to their actions, so the data should be processed
and transferred faster through the network.
 Volume:-Volume is the storage issue. Transactional based data contains from years. This
data may have huge volume. Though the storage cost is negligible when compare to data
processing.
 Variety:-There are many verities of data like transactional, textual and multimedia data. The
data should be processed any kind of information.
 Veracity:-Trust worthiness of data.

Fig.1. where does the data comes from?


Hadoop:
Hadoop is a framework introduced by the apache organization which is used to process the
large datasets using commodity of hardware. The Hadoop having two modules:
a. HDFS (Hadoop Distributed File System)
b. MapReduce.
a. HDFS
HDFS is a block-structured file system: individual files are broken into blocks of a fixed size.
These blocks are stored across a cluster of one or more machines with data storage capacity.
Individual machines in the cluster are referred to as DataNodes. A file can be made of several blocks,
and they are not necessarily stored on the same machine; the target machines which hold each block
are chosen randomly on a block-by-block basis. Thus access to a file may require the cooperation of
multiple machines, but supports file sizes far larger than a single-machine DFS; individual files can
require more space than a single hard drive could hold. The default block size of HDFS is 64MB. The
default replication factor for HDFS is 3. For Example, to store 512MB of data we need 8 blocks (as
block size is 64MB) and every block has 3 replications. Finally, we should have 24 blocks to store
512MB of data in HDFS.
Fig.2. Dividing large datasets into small chunks
Hadoop is designed as master-slave architecture. The namenode acts as the master and all
datanodes acts as the slaves. The jobs assigned to slaves through the jobtracker. The datanodes are
responsible to store the data in blocks. All slaves are frequently communicated with the master and
sending reports periodically. The above divided small-small chunks (Fig.2) are going to store in a
blocks of datanodes.

Fig.3. Hadoop Architecture


Figure 3 shows the communication between namenode and datanodes in both master and
slaves. The slave machines are having tasktracker and datanodes. The master machine having
jobtracker and namenode, sometimes master can act as slave also.
The NameNode and DataNode are designed to run on commodity machines. Namenode will
store the metadata of the HDFS. These machines typically run on a Linux operating system (OS).
HDFS is built using the Java language; any machine that supports Java can run the NameNode or the
DataNode. Usage of the highly portable Java language means that HDFS can be deployed on a wide
range of machines. A typical deployment has a dedicated machine that runs only the NameNode.
Each of the other machines in the cluster runs one instance of the DataNode.
The jobtracker is another part along with namenode in the master. Jobtracker will assign jobs
to the slaves through the tasktrackers. Jobtracker uses the metadata stored in the namenode to assign
the jobs. All the demons are continuously communicate each other by the TCP/IP protocol. There is
another master to store logs of a namenode periodically that is called as secondary namenode. The
secondary namenode act as a mirror to the namenode.

b. MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
A MapReduce (Fig.4) job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the
maps, which are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks.
Fig.4. MapReduce

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of nodes. The
MapReduce is a simple programming paradigm to process large datasets parallel on several systems.
The MapReduce program creates as multiple instances depend on the blocks of data. All the instances
are running parallel at each block. The combiner will join the all intermediate results and reduce as
single output. It is easiest process because of there is no need to travel the data through the network
every time. The mapreduce program is having small size. So distribution of mapreduce program to all
nodes is easiest task than other techniques. The data never travels through namenode. When user
requests to store the data according to the metadata, namenode will give the available block
information to the user, then the user directly stores into the blocks.
Hadoop can run on three modes
a) Standalone mode
b) Pseudo mode
c) Fully distributed mode
The software requirements for hadoop installation are
 Java Development Kit
 Hadoop framework
 Secured shell

A) STANDALONE MODE:
 Installation of jdk 7
Command: sudo apt-get install openjdk-7-jdk

 Download and extract Hadoop


Command: wget http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-
1.2.0.tar.gz
Command: tar -xvf hadoop-1.2.0.tar.gz
Command: sudo mv hadoop-1.2.0 /usr/lib/hadoop

 Set the path for java and hadoop


Command: sudo gedit $HOME/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_COMMON_HOME=/usr/lib/hadoop export


HADOOP_MAPRED_HOME=/usr/lib/hadoop export
PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/Sbin
 Checking of java and hadoop
Command: java -version
Command: hadoop version

B) PSEUDO MODE:
Hadoop single node cluster runs on single machine. The namenodes and datanodes are
performing on the one machine. The installation and configuration steps as given below:

 Installation of secured shell


Command: sudo apt-get install openssh-server

 Create a ssh key for passwordless ssh configuration


Command: ssh-keygen -t rsa –P ""

 Moving the key to authorized key


Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

/**************RESTART THE COMPUTER********************/

 Checking of secured shell login


Command: ssh localhost

 Add JAVA_HOME directory in hadoop-env.sh file


Command: sudo gedit /usr/lib/hadoop/conf/hadoop-env.sh export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

 Creating namenode and datanode directories for hadoop


Command: sudo mkdir -p /usr/lib/hadoop/dfs/namenode
Command: sudo mkdir -p /usr/lib/hadoop/dfs/datanode

 Configure core-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>

 Configure hdfs-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>

 Configure mapred-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>

 Format the name node


Command: hadoop namenode -format

 Start the namenode, datanode


Command: start-dfs.sh

 Start the task tracker and job tracker


Command: start-mapred.sh

 To check if Hadoop started correctly


Command: jps
namenode
secondarynamenode
datanode
jobtracker
tasktracker

C) FULLY DISTRIBUTED MODE:


All the demons like namenodes and datanodes are runs on different machines. The data will
replicate according to the replication factor in client machines. The secondary namenode will store the
mirror images of namenode periodically. The namenode having the metadata where the blocks are
stored and number of replicas in the client machines. The slaves and master communicate each other
periodically. The configurations of multinode cluster are given below:

 Configure the hosts in all nodes/machines


Command: sudo gedit /etc/hosts/
192.168.1.58 pcetcse1
192.168.1.4 pcetcse2
192.168.1.5 pcetcse3
192.168.1.7 pcetcse4
192.168.1.8 pcetcse5

 Passwordless Ssh Configuration

 Create ssh key on namenode/master.


Command: ssh-keygen -t rsa -p “”

 Copy the generated public key all datanodes/slaves. Command:


ssh-copy-id -i ~/.ssh/id_rsa.pub huser@pcetcse2 Command: ssh-
copy-id -i ~/.ssh/id_rsa.pub huser@pcetcse3 Command: ssh-
copy-id -i ~/.ssh/id_rsa.pub huser@pcetcse4 Command: ssh-
copy-id -i ~/.ssh/id_rsa.pub huser@pcetcse5

/**************RESTART ALL NODES/COMPUTERS/MACHINES ************/

NOTE: Verify the passwordless ssh environment from namenode to all datanodes as “huser” user.
 Login to master node
Command: ssh pcetcse1
Command: ssh pcetcse2
Command: ssh pcetcse3
Command: ssh pcetcse4
Command: ssh pcetcse5
 Add JAVA_HOME directory in hadoop-env.sh file in all nodes/machines
Command: sudo gedit /usr/lib/hadoop/conf/hadoop-env.sh export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

 Creating namenode directory in namenode/master


Command: sudo mkdir -p /usr/lib/hadoop/dfs/namenode

 Creating namenode directory in datanonodes/slaves


Command: sudo mkdir -p /usr/lib/hadoop/dfs/datanode

 Configure core-site.xml in all nodes/machines


Command: sudo gedit /usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://pcetcse1:8020</value>
</property>

 Configure hdfs-site.xml in namenode/master


Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value>
</property>

 Configure hdfs-site.xml in datanodes/slaves


Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>

 Configure mapred-site.xml in all nodes/machines


Command: sudo gedit /usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>pcetcse1:8021</value>
</property>

 Configure masters in all namenode/master give the secondary namenode hostname


Command: sudo gedit /usr/lib/hadoop/conf/masters
pcetcse2
 Configure masters in all datanodes/slaves give the namenode hostname
Command: sudo gedit /usr/lib/hadoop/conf/masters
pcetcse1

 Configure slaves in all nodes/machines


Command: sudo gedit /usr/lib/hadoop/conf/slaves
pcetcse2
pcetcse3
pcetcse4
pcetcse5

 Format the name node


Command: hadoop namenode -format

 Start the namenode, datanode


Command: start-dfs.sh

 Start the task tracker and job tracker


Command: start-mapred.sh

 To check if Hadoop started correctly check in all the nodes/machines


huser@pcetcse1:$ jps
namenode
jobtracker
huser@pcetcse2:$ jps
secondarynamenode
tasktracker datanode
huser@pcetcse3:$ jps
datanode
tasktracker
huser@pcetcse4:$ jps
datanode
tasktracker
huser@pcetcse5:$ jps
datanode
tasktracker

Using HDFS monitoring UI


 HDFS Namenode on UI
http://locahost:50070/
 HDFS Live Nodes list

 HDFS Jobtracker
http://locahost:50030/
 HDFS Logs
http://locahost:50070/logs/

 HDFS Tasktracker
http://locahost:50060/
EXPERIMENT-5

HDFS basic Command-line file operations

1. Create a directory in HDFS at given path(s):


Command: hadoop fs -mkdir <paths>
2. List the contents of a directory:
Command: hadoop fs -ls <args>
3. Upload and download a file in HDFS:
Upload:
Command: hadoop fs -put <localsrc> <HDFS_dest_path>
Download:
Command: hadoop fs -get <HDFS_src> <localdst>
4. See contents of a file:
Command: hadoop fs -cat <path[filename]>
5. Copy a file from source to destination:
Command: hadoop fs -cp <source> <dest>
6. Copy a file from/To Local file system to HDFS:
Command: hadoop fs -copyFromLocal <localsrc> URI
Command: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localsrc>
7. Move file from source to destination:
Command: hadoop fs -mv <src> <dest>
8. Remove a file or directory in HDFS:
Remove files specified as argument. Delete directory only when it is empty.
Command: hadoop fs -rm <arg>
Recursive version of delete
Command: hadoop fs -rmr <arg>
9. Display last few lines of a file:
Command: hadoop fs -tail <path[filename]>
10. Display the aggregate length of a file:
Command: hadoop fs -du <path>
11. Getting help:
Command: hadoop fs -help

Adding files and directories:


 Creating a directory
Command: hadoop fs -mkdir input/
 Copying the files from localfile system to HDFS
Command: hadoop fs -put inp/file01 input/
Retrieving files:
Command: hadoop fs -get input/file01 localfs
Deleting files and directories:
Command: hadoop fs -rmr input/file01
Experiment-6
AIM: Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.

PROCEDURE:

 WordCount MapReduce Program

import java.io.IOException; import


java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class
WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable>
{ private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context
) throws IOException, InterruptedException { int
sum = 0;
for (IntWritable val : values)
{ sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception
{ Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
 Create the temporary content file in the input directory
Command: sudo mkdir input
Command: sudo gedit input/file.txt
 Type some text on that file, save the file and close

 Put the file.txt into hdfs


Command: hadoop fs -mkdir input
Command: hadoop fs -put input/file.txt input/
 Create jar file WordCount Program
Command: hadoop com.sun.tools.javac.Main WordCount.java
Command: jar cf wc.jar WordCount*.class
 Run WordCount jar file on input directory
Command: hadoop jar wc.jar WordCount input output

 To see the output


Command: cat output/*
EXPERIMENT - 7
AIM: Write a Map Reduce program that mines weather data.
Weather sensors collecting data every hour at many locations across the globe gather a large
volume of log data, which is a good candidate for analysis with MapReduce, since it is semi
structured and record-oriented.

PROGRAM:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
public static class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, Text>
{
@Override
public void map(LongWritable arg0, Text Value, Context context) throws IOException,
InterruptedException {
String line = Value.toString(); if
(!(line.length() == 0)) {
String date = line.substring(6, 14);
float temp_Min = Float.parseFloat(line.substring(22, 28).trim()); float
temp_Max = Float.parseFloat(line.substring(32, 36).trim()); if
(temp_Max > 35.0) {
context.write(new Text("Hot Day " + date),new
Text(String.valueOf(temp_Max)));
}
if (temp_Min < 10) {
context.write(new Text("Cold Day " + date),new
Text(String.valueOf(temp_Min)));
}
}
}
}
public static class MaxTemperatureReducer extends Reducer<Text, Text, Text, Text> { public
void reduce(Text Key, Iterator<Text> Values, Context context)throws IOException,
InterruptedException {
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}

}
public static void main(String[] args) throws Exception
{ Configuration conf = new Configuration();
Job job = new Job(conf, "weather example");
job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class); Path
OutputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

sample input dataset:

 Compiling and creating jar file for hadoop mapreduce java program:
Command: hadoop com.sun.tools.javac.Main MyMaxMin.java Command:
jar cvf we.jar MyMaxMin*.class
 Runnning weather dataset mapreduce jar file on hadoop
Command: hadoop jar we.jar MyMaxMin weather/input weather/output
output:
EXPERIMENT-8
AIM: Implement Matrix Multiplication with Hadoop Map Reduce
PROGRAM:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class MatrixMul {
/*******************Mapper class**********************/ public
static class Map extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration(); int
m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) { for
(int k = 0; k < p; k++)
{
outputKey.set(indicesAndValue[1] + "," + k);
outputValue.set("A," + indicesAndValue[2] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
} else {
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("B," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
}
}

/*************************Reducer Class*************************************/ public


static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
String[] value;
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>(); for
(Text val : values) {
value = val.toString().split(",");
if (value[0].equals("A")) {
hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
}
}
double[] myList = new double[10];
int n = Integer.parseInt(context.getConfiguration().get("n")); float
result = 0.0f;
float a_ij;
float b_jk;
for (int j = 0; j < n; j++) {
a_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f; b`_jk
= hashB.containsKey(j) ? hashB.get(j) : 0.0f; result +=
a_ij * b_jk;
}
if (result != 0.0f) {
context.write(null, new Text(key.toString() + "," + Float.toString(result)));
}
}
}

/********************Driver(main) function********************************/ public


static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// A is an m-by-n matrix; B is an n-by-p matrix.
conf.set("m", "8");
conf.set("n", "8");
conf.set("p", "8");

Job job = Job.getInstance(conf, "MatrixMultiplication");


job.setJarByClass(MatrixMul.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.submit();
}
}

 Create the temporary content file in the input directory


Command: sudo mkdir input
Command: sudo gedit input/matrix.txt
 enter the 8x8 matrix on that file
Sample matrix 8x8 matrix dataset
 Put the matrix input into HDFS
Command: hadoop fs -mkdir inputMatrix
Command: hadoop fs -put input/matrix.txt inputMatrix/
 Create jar file MatrixMultiplication Program
Command: hadoop com.sun.tools.javac.Main MatrixMul.java
Command: jar cvf mc.jar MatrixMul *.class
 Run mc jar file on input directory
Command: hadoop jar mc.jar MatrixMul inputMatrix/matrix.txt out1
 To see the output browse the file system
EXPERIMENT-9, 10

AIM: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
PROCEDURE:
 Download and extract pig-0.13.0.
Command: wget https://archive.apache.org/dist/pig/pig-0.13.0/pig-0.13.0.tar.gz
Command: tar xvf pig-0.13.0.tar.gz
Command: sudo mv pig-0.13.0 /usr/lib/pig
 Set Path for pig
Command: sudo gedit $HOME/.bashrc
export PIG_HOME=/usr/lib/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_COMMON_HOME/conf
 pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can
set various parameters as given below.
pig -h properties
 Verifying the Installation
Verify the installation of Apache Pig by typing the version command. If the installation is successful,
you will get the version of Apache Pig as shown below.
Command: pig -version

Local mode MapReduce mode


Command: Command:
$ pig -x local $ pig -x mapreduce
15/09/28 10:13:03 INFO pig.Main: 15/09/28 10:28:46 INFO pig.Main:
Logging error messages to: Logging error messages to:
/home/Hadoop/pig_1443415383991.log /home/Hadoop/pig_1443416326123.log
2015-09-28 10:13:04,838 [main] 2015-09-28 10:28:46,427 [main] INFO
INFO org.apache.pig.backend.hadoop.execution org.apache.pig.backend.hadoop.execution
engine.HExecutionEngine - Connecting to hadoop engine.HExecutionEngine - Connecting to
file system at: file:/// hadoop file system at: file:///
grunt>
grunt>
Grouping Of Data:

 put dataset into hadoop


Command: hadoop fs -put pig/input/data.txt pig_data/
 Run pig script program of GROUP on hadoop mapreduce
grunt>
student_details = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt'
USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
group_data = GROUP student_details by age;
Dump group_data;
Output:

Joining Of Data:
 Run pig script program of JOIN on hadoop mapreduce
grunt>
customers = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/customers.txt'
USING PigStorage(',')as (id:int, name:chararray, age:int, address:chararray,
salary:int);
orders = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/orders.txt' USING
PigStorage(',')as (oid:int, date:chararray, customer_id:int, amount:int);

grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;


 Verification
Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
 Output
You will get the following output that wills the contents of the relation named
coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Sorting of Data:
 Run pig script program of SORT on hadoop mapreduce
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/
as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown
below.
grunt>
student_details = LOAD
„hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt' USING
PigStorage(',')as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Let us now sort the relation in a descending order based on the age of the student and store it
into another relation named data using the ORDER BY operator as shown below.
grunt> order_by_data = ORDER student_details BY age DESC;
 Verification
Verify the relation order_by_data using the DUMP operator as shown below.
grunt> Dump order_by_data;
 Output
It will produce the following output, displaying the contents of the relation order_by_data as
follows.

(8,Bharathi,Nambiayar,24,9848022333,Chennai)

(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
Filtering of data:
 Run pig script program of FILTER on hadoop mapreduce
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/
as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown
below.
grunt>
student_details = LOAD
„hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt' USING
PigStorage(',')as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the students who belong to the city
Chennai.
grunt> filter_data = FILTER student_details BY city == 'Chennai';
 Verification
Verify the relation filter_data using the DUMP operator as shown below.
grunt> Dump filter_data;
 Output
It will produce the following output, displaying the contents of the relation filter_data as
follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
EXPERIMENT -11, 12
AIM: Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions,
and indexes

 Download and extract Hive:


Command: wget https://archive.apache.org/dist/hive/hive-0.14.0/apache-hive-0.14.0- bin.tar.gz
Command: tar zxvf apache-hive-0.14.0-bin.tar.gz
Command: sudo mv apache-hive-0.13.1-bin /usr/lib/hive
Command: sudo gedit $HOME/.bashrc
export HIVE_HOME=/usr/lib/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/lib/*.jar
export CLASSPATH=$CLASSPATH:/usr/lib/hive/lib/*.jar
Command: sudo cd $HIVE_HOME/conf
Command: sudo cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/lib/hadoop

 Downloading Apache Derby


The following command is used to download Apache Derby. It takes some time to download.
Command: wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-
10.4.2.0-bin.tar.gz
Command: tar zxvf db-derby-10.4.2.0-bin.tar.gz Command:
sudo mv db-derby-10.4.2.0-bin /usr/lib/derby Command:
sudo gedit $HOME/.bashrc
export DERBY_HOME=/usr/local/derby export
PATH=$PATH:$DERBY_HOME/bin export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/
derbytools.jar:$DERBY_HOME/lib/derbyclient.jar
Command: sudo mkdir $DERBY_HOME/data
Command: sudo cd $HIVE_HOME/conf
Command: sudo cp hive-default.xml.template hive-site.xml
Command: Sudo gedit $HOVE_HOME/conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>

 Create a file named jpox.properties and add the following lines into it:

javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Command: HADOOP_HOME/bin/hadoop fs -mkdir /tmp


Command: HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
Command: HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
Command: HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
Command: hive
Logging initialized using configuration in jar:file:/home/hadoop/hive-
0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties Hive history
file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….

hive> show tables;


OK
Time Taken: 2.798 seconds

 Database and table creation, dropping:


hive> CREATE DATABASE [IF NOT EXISTS] userdb; hive>
SHOW DATABASES;
default
userdb
hive> DROP DATABASE IF EXISTS userdb;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT „Employee details‟
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY „\t‟
> LINES TERMINATED BY „\n‟
> STORED AS TEXTFILE;

Example
We will insert the following data into the table. It is a text file named sample.txt in
/home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Krian 40000 Hr Admin
1205 Kranthi 30000 Op Admin

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'


> OVERWRITE INTO TABLE employee;
hive> SELECT * FROM employee WHERE Salary>=40000;

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+

hive> ALTER TABLE employee RENAME TO emp;


hive> DROP TABLE IF EXISTS employee;
Functions:

Return Signature Description


Type
BIGINT round(double a) It returns the rounded BIGINT
value of the double.
BIGINT floor(double a) It returns the maximum BIGINT value
that is equal or less than the
double.

BIGINT ceil(double a) It returns the minimum BIGINT value


that is equal or greater than
the double.

double rand(), rand(int seed) It returns a random number that


changes from row to row.
string concat(string A, string B,...) It returns the string resulting from
concatenating B after A.
string substr(string A, int start) It returns the substring of A starting
from start position till the end of string
A.

string substr(string A, int start, int It returns the substring of A starting


length) from start position with the given
length.

string upper(string A) It returns the string resulting from


converting all characters of A to
upper case.

string ucase(string A) Same as above.


string lower(string A) It returns the string resulting from
converting all characters of B to lower
case.

hive> SELECT round(2.6) from temp;


2.0
 Views:

Example
Let us take an example for view. Assume employee table as given below, with the fields
Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000. We store the result in a view named emp_30000.

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |

The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
> SELECT * FROM employee
> WHERE salary>30000;

 Indexes:
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';

You might also like