0% found this document useful (0 votes)

47 views

Hadoop Lab Programs

The document outlines a lab schedule for learning Hadoop and big data concepts over 12 weeks. It includes implementing data structures, setting up Hadoop in different modes, file management tasks, a word count MapReduce program, weather data analysis with MapReduce, matrix multiplication with MapReduce, installing and using Pig and Hive.

Uploaded by

gunturanusha88

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Hadoop Lab Programs

Uploaded by

gunturanusha88

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 32

HADOOP AND BIGDATA LAB

Week 1,2:
1. Implement the following Data structures in Java
a)Linked Lists b) Stacks c) Queues d) Set e) Map
Week 3, 4:
2. (i) Perform setting up and Installing Hadoop in its three operating modes:
 Standalone,
 Pseudo distributed,
 Fully distributed
(ii)Use web based tools to monitor your Hadoop setup.
Week 5:
3. Implement the following file management tasks in Hadoop:
 Adding files and directories
 Retrieving files
 Deleting files
Week 6:
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
Week 7:
5. Write a Map Reduce program that mines weather data. Weather sensors collecting data every
hour at many locations across the globe gather a large volume of log data, which is a good
candidate for analysis with MapReduce, since it is semi structured and record-oriented.
Week 8:
6. Implement Matrix Multiplication with Hadoop Map Reduce
Week 9,10:
7. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
Week 11,12:
8. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions,
and indexes.
1. Implement the following Data structures in Java
a) Linked Lists
b) Stacks
a) Implementation of LinkedList

import java.util.*;
public class Test
{
public static void main(String args[])
{
// Creating object of class linked list
LinkedList<String> object = new LinkedList<String>();
// Adding elements to the linked list

object.add("A");
object.add("B");
object.addLast("C");
object.addFirst("D");
object.add(2, "E");

// similar to above add F and G

System.out.println("Linked list : " + object);

// Removing elements from the linked list
object.remove("B");
object.remove(3);
object.removeFirst();
object.removeLast();
System.out.println("Linked list after deletion: " + object);
// Finding elements in the linked list
boolean status = object.contains("E");
if(status)
System.out.println("List contains the element 'E' ");
else
System.out.println("List doesn't contain the element 'E'");
// Number of elements in the linked list

int size = object.size();

System.out.println("Size of linked list = " + size);

// Get and set elements from linked list

Object element = object.get(2);
System.out.println("Element returned by get() : " + element);
object.set(2, "Y");
System.out.println("Linked list after change : " + object);
}
}

Expected Output :
Linked list : [D, A, E, B, C, F, G]
Linked list after deletion: [A, E, F]
List contains the element 'E'
Size of linked list = 3
Element returned by get() : F
Linked list after change : [A, E, Y]
b) stack implementation

import java.io.*;
import java.util.*.
class MyStack
{
// Pushing element on the top of the stack
static void stack_push(Stack<Integer> stack)
{
for(int i = 0; i < 5; i++)
{
stack.push(i);
}
}

// Popping element from the top of the stack

static void stack_pop(Stack<Integer> stack)
{
System.out.println("Pop :");
for(int i = 0; i < 5; i++)
{
Integer y = (Integer) stack.pop();
System.out.println(y);
}
}

// Displaying element on the top of the stack

static void stack_peek(Stack<Integer> stack)
{
Integer element = (Integer) stack.peek();
System.out.println("Element on stack top : " + element);
}

// Searching element in the stack

static void stack_search(Stack<Integer> stack, int element)
{
Integer pos = (Integer) stack.search(element);
if(pos == -1)
System.out.println("Element not found");
else
System.out.println("Element is found at position " + pos);
}
public static void main (String[] args)
{
Stack<Integer> stack = new Stack<Integer>();
stack_push(stack);
stack_pop(stack);
stack_push(stack);
stack_peek(stack);
stack_search(stack, 2);
stack_search(stack, 6);
}
}

Expected Output :
Pop :
4
3
2
1
0
Element on stack top : 4
Element is found at position 3
Element not found

2. Implement the following Data structures in Java

a)Queues
b) Set
c) Map

a) Implementation of Queues

import java.util.LinkedList;
import java.util.Queue;
public class QueueExample
{
public static void main(String[] args)
{
Queue<Integer> q = new LinkedList<>();
// Adds elements {0, 1, 2, 3, 4} to queue
for (int i=0; i<5; i++)
q.add(i);

// Display contents of the queue.

System.out.println("Elements of queue-"+q);
// To remove the head of queue.
int removedele = q.remove();
System.out.println("removed element-" + removedele);
System.out.println(q);
// To view the head of queue
int head = q.peek();
System.out.println("head of queue-" + head);
// Rest all methods of collection interface,
// Like size and contains can be used with this
// implementation.
int size = q.size();
System.out.println("Size of queue-" + size);
}
}
Expected Output:
Elements of queue-[0, 1, 2, 3, 4]
removed element-0
[1, 2, 3, 4]
head of queue-1
Size of queue-4

b)set implementation

// Java code for adding elements in Set

import java.util.*;
public class Set_example
{
public static void main(String[] args)
{
// Set deonstration using HashSet
Set<String> hash_Set = new HashSet<String>();
hash_Set.add("srinivas");
hash_Set.add("shoba");
hash_Set.add("srithan");
hash_Set.add("krithik");
hash_Set.add("shoba");
System.out.print("Set output without the duplicates");
System.out.println(hash_Set);
// Set deonstration using TreeSet
System.out.print("Sorted Set after passing into TreeSet");
Set<String> tree_Set = new TreeSet<String>(hash_Set);
System.out.println(tree_Set);
}
}
Output:
Set output without the duplicates[srinivas, shoba, srithan sai, sai krithik]
Sorted Set after passing into TreeSet[sai krithik, shoba, srinivas,srithan sai]

working with HashSet:

import java.util.*;
class Test
{
public static void main(String[]args)
{
HashSet<String> h = new HashSet<String>();
// adding into HashSet
h.add("India");
h.add("Australia");
h.add("South Africa");
h.add("India");// adding duplicate elements
// printing HashSet
System.out.println(h);
System.out.println("List contains India or not:" +
h.contains("India"));
// Removing an item
h.remove("Australia");
System.out.println("List after removing Australia:"+h);
// Iterating over hash set items
System.out.println("Iterating over list:");
Iterator<String> i = h.iterator();
while (i.hasNext())
System.out.println(i.next());
}
}
Expected Output :

[Australia, South Africa, India]

List contains India or not:true
List after removing Australia:[South Africa, India]
Iterating over list:
South Africa
India

c)working with Map interface

import java.util.*;
class HashMapDemo
{
public static void main(String args[])
{
HashMap< String,Integer> hm =
new HashMap< String,Integer>();
hm.put("a", new Integer(100));
hm.put("b", new Integer(200));
hm.put("c", new Integer(300));
hm.put("d", new Integer(400));
// Returns Set view
Set< Map.Entry< String,Integer> > st = hm.entrySet();
for (Map.Entry< String,Integer> me:st)
{
System.out.print(me.getKey()+":");
System.out.println(me.getValue());
}
}
}

Output:
a:100
b:200
c:300
d:400
EXPERIMENT-3, 4

The BigData is a keyword to the present data trends. The data is increasing rapidly, but the
hardware is not feasible to process the huge amount of data with sufficient velocity. So the user has to
wait more time to perform operations on the data.
The main challenges facing BigData are:
 Velocity:-User requires immediate result to their actions, so the data should be processed
and transferred faster through the network.
 Volume:-Volume is the storage issue. Transactional based data contains from years. This
data may have huge volume. Though the storage cost is negligible when compare to data
processing.
 Variety:-There are many verities of data like transactional, textual and multimedia data. The
data should be processed any kind of information.
 Veracity:-Trust worthiness of data.

Fig.1. where does the data comes from?

Hadoop:
Hadoop is a framework introduced by the apache organization which is used to process the
large datasets using commodity of hardware. The Hadoop having two modules:
a. HDFS (Hadoop Distributed File System)
b. MapReduce.
a. HDFS
HDFS is a block-structured file system: individual files are broken into blocks of a fixed size.
These blocks are stored across a cluster of one or more machines with data storage capacity.
Individual machines in the cluster are referred to as DataNodes. A file can be made of several blocks,
and they are not necessarily stored on the same machine; the target machines which hold each block
are chosen randomly on a block-by-block basis. Thus access to a file may require the cooperation of
multiple machines, but supports file sizes far larger than a single-machine DFS; individual files can
require more space than a single hard drive could hold. The default block size of HDFS is 64MB. The
default replication factor for HDFS is 3. For Example, to store 512MB of data we need 8 blocks (as
block size is 64MB) and every block has 3 replications. Finally, we should have 24 blocks to store
512MB of data in HDFS.
Fig.2. Dividing large datasets into small chunks
Hadoop is designed as master-slave architecture. The namenode acts as the master and all
datanodes acts as the slaves. The jobs assigned to slaves through the jobtracker. The datanodes are
responsible to store the data in blocks. All slaves are frequently communicated with the master and
sending reports periodically. The above divided small-small chunks (Fig.2) are going to store in a
blocks of datanodes.

Fig.3. Hadoop Architecture

Figure 3 shows the communication between namenode and datanodes in both master and
slaves. The slave machines are having tasktracker and datanodes. The master machine having
jobtracker and namenode, sometimes master can act as slave also.
The NameNode and DataNode are designed to run on commodity machines. Namenode will
store the metadata of the HDFS. These machines typically run on a Linux operating system (OS).
HDFS is built using the Java language; any machine that supports Java can run the NameNode or the
DataNode. Usage of the highly portable Java language means that HDFS can be deployed on a wide
range of machines. A typical deployment has a dedicated machine that runs only the NameNode.
Each of the other machines in the cluster runs one instance of the DataNode.
The jobtracker is another part along with namenode in the master. Jobtracker will assign jobs
to the slaves through the tasktrackers. Jobtracker uses the metadata stored in the namenode to assign
the jobs. All the demons are continuously communicate each other by the TCP/IP protocol. There is
another master to store logs of a namenode periodically that is called as secondary namenode. The
secondary namenode act as a mirror to the namenode.

b. MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
A MapReduce (Fig.4) job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the
maps, which are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks.
Fig.4. MapReduce

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of nodes. The
MapReduce is a simple programming paradigm to process large datasets parallel on several systems.
The MapReduce program creates as multiple instances depend on the blocks of data. All the instances
are running parallel at each block. The combiner will join the all intermediate results and reduce as
single output. It is easiest process because of there is no need to travel the data through the network
every time. The mapreduce program is having small size. So distribution of mapreduce program to all
nodes is easiest task than other techniques. The data never travels through namenode. When user
requests to store the data according to the metadata, namenode will give the available block
information to the user, then the user directly stores into the blocks.
Hadoop can run on three modes
a) Standalone mode
b) Pseudo mode
c) Fully distributed mode
The software requirements for hadoop installation are
 Java Development Kit
 Hadoop framework
 Secured shell

A) STANDALONE MODE:
 Installation of jdk 7
Command: sudo apt-get install openjdk-7-jdk

 Download and extract Hadoop

Command: wget http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-
1.2.0.tar.gz
Command: tar -xvf hadoop-1.2.0.tar.gz
Command: sudo mv hadoop-1.2.0 /usr/lib/hadoop

 Set the path for java and hadoop

Command: sudo gedit $HOME/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_COMMON_HOME=/usr/lib/hadoop export

HADOOP_MAPRED_HOME=/usr/lib/hadoop export
PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/Sbin
 Checking of java and hadoop
Command: java -version
Command: hadoop version

B) PSEUDO MODE:
Hadoop single node cluster runs on single machine. The namenodes and datanodes are
performing on the one machine. The installation and configuration steps as given below:

 Installation of secured shell

Command: sudo apt-get install openssh-server

 Create a ssh key for passwordless ssh configuration

Command: ssh-keygen -t rsa –P ""

 Moving the key to authorized key

Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

/RESTART THE COMPUTER******/

 Checking of secured shell login

Command: ssh localhost

 Add JAVA_HOME directory in hadoop-env.sh file

Command: sudo gedit /usr/lib/hadoop/conf/hadoop-env.sh export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

 Creating namenode and datanode directories for hadoop

Command: sudo mkdir -p /usr/lib/hadoop/dfs/namenode
Command: sudo mkdir -p /usr/lib/hadoop/dfs/datanode

 Configure core-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>

 Configure hdfs-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>

 Configure mapred-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>

 Format the name node

Command: hadoop namenode -format

 Start the namenode, datanode

Command: start-dfs.sh

 Start the task tracker and job tracker

Command: start-mapred.sh

 To check if Hadoop started correctly

Command: jps
namenode
secondarynamenode
datanode
jobtracker
tasktracker

C) FULLY DISTRIBUTED MODE:

All the demons like namenodes and datanodes are runs on different machines. The data will
replicate according to the replication factor in client machines. The secondary namenode will store the
mirror images of namenode periodically. The namenode having the metadata where the blocks are
stored and number of replicas in the client machines. The slaves and master communicate each other
periodically. The configurations of multinode cluster are given below:

 Configure the hosts in all nodes/machines

Command: sudo gedit /etc/hosts/
192.168.1.58 pcetcse1
192.168.1.4 pcetcse2
192.168.1.5 pcetcse3
192.168.1.7 pcetcse4
192.168.1.8 pcetcse5

 Passwordless Ssh Configuration

 Create ssh key on namenode/master.

Command: ssh-keygen -t rsa -p “”

 Copy the generated public key all datanodes/slaves. Command:

ssh-copy-id -i ~/.ssh/id_rsa.pub huser@pcetcse2 Command: ssh-
copy-id -i ~/.ssh/id_rsa.pub huser@pcetcse3 Command: ssh-
copy-id -i ~/.ssh/id_rsa.pub huser@pcetcse4 Command: ssh-
copy-id -i ~/.ssh/id_rsa.pub huser@pcetcse5

/**RESTART ALL NODES/COMPUTERS/MACHINES /

NOTE: Verify the passwordless ssh environment from namenode to all datanodes as “huser” user.
 Login to master node
Command: ssh pcetcse1
Command: ssh pcetcse2
Command: ssh pcetcse3
Command: ssh pcetcse4
Command: ssh pcetcse5
 Add JAVA_HOME directory in hadoop-env.sh file in all nodes/machines
Command: sudo gedit /usr/lib/hadoop/conf/hadoop-env.sh export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

 Creating namenode directory in namenode/master

Command: sudo mkdir -p /usr/lib/hadoop/dfs/namenode

 Creating namenode directory in datanonodes/slaves

Command: sudo mkdir -p /usr/lib/hadoop/dfs/datanode

 Configure core-site.xml in all nodes/machines

Command: sudo gedit /usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://pcetcse1:8020</value>
</property>

 Configure mapred-site.xml in all nodes/machines

Command: sudo gedit /usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>pcetcse1:8021</value>
</property>

 Configure masters in all namenode/master give the secondary namenode hostname

Command: sudo gedit /usr/lib/hadoop/conf/masters
pcetcse2
 Configure masters in all datanodes/slaves give the namenode hostname
Command: sudo gedit /usr/lib/hadoop/conf/masters
pcetcse1

 Configure slaves in all nodes/machines

Command: sudo gedit /usr/lib/hadoop/conf/slaves
pcetcse2
pcetcse3
pcetcse4
pcetcse5

 Format the name node

Command: hadoop namenode -format

 Start the namenode, datanode

Command: start-dfs.sh

 Start the task tracker and job tracker

Command: start-mapred.sh

 To check if Hadoop started correctly check in all the nodes/machines

huser@pcetcse1:$ jps
namenode
jobtracker
huser@pcetcse2:$ jps
secondarynamenode
tasktracker datanode
huser@pcetcse3:$ jps
datanode
tasktracker
huser@pcetcse4:$ jps
datanode
tasktracker
huser@pcetcse5:$ jps
datanode
tasktracker

Using HDFS monitoring UI

 HDFS Namenode on UI
http://locahost:50070/
 HDFS Live Nodes list

 HDFS Jobtracker
http://locahost:50030/
 HDFS Logs
http://locahost:50070/logs/

 HDFS Tasktracker
http://locahost:50060/
EXPERIMENT-5

HDFS basic Command-line file operations

1. Create a directory in HDFS at given path(s):

Command: hadoop fs -mkdir <paths>
2. List the contents of a directory:
Command: hadoop fs -ls <args>
3. Upload and download a file in HDFS:
Upload:
Command: hadoop fs -put <localsrc> <HDFS_dest_path>
Download:
Command: hadoop fs -get <HDFS_src> <localdst>
4. See contents of a file:
Command: hadoop fs -cat <path[filename]>
5. Copy a file from source to destination:
Command: hadoop fs -cp <source> <dest>
6. Copy a file from/To Local file system to HDFS:
Command: hadoop fs -copyFromLocal <localsrc> URI
Command: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localsrc>
7. Move file from source to destination:
Command: hadoop fs -mv <src> <dest>
8. Remove a file or directory in HDFS:
Remove files specified as argument. Delete directory only when it is empty.
Command: hadoop fs -rm <arg>
Recursive version of delete
Command: hadoop fs -rmr <arg>
9. Display last few lines of a file:
Command: hadoop fs -tail <path[filename]>
10. Display the aggregate length of a file:
Command: hadoop fs -du <path>
11. Getting help:
Command: hadoop fs -help

Adding files and directories:

 Creating a directory
Command: hadoop fs -mkdir input/
 Copying the files from localfile system to HDFS
Command: hadoop fs -put inp/file01 input/
Retrieving files:
Command: hadoop fs -get input/file01 localfs
Deleting files and directories:
Command: hadoop fs -rmr input/file01
Experiment-6
AIM: Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.

PROCEDURE:

 WordCount MapReduce Program

import java.io.IOException; import

java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class
WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable>
{ private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context
) throws IOException, InterruptedException { int
sum = 0;
for (IntWritable val : values)
{ sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception
{ Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
 Create the temporary content file in the input directory
Command: sudo mkdir input
Command: sudo gedit input/file.txt
 Type some text on that file, save the file and close

 Put the file.txt into hdfs

Command: hadoop fs -mkdir input
Command: hadoop fs -put input/file.txt input/
 Create jar file WordCount Program
Command: hadoop com.sun.tools.javac.Main WordCount.java
Command: jar cf wc.jar WordCount*.class
 Run WordCount jar file on input directory
Command: hadoop jar wc.jar WordCount input output

 To see the output

Command: cat output/*
EXPERIMENT - 7
AIM: Write a Map Reduce program that mines weather data.
Weather sensors collecting data every hour at many locations across the globe gather a large
volume of log data, which is a good candidate for analysis with MapReduce, since it is semi
structured and record-oriented.

PROGRAM:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
public static class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, Text>
{
@Override
public void map(LongWritable arg0, Text Value, Context context) throws IOException,
InterruptedException {
String line = Value.toString(); if
(!(line.length() == 0)) {
String date = line.substring(6, 14);
float temp_Min = Float.parseFloat(line.substring(22, 28).trim()); float
temp_Max = Float.parseFloat(line.substring(32, 36).trim()); if
(temp_Max > 35.0) {
context.write(new Text("Hot Day " + date),new
Text(String.valueOf(temp_Max)));
}
if (temp_Min < 10) {
context.write(new Text("Cold Day " + date),new
Text(String.valueOf(temp_Min)));
}
}
}
}
public static class MaxTemperatureReducer extends Reducer<Text, Text, Text, Text> { public
void reduce(Text Key, Iterator<Text> Values, Context context)throws IOException,
InterruptedException {
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}

}
public static void main(String[] args) throws Exception
{ Configuration conf = new Configuration();
Job job = new Job(conf, "weather example");
job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class); Path
OutputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

sample input dataset:

 Compiling and creating jar file for hadoop mapreduce java program:
Command: hadoop com.sun.tools.javac.Main MyMaxMin.java Command:
jar cvf we.jar MyMaxMin*.class
 Runnning weather dataset mapreduce jar file on hadoop
Command: hadoop jar we.jar MyMaxMin weather/input weather/output
output:
EXPERIMENT-8
AIM: Implement Matrix Multiplication with Hadoop Map Reduce
PROGRAM:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class MatrixMul {
/*******************Mapper class**********************/ public
static class Map extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration(); int
m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) { for
(int k = 0; k < p; k++)
{
outputKey.set(indicesAndValue[1] + "," + k);
outputValue.set("A," + indicesAndValue[2] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
} else {
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("B," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
}
}

/*Reducer Class*************/ public

static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
String[] value;
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>(); for
(Text val : values) {
value = val.toString().split(",");
if (value[0].equals("A")) {
hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
}
}
double[] myList = new double[10];
int n = Integer.parseInt(context.getConfiguration().get("n")); float
result = 0.0f;
float a_ij;
float b_jk;
for (int j = 0; j < n; j++) {
a_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f; b`_jk
= hashB.containsKey(j) ? hashB.get(j) : 0.0f; result +=
a_ij * b_jk;
}
if (result != 0.0f) {
context.write(null, new Text(key.toString() + "," + Float.toString(result)));
}
}
}

/Driver(main) function************/ public

static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// A is an m-by-n matrix; B is an n-by-p matrix.
conf.set("m", "8");
conf.set("n", "8");
conf.set("p", "8");

Job job = Job.getInstance(conf, "MatrixMultiplication");

job.setJarByClass(MatrixMul.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.submit();
}
}

 Create the temporary content file in the input directory

Command: sudo mkdir input
Command: sudo gedit input/matrix.txt
 enter the 8x8 matrix on that file
Sample matrix 8x8 matrix dataset
 Put the matrix input into HDFS
Command: hadoop fs -mkdir inputMatrix
Command: hadoop fs -put input/matrix.txt inputMatrix/
 Create jar file MatrixMultiplication Program
Command: hadoop com.sun.tools.javac.Main MatrixMul.java
Command: jar cvf mc.jar MatrixMul *.class
 Run mc jar file on input directory
Command: hadoop jar mc.jar MatrixMul inputMatrix/matrix.txt out1
 To see the output browse the file system
EXPERIMENT-9, 10

AIM: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
PROCEDURE:
 Download and extract pig-0.13.0.
Command: wget https://archive.apache.org/dist/pig/pig-0.13.0/pig-0.13.0.tar.gz
Command: tar xvf pig-0.13.0.tar.gz
Command: sudo mv pig-0.13.0 /usr/lib/pig
 Set Path for pig
Command: sudo gedit $HOME/.bashrc
export PIG_HOME=/usr/lib/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_COMMON_HOME/conf
 pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can
set various parameters as given below.
pig -h properties
 Verifying the Installation
Verify the installation of Apache Pig by typing the version command. If the installation is successful,
you will get the version of Apache Pig as shown below.
Command: pig -version

Local mode MapReduce mode

Command: Command:
$ pig -x local $ pig -x mapreduce
15/09/28 10:13:03 INFO pig.Main: 15/09/28 10:28:46 INFO pig.Main:
Logging error messages to: Logging error messages to:
/home/Hadoop/pig_1443415383991.log /home/Hadoop/pig_1443416326123.log
2015-09-28 10:13:04,838 [main] 2015-09-28 10:28:46,427 [main] INFO
INFO org.apache.pig.backend.hadoop.execution org.apache.pig.backend.hadoop.execution
engine.HExecutionEngine - Connecting to hadoop engine.HExecutionEngine - Connecting to
file system at: file:/// hadoop file system at: file:///
grunt>
grunt>
Grouping Of Data:

 put dataset into hadoop

Command: hadoop fs -put pig/input/data.txt pig_data/
 Run pig script program of GROUP on hadoop mapreduce
grunt>
student_details = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt'
USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
group_data = GROUP student_details by age;
Dump group_data;
Output:

Joining Of Data:
 Run pig script program of JOIN on hadoop mapreduce
grunt>
customers = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/customers.txt'
USING PigStorage(',')as (id:int, name:chararray, age:int, address:chararray,
salary:int);
orders = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/orders.txt' USING
PigStorage(',')as (oid:int, date:chararray, customer_id:int, amount:int);

grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;

 Verification
Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
 Output
You will get the following output that wills the contents of the relation named
coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Sorting of Data:
 Run pig script program of SORT on hadoop mapreduce
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/
as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown
below.
grunt>
student_details = LOAD
„hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt' USING
PigStorage(',')as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Let us now sort the relation in a descending order based on the age of the student and store it
into another relation named data using the ORDER BY operator as shown below.
grunt> order_by_data = ORDER student_details BY age DESC;
 Verification
Verify the relation order_by_data using the DUMP operator as shown below.
grunt> Dump order_by_data;
 Output
It will produce the following output, displaying the contents of the relation order_by_data as
follows.

(8,Bharathi,Nambiayar,24,9848022333,Chennai)

(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
Filtering of data:
 Run pig script program of FILTER on hadoop mapreduce
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/
as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as shown
below.
grunt>
student_details = LOAD
„hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt' USING
PigStorage(',')as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the students who belong to the city
Chennai.
grunt> filter_data = FILTER student_details BY city == 'Chennai';
 Verification
Verify the relation filter_data using the DUMP operator as shown below.
grunt> Dump filter_data;
 Output
It will produce the following output, displaying the contents of the relation filter_data as
follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
EXPERIMENT -11, 12
AIM: Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions,
and indexes

 Download and extract Hive:

Command: wget https://archive.apache.org/dist/hive/hive-0.14.0/apache-hive-0.14.0- bin.tar.gz
Command: tar zxvf apache-hive-0.14.0-bin.tar.gz
Command: sudo mv apache-hive-0.13.1-bin /usr/lib/hive
Command: sudo gedit $HOME/.bashrc
export HIVE_HOME=/usr/lib/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/lib/*.jar
export CLASSPATH=$CLASSPATH:/usr/lib/hive/lib/*.jar
Command: sudo cd $HIVE_HOME/conf
Command: sudo cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/lib/hadoop

 Downloading Apache Derby

The following command is used to download Apache Derby. It takes some time to download.
Command: wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-
10.4.2.0-bin.tar.gz
Command: tar zxvf db-derby-10.4.2.0-bin.tar.gz Command:
sudo mv db-derby-10.4.2.0-bin /usr/lib/derby Command:
sudo gedit $HOME/.bashrc
export DERBY_HOME=/usr/local/derby export
PATH=$PATH:$DERBY_HOME/bin export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/
derbytools.jar:$DERBY_HOME/lib/derbyclient.jar
Command: sudo mkdir $DERBY_HOME/data
Command: sudo cd $HIVE_HOME/conf
Command: sudo cp hive-default.xml.template hive-site.xml
Command: Sudo gedit $HOVE_HOME/conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>

 Create a file named jpox.properties and add the following lines into it:

javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Command: HADOOP_HOME/bin/hadoop fs -mkdir /tmp

Command: HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
Command: HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
Command: HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
Command: hive
Logging initialized using configuration in jar:file:/home/hadoop/hive-
0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties Hive history
file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….

hive> show tables;

OK
Time Taken: 2.798 seconds

 Database and table creation, dropping:

hive> CREATE DATABASE [IF NOT EXISTS] userdb; hive>
SHOW DATABASES;
default
userdb
hive> DROP DATABASE IF EXISTS userdb;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT „Employee details‟
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY „\t‟
> LINES TERMINATED BY „\n‟
> STORED AS TEXTFILE;

Example
We will insert the following data into the table. It is a text file named sample.txt in
/home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Krian 40000 Hr Admin
1205 Kranthi 30000 Op Admin

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'

> OVERWRITE INTO TABLE employee;
hive> SELECT * FROM employee WHERE Salary>=40000;

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+

hive> ALTER TABLE employee RENAME TO emp;

hive> DROP TABLE IF EXISTS employee;
Functions:

Return Signature Description

Type
BIGINT round(double a) It returns the rounded BIGINT
value of the double.
BIGINT floor(double a) It returns the maximum BIGINT value
that is equal or less than the
double.

BIGINT ceil(double a) It returns the minimum BIGINT value

that is equal or greater than
the double.

double rand(), rand(int seed) It returns a random number that

changes from row to row.
string concat(string A, string B,...) It returns the string resulting from
concatenating B after A.
string substr(string A, int start) It returns the substring of A starting
from start position till the end of string
A.

string substr(string A, int start, int It returns the substring of A starting

length) from start position with the given
length.

string upper(string A) It returns the string resulting from

converting all characters of A to
upper case.

string ucase(string A) Same as above.

string lower(string A) It returns the string resulting from
converting all characters of B to lower
case.

hive> SELECT round(2.6) from temp;

2.0
 Views:

Example
Let us take an example for view. Assume employee table as given below, with the fields
Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000. We store the result in a view named emp_30000.

The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
> SELECT * FROM employee
> WHERE salary>30000;

 Indexes:
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';

Nimsabda
No ratings yet
Nimsabda
36 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
43 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
36 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
20 pages
Bda
No ratings yet
Bda
20 pages
Hadoop and BigData LAB MANUAL
33% (3)
Hadoop and BigData LAB MANUAL
59 pages
Unit1 Hadoop
No ratings yet
Unit1 Hadoop
20 pages
BDA LAB
No ratings yet
BDA LAB
50 pages
Java Collection Framework
No ratings yet
Java Collection Framework
29 pages
Experiment-1
No ratings yet
Experiment-1
7 pages
Java Collections
No ratings yet
Java Collections
20 pages
Adv Prog Contest Prep
No ratings yet
Adv Prog Contest Prep
12 pages
JAVA UNIT-4 Lecture Notes
No ratings yet
JAVA UNIT-4 Lecture Notes
31 pages
Hadoop Bigdata UNIT-1
No ratings yet
Hadoop Bigdata UNIT-1
18 pages
2 - Lab2 Data-Structures-Linear-Data-Structures-Lab
No ratings yet
2 - Lab2 Data-Structures-Linear-Data-Structures-Lab
8 pages
COMP201 SI Worksheet 2 - Suggested Solns
No ratings yet
COMP201 SI Worksheet 2 - Suggested Solns
3 pages
Java Collections Cheat Sheet Easy
0% (1)
Java Collections Cheat Sheet Easy
40 pages
Softeng - Polito.it Slides 07-JavaCollections
No ratings yet
Softeng - Polito.it Slides 07-JavaCollections
71 pages
Java Collection Framework: Version March 2009
No ratings yet
Java Collection Framework: Version March 2009
71 pages
Coding_Syntaxes
No ratings yet
Coding_Syntaxes
20 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
23 pages
Keys
No ratings yet
Keys
13 pages
DSNotes Using Java
No ratings yet
DSNotes Using Java
103 pages
PBLJ(Chapter 4)
No ratings yet
PBLJ(Chapter 4)
47 pages
Class Running Notes from 15th to 19th Nov
No ratings yet
Class Running Notes from 15th to 19th Nov
43 pages
Oops Lab Manual Printing
No ratings yet
Oops Lab Manual Printing
71 pages
Collections
No ratings yet
Collections
21 pages
OOPS RECORD
No ratings yet
OOPS RECORD
52 pages
Java Unit 10
No ratings yet
Java Unit 10
5 pages
Class Running Notes 14th Nov
No ratings yet
Class Running Notes 14th Nov
11 pages
Advance Java Practical
No ratings yet
Advance Java Practical
87 pages
Collections
No ratings yet
Collections
25 pages
Collection: Java Collections Framework Has Following Benefits: 1
No ratings yet
Collection: Java Collections Framework Has Following Benefits: 1
12 pages
Study material
No ratings yet
Study material
6 pages
Hadoop Lab Manual
No ratings yet
Hadoop Lab Manual
92 pages
Collection
No ratings yet
Collection
41 pages
Java Collections Framework
No ratings yet
Java Collections Framework
7 pages
UNIT-4(JP)
No ratings yet
UNIT-4(JP)
70 pages
Unit 5 - Collections
No ratings yet
Unit 5 - Collections
42 pages
Abstract
No ratings yet
Abstract
17 pages
algos
No ratings yet
algos
17 pages
CSC3A - ST1 Notes
No ratings yet
CSC3A - ST1 Notes
8 pages
Unit-4 Study Material Java
No ratings yet
Unit-4 Study Material Java
25 pages
Sec3 Program
No ratings yet
Sec3 Program
41 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
59 pages
Collections Framework - List, Set, Map.
No ratings yet
Collections Framework - List, Set, Map.
25 pages
CSC 251 - Concise Study Guide
No ratings yet
CSC 251 - Concise Study Guide
10 pages
w05 Collections Curs
No ratings yet
w05 Collections Curs
48 pages
DS Through JAVA Lab Manual-BSC
100% (1)
DS Through JAVA Lab Manual-BSC
31 pages
Unit 5
No ratings yet
Unit 5
61 pages
J- Array Collection
No ratings yet
J- Array Collection
11 pages
Comp201 Tut
No ratings yet
Comp201 Tut
8 pages
Big Data
No ratings yet
Big Data
67 pages
Common Formats
No ratings yet
Common Formats
11 pages
02 Collections
No ratings yet
02 Collections
27 pages
03 Collections2
No ratings yet
03 Collections2
20 pages
Collections Diagrams
No ratings yet
Collections Diagrams
8 pages
41 Java Collection MCQ
No ratings yet
41 Java Collection MCQ
24 pages