BDA record

REG NO : 411622243038
Ex.No: 01 Downloading and installing Hadoop; Understanding

different Hadoop modes. Start-up scripts,
Date:
Configuration files.
AIM:
To Install Apache Hadoop.
Hadoop software can be installed in three modes of
Hadoop is a Java-based programming framework that supports the processing and
storage of extremely large datasets on a cluster of inexpensive machines. It was
the first major open source project in the big data playing field and is sponsored
by the Apache Software Foundation.
Hadoop-2.7.3 is comprised of four main layers:
➢ Hadoop Common is the collection of utilities and libraries that support other
Hadoop modules.
➢ HDFS, which stands for Hadoop Distributed File System, is responsible for
persisting data to disk.
➢ YARN, short for Yet Another Resource Negotiator, is the "operating system"
for HDFS.
➢ Map Reduce is the original processing model for Hadoop clusters. It
distributes work within the cluster or map, then organizes and reduces the
results from the nodes into a response to a query. Many other processing
models are available for the 2.x version of Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-
alone mode which is suitable for learning about Hadoop, performing simple
operations, and debugging
1
REG NO : 411622243038
ALGORITHM:
1. Install Apache Hadoop 2.2.0 in Microsoft Windows OS
If Apache Hadoop 2.2.0 is not already installed then follow the post Build, Install,
Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS.
2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager and
Node Manager).
PROGRAM:
Run following commands. Command Prompt

C:\Users\abhijitg>cd c:\hadoop
c:\hadoop>sbin\start-dfs
c:\hadoop>sbin\start-yarn
starting yarn daemons
Namenode, Datanode, Resource Manager and Node Manager will be started
in few minutes and ready to execute Hadoop MapReduce job in the Single Node
(pseudo-distributed mode) cluster.
2
REG NO : 411622243038
Run wordcount MapReduce job

Now we'll run wordcount MapReduce job available in
%HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-
examples-2.2.0.jar
Create a text file with some content. We'll pass this file as input to the wordcount
MapReduce job for counting words. C:\file1.txt
Install Hadoop
Run Hadoop Wordcount Mapreduce Example
Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to
be used for counting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input
Copy the text file(say 'file1.txt') from local disk to the newly created 'input'
directory in HDFS.
C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input
Check content of the copied file.

C:\hadoop>hdfs dfs -ls input
3
REG NO : 411622243038
Found 1 items
-rw-r--r-- 1 ABHIJITG supergroup 55 2014-02-03 13:19 input/file1.txt
C:\hadoop>bin\hdfs dfs -cat input/file1.txt
Install Hadoop
Run Hadoop Wordcount Mapreduce Example
Run the wordcount MapReduce job provided in
%HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-
2.2.0.jar
C:\hadoop>bin\yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.2.0.jar wordcount input output
14/02/03 13:22:02 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
14/02/03 13:22:03 INFO input.FileInputFormat: Total input paths to process : 1
14/02/03 13:22:03 INFO mapreduce.JobSubmitter: number of splits:1
:
:
14/02/03 13:22:04 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1391412385921_0002
14/02/03 13:22:04 INFO impl.YarnClientImpl: Submitted application
application_1391412385921_0002 to ResourceManager at /0.0.0.0:8032
14/02/03 13:22:04 INFO mapreduce.Job: The url to track the job:
http://ABHIJITG:8088/proxy/application_1391412385921_0002/
14/02/03 13:22:04 INFO mapreduce.Job: Running job:
job_1391412385921_0002
14/02/03 13:22:14 INFO mapreduce.Job: Job job_1391412385921_0002 running
in uber mode : false
14/02/03 13:22:14 INFO mapreduce.Job: map 0% reduce 0%
4
REG NO : 411622243038
14/02/03 13:22:30 INFO mapreduce.Job: Job job_1391412385921_0002

completed successfully
14/02/03 13:22:31 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=89
FILE: Number of bytes written=160142
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=171

HDFS: Number of bytes written=59
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5657
Total time spent by all reduces in occupied slots (ms)=6128
Map-Reduce Framework
Map input records=2
Map output records=7
Map output bytes=82
Map output materialized bytes=89
Input split bytes=116
Combine input records=7
Combine output records=6
5
REG NO : 411622243038
Reduce input groups=6

Reduce shuffle bytes=89
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=145
CPU time spent (ms)=1418
Physical memory (bytes) snapshot=368246784
Virtual memory (bytes) snapshot=513716224
Total committed heap usage (bytes)=307757056
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=55
File Output Format Counters
Bytes Written=59
6
REG NO : 411622243038
OUTPUT:
RESULT:
We've installed Hadoop in stand-alone mode and verified it by running an
example program it provided.
7
REG NO : 411622243038
Ex.No: 02 Hadoop Implementation of file management tasks,

such as Adding files and directories, retrieving files
Date:
and Deleting files
AIM:
To create and Implementation of file management tasks, such as Adding
files and directories, retrieving files and Deleting files
PROCEDURE:
Using Hadoop CLI:

Hadoop CLI provides several commands for file management
1. Adding files and Directories
This command copies files or directories from the local directiores from the local
filesystem to HDFS. Replace <local_path> with the path of the file directory
on your local machine, and <hdfs_path> with the desired destination path
in HDFS.
2. Retrieving Files:
This command copies files or directoires from HDFS to the local filesystem.
Replace <hdfs_path>with the path of the or directory in HDFS, and <local_path>
with the desired destination path on your local machine.
8
REG NO : 411622243038
3. Deleting Files
Using Hadoop JAVA APIs:
RESULT:
Thus the program Adding files and directories, retrieving files and Deleting
files successfully completed
9
REG NO : 411622243038
Ex.No: 03 Implement of Matrix Multiplication with Hadoop

Map Reduce
Date:
AIM:
To Develop a Map Reduce program to implement Matrix Multiplication.
Matrix multiplication or matrix product is a binary operation in mathematics that
creates a matrix from two matrices. It is influenced by linear equations and vector
transformations, which have applications in applied mathematics, physics, and
engineering. For example, if A is an n × m matrix and B is an m × p matrix, their
matrix product AB is an n × p matrix. The matrix product represents the
composition of two linear transformations represented by matrices.
10
REG NO : 411622243038
ALGORITHM:
Algorithm for Map Function:
a. for each element mij of M do produce (key,value) pairs as ((i,k), (M,j,mij), for
k=1,2,3,.. upto the number of columns of N
b. for each element njk of N do produce (key,value) pairs as ((i,k),(N,j,Njk), for i
= 1,2,3,.. Upto the number of rows of M.
c. return Set of (key,value) pairs that each key (i,k), has list with values (M,j,mij)
and (N, j,njk) for all possible values of j.
Algorithm for Reduce Function:

d. for each key (i,k) do
e. sort values begin with M by j in listM sort values begin with N by j in listN
multiply mij and njk for jth value of each list
f. sum up mij x njk return (i,k), Σj=1 mij x njk
Step 1. Download the hadoop jar files with these links.

Download Hadoop Common Jar files: https://goo.gl/G4MyHp
$ wget https://goo.gl/G4MyHp -O hadoop-common-2.2.0.jar
Download Hadoop Mapreduce Jar File: https://goo.gl/KT8yfB
$ wget https://goo.gl/KT8yfB -O hadoop-mapreduce-client-core-2.7.1.jar
Step 2. Creating Mapper file for Matrix Multiplication.

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
11
REG NO : 411622243038
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.ReflectionUtils;
class Element implements Writable {
int tag;
int index;
double value;
Element() {
tag = 0;
index = 0;
value = 0.0;
}
Element(int tag, int index, double value) {
this.tag = tag;
this.index = index;
this.value = value;
}
@Override
public void readFields(DataInput input) throws IOException {
tag = input.readInt();
index = input.readInt();
12
REG NO : 411622243038
value = input.readDouble();
}
@Override
public void write(DataOutput output) throws IOException {
output.writeInt(tag);
output.writeInt(index);
output.writeDouble(value);
}
}
class Pair implements WritableComparable<Pair> {
int i;
int j;
Pair() {
i = 0;
j = 0;
}
Pair(int i, int j) {
this.i = i;
this.j = j;
}
@Override
public void readFields(DataInput input) throws IOException {
i = input.readInt();
j = input.readInt();
}
@Override
public void write(DataOutput output) throws IOException {
output.writeInt(i);
output.writeInt(j);
13
REG NO : 411622243038
}
@Override
public int compareTo(Pair compare) {
if (i > compare.i) {
return 1;
} else if ( i < compare.i) {
return -1;
} else {
if(j > compare.j) {
return 1;
} else if (j < compare.j) {
return -1;
}
}
return 0;
}
public String toString() {
return i + " " + j + " ";
}
}
public class Multiply
{
public static class MatriceMapperM extends
Mapper<Object,Text,IntWritable,Element>
{ 24 Department of CSE
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString();
14
REG NO : 411622243038
String[] stringTokens = readLine.split(",");

int index = Integer.parseInt(stringTokens[0]);
double elementValue = Double.parseDouble(stringTokens[2]);
Element e = new Element(0, index, elementValue);
IntWritable keyValue = new IntWritable(Integer.parseInt(stringTokens[1]));
context.write(keyValue, e);
}
}
public static class MatriceMapperN extends
Mapper<Object,Text,IntWritable,Element> {
@Override
public void map(Object key, Text value, Context context)
String readLine = value.toString();
String[] stringTokens = readLine.split(",");
int index = Integer.parseInt(stringTokens[1]);
double elementValue = Double.parseDouble(stringTokens[2]);
Element e = new Element(1,index, elementValue);
IntWritable keyValue = new IntWritable(Integer.parseInt(stringTokens[0]));
context.write(keyValue, e);
}
}
public static void main(String[] args) throws Exception {
Job job = Job.getInstance();
job.setJobName("MapIntermediate");
job.setJarByClass(Project1.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class,
MatriceMapperM.class);
15
REG NO : 411622243038
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class,

MatriceMapperN.class);
job.setReducerClass(ReducerMxN.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Element.class);
job.setOutputKeyClass(Pair.class);
job.setOutputValueClass(DoubleWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
Job job2 = Job.getInstance();
job2.setJobName("MapFinalOutput");
job2.setJarByClass(Project1.class);
job2.setMapperClass(MapMxN.class);
job2.setReducerClass(ReduceMxN.class);
job2.setMapOutputKeyClass(Pair.class);
job2.setMapOutputValueClass(DoubleWritable.class);
job2.setOutputKeyClass(Pair.class);
job2.setOutputValueClass(DoubleWritable.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job2, new Path(args[2]));
FileOutputFormat.setOutputPath(job2, new Path(args[3]));
job2.waitForCompletion(true);
}
}
Step 5. Compiling the program in particular folder named as operation
#!/bin/bash
rm -rf multiply.jar classes
16
REG NO : 411622243038
module load hadoop/2.6.0

mkdir -p classes
javac -d classes -cp classes:`$HADOOP_HOME/bin/hadoop classpath`
Multiply.java
jar cf multiply.jar -C classes .
echo "end"
Step 6. Running the program in particular folder named as operation

export HADOOP_CONF_DIR=/home/$USER/cometcluster
myhadoop-configure.sh
start-dfs.sh
start-yarn.sh
hdfs dfs -mkdir -p /user/$USER
hdfs dfs -put M-matrix-large.txt /user/$USER/M-matrix-large.txt
hdfs dfs -put N-matrix-large.txt /user/$USER/N-matrix-large.txt
hadoop jar multiply.jar edu.uta.cse6331.Multiply /user/$USER/M-matrix-
large.txt /user/$USER/N-matrix-large.txt /user/$USER/intermediate
/user/$USER/output
rm -rf output-distr
mkdir output-distr
hdfs dfs -get /user/$USER/output/part* output-distr
stop-yarn.sh
stop-dfs.sh
myhadoop-cleanup.sh
17
REG NO : 411622243038
OUTPUT:

rm -rf output intermediate
hadoop --config $HOME jar multiply.jar edu.uta.cse6331.Multiply M-matrix-
small.txt N-matrix-small.txt intermediate output.
Result:
Thus, the program implement matrix multiplication with Hadoop Map
Reduce.
18
REG NO : 411622243038
Ex.No: 04 Run a basic Word Count Map Reduce program to

understand Map Reduce Paradigm.
Date:
AIM:
Run a basic Word Count MapReduce program to understand MapReduce
paradigm: Count words in a given file. View the output file. Calculate the
execution time
About HDFS
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing
primitives are called mappers and reducers. Decomposing a data processing
application into mappers and reducers is sometimes nontrivial. But, once we write
an application in the MapReduce form, scaling the application to run over
hundreds, thousands, or even tens of thousands of machines in a cluster is merely
a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
Below are the steps for MapReduce data flow:
Step 1: One block is processed by one mapper at a time. In the mapper, a

developer can specify his own business logic as per the requirements. In this
19
REG NO : 411622243038
manner, Map runs on all the nodes of the cluster and process the data blocks in
parallel.
Step 2: Output of Mapper also known as intermediate output is written to the

local disk. An output of mapper is not stored on HDFS as this is temporary data
and writing on HDFS will create unnecessary many copies.
Step 3: Output of mapper is shuffled to reducer node (which is a normal slave

node but reduce phase will run here hence called as reducer node). The
shuffling/copying is a physical movement of data which is done over the network.
Step 4: Once all the mappers are finished and their output is shuffled on reducer
nodes then this intermediate output is merged & sorted. Which is then provided
as input to reduce phase.
Step 5: Reduce is the second phase of processing where the user can specify his
own custom business logic as per the requirements. An input to a reducer is
provided from all the mappers. An output of reducer is the final output, which is
written on HDFS.
20
REG NO : 411622243038
PROGRAM:
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class wordCount {

public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
21
REG NO : 411622243038
public static class Reduce extends Reducer<Text, IntWritable, Text,

IntWritable> {
public void reduce(Text key, Iterable <IntWritable> values, Context context)

int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
job.waitForCompletion(true);
}
}
22
REG NO : 411622243038
OUTPUT:
RESULT:
Thus, the program run a basic Word Count Map Reduce to understand
Map Reduce Paradigm has been executed successfully.
23
REG NO : 411622243038
Ex.No: 05(a) To install and run Hive

Date:
AIM:
To install and run the Apache Hive
PROCEDURE:
1. Downloading Apache Hive binaries
In order to download Apache Hive binaries, you should go to the following
website: https://downloads.apache.org/hive/hive-3.1.2/. Then, download the
apache-hive-3.1.2.-bin.tar.gz file.
When the file download is complete, we should extract twice (as mentioned
above) the apache-hive.3.1.2-bin.tar.gz archive into “E:\hadoop-env\apache-
hive-3.1.2” directory (Since we decided to use E:\hadoop-env\” as the
installation directory for all technologies used in the previous guide.
2. Setting environment variables
After extracting Derby and Hive archives, we should go to Control Panel

> System and Security > System. Then Click on “Advanced system settings”.
24
REG NO : 411622243038
In the advanced system settings dialog, click on “Environment variables” button.
25
REG NO : 411622243038
Now we should add the following user variables:
• HIVE_HOME: “E:\hadoop-env\apache-hive-3.1.2\”
• DERBY_HOME: “E:\hadoop-env\db-derby-10.14.2.0\”
• HIVE_LIB: “%HIVE_HOME%\lib”
• HIVE_BIN: “%HIVE_HOME%\bin”
• HADOOP_USER_CLASSPATH_FIRST: “true”
26
REG NO : 411622243038
Besides, we should add the following system variable:

• HADOOP_USER_CLASSPATH_FIRST: “true”
Now, we should edit the Path user variable to add the following paths:
• %HIVE_BIN%
• %DERBY_HOME%\bin
3. Configuring Hive
4.1. Copy Derby libraries
Now, we should go to the Derby libraries directory (E:\hadoop-env\db-derby-
10.14.2.0\lib) and copy all *.jar files.
27
REG NO : 411622243038
Then, we should paste them within the Hive libraries directory (E:\hadoop-
env\apache-hive-3.1.2\lib).
28
REG NO : 411622243038
3.2. Configuring hive-site.xml
Now, we should go to the Apache Hive configuration directory

(E:\hadoop-env\apache-hive-3.1.2\conf) create a new file “hive-site.xml”. We
should paste the following XML code within this file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration><property> <name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property><property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<description>Enable user impersonation for HiveServer2</description>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
<description> Client authentication types. NONE: no authentication check LDAP:
LDAP/AD based authentication KERBEROS: Kerberos/GSSAPI authentication
CUSTOM: Custom authentication provider (Use with property
hive.server2.custom.authentication.class) </description>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property>
</configuration>
4. Starting Services
4.1. Hadoop Services
To start Apache Hive, open the command prompt utility as administrator.
Then, start the Hadoop services using start-dfs and start-yarn commands (as
illustrated in the Hadoop installation guide).
29
REG NO : 411622243038
4.2. Derby Network Server

Then, we should start the Derby network server on the localhost using the
following command:
E:\hadoop-env\db-derby-10.14.2.0\bin\StartNetworkServer -h 0.0.0.0
5. Starting Apache Hive

Now, let try to open a command prompt tool and go to the Hive binaries
directory (E:\hadoop-env\apache-hive-3.1.2\bin) and execute the following
command:
hive
RESULT:
Thus, we successfully install and run the apache Hive
successfully.
30
REG NO : 411622243038
Ex.No: 05(b) Hive Operations

Date:
AIM:
To perform the Hive operation
ALGORITHM:
Step 1: Create a Database (if not exists)

Database name (ùserdb`).
Step 2: Create a Table (if not exists)

Table name (èmployee`), columns (èid`, `name`, `salary`, `designation`),
delimiters (`'\t'`, `'\n'`), and storage location (`'/user/input'`).
Step 3: Load Data into the Table

Path to the local data file (ìnputdata.txt`).
Step 4: Create a View

Input: View name (`writer_editor`), condition (`designation='Writer' OR
designation='Editor'`).
Step 5: Create an Index (with Deferred Rebuild)

Index name (ìndex_salary`), indexed column (`salary`), index handler
(`'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'`).
Step 6: Query Data

Retrieve and display all records from the èmployee` table.
Retrieve and display records from the `writer_editor` view.
31
REG NO : 411622243038
PROGRAM:
CREATE DATABASE IF NOT EXISTS userdb;
CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary
String, designation String) COMMENT 'Employee details' ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '/user/input';
LOAD DATA LOCAL INPATH 'inputdata.txt' OVERWRITE INTO TABLE

employee;
CREATE VIEW writer_editor AS SELECT * FROM employee WHERE

designation='Writer' or designation='Editor';
CREATE INDEX index_salary ON TABLE employee(salary) AS

'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH
DEFERRED REBUILD;
SELECT * from employee;
SELECT * from writer_editor;
32
REG NO : 411622243038
OUTPUT:
RESULT:
Thus, we successfully perform Hive operation
33
REG NO : 411622243038
Ex.No: 06 Installation of HBase, Installing thrift along with

Practice examples
Date:
AIM:
To install HBase in windows
PROCEDURE:
Step-1: (Extraction of files)
Extract all the files in C drive
Step-2:(Creating Folder)
Create folders named "hbase" and "zookeeper."
34
REG NO : 411622243038
Step-3: (Deleting line in HBase.cmd)

Open hbase.cmd in any text editor.
Search for line %HEAP_SETTINGS% and remove it.
Step-4: (Add lines in hbase-env.cmd)

Now open hbase-env.cmd, which is in the conf folder in any text editor.
set JAVA_HOME=%JAVA_HOME%
set HBASE_CLASSPATH=%HBASE_HOME%\lib\client-facing-thirdparty\*
set HBASE_HEAPSIZE=8000
set HBASE_OPTS="-XX:+UseConcMarkSweepGC" "-Djava.net.preferIPv4Stack=true"
set SERVER_GC_OPTS="-verbose:gc" "-XX:+PrintGCDetails" "-
XX:+PrintGCDateStamps" %HBASE_GC_OPTS%
set HBASE_USE_GC_LOGFILE=true
set HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false" "-

Dcom.sun.management.jmxremote.authenticate=false"
35
REG NO : 411622243038
set HBASE_MASTER_OPTS=%HBASE_JMX_BASE% "-

Dcom.sun.management.jmxremote.port=10101"
set HBASE_REGIONSERVER_OPTS=%HBASE_JMX_BASE% "-
set HBASE_THRIFT_OPTS=%HBASE_JMX_BASE% "-
set HBASE_ZOOKEEPER_OPTS=%HBASE_JMX_BASE% -
set HBASE_REGIONSERVERS=%HBASE_HOME%\conf\regionservers
set HBASE_LOG_DIR=%HBASE_HOME%\logs
set HBASE_IDENT_STRING=%USERNAME%
set HBASE_MANAGES_ZK=true
Step-6: (Setting Environment Variables)

Now set up the environment variables.
Search "System environment variables."
36
REG NO : 411622243038
Now click on " Environment Variables."
Then click on "New."
37
REG NO : 411622243038
Variable name: HBASE_HOME

Variable Value: Put the path of the Hbase folder.
We have completed the HBase Setup on Windows procedure.
Step 7: Install Apache Thrift
Download Thrift:
Visit the Apache Thrift website: https://thrift.apache.org/download.
Download and extract Thrift.
Build and Install Thrift:
./configure
make
sudo make install
38
REG NO : 411622243038
Step 8: Practice Examples (Using Java with HBase and Thrift)

Below is a simple Java example demonstrating how to use Apache Thrift to
interact with HBase:
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.hadoop.hbase.thrift.generated.Hbase;
public class HBaseThriftExample {
public static void main(String[] args) {

TTransport transport = new TSocket("localhost", 9090);
try {
transport.open();
// Create Thrift client

Hbase.Client client = new Hbase.Client(new
TBinaryProtocol(transport));
// Perform operations
// ... add your HBase Thrift operations here ...
// Close the transport

transport.close();
} catch (TException e) {
e.printStackTrace();
}
}
}
39
REG NO : 411622243038
Ensure that your HBase Thrift server is running and accessible at the specified
host and port. Also, make sure the necessary HBase Thrift libraries are included
in your Java project's classpath.
The provided Java code connects to an HBase Thrift server, performs unspecified
operations (indicated by comments), and handles exceptions. Since the actual
operations are not specified in the code, the output would depend on what
operations you perform within the try block.
If everything runs successfully (meaning the HBase Thrift server is running and
reachable, and your operations execute without errors), the program will
terminate without any output.
RESULT:
Thus, we install of HBase, Installing thrift along with Practice examples
40
REG NO : 411622243038
Ex.No: 07 Practice importing and exporting data from

various databases.
Date:
AIM:
To perform importing and exporting data from various databases.
Such as HDFS, Apache Hive and Apache spark
PROCEDURE:
Importing Data:
1. Hadoop Distributed File System (HDFS):
• Use the Hadoop hdfs dfs command-line tool or Hadoop File System API to
copy data from a local file system or another location to HDFS. For
example:
S hdfs dfs -put local_file.txt /hdfs/path
• This command uploads the local_file.txt from the local file system to the
HDFS path /hdfs/path.
2. Apache Hive:
• Hive supports data import from various sources, including local files,
HDFS, and databases. You can use the LOAD DATA statement to import
data into Hive tables. For example:
LOAD DATA INPATH '/hdfs/path/data.txt' INTO TABLE my_table;
• This statement loads data from the HDFS path /hdfs/path/data.txt into the
41
REG NO : 411622243038
Hive table my_table.
3. Apache Spark:
• Spark provides rich APIs for data ingestion. You can use th
DataFrameReader or SparkSession APIs to read data from different source
such as CSV files, databases, or streaming systems. For example:
val df = spark.read.format("esv").load("/path/to/data.csv")
• This code reads data from the CSV file located at /path/to/data.csv inte
DataFrame in Spark.
Exporting Data:
1. Hadoop Distributed File System (HDFS):
• Use the Hadoop hdfs dfs command-line tool or Hadoop File System AP
copy data from HDFS to a local file system or another location. For
example:
S hdfs dfs -get/hdfs/path/file.txt local_file.txt
• This command downloads the file /hdfs/path/file.txt from HDFS and saves
it as local file.txt in the local file system.
2. Apache Hive:
• Exporting data from Hive can be done in various ways, depending on the
desired output format. You can use the INSERT OVERWRITE statement
to export data from Hive tables to files or other Hive tables. For example:
42
REG NO : 411622243038
INSERT OVERWRITE LOCAL DIRECTORY '/path/to/output

SELECT FROM my_table;
• This statement exports the data from the table Hive table to the local
directory /path/to/output.
3. Apache Spark:
• Spark provides flexible options for data export. You can use theDataFrame
Writer or Dataset Writer APIs to write data to different file formats,
databases, or streaming systems. For example:
df.write.format("parquet").save("/path/to/output")
• This code saves the DataFrame df in Parquet format to the specified output
directory.
RESULT:
Thus, we perform importing and exporting data from various databases.
43
REG NO : 411622243038
Ex.No: 08 MapReduce to find the maximum electrical

consumption in each year
Date:
AIM:
To Develop a MapReduce to find the maximum electrical consumption in

each year given electrical consumption for each month in each year.
PROCEDURE:
Given below is the data regarding the electrical consumption of an

organization It contains the monthly electrical consumption and the annual
average for various years.
If the above data is given as input, we have to write applications to process it and
produce results such as finding the year of maximum usage, year of minimum
usage, and so on. This is a walkover for the programmers with finite number of
records. They will simply write the logic to produce the required output, and pass
the data to the application written.
But, think of the data representing the electrical consumption of all the largescale
industries of a particular state, since its formation.
When we write applications to process such bulk data,

• They will take a lot of time to execute.
• There will be a heavy network traffic when we move data from source to
network server and so on.
To solve these problems, we have the MapReduce framework
44
REG NO : 411622243038
Input Data
The above data is saved as sample.txt and given as input. The input file looks as
shown below.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
PROGRAM:
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits
{
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable ,/*Input key Type */ Text, /*Input value Type*/
Text, /*Output key Type*/ IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
45
REG NO : 411622243038
String line = value.toString(); String lasttoken = null;

StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens())
{
lasttoken=s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
}
}
//Reducer class
public static class E_EReduce extends MapReduceBase implements
Reducer< Text, IntWritable, Text, IntWritable >
{
//Reduce function
public void reduce( Text key, Iterator <IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException
{
int maxavg=30;
int val=Integer.MIN_VALUE;
while (values.hasNext())
{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}
46
REG NO : 411622243038
}}
//Main function
public static void main(String args[])throws Exception
{
JobConf conf = new JobConf(ProcessUnits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
47
REG NO : 411622243038
OUTPUT:
Input:
Kolkata,56
Jaipur,45
Delhi,43
Mumbai,34
Goa,45
Kolkata,35
Jaipur,34
Delhi,32
Output:
Kolkata 56
Jaipur 45
Delhi 43
Mumbai 34
RESULT:
Thus, we perform MapReduce to find the maximum electrical consumption
in each year has been executed successfully.
48
REG NO : 411622243038
Ex.No: 09 MapReduce program to analyze Uber data set

Date:
AIM:
To Develop a MapReduce program to analyze Uber data set to find the
days on which each basement has more trips using the following dataset.
PROCEDURE:
Problem Statement 1: In this problem statement, we will find the days on

which each basement has more trips.
PROGRAM:
Mapper Class:
public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{
java.text.SimpleDateFormat format = new
java.text.SimpleDateFormat("MM/dd/yyyy");
String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"};
private Text basement = new Text();
Date date = null;
private int trips;
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String[] splits = line.split(",");
basement.set(splits[0]);
try {
date = format.parse(splits[1]);
} catch (ParseException e) {
49
REG NO : 411622243038
// TODO Auto-generated catch block

}
trips = new Integer(splits[3]);
String keys = basement.toString()+ " "+days[date.getDay()];
context.write(new Text(keys), new IntWritable(trips));
}
}
Reducer Class:
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
int sum = 0
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
50
REG NO : 411622243038
Whole Source Code:
import java.text.ParseException;
import java.util.Date;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Uber1 {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
java.text.SimpleDateFormat format = new
java.text.SimpleDateFormat("MM/dd/yyyy");
String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"};
private Text basement = new Text();
Date date = null;
private int trips;
public void map(Object key, Text value, Context context )
throw IOException, InterruptedException
String[] splits = line.split(",");
basement.set(splits[0]);
try {
51
REG NO : 411622243038
date = format.parse(splits[1]);
} catch (ParseException e) {
// TODO Auto-generated catch block
}
trips = new Integer(splits[3]);
String keys = basement.toString()+ " "+days[date.getDay()];
context.write(new Text(keys), new IntWritable(trips));
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable>
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values
Context context
int sum = 0;
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Uber1");
job.setJarByClass(Uber1.class);
job.setMapperClass(TokenizerMapper.class);
52
REG NO : 411622243038
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Running the Program:
First, we need to build a jar file for the above program and we need to run it as a
normal Hadoop program by passing the input dataset and the output file path as
shown below.
hadoop jar uber1.jar /uber /user/output1
In the output file directory, a part of the file is created and contains the below
53
REG NO : 411622243038
OUTPUT:
B02512 Sat 15026

B02512 Sun 10487
B02512 Thu 15809
B02512 Tue 12041
B02512 Wed 12691
B02598 Fri 93126
B02598 Mon 60882
B02598 Sat 94588
B02598 Sun 66477
B02598 Thu 90333
B02598 Tue 63429
B02598 Wed 71956
B02617 Fri 125067
B02617 Mon 80591
B02617 Sat 127902
B02617 Sun 91722
B02617 Thu 118254
B02617 Tue 86602
B02617 Wed 94887
B02682 Fri 114662
B02682 Mon 74939
B02682 Sat 120283
B02682 Sun 82825
B02682 Thu 106643
B02682 Tue 76905
B02682 Wed 86252
B02764 Fri 326968
54
REG NO : 411622243038
B02764 Mon 214116

B02764 Sat 356789
B02764 Sun 249896
B02764 Thu 304200
B02764 Tue 221343
B02764 Wed 241137
B02765 Fri 34934
B02765 Mon 21974
B02765 Sat 36737
RESULT:
Thus, we perform MapReduce program to analyze Uber data set has been
executed successfully.
55
REG NO : 411622243038
Ex.No: 10 MapReduce program to find the grades of

student’s
Date:
AIM:
To Develop a MapReduce program to find the grades of student’s
ALGORITHM:
Step 1: Input Marks and Calculate Average

1. Initialize an array `marks[]` of size 6 and variables ì` for iteration and
`total` to store the total marks.
2. Create a `Scanner` object `scanner` to read user input.
3. Use a loop to iterate from ì=0` to ì<6`:
- Prompt the user for marks of Subject `(i+1)`.
- Read the input marks and store them in `marks[i]`.
- Add `marks[i]` to `total`.
4. Calculate the average marks by dividing `total` by 6:
float avg = total / 6;
Step 2: Determine Grade
1. Use ìf-else if-else` statements to determine the grade based on àvg`.
- If àvg` is >= 80, print "A".
- Else if àvg` is >= 60, print "B".
- Else if àvg` is >= 40, print "C".
- Else, print "D".
Step 3: Output the Result and Close Scanner
1. Print the student's grade calculated in Step 2.
2. Close the `Scanner` object using `scanner.close()` to release resources.
Step 4: End
1. End the algorithm.
56
REG NO : 411622243038
PROGRAM:
import java.util.Scanner;
public class JavaExample
{
public static void main(String args[])
{
int marks[] = new int[6];
int i;
float total=0, avg;
Scanner scanner = new Scanner(System.in);
for(i=0; i<6; i++) {
System.out.print("Enter Marks of Subject"+(i+1)+":");
marks[i] = scanner.nextInt();
total = total + marks[i];
}
scanner.close();
//Calculating average here avg = total/6;
System.out.print("The student Grade is: ");
if(avg>=80)
{
System.out.print("A");
}
else if(avg>=60 && avg<80)
{
System.out.print("B");
}
else if(avg>=40 && avg<60)
{
System.out.print("C");
}
else
{
System.out.print("D");
}
}
}
57
REG NO : 411622243038
OUTPUT:
Enter Marks of Subject1:40
The student Grade is: B
RESULT:
Thus, we perform MapReduce program to find the grades of student’s has
been executed successfully.
58

BDA record

Uploaded by

Copyright:

Available Formats

BDA record

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA record

Uploaded by

Copyright:

Available Formats

REG NO : 411622243038

Ex.No: 01 Downloading and installing Hadoop; Understanding

Run following commands. Command Prompt

Run wordcount MapReduce job

Check content of the copied file.

14/02/03 13:22:30 INFO mapreduce.Job: Job job_1391412385921_0002

HDFS: Number of bytes read=171

Reduce input groups=6

Ex.No: 02 Hadoop Implementation of file management tasks,

Using Hadoop CLI:

1. Adding files and Directories

Using Hadoop JAVA APIs:

Ex.No: 03 Implement of Matrix Multiplication with Hadoop

Algorithm for Map Function:

Algorithm for Reduce Function:

Step 1. Download the hadoop jar files with these links.

Step 2. Creating Mapper file for Matrix Multiplication.

String[] stringTokens = readLine.split(",");

MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class,

module load hadoop/2.6.0

Step 6. Running the program in particular folder named as operation

module load hadoop/2.6.0

Ex.No: 04 Run a basic Word Count Map Reduce program to

Below are the steps for MapReduce data flow:

Step 1: One block is processed by one mapper at a time. In the mapper, a

Step 2: Output of Mapper also known as intermediate output is written to the

Step 3: Output of mapper is shuffled to reducer node (which is a normal slave

public class wordCount {

public void map(LongWritable key, Text value, Context context)

public static class Reduce extends Reducer<Text, IntWritable, Text,

public void reduce(Text key, Iterable <IntWritable> values, Context context)

Job job = new Job(conf, "wordcount");

Ex.No: 05(a) To install and run Hive

2. Setting environment variables

After extracting Derby and Hive archives, we should go to Control Panel

In the advanced system settings dialog, click on “Environment variables” button.

Now we should add the following user variables:

Besides, we should add the following system variable:

3.2. Configuring hive-site.xml

Now, we should go to the Apache Hive configuration directory

4.2. Derby Network Server

5. Starting Apache Hive

Ex.No: 05(b) Hive Operations

Step 1: Create a Database (if not exists)

Step 2: Create a Table (if not exists)

Step 3: Load Data into the Table

Step 4: Create a View

Step 5: Create an Index (with Deferred Rebuild)

Step 6: Query Data

CREATE DATABASE IF NOT EXISTS userdb;

LOAD DATA LOCAL INPATH 'inputdata.txt' OVERWRITE INTO TABLE

CREATE VIEW writer_editor AS SELECT * FROM employee WHERE

CREATE INDEX index_salary ON TABLE employee(salary) AS

SELECT * from employee;

SELECT * from writer_editor;

Ex.No: 06 Installation of HBase, Installing thrift along with

Step-3: (Deleting line in HBase.cmd)

Step-4: (Add lines in hbase-env.cmd)

set HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false" "-

set HBASE_MASTER_OPTS=%HBASE_JMX_BASE% "-