0% found this document useful (0 votes)

12 views

Assignment 10

Uploaded by

hassy12cool12

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Assignment 10

Uploaded by

hassy12cool12

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Assignment No.

: 10

Aim: Hadoop and HBase installation on single node.

Software required:
1. Ubuntu 18 / 18
2. Hadoop 3.0.0 and above
Theory:
Hadoop
Hadoop is an open source software framework written in Java for distributed storage and
distributed processing of very large data sets on computer clusters built from commodity hardware. All
the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual
machines or racks of machines) are common and thus should be automatically handled in software by the
framework.

Traditional Approach

In this approach, an enterprise will have a computer to store and process big data. Here data will be stored
in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated software can be written to
interact with the database, process the required data and present it to the users for analysis purpose.

Limitation

This approach works well where we have less volume of data that can be accommodated by standard
database servers, or up to the limit of the processor which is processing the data. But when it comes to
dealing with huge amounts of data, it is really a tedious task to process such data through a traditional
database server.

Google’s Solution

Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into
small parts and assigns those parts to many computers connected over the network, and collects the
results to form the final result dataset.
Above diagram shows various commodity hardware’s which could be single CPU machines or servers
with higher capacity.

Hadoop

Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source
Project called HADOOP in 2005 and Doug named it after his son's toy elephant. Now Apache Hadoop is
a registered trademark of the Apache Software Foundation.

Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on
different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of
running on clusters of computers and they could perform complete statistical analysis for huge amounts
of data.

Hadoop Architecture

Hadoop framework includes following four modules:

Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These
libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts
required to start Hadoop.

Hadoop YARN: This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput
access to application data.

Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process big amounts
of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.

The term MapReduce actually refers to the following two different tasks that Hadoop programs perform:

 The Map Task: This is the first task, which takes input data and converts it into a set of data,
where individual elements are broken down into tuples (key/value pairs).
 The Reduce Task: This task takes the output from a map task as input and combines those data
tuples into a smaller set of tuples. The reduce task is always performed after the map task.

Typically both the input and the output are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per
cluster-node. The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-
executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master and provide
task-status information to the master periodically.

The JobTracker is a single point of failure for the Hadoop MapReduce service which means if JobTracker
goes down, all running jobs are halted.

Hadoop Distributed File System

Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS, S3 FS,
and others, but the most common file system used by Hadoop is the Hadoop Distributed File System
(HDFS).

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on large clusters (thousands of computers) of small
computer machines in a reliable, fault-tolerant manner.

HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file
system metadata and one or more slave DataNodes that store the actual data.

How Does Hadoop Work?

Stage 1

A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:

1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce functions.
3. The job configuration by setting different parameters specific to the job.

Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker which
then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks
and monitoring them, providing status and diagnostic information to the job-client.

Stage 3

The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of
the reduce function is stored into the output files on the file system.

Advantages of Hadoop

 Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and
it automatic distributes the data and work across the machines and in turn, utilizes the underlying
parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather
Hadoop library itself has been designed to detect and handle failures at the application layer.

a) Single Node:
Steps for Compilation & Execution
 sudo apt-get update
 sudo apt-get install openjdk-8-jre-headless
 sudo apt-get install openjdk-8-jdk
 sudo apt-get install ssh
 sudo apt-get install rsync

# Download hadoop from:

https://archive.apache.org/dist/hadoop/common/hadoop-3.0.0/ hadoop-3.0.0.tar.gz
 # copy and extract hadoop-3.0.0.tar.gz in home folder
 # rename the name of the extracted folder from hadoop-3.0.0 to hadoop
 readlink -f /usr/bin/javac
 gedit ~/hadoop/etc/hadoop/hadoop-env.sh
 # add following line in it
 # for 32 bit ubuntu
 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-i386
 # for 64 bit ubuntu
 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
 # save and exit the file
 # to display the usage documentation for the hadoop script try next command
 ~/hadoop/bin/hadoop
#Setup passphraseless/passwordless ssh
 ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
 export HADOOP\_PREFIX=/home/your_user_name/hadoop
 ssh localhost
# type exit in the terminal to close the ssh connection (very important)
Exit
# The following instructions are to run a MapReduce job locally.
 Format the filesystem:( Do it only once )
~/hadoop/bin/hdfs namenode -format

 Start NameNode daemon and DataNode daemon:

~/hadoop/sbin/start-dfs.sh

 Browse the web interface for the NameNode; by default it is available at:
http://localhost:50070/

Conclusion: In this way the Hadoop was installed & configured on Ubuntu for BigData.
Questions:

Q1) what are the various daemons in Hadoop and their role in Hadoop Cluster?
Q2) what does JPS command do?
Q3) what is difference between RDBMS vs. Hadoop
Q4) What is YARN and explain its components?
Q5) Explain HDFS and its components?

1.1 Overview of The Modbus Protocol
100% (1)
1.1 Overview of The Modbus Protocol
9 pages
HL7 To FMP
No ratings yet
HL7 To FMP
27 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Hadoop 10
No ratings yet
Hadoop 10
8 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
Unit-2 Hadoop
No ratings yet
Unit-2 Hadoop
16 pages
shawn
No ratings yet
shawn
4 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Unit 2
No ratings yet
Unit 2
21 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
bd sec b
No ratings yet
bd sec b
19 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
Part 02 - Big Data Solutions
No ratings yet
Part 02 - Big Data Solutions
17 pages
Big Data 3rd Module
No ratings yet
Big Data 3rd Module
22 pages
1.Mrplab Intro
No ratings yet
1.Mrplab Intro
18 pages
Module II
No ratings yet
Module II
46 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Explain in Detail About Hadoop Framework
No ratings yet
Explain in Detail About Hadoop Framework
4 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
Hadoop
No ratings yet
Hadoop
7 pages
Introduction
No ratings yet
Introduction
2 pages
Hadoop
No ratings yet
Hadoop
11 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Hadoop
No ratings yet
Hadoop
7 pages
Unit 2
No ratings yet
Unit 2
30 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
1 Purpose: Single Node Setup Cluster Setup
No ratings yet
1 Purpose: Single Node Setup Cluster Setup
1 page
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit 3
No ratings yet
Unit 3
61 pages
The Hadoop Approach
100% (2)
The Hadoop Approach
14 pages
CC 2
No ratings yet
CC 2
25 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Bda A2
No ratings yet
Bda A2
17 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
hadoop Introduction
No ratings yet
hadoop Introduction
2 pages
Module III Note
No ratings yet
Module III Note
36 pages
CC unit5
No ratings yet
CC unit5
27 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Hadoop
No ratings yet
Hadoop
14 pages
Apache Hadoop: Google File System Hadoop Distributed File System
No ratings yet
Apache Hadoop: Google File System Hadoop Distributed File System
2 pages
Unit 3
No ratings yet
Unit 3
18 pages
CSCI 2132 Final Exam Solutions: Faculty of Computer Science
No ratings yet
CSCI 2132 Final Exam Solutions: Faculty of Computer Science
18 pages
WCF
No ratings yet
WCF
1 page
Genesys Logic, Inc.: Datasheet Revision 1.04 Aug. 08, 2007
No ratings yet
Genesys Logic, Inc.: Datasheet Revision 1.04 Aug. 08, 2007
25 pages
SYSprep
No ratings yet
SYSprep
6 pages
Membrane Keyboard - kblt106
No ratings yet
Membrane Keyboard - kblt106
2 pages
How To C Chat Server
No ratings yet
How To C Chat Server
19 pages
TSM Policy Management
No ratings yet
TSM Policy Management
3 pages
Introduction To AWS
No ratings yet
Introduction To AWS
8 pages
Scadapack 314 Datasheet
No ratings yet
Scadapack 314 Datasheet
4 pages
Xda-Developers - SECRET CODES - Omnia 2 GT-i8000 PDF
No ratings yet
Xda-Developers - SECRET CODES - Omnia 2 GT-i8000 PDF
3 pages
Steam Underground Community - View Topic - Bloons TD 6
No ratings yet
Steam Underground Community - View Topic - Bloons TD 6
7 pages
Instalar Windows 95 en Máquina Virtual
No ratings yet
Instalar Windows 95 en Máquina Virtual
3 pages
8.51 MailAdmin
No ratings yet
8.51 MailAdmin
75 pages
Gateway mt6451 Quanta Ma3f Rev 1a SCH
No ratings yet
Gateway mt6451 Quanta Ma3f Rev 1a SCH
35 pages
CSC 101 - LECTURE 2 - Component - of - Computers - Hardware - Software
No ratings yet
CSC 101 - LECTURE 2 - Component - of - Computers - Hardware - Software
108 pages
Installation Guide USB Driver For The SIC ... - SIC-Venim S.R.O.
No ratings yet
Installation Guide USB Driver For The SIC ... - SIC-Venim S.R.O.
11 pages
FULL TuneUp Utilities 2012 v1203000140 Serials ChattChitto RG PDF
No ratings yet
FULL TuneUp Utilities 2012 v1203000140 Serials ChattChitto RG PDF
3 pages
Optiplex 980
No ratings yet
Optiplex 980
82 pages
Bits g553 Real Time Systems
No ratings yet
Bits g553 Real Time Systems
2 pages
NO. Type of Signal Signal Description To Pms SDI SDO SAI SAO
No ratings yet
NO. Type of Signal Signal Description To Pms SDI SDO SAI SAO
2 pages
Vapp Deployment and Configuration Guide: Vcenter Operations Manager 5.6
No ratings yet
Vapp Deployment and Configuration Guide: Vcenter Operations Manager 5.6
38 pages
MBTCP DAServer User's Guide PDF
No ratings yet
MBTCP DAServer User's Guide PDF
112 pages
Pinginfo
No ratings yet
Pinginfo
5 pages
GOT 2000 User's Manual
No ratings yet
GOT 2000 User's Manual
1 page
H TR Technical Support
No ratings yet
H TR Technical Support
22 pages
Back Up Your SLC 500 Control System With The 1747-BSN
No ratings yet
Back Up Your SLC 500 Control System With The 1747-BSN
2 pages
Data Centre Manual
100% (1)
Data Centre Manual
33 pages
Introduction To NS-2: Part 5. Wireless Network
No ratings yet
Introduction To NS-2: Part 5. Wireless Network
28 pages