0% found this document useful (0 votes)

50 views

Big Data and Hadoop Guide

Big Data

Uploaded by

Roxana Godoy Astudillo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

Big Data and Hadoop Guide

Big Data

Uploaded by

Roxana Godoy Astudillo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

8 ESSENTIAL CONCEPTS OF

BIG DATA AND HADOOP

A QUICK REFERENCE GUIDE

What is Big Data and Hadoop?

Big Data refers to large sets of data that cannot be analyzed with traditional tools. It stands for data
related to large-scale processing architectures.
Hadoop is the software framework that is developed by Apache to support distributed processing of
data. Initially, Java language was used to develop Hadoop, but today many other languages are used
for scripting Hadoop. Hadoop is used as the core platform to structure Big Data and helps in
performing data analytics.

Table of Contents

Chapter: 1

Important Definitions

Chapter: 2

MapReduce

Chapter: 3

HDFS

Chapter: 4

Pig vs. SQL

Chapter: 5

HBase componenets

Chapter: 6

Cloudera

Chapter: 7

Zookeeper and Sqoop

Chapter: 8

Hadoop Ecosystem

Chapter: 1 Important Denitions

TERM

DEFINITION

Big Data refers to the data sets whose size makes it difficult for commonly

Big data

used data capturing software tools to interpret, manage, and process them within a
reasonable time frame.

Hadoop

Hadoop is an open-source framework built on the Java environment. It assists in the

processing of large data sets in a distributed computing environment.

VMware Player

VMware Player is a free software package offered by VMware, Inc., which is used to create

Hadoop Architecture

Hadoop is a master and slave architecture that includes the NameNode as the master and the

and manage virtual machines.

DataNode as the slave.

The Hadoop Distributed File System (HDFS) is a distributed file system that shares some of

HDFS

the features of other distributed file systems. It is used for storing and retrieving
unstructured data.

MapReduce

Apache Hadoop

The MapReduce is a core component of Hadoop, and is responsible for processing jobs in
distributed mode.

One of the primary technologies, which rules the field of Big Data technology, is Apache
Hadoop.
Ubuntu is a leading open-source platform for scale out. Ubuntu helps in utilizing the

Ubuntu Server

infrastructure at its optimum level irrespective of whether users want to deploy a cloud, a
web farm, or a Hadoop cluster.
The Apache Pig is a platform which helps to analyze large datasets that includes high-level

Pig

language required to express data analysis programs. Pig is one of the components of the
Hadoop eco-system.
Hive is an open-source data warehousing system used to analyze a large amount of dataset

Hive

that is stored in Hadoop files. It has three key functions like summarization of data, query,
and analysis.

SQL

Metastore

Driver

Query compiler

It is a query language used to interact with SQL databases.

It is the component that stores the system catalog and metadata about tables, columns,
partitions, etc. It is stored in a traditional RDBMS format.

Driver is the component that Manages the lifecycle of a HiveQL statement.

A query compiler is one of the driver components. It is responsible for compiling the Hive
script for errors.

Query optimizer

Execution engine:

A query optimizer optimizes Hive scripts for faster execution of the same. It consists of a
chain of transformations.

The role of the execution engine is to execute the tasks produced by the compiler in proper
dependency order.

Hive server

The Hive Server is the main component which is responsible for providing an interface to the

Client components

The developer uses client components to perform development in Hive. The client

user. It also maintains connectivity in modules.

components include Command Line Interface (CLI), web UI, and the JDBC/ODBC driver.
It is a distributed, column-oriented database built on top of HDFS (Hadoop Distributed

Apache HBase

Filesystem). HBase can scale horizontally to thousands of commodity servers and petabytes
by indexing the storage.
It is used for performing region assignment. ZooKeeper is a centralized management service

ZooKeeper

for maintaining and configuring information, naming, providing distributed synchronization,

and group services.

Cloudera

Sqoop

It is a commercial tool for deploying Hadoop in an enterprise setup.

It is a tool that extracts data derived from non-Hadoop sources and formats them such that
the data can be used by Hadoop later.

Chapter: 2 MapReduce
The MapReduce component of Hadoop is responsible for processing jobs in
distributed mode. The features of MapReduce are as follows:

Distributed data processing

The first feature of MapReduce
component is that it performs
distributed data processing using the
MapReduce programming paradigm.

User-dened map phase

The second feature of MapReduce is
that you can possess a user-defined
map phase, which is a parallel,
share-nothing processing of input.

Aggregation of output
The third feature of MapReduce is
aggregation of the output of the
map phase, which is a user-defined
reduce phase after a map process.

Chapter: 3 HDFS
HDFS is used for storing and retrieving unstructured data. The features of
Hadoop HDFS are as follows:

Provides access to data blocks

Helps to manage le system

HDFS provides a high-throughput access

to data blocks. When an unstructured
data is uploaded on HDFS, it is converted
into data blocks of fixed size. The data is
chunked into blocks so that it is
compatible with commodity
hardware's storage.

HDFS provides a limited interface for

managing the file system to allow it
to scale. This feature ensures that
you can perform a scale up or scale
down of resources in the Hadoop
cluster.

Creates multiple replicas

of data blocks
The third feature of MapReduce is
aggregation of the output of the
map phase, which is a user-defined
reduce phase after a map process.

Chapter: 4 Pig vs. SQL

The table below includes the differences between Pig and SQL:

Dierence

Denition

Example

Pig

SQL

HDFS provides a limited interface for managing the file

system to allow it to scale. This feature ensures that
you can perform a scale up or scale down of resources
in the Hadoop cluster.

It is a query language used to

interact with SQL databases.

customer = LOAD '/data/customer.dat' AS

(c_id,name,city);
sales = LOAD '/data/sales.dat' AS (s_id,c_id,date,amount);
salesBLR = FILTER customer BY city == 'Bangalore';
joined= JOIN customer BY c_id, salesBLR BY c_id;
grouped = GROUP joined BY c_id;
summed= FOREACH grouped GENERATE GROUP,
SUM(joined.salesBLR::amount);
spenders= FILTER summed BY $1 > 100000;
sorted = ORDER spenders BY $1 DESC;
DUMP sorted;

SELECT c_id , SUM(amount)

AS CTotal
FROM customers c
JOIN sales s ON c.c_id =
s.c_id
WHERE c.city = 'Bangalore'
GROUP BY c_id
HAVING SUM(amount) >
100000
ORDER BY CTotal DESC

Chapter: 5 HBase components

Introduction
Apache HBase is a distributed, column-oriented database built on top of HDFS (Hadoop Distributed Filesystem). HBase can scale
horizontally to thousands of commodity servers and petabytes by indexing the storage. HBase supports random real-time CRUD
operations. HBase also has linear and modular scalability. It supports an easy-to-use Java API for programmatic access.
HBase is integrated with the MapReduce framework in Hadoop. It is an open-source framework that has been modeled after Googles
BigTable. Further, HBase is a type of NoSQL.

HBase Components

The components of HBase are HBase Master and Multiple RegionServers

ZooKeeper
RegionServer

Master

Memstore

HFile

/hbase/region1
/hbase/region2

/hbase/region

WAL

An explanation of the components of HBase is given below:

HBase Master

Multiple RegionServers

It is responsible for managing the schema that is

stored in Hadoop Distributed File System (HDFS).

They act like availability servers that help in maintaining a part of the complete data, which is stored
in HDFS according to the requirement of the user.
They do this using the HFile and WAL (Write Ahead
Log) service. The RegionServers always stay in sync
with the HBase Master. The responsibility of
ZooKeeper is to ensure that the RegionServers are
in sync with the HBase Master.

Chapter: 6 Cloudera
Cloudera is a commercial tool for deploying Hadoop in an enterprise setup.
The salient features of Cloudera are as follows:

It has its own user-friendly Cloudera Manager for

system management, Cloudera Navigator for data
management, dedicated technical support, etc.

It uses 100% open-source distribution of Apache

Hadoop and related projects like Apache Pig,
Apache Hive, Apache HBase, Apache Sqoop, etc.

Chapter: 7 Zookeeper and Sqoop

ZooKeeper is an open-source and high performance co-ordination service for distributed applications. It offers services such as
Naming, Locks and synchronization, Configuration management, and Group services.

ZooKeeper Data Model

ZooKeeper has a hierarchical namespace. Each node in the namespace is called Znode. The example given here shows the tree
diagram used to represent the namespace. The tree follows a top-down approach where '/' is the root and App1 and App2 resides in
the root. The path to access db is /App1/db. This path is called the hierarchical path.

/App 1

/App 1/db

/App 2

/App 1/conf

Sqoop is an Apache Hadoop ecosystem project whose responsibility is to import or export operations across relational databases like
MySQL, MSSQL, Oracle, etc. to HDFS. Following are the reasons for using Sqoop:

SQL servers are deployed worldwide and are the primary ways to accept the data from a user.
Nightly processing is done on SQL server for years.
It is essential to have a mechanism to move the data from traditional SQL DB to Hadoop HDFS.
Transferring the data using some automated script is inefficient and time-consuming.
Traditional DB has reporting, data visualization, and other applications built in enterprises but to handle large data, we need an
ecosystem.
The need to bring the processed data from Hadoop HDFS to the applications like database engine or web services is satisfied by
Sqoop.

Chapter: 8 Hadoop Ecosystem

The image given here depicts the various Hadoop ecosystem components. The base of all the components is Hadoop Distributed File
System (HDFS). Above this component is YARN MapReduce v2. This framework component is used for the distributed processing in a
Hadoop cluster.
The next component is Flume. Flume is used for collecting logs across a cluster. Sqoop is used for data exchange between a relational
database and Hadoop HDFS.
The ZooKeeper component is used for coordinating the nodes in a cluster. The next ecosystem component is Oozie. This component
is used for creating, executing, and modifying the workflow of a MapReduce job. The Pig component is used for performing scripting
for MapReduce applications.
The next component is Mahout. This component is used for machine learning based on machine inputs. R Connectors are used for
generating statistics of the nodes in a cluster. Hive is used for interacting with Hadoop using SQL like query. The next component is
HBase. This component is used for slicing of large data.
The last component is Ambari. This component is used for provisioning, managing, and monitoring Hadoop clusters.

Unit Iii
No ratings yet
Unit Iii
20 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit 2
No ratings yet
Unit 2
23 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
S_Pig_Hive_HBase_Zookeeper
No ratings yet
S_Pig_Hive_HBase_Zookeeper
19 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Ibm Hadoop
No ratings yet
Ibm Hadoop
4 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Big Data Introduction & Ecosystems
No ratings yet
Big Data Introduction & Ecosystems
4 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Unit 5
No ratings yet
Unit 5
10 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
HADOOP
No ratings yet
HADOOP
10 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
Lecture 4 - Hadoop Ecosystem - 1691899782480
No ratings yet
Lecture 4 - Hadoop Ecosystem - 1691899782480
36 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
04
No ratings yet
04
23 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
HADOOP
No ratings yet
HADOOP
55 pages
UNIT II
No ratings yet
UNIT II
30 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet