Big Data and Hadoop Guide
Big Data and Hadoop Guide
Table of Contents
Chapter: 1
Important Definitions
Chapter: 2
MapReduce
Chapter: 3
HDFS
Chapter: 4
Chapter: 5
HBase componenets
Chapter: 6
Cloudera
Chapter: 7
Chapter: 8
Hadoop Ecosystem
DEFINITION
Big Data refers to the data sets whose size makes it difficult for commonly
Big data
used data capturing software tools to interpret, manage, and process them within a
reasonable time frame.
Hadoop
VMware Player
VMware Player is a free software package offered by VMware, Inc., which is used to create
Hadoop Architecture
Hadoop is a master and slave architecture that includes the NameNode as the master and the
HDFS
the features of other distributed file systems. It is used for storing and retrieving
unstructured data.
MapReduce
Apache Hadoop
The MapReduce is a core component of Hadoop, and is responsible for processing jobs in
distributed mode.
One of the primary technologies, which rules the field of Big Data technology, is Apache
Hadoop.
Ubuntu is a leading open-source platform for scale out. Ubuntu helps in utilizing the
Ubuntu Server
infrastructure at its optimum level irrespective of whether users want to deploy a cloud, a
web farm, or a Hadoop cluster.
The Apache Pig is a platform which helps to analyze large datasets that includes high-level
Pig
language required to express data analysis programs. Pig is one of the components of the
Hadoop eco-system.
Hive is an open-source data warehousing system used to analyze a large amount of dataset
Hive
that is stored in Hadoop files. It has three key functions like summarization of data, query,
and analysis.
SQL
Metastore
Driver
Query compiler
It is the component that stores the system catalog and metadata about tables, columns,
partitions, etc. It is stored in a traditional RDBMS format.
A query compiler is one of the driver components. It is responsible for compiling the Hive
script for errors.
Query optimizer
Execution engine:
A query optimizer optimizes Hive scripts for faster execution of the same. It consists of a
chain of transformations.
The role of the execution engine is to execute the tasks produced by the compiler in proper
dependency order.
Hive server
The Hive Server is the main component which is responsible for providing an interface to the
Client components
The developer uses client components to perform development in Hive. The client
components include Command Line Interface (CLI), web UI, and the JDBC/ODBC driver.
It is a distributed, column-oriented database built on top of HDFS (Hadoop Distributed
Apache HBase
Filesystem). HBase can scale horizontally to thousands of commodity servers and petabytes
by indexing the storage.
It is used for performing region assignment. ZooKeeper is a centralized management service
ZooKeeper
Cloudera
Sqoop
It is a tool that extracts data derived from non-Hadoop sources and formats them such that
the data can be used by Hadoop later.
Chapter: 2 MapReduce
The MapReduce component of Hadoop is responsible for processing jobs in
distributed mode. The features of MapReduce are as follows:
Aggregation of output
The third feature of MapReduce is
aggregation of the output of the
map phase, which is a user-defined
reduce phase after a map process.
Chapter: 3 HDFS
HDFS is used for storing and retrieving unstructured data. The features of
Hadoop HDFS are as follows:
Dierence
Denition
Example
Pig
SQL
HBase Components
ZooKeeper
RegionServer
Master
Memstore
HFile
/hbase/region1
/hbase/region2
/hbase/region
WAL
HBase Master
Multiple RegionServers
They act like availability servers that help in maintaining a part of the complete data, which is stored
in HDFS according to the requirement of the user.
They do this using the HFile and WAL (Write Ahead
Log) service. The RegionServers always stay in sync
with the HBase Master. The responsibility of
ZooKeeper is to ensure that the RegionServers are
in sync with the HBase Master.
Chapter: 6 Cloudera
Cloudera is a commercial tool for deploying Hadoop in an enterprise setup.
The salient features of Cloudera are as follows:
/App 1
/App 1/db
/App 2
/App 1/conf
/App 1/conf
Sqoop is an Apache Hadoop ecosystem project whose responsibility is to import or export operations across relational databases like
MySQL, MSSQL, Oracle, etc. to HDFS. Following are the reasons for using Sqoop:
SQL servers are deployed worldwide and are the primary ways to accept the data from a user.
Nightly processing is done on SQL server for years.
It is essential to have a mechanism to move the data from traditional SQL DB to Hadoop HDFS.
Transferring the data using some automated script is inefficient and time-consuming.
Traditional DB has reporting, data visualization, and other applications built in enterprises but to handle large data, we need an
ecosystem.
The need to bring the processed data from Hadoop HDFS to the applications like database engine or web services is satisfied by
Sqoop.
The image given here depicts the various Hadoop ecosystem components. The base of all the components is Hadoop Distributed File
System (HDFS). Above this component is YARN MapReduce v2. This framework component is used for the distributed processing in a
Hadoop cluster.
The next component is Flume. Flume is used for collecting logs across a cluster. Sqoop is used for data exchange between a relational
database and Hadoop HDFS.
The ZooKeeper component is used for coordinating the nodes in a cluster. The next ecosystem component is Oozie. This component
is used for creating, executing, and modifying the workflow of a MapReduce job. The Pig component is used for performing scripting
for MapReduce applications.
The next component is Mahout. This component is used for machine learning based on machine inputs. R Connectors are used for
generating statistics of the nodes in a cluster. Hive is used for interacting with Hadoop using SQL like query. The next component is
HBase. This component is used for slicing of large data.
The last component is Ambari. This component is used for provisioning, managing, and monitoring Hadoop clusters.