100% found this document useful (1 vote)

115 views

Lesson 6 NoSQL Databases HBase

The document discusses HBase, an open-source NoSQL database that provides big data storage and access across clusters of servers. It explains the architecture and components of HBase, including how it uses a master node and region servers to partition and store data across nodes in a Hadoop cluster. Key differences between HBase and relational databases are also outlined, such as HBase's use of dynamic schemas and horizontal scaling for high performance and scalability.

Uploaded by

Keerthi Uma Mahesh

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

115 views

Lesson 6 NoSQL Databases HBase

Uploaded by

Keerthi Uma Mahesh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Big Data Hadoop and Spark Developer

NoSQL Databases: HBase

Learning Objectives

By the end of this lesson, you will be able to:

Understand the need for NoSQL databases

Analyze the HBase architecture and components

Distinguish HBase from RDBMS

NoSQL Introduction
NoSQL Database

NoSQL is a form of unstructured storage.

DB NoSQL

Structured Unstructured
Why NoSQL?

With the explosion of social media sites, such as Facebook and Twitter, the demand to manage
large data has grown tremendously.

Key-Value Pair Document Column-Based

Databases Databases Data Stores
Types of NoSQL

Key-Value Document-Based Column-Based Graph-Based

Graph
Example:
Record Record
s s

Nodes Organiz Relationships

Hav Hav
e e
Properties

Example: Example: Example: Example:

Oracle NoSQL, Redis MongoDB, CouchDB, BigTable, Cassandra, Neo4J, InfoGrid, Inﬁnite
Server, Scalaris OrientDB, RavenDB HBase, Hypertable Graph, FlockDB
RDBMS vs. NoSQL

The diﬀerences between RDBMS and NoSQL databases are as follows:

Feature RDBMS NoSQL Databases

Data Storage Tabular Variable

Schema Fixed Dynamic

Performance Low High

Scalability Vertical Horizontal

Reliability Good Poor

Assisted Practice

YARN Tuning Duration: 15 mins

Problem Statement: In this demonstration, you will learn, how to tune YARN and allow HBase to run
smoothly without being resource starved.

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective ﬁelds, and click Login.
HBase Overview
What Is HBase?

HBase is a database management system designed in 2007 by Powerset, a Microsoft company.

HBase rests on top of HDFS and enables real-time analysis of data.

What Is HBase?

It can store huge amount of data in tabular format for extremely fast reads and writes.

HBase is mostly used in a scenario that requires regular and consistent inserting and overwriting of data.
Why HBase?

HDFS stores, processes, and manages large amounts of data eﬃciently.

However, it performs only batch processing and the data will be accessed in a sequential manner.

Therefore, a solution is required to access, read, or write data anytime regardless of its sequence in the
clusters of data.
Characteristics of HBase

HBase is a type of NoSQL database and is classiﬁed as a key-value store. In HBase:

Value is Values are

Key and value are Quickly accessed
identiﬁed with a stored in
a ByteArray by value keys
key key-orders

HBase is a database in which tables have no schema. At the time of table creation, column families are
deﬁned, not columns.
HBase: Real-Life Connect

Facebook’s messenger platform needs to store over 135 trillion messages every month.

Rarely Accessed Highly Volatile

Dataset Dataset

Where do they store such data?

HBase Architecture
HBase Architecture

HBase has two types of nodes: Master and RegionServer. Their characteristics are as follows:

Master RegionServer
• Single Master node running at a • One or more RegionServers
time running at a time
• Manages cluster operations • Hosts tables and performs reads
HBase and buﬀer writes
• Not a part of the read or write Nodes
path • RegionServer is communicated in
order to read and write

A region in HBase is the subset of a table’s rows. The Master node detects the status of RegionServers and
assigns regions to it.
HBase Components

The HBase components include HBase Master and multiple RegionServers.

ZooKeeper is used for

ZooKeeper Quorum coordination or monitoring
ZooKeeper Peer
HBase Cluster Architecture HMaster
ZooKeeper Peer
... ...
HBase Master
assigns regions
RegionServer RegionServer and load-
Region balancing
Region Region
Store Store MemStore
MemStore
Store MemStore Store MemStore
...
StoreFile StoreFile StoreFile
StoreFile StoreFile
HLog HFile HFile HFile HFile HFile
HLog HLog

HDFS
Storage Model of HBase

The two major components of the storage model are as follows:

Partitioning:
• A table is horizontally partitioned into regions.
• Each region is managed by a RegionServer.
• A RegionServer may hold multiple regions.

Persistence and data availability:

• HBase stores its data in HDFS, does not replicate RegionServers,
and relies on HDFS replication for data availability.
• Updates and reads are served from the in-memory cache called
MemStore.
Row Distribution of Data between RegionServers

The distribution of rows of structured data using HBase is illustrated here:

A1
A2 Region
Null🡪A3
A22
Logical View-All rows in a table

A3 Region
A3🡪F34
…
…
Region
K4 F34🡪K80
…
… Region
O90 k80🡪095
Region
… 095🡪null
… RegionServer RegionServer RegionServer
…
Z30
Z55
Data Storage in HBase

Data is stored Data is stored in ﬁles called HFiles or StoreFiles that are usually saved in HDFS.
in ﬁles called HFiles or StoreFiles that are usually saved in HDFS.

HFile is a key-value map.

When data is added, it is written to a log called the Write Ahead

Log, and it is stored in memory, MemStore.

HFiles are immutable, since HDFS does not support updates to an

existing ﬁle.

HBase periodically performs data compactions to control the

number of HFiles and to keep the cluster well-balanced.
Data Model
Data Model

Following are the features of the data model in HBase:

One column family can have

Multi-versioned
any number of columns.

rowkey CF1:C1 CF1:C2 CF1:C3

.. CF2:C1 CF1:C8

. .. CF2:C1 CF1:C8
rowkey CF1:C1 CF1:C2 CF1:C3
. ..
rowkey CF1:C1 CF1:C2 CF1:C3 CF2:C1 CF1:C8
.

Cells within a column family are sorted physically. Very sparse as most cells have NULL values.

Everything except table names are stored as ByteArrays.

Data Mode: Features

Row Key

Column family 1 Column family 2

qualifier1 qualifier2 qualifier1 qualifier2 qualifier3

Timestamp 1 Timestamp 1 Timestamp 1 Timestamp 1

value1 value1 value2 value3

When to Use HBase?

Utilize HBase Enough data in millions

invariable or billions of rows
schema

For random selects and Suﬃcient commodity hardware

range scans by key with at least ﬁve nodes
HBase vs. RDBMS

The table shows a comparison between HBase and a Relational Database Management System (RDBMS):

HBase RDBMS
Automatic partitioning Usually manual and admin-driven partitioning

Scales linearly and automatically with new Usually scales vertically by adding more hardware
nodes resources

Uses commodity hardware Relies on expensive servers

Has fault tolerance Fault tolerance may or may not be present

Leverages batch processing with MapReduce Relies on multiple threads or processes rather
distributed processing than MapReduce distributed processing
Connecting to HBase
Connecting to HBase

HBase can be connected through the following media:

MapReduce
Rest/Thrift
Hive/Pig/HCatalog Java Application
Gateway
/Hue

Java API

ZooKeeper

HBase

HDFS
HBase Shell Commands

Common commands include, but are not limited to, the following:

Create table. Pass table name from a dictionary of speciﬁcations per

HBase> create ‘t1′, {NAME => ‘f1′}, {NAME
column family, and a dictionary of table conﬁguration which is => ‘f2′}, {NAME => ‘f3′}
optional HBase> #
The above in shorthand would be the
following:
HBase> create ‘t1′, ‘f1′, ‘f2′, ‘f3′

Describe the table named HBase> describe ‘t1′

Start the disabling of the table named HBase> disable ‘t1′

Drop the table named. Table must ﬁrst be disabled HBase> drop ‘t1′

List all tables in HBase. Optional regular expression parameter can be

HBase> list
used to ﬁlter the output.
HBase Shell Commands

Delete Put
Deleting a cell value Putting a cell value

Count Get Scan

Counting the number Getting the contents Scanning a table’s
of rows in a table of a row or a cell value
Unssisted Practice

HBase Shell Duration: 15 mins

Problem Statement: Create a sample HBase table on the cluster, enter some data, query the table, then
clean up the data and exit.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective ﬁelds, and click Login.
Unassisted Practice

Steps to Perform
• HBase Shell

// Start the HBase shell

hbase shell

// Create a table called simplilearn with one column family named stats:
create 'simplilearn', 'stats’

// Verify the table creation by listing everything

list

// Add a test value to the daily column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:daily', 'test-daily-value’
Unassisted Practice

Steps to Perform
• HBase Shell

// Add a test value to the weekly column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:weekly', 'test-weekly-value’

// Add a test value to the weekly column in the stats column family for row 2:
put 'simplilearn', 'row2', 'stats:weekly', 'test-weekly-value’

// Type scan 'simplilearn' to display the contents of the table.

// Type get 'simplilearn', 'row1' to display the contents of row 1.

Type disable 'simplilearn' to disable the table.

Type drop 'simplilearn' to drop the table and delete all data.
Type exit to exit the HBase shell.
NoSQL Graph Database
NoSQL Graph Database

A database designed to treat the relationships between data as equally important

to the data itself.

It is intended to hold data without constricting it to a predeﬁned model.

It focuses on the relationships between entities and is able to infer new knowledge
out of existing information.
Why Graph Databases?

Accessing nodes and relationships in a native graph database is an eﬃcient,

constant-time operation and allows you to quickly traverse millions of connections
per second per core.

Independent of the total size of your dataset, graph databases excel at managing
highly connected data and complex queries.
Property Graph Model

Nodes Relationships

Relationships provide
directed, named,
Nodes are the entities in
semantically relevant
the graph. Nodes can
connections between
be tagged with labels,
two node entities.
representing their
It always has a
diﬀerent roles in your
direction, a type, a start
domain.
node, and an end node.
Assisted Practice

NoSQL Graph Database Duration: 15 mins

Problem Statement: In this demonstration, you will learn, how to create a NoSQL graph database.

You are now able to:

Understand the need for NoSQL databases

Analyze the HBase architecture and components

Diﬀerentiate HBase from RDBMS

Knowledge Check
Knowledge
Check Which of the following are the nodes of HBase?
1

a. Spooldir and Master

b. Syslog and RegionalServer

c. Master and Regional Server

d. None of the above

Knowledge
Check Which of the following are the nodes of HBase?
1

a. Spooldir and Master

b. Syslog and RegionalServer

c. Master and Regional Server

d. None of the above

The correct answer is c.

Master and RegionalServer are the nodes of HBase, whereas the other options are parts of Flume.
Knowledge
Check In which of the following scenarios can we use HBase?
2

a. For random selects and range scans by key

b. For suﬃcient commodity hardware with at least ﬁve nodes

c. In variable schema

d. All of the above

Knowledge
Check In which of the following scenarios can we use HBase?
2

a. For random selects and range scans by key

b. For suﬃcient commodity hardware with at least ﬁve nodes

c. In variable schema

d. All of the above

The correct answer is d.

HBase can be used for random selects and range scans by key, for suﬃcient commodity hardware with at least ﬁve
nodes, and in variable schema.
Lesson-End-Project
Problem Statement:

Global transport private limited is in transport analytics and they are keen to ensure the
safety of people. Nowadays, as the population is increasing accidents are also becoming
more and more frequent. Accidents occur mostly when the route is long, the driver is drunk,
or the roads are damaged. The company collects data of all the accidents and provides
important insights that can reduce the number of accidents. The company wants to create a
public portal where anyone can see the accident’s aggregated data.

Your task is to suggest a suitable database and design a schema which can cover most of the
use cases.

You are given a ﬁle that contains details about the various parameter of accidents.
The column details are as follows:
1. Year
2. TYPE
3. 0-3 hrs. (Night)
4. 3-6 hrs. (Night)
5. 6-9 hrs (Day)
6. 9-12 hrs (Day)
7. 12-15 hrs (Day)
8. 15-18 hrs (Day)
9. 18-21 hrs (Night)
10. 21-24 hrs (Night)
11. Total
Lesson-End-Project

Problem Statement:

You have to save the given data in HBase in such a way that you can solve the below queries.
Please mention what you are selecting as a row key and why.

1. Get the total number of accidents when you are given

a. Year
b. Type of Accident
c. Time Duration

2. Get the total number of accidents when you are given

a. Year
b. Type of Accident

3. Get the total number of accidents in a given year

Thank You

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
FactoryTalk View Software Opens in Demo Mode
No ratings yet
FactoryTalk View Software Opens in Demo Mode
6 pages
Big Data
No ratings yet
Big Data
28 pages
Android Quiz Application FFFF
No ratings yet
Android Quiz Application FFFF
44 pages
AWS Simple Icons PPT
No ratings yet
AWS Simple Icons PPT
24 pages
MongoDB Lab
No ratings yet
MongoDB Lab
41 pages
Intellipaat Hands On Exercises PDF
No ratings yet
Intellipaat Hands On Exercises PDF
49 pages
Snowflake Standards
No ratings yet
Snowflake Standards
2 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
Kaiser Tableau 10 Workshop 01-2017
No ratings yet
Kaiser Tableau 10 Workshop 01-2017
114 pages
Big Data My Studies
No ratings yet
Big Data My Studies
28 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Snowflake Fundamentals Anand Jha
No ratings yet
Snowflake Fundamentals Anand Jha
50 pages
BDA Presentations
No ratings yet
BDA Presentations
26 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
37 pages
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
22 pages
Project Ready Workshop catalog_updated Nov 2024
No ratings yet
Project Ready Workshop catalog_updated Nov 2024
121 pages
Module 6 - Guided Lab - Creating A Virtual Private Cloud
No ratings yet
Module 6 - Guided Lab - Creating A Virtual Private Cloud
9 pages
Steps For Creating A Virtual Machine (VM) in AWS
No ratings yet
Steps For Creating A Virtual Machine (VM) in AWS
4 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Module 3 - Breaking The Monolith - Containers
No ratings yet
Module 3 - Breaking The Monolith - Containers
43 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
Chapter 5: Advanced SQL: Database System Concepts, 6 Ed
No ratings yet
Chapter 5: Advanced SQL: Database System Concepts, 6 Ed
77 pages
Relational Database Management System
No ratings yet
Relational Database Management System
5 pages
BDH Admin Ebook
No ratings yet
BDH Admin Ebook
807 pages
Selenium Introduction: Devang Mehta Quality Analyst, Thoughtworks
No ratings yet
Selenium Introduction: Devang Mehta Quality Analyst, Thoughtworks
16 pages
Aws (S3, Iam, Ec2, Emr and Redshift)
100% (1)
Aws (S3, Iam, Ec2, Emr and Redshift)
16 pages
Understanding DNS Protocol and Its Effects On Web Performance
100% (1)
Understanding DNS Protocol and Its Effects On Web Performance
31 pages
Powerbivstableau 160912230240
100% (1)
Powerbivstableau 160912230240
34 pages
SQL Server Ssms
No ratings yet
SQL Server Ssms
86 pages
Linux File System
No ratings yet
Linux File System
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
100 Linux Commands by DevOps Shack
No ratings yet
100 Linux Commands by DevOps Shack
18 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
No ratings yet
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
32 pages
Big Data Government Use Case Gartner
No ratings yet
Big Data Government Use Case Gartner
40 pages
SQL Replication Basic
No ratings yet
SQL Replication Basic
22 pages
Snowflake Best Practices
No ratings yet
Snowflake Best Practices
7 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
PPB ML Notes
No ratings yet
PPB ML Notes
54 pages
Microsoft SQL Server Database Command
No ratings yet
Microsoft SQL Server Database Command
5 pages
Lesson 1 - Course - Introduction
No ratings yet
Lesson 1 - Course - Introduction
9 pages
When Where and Why To Use NoSQL
No ratings yet
When Where and Why To Use NoSQL
13 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Literature Review On Feature Selection Methods For HighDimensional Data
No ratings yet
Literature Review On Feature Selection Methods For HighDimensional Data
9 pages
VPC Lab
No ratings yet
VPC Lab
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
37 pages
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
100% (1)
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
135 pages
Talend Installation Guide (Data Service Platform)
No ratings yet
Talend Installation Guide (Data Service Platform)
14 pages
Devops Shack: Linux Directories Structure & Explanation
No ratings yet
Devops Shack: Linux Directories Structure & Explanation
5 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Fundamentals of Database Systems: Lesson 1: Introduction
No ratings yet
Fundamentals of Database Systems: Lesson 1: Introduction
35 pages
OC - Module 1 - Intro To BDA 021312
No ratings yet
OC - Module 1 - Intro To BDA 021312
38 pages
User-Group & Permissions-Ownership
No ratings yet
User-Group & Permissions-Ownership
6 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
Lesson 5 - Supervised Learning-Classification
100% (1)
Lesson 5 - Supervised Learning-Classification
91 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Hbase
No ratings yet
Hbase
15 pages
10_HBase
No ratings yet
10_HBase
13 pages
Proof of Concept Guide For ManageEngine OpManager
No ratings yet
Proof of Concept Guide For ManageEngine OpManager
29 pages
JabRef Guide
No ratings yet
JabRef Guide
261 pages
(Lineage Os) Asus Zenfone 2 Lineage Os 14.1, Nougat 7.1 Rom
No ratings yet
(Lineage Os) Asus Zenfone 2 Lineage Os 14.1, Nougat 7.1 Rom
1 page
BMC Cloud Lifecycle
No ratings yet
BMC Cloud Lifecycle
2 pages
Oracle 12cR2 Installation On Linux With ASM
No ratings yet
Oracle 12cR2 Installation On Linux With ASM
40 pages
Log
No ratings yet
Log
25 pages
gINT ProductFamilyBroch LTR 0713 LR F
No ratings yet
gINT ProductFamilyBroch LTR 0713 LR F
4 pages
Create A Letter: Note To Use The Letter Wizard To Modify or Complete An Axisting Letter, Open The Letter in Word
No ratings yet
Create A Letter: Note To Use The Letter Wizard To Modify or Complete An Axisting Letter, Open The Letter in Word
1 page
3.1.2.7 Lab Getting Familiar With The Linux Shell ILM
No ratings yet
3.1.2.7 Lab Getting Familiar With The Linux Shell ILM
9 pages
Fiscal Year Formula
No ratings yet
Fiscal Year Formula
4 pages
PLC 200
No ratings yet
PLC 200
44 pages
3300088356-01 11 Frame1 - 74
No ratings yet
3300088356-01 11 Frame1 - 74
75 pages
Operating System Concepts
No ratings yet
Operating System Concepts
9 pages
1 - Install and Configure Computer Systems
No ratings yet
1 - Install and Configure Computer Systems
8 pages
LSH3 V5.1 Installation en
No ratings yet
LSH3 V5.1 Installation en
22 pages
README
No ratings yet
README
2 pages
1. Introduction to NodeJS
No ratings yet
1. Introduction to NodeJS
4 pages
GV65 Plus @track Air Are Update V1.00
No ratings yet
GV65 Plus @track Air Are Update V1.00
12 pages
CTC Calculator in Excel
No ratings yet
CTC Calculator in Excel
7 pages
Dbms Synopsis Format
67% (3)
Dbms Synopsis Format
2 pages
Configuration Guide
No ratings yet
Configuration Guide
153 pages
Metus Library: Users Manual
No ratings yet
Metus Library: Users Manual
110 pages
Abdul Sattar Ayub: Objective
No ratings yet
Abdul Sattar Ayub: Objective
2 pages
Redhat Linux Essential
No ratings yet
Redhat Linux Essential
16 pages
Graded Exe Final
100% (4)
Graded Exe Final
39 pages
Migrating From ColdFusion To ASP NET
100% (3)
Migrating From ColdFusion To ASP NET
7 pages
DASA DevOps Fundamentals Mock Exam-Spanish
100% (1)
DASA DevOps Fundamentals Mock Exam-Spanish
25 pages