Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
115 views

Lesson 6 NoSQL Databases HBase

The document discusses HBase, an open-source NoSQL database that provides big data storage and access across clusters of servers. It explains the architecture and components of HBase, including how it uses a master node and region servers to partition and store data across nodes in a Hadoop cluster. Key differences between HBase and relational databases are also outlined, such as HBase's use of dynamic schemas and horizontal scaling for high performance and scalability.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
115 views

Lesson 6 NoSQL Databases HBase

The document discusses HBase, an open-source NoSQL database that provides big data storage and access across clusters of servers. It explains the architecture and components of HBase, including how it uses a master node and region servers to partition and store data across nodes in a Hadoop cluster. Key differences between HBase and relational databases are also outlined, such as HBase's use of dynamic schemas and horizontal scaling for high performance and scalability.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Big Data Hadoop and Spark Developer

NoSQL Databases: HBase


Learning Objectives

By the end of this lesson, you will be able to:

Understand the need for NoSQL databases

Analyze the HBase architecture and components

Distinguish HBase from RDBMS


NoSQL Introduction
NoSQL Database

NoSQL is a form of unstructured storage.

DB NoSQL

Structured Unstructured
Why NoSQL?

With the explosion of social media sites, such as Facebook and Twitter, the demand to manage
large data has grown tremendously.

Key-Value Pair Document Column-Based


Databases Databases Data Stores
Types of NoSQL

Key-Value Document-Based Column-Based Graph-Based

Graph
Example:
Record Record
s s

Nodes Organiz Relationships


e

Hav Hav
e e
Properties

Example: Example: Example: Example:


Oracle NoSQL, Redis MongoDB, CouchDB, BigTable, Cassandra, Neo4J, InfoGrid, Infinite
Server, Scalaris OrientDB, RavenDB HBase, Hypertable Graph, FlockDB
RDBMS vs. NoSQL

The differences between RDBMS and NoSQL databases are as follows:

Feature RDBMS NoSQL Databases

Data Storage Tabular Variable

Schema Fixed Dynamic

Performance Low High

Scalability Vertical Horizontal

Reliability Good Poor


Assisted Practice

YARN Tuning Duration: 15 mins

Problem Statement: In this demonstration, you will learn, how to tune YARN and allow HBase to run
smoothly without being resource starved.

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
HBase Overview
What Is HBase?

HBase is a database management system designed in 2007 by Powerset, a Microsoft company.

HBase rests on top of HDFS and enables real-time analysis of data.


What Is HBase?

It can store huge amount of data in tabular format for extremely fast reads and writes.

HBase is mostly used in a scenario that requires regular and consistent inserting and overwriting of data.
Why HBase?

HDFS stores, processes, and manages large amounts of data efficiently.


However, it performs only batch processing and the data will be accessed in a sequential manner.

Therefore, a solution is required to access, read, or write data anytime regardless of its sequence in the
clusters of data.
Characteristics of HBase

HBase is a type of NoSQL database and is classified as a key-value store. In HBase:

Value is Values are


Key and value are Quickly accessed
identified with a stored in
a ByteArray by value keys
key key-orders

HBase is a database in which tables have no schema. At the time of table creation, column families are
defined, not columns.
HBase: Real-Life Connect

Facebook’s messenger platform needs to store over 135 trillion messages every month.

Rarely Accessed Highly Volatile


Dataset Dataset

Where do they store such data?


HBase Architecture
HBase Architecture

HBase has two types of nodes: Master and RegionServer. Their characteristics are as follows:

Master RegionServer
• Single Master node running at a • One or more RegionServers
time running at a time
• Manages cluster operations • Hosts tables and performs reads
HBase and buffer writes
• Not a part of the read or write Nodes
path • RegionServer is communicated in
order to read and write

A region in HBase is the subset of a table’s rows. The Master node detects the status of RegionServers and
assigns regions to it.
HBase Components

The HBase components include HBase Master and multiple RegionServers.

ZooKeeper is used for


ZooKeeper Quorum coordination or monitoring
ZooKeeper Peer
HBase Cluster Architecture HMaster
ZooKeeper Peer
... ...
HBase Master
assigns regions
RegionServer RegionServer and load-
Region balancing
Region Region
Store Store MemStore
MemStore
Store MemStore Store MemStore
...
StoreFile StoreFile StoreFile
StoreFile StoreFile
HLog HFile HFile HFile HFile HFile
HLog HLog

HDFS
Storage Model of HBase

The two major components of the storage model are as follows:

Partitioning:
• A table is horizontally partitioned into regions.
• Each region is managed by a RegionServer.
• A RegionServer may hold multiple regions.

Persistence and data availability:


• HBase stores its data in HDFS, does not replicate RegionServers,
and relies on HDFS replication for data availability.
• Updates and reads are served from the in-memory cache called
MemStore.
Row Distribution of Data between RegionServers

The distribution of rows of structured data using HBase is illustrated here:

A1
A2 Region
Null🡪A3
A22
Logical View-All rows in a table

A3 Region
A3🡪F34


Region
K4 F34🡪K80

… Region
O90 k80🡪095
Region
… 095🡪null
… RegionServer RegionServer RegionServer

Z30
Z55
Data Storage in HBase

Data is stored Data is stored in files called HFiles or StoreFiles that are usually saved in HDFS.
in files called HFiles or StoreFiles that are usually saved in HDFS.

HFile is a key-value map.

When data is added, it is written to a log called the Write Ahead


Log, and it is stored in memory, MemStore.

HFiles are immutable, since HDFS does not support updates to an


existing file.

HBase periodically performs data compactions to control the


number of HFiles and to keep the cluster well-balanced.
Data Model
Data Model

Following are the features of the data model in HBase:

One column family can have


Multi-versioned
any number of columns.

rowkey CF1:C1 CF1:C2 CF1:C3


.. CF2:C1 CF1:C8

. .. CF2:C1 CF1:C8
rowkey CF1:C1 CF1:C2 CF1:C3
. ..
rowkey CF1:C1 CF1:C2 CF1:C3 CF2:C1 CF1:C8
.

Cells within a column family are sorted physically. Very sparse as most cells have NULL values.

Everything except table names are stored as ByteArrays.


Data Mode: Features

Row Key

Column family 1 Column family 2

qualifier1 qualifier2 qualifier1 qualifier2 qualifier3

Timestamp 1 Timestamp 1 Timestamp 1 Timestamp 1

value1 value1 value2 value3


When to Use HBase?

Utilize HBase Enough data in millions


invariable or billions of rows
schema

For random selects and Sufficient commodity hardware


range scans by key with at least five nodes
HBase vs. RDBMS

The table shows a comparison between HBase and a Relational Database Management System (RDBMS):

HBase RDBMS
Automatic partitioning Usually manual and admin-driven partitioning

Scales linearly and automatically with new Usually scales vertically by adding more hardware
nodes resources

Uses commodity hardware Relies on expensive servers

Has fault tolerance Fault tolerance may or may not be present

Leverages batch processing with MapReduce Relies on multiple threads or processes rather
distributed processing than MapReduce distributed processing
Connecting to HBase
Connecting to HBase

HBase can be connected through the following media:

MapReduce
Rest/Thrift
Hive/Pig/HCatalog Java Application
Gateway
/Hue

Java API

ZooKeeper

HBase

HDFS
HBase Shell Commands

Common commands include, but are not limited to, the following:

Create table. Pass table name from a dictionary of specifications per


HBase> create ‘t1′, {NAME => ‘f1′}, {NAME
column family, and a dictionary of table configuration which is => ‘f2′}, {NAME => ‘f3′}
optional HBase> #
The above in shorthand would be the
following:
HBase> create ‘t1′, ‘f1′, ‘f2′, ‘f3′

Describe the table named HBase> describe ‘t1′

Start the disabling of the table named HBase> disable ‘t1′

Drop the table named. Table must first be disabled HBase> drop ‘t1′

List all tables in HBase. Optional regular expression parameter can be


HBase> list
used to filter the output.
HBase Shell Commands

Delete Put
Deleting a cell value Putting a cell value

Count Get Scan


Counting the number Getting the contents Scanning a table’s
of rows in a table of a row or a cell value
Unssisted Practice

HBase Shell Duration: 15 mins

Problem Statement: Create a sample HBase table on the cluster, enter some data, query the table, then
clean up the data and exit.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Unassisted Practice

Steps to Perform
• HBase Shell

// Start the HBase shell


hbase shell

// Create a table called simplilearn with one column family named stats:
create 'simplilearn', 'stats’

// Verify the table creation by listing everything


list

// Add a test value to the daily column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:daily', 'test-daily-value’
Unassisted Practice

Steps to Perform
• HBase Shell

// Add a test value to the weekly column in the stats column family for row 1:
put 'simplilearn', 'row1', 'stats:weekly', 'test-weekly-value’

// Add a test value to the weekly column in the stats column family for row 2:
put 'simplilearn', 'row2', 'stats:weekly', 'test-weekly-value’

// Type scan 'simplilearn' to display the contents of the table.

// Type get 'simplilearn', 'row1' to display the contents of row 1.

Type disable 'simplilearn' to disable the table.


Type drop 'simplilearn' to drop the table and delete all data.
Type exit to exit the HBase shell.
NoSQL Graph Database
NoSQL Graph Database

A database designed to treat the relationships between data as equally important


to the data itself.

It is intended to hold data without constricting it to a predefined model.

It focuses on the relationships between entities and is able to infer new knowledge
out of existing information.
Why Graph Databases?

Accessing nodes and relationships in a native graph database is an efficient,


constant-time operation and allows you to quickly traverse millions of connections
per second per core.

Independent of the total size of your dataset, graph databases excel at managing
highly connected data and complex queries.
Property Graph Model

Nodes Relationships

Relationships provide
directed, named,
Nodes are the entities in
semantically relevant
the graph. Nodes can
connections between
be tagged with labels,
two node entities.
representing their
It always has a
different roles in your
direction, a type, a start
domain.
node, and an end node.
Assisted Practice

NoSQL Graph Database Duration: 15 mins

Problem Statement: In this demonstration, you will learn, how to create a NoSQL graph database.

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Key Takeaways

You are now able to:

Understand the need for NoSQL databases

Analyze the HBase architecture and components

Differentiate HBase from RDBMS


Knowledge Check
Knowledge
Check Which of the following are the nodes of HBase?
1

a. Spooldir and Master

b. Syslog and RegionalServer

c. Master and Regional Server

d. None of the above


Knowledge
Check Which of the following are the nodes of HBase?
1

a. Spooldir and Master

b. Syslog and RegionalServer

c. Master and Regional Server

d. None of the above

The correct answer is c.


Master and RegionalServer are the nodes of HBase, whereas the other options are parts of Flume.
Knowledge
Check In which of the following scenarios can we use HBase?
2

a. For random selects and range scans by key

b. For sufficient commodity hardware with at least five nodes

c. In variable schema

d. All of the above


Knowledge
Check In which of the following scenarios can we use HBase?
2

a. For random selects and range scans by key

b. For sufficient commodity hardware with at least five nodes

c. In variable schema

d. All of the above

The correct answer is d.


HBase can be used for random selects and range scans by key, for sufficient commodity hardware with at least five
nodes, and in variable schema.
Lesson-End-Project
Problem Statement:

Global transport private limited is in transport analytics and they are keen to ensure the
safety of people. Nowadays, as the population is increasing accidents are also becoming
more and more frequent. Accidents occur mostly when the route is long, the driver is drunk,
or the roads are damaged. The company collects data of all the accidents and provides
important insights that can reduce the number of accidents. The company wants to create a
public portal where anyone can see the accident’s aggregated data.

Your task is to suggest a suitable database and design a schema which can cover most of the
use cases.

You are given a file that contains details about the various parameter of accidents.
The column details are as follows:
1. Year
2. TYPE
3. 0-3 hrs. (Night)
4. 3-6 hrs. (Night)
5. 6-9 hrs (Day)
6. 9-12 hrs (Day)
7. 12-15 hrs (Day)
8. 15-18 hrs (Day)
9. 18-21 hrs (Night)
10. 21-24 hrs (Night)
11. Total
Lesson-End-Project

Problem Statement:

You have to save the given data in HBase in such a way that you can solve the below queries.
Please mention what you are selecting as a row key and why.

1. Get the total number of accidents when you are given


a. Year
b. Type of Accident
c. Time Duration

2. Get the total number of accidents when you are given


a. Year
b. Type of Accident

3. Get the total number of accidents in a given year


Thank You

You might also like