HBase in Practice
Lars George ā€“ Partner and Co-Founder @ OpenCore
DataWorks Summit 2017 - Munich
NoSQL is no SQL is SQL?
About Me
ā€¢ Partner & Co-Founder at OpenCore
ā€¢ Before that
ā€¢ Lars: EMEA Chief Architect at Cloudera (5+ years)
ā€¢ Hadoop since 2007
ā€¢ Apache Committer & Apache Member
ā€¢ HBase (also in PMC)
ā€¢ Lars: Oā€™Reilly Author: HBase ā€“ The Definitive Guide
ā€¢ Contact
ā€¢ lars.george@opencore.com
ā€¢ @larsgeorge
Website: www.opencore.com
ā€¢ Brief Intro To Core Concepts
ā€¢ Access Options
ā€¢ Data Modelling
ā€¢ Performance Tuning
ā€¢ Use-Cases
ā€¢ Summary
Introduction To Core Concepts
HBase Tables
ā€¢ From user perspective, HBase is similar to a database, or spreadsheet
ā€¢ There are rows and columns, storing values
ā€¢ By default asking for a specific row/column combination returns the
current value (that is, that last value stored there)
HBase Tables
ā€¢ HBase can have a
different schema
per row
ā€¢ Could be called
ā€¢ Primary access by
the user given row
key and column
ā€¢ Sorting of rows and
columns by their
key (aka names)
HBase Tables
ā€¢ Each row/column coordinate is tagged with a version number, allowing
multi-versioned values
ā€¢ Version is usually
the current time
(as epoch)
ā€¢ API lets user ask
for versions
(specific, by count,
or by ranges)
ā€¢ Up to 2B versions
HBase Tables
ā€¢ Table data is cut into pieces to distribute over cluster
ā€¢ Regions split table into
shards at size boundaries
ā€¢ Families split within
regions to group
sets of columns
ā€¢ At least one of
each is needed
Scalability ā€“ Regions as Shards
ā€¢ A region is served by exactly
one region server
ā€¢ Every region server serves
many regions
ā€¢ Table data is spread over servers
ā€¢ Distribution of I/O
ā€¢ Assignment is based on
configurable logic
ā€¢ Balancing cluster load
ā€¢ Clients talk directly to region
Column Family-Oriented
ā€¢ Group multiple columns into
physically separated locations
ā€¢ Apply different properties to each
ā€¢ TTL, compression, versions, ā€¦
ā€¢ Useful to separate distinct data
sets that are related
ā€¢ Also useful to separate larger blob
from meta data
Data Management
ā€¢ What is available is tracked in three
ā€¢ System catalog table hbase:meta
ā€¢ Files in HDFS directories
ā€¢ Open region instances on servers
ā€¢ System aligns these locations
ā€¢ Sometimes (very rarely) a repair may
be needed using HBase Fsck
ā€¢ Redundant information is useful to
repair corrupt tables
HBase really isā€¦.
ā€¢ A distributed Hash Map
ā€¢ Imagine a complex, concatenated key including the user given row key and
column name, the timestamp (version)
ā€¢ Complex key points to actual value, that is, the cell
Fold, Store, and Shift
ā€¢ Logical rows in tables are
really stored as flat key-value
ā€¢ Each carries full coordinates
ā€¢ Pertinent information can be
freely placed in cell to
improve lookup
ā€¢ HBase is a column-family
grouped key-value store
HFile Format Information
ā€¢ All data is stored in a custom (open-source) format, called HFile
ā€¢ Data is stored in blocks (64KB default)
ā€¢ Trade-off between lookups and I/O throughput
ā€¢ Compression, encoding applied _after_ limit check
ā€¢ Index, filter and meta data is stored in separate blocks
ā€¢ Fixed trailer allows traversal of file structure
ā€¢ Newer versions introduce multilayered index and filter structures
ā€¢ Only load master index and load partial index blocks on demand
ā€¢ Reading data requires deserialization of block into cells
ā€¢ Kind of Amdahlā€™s Law applies
HBase Architecture
ā€¢ One Master and many Worker servers
ā€¢ Clients mostly communicate with workers
ā€¢ Workers store actual data
ā€¢ Memstore for accruing
ā€¢ HFile for persistence
ā€¢ WAL for fail-safety
ā€¢ Data provided as regions
ā€¢ HDFS is backing store
ā€¢ But could be another
HBase Architecture (cont.)
HBase Architecture (cont.)
ā€¢ Based on Log-Structured Merge-Trees (LSM-Trees)
ā€¢ Inserts are done in write-ahead log first
ā€¢ Data is stored in memory and flushed to disk on regular intervals or based
on size
ā€¢ Small flushes are merged in the background to keep number of files small
ā€¢ Reads read memory stores first and then disk based files second
ā€¢ Deletes are handled with ā€œtombstoneā€
ā€¢ Atomicity on row level no matter how
many columns
ā€¢ Keeps locking model easy
Merge Reads
ā€¢ Read Memstore & StoreFiles
using separate scanners
ā€¢ Merge matching cells into
single row ā€œviewā€
ā€¢ Deleteā€™s mask existing data
ā€¢ Bloom filters help skip
ā€¢ Reads may have to span
many files
APIs and Access Options
HBase Clients
ā€¢ Native Java Client/API
ā€¢ Non-Java Clients
ā€¢ REST server
ā€¢ Thrift server
ā€¢ Jython, Groovy DSL
ā€¢ Spark
ā€¢ TableInputFormat/TableOutputFormat for MapReduce
ā€¢ HBase as MapReduce source and/or target
ā€¢ Also available for table snapshots
ā€¢ HBase Shell
ā€¢ JRuby shell adding get, put, scan etc. and admin calls
ā€¢ Phoenix, Impala, Hive, ā€¦
Java API
From Wikipedia:
ā€¢ CRUD: ā€œIn computer programming, create, read, update, and delete are the
four basic functions of persistent storage.ā€
ā€¢ Other variations of CRUD include
ā€¢ BREAD (Browse, Read, Edit, Add, Delete)
ā€¢ MADS (Modify, Add, Delete, Show)
ā€¢ DAVE (Delete, Add, View, Edit)
ā€¢ CRAP (Create, Retrieve, Alter, Purge)
Java API (cont.)
ā€¢ CRUD
ā€¢ put: Create and update a row (CU)
ā€¢ get: Retrieve an entire, or partial row (R)
ā€¢ delete: Delete a cell, column, columns, or row (D)
ā€¢ scan: Scan any number of rows (S)
ā€¢ increment: Increment a column value (I)
ā€¢ Atomic compare-and-swap (CAS)
ā€¢ Combined get, check, and put operation
ā€¢ Helps to overcome lack of full transactions
Java API (cont.)
ā€¢ Batch Operations
ā€¢ Support Get, Put, and Delete
ā€¢ Reduce network round-trips
ā€¢ If possible, batch operation to the server to gain better overall throughput
ā€¢ Filters
ā€¢ Can be used with Get and Scan operations
ā€¢ Server side hinting
ā€¢ Reduce data transferred to client
ā€¢ Filters are no guarantee for fast scans
ā€¢ Still full table scan in worst-case scenario
ā€¢ Might have to implement your own
ā€¢ Filters can hint next row key
Data Modeling
Whereā€™s your data at?
Key Cardinality
ā€¢ The best performance is gained from using row keys
ā€¢ Time range bound reads can skip store files
ā€¢ So can Bloom Filters
ā€¢ Selecting column families
reduces the amount of data
to be scanned
ā€¢ Pure value based access
is a full table scan
ā€¢ Filters often are too, but
reduce network traffic
Key/Table Design
ā€¢ Crucial to gain best performance
ā€¢ Why do I need to know? Well, you also need to know that RDBMS is only working
well when columns are indexed and query plan is OK
ā€¢ Absence of secondary indexes forces use of row key or column name
ā€¢ Transfer multiple indexes into one
ā€¢ Generate large table -> Good since fits architecture and spreads across cluster
ā€¢ DDI
ā€¢ Stands for Denormalization, Duplication and Intelligent Keys
ā€¢ Needed to overcome trade-offs of architecture
ā€¢ Denormalization -> Replacement for JOINs
ā€¢ Duplication -> Design for reads
ā€¢ Intelligent Keys -> Implement indexing and sorting, optimize reads
Pre-materialize Everything
ā€¢ Achieve one read per customer request if possible
ā€¢ Otherwise keep at lowest number
ā€¢ Reads between 10ms (cache miss) and 1ms (cache hit)
ā€¢ Use MapReduce or Spark to compute exacts in batch
ā€¢ Store and merge updates live
ā€¢ Use increment() methods
ļƒ˜Motto: ā€œDesign for Readsā€
Tall-Narrow vs. Flat-Wide Tables
ā€¢ Rows do not split
ā€¢ Might end up with one row per region
ā€¢ Same storage footprint
ā€¢ Put more details into the row key
ā€¢ Sometimes dummy column only
ā€¢ Make use of partial key scans
ā€¢ Tall with Scans, Wide with Gets
ā€¢ Atomicity only on row level
ā€¢ Examples
ā€¢ Large graphs, stored as adjacency matrix (narrow)
ā€¢ Message inbox (wide)
Sequential Keys
<timestamp><more key>: {CF: {CQ: {TS : Val}}}
ā€¢ Hotspotting on regions is bad!
ā€¢ Instead do one of the following:
ā€¢ Salting
ā€¢ Prefix <timestamp> with distributed value
ā€¢ Binning or bucketing rows across regions
ā€¢ Key field swap/promotion
ā€¢ Move <more key> before the timestamp (see OpenTSDB)
ā€¢ Randomization
ā€¢ Move <timestamp> out of key or prefix with MD5 hash
ā€¢ Might also be mitigated by overall spread of workloads
Key Design Choices
ā€¢ Based on access pattern, either use
sequential or random keys
ā€¢ Often a combination of both is needed
ā€¢ Overcome architectural limitations
ā€¢ Neither is necessarily bad
ā€¢ Use bulk import for sequential keys and
ā€¢ Random keys are good for random access
ā€¢ Design for Use-Case
ā€¢ Read, Write, or Both?
ā€¢ Avoid Hotspotting
ā€¢ Hash leading key part, or use salting/bucketing
ā€¢ Use bulk loading where possible
ā€¢ Monitor your servers!
ā€¢ Presplit tables
ā€¢ Try prefix encoding when values are small
ā€¢ Otherwise use compression (or both)
ā€¢ For Reads: Restrict yourself
ā€¢ Specify what you need, i.e. columns, families, time range
ā€¢ Shift details to appropriate position
ā€¢ Composite Keys
ā€¢ Column Qualifiers
Performance Tuning
1000 knobs to turnā€¦ 20 are important?
Everything is Pluggable
ā€¢ Cell
ā€¢ Memstore
ā€¢ Flush Policy
ā€¢ Compaction
ā€¢ Cache
ā€¢ WAL
ā€¢ RPC handling
ā€¢ ā€¦
Cluster Tuning
ā€¢ First, tune the global settings
ā€¢ Heap size and GC algorithm
ā€¢ Memory share for reads and writes
ā€¢ Enable Block Cache
ā€¢ Number of RPC handlers
ā€¢ Load Balancer
ā€¢ Default flush and compaction strategy
ā€¢ Thread pools (10+)
ā€¢ Next, tune the per-table and family settings
ā€¢ Region sizes
ā€¢ Block sizes
ā€¢ Compression and encoding
ā€¢ Compactions
ā€¢ ā€¦
Region Balancer Tuning
ā€¢ A background process in the HBase
Master is tracking load on servers
ā€¢ The load balancer moves regions
ā€¢ Multiple implementations exists
ā€¢ Simple counts number of regions
ā€¢ Stochastic determines cost
ā€¢ Favored Node pins HDFS block
ā€¢ Can be tuned further
ā€¢ Cluster-wide setting!
RPC Tuning
ā€¢ Default is one queue for
all types of requests
ā€¢ Can be split into
separate queues for
reads and writes
ā€¢ Read queue can be
further split into reads
and scans
ļƒ˜ Stricter resource limits,
but may avoid cross-
Key Tuning
ā€¢ Design keys to match use-case
ā€¢ Sequential, salted, or random
ā€¢ Use sorting to convey meaning
ā€¢ Colocate related data
ā€¢ Spread load over all servers
ā€¢ Clever key design can make use
of distribution: aging-out regions
Compaction Tuning
ā€¢ Default compaction settings are aggressive
ā€¢ Set for update use-case
ā€¢ For insert use-cases, Blooms are effective
ā€¢ Allows to tune down compactions
ā€¢ Saves resources by reducing write amplification
ā€¢ More store files are also enabling faster full
table scans with time range bound scans
ā€¢ Server can ignore older files
ā€¢ Large regions may be eligible for advanced
compaction strategies
ā€¢ Stripe or date-tiered compactions
ā€¢ Reduce rewrites to fraction of region size
What works well, what does not, and what is so-so
Placing the Use-Case
ā€¢ HBase chooses to work best for random access
ā€¢ You can optimize a table to prefer scans over gets
ā€¢ Fewer columns with larger payload
ā€¢ Larger HFile block sizes (maybe even
duplicate data in two differently
configured column families)
ā€¢ After that is the realm of hybrid systems
ā€¢ For fastest scans use brute force HDFS
and native query engine with a
columnar format
Big Data Workloads
Random Access Full ScanShort Scan
HBase + Snapshots
-> HDFS + MR/Spark
HBase + MR/Spark
Big Data Workloads
Random Access Full ScanShort Scan
HDFS + MR/Spark
HBase + Snapshots
-> HDFS + MR/Spark
HBase + MR/Spark
Current Metrics
Graph data
Simple Entities
Hybrid Entity Time series
+ Rollup serving
Analytic archive
Hybrid Entity Time series
+ Rollup generation
Index building
Entity Time series
Wrapping it upā€¦
Mostly Inserts Use-Cases
ā€¢ Tune down compactions
ā€¢ Compaction ratio, max store file size
ā€¢ Use Bloom Filters
ā€¢ On by default for row keys
Mostly Update Use-Cases
ā€¢ Batch updates if possible
Mostly Serial Keys
ā€¢ Use bulk loading or salting
Mostly Random Keys
ā€¢ Hash key with MD5 prefix
Mostly Random Reads
ā€¢ Decrease HFile block size
ā€¢ Use random keys
Mostly Scans
ā€¢ Increase HFile (and HDFS) block size
ā€¢ Reduce columns and increase cell sizes
What mattersā€¦
ā€¢ For optimal performance, two things need to be considered:
ā€¢ Optimize the cluster and table settings
ā€¢ Choose the matching key schema
ā€¢ Ensure load is spread over tables and cluster nodes
ā€¢ HBase works best for random access and bound scans
ā€¢ HBase can be optimized for larger scans, but its sweet spot is short burst scans (can
be parallelized too) and random point gets
ā€¢ Java heap space limits addressable space
ā€¢ Play with region sizes, compaction strategies, and key design to maximize result
ā€¢ Using HBase for a suitable use-case will make for a happy customerā€¦
ā€¢ Conversely, forcing it into non-suitable use-cases may be cause for trouble
Thank You!

  3. Time-series Data etc.
  4. Time-series Data etc.