Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
NoSQL & HBase overview
Big Data – 4 V’s
NoSQL 
• NoSQL is all about scalability 
• Scaling to size 
• Scaling to complexity 
• Deliver Heavy R/W workloads 
• Data duplication and denormalization are first-class 
citizens
RDBMS vs NoSQL
No SQL Types
Database Chart
CAP Theorem
Re Check.. 
• What is CAP theorem? 
• Does NoSQL supports Transaction? 
• NoSQL Types?
HBase 
• Scalable, distributed data store 
• Sorted map of maps / Key- Value store 
• Open source avatar of Google’s Bigtable 
• Sparse 
• Multi dimensional 
• Tightly integrated with Hadoop 
• Not a RDBMS
NoSQL & HBase overview
Architecture 
HDFS((DataNodes) 
Storage 
ZooKeeper 
Membership management 
RegionServers 
Serve the regions 
HBase Masters 
Janitorial work
Column Oriented
Distributed
Variable number of columns
Important Terms 
• Table 
• Consists of rows and columns 
• Row 
• Has a bunch of columns. 
• Identified by a rowkey (primary’ key) 
• Column Qualifier 
• Dynamic column name 
• Column Family 
• Column groups - logical and physical (Similar access pattern) 
• Cell 
• The actual element that contains the data for a row-column insertion 
• Version 
• Every cell has multiple versions
Logical & Tall(v/s(Wide(tab Plehsy(sstiocraal gSet(rfuocottuprreint 
CF1 CF2 
r1 c1:v1 c1:v9 c6:v2 
r2 c1:v2 c3:v6 
r3 c2:v3 c5:v6 
r4 c2:v4 
r5 c1:v1 c3:v5 c7:v8 
HFile for CF1 HFile for CF2 
r1:CF1:c1:t1:v1 
r2:CF1:c1:t2:v2 
r2:CF1:c3:t3:v6 
r3:CF1:c2:t1:v3 
r4:CF1:c2:t1:v4 
r5:CF1:c1:t2:v1 
r5:CF1:c3:t3:v5 
r1:CF2:c1:t1:v9 
r1:CF2:c6:t4:v2 
r3:CF2:c5:t4:v6 
r5:CF2:c7:t3:v8 
Result object returned for a Get() on row r5 
r5:CF1:c1:t2:v1 
r5:CF1:c3:t3:v5 
r5:cf2:c7:t3:v8 
KeyValue objects 
Cell 
Value 
Time 
Stamp 
Col 
Qual 
Col 
Fam 
Row 
Key 
Key Value 
Logical representation of an HBase table. 
We'll look at what it means to Get() row r5 from this table. 
Actual physical storage of the table 
Structure of a KeyValue object
(J)Ruby Shell Commands 
• General 
• DDL 
• Create 
• Describe 
• Namespace 
• DML 
• Put 
• Get 
• Scan 
• Delete 
• Tools 
• Replication 
• Snapshot 
• Security 
• Visibility 
Creating Table: 
create 'DEVICE_DETAIL','BASIC_INFO','CONTRACT_INFO' 
Data Generation : 
put 'DEVICE_DETAIL','Device1','BASIC_INFO:IP_ADDR','10.10.10.10' 
put 'DEVICE_DETAIL','Device2','BASIC_INFO:IP_ADDR','20.20.20.20' 
Descripting Table: 
describe 'DEVICE_DETAIL' 
Alert Info : 
alter 'DEVICE_DETAIL',{NAME => 'CONTRACT_INFO',VERSIONS => 3 } 
Update Data: 
put 'DEVICE_DETAIL','Device2','CONTRACT_INFO:CONTRACT_NUMBER','22222222' 
Multi- Version Example : 
get 'DEVICE_DETAIL','Device2', {COLUMN=>'CONTRACT_INFO:CONTRACT_NUMBER', VERSIONS=>2} 
Scan Info: 
scan 'DEVICE_DETAIL’ 
Scan with Filter : 
scan 'DEVICE_DETAIL' , { COLUMNS => 'CONTRACT_INFO:STATUS', LIMIT => 10, FILTER => 
"ValueFilter( =, 'binary:IN_ACTIVE' )" } 
Delete Info: 
delete 'DEVICE_DETAIL','Device2','CONTRACT_INFO:STATUS'
Java API 
• HTable 
• HBaseAdmin 
• HTablePool 
• Get 
• Put 
• Delete 
• Scan 
• Increment 
• HTableDescriptor 
• HTableInterface 
• Result 
• ResultScanner 
• KeyValue 
HTable table = new HTable(configuration, hbasetablename); 
Put row = new Put(Bytes.toBytes(rowKey)); 
row.add(Bytes.toBytes(columnFamily), Bytes.toBytes(key), 
Bytes.toBytes(value)); 
Get getKey = new Get(Bytes.toBytes(key)); 
Result result = table.get(getKey);
Spark HBase 
// create configuration 
val config = HBaseConfiguration.create() 
config.set("hbase.zookeeper.quorum", "localhost") 
config.set("hbase.zookeeper.property.clientPort","2181") 
config.set("hbase.mapreduce.inputtable", "hbaseTableName") 
// read data 
val hbaseData = sparkContext.hadoopRDD(new JobConf(config), classOf[TableInputFormat], 
classOf[ImmutableBytesWritable], classOf[Result]) 
// count rows 
println(hbaseData.count)
HBase Architecture
Write & Read Logic
SQL
Re Check.. 
• Column family? 
• HBase components? 
• Name few Shell commands? 
• Version in HBase?
Reference Slides
NoSQL & HBase overview
NoSQL & HBase overview
NoSQL & HBase overview
Use Case 
• Canonical(use(case:(storing(crawl(data(and(indices(for(search 
14 
1 
Web Search 
powered by Bigtable 
Crawlers 
Crawlers 
1 Crawlers constantly scour the Internet for new pages. 
Those pages are stored as individual records in Bigtable. 3 
2 A MapReduce job runs over the entire table, generating 
search indexes for the Web Search application. 
4 
2 
5 
Indexing the Internet 
Searching the Internet 
3 The user initiates a Web Search request. 
4 The Web Search application queries the Search Indexes 
and retries matching documents directly from Bigtable. 
5 Search results are presented to the user. 
Internets Bigtable 
Crawlers 
Crawlers 
MapReduce 
You 
Search 
InSdeeaxrch 
InSdeeaxrch 
Index 
Web Search
Hbase Architecture
Replications
NoSQL & HBase overview
CAP Theorem

More Related Content

NoSQL & HBase overview

  • 2. Big Data – 4 V’s
  • 3. NoSQL • NoSQL is all about scalability • Scaling to size • Scaling to complexity • Deliver Heavy R/W workloads • Data duplication and denormalization are first-class citizens
  • 8. Re Check.. • What is CAP theorem? • Does NoSQL supports Transaction? • NoSQL Types?
  • 9. HBase • Scalable, distributed data store • Sorted map of maps / Key- Value store • Open source avatar of Google’s Bigtable • Sparse • Multi dimensional • Tightly integrated with Hadoop • Not a RDBMS
  • 11. Architecture HDFS((DataNodes) Storage ZooKeeper Membership management RegionServers Serve the regions HBase Masters Janitorial work
  • 15. Important Terms • Table • Consists of rows and columns • Row • Has a bunch of columns. • Identified by a rowkey (primary’ key) • Column Qualifier • Dynamic column name • Column Family • Column groups - logical and physical (Similar access pattern) • Cell • The actual element that contains the data for a row-column insertion • Version • Every cell has multiple versions
  • 16. Logical & Tall(v/s(Wide(tab Plehsy(sstiocraal gSet(rfuocottuprreint CF1 CF2 r1 c1:v1 c1:v9 c6:v2 r2 c1:v2 c3:v6 r3 c2:v3 c5:v6 r4 c2:v4 r5 c1:v1 c3:v5 c7:v8 HFile for CF1 HFile for CF2 r1:CF1:c1:t1:v1 r2:CF1:c1:t2:v2 r2:CF1:c3:t3:v6 r3:CF1:c2:t1:v3 r4:CF1:c2:t1:v4 r5:CF1:c1:t2:v1 r5:CF1:c3:t3:v5 r1:CF2:c1:t1:v9 r1:CF2:c6:t4:v2 r3:CF2:c5:t4:v6 r5:CF2:c7:t3:v8 Result object returned for a Get() on row r5 r5:CF1:c1:t2:v1 r5:CF1:c3:t3:v5 r5:cf2:c7:t3:v8 KeyValue objects Cell Value Time Stamp Col Qual Col Fam Row Key Key Value Logical representation of an HBase table. We'll look at what it means to Get() row r5 from this table. Actual physical storage of the table Structure of a KeyValue object
  • 17. (J)Ruby Shell Commands • General • DDL • Create • Describe • Namespace • DML • Put • Get • Scan • Delete • Tools • Replication • Snapshot • Security • Visibility Creating Table: create 'DEVICE_DETAIL','BASIC_INFO','CONTRACT_INFO' Data Generation : put 'DEVICE_DETAIL','Device1','BASIC_INFO:IP_ADDR','10.10.10.10' put 'DEVICE_DETAIL','Device2','BASIC_INFO:IP_ADDR','20.20.20.20' Descripting Table: describe 'DEVICE_DETAIL' Alert Info : alter 'DEVICE_DETAIL',{NAME => 'CONTRACT_INFO',VERSIONS => 3 } Update Data: put 'DEVICE_DETAIL','Device2','CONTRACT_INFO:CONTRACT_NUMBER','22222222' Multi- Version Example : get 'DEVICE_DETAIL','Device2', {COLUMN=>'CONTRACT_INFO:CONTRACT_NUMBER', VERSIONS=>2} Scan Info: scan 'DEVICE_DETAIL’ Scan with Filter : scan 'DEVICE_DETAIL' , { COLUMNS => 'CONTRACT_INFO:STATUS', LIMIT => 10, FILTER => "ValueFilter( =, 'binary:IN_ACTIVE' )" } Delete Info: delete 'DEVICE_DETAIL','Device2','CONTRACT_INFO:STATUS'
  • 18. Java API • HTable • HBaseAdmin • HTablePool • Get • Put • Delete • Scan • Increment • HTableDescriptor • HTableInterface • Result • ResultScanner • KeyValue HTable table = new HTable(configuration, hbasetablename); Put row = new Put(Bytes.toBytes(rowKey)); row.add(Bytes.toBytes(columnFamily), Bytes.toBytes(key), Bytes.toBytes(value)); Get getKey = new Get(Bytes.toBytes(key)); Result result = table.get(getKey);
  • 19. Spark HBase // create configuration val config = HBaseConfiguration.create() config.set("hbase.zookeeper.quorum", "localhost") config.set("hbase.zookeeper.property.clientPort","2181") config.set("hbase.mapreduce.inputtable", "hbaseTableName") // read data val hbaseData = sparkContext.hadoopRDD(new JobConf(config), classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) // count rows println(hbaseData.count)
  • 21. Write & Read Logic
  • 22. SQL
  • 23. Re Check.. • Column family? • HBase components? • Name few Shell commands? • Version in HBase?
  • 28. Use Case • Canonical(use(case:(storing(crawl(data(and(indices(for(search 14 1 Web Search powered by Bigtable Crawlers Crawlers 1 Crawlers constantly scour the Internet for new pages. Those pages are stored as individual records in Bigtable. 3 2 A MapReduce job runs over the entire table, generating search indexes for the Web Search application. 4 2 5 Indexing the Internet Searching the Internet 3 The user initiates a Web Search request. 4 The Web Search application queries the Search Indexes and retries matching documents directly from Bigtable. 5 Search results are presented to the user. Internets Bigtable Crawlers Crawlers MapReduce You Search InSdeeaxrch InSdeeaxrch Index Web Search

Editor's Notes

  1. Most NoSQL stores lack true ACID transactions, although a few recent systems, such as FairCom c-treeACE, Google Spanner (though technically a NewSQL database) and FoundationDB, have made them central to their designs. Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value Eventually consistent services are often classified as providing BASE (Basically Available, Soft state, Eventual consistency) semantics, in contrast to traditional ACID (Atomicity, Consistency, Isolation, Durability) guarantees.
  2. http://blog.monitis.com/2011/05/22/picking-the-right-nosql-database-tool/
  3. Eric Brewer’s CAP theorem says that if you want consistency, availability, and partition tolerance, you have to settle for two out of three. (For a distributed system, partition tolerance means the system will continue to work unless there is a total network failure. A few nodes can fail and the system keeps going.) Consistency means that each client always has the same view of the data. Availability means that all clients can always read and write. Partition tolerance means that the system works well across physical network partitions.
  4. http://localhost:60010/master-status
  5. Eric Brewer’s CAP theorem says that if you want consistency, availability, and partition tolerance, you have to settle for two out of three. (For a distributed system, partition tolerance means the system will continue to work unless there is a total network failure. A few nodes can fail and the system keeps going.) Consistency means that each client always has the same view of the data. Availability means that all clients can always read and write. Partition tolerance means that the system works well across physical network partitions.