Assignment 10
Assignment 10
Assignment 10
Hbase
Key-value data stores: Key-value NoSQL databases emphasize simplicity and are very
useful in accelerating an application to support high-speed read and write processing
of non-transactional data. Stored values can be any type of binary object (text, video,
JSON document, etc.) and are accessed via a key. The application has complete
control over what is stored in the value, making this the most flexible NoSQL model.
Data is partitioned and replicated across a cluster to get scalability and availability.
For this reason, key value stores often do not support transactions. However, they
are highly effective at scaling applications that deal with high-velocity, non-
transactional data.
Document stores: Document databases typically store self-describing JSON, XML,
and BSON documents. They are similar to key-value stores, but in this case, a value is
a single document that stores all data related to a specific key. Popular fields in the
document can be indexed to provide fast retrieval without knowing the key. Each
document can have the same or a different structure.
Wide-column stores: Wide-column NoSQL databases store data in tables with rows
and columns similar to RDBMS, but names and formats of columns can vary from
row to row across the table. Wide-column databases group columns of related data
together. A query can retrieve related data in a single operation because only the
columns associated with the query are retrieved. In an RDBMS, the data would be in
different rows stored in different places on disk, requiring multiple disk operations
for retrieval.
Graph stores: A graph database uses graph structures to store, map, and query
relationships. They provide index-free adjacency, so that adjacent elements are
linked together without using an index.
The table shows two column families: CustomerName and ContactInfo. When creating a
table in HBase, the developer or administrator is required to define one or more column
families using printable characters.
Generally, column families remain fixed throughout the lifetime of an HBase table but new
column families can be added by using administrative commands.
The more we add column families there will be more MemStore created and Memstore
flush will be more frequent. It will degrade the performance.
5.Why columns are not defined at the time of table creation in HBase?
Ans: HBase has dynamic schema. It uses query -first schema design. All possible queries are
identified first and the schema model is designed accordingly. We are adding columns on need basis
at run time unlike in traditional database where we predefine the coulmns. Hence there is a
flexibility in HBase that whenever any new requirement comes up we can design the database and
add columns according to our current requirement without changing the complete model of the
database.
Ans: HBase stores data in a form of a distributed sorted multidimensional persistence maps
called Tables. ...HBase data model consists of tables containing rows.Data is
organized into column families grouping columns in each row. Just like in a Relational
Database, data in HBase is stored in Tables and these Tables are stored in Regions. When
a Table becomes too big, the Table is partitioned into multiple Regions. These Regions are
assigned to Region Servers across the cluster. Each Region Server hosts roughly the same
number of Regions.
Ans: When we put data into HBase, a timestamp is required. The timestamp can be generated
automatically by the RegionServer or can be supplied by you. The timestamp must be unique per
version of a given cell, because the timestamp identifies the version. To modify a previous version of
a cell, for instance, we issue a Put with a different value for the data itself, but the same timestamp.
When the client issues a Put request, the first step is to write the data to the
write-ahead log, the WAL:
- Edits are appended to the end of the WAL file that is stored on disk.
- The WAL is used to recover not-yet-persisted data in case a server crashes.
HBase Write Steps (2)
Once the data is written to the WAL, it is placed in the MemStore. Then, the put request
acknowledgement returns to the client.
All the above are sub components of Region server which manages the data in the HBASE.
Task 2:
a. Create an HBase table named 'clicks' with a column family 'hits' such that it should be
able to store last 5 values of qualifiers inside 'hits' column family.
b. Add few records in the table and update some of them. Use IP Address as row-key. Scan the
table to view if all the previous versions are getting displayed.
Ans:
Commands:
create 'clicks',{NAME=>'hits', VERSIONS=>5}
Explanation :
creates a table named 'clicks' with column family 'hits' which would keep five version
for corresponding columns values of hits column family
OR
create 'clicks','hits'
alter 'clicks',{NAME=>'hits', VERSIONS=>5}
Explanation :
creates a table named 'clicks' with column family 'hits'
Alter the table/Modify the table attributes to retains five versions for column values of
hits column family.
create 'clicks','hits'
alter 'clicks',{NAME=>'hits', VERSIONS=>5}
Explanation :
creates a table named 'clicks' with column family 'hits'
Alter the table/Modify the table attributes to retains five versions for column values of
hits column family.
put 'clicks','IP','hits:hostname','gpu_vdi_01'
put 'clicks','IP','hits:location','India'
put 'clicks','IP','hits:browser','chrome'
Explanation :
Inserting values in columns(hostname,location,browser <in purple>) for column family
hits with row-key “IP” of table ‘clicks’
scan 'clicks'
Explanation :
Reading values of table clicks. To check if values are inserted.
put 'clicks','IP','hits:browser','firefox'
put 'clicks','IP','hits:browser','safari'
put 'clicks','IP','hits:browser','IExplorer'
put 'clicks','IP','hits:browser','Opera'
put 'clicks','IP','hits:browser','Chromium'
put 'clicks','IP','hits:browser','Netscape'
Explanation :
Changing or updating values of browser column to check if table retains the last five
versions of the updated values.
scan 'clicks'
Explanation:
Although table clicks retains the latest updated value for browser i.e 'Netscape'
scan 'clicks',{COLUMN=>'hits:browser',VERSIONS=>5}
Explanation:
Versions=>5 shows the last five values that were updated/changed for column
browser in column family 'hits'
ScreenShot: