Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Learning HBase
Learning HBase
Learning HBase
Ebook692 pages3 hours

Learning HBase

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Apache HBase is a nonrelational NoSQL database management system that runs on top of HDFS. It is an open source, distributed, versioned, column-oriented store. It facilitates the tech industry with random, real-time read/write access to your Big Data with the benefit of linear scalability on the fly.

This book will take you through a series of core tasks in HBase. The introductory chapter will give you all the information you need about the HBase ecosystem. Furthermore, you'll learn how to configure, create, verify, and test clusters. The book also explores different parameters of Hadoop and HBase that need to be considered for optimization and a trouble-free operation of the cluster. It will focus more on HBase's data model, storage, and structure layout. You will also get to know the different options that can be used to speed up the operation and functioning of HBase. The book will also teach the users basic- and advance-level coding in Java for HBase. By the end of the book, you will have learned how to use HBase with large data sets and integrate them with Hadoop.

LanguageEnglish
Release dateNov 25, 2014
ISBN9781783985951
Learning HBase

Related to Learning HBase

Related ebooks

Programming For You

View More

Related articles

Reviews for Learning HBase

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Learning HBase - Shashwat Shriparv

    Table of Contents

    Learning HBase

    Credits

    About the Author

    Acknowledgments

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Understanding the HBase Ecosystem

    HBase layout on top of Hadoop

    Comparing architectural differences between RDBMs and HBase

    HBase features

    HBase in the Hadoop ecosystem

    Data representation in HBase

    Hadoop

    Core daemons of Hadoop

    Comparing HBase with Hadoop

    Comparing functional differences between RDBMs and HBase

    Logical view of row-oriented databases

    Logical view of column-oriented databases

    Pros and cons of column-oriented databases

    About the internal storage architecture of HBase

    Getting started with HBase

    When it started

    HBase components and functionalities

    ZooKeeper

    Why an odd number of ZooKeepers?

    HMaster

    If a master node goes down

    RegionServer

    Components of a RegionServer

    Client

    Catalog tables

    Who is using HBase and why?

    When should we think of using HBase?

    When not to use HBase

    Understanding some open source HBase tools

    The Hadoop-HBase version compatibility table

    Applications of HBase

    HBase pros and cons

    Summary

    2. Let's Begin with HBase

    Understanding HBase components in detail

    HFile

    Region

    Scalability – understanding the scale up and scale out processes

    Scale in

    Scale out

    Reading and writing cycle

    Write-Ahead Logs

    MemStore

    HBase housekeeping

    Compaction

    Minor compaction

    Major compaction

    Region split

    Region assignment

    Region merge

    RegionServer failovers

    The HBase delete request

    The reading and writing cycle

    List of available HBase distributions

    Prerequisites and capacity planning for HBase

    The forward DNS resolution

    The reverse DNS resolution

    Java

    SSH

    Domain Name Server

    Using Network Time Protocol to keep your node on time

    OS-level changes and tuning up OS for HBase

    Summary

    3. Let's Start Building It

    Downloading Java on Ubuntu

    Considering host configurations

    Host file based

    Command based

    File based

    DNS based

    Installing and configuring SSH

    Installing SSH on Ubuntu/Red Hat/CentOS

    Configuring SSH

    Installing and configuring NTP

    Performing capacity planning

    Installing and configuring Hadoop

    core-site.xml

    hdfs-site.xml

    yarn-site.xml

    mapred-site.xml

    hadoop-env.sh

    yarn-env.sh

    Slaves file

    Hadoop start up steps

    Configuring Apache HBase

    Configuring HBase in the standalone mode

    Configuring HBase in the distributed mode

    hbase-site.xml

    HBase-env.sh

    regionservers

    Installing and configuring ZooKeeper

    Installing Cloudera Hadoop and HBase

    Downloading the required RPM packages

    Installing Cloudera in an easier way

    Installing the Hadoop and MapReduce packages

    Installing Hadoop on Windows

    Summary

    4. Optimizing the HBase/Hadoop Cluster

    Setup types for Hadoop and HBase clusters

    Recommendations for CDH cluster configuration

    Capacity planning

    Hadoop optimization

    General optimization tips

    Optimizing Java GC

    Optimizing Linux OS

    Optimizing the Hadoop parameter

    Optimizing MapReduce

    Rack awareness in Hadoop

    Number of Map and Reduce limits in configuration files

    Considering and deciding the maximum number of Map and Reduce tasks

    Optimizing HBase

    Hadoop

    Memory

    Java

    OS

    HBase

    Optimizing ZooKeeper

    Important files in Hadoop

    Important files in HBase

    Summary

    5. The Storage, Structure Layout, and Data Model of HBase

    Data types in HBase

    Storing data in HBase – logical view versus actual physical view

    Namespace

    Commands available for namespaces

    Services of HBase

    Row key

    Column family

    Column

    Cell

    Version

    Timestamp

    Data model operations

    Get

    Put

    Scan

    Delete

    Versioning and why

    Deciding the number of the version

    Lower bound of versions

    Upper bound of versions

    Schema designing

    Types of table designs

    Benefits of Short Wide and Tall-Thin design patterns

    Composite key designing

    Real-time use case of schema in an HBase table

    Schema change operations

    Calculating the data size stored in HBase

    Summary

    6. HBase Cluster Maintenance and Troubleshooting

    Hadoop shell commands

    Types of Hadoop shell commands

    Administration commands

    User commands

    File system-related commands

    Difference between copyToLocal/copyFromLocal and get/put

    HBase shell commands

    HBase administration tools

    hbck – HBase check

    HBase health check script

    Writing HBase shell scripts

    Using the Hadoop tool or JARs for HBase

    Connecting HBase with Hive

    HBase region management

    Compaction

    Merge

    HBase node management

    Commissioning

    Decommissioning

    Implementing security

    Secure access

    Requirement

    Kerberos KDC

    Client-side security configuration

    Client-side security configuration for thrift requests

    Server-side security configuration

    Simple security

    Server-side configuration

    Client-side configuration

    The tag security feature

    Access control in HBase

    Server-side access control

    Cell-level access using tags

    Configuring ZooKeeper for security

    Troubleshooting the most frequent HBase errors and their explanations

    What might fail in cluster

    Monitoring HBase health

    HBase web UI

    Master

    RegionServer

    ZooKeeper command line

    Linux tools

    Summary

    7. Scripting in HBase

    HBase backup and restore techniques

    Offline backup / full-shutdown backup

    Backup

    Restore

    Online backup

    The HBase snapshot

    Online

    Offline

    The HBase replication method

    Setting up cluster replication

    Backup and restore using Export and Import commands

    Export

    Import

    Miscellaneous utilities

    CopyTable

    HTable API

    Backup using a Mozilla tool

    HBase on Windows

    Scripting in HBase

    The .irbrc file

    Getting the HBase timestamp from HBase shell

    Enabling debugging shell

    Enabling the debug level in HBase shell

    Enabling SQL in HBase

    Contributing to HBase

    Summary

    8. Coding HBase in Java

    Setting up the environment for development

    Building a Java client to code in HBase

    Data types

    Data model Java operations

    Read

    Get()

    Constructors

    Supported methods

    Scan()

    Constructors

    Methods

    Write

    Put()

    Constructors

    Methods

    Modify

    Delete()

    Constructors

    Methods

    HBase filters

    Types of filters

    Client APIs

    Summary

    9. Advance Coding in Java for HBase

    Interfaces, classes, and exceptions

    Code related to administrative tasks

    Data operation code

    MapReduce and HBase

    RESTful services and Thrift services interface

    REST service interfaces

    Thrift

    Coding for HDFS operations

    Some advance topics in brief

    Coprocessors

    Types of coprocessors

    Bloom filters

    The Lily project

    Features

    Summary

    10. HBase Use Cases

    HBase in industry today

    The future of HBase against relational databases

    Some real-world project examples' use cases

    HBase at Facebook

    Choosing HBase

    Storing in HBase

    The architecture of a Facebook message

    Facts and figures

    HBase at Pinterest

    The layout architecture

    HBase at Groupon

    The layout architecture

    HBase at LongTail Video

    The layout architecture

    HBase at Aadhaar (UIDAI)

    The layout architecture

    Useful links and references

    Summary

    Index

    Learning HBase


    Learning HBase

    Copyright © 2014 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: November 2014

    Production reference: 1181114

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78398-594-4

    www.packtpub.com

    Credits

    Author

    Shashwat Shriparv

    Reviewers

    Ashutosh Bijoor

    Chhavi Gangwal

    Henry Garner

    Nitin Pawar

    Jing Song

    Arun Vasudevan

    Commissioning Editor

    Akram Hussain

    Acquisition Editor

    Kevin Colaco

    Content Development Editor

    Prachi Bisht

    Technical Editor

    Pankaj Kadam

    Copy Editors

    Janbal Dharmaraj

    Sayanee Mukherjee

    Project Coordinator

    Sageer Parkar

    Proofreaders

    Bridget Braund

    Maria Gould

    Lucy Rowland

    Indexer

    Tejal Soni

    Graphics

    Ronak Dhruv

    Production Coordinator

    Aparna Bhagat

    Cover Work

    Aparna Bhagat

    About the Author

    Shashwat Shriparv was born in Muzaffarpur, Bihar. He did his schooling from Muzaffarpur and Shillong, Meghalaya. He received his BCA degree from IGNOU, Delhi and his MCA degree from Cochin University of Science and Technology, Kerala (C-DAC Trivandrum).

    He was introduced to Big Data technologies in early 2010 when he was asked to perform a proof of concept (POC) on Big Data technologies in storing and processing logs. He was also given another project, where he was required to store huge binary files with variable headers and process them. At this time, he started configuring, setting up, and testing Hadoop HBase clusters and writing sample code for them. After performing a successful POC, he initiated serious development using Java REST and SOAP web services, building a system to store and process logs to Hadoop using web services, and then storing these logs in HBase using homemade schema and reading data using HBase APIs and HBase-Hive mapped queries. Shashwat successfully implemented the project, and then moved on to work on huge binary files of size 1 to 3 TB, processing the header and storing metadata to HBase and files on HDFS.

    Shashwat started his career as a software developer at C-DAC Cyber Forensics, Trivandrum, building mobile-related software for forensics analysis. Then, he moved to Genilok Computer Solutions, where he worked on cluster computing, HPC technologies, and web technologies. After this, he moved to Bangalore from Trivandrum and joined PointCross, where he started working with Big Data technologies, developing software using Java, web services, and platform as Big Data. He worked on many projects revolving around Big Data technologies, such as Hadoop, HBase, Hive, Pig, Sqoop, Flume, and so on at PointCross. From here, he moved to HCL Infosystems Ltd. to work on the UIDAI project, which is one of the most prestigious projects in India, providing a unique identification number to every resident of India. Here, he worked on technologies such as HBase, Hive, Hadoop, Pig, and Linux, scripting, managing HBase Hadoop clusters, writing scripts, automating tasks and processes, and building dashboards for monitoring clusters.

    Currently, he is working with Cognilytics, Inc. on Big Data technologies, HANA, and other high-performance technologies.

    You can find out more about him at https://github.com/shriparv and http://helpmetocode.blogspot.com. You can connect with him on LinkedIn at http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9. You can also e-mail him at .

    Shashwat has worked as a reviewer on the book Pig Design Pattern, Pradeep Pasupuleti, Packt Publishing. He also contributed to his college magazine, InfinityTech, as an editor.

    Acknowledgments

    First, I would like to thank a few people from Packt Publishing: Kevin for encouraging me to write this book, Prachi for assisting and guiding me throughout the writing process, Pankaj for helping me out in technical editing, and all other contributors to this book.

    I would like to thank all the developers, contributors, and forums of Hadoop, HBase, and Big Data technologies for giving the industry such awesome technologies and contributing to it continuously. Thanks to Lars and Noll for their contribution towards HBase and Hadoop, respectively.

    I would like to thank some people who helped me to learn from life, including teachers at my college—Roshani ma'am (Principal), Namboothari sir, Santosh sir, Manjush ma'am, Hudlin Leo ma'am, and my seniors Jitesh sir, Nilanchal sir, Vaidhath sir, Jwala sir, Ashutosh sir, Anzar sir, Kishor sir, and all my friends in Batch 6. I dedicate this book to my friend, Nikhil, who is not in this world now. Special thanks to Ratnakar Mishra and Chandan Jha for always being with me and believing in me. Thanks also go out to Vineet, Shashi bhai, Shailesh, Rajeev, Pintu, Darshna, Priya, Amit, Manzar, Sunil, Ashok bhai, Pradeep, Arshad, Sujith, Vinay, Rachana, Ashwathi, Rinku, Pheona, Lizbeth, Arun, Kalesh, Chitra, Fatima, Rajesh, Jasmin, and all my friends from C-DAC Trivendrum college. I thank all my juniors, seniors, and friends in college. Thanks to all my colleagues at C-DAC Cyber Forensic: Sateesh sir, my project manager; Anwer Reyaz. J, an enthusiast who is always encouraging; Bibin bhai sahab; Ramani sir; Potty sir; Bhadran sir; Thomas sir; Satish sir; Nabeel sir; Balan sir; Abhin sir; and others. I would also like to thank Mani sir; Raja sir; my friends and teammates: Maruthi Kiran, Chethan, Alok, Tariq, Sujatha, Bhagya, and Mukesh; Sri Gopal sir, my team leader; and all my other colleagues from PointCross. I thank Ramesh Krishnan sir, Manoj sir, Vinod sir, Nand Kishor sir, and my teammates Varun bhai sahab, Preeti Gupta, Kuldeep bhai sahab, and all my colleagues at HCL Infosystems Ltd. and UIDAI. I would also like to thank Satish sir; Sudipta sir; my manager, Atul sir; Pradeep; Nikhil; Mohit; Brijesh; Kranth; Ashish Chopara; Sudhir; and all my colleagues at Cognilytics, Inc.

    Last but not the least, I would like to thank papa, Dr. Rameshwar Dwivedi; mummy, Smt. Rewa Dwivedi; bhai, Vishwas Priambud; sister-in-law, Ragini Dwivedi; sweet sister, Bhumika; brother-in-law, Chandramauli Dwivedi; and new members of my family, Vasu and Atmana.

    If I missed any names, it does not mean that I am not thankful to them, they all are in my heart and I am thankful to everyone who has come in my life and left their mark. Also, thanking is not in any order.

    About the Reviewers

    Ashutosh Bijoor (Ash) is Chief Technology Officer at Accion Labs India Private Limited. He has over 20 years of experience in the technology industry with customers ranging from start-ups to large multinationals in a wide range of industries, including high tech, engineering, software, insurance, banking, chemicals, pharmaceuticals, healthcare, media, and entertainment. He is experienced in leading and managing cross-functional teams through an entire product development life cycle.

    Ashutosh is skilled in emerging technologies, software architectures, framework design, and agile process definition. He has implemented enterprise solutions as well as commercial products in domains such as Big Data, business intelligence, graphics and image processing, sound and video processing, and advanced text search and analytics.

    His e-mail ID is <ashutosh.bijoor@accionlabs.com>. You can also visit his website at http://bijoor.me.

    Chhavi Gangwal is currently associated with Impetus Infotech (India) Pvt. Ltd. as a technical lead. With over 7 years of experience in the IT industry, she has worked on various dimensions of social media and the Web and witnessed the rise of Big Data first hand.

    Presently, Chhavi is leading the development of Kundera, a JPA 2.0-compliant object-datastore mapping library for NoSQL data stores. She is also actively involved in the product management and development of multitude of Big Data tools. Apart from a working knowledge of several NoSQL data stores, Java, PHP, and different JavaScript frameworks, her passion lies in product designing and learning the latest technologies. Connect with Chhavi on https://www.linkedin.com/profile/view?id=58308893.

    Nitin Pawar started his career as a release engineer with Veritas Systems, and so, the quality of software systems is always the main goal in his approach towards work. He has been lucky to work in multiple work profiles at companies such as Yahoo! for almost 5 years, where he learned a lot about the Hadoop ecosystem. After this, he worked with start-ups in analytics and Big Data domains, helping them design backend analytics infrastructures and platforms.

    He enjoys solving problems and helping others facing technical issues. Reviewing this book gave him a better understanding of the HBase system, and he hopes that the readers will like it too.

    He has also reviewed the book Securing Hadoop, Sudheesh Narayanan, and a video, Building Hadoop Clusters [Video], Sean Mikha, both by Packt Publishing.

    Jing Song has been working in the software industry as an engineer for more than 14 years since she graduated school. She enjoys solving problems and learning about new technologies in the Computer Science domain. Her interests and experiences lie across multiple tiers such as web-frontend GUI to middleware, middleware to backend SQL RDBMS, and NoSQL data storage. In the last 5 years, she has mainly focused on enterprise application performance and cloud computing areas. Jing currently works for Apple as a tech lead, leading various Java applications from design to implementation and performance tuning.

    Arun Vasudevan is a technical lead at Accion Labs India Private Limited. He specializes in Business Analytics and Visualization and has worked on solutions in various industry verticals, including insurance, telecom, and retail. He specializes in developing applications on Big Data technologies, including Hadoop stack, Cloud technologies, and NoSQL databases. He also has expertise on cloud infrastructure setup and management using OpenStack and AWS APIs.

    Arun is skilled in Java J2EE, JavaScript, relational databases, NoSQL technologies, and visualization using custom-built JavaScript visualization tools such as D3JS. Arun manages a team that delivers business analytics and visualization solutions.

    His e-mail address is <arun.vasudevan@accionlabs.com>. You can also visit his LinkedIn account at https://www.linkedin.com/profile/view?id=40201159.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

    I would like to thank god for giving me this opportunity. I dedicate this book to baba, dadi, nana, and nani.

    Preface

    This book will provide a top-down approach to learning HBase, which will be useful for both novices and experts. You will start learning configuration, code to maintenance, and troubleshooting—a kind of all-in-one HBase knowledge bank. This will be a step-by-step guide, which will help you work on HBase. The book will include day-to-day activities using HBase administration, and the implementation of Hadoop, plus HBase cluster setup from ground approach. The book will cover a complete list of use cases and explanations to implement HBase as an effective Big Data tool. It will also help you understand the layout and structure of HBase. There are lots of books available on the market on HBase, but they lack something in them; some of them focuses more on configuration and some on coding, but this book will provide a kind of start-to-end approach, which will be useful for a person with zero knowledge in HBase to the person proficient in HBase. This book is a complete guide to HBase administration and development with real-time scenarios and an operation guide.

    This book will provide an understanding of what HBase is like, where it came from, who all are involved, why should we consider using it, why people are using it, when to use it, and how to use it. This book will give overall information about the HBase ecosystem. It's more like an HBase-confusion-buster book, a book to read and implement in real life. The book has in-depth theory and practical examples on HBase features. This theoretical and practical approach clears doubts on Hadoop and HBase. It provides complete guidance on configuration/management/troubleshooting of HBase clusters and their operations. The book is targeted at administration and development aspects of HBase; administration with troubleshooting, setup, and development with client and server APIs. This book also enables you to design schema, code in Java, and write shell scripts to work with HBase.

    What this book covers

    Chapter 1, Understanding the HBase Ecosystem, introduces HBase in detail, and discusses its features, its evolution, and its architecture. We will compare HBase with traditional databases and look at add-on features and the various underlying components, and its uses in the industry.

    Chapter 2, Let's Begin with HBase, deals with the HBase components in detail, their internal architecture, communication between different components, how it provides scalability, as well as the HBase reading and writing cycle process, HBase housekeeping tasks, region-related operations, the different components needed for a HBase cluster configuration, and some basic OS tuning.

    Chapter 3, Let's Start Building It, lets us proceed ahead with building an HBase cluster. In this chapter, you will find information on the various components and the places we can get it from. We will start configuring the cluster and consider all the parameters and optimization tweaks while building the Hadoop and HBase cluster. One section in the chapter will focus on the various component-level and OS-level parameters for an optimized cluster.

    Chapter 4, Optimizing the HBase/Hadoop Cluster, teaches us to optimize the HBase cluster according to the production environment and running cluster troubleshooting tasks. We will look at optimization on hardware, OS, software, and network parameters. This chapter will also teach us how we can optimize Hadoop for a better HBase.

    Chapter 5, The Storage, Structure Layout, and Data Model of HBase, discusses HBase's data model and its various data model operations for fetching and writing data in HBase tables. We will also consider some use cases in order to design schema in HBase.

    Chapter 6, HBase Cluster Maintenance and Troubleshooting, covers all the aspects of HBase cluster management, operation, and maintenance. Once a cluster is built and in operation, we need to look after it, continuously tune it up, and troubleshoot in order to have a healthy HBase cluster. We will also study the commands available with HBase and Hadoop shell.

    Chapter 7, Scripting in HBase, explains an automation process using HBase and shell scripts. We will learn to write scripts as an administrator or developer to automate various data-model-related tasks. We will also read about various backup and restore options available in HBase and how to perform them.

    Chapter 8, Coding HBase in Java, teaches Java coding in HBase. We will start with basic Java coding in HBase and learn about Java APIs available for client requests. You will also learn to build a basic client in Java, which can be used to contact an HBase cluster for various operations using Java code.

    Chapter 9, Advance Coding in Java for HBase, focuses more on Java coding in HBase. It is a more detailed learning about all the different kind of APIs, classes, methods,

    Enjoying the preview?
    Page 1 of 1