Hadoop, Hbase and Hive- Bay area Hadoop User Group

Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +

Agenda Use Cases Architecture Storage Handler Load via INSERT Query Processing Bulk Load Q & A Facebook

Motivations Data, data, and more data 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009 About 8x increase per year Queries, queries, and more queries More than 200 unique users querying per day 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries They want it all and they want it now Users expect faster response time on fresher data Sampled subsets aren’t always good enough Facebook

How Can HBase Help? Replicate dimension tables from transactional databases with low latency and without sharding (Fact data can stay in Hive since it is append-only) Only move changed rows “ Full scrape” is too slow and doesn’t scale as data keeps growing Hive by itself is not good at row-level operations Integrate into Hive’s map/reduce query execution plans for full parallel distributed processing Multiversioning for snapshot consistency? Facebook

Use Case 1: HBase As ETL Data Target Facebook HBase Hive INSERT … SELECT … Source Files/Tables

Use Case 2: HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT … JOIN … GROUP BY … Query Result

Use Case 3: Low Latency Warehouse Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries

HBase Architecture Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Hive CLI With HBase Minimum configuration needed: hive --auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar -hiveconf hbase.zookeeper.quorum=zk1,zk2… hive> create table … Facebook

Storage Handler CREATE TABLE users( userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( “ hbase.columns.mapping” = “ small:name,small:email,large:notes”) TBLPROPERTIES ( “ hbase.table.name” = “user_list” ); Facebook

Column Mapping First column in table is always the row key Other columns can be mapped to either: An HBase column (any Hive type) An HBase column family (must be MAP type in Hive) Multiple Hive columns can map to the same HBase column or family Limitations Currently no control over type mapping (always string in HBase) Currently no way to map HBase timestamp attribute Facebook

Load Via INSERT INSERT OVERWRITE TABLE users SELECT * FROM …; Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings) Multiple rows with same key -> only one row written Limitations No write atomicity yet No way to delete rows Write parallelism is query-dependent (map vs reduce) Facebook

Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png

Map-Only Job for INSERT Facebook HBase

Query Processing SELECT name, notes FROM users WHERE userid=‘xyz’; Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBase HBase determines the splits (one per table region) HBaseSerDe produces lazy rows/maps for RowResults Column selection is pushed down Any SQL can be used (join, aggregation, union…) Limitations Currently no filter pushdown How do we achieve locality? Facebook

Metastore Integration DDL can be used to create metadata in Hive and HBase simultaneously and consistently CREATE EXTERNAL TABLE: register existing Hbase table DROP TABLE: will drop HBase table too unless it was created as EXTERNAL Limitations No two-phase-commit for DDL operations ALTER TABLE is not yet implemented Partitioning is not yet defined No secondary indexing Facebook

Bulk Load Ideally… SET hive.hbase.bulk=true; INSERT OVERWRITE TABLE users SELECT … ; But for now, you have to do some work and issue multiple Hive commands Sample source data for range partitioning Save sampling results to a file Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files) Import HFiles into HBase HBase can merge files if necessary Facebook

Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb

Sampling Query For Range Partitioning Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges. select user_id from (select user_id from hive_user_table tablesample(bucket 1 out of 1000 on user_id) s order by user_id) sorted_user_5k_sample where (row_sequence() % 501)=0; Facebook

Sorting Query For Bulk Load set mapred.reduce.tasks=12; set hive.mapred.partitioner= org.apache.hadoop.mapred.lib.TotalOrderPartitioner; set total.order.partitioner.path=/tmp/hb_range_key_list; set hfile.compression=gz; create table hbsort(user_id string, user_type string, ...) stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’ outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf'); insert overwrite table hbsort select user_id, user_type, createtime, … from hive_user_table cluster by user_id; Facebook

Deployment Latest Hive trunk (will be in Hive 0.6.0) Requires Hadoop 0.20+ Tested with HBase 0.20.3 and Zookeeper 3.2.2 20-node hbtest cluster at Facebook No performance numbers yet Currently setting up tests with about 6TB (gz compressed) Facebook

Questions? [email_address] [email_address] http://wiki.apache.org/hadoop/Hive/HBaseIntegration http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad Special thanks to Samuel Guo for the early versions of the integration code Facebook

Hey, What About HBQL? HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs Facebook

Hadoop, Hbase and Hive- Bay area Hadoop User Group

More Related Content

Hadoop, Hbase and Hive- Bay area Hadoop User Group