This document summarizes Facebook's use cases and architecture for integrating Apache Hive and HBase. It discusses loading data from Hive into HBase tables using INSERT statements, querying HBase tables from Hive using SELECT statements, and maintaining low latency access to dimension tables stored in HBase while performing analytics on fact data stored in Hive. The architecture involves writing a storage handler and SerDe to map between the two systems and executing Hive queries by generating MapReduce jobs that read from or write to HBase.
Report
Share
Report
Share
1 of 25
More Related Content
Hadoop, Hbase and Hive- Bay area Hadoop User Group
2. Agenda Use Cases Architecture Storage Handler Load via INSERT Query Processing Bulk Load Q & A Facebook
3. Motivations Data, data, and more data 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009 About 8x increase per year Queries, queries, and more queries More than 200 unique users querying per day 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries They want it all and they want it now Users expect faster response time on fresher data Sampled subsets arenāt always good enough Facebook
4. How Can HBase Help? Replicate dimension tables from transactional databases with low latency and without sharding (Fact data can stay in Hive since it is append-only) Only move changed rows ā Full scrapeā is too slow and doesnāt scale as data keeps growing Hive by itself is not good at row-level operations Integrate into Hiveās map/reduce query execution plans for full parallel distributed processing Multiversioning for snapshot consistency? Facebook
5. Use Case 1: HBase As ETL Data Target Facebook HBase Hive INSERT ā¦ SELECT ā¦ Source Files/Tables
6. Use Case 2: HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT ā¦ JOIN ā¦ GROUP BY ā¦ Query Result
7. Use Case 3: Low Latency Warehouse Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
13. Column Mapping First column in table is always the row key Other columns can be mapped to either: An HBase column (any Hive type) An HBase column family (must be MAP type in Hive) Multiple Hive columns can map to the same HBase column or family Limitations Currently no control over type mapping (always string in HBase) Currently no way to map HBase timestamp attribute Facebook
14. Load Via INSERT INSERT OVERWRITE TABLE users SELECT * FROM ā¦; Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings) Multiple rows with same key -> only one row written Limitations No write atomicity yet No way to delete rows Write parallelism is query-dependent (map vs reduce) Facebook
15. Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
17. Query Processing SELECT name, notes FROM users WHERE userid=āxyzā; Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBase HBase determines the splits (one per table region) HBaseSerDe produces lazy rows/maps for RowResults Column selection is pushed down Any SQL can be used (join, aggregation, unionā¦) Limitations Currently no filter pushdown How do we achieve locality? Facebook
18. Metastore Integration DDL can be used to create metadata in Hive and HBase simultaneously and consistently CREATE EXTERNAL TABLE: register existing Hbase table DROP TABLE: will drop HBase table too unless it was created as EXTERNAL Limitations No two-phase-commit for DDL operations ALTER TABLE is not yet implemented Partitioning is not yet defined No secondary indexing Facebook
19. Bulk Load Ideallyā¦ SET hive.hbase.bulk=true; INSERT OVERWRITE TABLE users SELECT ā¦ ; But for now, you have to do some work and issue multiple Hive commands Sample source data for range partitioning Save sampling results to a file Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files) Import HFiles into HBase HBase can merge files if necessary Facebook
20. Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
21. Sampling Query For Range Partitioning Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges. select user_id from (select user_id from hive_user_table tablesample(bucket 1 out of 1000 on user_id) s order by user_id) sorted_user_5k_sample where (row_sequence() % 501)=0; Facebook
22. Sorting Query For Bulk Load set mapred.reduce.tasks=12; set hive.mapred.partitioner= org.apache.hadoop.mapred.lib.TotalOrderPartitioner; set total.order.partitioner.path=/tmp/hb_range_key_list; set hfile.compression=gz; create table hbsort(user_id string, user_type string, ...) stored as inputformat 'org.apache.hadoop.mapred.TextInputFormatā outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormatā tblproperties ('hfile.family.path' = '/tmp/hbsort/cf'); insert overwrite table hbsort select user_id, user_type, createtime, ā¦ from hive_user_table cluster by user_id; Facebook
23. Deployment Latest Hive trunk (will be in Hive 0.6.0) Requires Hadoop 0.20+ Tested with HBase 0.20.3 and Zookeeper 3.2.2 20-node hbtest cluster at Facebook No performance numbers yet Currently setting up tests with about 6TB (gz compressed) Facebook
24. Questions? [email_address] [email_address] http://wiki.apache.org/hadoop/Hive/HBaseIntegration http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad Special thanks to Samuel Guo for the early versions of the integration code Facebook
25. Hey, What About HBQL? HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs Facebook