Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +
Agenda Use Cases Architecture Storage Handler Load via INSERT Query Processing Bulk Load Q & A Facebook
Motivations Data, data, and more data 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009 About 8x increase per year Queries, queries, and more queries More than 200 unique users querying per day 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries They want it all and they want it now Users expect faster response time on fresher data Sampled subsets arenā€™t always good enough Facebook
How Can HBase Help? Replicate dimension tables from transactional databases with low latency and without sharding (Fact data can stay in Hive since it is append-only) Only move changed rows ā€œ Full scrapeā€ is too slow and doesnā€™t scale as data keeps growing Hive by itself is not good at row-level operations Integrate into Hiveā€™s map/reduce query execution plans for full parallel distributed processing Multiversioning for snapshot consistency? Facebook
Use Case 1:  HBase As ETL Data Target Facebook HBase Hive INSERT ā€¦  SELECT ā€¦ Source Files/Tables
Use Case 2:  HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT ā€¦ JOIN ā€¦ GROUP BY ā€¦ Query Result
Use Case 3:  Low Latency Warehouse  Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
HBase Architecture Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Hive Architecture Facebook
All Together Now! Facebook
Hive CLI With HBase Minimum configuration needed: hive --auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar -hiveconf hbase.zookeeper.quorum=zk1,zk2ā€¦ hive> create table ā€¦ Facebook
Storage Handler CREATE TABLE users( userid int, name string, email string, notes string) STORED BY  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  WITH SERDEPROPERTIES (  ā€œ hbase.columns.mappingā€ =  ā€œ small:name,small:email,large:notesā€) TBLPROPERTIES ( ā€œ hbase.table.nameā€ = ā€œuser_listā€ ); Facebook
Column Mapping First column in table is always the row key Other columns can be mapped to either: An HBase column (any Hive type) An HBase column family (must be MAP type in Hive) Multiple Hive columns can map to the same HBase column or family Limitations Currently no control over type mapping (always string in HBase) Currently no way to map HBase timestamp attribute Facebook
Load Via INSERT INSERT OVERWRITE TABLE users SELECT * FROM ā€¦; Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings) Multiple rows with same key -> only one row written Limitations No write atomicity yet No way to delete rows Write parallelism is query-dependent (map vs reduce) Facebook
Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
Map-Only Job for INSERT Facebook HBase
Query Processing SELECT name, notes FROM users WHERE userid=ā€˜xyzā€™; Rows are read from HBase via  org.apache.hadoop.hbase.mapred.TableInputFormatBase HBase determines the splits (one per table region) HBaseSerDe produces lazy rows/maps for RowResults Column selection is pushed down Any SQL can be used (join, aggregation, unionā€¦) Limitations Currently no filter pushdown How do we achieve locality? Facebook
Metastore Integration DDL can be used to create metadata in Hive and HBase simultaneously and consistently CREATE EXTERNAL TABLE:  register existing Hbase table DROP TABLE:  will drop HBase table too unless it was created as EXTERNAL Limitations No two-phase-commit for DDL operations ALTER TABLE is not yet implemented Partitioning is not yet defined No secondary indexing Facebook
Bulk Load Ideallyā€¦ SET hive.hbase.bulk=true; INSERT OVERWRITE TABLE users SELECT ā€¦ ; But for now, you have to do some work and issue multiple Hive commands Sample source data for range partitioning Save sampling results to a file Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files) Import HFiles into HBase HBase can merge files if necessary Facebook
Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
Sampling Query For Range Partitioning Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges.  select user_id from (select user_id from hive_user_table tablesample(bucket 1 out of 1000 on user_id) s  order by user_id) sorted_user_5k_sample where (row_sequence() % 501)=0; Facebook
Sorting Query For Bulk Load set mapred.reduce.tasks=12;  set hive.mapred.partitioner= org.apache.hadoop.mapred.lib.TotalOrderPartitioner;  set total.order.partitioner.path=/tmp/hb_range_key_list;  set hfile.compression=gz;  create table hbsort(user_id string, user_type string, ...)  stored as inputformat 'org.apache.hadoop.mapred.TextInputFormatā€™ outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormatā€™ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');  insert overwrite table hbsort select user_id, user_type, createtime, ā€¦ from hive_user_table cluster by user_id; Facebook
Deployment Latest Hive trunk (will be in Hive 0.6.0) Requires Hadoop 0.20+ Tested with HBase 0.20.3 and Zookeeper 3.2.2 20-node hbtest cluster at Facebook No performance numbers yet Currently setting up tests with about 6TB (gz compressed) Facebook
Questions? [email_address] [email_address] http://wiki.apache.org/hadoop/Hive/HBaseIntegration http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad Special thanks to Samuel Guo for the early versions of the integration code Facebook
Hey, What About HBQL? HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs Facebook

More Related Content

Hadoop, Hbase and Hive- Bay area Hadoop User Group

  • 1. Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +
  • 2. Agenda Use Cases Architecture Storage Handler Load via INSERT Query Processing Bulk Load Q & A Facebook
  • 3. Motivations Data, data, and more data 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009 About 8x increase per year Queries, queries, and more queries More than 200 unique users querying per day 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries They want it all and they want it now Users expect faster response time on fresher data Sampled subsets arenā€™t always good enough Facebook
  • 4. How Can HBase Help? Replicate dimension tables from transactional databases with low latency and without sharding (Fact data can stay in Hive since it is append-only) Only move changed rows ā€œ Full scrapeā€ is too slow and doesnā€™t scale as data keeps growing Hive by itself is not good at row-level operations Integrate into Hiveā€™s map/reduce query execution plans for full parallel distributed processing Multiversioning for snapshot consistency? Facebook
  • 5. Use Case 1: HBase As ETL Data Target Facebook HBase Hive INSERT ā€¦ SELECT ā€¦ Source Files/Tables
  • 6. Use Case 2: HBase As Data Source Facebook HBase Other Files/Tables Hive SELECT ā€¦ JOIN ā€¦ GROUP BY ā€¦ Query Result
  • 7. Use Case 3: Low Latency Warehouse Facebook HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
  • 8. HBase Architecture Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  • 10. All Together Now! Facebook
  • 11. Hive CLI With HBase Minimum configuration needed: hive --auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar -hiveconf hbase.zookeeper.quorum=zk1,zk2ā€¦ hive> create table ā€¦ Facebook
  • 12. Storage Handler CREATE TABLE users( userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( ā€œ hbase.columns.mappingā€ = ā€œ small:name,small:email,large:notesā€) TBLPROPERTIES ( ā€œ hbase.table.nameā€ = ā€œuser_listā€ ); Facebook
  • 13. Column Mapping First column in table is always the row key Other columns can be mapped to either: An HBase column (any Hive type) An HBase column family (must be MAP type in Hive) Multiple Hive columns can map to the same HBase column or family Limitations Currently no control over type mapping (always string in HBase) Currently no way to map HBase timestamp attribute Facebook
  • 14. Load Via INSERT INSERT OVERWRITE TABLE users SELECT * FROM ā€¦; Hive task writes rows to HBase via org.apache.hadoop.hbase.mapred.TableOutputFormat HBaseSerDe serializes rows into BatchUpdate objects (currently all values are converted to strings) Multiple rows with same key -> only one row written Limitations No write atomicity yet No way to delete rows Write parallelism is query-dependent (map vs reduce) Facebook
  • 15. Map-Reduce Job for INSERT Facebook HBase From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
  • 16. Map-Only Job for INSERT Facebook HBase
  • 17. Query Processing SELECT name, notes FROM users WHERE userid=ā€˜xyzā€™; Rows are read from HBase via org.apache.hadoop.hbase.mapred.TableInputFormatBase HBase determines the splits (one per table region) HBaseSerDe produces lazy rows/maps for RowResults Column selection is pushed down Any SQL can be used (join, aggregation, unionā€¦) Limitations Currently no filter pushdown How do we achieve locality? Facebook
  • 18. Metastore Integration DDL can be used to create metadata in Hive and HBase simultaneously and consistently CREATE EXTERNAL TABLE: register existing Hbase table DROP TABLE: will drop HBase table too unless it was created as EXTERNAL Limitations No two-phase-commit for DDL operations ALTER TABLE is not yet implemented Partitioning is not yet defined No secondary indexing Facebook
  • 19. Bulk Load Ideallyā€¦ SET hive.hbase.bulk=true; INSERT OVERWRITE TABLE users SELECT ā€¦ ; But for now, you have to do some work and issue multiple Hive commands Sample source data for range partitioning Save sampling results to a file Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files) Import HFiles into HBase HBase can merge files if necessary Facebook
  • 20. Range Partitioning During Sort Facebook A-G H-Q R-Z HBase (H) (R) TotalOrderPartitioner loadtable.rb
  • 21. Sampling Query For Range Partitioning Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges. select user_id from (select user_id from hive_user_table tablesample(bucket 1 out of 1000 on user_id) s order by user_id) sorted_user_5k_sample where (row_sequence() % 501)=0; Facebook
  • 22. Sorting Query For Bulk Load set mapred.reduce.tasks=12; set hive.mapred.partitioner= org.apache.hadoop.mapred.lib.TotalOrderPartitioner; set total.order.partitioner.path=/tmp/hb_range_key_list; set hfile.compression=gz; create table hbsort(user_id string, user_type string, ...) stored as inputformat 'org.apache.hadoop.mapred.TextInputFormatā€™ outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormatā€™ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf'); insert overwrite table hbsort select user_id, user_type, createtime, ā€¦ from hive_user_table cluster by user_id; Facebook
  • 23. Deployment Latest Hive trunk (will be in Hive 0.6.0) Requires Hadoop 0.20+ Tested with HBase 0.20.3 and Zookeeper 3.2.2 20-node hbtest cluster at Facebook No performance numbers yet Currently setting up tests with about 6TB (gz compressed) Facebook
  • 24. Questions? [email_address] [email_address] http://wiki.apache.org/hadoop/Hive/HBaseIntegration http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad Special thanks to Samuel Guo for the early versions of the integration code Facebook
  • 25. Hey, What About HBQL? HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs Facebook