Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Apache Iceberg
Scott Shaw
2
© 2021 Cloudera, Inc. All rights reserved.
What is Apache Iceberg?
• Efficient Table Format
– Hidden Partitioning
– Schema Evolution
– Time Travel
• Presto, Hive, Spark
• Created at Netflix (2017).
• Used at Adobe, Apple, LinkedIn,
Experian
3
© 2021 Cloudera, Inc. All rights reserved.
What are the Challenges?
• Data Scalability
• Atomicity
• Performance Degradation
• Complexity
• Object Stores
• Storage and Compute
• File System (Listing)
ARCHITECTURE
5
© 2021 Cloudera, Inc. All rights reserved.
Architecture
Spark Presto
HDFS Object Store
Iceberg
6
© 2021 Cloudera, Inc. All rights reserved.
Architecture
Snapshot (01)
Manifest List
Manifest
Files
Manifest
Manifest List
Snapshot (02)
Files Files
WORKING WITH ICEBERG
8
© 2021 Cloudera, Inc. All rights reserved.
Initial Setup
• Catalogs
– Working with SQL
– System Information
9
© 2021 Cloudera, Inc. All rights reserved.
Spark
spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0 
--conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog

--conf spark.sql.catalog.spark_catalog.type=hive 
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog 
--conf spark.sql.catalog.local.type=hadoop 
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
Adding a Catalog
Creating a Table
CREATE TABLE local.db.table (id bigint, data string) USING iceberg
10
© 2021 Cloudera, Inc. All rights reserved.
Hive
add jar /path/to/iceberg-hive-runtime.jar;
Add the jar file
Create an External Table
CREATE EXTERNAL TABLE table_a
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION 'hdfs://some_bucket/some_path/table_a';
REFERENCES
12
© 2021 Cloudera, Inc. All rights reserved.
References
Apache Iceberg: https://iceberg.apache.org/
Project Nessie: https://projectnessie.org/
Hive/Iceberg Integration: https://github.com/ExpediaGroup/hiveberg
Partitioning:
https://developer.ibm.com/technologies/artificial-intelligence/articles/the-why-and-how-of-partitioning-in-apache-iceberg/?utm_source=the
newstack&utm_medium=website&utm_campaign=platform
Iceberg Explained: https://thenewstack.io/apache-iceberg-a-different-table-design-for-big-data/
Apache Iceberg Presentation for the St. Louis Big Data IDEA

More Related Content

Apache Iceberg Presentation for the St. Louis Big Data IDEA

  • 2. 2 © 2021 Cloudera, Inc. All rights reserved. What is Apache Iceberg? • Efficient Table Format – Hidden Partitioning – Schema Evolution – Time Travel • Presto, Hive, Spark • Created at Netflix (2017). • Used at Adobe, Apple, LinkedIn, Experian
  • 3. 3 © 2021 Cloudera, Inc. All rights reserved. What are the Challenges? • Data Scalability • Atomicity • Performance Degradation • Complexity • Object Stores • Storage and Compute • File System (Listing)
  • 5. 5 © 2021 Cloudera, Inc. All rights reserved. Architecture Spark Presto HDFS Object Store Iceberg
  • 6. 6 © 2021 Cloudera, Inc. All rights reserved. Architecture Snapshot (01) Manifest List Manifest Files Manifest Manifest List Snapshot (02) Files Files
  • 8. 8 © 2021 Cloudera, Inc. All rights reserved. Initial Setup • Catalogs – Working with SQL – System Information
  • 9. 9 © 2021 Cloudera, Inc. All rights reserved. Spark spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.local.type=hadoop --conf spark.sql.catalog.local.warehouse=$PWD/warehouse Adding a Catalog Creating a Table CREATE TABLE local.db.table (id bigint, data string) USING iceberg
  • 10. 10 © 2021 Cloudera, Inc. All rights reserved. Hive add jar /path/to/iceberg-hive-runtime.jar; Add the jar file Create an External Table CREATE EXTERNAL TABLE table_a STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs://some_bucket/some_path/table_a';
  • 12. 12 © 2021 Cloudera, Inc. All rights reserved. References Apache Iceberg: https://iceberg.apache.org/ Project Nessie: https://projectnessie.org/ Hive/Iceberg Integration: https://github.com/ExpediaGroup/hiveberg Partitioning: https://developer.ibm.com/technologies/artificial-intelligence/articles/the-why-and-how-of-partitioning-in-apache-iceberg/?utm_source=the newstack&utm_medium=website&utm_campaign=platform Iceberg Explained: https://thenewstack.io/apache-iceberg-a-different-table-design-for-big-data/