Big SQL is a SQL engine for Hadoop that excels at performance and scalability at high concurrency. Big SQL complements and integrates with Apache Hive for both data and metadata. An architecture that separates compute from storage allows Big SQL to support multiple open data formats natively. Until recently, Parquet provided a significant performance advantage over other data formats for SQL on Hadoop. The landscape changed when ORC became a top level Apache project independent from Hive. Gone were the days of reading ORC files using slow, single-row-at-a-time Hive Serdes. The new vectorized APIs in the Apache ORC libraries make it possible to ingest ORC data at blazing speed. This talk is about the journey leading to ORC taking the crown of best performing data format for Big SQL away from Parquet. We'll have a look under the hood at the architecture of Big SQL ORC readers, and how to tune them. We'll share lessons learned in walking the fine line between maximizing performance at scale and avoiding dreaded Java OOMs . You'll learn the techniques that SQL engines use for fast data ingestion, so that you can leverage the full potential of Apache ORC in any application.
Speaker:
Gustavo Arocena, Big Data Architect, IBM