Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hortonworks

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Tez
Bikas Saha @bikassaha

Apache Hadoop YARN and HDFS
Flexible
Enables other purpose-built data processing
models beyond MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on the same
hardware while providing predictable
performance & quality of service
Shared
Provides a stable, reliable, secure
foundation and shared operational
services across multiple workloads
The Data Operating System for Hadoop 2.x
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
LOG STORE
Kafka
STREAMING
Storm
IN-MEMORY
Spark
GRAPH
Giraph
SAS
LASR, HPA
ONLINE
HBase, Accumulo
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management

Tez
•API’s and libraries to create data processing applications on YARN
•Customizable and adaptable DAG definition
•Orchestration framework to execute the DAG in a Hadoop cluster
•NOT a general purpose execution engine
Open Source
Apache Project

Tez – Goals
• Tez solves the hard problems of running on a distributed Hadoop environment
• Apps can focus on solving their domain specific problems
• Tez instantiates the physical execution structure. App fills in logic and behavior
• API targets data processing specified as a data flow graph
App
Tez
• Custom application logic
• Custom data format
• Custom data transfer technology
• Distributed parallel execution
• Negotiating resources from the Hadoop framework
• Fault tolerance and recovery
• Shared library of ready-to-use components
• Built-in performance optimizations
• Hadoop Security

Tez – Adoption
• Apache Hive
– Most popular SQL-like interface for data in Hadoop
• Apache Pig
– Scripting language used in some of the largest Hadoop installations
• Apache Flink (Stratosphere project from TU Berlin)
– General purpose engine with language integrated data processing API
• Cascading + Scalding
– Language integrated data processing API in Java/Scala
• Commercial Products
– Datameer, Syncsort and other in progress

Tez – Performance benefits
• Apache Hive
– Order of magnitude improvement in performance
– Speed up mainly from flexible DAG definition and runtime graph reconfiguration
– Performance oriented orchestration layer and shared library components
Hive : TPC-DS Query 64
Logical DAG

Tez – Scale and Reliability
• Apache Pig
– Predominant number of data processing jobs at Yahoo with up to 5000 node clusters
– Multi-Petabyte jobs
– On track for using Pig with Tez for all production Pig jobs
– Already use Hive with Tez for large scale analytics
• Hortonworks customers
– All new customers default on Hive with Tez
• Cascading + Scalding
– Cascading 3.0 released with Tez integration
– Very promising results with beta users
http://scalding.io/2015/05/scalding-cascading-tez-♥/

© Hortonworks Inc. 2013
Tez – DAG API
// Define DAG
DAG dag = DAG.create();
// Define Vertex
Vertex Scan1 = Vertex.create(Processor.class);
// Define Edge
Edge edge = Edge.create(Scan1, Partition1,
SCATTER_GATHER, PERSISTED, SEQUENTIAL,
Output.class, Input.class);
// Connect them
dag.addVertex(Scan1).addEdge(edge)….
Page 8
Defines the global logical processing flow
Scan1 Scan2
Partition1 Partition2
Join
Scatter
Gather
Scatter
Gather

Tez – Logical DAG expansion at Runtime
Page 9
Partition1
Scan2
Partition2
Join
Scan1

Tez – Task Composition
Page 10
V-A
V-B V-C
Logical DAG
Output-1 Output-3
Processor-A
Input-2
Processor-B
Input-4
Processor-C
Task A
Task B Task C
Edge AB Edge AC
V-A = { Processor-A.class }
V-B = { Processor-B.class }
V-C = { Processor-C.class }
Edge AB = { V-A, V-B,
Output-1.class, Input-2.class }
Edge AC = { V-A, V-C,
Output-3.class, Input-4.class }

Tez – Composable Task Model
Page 11
Hive Processor
HDFS
Input
Remote
File
Server
Input
HDFS
Output
Local
Disk
Output
Custom Processor
HDFS
Input
Remote
File
Server
Input
HDFS
Output
Local
Disk
Output
Custom Processor
RDMA
Input
Native
DB
Input
Kakfa
Pub-Sub
Output
Amazon
S3
Output
Adopt Evolve Optimize

Tez – Customizable Core Engine
Page 12
Vertex-2
Vertex-1
Start
vertex
Vertex Manager
Start
tasks
DAG
Scheduler
Get Priority
Get Priority
Start
vertex
Task
Scheduler
Get container
Get container
• Vertex Manager
• Determines task
parallelism
• Determines when
tasks in a vertex can
start.
• DAG Scheduler
Determines priority of
task
• Task Scheduler
Allocates containers
from YARN and assigns
them to tasks

Tez – Customizable core engine: graph reconfiguration
Page 14
Vertex 1 tasks
Vertex 2 Input Data
App Master
Input Initializer
+
Vertex Manager
Filtering values
Vertex State
Machine
Reconfigure Vertex
Apply Filter to Prune Input Partitions
Event Model
Map tasks send data
statistics events to the
Reduce Vertex Manager.
Vertex Manager
Pluggable application logic
that understands the data
statistics and can formulate
the correct parallelism.
Advises vertex controller on
parallelism
Hive – Dynamic Partition Pruning

Tez – Engineering optimizations
•Container re-use
•Support for user sessions
•Event-based control flow
Page 15

Tez – Developer tools – Local Mode
• Fast prototyping – no hadoop setup required
• Quick turnaround in Unit testing – no overheads for allocating resources , launching
JVM’s.
• Easy debuggability – Single JVM
• Scheduling / RPC invocations skipped
Page 16

Tez – Developer Tools - Tez UI
• View Status and
progress of
DAG/Vertex
• Diagnostics on failure
• View counters for
DAG/Vertex
• View and compare
counters across
tasks/attempts
• View app specific
information
Page 17

Tez – Developer Tools - Tez UI
Page 18

Tez – Job Analysis tools - Swimlanes
• “$TEZ_HOME/tez-tools/swimlanes/yarn-swimlanes.sh <app_id>”
Page 19

Tez – Job Analysis tools – Shuffle performance
• View shuffle performance between nodes
Page 20

Tez – Job Analysis tools – Shuffle performance
• View shuffle performance between nodes
Page 21

Tez – Hybrid Execution
Page 22
• Run “compute where its most
efficient”
• Building on the pluggable design of
Tez, different vertices in the DAG can
run in different execution
environments
• Hive LLAP daemons can run initial
scans, map joins etc. while large joins
can run in YARN containers
• Best of both worlds and the pattern
can be repeated for Apache Phoenix
or your MPP database
MPP
Daemon
MPP
Daemon
MPP
Daemon
MPP
Daemon
MPP
Daemon
MPP
Daemon
Vertex 1
Vertex 2
Vertex 3
YARNYARN YARN
Join
Scan/Filter

Tez – How can you help?
•Improve core Tez infrastructure
– Apache open source project. Your use cases and code are welcome
•Port DB ideas to Hive+Tez world
– Evolve distributed query optimization and execution
•Use Tez hybrid execution
– Use the Hive-LLAP pattern to get the best of both worlds with your
execution environment
•Integrate your project with Tez
– Get benefits similar to Hive, Pig, Cascading, Flink. Takes between 1-6
months depending on the complexity of the target project

Tez – How to contribute
•Useful links
– Work tracking: https://issues.apache.org/jira/browse/TEZ
– Code: https://github.com/apache/tez
– Developer list: dev@tez.apache.org
User list: user@tez.apache.org
Issues list: issues@tez.apache.org

Tez
Thanks for your time and attention!
Video with Deep Dive on Tez
http://goo.gl/BL67o7
http://www.infoq.com/presentations/apache-tez
Questions?
@bikassaha
Page 25

Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hortonworks

More Related Content

Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hortonworks

Editor's Notes