Big data and Kubernetes

Data Processing and
Kubernetes
Anirudh Ramanathan (Google Inc.)

Agenda
• Basics of Kubernetes & Containers
• Motivation
• Apache Spark and HDFS on Kubernetes
• Data Processing Ecosystem
• Future Work

Kubernetes
Kubernetes is an open-source system

Kubernetes is an open-source system for
automating deployment, scaling, and
management
Kubernetes

Kubernetes is an open-source system for
automating deployment, scaling, and
management of containerized applications.
Kubernetes

Containers
• Repeatable Builds and
Workflows
• Application Portability
• High Degree of Control over
Software
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure
Utilization
libs
app
kernel
libs
app
libs
app
libs
app

• Based on Google's experience running containers in
production for over 15 years
• Large OSS Community - 1200+ contributors and 45k+
commits
• Ecosystem and Partners - 100+ organizations involved
• One of the top 100 projects overall on GitHub - 23k+
stars
Statistics

At a Glance
kubelet
kubeletCLI
API
users master nodes
etcd
kubelet
scheduler
controllers
apiserver
UI

Nodes and Pods
Pod
Volume
Containers
Pod
Containers
8080 8080
• Pod is set of co-located
containers
• Created by declarative
specification
• Each pod has distinct IP
address
• Volumes local or
network-attached
8080
Volume

Controllers
● Drive current state -> desired state
● Act independently
● Recurring pattern in the system
Examples:
● Deployment
● DaemonSet
● StatefulSet
observe
diff
act

• Resource sharing between batch, serving and stateful
workloads
– Streamlined developer experience
– Reduced operational costs
– Improved infrastructure utilization
• Kubernetes and the Container Ecosystem
– Lots of addon services: third-party logging, monitoring,
and security tools
– For example, the Istio project, announced May 24, by IBM,
Google and Lyft
Why Kubernetes?

Cluster Administration
Namespaces
Resource
Accounting
Logging
Monitoring
Resource
Quota
Pluggable
Authorization
Admission
Control
RBAC
• Launch Jobs as a particular
user into a specific
namespace
• RBAC and Namespace-level
resource quotas
• Audit logging for clusters
• Several monitoring solutions
to see node, cluster and
pod-level statistics

• Beta recently announced at Spark Summit 2017
• Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata,
Red Hat, and growing.
Spark on Kubernetes
https://github.com/apache-spark-on-k8s/spar
k
Spark Core
Kubernetes Standalone YARN Mesos
GraphX SparkSQL MLlib Streaming

Spark on Kubernetes
Kubernetes
Integration
Container images with dependencies baked
in
Files from GCS/S3/HDFS/HTTP
File Staging Server
Staged files and
JARs
Several ways of running Spark Jobs along with their dependencies
on Kubernetes

Spark on Kubernetes
Spark Core Kubernetes Scheduler
Backend
Kubernetes Clusternew executors
remove executors
configuration
• Resource Requests
• Authnz
• Communication with K8s

State of Spark
Spark Streaming
Spark Shell
Client Mode
Python/R support
Cluster Mode
Java/Scala
Support
Dynamic
Allocation
Local File Staging High Availability
Spark SQL
GraphX MLlib
Dec 2016
Development
Began
Mar 2017
Alpha
Release
June 2017
Beta
Release
Nov 2016
Design
= supported but
untested
= not yet
supported

• Community driven effort to get HDFS running well on
Kubernetes
• Uses a helm chart to install onto a cluster
• Identified and solved several problems around data
locality when running Spark Jobs
HDFS on Kubernetes
https://github.com/apache-spark-on-k8s/kubernetes-HDFS

HDFS on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Namenode Pod Datanode Pod 1 Datanode Pod 2
HDFS on Kubernetes -- Lessons Learned [Public]
Kimoon Kim (PepperData)

State of HDFS
• HDFS with basic data locality works!
• Future Work
– Remaining data locality issues -- rack locality, node
preference, etc
– Performance benchmarks and testing
– Kerberos support
– Namenode HA

• Pipelines feature many other components.
• All of the below must run well on K8s
– Cassandra
– Kafka
– Zookeeper
– Elasticsearch, Kibana, etc
Data Pipelines are complicated!

• Cassandra:
https://github.com/kubernetes/examples/tree/master/cassandra
• Kafka:
https://github.com/kubernetes/contrib/tree/master/statefulsets/ka
fka
• Zookeeper:
https://github.com/kubernetes/charts/tree/master/incubator/zook
eeper
• zetcd: https://github.com/coreos/zetcd
• Elasticsearch Operator:
https://github.com/upmc-enterprises/elasticsearch-operator
Cassandra, Kafka and Zookeeper

Future Work
• Batch Scheduling and Resource Sharing
– Priorities and Preemption
• Storage
– Local Storage Provisioning
• Extensibility
– Kubernetes CustomResources (formerly
ThirdPartyResources)
– UI and Dashboard Improvements
• Cluster Federation and Multi-cloud deployments

• Get involved!
https://github.com/kubernetes/community/t
ree/master/sig-big-data
• SIG BigData weekly meeting open to all
(10am PT on Wednesdays) via Zoom:
http://zoom.us/my/sig.big.data
Future Work

Big data and Kubernetes

More Related Content

What's hot

What's hot (20)

Similar to Big data and Kubernetes

Similar to Big data and Kubernetes (20)

Recently uploaded

Recently uploaded (20)

Big data and Kubernetes