What's the Hadoop-la about Kubernetes?

DataWorks Summit 2018, San Jose, CA
What’s the ‘Hadoop-la’
about Kubernetes

Today’s Speakers
Nanda VijaydevAnant Chintamaneni
@NandaVijaydev@AnantCman
Vice President of Products
BlueData Software
Sr. Director of Solutions
BlueData Software

Agenda
• Market Dynamics (with containers)
• What is Kubernetes – Why should you care?
• Requirements for Stateful Hadoop Clusters
• Key gaps in Kubernetes for running Hadoop
• What will it take to go from here to there.
• Q & A

The “Promised Land”
Single “Container” Platform for multiple application patterns….
Public Cloud
Infrastructure
On-Prem
Infrastructure
Stateless
(web frontends, servers)
Stateful
(databases, queues)
Daemons
(log collection,
monitoring)
Others?
TargetInfraInfra-agnosticWorkloads

Kubernetes (K8s) – Key Points..
| Open source “platform” for containerized workloads
| Platform building blocks vs. turnkey platform
– https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not
| Top use case is stateless/microservices deployments
| Evolving for stateful and others

Kubernetes (K8s) – Key Concepts
| Kubernetes: a platform for application patterns
| Pod: a single instance of an application in Kubernetes
| Controller: manages replicated pods for an application pattern

Kubernetes (K8s) – Master/Worker

Kubernetes (K8s) – Controller

Kubernetes (K8s) – Storage
| Volume: Ephemeral, Lifecycle of pod
| Persistent Volume: Networked Storage, Pod independent
| Persistent Volume Claim: Requested amount

Kubernetes (K8s) - Controller Patterns

Reality Check…. K8s challenges
source: https://www.cncf.io/blog/2017/06/28/survey-shows-kubernetes-leading-orchestration-platform/

Why Hadoop/Spark on Containers
Infrastructure
• Agility and elasticity
• Standardized environments
(dev, test, prod)
• Portability
(on-premises and cloud)
• Higher resource utilization
Applications
• Fool-proof packaging
(configs, libraries, driver
versions, etc.)
• Repeatable builds and
orchestration
• Faster app dev cycles

This is not about using containers to run Hadoop/Spark tasks
on YARN:
Source: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications
Not to be confused with……..

containers
cluster
Hadoop in Docker Containers
This is about running Hadoop clusters in containers:

Attributes of Hadoop Clusters
• Not exactly monolithic applications, but close
• Multiple, co-operating services with dynamic APIs
– Service start-up / tear-down ordering requirements
– Different sets of services running on different hosts (nodes)
– Tricky service interdependencies impact scalability
• Lots of configuration (aka state)
– Host name, IP address, ports, etc.
– Big meta-data: Hadoop and Spark service-specific configurations

RM YARN ResourceManager
NM YARN NodeManager
NN HDFS NameNode
DN HDFS DataNode
Master Node
RMNN
DN NM
Worker Node
DN NM
Worker Node
DN NM
Worker Node
Hadoop itself is clustered….
Hive Server2
Hive
Server
2
Data
Metadata

RM YARN ResourceManager
NM YARN NodeManager
NN HDFS NameNode
DN HDFS DataNode
JHS Job History Server
JN Journal Node
ZK ZooKeeper
HFS HttpFS Service
HM Hbase Master
HRS Hbase Region Server
Hue Hue
OZ Oozie
SHS Spark History Server
Ambari Ambari server
DB MySQL/Postgres
GW Gateway
FA Flume Agent
Tez Tez Service
SS Solr Server
Hive
on
LLAP
Hive on LLAP
RA Ranger
HS Hive Server
HSS Hive Metastore Service
ACK! There is seemingly no end to these
services & versions
…
And lots of services to keep in synch

• Use a Hadoop manager
– Hortonworks: Ambari
– Cloudera: Cloudera Manager
– MapR: MapR Control System (MCS)
• Follow common deployment pattern
• Ensures distro supportability
Managing and Configuring Hadoop

And we want multiple Hadoop clusters
Data Engineering SQL Analytics Machine Learning
“Containerized” Platform
Multiple evaluation teams
Evaluate different business use cases
(e.g. ETL, machine learning)
Use different services (e.g. Hive, Pig,
SparkR), different distributions / versions
Shared ‘containerized’ infrastructure
Petabyte scale data
2.6 2.2
Multiple distributions, services, tools on shared, cost-effective infrastructure
2.12.5 2.7
Data/Storage

Requirements for success
Hadoop
won’t
change
Resource
Management
(YARN)
Master
Services
running
always
Hadoop
Service
Dependency
& Endpoints
State
Persistence
(Data +
Metadata)
+

Hadoop Clusters on Kubernetes
Challenges and Gaps
• Existing, available Controller pattern is insufficient
• Hadoop service inter-communication via K8s Services
(clusterIP, NodePort etc) is not trivial
• Persistent volumes (PV) and the persistent volume
claim (PVC) approach needs to adapt to Hadoop
requirements for state persistence.

So is it to possible run Hadoop in all
its glory on Kubernetes (K8s)?

Started with BlueData Custom Controller on K8s
12 months ago - we learnt a lot!
https://www.bluedata.com/blog/2017/12/big-data-container-orchestration-kubernetes-k8s/

HDP Cluster
Custom Controller - Architecture
K8s API Server K8s Scheduler K8s Controller Manager
Custom Controller
(Pod)
Ambari, NN,
RM
(Pod)
DN, NM
(Pod)
HDP Cluster
Pod
Pod
Pod
Pod
K8s Cluster
BlueData Namespace
Networking (Ex.Calico)
Default Namespace
DN, NM
(Pod)
DN, NM
(Pod)

• Launch statefulsets for defined roles
• Configure and start services in the right sequence
• Make the services available to end users – Network
and port mapping
• Secure the services with existing enterprise policies
(e.g. LDAP / AD)
• Maintain Big Data performance goals
Our ‘Custom Controller’ Approach..

Launching HDP on K8s with Ambari
Each role is a Statefulset.
4 Statefulsets for this cluster
Launch: BlueData UI or
API
- Cluster Metadata:
Manifest file
- Node Roles: statefulsets
- Node count: Nbr of pods
per role
- Node Services: List of
services and ports

HDP cluster running on K8s with BlueData
- A nodeport
service is
created per pod
for all endpoints
of each pod

Statefulset definition - details
• Persistent Storage
• Volume claim template
• Preserve / (root) to enable restarts and migration
• Both init & app container has definition to mount same “subPath” from
dynamic volume
• initContainer set up /var /opt /etc on volume dynamically provisioned, used
by app container
• Container access setup
• Leverage K8s postStart hook to set up authorized_keys & /etc/resolv.conf
• Ease of use
• Added concept of flavor definition for CPU, memory, storage etc.

Key Gaps (Custom Controller)
Functional gaps
• Authentication and authorization was done by controller
• Limited to single namespace and lacked mapping to K8s
Multi-tenancy
Usability gaps
• Inability to use native kubectl commands for all operations
• Unable to use helm charts and other community projects

So what’s next to make it more K8s native and
address gaps..

Available Approaches….
• Use kubectl commands for simple deployment
• Use Helm charts for dependency management
• Use Operators for managing complex actions during and
after deployment
Operator = Custom Resource Definition (CRD) + Custom Controller

Creating Hadoop “Custom Operator”
API Server
Scheduler Controller
etcd
Register
Hadoop
CRD
Create Hadoop
cluster (Kubectl
create
hadoopcluster)
Custom
Hadoop
Controller
Observe/
Assess/ Act
Hadoop Operator:
1. Create statefulsets
2. Configure services
3. Map ports
4. Scale up/Scale down
5. Migrate to ensure FT

Custom Operator – CRD
• Native extension to standard K8s APIs
• Uses same authentication, authorization, and
audit logging
• Use kubectl commands to operate on CRD object
(e.g. create hadoopcluster)
• API request object will be stored in “etcd”

Example – CRD Registration and Usage
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: apps.hadoop.example.com
spec:
group: hadoop.example.com
version: v1alpha1
scope: Namespace
names:
plural: hadoopclusters
singular: hadoopcluster
kind: HadoopCluster
• Create:
kubectl create –f <CRD>.yaml
• New REST API endpoints :
/apis/hadoop.example.com/v1al
pha1/namespaces/*/hadoopclus
ters/...

Example – New objects using CRD
apiVersion: ”hadoop.example.com/v1alpha1"
kind: HadoopCluster
metadata:
name: my-new-hdp-cluster
spec:
image: bluedata/hdp26:v0.0.1
roles:
- name: master
replicas: 1
resources …
- name: worker
replicas: 4
resources …
…
Create:
kubectl create –f <request>.yaml
Manage:
kubectl get hadoopcluster

Custom Operator– Controller
• Watch on instances of objects with type defined in “CRD”
• Example: Create HDP cluster with Hive, and Oozie
• Runs scripts and services to coordinate activities between different pods for
clusters
• Example: Start HDFS, Start HiveServer2
• Any modifications, and scaling logic can be applied using custom controller
watch events
• Example: Expand and shrink cluster
• Same controller handles requests for multiple instances of custom object
• Example: Create and monitor multiple HDP clusters

Review Hadoop “Custom Operator”
API Server
Scheduler Controller
etcd
Register
Hadoop
CRD
Create Hadoop
cluster (Kubectl
create
hadoopcluster)
Custom
Hadoop
Controller
Observe/
Assess/ Act

• Lightweight Directory Access Protocol (LDAP) service
• Active Directory (AD) service
• Directory Name Service (DNS)
• Kerberos Key Distribution Center (KDC)
• Key Management Service (KMS)
Additional Configuration

• Networking
– Used calico for our testing
• Storage
– Persistent external storage (gluster)
• This approach allows us to run on any standard K8s
installation (1.9 and higher)
Network and access to services

Key Takeaways
• Kubernetes is still best suited for stateless services
• Complex stateful services like Hadoop requires significant work
• Statefulsets is a key enabler – necessary, but not sufficient
• New innovations and K8s contributions are needed to run Big Data
BlueData will simplify onboarding of Hadoop products to K8s

What's the Hadoop-la about Kubernetes?

Thank You
For more information:
www.bluedata.com
Booth # S5

What's the Hadoop-la about Kubernetes?

More Related Content

What's the Hadoop-la about Kubernetes?