Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
DataWorks Summit 2018, San Jose, CA
What’s the ‘Hadoop-la’
about Kubernetes
Today’s Speakers
Nanda VijaydevAnant Chintamaneni
@NandaVijaydev@AnantCman
Vice President of Products
BlueData Software
Sr. Director of Solutions
BlueData Software
Agenda
• Market Dynamics (with containers)
• What is Kubernetes – Why should you care?
• Requirements for Stateful Hadoop Clusters
• Key gaps in Kubernetes for running Hadoop
• What will it take to go from here to there.
• Q & A
The “Promised Land”
Single “Container” Platform for multiple application patterns….
Public Cloud
Infrastructure
On-Prem
Infrastructure
Stateless
(web frontends, servers)
Stateful
(databases, queues)
Daemons
(log collection,
monitoring)
Others?
TargetInfraInfra-agnosticWorkloads
And the winner is……..
Kubernetes (K8s) – Key Points..
| Open source “platform” for containerized workloads
| Platform building blocks vs. turnkey platform
– https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not
| Top use case is stateless/microservices deployments
| Evolving for stateful and others
Kubernetes (K8s) – Key Concepts
| Kubernetes: a platform for application patterns
| Pod: a single instance of an application in Kubernetes
| Controller: manages replicated pods for an application pattern
Kubernetes (K8s) – Master/Worker
Kubernetes (K8s) – Pods
Kubernetes (K8s) – Controller
Kubernetes (K8s) – Service
Kubernetes (K8s) – Storage
| Volume: Ephemeral, Lifecycle of pod
| Persistent Volume: Networked Storage, Pod independent
| Persistent Volume Claim: Requested amount
Kubernetes (K8s) - Controller Patterns
Reality Check…. K8s challenges
source: https://www.cncf.io/blog/2017/06/28/survey-shows-kubernetes-leading-orchestration-platform/
Why Hadoop/Spark on Containers
Infrastructure
• Agility and elasticity
• Standardized environments
(dev, test, prod)
• Portability
(on-premises and cloud)
• Higher resource utilization
Applications
• Fool-proof packaging
(configs, libraries, driver
versions, etc.)
• Repeatable builds and
orchestration
• Faster app dev cycles
This is not about using containers to run Hadoop/Spark tasks
on YARN:
Source: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications
Not to be confused with……..
containers
cluster
Hadoop in Docker Containers
This is about running Hadoop clusters in containers:
Attributes of Hadoop Clusters
• Not exactly monolithic applications, but close
• Multiple, co-operating services with dynamic APIs
– Service start-up / tear-down ordering requirements
– Different sets of services running on different hosts (nodes)
– Tricky service interdependencies impact scalability
• Lots of configuration (aka state)
– Host name, IP address, ports, etc.
– Big meta-data: Hadoop and Spark service-specific configurations
RM YARN ResourceManager
NM YARN NodeManager
NN HDFS NameNode
DN HDFS DataNode
Master Node
RMNN
DN NM
Worker Node
DN NM
Worker Node
DN NM
Worker Node
Hadoop itself is clustered….
Hive Server2
Hive
Server
2
Data
Metadata
RM YARN ResourceManager
NM YARN NodeManager
NN HDFS NameNode
DN HDFS DataNode
JHS Job History Server
JN Journal Node
ZK ZooKeeper
HFS HttpFS Service
HM Hbase Master
HRS Hbase Region Server
Hue Hue
OZ Oozie
SHS Spark History Server
Ambari Ambari server
DB MySQL/Postgres
GW Gateway
FA Flume Agent
Tez Tez Service
SS Solr Server
Hive
on
LLAP
Hive on LLAP
RA Ranger
HS Hive Server
HSS Hive Metastore Service
ACK! There is seemingly no end to these
services & versions
…
And lots of services to keep in synch
• Use a Hadoop manager
– Hortonworks: Ambari
– Cloudera: Cloudera Manager
– MapR: MapR Control System (MCS)
• Follow common deployment pattern
• Ensures distro supportability
Managing and Configuring Hadoop
And we want multiple Hadoop clusters
Data Engineering SQL Analytics Machine Learning
“Containerized” Platform
Multiple evaluation teams
Evaluate different business use cases
(e.g. ETL, machine learning)
Use different services (e.g. Hive, Pig,
SparkR), different distributions / versions
Shared ‘containerized’ infrastructure
Petabyte scale data
2.6 2.2
Multiple distributions, services, tools on shared, cost-effective infrastructure
2.12.5 2.7
Data/Storage
Requirements for success
Hadoop
won’t
change
Resource
Management
(YARN)
Master
Services
running
always
Hadoop
Service
Dependency
& Endpoints
State
Persistence
(Data +
Metadata)
+
Hadoop Clusters on Kubernetes
Challenges and Gaps
• Existing, available Controller pattern is insufficient
• Hadoop service inter-communication via K8s Services
(clusterIP, NodePort etc) is not trivial
• Persistent volumes (PV) and the persistent volume
claim (PVC) approach needs to adapt to Hadoop
requirements for state persistence.
So is it to possible run Hadoop in all
its glory on Kubernetes (K8s)?
It’s a journey
Started with BlueData Custom Controller on K8s
12 months ago - we learnt a lot!
https://www.bluedata.com/blog/2017/12/big-data-container-orchestration-kubernetes-k8s/
HDP Cluster
Custom Controller - Architecture
K8s API Server K8s Scheduler K8s Controller Manager
Custom Controller
(Pod)
Ambari, NN,
RM
(Pod)
DN, NM
(Pod)
HDP Cluster
Pod
Pod
Pod
Pod
K8s Cluster
BlueData Namespace
Networking (Ex.Calico)
Default Namespace
DN, NM
(Pod)
DN, NM
(Pod)
• Launch statefulsets for defined roles
• Configure and start services in the right sequence
• Make the services available to end users – Network
and port mapping
• Secure the services with existing enterprise policies
(e.g. LDAP / AD)
• Maintain Big Data performance goals
Our ‘Custom Controller’ Approach..
Launching HDP on K8s with Ambari
Each role is a Statefulset.
4 Statefulsets for this cluster
Launch: BlueData UI or
API
- Cluster Metadata:
Manifest file
- Node Roles: statefulsets
- Node count: Nbr of pods
per role
- Node Services: List of
services and ports
HDP cluster running on K8s with BlueData
- A nodeport
service is
created per pod
for all endpoints
of each pod
Statefulset definition - details
• Persistent Storage
• Volume claim template
• Preserve / (root) to enable restarts and migration
• Both init & app container has definition to mount same “subPath” from
dynamic volume
• initContainer set up /var /opt /etc on volume dynamically provisioned, used
by app container
• Container access setup
• Leverage K8s postStart hook to set up authorized_keys & /etc/resolv.conf
• Ease of use
• Added concept of flavor definition for CPU, memory, storage etc.
Key Gaps (Custom Controller)
Functional gaps
• Authentication and authorization was done by controller
• Limited to single namespace and lacked mapping to K8s
Multi-tenancy
Usability gaps
• Inability to use native kubectl commands for all operations
• Unable to use helm charts and other community projects
So what’s next to make it more K8s native and
address gaps..
Available Approaches….
• Use kubectl commands for simple deployment
• Use Helm charts for dependency management
• Use Operators for managing complex actions during and
after deployment
Operator = Custom Resource Definition (CRD) + Custom Controller
Creating Hadoop “Custom Operator”
API Server
Scheduler Controller
etcd
Register
Hadoop
CRD
Create Hadoop
cluster (Kubectl
create
hadoopcluster)
Custom
Hadoop
Controller
Observe/
Assess/ Act
Hadoop Operator:
1. Create statefulsets
2. Configure services
3. Map ports
4. Scale up/Scale down
5. Migrate to ensure FT
Custom Operator – CRD
• Native extension to standard K8s APIs
• Uses same authentication, authorization, and
audit logging
• Use kubectl commands to operate on CRD object
(e.g. create hadoopcluster)
• API request object will be stored in “etcd”
Example – CRD Registration and Usage
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: apps.hadoop.example.com
spec:
group: hadoop.example.com
version: v1alpha1
scope: Namespace
names:
plural: hadoopclusters
singular: hadoopcluster
kind: HadoopCluster
• Create:
kubectl create –f <CRD>.yaml
• New REST API endpoints :
/apis/hadoop.example.com/v1al
pha1/namespaces/*/hadoopclus
ters/...
Example – New objects using CRD
apiVersion: ”hadoop.example.com/v1alpha1"
kind: HadoopCluster
metadata:
name: my-new-hdp-cluster
spec:
image: bluedata/hdp26:v0.0.1
roles:
- name: master
replicas: 1
resources …
- name: worker
replicas: 4
resources …
…
Create:
kubectl create –f <request>.yaml
Manage:
kubectl get hadoopcluster
Custom Operator– Controller
• Watch on instances of objects with type defined in “CRD”
• Example: Create HDP cluster with Hive, and Oozie
• Runs scripts and services to coordinate activities between different pods for
clusters
• Example: Start HDFS, Start HiveServer2
• Any modifications, and scaling logic can be applied using custom controller
watch events
• Example: Expand and shrink cluster
• Same controller handles requests for multiple instances of custom object
• Example: Create and monitor multiple HDP clusters
Review Hadoop “Custom Operator”
API Server
Scheduler Controller
etcd
Register
Hadoop
CRD
Create Hadoop
cluster (Kubectl
create
hadoopcluster)
Custom
Hadoop
Controller
Observe/
Assess/ Act
• Lightweight Directory Access Protocol (LDAP) service
• Active Directory (AD) service
• Directory Name Service (DNS)
• Kerberos Key Distribution Center (KDC)
• Key Management Service (KMS)
Additional Configuration
• Networking
– Used calico for our testing
• Storage
– Persistent external storage (gluster)
• This approach allows us to run on any standard K8s
installation (1.9 and higher)
Network and access to services
Key Takeaways
• Kubernetes is still best suited for stateless services
• Complex stateful services like Hadoop requires significant work
• Statefulsets is a key enabler – necessary, but not sufficient
• New innovations and K8s contributions are needed to run Big Data
BlueData will simplify onboarding of Hadoop products to K8s
What's the Hadoop-la about Kubernetes?
Thank You
For more information:
www.bluedata.com
Booth # S5

More Related Content

What's the Hadoop-la about Kubernetes?

  • 1. DataWorks Summit 2018, San Jose, CA What’s the ‘Hadoop-la’ about Kubernetes
  • 2. Today’s Speakers Nanda VijaydevAnant Chintamaneni @NandaVijaydev@AnantCman Vice President of Products BlueData Software Sr. Director of Solutions BlueData Software
  • 3. Agenda • Market Dynamics (with containers) • What is Kubernetes – Why should you care? • Requirements for Stateful Hadoop Clusters • Key gaps in Kubernetes for running Hadoop • What will it take to go from here to there. • Q & A
  • 4. The “Promised Land” Single “Container” Platform for multiple application patterns…. Public Cloud Infrastructure On-Prem Infrastructure Stateless (web frontends, servers) Stateful (databases, queues) Daemons (log collection, monitoring) Others? TargetInfraInfra-agnosticWorkloads
  • 5. And the winner is……..
  • 6. Kubernetes (K8s) – Key Points.. | Open source “platform” for containerized workloads | Platform building blocks vs. turnkey platform – https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not | Top use case is stateless/microservices deployments | Evolving for stateful and others
  • 7. Kubernetes (K8s) – Key Concepts | Kubernetes: a platform for application patterns | Pod: a single instance of an application in Kubernetes | Controller: manages replicated pods for an application pattern
  • 8. Kubernetes (K8s) – Master/Worker
  • 10. Kubernetes (K8s) – Controller
  • 12. Kubernetes (K8s) – Storage | Volume: Ephemeral, Lifecycle of pod | Persistent Volume: Networked Storage, Pod independent | Persistent Volume Claim: Requested amount
  • 13. Kubernetes (K8s) - Controller Patterns
  • 14. Reality Check…. K8s challenges source: https://www.cncf.io/blog/2017/06/28/survey-shows-kubernetes-leading-orchestration-platform/
  • 15. Why Hadoop/Spark on Containers Infrastructure • Agility and elasticity • Standardized environments (dev, test, prod) • Portability (on-premises and cloud) • Higher resource utilization Applications • Fool-proof packaging (configs, libraries, driver versions, etc.) • Repeatable builds and orchestration • Faster app dev cycles
  • 16. This is not about using containers to run Hadoop/Spark tasks on YARN: Source: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications Not to be confused with……..
  • 17. containers cluster Hadoop in Docker Containers This is about running Hadoop clusters in containers:
  • 18. Attributes of Hadoop Clusters • Not exactly monolithic applications, but close • Multiple, co-operating services with dynamic APIs – Service start-up / tear-down ordering requirements – Different sets of services running on different hosts (nodes) – Tricky service interdependencies impact scalability • Lots of configuration (aka state) – Host name, IP address, ports, etc. – Big meta-data: Hadoop and Spark service-specific configurations
  • 19. RM YARN ResourceManager NM YARN NodeManager NN HDFS NameNode DN HDFS DataNode Master Node RMNN DN NM Worker Node DN NM Worker Node DN NM Worker Node Hadoop itself is clustered…. Hive Server2 Hive Server 2 Data Metadata
  • 20. RM YARN ResourceManager NM YARN NodeManager NN HDFS NameNode DN HDFS DataNode JHS Job History Server JN Journal Node ZK ZooKeeper HFS HttpFS Service HM Hbase Master HRS Hbase Region Server Hue Hue OZ Oozie SHS Spark History Server Ambari Ambari server DB MySQL/Postgres GW Gateway FA Flume Agent Tez Tez Service SS Solr Server Hive on LLAP Hive on LLAP RA Ranger HS Hive Server HSS Hive Metastore Service ACK! There is seemingly no end to these services & versions … And lots of services to keep in synch
  • 21. • Use a Hadoop manager – Hortonworks: Ambari – Cloudera: Cloudera Manager – MapR: MapR Control System (MCS) • Follow common deployment pattern • Ensures distro supportability Managing and Configuring Hadoop
  • 22. And we want multiple Hadoop clusters Data Engineering SQL Analytics Machine Learning “Containerized” Platform Multiple evaluation teams Evaluate different business use cases (e.g. ETL, machine learning) Use different services (e.g. Hive, Pig, SparkR), different distributions / versions Shared ‘containerized’ infrastructure Petabyte scale data 2.6 2.2 Multiple distributions, services, tools on shared, cost-effective infrastructure 2.12.5 2.7 Data/Storage
  • 24. Hadoop Clusters on Kubernetes Challenges and Gaps • Existing, available Controller pattern is insufficient • Hadoop service inter-communication via K8s Services (clusterIP, NodePort etc) is not trivial • Persistent volumes (PV) and the persistent volume claim (PVC) approach needs to adapt to Hadoop requirements for state persistence.
  • 25. So is it to possible run Hadoop in all its glory on Kubernetes (K8s)?
  • 27. Started with BlueData Custom Controller on K8s 12 months ago - we learnt a lot! https://www.bluedata.com/blog/2017/12/big-data-container-orchestration-kubernetes-k8s/
  • 28. HDP Cluster Custom Controller - Architecture K8s API Server K8s Scheduler K8s Controller Manager Custom Controller (Pod) Ambari, NN, RM (Pod) DN, NM (Pod) HDP Cluster Pod Pod Pod Pod K8s Cluster BlueData Namespace Networking (Ex.Calico) Default Namespace DN, NM (Pod) DN, NM (Pod)
  • 29. • Launch statefulsets for defined roles • Configure and start services in the right sequence • Make the services available to end users – Network and port mapping • Secure the services with existing enterprise policies (e.g. LDAP / AD) • Maintain Big Data performance goals Our ‘Custom Controller’ Approach..
  • 30. Launching HDP on K8s with Ambari Each role is a Statefulset. 4 Statefulsets for this cluster Launch: BlueData UI or API - Cluster Metadata: Manifest file - Node Roles: statefulsets - Node count: Nbr of pods per role - Node Services: List of services and ports
  • 31. HDP cluster running on K8s with BlueData - A nodeport service is created per pod for all endpoints of each pod
  • 32. Statefulset definition - details • Persistent Storage • Volume claim template • Preserve / (root) to enable restarts and migration • Both init & app container has definition to mount same “subPath” from dynamic volume • initContainer set up /var /opt /etc on volume dynamically provisioned, used by app container • Container access setup • Leverage K8s postStart hook to set up authorized_keys & /etc/resolv.conf • Ease of use • Added concept of flavor definition for CPU, memory, storage etc.
  • 33. Key Gaps (Custom Controller) Functional gaps • Authentication and authorization was done by controller • Limited to single namespace and lacked mapping to K8s Multi-tenancy Usability gaps • Inability to use native kubectl commands for all operations • Unable to use helm charts and other community projects
  • 34. So what’s next to make it more K8s native and address gaps..
  • 35. Available Approaches…. • Use kubectl commands for simple deployment • Use Helm charts for dependency management • Use Operators for managing complex actions during and after deployment Operator = Custom Resource Definition (CRD) + Custom Controller
  • 36. Creating Hadoop “Custom Operator” API Server Scheduler Controller etcd Register Hadoop CRD Create Hadoop cluster (Kubectl create hadoopcluster) Custom Hadoop Controller Observe/ Assess/ Act Hadoop Operator: 1. Create statefulsets 2. Configure services 3. Map ports 4. Scale up/Scale down 5. Migrate to ensure FT
  • 37. Custom Operator – CRD • Native extension to standard K8s APIs • Uses same authentication, authorization, and audit logging • Use kubectl commands to operate on CRD object (e.g. create hadoopcluster) • API request object will be stored in “etcd”
  • 38. Example – CRD Registration and Usage apiVersion: apiextensions.k8s.io/v1beta1 kind: CustomResourceDefinition metadata: name: apps.hadoop.example.com spec: group: hadoop.example.com version: v1alpha1 scope: Namespace names: plural: hadoopclusters singular: hadoopcluster kind: HadoopCluster • Create: kubectl create –f <CRD>.yaml • New REST API endpoints : /apis/hadoop.example.com/v1al pha1/namespaces/*/hadoopclus ters/...
  • 39. Example – New objects using CRD apiVersion: ”hadoop.example.com/v1alpha1" kind: HadoopCluster metadata: name: my-new-hdp-cluster spec: image: bluedata/hdp26:v0.0.1 roles: - name: master replicas: 1 resources … - name: worker replicas: 4 resources … … Create: kubectl create –f <request>.yaml Manage: kubectl get hadoopcluster
  • 40. Custom Operator– Controller • Watch on instances of objects with type defined in “CRD” • Example: Create HDP cluster with Hive, and Oozie • Runs scripts and services to coordinate activities between different pods for clusters • Example: Start HDFS, Start HiveServer2 • Any modifications, and scaling logic can be applied using custom controller watch events • Example: Expand and shrink cluster • Same controller handles requests for multiple instances of custom object • Example: Create and monitor multiple HDP clusters
  • 41. Review Hadoop “Custom Operator” API Server Scheduler Controller etcd Register Hadoop CRD Create Hadoop cluster (Kubectl create hadoopcluster) Custom Hadoop Controller Observe/ Assess/ Act
  • 42. • Lightweight Directory Access Protocol (LDAP) service • Active Directory (AD) service • Directory Name Service (DNS) • Kerberos Key Distribution Center (KDC) • Key Management Service (KMS) Additional Configuration
  • 43. • Networking – Used calico for our testing • Storage – Persistent external storage (gluster) • This approach allows us to run on any standard K8s installation (1.9 and higher) Network and access to services
  • 44. Key Takeaways • Kubernetes is still best suited for stateless services • Complex stateful services like Hadoop requires significant work • Statefulsets is a key enabler – necessary, but not sufficient • New innovations and K8s contributions are needed to run Big Data BlueData will simplify onboarding of Hadoop products to K8s
  • 46. Thank You For more information: www.bluedata.com Booth # S5