Running secured Spark job in Kubernetes compute cluster and integrating with Kerberized HDFS

Running secured Spark job in
Kubernetes compute cluster and
integrating with Kerberized HDFS
Joy Chakraborty
June 21, 2018
Email: joychak1@[yahoo/gmail].com]]

• To run Spark job on Elastic
Compute platform (Kubernetes)
accessing data stored in HDFS
Essentially what we will be doing -

3
Who am I ???
I learn and apply ….
- Design and write software for living
- for last 18 years …

4
Disclaimer
We will be just scratching the Surface !!!

Agenda
5
Kubernetes as Elastic Compute1
HDFS as secured distributed storage2
Configuring Spark to run in Kubernetes & accessing HDFS3
Demo - Build & setup the Kubernetes/Spark/HDFS environment4

8
Compute Requirements
• Support Elasticity
• Flexibility and variability in work load
• Process massive amount of data in parallel
• Support Reliability & Multitenancy
• Ease in Accessibility without compromising
Security

10
• Container Management
• Scheduling
• Resource Management
Distributed Containerized System

11
• Container Management
• Scheduling
• Resource Management
Kubernetes
Distributed Containerized System

13
Kubernetes
An open-source system for automating deployment, scaling,
and management of containerized applications.
• Portable
• Extensible: modular, plug-able, compos-able
• Self-healing: auto-placement, auto-restart, auto-
replication, auto-scaling
• Latest Release: 1.8.13
• Community driven completely
• Written in GoLang

14
Kubernetes High Level Architecture

15
Kubernetes High Level Architecture
• The Kubernetes Master is a collection
of three processes that run on a single
node in your cluster, which is
designated as the master node.
• kube-apiserver
• kube-controller-manager
• kube-scheduler
• Each individual non-master node in
your cluster runs two processes –
• kubelet, which communicates with
the Kubernetes Master
• kube-proxy, a network proxy which
reflects Kubernetes networking
services on each node.
Kube-apiserver
Kube-Controller-manager

16
Kubernetes - basic
components

17
Kubernetes Object/Resource (out of the box)
• Namespace
• Pod (a basic unit of work)
• Service
• Volume

18
• Namespace
• Service
• Volume
• ReplicaSet
• Deployment
• Job

19
• Namespace
• Service
• Volume
• ReplicaSet
• Deployment
• Job
• RBAC

20
• Namespace
• Service
• Volume
• ReplicaSet
• Deployment
• Job
• RBAC
*** Custom Resource

21
Kubernetes Pod
• Pod is the basic building block of Kubernetes–
• the smallest and simplest unit in the Kubernetes object model
that you create or deploy
• represents a running process on your cluster.
• encapsulates an application container (or, in some cases,
multiple containers), storage resources, a unique network IP,
and options that govern how the container(s) should run
• Docker is the most common container runtime used in a Pod,
but Pods support other container runtimes as well

22
How to Interact with Kubernetes

23
How to Interact with Kubernetes
• REST API
• Curl or browser
• Command Line Interface
• Kubectl
• Kube Config
• Created during cluster
creation
• Programmatic
• GoLang, Python, Scala

24
How Secured
(kerberized) HDFS
works?

25
KDC
AS TGS
Active Directory
Client Machine
Client Application
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
Secured/Kerberized HDFS

26
KDC
AS TGS
Active Directory
Client Machine
Client Application
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
2. Client-app requests Ticket
3. KDC sends TGT
1. Service
Principles/Keys
6. Sends Service Ticket and requests for Authentication
5. User Authenticated
using Service Principle/key
Retrieves
User roles/permissions
Secured/Kerberized (keytab) HDFS

27
KDC
AS TGS
Active Directory
Client Machine
Client Application
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
2. Client-app requests Ticket
3. KDC sends TGT
1. Service
Principles/Keys
6. Sends Service Ticket and requests for Delegation Token
7. User Authenticated
using Service Principle/key
Retrieves
User roles/permissions
Secured/Kerberized (delegation token) HDFS
8. Name node sends delegation token

29
Spark Running in Kubernetes
Kubernetes ClusterKube Master

30
Kubernetes Cluster
Spark-Submit
(--master k8s://…)
Kube Master

31
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Spark-Submit
Spark
Master
Kube Master

32
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Docker Hub
Spark Docker Images
Spark-Submit
Spark
Master
Kube Master

33
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Docker Hub
Spark Docker Images
Spark-Submit
Spark
Master
Kube Master
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node

34
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Docker Hub
Spark Docker Images
Spark-Submit
Spark
Master
Kube Master
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
Kube Secret
Keytab

35
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Docker Hub
Spark Docker Images
Spark-Submit
Spark
Master
Kube Master
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
Kube Secret
Keytab

36
Lets build the
Kubernetes/Spark/HDFS
environment !!!

37
What to use
• To install Kubernetes
• vagrant-kubeadm (https://github.com/c9s/vagrant-kubeadm)
• Run local Docker Hub (https://docs.docker.com/registry/deploying )
• Run Docker registry container and map localhost to
some ip (10.0.2.2)
• Build Spark Docker image from Spark download Dockerfile
under Kubernetes directory (spark-2.3.0-bin-hadoop2.7/ kubernetes)
• Publish the image to Docker hub
• Run Spark-submit

Email: joychak1@[yahoo/gmail].com]
Thank You
39

Running secured Spark job in Kubernetes compute cluster and integrating with Kerberized HDFS

Related slideshows

More Related Content

Running secured Spark job in Kubernetes compute cluster and integrating with Kerberized HDFS

Editor's Notes