Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Running secured Spark job in
Kubernetes compute cluster and
integrating with Kerberized HDFS
Joy Chakraborty
June 21, 2018
Email: joychak1@[yahoo/gmail].com]]
• To run Spark job on Elastic
Compute platform (Kubernetes)
accessing data stored in HDFS
Essentially what we will be doing -
3
Who am I ???
I learn and apply ….
- Design and write software for living
- for last 18 years …
4
Disclaimer
We will be just scratching the Surface !!!
Agenda
5
Kubernetes as Elastic Compute1
HDFS as secured distributed storage2
Configuring Spark to run in Kubernetes & accessing HDFS3
Demo - Build & setup the Kubernetes/Spark/HDFS environment4
6
Why Kubernetes?
7
Compute Requirements
8
Compute Requirements
• Support Elasticity
• Flexibility and variability in work load
• Process massive amount of data in parallel
• Support Reliability & Multitenancy
• Ease in Accessibility without compromising
Security
9
Distributed
Containerized
10
• Container Management
• Scheduling
• Resource Management
Distributed Containerized System
11
• Container Management
• Scheduling
• Resource Management
Kubernetes
Distributed Containerized System
12
what?
13
Kubernetes
An open-source system for automating deployment, scaling,
and management of containerized applications.
• Portable
• Extensible: modular, plug-able, compos-able
• Self-healing: auto-placement, auto-restart, auto-
replication, auto-scaling
• Latest Release: 1.8.13
• Community driven completely
• Written in GoLang
14
Kubernetes High Level Architecture
15
Kubernetes High Level Architecture
• The Kubernetes Master is a collection
of three processes that run on a single
node in your cluster, which is
designated as the master node.
• kube-apiserver
• kube-controller-manager
• kube-scheduler
• Each individual non-master node in
your cluster runs two processes –
• kubelet, which communicates with
the Kubernetes Master
• kube-proxy, a network proxy which
reflects Kubernetes networking
services on each node.
Kube-apiserver
Kube-Controller-manager
16
Kubernetes - basic
components
17
Kubernetes Object/Resource (out of the box)
• Namespace
• Pod (a basic unit of work)
• Service
• Volume
18
Kubernetes Object/Resource (out of the box)
• Namespace
• Pod (a basic unit of work)
• Service
• Volume
• ReplicaSet
• Deployment
• Job
19
Kubernetes Object/Resource (out of the box)
• Namespace
• Pod (a basic unit of work)
• Service
• Volume
• ReplicaSet
• Deployment
• Job
• RBAC
20
Kubernetes Object/Resource (out of the box)
• Namespace
• Pod (a basic unit of work)
• Service
• Volume
• ReplicaSet
• Deployment
• Job
• RBAC
*** Custom Resource
21
Kubernetes Pod
• Pod is the basic building block of Kubernetes–
• the smallest and simplest unit in the Kubernetes object model
that you create or deploy
• represents a running process on your cluster.
• encapsulates an application container (or, in some cases,
multiple containers), storage resources, a unique network IP,
and options that govern how the container(s) should run
• Docker is the most common container runtime used in a Pod,
but Pods support other container runtimes as well
22
How to Interact with Kubernetes
23
How to Interact with Kubernetes
• REST API
• Curl or browser
• Command Line Interface
• Kubectl
• Kube Config
• Created during cluster
creation
• Programmatic
• GoLang, Python, Scala
24
How Secured
(kerberized) HDFS
works?
25
KDC
AS TGS
Active Directory
Client Machine
Client Application
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
Secured/Kerberized HDFS
26
KDC
AS TGS
Active Directory
Client Machine
Client Application
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
2. Client-app requests Ticket
3. KDC sends TGT
1. Service
Principles/Keys
6. Sends Service Ticket and requests for Authentication
5. User Authenticated
using Service Principle/key
Retrieves
User roles/permissions
Secured/Kerberized (keytab) HDFS
27
KDC
AS TGS
Active Directory
Client Machine
Client Application
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
2. Client-app requests Ticket
3. KDC sends TGT
1. Service
Principles/Keys
6. Sends Service Ticket and requests for Delegation Token
7. User Authenticated
using Service Principle/key
Retrieves
User roles/permissions
Secured/Kerberized (delegation token) HDFS
8. Name node sends delegation token
28
Spark + Kubernetes +
HDFS
29
Spark Running in Kubernetes
Kubernetes ClusterKube Master
30
Spark Running in Kubernetes
Kubernetes Cluster
Spark-Submit
(--master k8s://…)
Kube Master
31
Spark Running in Kubernetes
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Spark-Submit
Spark
Master
Kube Master
32
Spark Running in Kubernetes
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Docker Hub
Spark Docker Images
Spark-Submit
Spark
Master
Kube Master
33
Spark Running in Kubernetes
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Docker Hub
Spark Docker Images
Spark-Submit
Spark
Master
Kube Master
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
34
Spark Running in Kubernetes
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Docker Hub
Spark Docker Images
Spark-Submit
Spark
Master
Kube Master
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
Kube Secret
Keytab
35
Spark Running in Kubernetes
Kubernetes Cluster
Worker
Pod
Worker
Pod
Worker
Pod
Worker
Pod
Spark Job
Driver
Pod
Docker Hub
Spark Docker Images
Spark-Submit
Spark
Master
Kube Master
HDFS
Namenode
Data
Node
Data
Node
Data
Node
Data
Node
Kube Secret
Keytab
36
Lets build the
Kubernetes/Spark/HDFS
environment !!!
37
What to use
• To install Kubernetes
• vagrant-kubeadm (https://github.com/c9s/vagrant-kubeadm)
• Run local Docker Hub (https://docs.docker.com/registry/deploying )
• Run Docker registry container and map localhost to
some ip (10.0.2.2)
• Build Spark Docker image from Spark download Dockerfile
under Kubernetes directory (spark-2.3.0-bin-hadoop2.7/ kubernetes)
• Publish the image to Docker hub
• Run Spark-submit
38
Demo !!!
Email: joychak1@[yahoo/gmail].com]
Thank You
39

More Related Content

Running secured Spark job in Kubernetes compute cluster and integrating with Kerberized HDFS

  • 1. Running secured Spark job in Kubernetes compute cluster and integrating with Kerberized HDFS Joy Chakraborty June 21, 2018 Email: joychak1@[yahoo/gmail].com]]
  • 2. • To run Spark job on Elastic Compute platform (Kubernetes) accessing data stored in HDFS Essentially what we will be doing -
  • 3. 3 Who am I ??? I learn and apply …. - Design and write software for living - for last 18 years …
  • 4. 4 Disclaimer We will be just scratching the Surface !!!
  • 5. Agenda 5 Kubernetes as Elastic Compute1 HDFS as secured distributed storage2 Configuring Spark to run in Kubernetes & accessing HDFS3 Demo - Build & setup the Kubernetes/Spark/HDFS environment4
  • 8. 8 Compute Requirements • Support Elasticity • Flexibility and variability in work load • Process massive amount of data in parallel • Support Reliability & Multitenancy • Ease in Accessibility without compromising Security
  • 10. 10 • Container Management • Scheduling • Resource Management Distributed Containerized System
  • 11. 11 • Container Management • Scheduling • Resource Management Kubernetes Distributed Containerized System
  • 13. 13 Kubernetes An open-source system for automating deployment, scaling, and management of containerized applications. • Portable • Extensible: modular, plug-able, compos-able • Self-healing: auto-placement, auto-restart, auto- replication, auto-scaling • Latest Release: 1.8.13 • Community driven completely • Written in GoLang
  • 14. 14 Kubernetes High Level Architecture
  • 15. 15 Kubernetes High Level Architecture • The Kubernetes Master is a collection of three processes that run on a single node in your cluster, which is designated as the master node. • kube-apiserver • kube-controller-manager • kube-scheduler • Each individual non-master node in your cluster runs two processes – • kubelet, which communicates with the Kubernetes Master • kube-proxy, a network proxy which reflects Kubernetes networking services on each node. Kube-apiserver Kube-Controller-manager
  • 17. 17 Kubernetes Object/Resource (out of the box) • Namespace • Pod (a basic unit of work) • Service • Volume
  • 18. 18 Kubernetes Object/Resource (out of the box) • Namespace • Pod (a basic unit of work) • Service • Volume • ReplicaSet • Deployment • Job
  • 19. 19 Kubernetes Object/Resource (out of the box) • Namespace • Pod (a basic unit of work) • Service • Volume • ReplicaSet • Deployment • Job • RBAC
  • 20. 20 Kubernetes Object/Resource (out of the box) • Namespace • Pod (a basic unit of work) • Service • Volume • ReplicaSet • Deployment • Job • RBAC *** Custom Resource
  • 21. 21 Kubernetes Pod • Pod is the basic building block of Kubernetes– • the smallest and simplest unit in the Kubernetes object model that you create or deploy • represents a running process on your cluster. • encapsulates an application container (or, in some cases, multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run • Docker is the most common container runtime used in a Pod, but Pods support other container runtimes as well
  • 22. 22 How to Interact with Kubernetes
  • 23. 23 How to Interact with Kubernetes • REST API • Curl or browser • Command Line Interface • Kubectl • Kube Config • Created during cluster creation • Programmatic • GoLang, Python, Scala
  • 25. 25 KDC AS TGS Active Directory Client Machine Client Application HDFS Namenode Data Node Data Node Data Node Data Node Secured/Kerberized HDFS
  • 26. 26 KDC AS TGS Active Directory Client Machine Client Application HDFS Namenode Data Node Data Node Data Node Data Node 2. Client-app requests Ticket 3. KDC sends TGT 1. Service Principles/Keys 6. Sends Service Ticket and requests for Authentication 5. User Authenticated using Service Principle/key Retrieves User roles/permissions Secured/Kerberized (keytab) HDFS
  • 27. 27 KDC AS TGS Active Directory Client Machine Client Application HDFS Namenode Data Node Data Node Data Node Data Node 2. Client-app requests Ticket 3. KDC sends TGT 1. Service Principles/Keys 6. Sends Service Ticket and requests for Delegation Token 7. User Authenticated using Service Principle/key Retrieves User roles/permissions Secured/Kerberized (delegation token) HDFS 8. Name node sends delegation token
  • 29. 29 Spark Running in Kubernetes Kubernetes ClusterKube Master
  • 30. 30 Spark Running in Kubernetes Kubernetes Cluster Spark-Submit (--master k8s://…) Kube Master
  • 31. 31 Spark Running in Kubernetes Kubernetes Cluster Worker Pod Worker Pod Worker Pod Worker Pod Spark Job Driver Pod Spark-Submit Spark Master Kube Master
  • 32. 32 Spark Running in Kubernetes Kubernetes Cluster Worker Pod Worker Pod Worker Pod Worker Pod Spark Job Driver Pod Docker Hub Spark Docker Images Spark-Submit Spark Master Kube Master
  • 33. 33 Spark Running in Kubernetes Kubernetes Cluster Worker Pod Worker Pod Worker Pod Worker Pod Spark Job Driver Pod Docker Hub Spark Docker Images Spark-Submit Spark Master Kube Master HDFS Namenode Data Node Data Node Data Node Data Node
  • 34. 34 Spark Running in Kubernetes Kubernetes Cluster Worker Pod Worker Pod Worker Pod Worker Pod Spark Job Driver Pod Docker Hub Spark Docker Images Spark-Submit Spark Master Kube Master HDFS Namenode Data Node Data Node Data Node Data Node Kube Secret Keytab
  • 35. 35 Spark Running in Kubernetes Kubernetes Cluster Worker Pod Worker Pod Worker Pod Worker Pod Spark Job Driver Pod Docker Hub Spark Docker Images Spark-Submit Spark Master Kube Master HDFS Namenode Data Node Data Node Data Node Data Node Kube Secret Keytab
  • 37. 37 What to use • To install Kubernetes • vagrant-kubeadm (https://github.com/c9s/vagrant-kubeadm) • Run local Docker Hub (https://docs.docker.com/registry/deploying ) • Run Docker registry container and map localhost to some ip (10.0.2.2) • Build Spark Docker image from Spark download Dockerfile under Kubernetes directory (spark-2.3.0-bin-hadoop2.7/ kubernetes) • Publish the image to Docker hub • Run Spark-submit

Editor's Notes

  1. Artificial intelligence is intelligence exhibited by machines. In computer science, the field of AI defines itself as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of success at some goal