Extending DevOps to Big Data Applications with Kubernetes

EXTENDING DEVOPS TO
BIG DATA APPLICATIONS
WITH KUBERNETES

ABOUT ME
NICOLA FERRARO
Software Engineer at Red Hat
Working on Apache Camel, Fabric8, JBoss Fuse, Fuse Integration Services for
Openshift
Previously Solution Architect for Big Data systems at Engineering (IT)
@ni_ferraro

SUMMARY
Overview of Big Data Systems and Apache Spark
DevOps, Docker, Kubernetes and Openshift Origin
Demo: Oshinko, Kafka, Spring-Boot and Fabric8 tools

BIG DATA
Big Data ≈ Machine Learning at Scale
Big Data systems are capable of
handling data with high:
Volume
Terabytes, petabytes, ...
Log ﬁles
Transactions
Velocity
Streaming data
Micro batching
Near real-time
Variety
Structured data
Images
Videos
Free text
Volume
Velocity Variety
Here
> 100
machines

EVOLUTION OF BIG DATA SOFTWARE
http://www.slideshare.net/nicolaferraro/a-brief-history-of-big-data-48525942
Hadoop Pig Hive
(a whole zoo)
Spark

EVOLUTION OF BIG DATA
INFRASTRUCTURE
Commodity hardware
(Really ?)
Specialized hardware
(Appliances)
The Cloud

EVOLUTION OF
ARCHITECTURES:
BATCH
The ﬁrst Big Data architecture was heavily
based on Hadoop: MapReduce + HDFS.
Some commercial advertising motivations:
Extract meaningful information from raw
data: "companies analyze around <<put-a-
random-number-between-0-and-20-
here>> % of the data they produce"
More data = more precision for machine
learning algorithms = better insights
Assist the decision making process: predict
the future using all the available data
SELECT page, count(*) FROM logs

ARCHITECTURES: BATCH
Hadoop (HDFS + MapReduce) Nodes
App Server
Logs + Data
(Flume)
Report (DWH)
Client
Batch
Processing
(Hive)
MR

EVOLUTION OF
ARCHITECTURES:
HYBRID
Second generation architectures were focused
to improve user experience through machine
learning:
Not everything could be executed in
streaming, for performance reasons
Execute heavyweight steps ofﬂine in
batches (generate the "view", e.g. a
machine learning model)
Execute lightweight steps online with
streaming applications
The view must be refreshed periodically
+

ARCHITECTURES: HYBRID (LAMBDA)
Hadoop Nodes
App Server
Client
Batch
Processing
NoSQL + Messaging + Other
Streaming
Processing

EVOLUTION OF
ARCHITECTURES:
STREAMING
The new generation of architectures are
streaming only:
Provide immediate feedback to the user
No need to execute ofﬂine work
Provide a personalized experience to each
user
Here we focus (mainly) on
streaming applications ...

ARCHITECTURES: STREAMING (KAPPA)
Hadoop Nodes
App Server
Client
NoSQL + Messaging + Other
Streaming
Processing
Here we focus (mainly) on
streaming applications ...

APACHE SPARK
One of the main reason of Spark success is that
it allows to deﬁne, by design, with a uniﬁed
model:
Batch processing and streaming
(MapReduce / Storm)
Multi-language: core in Scala, can be also
used in Java, Python and R
Declarative semantics (SQL), procedural
and functional programming (Hive /
MapReduce / Pig)
General purpose data processing and
machine learning (MapReduce / Mahout)
Logistic regression in Spark vs.
Hadoop (it was so in 2013)
And the performance are really impressive!

SPARK ARCHITECTURE
Driver
Cluster
Manager
Driver
(2nd app)
Workers
Driver App
(main)
Executor Executor
Single Worker
Tasks:
"do something on
a data partition"
Assign
executors to
the driver app
Deﬁne the app
workﬂow
Do the
dirty job
Logical viewPhysical view

SPARK ARCHITECTURE: DATA
DISTRIBUTION
HDFS / S3
Spark
Executo
r
Spark
Executo
r
Spark
Executo
r
r/w
Spark
Driver
Kafka
Partition
Spark
Executo
r
Spark
Executo
r
Spark
Executo
r
Spark
Driver
Kafka
Partition
Kafka
Partition
HDFS / S3 HDFS / S3
Why distribution?
... to spread the
work across
thousands of
workers

SPARK DEMO
A simple quickstart showing a batch
application doing a word count.
The purpose is showing which part of the
software is executed in the driver and which
parts are sent to the remote executors.

SPARK: WHAT CAN GO WRONG ???
Programming with Spark is awesome, but there
are a lot of problems you need to solve:
Most of the problems do not happen in a
standalone cluster (serialization issues,
classloading issues, network issues, ...)
Something that works well with a small dataset
hangs forever in production: you must test it
earlier with a lot of data!
Spark applications must be tuned: decide
caching steps, decide partitioning, reduce the
number of shufﬂes, do not "collect()"
For streaming apps:
Exactly-once message semantics cannot be
guaranteed all the times: you have to write
idempotent operations to take into account
duplicate messages
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCl
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$Closur
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:1
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:323)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationSco
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationSco
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.map(RDD.scala:323)

AND SPARK IS THE MINOR
PROBLEMEvery operation is executed by different software
distributed in hundreds of machines.
Production different from the development
environment: difficult to maintain software
aligned (patches and upgrades of NoSQL
databases, message brokers, ...): you need to
test the integration with different versions of
external systems
The whole system randomly doesn't work: e.g.
you expect a message in a queue or you expect
a result in a query, and your expectations are
not satisfied
Debugging is hard: it's better to fix bugs
earlier, but bugs appear late in production!
There are consistency problems in some
NoSQL databases: the CAP theorem
Too many steps from development to release,
with different teams involved: development, qa,
operations I can continue...

THE 2 MAIN PROBLEMS
1
The local development environment is usually too different
from production (replication and versions), so you find bugs late
2
Big Data applications are difficult to test effectively, because
configuring a good testing environment is a hard and long task:
it's difficult to debug bugs all together
How? we can apply DevOps practices with Kubernetes ...
Easy Solution:
Test early
Test more
Test the production software

DEVOPS
No common deﬁnition
Combination of software development, qa
and operations: working together to release
working software frequently
Applying common dev and agile techniques
to operations
It's a cultural change, rather than just a
change in development and administration
tools
Common practices:
Infrastructure as code
Continuous integration
Continuous delivery
DEV
OPS

WHY DEVOPS
Increased communication and
collaboration among: dev, ops, qa,
business
Fast time to market: accomodating
business and customer needs quickly to
have competitive advantage
Embrace change, do not fear it
Embrace innovation (and team
engagement), through usage of new
technologies
Improved quality of software through
automated testing
DEV
OPS

A DEVOPS WORKFLOW
The key point of DevOps is being able to
collaborate on the infrastructure the same
way you collaborate on software:
Common shared repository (SCM)
Pull requests
Peer review
Automated checks (correctness, quality)
Continuous delivery
Testing the infrastructure like you test
software: scalability, fault tolerance, backup,
recovery options.
DevOps practices become difﬁcult to apply
and uneffective when your infrastructure does
not follow the software release cycle.

DOCKER
Docker has revolutionized the way we build
and ship software.
Containers can be created using a
Dockerﬁle (example on the right)
There are at least 10 different ways to
create Docker containers
A container is not a VM (even if it seems
really close to it)
A container hosts a single application
Containers need to be orchestrated to
deliver their full power
FROM ubuntu:16.04
RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:
RUN echo "deb http://repo.mongodb.org/apt/ubuntu $(cat
RUN apt-get update && apt-get install -y mongodb-org
# Create the MongoDB data directory
RUN mkdir -p /data/db
# Expose port #27017 from the container to the host
EXPOSE 27017
# Set /usr/bin/mongod as the dockerized entry-point app
ENTRYPOINT ["/usr/bin/mongod"]

DOCKER ARCHITECTURE
Docker is the building block of infrastructure as code:
The image is completely described by the Dockerﬁle
Any container is stateless (although stateful data can be attached with volumes).
This means:
You don't ssh your server and change the conﬁguration
You rebuild a new version of the application when you need to change something
"Deployment" means basically replacing a container with a different container

KUBERNETES
Cloud platform, with several deployment
options (including local and private cloud), to
Orchestrate (Docker) containers:
Born at Google
Production ready (Pokemon Go!)
Provides:
Application composition in namespaces
Virtual networks
Service discovery
Load balancing
Auto (and manual) scaling
Automatic recovery
Health checking
Rolling upgrades
Monitoring
Logging
....
kubectl run --image=nginx nginx-app --port=80
oc run --image=nginx nginx-app --port=80
# or better..
# oc new-app nginx --name nginx-app2 BOTH Kuebertes and Openshift Origin
are Open Source projects!

KUBERNETES ARCHITECTURE
Very high level architecture of Kubernetes.
Note: it resembles the Spark architecture seen in previous slides (with the Kubernetes master playing the
role of the cluster manager), but we want to deploy a whole Spark application inside Kubernetes
(including the driver).

KUBERNETES FOR
LOCAL DEV
Minikube
https://github.com/kubernetes/minikube
OPENSHIFT FOR
LOCAL DEV
Minishift
"oc cluster up"
https://github.com/minishift/minishift
$ minikube start
Starting local Kubernetes cluster...
Running pre-create checks...
Creating machine...
Starting local Kubernetes cluster...
$ kubectl run hello-minikube --image=gcr.io/google_containers/echoserver:1.4 --port=8080
$ minishift start
...
$ oc cluster up
-- Checking OpenShift client ... OK
-- Checking Docker client ... OK
...

OPENSHIFT ORIGIN DEMO
A simple recommender system using collaborative ﬁltering on Openshift origin.
Deploy a:
Spring-boot microservice
Kafka Broker
Spark Cluster
Into openshift origin.
Source code:
https://github.com/nicolaferraro/voxxed-bigdata-spark
Rating **** Rating ****
Recommendations
Recommendations
Spark Cluster

SPARK ON
KUBERNETES
The problem:
A Spark application needs a cluster of n
worker machines to run
You need 1 cluster manager node
The application runs on 1 driver node
We want to deploy everything
inside Kubernetes or Openshift Origin
Basically, every piece of the Spark architecture
should become a POD.
There are multiple solutions. I'm going to
introduce the Oshinko project.
http://radanalytics.io/
https://github.com/radanalyticsio
Driver
Cluster
Manager
Recall the Spark architecture
PODs
Worker Worker Worker

OSHINKO
Making Apache Spark Cloud Native on Openshift Origin.
W WW
CM
D
Spark Cluster
W WW
CM
D
Spark Cluster
1 cluster per application!
Deployed in the application namespace
Oshinko console

DEMO TIME
We will deploy the Oshinko rest server and then deploy the spark
streaming application. It will connect to Kafka to retrieve ratings
and return personalized (not really) recommendations.
Sources:
https://github.com/nicolaferraro/voxxed-bigdata-spark
I used a basic implementation of the slope one algorithm
( ) for collaborative
ﬁltering.
It's just an example of "real life" custom Spark application that
sometimes doesn't work :)
https://en.wikipedia.org/wiki/Slope_One

TESTING
One of the major advantages of making a cloud native Spark application is the
possibility of executing system tests on software identical to production.
Production pods
Optional testing tools or load
generators
Test driver (inside or outside the namespace)
You can create virtual test environments on the ﬂy and do assertions.
Demo: using fabric8-arquillian and the fabric8 kubernetes client

TESTING: SOME SCENARIOS
Some tests that you may want to run:
Deploy the production infrastructure (or part of it) and run a suite of functional
tests to ensure every piece works as expected to produce the result
Deploy the production infrastructure and do performance tests using external
or injected data
Deploy the production infrastructure (or part of it) and check that every piece is
healthy. Then randomly kill pods while sending a load to test the system
availability (system tests)
The Chaos Monkey!

FABRIC8
We have used some tools from Fabric8:
Fabric8 Maven Plugin: to build Kubernetes resources for Java projects
Docker Maven Plugin: used by the fabric8 maven plugin to create docker images
Fabric8 Arquillian: to create system tests from artifacts using the fabric8 plugin
Fabric8 Kubernetes Client: to interact with Kubernetes resources (pods, services, etc.)
Fabric8 is a complete development platform for
Kubernetes!
Uses Jenkins Pipelines and to provide:
Continuous delivery
Out of the box!
https://fabric8.io/

Q/A
What about storage and data locality? In cloud native applications, using a storage
system like s3 has many advantages over HDFS (costs, availability, usage patterns in
testing). Performance of HDFS are superior (6x), but you can use larger clusters for a
limited amount of time to gain the same performance. More info:
Can you scale the application directly from Openshift, or you always need to
change the code? Sure, you can use Openshift to scale the application. Beware that, in
a pure DevOps approach, changing the configuration of a system at runtime is not
something you should do, especially on production. You lose the benefits of having all
the software and configuration in a "executable" state, and you lose the effectiveness of
system tests because tests will not run on an environment identical to production. As an
alternative, you can use the auto-scaling feature (enabling it in the code).
Is the Spark demo using a machine learning algorithm? The demo is using a basic
collaborative filtering algorithm called "slope one" (the focus of this talk is not machine
learning). Spark provides many algorithms out of the box, also for collaborative filtering.
Many of them are supposed to be executed in a batch application, rather than in a
streaming app.
http://qr.ae/TAF4cN

THANKS
Follow me on Twitter!
@ni_ferraro

Extending DevOps to Big Data Applications with Kubernetes

More Related Content

Extending DevOps to Big Data Applications with Kubernetes