Building a REST Job Server for Interactive Spark as a Service

BUILDING A REST JOB SERVER 
FOR INTERACTIVE SPARK
AS A SERVICE
Romain Rigaux - Cloudera
Erick Tryzelaar - Cloudera

NOTEBOOKS 
EASY ACCESS FROM ANYWHERE 
SHARE SPARK CONTEXTS AND RDDs 
BUILD APPS 
SPARK MAGIC 
…
WHY SPARK 
AS A SERVICE?

MARRIED WITH FULL HADOOP ECOSYSTEM
WHY SPARK 
IN HUE?

HISTORY 
V1: OOZIE
• It works
• Code snippet
THE GOOD
• Submit through Oozie
• Shell ac:on
• Very Slow
• Batch
THE BAD
workflow.xml
snippet.py
stdout

HISTORY 
V2: SPARK IGNITER
• It works beAer
THE GOOD
• Compiler Jar
• Batch only, no shell
• No Python, R
• Security
• Single point of failure
THE BAD Compile
Implement
Upload
json output
Batch
Scala
jar
Ooyala

HISTORY 
V3: NOTEBOOK
• Like spark-submit / spark shells
• Scala / Python / R shells
• Jar / Python batch Jobs
• Notebook UI
• YARN
THE GOOD
• Beta?
THE BAD
Livy
code snippet batch

GENERAL ARCHITECTURE
Spark
Spark
Spark
Livy YARN
!"
# $

Livy
Spark
Spark
Spark
YARN
API
!"
# $
GENERAL ARCHITECTURE

LIVY 
SPARK SERVER
•REST Web server in Scala for Spark submissions
•Interac:ve Shell Sessions or Batch Jobs
•Backends: Scala, Java, Python, R
•No dependency on Hue
•Open Source: hAps://github.com/cloudera/
hue/tree/master/apps/spark/java
•Read about it: hAp://gethue.com/spark/

ARCHITECTURE
• Standard web service: wrapper around spark-submit / Spark shells
• YARN mode, Spark drivers run inside the cluster (supports crashes)
• No need to inherit any interface or compile code
• Extended to work with additional backends

LIVY WEB SERVER 
ARCHITECTURE
LOCAL “DEV” MODE YARN MODE

LOCAL
MODE
Livy Server
Scalatra
Session Manager
Session
Spark 
ContextSpark
Client
Spark
Client
Spark 
Interpreter

LOCAL
MODE
Livy Server
Scalatra
Session Manager
Session
Spark
Client
Spark
Client
Spark 
Context
Spark 
Interpreter

LOCAL
MODE
Spark
Client
1
Livy Server
Scalatra
Session Manager
Session
Spark
Client
Spark 
Context
Spark 
Interpreter

LOCAL
MODE
Spark
Client
1
2
Livy Server
Scalatra
Session Manager
Session
Spark
Client
Spark 
Context
Spark 
Interpreter

LOCAL
MODE
Spark
Client
Spark 
Interpreter
1
2
Livy Server
Scalatra
Session Manager
Session
Spark
Client
Spark 
Context
3

LOCAL
MODE
Spark
Client
1
2
Livy Server
Scalatra
Session Manager
Session
Spark
Client
Spark 
Context
3
4 Spark 
Interpreter

LOCAL
MODE
Spark
Client
1
2
Livy Server
Scalatra
Session Manager
Session
Spark
Client
Spark 
Context
3
4
5
Spark 
Interpreter

YARN-CLUSTER 
MODE
PRODUCTION SCALABLE

YARN
Master
Spark
Client
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
Livy Server
Scalatra
Session Manager
Session
YARN-CLUSTER 
MODE
Spark 
Interpreter

Livy Server
YARN
Master
Scalatra
Spark
Client
Session Manager
Session
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
1
YARN-CLUSTER 
MODE
Spark 
Interpreter

YARN
Master
Spark
Client
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
1
2
Livy Server
Scalatra
Session Manager
Session
YARN-CLUSTER 
MODE
Spark 
Interpreter

YARN
Master
Spark
Client
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
1
2
3
Livy Server
Scalatra
Session Manager
Session
YARN-CLUSTER 
MODE
Spark 
Interpreter

YARN
Master
Spark
Client
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
1
2
3
4
Livy Server
Scalatra
Session Manager
Session
YARN-CLUSTER 
MODE
Spark 
Interpreter

YARN
Master
Spark
Client
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
1
2
3
4
5
Livy Server
Scalatra
Session Manager
Session
YARN-CLUSTER 
MODE
Spark 
Interpreter

YARN
Master
Spark
Client
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
1
2
3
4
5
6
Livy Server
Scalatra
Session Manager
Session
YARN-CLUSTER 
MODE
Spark 
Interpreter

YARN
Master
Spark
Client
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
1 7
2
3
4
5
6
Livy Server
Scalatra
Session Manager
Session
YARN-CLUSTER 
MODE
Spark 
Interpreter

SESSION CREATION AND EXECUTION
% curl -XPOST localhost:8998/sessions
-d '{"kind": "spark"}'
{
"id": 0,
"kind": "spark",
"log": [...],
"state": "idle"
}
% curl -XPOST localhost:8998/sessions/0/statements -d '{"code": "1+1"}'
{
"id": 0,
"output": {
"data": { "text/plain": "res0: Int = 2" },
"execution_count": 0,
"status": "ok"
},
"state": "available"
}

Jar
Py
Scala
Python
R
Livy
Spark
Spark
Spark
YARN
/batches
/sessions
BATCH OR INTERACTIVE

SHELL OR BATCH?
YARN
Master
Spark
Client
YARN 
Node
Spark 
Interpreter
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
Livy Server
Scalatra
Session Manager
Session

SHELL
YARN
Master
Spark
Client
YARN 
Node
pyspark
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
Livy Server
Scalatra
Session Manager
Session

BATCH
YARN
Master
Spark
Client
YARN 
Node
spark-
submit
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
Livy Server
Scalatra
Session Manager
Session

LIVY INTERPRETERSScala, Python, R…

REMEMBER?
YARN
Master
Spark Client
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
Livy Server
Scalatra
Session Manager
Session
Spark 
Interpreter

INTERPRETERS
• Pipe stdin/stdout to a running shell
• Execute the code / send to Spark
workers
• Perform magic opera:ons
• One interpreter per language
• “Swappable” with other kernels
(python, spark..)
Interpreter
> println(1 + 1)
2
println(1 + 1)
2

Livy Server
INTERPRETER FLOW
Interpreter

Livy Server
> 1 + 1
Interpreter
INTERPRETER FLOW

Livy Server
{“code”: “1+1”}
> 1 + 1
Interpreter
INTERPRETER FLOW

Livy Server Interpreter
1+1
{“code”: “1+1”}
> 1 + 1
INTERPRETER FLOW

Livy Server Interpreter
1+1
{“code”: “1+1”}
> 1 + 1
Magic
INTERPRETER FLOW

Livy Server
2
Interpreter
1+1
{“code”: “1+1”}
> 1 + 1
Magic
INTERPRETER FLOW

{
“data”: {
“application/json”: “2”
}
}
Livy Server
2
Interpreter
1+1
{“code”: “1+1”}
> 1 + 1
Magic
INTERPRETER FLOW

{
“data”: {
“application/json”: “2”
}
}
Livy Server
2
Interpreter
1+1
{“code”: “1+1”}
> 1 + 1
2 Magic
INTERPRETER FLOW

INTERPRETER FLOW CHART
Receive lines
Split into
Chunks
Send output 
to server
Send error to
server
Success
Execute ChunkMagic!
Chunks
le[?
Magic
chunk?
No
Yes
NoYes
Example of parsing

INTERPRETER MAGIC
• table
• json
• plotting
• ...

NO MAGIC
> 1 + 1
Interpreter
1+1
sparkIMain.interpret(“1+1”)
{
"id": 0,
"output": {
"application/json": 2
}
}

[('', 506610), ('the', 23407), ('I', 19540)... ]
JSON MAGIC
> counts
sparkIMain.valueOfTerm(“counts”)
.toJson()
Interpreter
val lines = sc.textFile("shakespeare.txt");
val counts = lines.
flatMap(line => line.split(" ")).
map(word => (word, 1)).
reduceByKey(_ + _).
sortBy(-_._2).
map { case (w, c) =>
Map("word" -> w, "count" -> c)
}
%json counts

JSON MAGIC
> counts
.toJson()
Interpreter
{
"id": 0,
"output": {
"application/json": [
{ "count": 506610, "word": "" },
{ "count": 23407, "word": "the" },
{ "count": 19540, "word": "I" },
...
]
...
}
val counts = lines.
reduceByKey(_ + _).
sortBy(-_._2).
}
%json counts

[('', 506610), ('the', 23407), ('I', 19540)... ]
TABLE MAGIC
> counts
Interpreter
val counts = lines.
reduceByKey(_ + _).
sortBy(-_._2).
}
%table counts
.guessHeaders().toList()

TABLE MAGIC
> counts
.guessHeaders().toList()
Interpreter
val counts = lines.
reduceByKey(_ + _).
sortBy(-_._2).
}
%table counts
"application/vnd.livy.table.v1+json": {
"headers": [
{ "name": "count", "type": "BIGINT_TYPE" },
{ "name": "name", "type": "STRING_TYPE" }
],
"data": [
[ 23407, "the" ],
[ 19540, "I" ],
[ 18358, "and" ],
...
]
}

PLOT MAGIC
>
sparkIMain.interpret(“png(‘/tmp/
plot.png’) barplot dev.oﬀ()”)
Interpreter
...
barplot(sorted_data
$count,names.arg=sorted_data$value,
main="Resource hits", las=2,
col=colfunc(nrow(sorted_data)),
ylim=c(0,300))

PLOT MAGIC
> png(‘/tmp/..’)
> barplot
> dev.oﬀ()
Interpreter
...
barplot(sorted_data
ylim=c(0,300))

PLOT MAGIC
> png(‘/tmp/..’)
> barplot
> dev.oﬀ()
File(’/tmp/plot.png’).read().toBase64()
Interpreter
...
barplot(sorted_data
ylim=c(0,300))

PLOT MAGIC
> png(‘/tmp/..’)
> barplot
> dev.oﬀ()
File(’/tmp/plot.png’).read().toBase64()
Interpreter
...
barplot(sorted_data
ylim=c(0,300))
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEU
...
}
...
}

• Pluggable Backends
• Livy's Spark Backends
– Scala
– pyspark
– R
• IPython / Jupyter support coming soon
PLUGGABLE INTERPRETERS

• Re-using it
• Generic Framework
for Interpreters
• 51 Kernels
JUPYTER BACKEND

REMEMBER AGAIN?
YARN
Master
Spark Client
YARN 
Node
Spark 
Context
YARN 
Node
Spark 
Worker
YARN 
Node
Spark 
Worker
Livy Server
Scalatra
Session Manager
Session
Spark 
Interpreter

MULTI USERS
YARN 
Node
Spark 
Context
Livy Server
Scalatra
Session Manager
Session
Spark 
Interpreter YARN 
Node
Spark 
Context
Spark 
Interpreter
YARN 
Node
Spark 
Context
Spark 
Interpreter
Spark
Client
Spark
Client
Spark
Client

SHARED CONTEXTS?
YARN 
Node
Spark 
Context
Livy Server
Scalatra
Session Manager
Session
Spark 
Interpreter
Spark
Client
Spark
Client
Spark
Client

SHARED RDD?
YARN 
Node
Spark 
Context
Livy Server
Scalatra
Session Manager
Session
Spark 
Interpreter
Spark
Client
Spark
Client
Spark
Client
RDD

SHARED RDDS?
YARN 
Node
Spark 
Context
Livy Server
Scalatra
Session Manager
Session
Spark 
Interpreter
Spark
Client
Spark
Client
Spark
Client
RDD
RDD
RDD

YARN 
Node
Spark 
Context
Livy Server
Scalatra
Session Manager
Session
Spark 
Interpreter
Spark
Client
Spark
Client
Spark
Client
RDD
RDD
RDD
SECURE IT?

Livy Server
Spark
Spark
Client
Spark
Client
Spark
Client
SPARK AS SERVICE
Spark

PySpark shell
RDD
Shell
Python
Shell

PySpark shell
RDD
Shell
Python
Shell
r = sc.parallelize([])
srdd = ShareableRdd(r)

PySpark shell
RDD
{'ak': 'Alaska'}
{'ca': 'California'}
Shell
Python
Shell

PySpark shell
RDD
{'ak': 'Alaska'}
Shell
Python
Shell
curl -XPOST /sessions/0/statement {
'code': srdd.get('ak')
}

PySpark shell
RDD
{'ak': 'Alaska'}
Shell
Python
Shell
states = SharedRdd('host/sessions/0', 'srdd')
states.get('ak')
curl -XPOST /sessions/0/statement {
'code': srdd.get('ak')
}

DEMO
TIME 
https://github.com/romainr/hadoop-tutorials-examples/tree/master/notebook/shared_rdd

• SSL Support
• Persistent Sessions
• Kerberos
SECURITY

SPARK MAGIC
•From Microsop
•Python magics for working with remote Spark
clusters
•Open Source: hAps://github.com/jupyter-
incubator/sparkmagic

FUTURE
•Move to ext repo?
•Security
•iPython/Jupyter backends and ﬁle format
•Shared named RDD / contexts?
•Share data
•Spark speciﬁc, language generic, both?
•Leverage Hue 4
https://issues.cloudera.org/browse/HUE-2990

• Open Source: hAps://github.com/cloudera/
hue/tree/master/apps/spark/java
• Read about it: hAp://gethue.com/spark/
•Scala, Java, Python, R
•Type Introspec:on for Visualiza:on
•YARN-cluster or local modes
•Code snippets / compiled
•REST API
•Pluggable backends
•Magic keywords
•Failure resilient
•Security
LIVY’S 
CHEAT SHEET

BEDANKT! 
TWITTER
@gethue
USER GROUP
hue-user@
WEBSITE
hAp://gethue.com
LEARN
hAp://learn.gethue.com

Building a REST Job Server for Interactive Spark as a Service

More Related Content

Building a REST Job Server for Interactive Spark as a Service