Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
BUILDING A REST JOB SERVER

FOR INTERACTIVE SPARK
AS A SERVICE
Romain Rigaux - Cloudera
Erick Tryzelaar - Cloudera
WHY?
Building a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a Service
NOTEBOOKS

EASY	ACCESS	FROM	ANYWHERE

SHARE	SPARK	CONTEXTS	AND	RDDs

BUILD	APPS

SPARK	MAGIC

…
WHY SPARK

AS A SERVICE?
MARRIED	WITH	FULL	HADOOP	ECOSYSTEM		
WHY SPARK

IN HUE?
HISTORY

V1: OOZIE
• It	works	
• Code	snippet
THE GOOD
• Submit	through	Oozie	
• Shell	ac:on	
• Very	Slow	
• Batch
THE BAD
workflow.xml	
snippet.py
stdout
HISTORY

V2: SPARK IGNITER
• It	works	beAer
THE GOOD
• Compiler	Jar	
• Batch	only,	no	shell	
• No	Python,	R	
• Security	
• Single	point	of	failure
THE BAD Compile
Implement
Upload
json	output
Batch
Scala
jar
Ooyala
HISTORY

V3: NOTEBOOK
• Like	spark-submit	/	spark	shells	
• Scala	/	Python	/	R	shells	
• Jar	/	Python	batch	Jobs	
• Notebook	UI	
• YARN
THE GOOD
• Beta?
THE BAD
Livy
code	snippet batch
GENERAL ARCHITECTURE
Spark
Spark
Spark
Livy YARN
!"
# $
Livy
Spark
Spark
Spark
YARN
API
!"
# $
GENERAL ARCHITECTURE
LIVY SPARK SERVER
LIVY

SPARK SERVER
•REST	Web	server	in	Scala	for	Spark	submissions	
•Interac:ve	Shell	Sessions	or	Batch	Jobs	
•Backends:	Scala,	Java,	Python,	R	
•No	dependency	on	Hue	
•Open	Source:	hAps://github.com/cloudera/
hue/tree/master/apps/spark/java	
•Read	about	it:	hAp://gethue.com/spark/
ARCHITECTURE
• Standard	web	service:	wrapper	around	spark-submit	/	Spark	shells	
• YARN	mode,	Spark	drivers	run	inside	the	cluster	(supports	crashes)	
• No	need	to	inherit	any	interface	or	compile	code	
• Extended	to	work	with	additional	backends
LIVY WEB SERVER

ARCHITECTURE
LOCAL	“DEV”	MODE YARN	MODE
LOCAL
MODE
Livy	Server
Scalatra
Session	Manager
Session
Spark

ContextSpark	
Client
Spark	
Client
Spark

Interpreter
LOCAL
MODE
Livy	Server
Scalatra
Session	Manager
Session
Spark	
Client
Spark	
Client
Spark

Context
Spark

Interpreter
LOCAL
MODE
Spark	
Client
1
Livy	Server
Scalatra
Session	Manager
Session
Spark	
Client
Spark

Context
Spark

Interpreter
LOCAL
MODE
Spark	
Client
1
2
Livy	Server
Scalatra
Session	Manager
Session
Spark	
Client
Spark

Context
Spark

Interpreter
LOCAL
MODE
Spark	
Client
Spark

Interpreter
1
2
Livy	Server
Scalatra
Session	Manager
Session
Spark	
Client
Spark

Context
3
LOCAL
MODE
Spark	
Client
1
2
Livy	Server
Scalatra
Session	Manager
Session
Spark	
Client
Spark

Context
3
4 Spark

Interpreter
LOCAL
MODE
Spark	
Client
1
2
Livy	Server
Scalatra
Session	Manager
Session
Spark	
Client
Spark

Context
3
4
5
Spark

Interpreter
YARN-CLUSTER

MODE
PRODUCTION SCALABLE
YARN	
Master
Spark	
Client
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
Livy	Server
Scalatra
Session	Manager
Session
YARN-CLUSTER

MODE
Spark

Interpreter
Livy	Server
YARN	
Master
Scalatra
Spark	
Client
Session	Manager
Session
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
1
YARN-CLUSTER

MODE
Spark

Interpreter
YARN	
Master
Spark	
Client
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
1
2
Livy	Server
Scalatra
Session	Manager
Session
YARN-CLUSTER

MODE
Spark

Interpreter
YARN	
Master
Spark	
Client
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
1
2
3
Livy	Server
Scalatra
Session	Manager
Session
YARN-CLUSTER

MODE
Spark

Interpreter
YARN	
Master
Spark	
Client
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
1
2
3
4
Livy	Server
Scalatra
Session	Manager
Session
YARN-CLUSTER

MODE
Spark

Interpreter
YARN	
Master
Spark	
Client
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
1
2
3
4
5
Livy	Server
Scalatra
Session	Manager
Session
YARN-CLUSTER

MODE
Spark

Interpreter
YARN	
Master
Spark	
Client
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
1
2
3
4
5
6
Livy	Server
Scalatra
Session	Manager
Session
YARN-CLUSTER

MODE
Spark

Interpreter
YARN	
Master
Spark	
Client
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
1 7
2
3
4
5
6
Livy	Server
Scalatra
Session	Manager
Session
YARN-CLUSTER

MODE
Spark

Interpreter
SESSION CREATION AND EXECUTION
%	curl	-XPOST	localhost:8998/sessions		
		-d	'{"kind":	"spark"}'	
{	
		"id":	0,	
		"kind":	"spark",	
		"log":	[...],	
		"state":	"idle"	
}	
%	curl	-XPOST	localhost:8998/sessions/0/statements	-d	'{"code":	"1+1"}'	
{	
		"id":	0,	
		"output":	{	
				"data":	{	"text/plain":	"res0:	Int	=	2"	},	
				"execution_count":	0,	
				"status":	"ok"	
		},	
		"state":	"available"	
}
Jar
Py
Scala
Python
R
Livy
Spark
Spark
Spark
YARN
/batches
/sessions
BATCH OR INTERACTIVE
SHELL OR BATCH?
YARN	
Master
Spark	
Client
YARN

Node
Spark

Interpreter
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
Livy	Server
Scalatra
Session	Manager
Session
SHELL
YARN	
Master
Spark	
Client
YARN

Node
pyspark
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
Livy	Server
Scalatra
Session	Manager
Session
BATCH
YARN	
Master
Spark	
Client
YARN

Node
spark-
submit
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
Livy	Server
Scalatra
Session	Manager
Session
LIVY INTERPRETERSScala,	Python,	R…
REMEMBER?
YARN	
Master
Spark	Client
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
Livy	Server
Scalatra
Session	Manager
Session
Spark

Interpreter
INTERPRETERS
• Pipe	stdin/stdout	to	a	running	shell	
• Execute	the	code	/	send	to	Spark	
workers	
• Perform	magic	opera:ons	
• One	interpreter	per	language	
• “Swappable”	with	other	kernels	
(python,	spark..)
Interpreter
>	println(1	+	1)	
2
println(1	+	1)
2
Livy	Server
INTERPRETER FLOW
Interpreter
Livy	Server
>	1	+	1
Interpreter
INTERPRETER FLOW
Livy	Server
{“code”:	“1+1”}
>	1	+	1
Interpreter
INTERPRETER FLOW
Livy	Server Interpreter
1+1	
{“code”:	“1+1”}
>	1	+	1
INTERPRETER FLOW
Livy	Server Interpreter
1+1	
{“code”:	“1+1”}
>	1	+	1
Magic
INTERPRETER FLOW
Livy	Server
2	
Interpreter
1+1	
{“code”:	“1+1”}
>	1	+	1
Magic
INTERPRETER FLOW
{	
		“data”:	{	
				“application/json”:	“2”	
		}	
}	
Livy	Server
2	
Interpreter
1+1	
{“code”:	“1+1”}
>	1	+	1
Magic
INTERPRETER FLOW
{	
		“data”:	{	
				“application/json”:	“2”	
		}	
}	
Livy	Server
2	
Interpreter
1+1	
{“code”:	“1+1”}
>	1	+	1
2 Magic
INTERPRETER FLOW
INTERPRETER FLOW CHART
Receive	lines
Split	into	
Chunks
Send	output

to	server
Send	error	to	
server
Success
Execute	ChunkMagic!
Chunks	
le[?
Magic	
chunk?
No
Yes
NoYes
Example	of	parsing
INTERPRETER MAGIC
• table	
• json	
• plotting	
• ...
NO MAGIC
>	1	+	1
Interpreter
1+1
sparkIMain.interpret(“1+1”)
{	
		"id":	0,	
		"output":	{	
				"application/json":	2	
		}	
}
[('',	506610),	('the',	23407),	('I',	19540)...	]	
JSON MAGIC
>	counts
sparkIMain.valueOfTerm(“counts”)	
.toJson()
Interpreter
val	lines	=	sc.textFile("shakespeare.txt");	
val	counts	=	lines.	
		flatMap(line	=>	line.split("	")).	
				map(word	=>	(word,	1)).	
				reduceByKey(_	+	_).	
				sortBy(-_._2).	
				map	{	case	(w,	c)	=>	
				Map("word"	->	w,	"count"	->	c)	
				}	
%json	counts
JSON MAGIC
>	counts
sparkIMain.valueOfTerm(“counts”)	
.toJson()
Interpreter
{	
		"id":	0,	
		"output":	{	
				"application/json":	[	
						{	"count":	506610,	"word":	""	},	
						{	"count":	23407,	"word":	"the"	},	
						{	"count":	19540,	"word":	"I"	},	
						...	
				]	
		...	
}	
val	lines	=	sc.textFile("shakespeare.txt");	
val	counts	=	lines.	
		flatMap(line	=>	line.split("	")).	
				map(word	=>	(word,	1)).	
				reduceByKey(_	+	_).	
				sortBy(-_._2).	
				map	{	case	(w,	c)	=>	
				Map("word"	->	w,	"count"	->	c)	
				}	
%json	counts
[('',	506610),	('the',	23407),	('I',	19540)...	]	
TABLE MAGIC
>	counts
Interpreter
val	lines	=	sc.textFile("shakespeare.txt");	
val	counts	=	lines.	
		flatMap(line	=>	line.split("	")).	
				map(word	=>	(word,	1)).	
				reduceByKey(_	+	_).	
				sortBy(-_._2).	
				map	{	case	(w,	c)	=>	
				Map("word"	->	w,	"count"	->	c)	
				}	
%table	counts
sparkIMain.valueOfTerm(“counts”)	
.guessHeaders().toList()
TABLE MAGIC
>	counts
sparkIMain.valueOfTerm(“counts”)	
.guessHeaders().toList()
Interpreter
val	lines	=	sc.textFile("shakespeare.txt");	
val	counts	=	lines.	
		flatMap(line	=>	line.split("	")).	
				map(word	=>	(word,	1)).	
				reduceByKey(_	+	_).	
				sortBy(-_._2).	
				map	{	case	(w,	c)	=>	
				Map("word"	->	w,	"count"	->	c)	
				}	
%table	counts
"application/vnd.livy.table.v1+json":	{	
		"headers":	[	
				{	"name":	"count",	"type":	"BIGINT_TYPE"	},	
				{	"name":	"name",	"type":	"STRING_TYPE"	}	
		],	
		"data":	[	
				[	23407,	"the"	],	
				[	19540,	"I"	],	
				[	18358,	"and"	],	
								...	
		]	
}
PLOT MAGIC
	>
sparkIMain.interpret(“png(‘/tmp/
plot.png’)	barplot	dev.off()”)	
Interpreter
...	
barplot(sorted_data
$count,names.arg=sorted_data$value,	
main="Resource	hits",	las=2,	
col=colfunc(nrow(sorted_data)),	
ylim=c(0,300))
PLOT MAGIC
	>
sparkIMain.interpret(“png(‘/tmp/
plot.png’)	barplot	dev.off()”)	
Interpreter
...	
barplot(sorted_data
$count,names.arg=sorted_data$value,	
main="Resource	hits",	las=2,	
col=colfunc(nrow(sorted_data)),	
ylim=c(0,300))
PLOT MAGIC
	>	png(‘/tmp/..’)	
	>	barplot	
	>	dev.off()
sparkIMain.interpret(“png(‘/tmp/
plot.png’)	barplot	dev.off()”)	
Interpreter
...	
barplot(sorted_data
$count,names.arg=sorted_data$value,	
main="Resource	hits",	las=2,	
col=colfunc(nrow(sorted_data)),	
ylim=c(0,300))
PLOT MAGIC
	>	png(‘/tmp/..’)	
	>	barplot	
	>	dev.off()
sparkIMain.interpret(“png(‘/tmp/
plot.png’)	barplot	dev.off()”)	
File(’/tmp/plot.png’).read().toBase64()
Interpreter
...	
barplot(sorted_data
$count,names.arg=sorted_data$value,	
main="Resource	hits",	las=2,	
col=colfunc(nrow(sorted_data)),	
ylim=c(0,300))
PLOT MAGIC
	>	png(‘/tmp/..’)	
	>	barplot	
	>	dev.off()
sparkIMain.interpret(“png(‘/tmp/
plot.png’)	barplot	dev.off()”)	
File(’/tmp/plot.png’).read().toBase64()
Interpreter
...	
barplot(sorted_data
$count,names.arg=sorted_data$value,	
main="Resource	hits",	las=2,	
col=colfunc(nrow(sorted_data)),	
ylim=c(0,300))
{	
		"data":	{	
				"image/png":	"iVBORw0KGgoAAAANSUhEU
					...	
				}	
		...	
}
• Pluggable	Backends	
• Livy's	Spark	Backends	
– Scala	
– pyspark	
– R	
• IPython	/	Jupyter	support	coming	soon
PLUGGABLE INTERPRETERS
• Re-using	it	
• Generic	Framework	
for	Interpreters	
• 51	Kernels
JUPYTER BACKEND

SPARK AS A SERVICE
REMEMBER AGAIN?
YARN	
Master
Spark	Client
YARN

Node
Spark

Context
YARN

Node
Spark

Worker
YARN

Node
Spark

Worker
Livy	Server
Scalatra
Session	Manager
Session
Spark

Interpreter
MULTI USERS
YARN

Node
Spark

Context
Livy	Server
Scalatra
Session	Manager
Session
Spark

Interpreter YARN

Node
Spark

Context
Spark

Interpreter
YARN

Node
Spark

Context
Spark

Interpreter
Spark	
Client
Spark	
Client
Spark	
Client
SHARED CONTEXTS?
YARN

Node
Spark

Context
Livy	Server
Scalatra
Session	Manager
Session
Spark

Interpreter
Spark	
Client
Spark	
Client
Spark	
Client
SHARED RDD?
YARN

Node
Spark

Context
Livy	Server
Scalatra
Session	Manager
Session
Spark

Interpreter
Spark	
Client
Spark	
Client
Spark	
Client
RDD
SHARED RDDS?
YARN

Node
Spark

Context
Livy	Server
Scalatra
Session	Manager
Session
Spark

Interpreter
Spark	
Client
Spark	
Client
Spark	
Client
RDD
RDD
RDD
YARN

Node
Spark

Context
Livy	Server
Scalatra
Session	Manager
Session
Spark

Interpreter
Spark	
Client
Spark	
Client
Spark	
Client
RDD
RDD
RDD
SECURE IT?
YARN

Node
Spark

Context
Livy	Server
Scalatra
Session	Manager
Session
Spark

Interpreter
Spark	
Client
Spark	
Client
Spark	
Client
RDD
RDD
RDD
SECURE IT?
Livy	Server
Spark
Spark	
Client
Spark	
Client
Spark	
Client
SPARK AS SERVICE
Spark
SHARING RDDS
PySpark	shell
RDD
Shell
Python	
Shell
PySpark	shell
RDD
Shell
Python	
Shell
PySpark	shell
RDD
Shell
Python	
Shell
r	=	sc.parallelize([])	
srdd	=	ShareableRdd(r)
PySpark	shell
RDD
{'ak':	'Alaska'}
{'ca':	'California'}
Shell
Python	
Shell
r	=	sc.parallelize([])	
srdd	=	ShareableRdd(r)
PySpark	shell
RDD
{'ak':	'Alaska'}
{'ca':	'California'}
Shell
Python	
Shell
curl	-XPOST	/sessions/0/statement	{	
			'code':	srdd.get('ak')	
}
r	=	sc.parallelize([])	
srdd	=	ShareableRdd(r)
PySpark	shell
RDD
{'ak':	'Alaska'}
{'ca':	'California'}
Shell
Python	
Shell
states	=	SharedRdd('host/sessions/0',	'srdd')	
states.get('ak')
r	=	sc.parallelize([])	
srdd	=	ShareableRdd(r)	
curl	-XPOST	/sessions/0/statement	{	
			'code':	srdd.get('ak')	
}
DEMO
TIME

https://github.com/romainr/hadoop-tutorials-examples/tree/master/notebook/shared_rdd
• SSL	Support	
• Persistent	Sessions	
• Kerberos
SECURITY
SPARK MAGIC
•From	Microsop	
•Python	magics	for	working	with	remote	Spark	
clusters	
•Open	Source:	hAps://github.com/jupyter-
incubator/sparkmagic
FUTURE
•Move	to	ext	repo?	
•Security	
•iPython/Jupyter	backends	and	file	format	
•Shared	named	RDD	/	contexts?	
•Share	data	
•Spark	specific,	language	generic,	both?	
•Leverage	Hue	4
https://issues.cloudera.org/browse/HUE-2990
• Open	Source:	hAps://github.com/cloudera/
hue/tree/master/apps/spark/java	
• Read	about	it:	hAp://gethue.com/spark/
•Scala,	Java,	Python,	R	
•Type	Introspec:on	for	Visualiza:on	
•YARN-cluster	or	local	modes	
•Code	snippets	/	compiled	
•REST	API
•Pluggable	backends	
•Magic	keywords	
•Failure	resilient	
•Security
LIVY’S

CHEAT SHEET
BEDANKT!

TWITTER
@gethue
USER GROUP
hue-user@
WEBSITE
hAp://gethue.com
LEARN
hAp://learn.gethue.com

More Related Content

Building a REST Job Server for Interactive Spark as a Service