Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Getting Started Running 

Apache Spark on Apache Mesos,
2014-01-24	

Paco Nathan 

liber118.com/pxn

@pacoid
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
Datacenter Computing	

Google has been doing datacenter computing for years, 

to address the complexities of large-scale data workflows:	


•
•

leveraging the modern kernel: isolation in lieu of VMs	


•

“most (>80%) jobs are batch jobs, but the majority 

of resources (55–80%) are allocated to service jobs”	


•
•
•

mixed workloads, multi-tenancy	


among the top 10 Linux kernel OSS contributors:
cgroups	


relatively high utilization rates	

JVM? not so much…	


!

take-aways: 

scheduling batch is not so difficult; 

scheduling services is hard+expensive
Google describes the business case…	

Taming Latency Variability

Jeff Dean

plus.google.com/u/0/+ResearchatGoogle/posts/C1dPhQhcDRv
“Return of the Borg”	

Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon

Cade Metz

wired.com/wiredenterprise/2013/03/googleborg-twitter-mesos	

!

The Datacenter as a Computer: An Introduction 

to the Design of Warehouse-Scale Machines	

Luiz André Barroso, Urs Hölzle	

research.google.com/pubs/pub35290.html	

!
!

2011 GAFS Omega

John Wilkes, et al.

youtu.be/0ZFMlO98Jkc
Google describes the technology…	

Omega: flexible, scalable schedulers for large compute clusters	

Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkes	

eurosys2013.tudos.org/wp-content/uploads/2013/paper/
Schwarzkopf.pdf
Getting Started Running Apache Spark on Apache Mesos
Mesos – open source datacenter computing	

a common substrate for cluster computing	

mesos.apache.org	

heterogenous assets in your datacenter or cloud 

made available as a homogenous set of resources	


•
•
•
•
•
•
•
•

top-level Apache project	

scalability to 10,000s of nodes	

obviates the need for virtual machines	

isolation (pluggable) for CPU, RAM, I/O, FS, etc.	

fault-tolerant leader election based on Zookeeper	

APIs in C++, Java, Python, Go	

web UI for inspecting cluster state	

available for Linux, OpenSolaris, Mac OSX
Getting Started Running Apache Spark on Apache Mesos
Mesos – architecture	

services

batch

Workloads

Apps
Scalding

MPI

Impala

Hadoop

Shark

Spark

MySQL

Kafka

JBoss

Django

Chronos

Storm

Rails

Frameworks

Marathon

Kernel

distributed file system

distributed resources: CPU, RAM, I/O, FS, rack locality, etc.

DFS

Cluster
Mesos – architecture	

apps: HA services, web apps, batch
jobs, scripts, etc.

frameworks: Spark, Storm,
MPI, Jenkins, etc.

task schedulers: Chronos, etc.

meta-frameworks: Aurora, Marathon

APIs: C++, JVM, Py, Go

Mesos, distrib kernel

HDFS, distrib file system

Linux: libcgroup, libprocess, libev, etc.
Mesos – dynamics	


scheduled
apps

HA
services

distrib
frameworks

Marathon
distrib init.d

Mesos
distrib kernel

Chronos
distrib cron
Mesos – dynamics	


distributed
framework

Scheduler

Executor

Executor

Executor

Mesos
Mesos
slave
slave

Mesos
Mesos
slave
slave

Mesos
Mesos
slave
slave

resource
offers
Mesos
Mesos
master
master

available resources

distributed
kernel
Production Deployments (public)
Case Study: Twitter (bare metal / on premise)	

“Mesos is the cornerstone of our elastic compute infrastructure – 

it’s how we build all our new services and is critical for Twitter’s

continued success at scale. It's one of the primary keys to our

data center efficiency."	

Chris Fry, SVP Engineering	

!

blog.twitter.com/2013/mesos-graduates-from-apache-incubation	

wired.com/gadgetlab/2013/11/qa-with-chris-fry/	


•
•
•

key services run in production: analytics, typeahead, ads	


•

allows services to scale and leverage a shared pool of 

servers across datacenters efficiently	


•

reduces the time between prototyping and launching

Twitter engineers rely on Mesos to build all new services	

instead of thinking about static machines, engineers think 

about resources like CPU, memory and disk
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
http://elastic.mesosphere.io

launch a Mesos cluster in the Amazon AWS 

cloud in three simple steps, given: 


•
•
•

AWS credentials	

SSH public key	

email address
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
http://mesosphere.io/learn/run-spark-on-mesos/	


configure and run Spark on a Mesos 

cluster on AWS, in a seven-step tutorial…
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
step 1: ssh to master
ssh -l ubuntu <master>
step 2: install git, jdk-7
sudo aptitude -y install git!
sudo aptitude -y install openjdk-7-jdk
step 3: download spark
wget http://spark-project.org/download/spark-0.8.0-incubating-bin-cdh4.tgz!
tar xzf spark-0.8.0-incubating-bin-cdh4.tgz!
cd spark-0.8.0-incubating-bin-cdh4/
step 4: sbt clean assembly
SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.4.0 sbt/sbt clean assembly
step 5: make distro, cp to HDFS
./make-distribution.sh --hadoop 2.0.0-mr1-cdh4.4.0!
mv dist spark-0.8.0-2.0.0-mr1-cdh4.4.0!
tar czf spark-0.8.0-2.0.0-mr1-cdh4.4.0.tgz spark-0.8.0-2.0.0-mr1-cdh4.4.0!

!
hadoop fs -mkdir /tmp!
hadoop fs -put spark-0.8.0-2.0.0-mr1-cdh4.4.0.tgz /tmp
step 6: config env
cd conf/!
cp spark-env.sh.template spark-env.sh!
vim spark-env.sh!

!
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so!
export SPARK_EXECUTOR_URI=hdfs://<nn>/tmp/spark-0.8.0-2.0.0-mr1-cdh4.4.0.tgz!
export MASTER=zk://<master>:2181/mesos!

!
cat spark-env.sh!
cd ..!

!
./spark-shell
et voilà!
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
http://spark.incubator.apache.org/examples.html	


run an example job in Spark, 

to filter an RDD of integers,	

in two steps at the REPL…
step 1: create an RDD
val data = 1 to 10000!
val distData = sc.parallelize(data)!

!
distData.filter(_< 10).collect()
step 2: run the filter
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
Join us!	

!

O’Reilly Strata, Santa Clara

Feb 11-13

strataconf.com/strata2014

Mesos tutorial, Tue 2/11 1:30pm	

BOF lunch, Wed 2/12 12:10pm	

Mesos session, Thu 2/13 2:20pm	

office hours, Thu 2/13 3:15pm
More insights…	

!

Monthly newsletter for 

events, conf summaries, 

workshops, etc.:	

liber118.com/pxn/	

!

collected Mesos notes:	

goo.gl/jPtTP
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos

More Related Content

Getting Started Running Apache Spark on Apache Mesos