This document summarizes the growth and development of the Spark project. It notes that Spark has grown significantly over the past year in terms of contributors, companies involved, and lines of code. Spark is now one of the most active projects within the Apache Hadoop ecosystem. The document outlines major new additions to Spark including Spark SQL for structured data, MLlib for machine learning algorithms, and Java 8 APIs. It discusses the vision for Spark as a unified platform and standard library for big data applications.
2. An Exciting Year for Spark
Very fast community growth
1.0 release in May
7+ distributors, 20+ apps
3. Project Activity
June 2013
June 2014
total
contributors
68
255
companies
contributing
17
50
total lines"
of code
63,000
175,000
4. Project Activity
June 2013
June 2014
total
contributors
68
255
companies
contributing
17
50
total lines"
of code
63,000
175,000
5. Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
1400
1200
1000
800
600
400
200
0
MapReduce
YARN
HDFS
Storm
Spark
300000
250000
200000
150000
100000
50000
0
Commits
Lines of Code Changed
Activity in past 6 months
6. Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
1400
1200
1000
800
600
400
200
0
MapReduce
YARN
HDFS
Storm
Spark
300000
250000
200000
150000
100000
50000
0
Commits
Lines of Code Changed
Spark is now the most active project in the"
Hadoop ecosystem
Activity in past 6 months
7. Compared to Other Projects
Spark is one of top 3 most active projects at Apache
More active than “general” data processing projects
like NumPy, matplotlib, SciKit-Learn
10. Last Summit
Last Summit we said we’d focus on two things:
• Standard libraries
• Enterprise features
New libraries: Spark SQL, MLlib (machine learning),
GraphX (graph processing)
Enterprise features: security, monitoring, HA
11. Spark SQL
Enables loading & querying structured data in Spark
From Hive:
c = HiveContext(sc)!
rows = c.sql(“select text, year from hivetable”)!
rows.filter(lambda r: r.year > 2013).collect()!
{“text”: “hi”,
“user”: {
“name”: “matei”,
“id”: 123
}}
From JSON:
c.jsonFile(“tweets.json”).registerAsTable(“tweets”)!
c.sql(“select text, user.name from tweets”)!
tweets.json
12. Spark SQL
Integrates closely with Spark’s language APIs
c.registerFunction(“hasSpark”, lambda text: “Spark” in text)!
c.sql(“select * from tweets where hasSpark(text)”)!
Uniform interface for data access
44 contributors in
past year
Hive
Parquet
JSON
Cassan-dra
…
SQL
Python
Scala
Java
13. Machine Learning Library (MLlib)
Standard library of machine learning algorithms
Now includes 15+ algorithms
• New in 1.0: decision trees, SVD, PCA, L-BFGS
• In development: non-negative matrix factorization, LDA,
Lanczos, multiclass trees, ADMM
points = context.sql(“select latitude, longitude from tweets”)!
model = KMeans.train(points, 10)!
!
40 contributors in
past year
14. Java 8 API
Enables concise programming in Java similar to
Scala and Python
JavaRDD<String> lines = sc.textFile("data.txt");!
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());!
int totalLength = lineLengths.reduce((a, b) -> a + b);!
16. 1. Unified Platform for Big Data Apps
Batch
Interactive
Streaming
Hadoop
Cassandra
Mesos
…
Uniform API for diverse workloads over diverse
storage systems and runtimes
…
Cloud
Providers
…
17. Why a Platform Matters
Good for developers: one system to learn
Good for users: take apps anywhere
Good for distributors: more applications
18. 2. Standard Library for Big Data
Big data apps lack libraries"
of common algorithms
Spark’s generality + support"
for multiple languages make it"
suitable to offer this
Python
Scala
Java
R
SQL
ML
graph
Core
…
Much of future activity will be in these libraries
19. Databricks & Spark
At Databricks, we are working to keep Spark 100%
open source and compatible across vendors
All our work on Spark is at Apache
Check out project-specific talks to see what’s next!