Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices for Using
Apache Spark on AWS
Jonathan Fritz, Amazon EMR
Sr. Product Manager
August 11, 2016
Agenda
• Why Spark?
• Deploying Spark with Amazon EMR
• EMRFS and connectivity to AWS data stores
• Spark on YARN and DataFrames
• Spark security overview
Available on EMR 5.0
2.0
Spark moves at interactive speed
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark 2.0 – Performance Enhancements
• Second generation Tungsten engine
• Whole-stage code generation to create optimized
bytecode at runtime
• Improvements to Catalyst optimizer for query
performance
• New vectorized Parquet decoder to increase throughput
Spark components to match your use case
Spark speaks your language
Datasets and DataFrames
• Abstraction for
selecting, filtering,
aggregating, and
plotting structured data
• Optimized for query
execution using
Catalyst query planner
Datasets and DataFrames (Spark 2.0)
• Datasets
• Distributed collection of data
• Strong typing, ability to use Lambda functions
• Object-oriented operations (similar to RDD API)
• Optimized encoders which increase performance and
minimize serialization/deserialization overhead
• Compile-time type safety for more robust applications
• DataFrames
• Dataset organized into named columns
• Represented as a Dataset of rows
Track the transformations used to build them (their lineage)
to recompute lost data
E.g:
DataFrames/DataSets/RDDs and Fault Tolerance
messages = textFile(...).filter(lambda s: s.contains(“ERROR”))
.map(lambda s: s.split(‘t’)[2])
HadoopRDD
path = hdfs://…
FilteredRDD
func = contains(...)
MappedRDD
func = split(…)
Easily create DataFrames from many formats
RDD
Additional libraries for Spark SQL Data Sources
at spark-packages.org
Load data with the Spark SQL Data Sources API
Amazon Redshift
Additional libraries at spark-packages.org
Use DataFrames for machine learning
• Spark ML libraries
(replacing MLlib) use
DataFrame API as
input/output for models
instead of RDDs
• Create ML pipelines with
a variety of distributed
algorithms
• Pipeline persistence to
save and load models
and full pipelines to
Amazon S3
Spark 2.0 – ML Updates
• Additional distributed algorithms in SparkR, including K-
Means, Generalized Linear Models, and Naive Bayes
• ML pipeline persistence is now supported across all
languages
DataFrames on streaming data (Spark 1.x)
• Access data in Spark Streaming DStream
• Create SQLContext on the SparkContext used for Spark Streaming
application for ad hoc queries
• Incorporate DataFrame in Spark Streaming application
• Checkpoint streaming jobs for disaster recovery
Spark 2.0 – Structured Streaming
• Structured Streaming API is an extension to the
DataFrame/Dataset API (instead of DStream)
• SparkSession is the new entry point for streaming
• Better merges processing on static and streaming
datasets, abstracting the velocity of the data
Use R to interact with DataFrames
• SparkR package for using R to manipulate DataFrames
• Create SparkR applications or interactively use the SparkR
shell (SparkR support in Zeppelin 0.6.1 - coming soon in EMR)
• Comparable performance to Python and Scala DataFrames
Spark SQL (Spark 2.0)
• SparkSession – replaces the old SQLContext and
HiveContext
• Seamlessly mix SQL with Spark programs
• ANSI SQL Parser and subquery support
• HiveQL compatibility and can directly use tables in
Hive metastore
• Connect through JDBC / ODBC using the Spark Thrift
server
Creating Spark Clusters
With Amazon EMR
Focus on deriving insights from your data
instead of manually configuring clusters
Easy to install and
configure Spark
Secured
Spark submit, Oozie or
use Zeppelin UI
Quickly add
and remove capacity
Hourly, reserved, or
EC2 Spot pricing
Use S3 to decouple
compute and storage
Create a fully configured cluster with the
latest version of Spark in minutes
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
Choice of multiple instances
Options to Submit Spark Jobs – Off Cluster
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Options to Submit Spark Jobs – On Cluster
Web UIs: Zeppelin notebooks,
R Studio, and more!
Connect with ODBC / JDBC
using the Spark Thrift server
Use Spark Actions in your Apache Oozie
workflow to create DAGs of Spark jobs.
(start using
start-thriftserver.sh)
Other:
- Use the Spark Job Server for a
REST interface and shared
DataFrames across jobs
- Use the Spark shell on your cluster
Monitoring and Debugging
• Log pushing to S3
• Logs produced by driver and executors on each node
• Can browse through log folders in EMR console
• Spark UI
• Job performance, task breakdown of jobs, information about
cached DataFrames, and more
• Ganglia monitoring
• CloudWatch metrics in the EMR console
DEMO
Some of our customers running Spark on EMR
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon.com Personalization Team Using Spark + DSSTNE
http://blogs.aws.amazon.com/bigdata/
Using Amazon S3 as persistent
storage for Spark
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
EMR Filesystem (EMRFS)
• S3 connector for EMR (implements the Hadoop
FileSystem interface)
• Improved performance and error handling options
• Transparent to applications – just read/write to “s3://”
• Consistent view feature set for consistent list
• Support for Amazon S3 server-side and client-side
encryption
• Faster listing using EMRFS metadata
Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
Use RDS for an external Hive metastore
Amazon RDS
Hive Metastore with
schema for tables in S3
Amazon S3Set metastore
location in hive-site
Using Spark with other
data stores in AWS
Many storage layers to choose from
Amazon DynamoDB
EMR-DynamoDB
connector
Amazon RDS Amazon Kinesis
Streaming data
connectors
JDBC Data Source
w/ Spark SQL
Elasticsearch
connector
Amazon Redshift
Spark-Redshift
connector
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
Spark architecture
• Run Spark Driver in
Client or Cluster mode
• Spark application runs
as a YARN application
• SparkContext runs as a
library in your program,
one instance per Spark
application.
• Spark Executors run in
YARN Containers on
NodeManagers in your
cluster
Amazon EMR runs Spark on YARN
• Dynamically share and centrally
configure the same pool of cluster
resources across engines
• Schedulers for categorizing, isolating,
and prioritizing workloads
• Choose the number of executors to use,
or allow YARN to choose (dynamic
allocation)
• Kerberos authentication
Storage
S3, HDFS
YARN
Cluster Resource Management
Batch
MapReduce
In Memory
Spark
Applications
Pig, Hive, Cascading, Spark Streaming, Spark SQL
YARN Schedulers - CapacityScheduler
• Default scheduler specified in Amazon EMR
• Queues
• Single queue is set by default
• Can create additional queues for workloads based on
multitenancy requirements
• Capacity Guarantees
• set minimal resources for each queue
• Programmatically assign free resources to queues
• Adjust these settings using the classification capacity-
scheduler in an EMR configuration object
What is a Spark Executor?
• Processes that store data and run tasks for your Spark
application
• Specific to a single Spark application (no shared
executors across applications)
• Executors run in YARN containers managed by YARN
NodeManager daemons
Max Container size on node
Executor Container
Memory
Overhead
Spark Executor Memory
Inside Spark Executor on YARN
Execution / Cache
yarn.nodemanager.resource.memory-mb (classification: yarn-site)
spark.yarn.executor.memoryOverhead (classification: spark-default)
spark.executor.memory (classification: spark-default)
spark.memory.fraction (classification: spark-default)
Configuring Executors – Dynamic Allocation
• Optimal resource utilization
• YARN dynamically creates and shuts down executors
based on the resource needs of the Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default, and
calculates the default executor size to use based on the
instance family of your Core Group
Properties Related to Dynamic Allocation
Property Value
Spark.dynamicAllocation.enabled true
Spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 5
spark.dynamicAllocation.maxExecutors 17
spark.dynamicAllocation.initalExecutors 0
sparkdynamicAllocation.executorIdleTime 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBackl
ogTimeout
5s
Optional
Easily override spark-defaults
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "15g",
"spark.executor.cores": "4"
}
}
]
EMR Console:
Configuration object:
Configuration precedence: (1) SparkConf object, (2) flags passed to Spark Submit, (3) spark-defaults.conf
When to set executor configuration
• Need to fit larger partitions in memory
• GC is too high (though this is being resolved through
work in Project Tungsten)
• Long-running, single tenant Spark Applications
• Static executors recommended for Spark Streaming
• Could be good for multitenancy, depending on YARN
scheduler being used
More Options for Executor Configuration
• When creating your cluster, specify
maximizeResourceAllocation to create one large
executor per node. Spark will use all of the executors for
each application submitted.
• Adjust the Spark executor settings using an EMR
configuration object when creating your cluster
• Pass in configuration overrides when running your Spark
application with spark-submit
DataFrame optimizations
• Use columnar forms like Parquet to scan less data
• More partitions give you more parallelism
• Automatic partition discovery when using Parquet
• Can repartition a DataFrame
• Also you can adjust parallelism using with
spark.default.parallelism
• Data is serialized when cached or shuffled
• Use Kryo serialization (10x faster than Java serialization)
• Does not support all Serializable types
• Register the class in advance
Spark Security on
Amazon EMR
VPC private subnets to isolate network
• Use Amazon S3 Endpoints for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using Security Groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess
Spark on EMR security overview
Encryption At-Rest
• HDFS transparent encryption (AES 256)
• Local disk encryption for temporary files using LUKS encryption
• EMRFS support for Amazon S3 client-side and server-side encryption
Encryption In-Flight
• Secure communication with SSL from S3 to EC2 (nodes of cluster)
• HDFS blocks encrypted in-transit when using HDFS encryption
• SASL encryption (digest-MD5) for Spark Shuffle
Permissions
• IAM roles, Kerberos, and IAM users
Access
• VPC private subnet support, Security Groups, and SSH Keys
Auditing
• AWS CloudTrail and S3 object-level auditing
Amazon S3
Thank you!
Jonathan Fritz - jonfritz@amazon.com
https://aws.amazon.com/emr/spark
http://blogs.aws.amazon.com/bigdata

More Related Content

Best Practices for Using Apache Spark on AWS

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices for Using Apache Spark on AWS Jonathan Fritz, Amazon EMR Sr. Product Manager August 11, 2016
  • 2. Agenda • Why Spark? • Deploying Spark with Amazon EMR • EMRFS and connectivity to AWS data stores • Spark on YARN and DataFrames • Spark security overview
  • 4. Spark moves at interactive speed join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle
  • 5. Spark 2.0 – Performance Enhancements • Second generation Tungsten engine • Whole-stage code generation to create optimized bytecode at runtime • Improvements to Catalyst optimizer for query performance • New vectorized Parquet decoder to increase throughput
  • 6. Spark components to match your use case
  • 7. Spark speaks your language
  • 8. Datasets and DataFrames • Abstraction for selecting, filtering, aggregating, and plotting structured data • Optimized for query execution using Catalyst query planner
  • 9. Datasets and DataFrames (Spark 2.0) • Datasets • Distributed collection of data • Strong typing, ability to use Lambda functions • Object-oriented operations (similar to RDD API) • Optimized encoders which increase performance and minimize serialization/deserialization overhead • Compile-time type safety for more robust applications • DataFrames • Dataset organized into named columns • Represented as a Dataset of rows
  • 10. Track the transformations used to build them (their lineage) to recompute lost data E.g: DataFrames/DataSets/RDDs and Fault Tolerance messages = textFile(...).filter(lambda s: s.contains(“ERROR”)) .map(lambda s: s.split(‘t’)[2]) HadoopRDD path = hdfs://… FilteredRDD func = contains(...) MappedRDD func = split(…)
  • 11. Easily create DataFrames from many formats RDD Additional libraries for Spark SQL Data Sources at spark-packages.org
  • 12. Load data with the Spark SQL Data Sources API Amazon Redshift Additional libraries at spark-packages.org
  • 13. Use DataFrames for machine learning • Spark ML libraries (replacing MLlib) use DataFrame API as input/output for models instead of RDDs • Create ML pipelines with a variety of distributed algorithms • Pipeline persistence to save and load models and full pipelines to Amazon S3
  • 14. Spark 2.0 – ML Updates • Additional distributed algorithms in SparkR, including K- Means, Generalized Linear Models, and Naive Bayes • ML pipeline persistence is now supported across all languages
  • 15. DataFrames on streaming data (Spark 1.x) • Access data in Spark Streaming DStream • Create SQLContext on the SparkContext used for Spark Streaming application for ad hoc queries • Incorporate DataFrame in Spark Streaming application • Checkpoint streaming jobs for disaster recovery
  • 16. Spark 2.0 – Structured Streaming • Structured Streaming API is an extension to the DataFrame/Dataset API (instead of DStream) • SparkSession is the new entry point for streaming • Better merges processing on static and streaming datasets, abstracting the velocity of the data
  • 17. Use R to interact with DataFrames • SparkR package for using R to manipulate DataFrames • Create SparkR applications or interactively use the SparkR shell (SparkR support in Zeppelin 0.6.1 - coming soon in EMR) • Comparable performance to Python and Scala DataFrames
  • 18. Spark SQL (Spark 2.0) • SparkSession – replaces the old SQLContext and HiveContext • Seamlessly mix SQL with Spark programs • ANSI SQL Parser and subquery support • HiveQL compatibility and can directly use tables in Hive metastore • Connect through JDBC / ODBC using the Spark Thrift server
  • 20. Focus on deriving insights from your data instead of manually configuring clusters Easy to install and configure Spark Secured Spark submit, Oozie or use Zeppelin UI Quickly add and remove capacity Hourly, reserved, or EC2 Spot pricing Use S3 to decouple compute and storage
  • 21. Create a fully configured cluster with the latest version of Spark in minutes AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API
  • 22. Choice of multiple instances
  • 23. Options to Submit Spark Jobs – Off Cluster Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster
  • 24. Options to Submit Spark Jobs – On Cluster Web UIs: Zeppelin notebooks, R Studio, and more! Connect with ODBC / JDBC using the Spark Thrift server Use Spark Actions in your Apache Oozie workflow to create DAGs of Spark jobs. (start using start-thriftserver.sh) Other: - Use the Spark Job Server for a REST interface and shared DataFrames across jobs - Use the Spark shell on your cluster
  • 25. Monitoring and Debugging • Log pushing to S3 • Logs produced by driver and executors on each node • Can browse through log folders in EMR console • Spark UI • Job performance, task breakdown of jobs, information about cached DataFrames, and more • Ganglia monitoring • CloudWatch metrics in the EMR console
  • 26. DEMO
  • 27. Some of our customers running Spark on EMR
  • 32. Amazon.com Personalization Team Using Spark + DSSTNE http://blogs.aws.amazon.com/bigdata/
  • 33. Using Amazon S3 as persistent storage for Spark
  • 34. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  • 35. EMR Filesystem (EMRFS) • S3 connector for EMR (implements the Hadoop FileSystem interface) • Improved performance and error handling options • Transparent to applications – just read/write to “s3://” • Consistent view feature set for consistent list • Support for Amazon S3 server-side and client-side encryption • Faster listing using EMRFS metadata
  • 36. Partitions, compression, and file formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 37. Use RDS for an external Hive metastore Amazon RDS Hive Metastore with schema for tables in S3 Amazon S3Set metastore location in hive-site
  • 38. Using Spark with other data stores in AWS
  • 39. Many storage layers to choose from Amazon DynamoDB EMR-DynamoDB connector Amazon RDS Amazon Kinesis Streaming data connectors JDBC Data Source w/ Spark SQL Elasticsearch connector Amazon Redshift Spark-Redshift connector EMR File System (EMRFS) Amazon S3 Amazon EMR
  • 41. • Run Spark Driver in Client or Cluster mode • Spark application runs as a YARN application • SparkContext runs as a library in your program, one instance per Spark application. • Spark Executors run in YARN Containers on NodeManagers in your cluster
  • 42. Amazon EMR runs Spark on YARN • Dynamically share and centrally configure the same pool of cluster resources across engines • Schedulers for categorizing, isolating, and prioritizing workloads • Choose the number of executors to use, or allow YARN to choose (dynamic allocation) • Kerberos authentication Storage S3, HDFS YARN Cluster Resource Management Batch MapReduce In Memory Spark Applications Pig, Hive, Cascading, Spark Streaming, Spark SQL
  • 43. YARN Schedulers - CapacityScheduler • Default scheduler specified in Amazon EMR • Queues • Single queue is set by default • Can create additional queues for workloads based on multitenancy requirements • Capacity Guarantees • set minimal resources for each queue • Programmatically assign free resources to queues • Adjust these settings using the classification capacity- scheduler in an EMR configuration object
  • 44. What is a Spark Executor? • Processes that store data and run tasks for your Spark application • Specific to a single Spark application (no shared executors across applications) • Executors run in YARN containers managed by YARN NodeManager daemons
  • 45. Max Container size on node Executor Container Memory Overhead Spark Executor Memory Inside Spark Executor on YARN Execution / Cache yarn.nodemanager.resource.memory-mb (classification: yarn-site) spark.yarn.executor.memoryOverhead (classification: spark-default) spark.executor.memory (classification: spark-default) spark.memory.fraction (classification: spark-default)
  • 46. Configuring Executors – Dynamic Allocation • Optimal resource utilization • YARN dynamically creates and shuts down executors based on the resource needs of the Spark application • Spark uses the executor memory and executor cores settings in the configuration for each executor • Amazon EMR uses dynamic allocation by default, and calculates the default executor size to use based on the instance family of your Core Group
  • 47. Properties Related to Dynamic Allocation Property Value Spark.dynamicAllocation.enabled true Spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 5 spark.dynamicAllocation.maxExecutors 17 spark.dynamicAllocation.initalExecutors 0 sparkdynamicAllocation.executorIdleTime 60s spark.dynamicAllocation.schedulerBacklogTimeout 5s spark.dynamicAllocation.sustainedSchedulerBackl ogTimeout 5s Optional
  • 48. Easily override spark-defaults [ { "Classification": "spark-defaults", "Properties": { "spark.executor.memory": "15g", "spark.executor.cores": "4" } } ] EMR Console: Configuration object: Configuration precedence: (1) SparkConf object, (2) flags passed to Spark Submit, (3) spark-defaults.conf
  • 49. When to set executor configuration • Need to fit larger partitions in memory • GC is too high (though this is being resolved through work in Project Tungsten) • Long-running, single tenant Spark Applications • Static executors recommended for Spark Streaming • Could be good for multitenancy, depending on YARN scheduler being used
  • 50. More Options for Executor Configuration • When creating your cluster, specify maximizeResourceAllocation to create one large executor per node. Spark will use all of the executors for each application submitted. • Adjust the Spark executor settings using an EMR configuration object when creating your cluster • Pass in configuration overrides when running your Spark application with spark-submit
  • 51. DataFrame optimizations • Use columnar forms like Parquet to scan less data • More partitions give you more parallelism • Automatic partition discovery when using Parquet • Can repartition a DataFrame • Also you can adjust parallelism using with spark.default.parallelism • Data is serialized when cached or shuffled • Use Kryo serialization (10x faster than Java serialization) • Does not support all Serializable types • Register the class in advance
  • 53. VPC private subnets to isolate network • Use Amazon S3 Endpoints for connectivity to S3 • Use Managed NAT for connectivity to other services or the Internet • Control the traffic using Security Groups • ElasticMapReduce-Master-Private • ElasticMapReduce-Slave-Private • ElasticMapReduce-ServiceAccess
  • 54. Spark on EMR security overview Encryption At-Rest • HDFS transparent encryption (AES 256) • Local disk encryption for temporary files using LUKS encryption • EMRFS support for Amazon S3 client-side and server-side encryption Encryption In-Flight • Secure communication with SSL from S3 to EC2 (nodes of cluster) • HDFS blocks encrypted in-transit when using HDFS encryption • SASL encryption (digest-MD5) for Spark Shuffle Permissions • IAM roles, Kerberos, and IAM users Access • VPC private subnet support, Security Groups, and SSH Keys Auditing • AWS CloudTrail and S3 object-level auditing Amazon S3
  • 55. Thank you! Jonathan Fritz - jonfritz@amazon.com https://aws.amazon.com/emr/spark http://blogs.aws.amazon.com/bigdata