Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Apache	
  Spark	
  	
  
Fundamentals	
   	
  
	
  
Eren	
  Avşaroğulları	
  	
  
Data	
  Science	
  and	
  Engineering	
  Club	
  Meetup	
  
Dublin	
  -­‐	
  December	
  9,	
  2017	
  
Agenda	
  
Ê  What	
  is	
  Apache	
  Spark?	
  
Ê  Spark	
  Ecosystem	
  &	
  Terminology	
  
Ê  RDDs	
  &	
  Operation	
  Types	
  (Transformations	
  &	
  Actions)	
  
Ê  RDD	
  Lineage	
  
Ê  Job	
  Lifecycle	
  
Ê  RDD	
  Evolution	
  (DataFrames	
  and	
  DataSets)	
  
Ê  Persistency	
  
Ê  Clustering	
  /	
  Spark	
  on	
  YARN	
  
	
  
shows	
  code	
  samples	
  
Bio	
  
Ê  B.Sc	
  &	
  M.Sc.	
  on	
  Electronics	
  &	
  Control	
  Engineering	
  
Ê  Apache	
  Spark	
  Contributor	
  since	
  v2.0.0
Ê  Sr.	
  Software	
  Engineer	
  @	
  	
  	
  
Ê  Currently,	
  work	
  on	
  Data	
  Analytics	
  
Data	
  Transformations	
  /	
  Cleaning	
  
	
  	
  	
  	
  	
  	
  	
  erenavsarogullari	
  
What	
  is	
  Apache	
  Spark?	
  
Ê  Distributed	
  Compute	
  Engine	
  
Ê  Project	
  started	
  in	
  2009	
  at	
  UC	
  Berkley	
  
Ê  First	
  version(v0.5)	
  is	
  released	
  on	
  June	
  2012	
  
Ê  Moved	
  to	
  Apache	
  Software	
  Foundation	
  in	
  2013	
  
Ê  Supported	
  Languages:	
  Java,	
  Scala,	
  Python	
  and	
  R	
  
Ê  +	
  1100	
  contributors	
  /	
  +14K	
  forks	
  on	
  Github	
  
Ê  spark-­‐packages.org	
  =>	
  ~380	
  Extensions	
  
Spark	
  Ecosystem	
  
Spark	
  SQL	
  
Spark	
  
Streaming	
  
MLlib	
   GraphX	
  
Spark	
  Core	
  Engine	
  
Standalone	
   YARN	
   Mesos	
  Local	
  
Cluster	
  Mode	
  Local	
  Mode	
  
Terminology	
  
Ê  RDD:	
  Resilient	
  Distributed	
  Dataset,	
  immutable,	
  resilient	
  and	
  partitioned.	
  
Ê  DAG:	
  Direct	
  Acyclic	
  Graph.	
  An	
  execution	
  plan	
  of	
  a	
  job	
  (a.k.a	
  RDD	
  dependency	
  graph)	
  
Ê  Application:	
  An	
  instance	
  of	
  Spark	
  Context.	
  Single	
  per	
  JVM.	
  
	
  Ê  Job:	
  An	
  action	
  operator	
  triggering	
  
computation.	
  
Ê  Driver:	
  The	
  program/process	
  for	
  running	
  
the	
  Job	
  over	
  the	
  Spark	
  Engine	
  
Ê  Executor:	
  The	
  process	
  executing	
  a	
  task	
  
Ê  Worker:	
  The	
  node	
  running	
  executors.	
  
	
  
How	
  to	
  create	
  RDD?	
  
Ê  Collection	
  Parallelize	
  
Ê  By	
  Loading	
  file	
  
Ê  Transformations	
  
Ê  Lets	
  see	
  the	
  sample	
  =>	
  Application-­‐1	
  
RDD	
  
RDD	
  
RDD	
  
RDD	
  Operation	
  Types	
  
Two	
  types	
  of	
  Spark	
  operations	
  on	
  RDD	
  
Ê  Transformations:	
  lazy	
  evaluated	
  (not	
  computed	
  immediately)	
  
Ê  Actions:	
  triggers	
  the	
  computation	
  and	
  returns	
  value	
  
Transformations	
  
RDD	
   Actions	
   Value	
  Data	
  
Transformations	
  
Ê  map(func)	
  
Ê  flatMap(func)	
  
Ê  filter(func)	
  
Ê  union(dataset)	
  
Ê  join(dataset,	
  usingColumns:	
  Seq[String])	
  
Ê  intersect(dataset)	
  
Ê  coalesce(numPartitions)	
  
Ê  repartition(numPartitions)	
  
Full	
  List:	
  
https://spark.apache.org/docs/latest/rdd-­‐programming-­‐
guide.html#transformations	
  
	
  
Actions	
  
Ê  first()	
  
Ê  take(n)	
  
Ê  collect()	
  
Ê  count()	
  
Ê  saveAsTextFile(path)	
  
Full	
  List:	
  
https://spark.apache.org/docs/latest/rdd-­‐programming-­‐guide.html#actions	
  
Lets	
  see	
  the	
  sample	
  =>	
  Application-­‐2	
  
	
  
RDD	
  Dependencies	
  (Lineage)	
  
RDD	
  5	
  
Stage	
  1	
  
RDD	
  1	
  
Stage	
  0	
  
RDD	
  3	
  
RDD	
  2	
  
map	
  
RDD	
  4	
  
union	
  
RDD	
  6	
  
sort	
  
RDD	
  7	
  
join	
  
Stage	
  3	
  
Narrow	
  
Transformation	
  
Narrow	
  
Transformations	
  
Wide	
  
Transformations	
  
Shuffles	
  
Shuffles	
  
Job	
  Lifecyle	
  
RDD	
  Evolution	
  
RDD	
  
V1.0	
  
(2011)	
  
DataFrame	
  
V1.3	
  
(2013)	
  
DataSet	
  
V1.6	
  
(2015)	
  
Untyped	
  API	
  
Schema	
  based	
  -­‐	
  Tabular	
  	
  
Java	
  Objects	
  
Low	
  level	
  data-­‐structure	
  
To	
  work	
  with	
  	
  
Unstructured	
  Data	
  
Typed	
  API:	
  [T]	
  
Tabular	
  
SQL	
  Support	
  
To	
  work	
  with	
  	
  
Semi-­‐Structured	
  (csv,	
  json)	
  /	
  Structured	
  Data	
  (jdbc)	
  
Project	
  Tungsten	
  
Catalyst	
  Optimizer	
  
	
  
Two	
  tier	
  
optimizations	
  
How	
  to	
  create	
  the	
  DataFrame?	
  
Ê  By	
  loading	
  file	
  (spark.read.format("csv").load())	
  
Ê  SparkSession.createDataFrame(RDD,	
  schema)	
  
	
  
Lets	
  see	
  the	
  code	
  –	
  Application-­‐3	
  
How	
  to	
  create	
  the	
  DataSet?	
  
Ê  By	
  loading	
  file	
  (spark.read.format("csv").load())	
  
Ê  SparkSession.createDataSet(collection	
  or	
  RDD)	
  
	
  
Lets	
  see	
  the	
  code	
  –	
  Application-­‐4-­‐1	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Application-­‐4-­‐2	
  
	
  
Persistency	
  
Storage	
  Modes	
   Details	
  
MEMORY_ONLY	
   Store	
  RDD	
  as	
  deserialized	
  Java	
  objects	
  in	
  the	
  JVM	
  
MEMORY_AND_DISK	
   Store	
  RDD	
  as	
  deserialized	
  Java	
  objects	
  in	
  the	
  JVM	
  
MEMORY_ONLY_SER	
   Store	
  RDD	
  as	
  serialized	
  Java	
  objects	
  (Kryo	
  API	
  can	
  be	
  thought)	
  
MEMORY_AND_DISK_SER	
   Similar	
  to	
  MEMORY_ONLY_SER	
  
DISK_ONLY	
   Store	
  the	
  RDD	
  partitions	
  only	
  on	
  disk.	
  
MEMORY_ONLY_2,	
  
MEMORY_AND_DISK_2	
  
Same	
  as	
  the	
  levels	
  above,	
  but	
  replicate	
  each	
  partition	
  on	
  two	
  
cluster	
  nodes.	
  
Ê  RDD	
  /	
  DF.persist(newStorageLevel:	
  StorageLevel)	
  
Ê  RDD.unpersist()	
  =>	
  Unpersists	
  RDD	
  from	
  memory	
  and	
  disk	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Unpersist	
  will	
  need	
  to	
  be	
  forced	
  for	
  long	
  term	
  to	
  use	
  	
  executor	
  memory	
  efficiently.	
  
Note:	
  Also	
  when	
  cached	
  data	
  exceeds	
  storage	
  memory,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Spark	
  will	
  use	
  Least	
  Recently	
  Used(LRU)	
  Expiry	
  Policy	
  as	
  default	
  
Clustering	
  /	
  Spark	
  on	
  YARN	
  
YARN	
  Client	
  
Mode	
  
Q	
  &	
  A	
  
Thanks	
  
References	
  
Ê  https://spark.apache.org/docs/latest/	
  
Ê  https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals	
  
Ê  https://jaceklaskowski.gitbooks.io/mastering-­‐apache-­‐spark	
  
Ê  https://stackoverflow.com/questions/36215672/spark-­‐yarn-­‐architecture	
  
Ê  High	
  Performance	
  Spark	
  by	
  	
  
	
  	
  	
  	
  	
  	
  	
  Holden	
  Karau	
  &	
  Rachel	
  Warren	
  

More Related Content

Apache Spark Fundamentals Meetup Talk

  • 1. Apache  Spark     Fundamentals       Eren  Avşaroğulları     Data  Science  and  Engineering  Club  Meetup   Dublin  -­‐  December  9,  2017  
  • 2. Agenda   Ê  What  is  Apache  Spark?   Ê  Spark  Ecosystem  &  Terminology   Ê  RDDs  &  Operation  Types  (Transformations  &  Actions)   Ê  RDD  Lineage   Ê  Job  Lifecycle   Ê  RDD  Evolution  (DataFrames  and  DataSets)   Ê  Persistency   Ê  Clustering  /  Spark  on  YARN     shows  code  samples  
  • 3. Bio   Ê  B.Sc  &  M.Sc.  on  Electronics  &  Control  Engineering   Ê  Apache  Spark  Contributor  since  v2.0.0 Ê  Sr.  Software  Engineer  @       Ê  Currently,  work  on  Data  Analytics   Data  Transformations  /  Cleaning                erenavsarogullari  
  • 4. What  is  Apache  Spark?   Ê  Distributed  Compute  Engine   Ê  Project  started  in  2009  at  UC  Berkley   Ê  First  version(v0.5)  is  released  on  June  2012   Ê  Moved  to  Apache  Software  Foundation  in  2013   Ê  Supported  Languages:  Java,  Scala,  Python  and  R   Ê  +  1100  contributors  /  +14K  forks  on  Github   Ê  spark-­‐packages.org  =>  ~380  Extensions  
  • 5. Spark  Ecosystem   Spark  SQL   Spark   Streaming   MLlib   GraphX   Spark  Core  Engine   Standalone   YARN   Mesos  Local   Cluster  Mode  Local  Mode  
  • 6. Terminology   Ê  RDD:  Resilient  Distributed  Dataset,  immutable,  resilient  and  partitioned.   Ê  DAG:  Direct  Acyclic  Graph.  An  execution  plan  of  a  job  (a.k.a  RDD  dependency  graph)   Ê  Application:  An  instance  of  Spark  Context.  Single  per  JVM.    Ê  Job:  An  action  operator  triggering   computation.   Ê  Driver:  The  program/process  for  running   the  Job  over  the  Spark  Engine   Ê  Executor:  The  process  executing  a  task   Ê  Worker:  The  node  running  executors.    
  • 7. How  to  create  RDD?   Ê  Collection  Parallelize   Ê  By  Loading  file   Ê  Transformations   Ê  Lets  see  the  sample  =>  Application-­‐1  
  • 8. RDD   RDD   RDD   RDD  Operation  Types   Two  types  of  Spark  operations  on  RDD   Ê  Transformations:  lazy  evaluated  (not  computed  immediately)   Ê  Actions:  triggers  the  computation  and  returns  value   Transformations   RDD   Actions   Value  Data  
  • 9. Transformations   Ê  map(func)   Ê  flatMap(func)   Ê  filter(func)   Ê  union(dataset)   Ê  join(dataset,  usingColumns:  Seq[String])   Ê  intersect(dataset)   Ê  coalesce(numPartitions)   Ê  repartition(numPartitions)   Full  List:   https://spark.apache.org/docs/latest/rdd-­‐programming-­‐ guide.html#transformations    
  • 10. Actions   Ê  first()   Ê  take(n)   Ê  collect()   Ê  count()   Ê  saveAsTextFile(path)   Full  List:   https://spark.apache.org/docs/latest/rdd-­‐programming-­‐guide.html#actions   Lets  see  the  sample  =>  Application-­‐2    
  • 11. RDD  Dependencies  (Lineage)   RDD  5   Stage  1   RDD  1   Stage  0   RDD  3   RDD  2   map   RDD  4   union   RDD  6   sort   RDD  7   join   Stage  3   Narrow   Transformation   Narrow   Transformations   Wide   Transformations   Shuffles   Shuffles  
  • 13. RDD  Evolution   RDD   V1.0   (2011)   DataFrame   V1.3   (2013)   DataSet   V1.6   (2015)   Untyped  API   Schema  based  -­‐  Tabular     Java  Objects   Low  level  data-­‐structure   To  work  with     Unstructured  Data   Typed  API:  [T]   Tabular   SQL  Support   To  work  with     Semi-­‐Structured  (csv,  json)  /  Structured  Data  (jdbc)   Project  Tungsten   Catalyst  Optimizer     Two  tier   optimizations  
  • 14. How  to  create  the  DataFrame?   Ê  By  loading  file  (spark.read.format("csv").load())   Ê  SparkSession.createDataFrame(RDD,  schema)     Lets  see  the  code  –  Application-­‐3  
  • 15. How  to  create  the  DataSet?   Ê  By  loading  file  (spark.read.format("csv").load())   Ê  SparkSession.createDataSet(collection  or  RDD)     Lets  see  the  code  –  Application-­‐4-­‐1                                                                                      Application-­‐4-­‐2    
  • 16. Persistency   Storage  Modes   Details   MEMORY_ONLY   Store  RDD  as  deserialized  Java  objects  in  the  JVM   MEMORY_AND_DISK   Store  RDD  as  deserialized  Java  objects  in  the  JVM   MEMORY_ONLY_SER   Store  RDD  as  serialized  Java  objects  (Kryo  API  can  be  thought)   MEMORY_AND_DISK_SER   Similar  to  MEMORY_ONLY_SER   DISK_ONLY   Store  the  RDD  partitions  only  on  disk.   MEMORY_ONLY_2,   MEMORY_AND_DISK_2   Same  as  the  levels  above,  but  replicate  each  partition  on  two   cluster  nodes.   Ê  RDD  /  DF.persist(newStorageLevel:  StorageLevel)   Ê  RDD.unpersist()  =>  Unpersists  RDD  from  memory  and  disk                                  Unpersist  will  need  to  be  forced  for  long  term  to  use    executor  memory  efficiently.   Note:  Also  when  cached  data  exceeds  storage  memory,                              Spark  will  use  Least  Recently  Used(LRU)  Expiry  Policy  as  default  
  • 17. Clustering  /  Spark  on  YARN   YARN  Client   Mode  
  • 18. Q  &  A   Thanks   References   Ê  https://spark.apache.org/docs/latest/   Ê  https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals   Ê  https://jaceklaskowski.gitbooks.io/mastering-­‐apache-­‐spark   Ê  https://stackoverflow.com/questions/36215672/spark-­‐yarn-­‐architecture   Ê  High  Performance  Spark  by                  Holden  Karau  &  Rachel  Warren