Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Spark Interview

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

What is Spark?

Spark is scheduling, monitoring and distributing engine for big data.It is a cluster computing platform designed
to be fast and general purpose.Spark extends the popular MapReduce model.One of the main features Spark
offers for speed is the ability to run computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.

What is Standalone mode?

In standalone mode, Spark uses a Master daemon which coordinates the efforts of the Workers, which run the
executors. Standalone mode is the default, but it cannot be used on secure clusters.When you submit an
application, you can choose how much memory its executors will use, as well as the total number of cores
across all executors.

What is YARN mode?

In YARN mode, the YARN ResourceManager performs the functions of the Spark Master. The functions of the
Workers are performed by the YARN NodeManager daemons, which run the executors. YARN mode is slightly
more complex to set up, but it supports security.

What are client mode and cluster mode?

Each application has a driver process which coordinates its execution. This process can run in the foreground
(client mode) or in the background (cluster mode). Client mode is a little simpler, but cluster mode allows you to
easily log out after starting a Spark application without terminating the application.

What do you understand by Lazy Evaluation?

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given
dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless
asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed
immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the
overall data processing workflow.

What are benefits of Spark over MapReduce?

• Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than
Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.
• Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch
processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch
processing.
• Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
• Spark is capable of performing computations multiple times on the same dataset. This is called iterative
computation while there is no iterative computing implemented by Hadoop.

Whar are the Partitions?

Partitions also known as 'Split' in HDFS, is a logical chunk of data set which may be in the range of Petabyte,
Terabytes and distributed across the cluster.
By Default, Spark creates one Partition for each block of the file (For HDFS)
Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2) so as the split
size.
However, one can explicitly specify the number of partitions to be create.
Partitions are basically used to speed up the data processing.

Or

By Default, Spark creates one Partition for each block of the file (For HDFS)
Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
However, one can explicitly specify the number of partitions to be create.

Example1:

No Partition is not specified


val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt")
Example2:

Following code create the RDD of 10 partitions, since we specify the no. of partitions.

val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt", 10)

One can query about the number of partitions in following way :

rdd1.partitions.length

OR

rdd1.getNumPartitions

Best case Scenario is that, we should make RDD in following way :

numbers of cores in Cluster = no. of partitions

PARTITIONER :

# An object that defines how the elements in a key-value pair RDD are partitioned by key. Maps each key to a
partition ID, from 0 to (number of partitions - 1)
# Partitioner captures the data distribution at output. A scheduler can optimize future operation based on type of
partitioner. (i.e. if we perform any operation say transformation or action which require shuffling across nodes in
that we may need the partitioner. Please refer reduceByKey() transformation in the forum)

# Basically there are three types of partitioners in Spark :


(1) Hash-Partitioner (2) Range-Partitioner (3) One can make its Custom Partitioner

Property Name : spark.default.parallelism


Default Value : For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a
parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
•Local mode: number of cores on the local machine
•Mesos fine grained mode: 8
•Others: total number of cores on all executor nodes or 2, whichever is larger
Meaning : Default number of partitions in RDDs returned by transformations like join

What is the rdd


RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming
into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault
tolerant manner. RDDs are read-only portioned, collection of records, that are –
Immutable – RDDs cannot be altered.
Resilient – If a node holding the partition fails the other node takes the data.

What are the major operations support on RDD?


• Transformations
• Actions
What do you understand by Transformations in Spark?

Transformations are lazy evaluated operations on RDD that create one or many new RDDs, e.g. map, filter,
reduceByKey, join, cogroup, randomSplit. Transformations are functions which takes an RDD as the input and
produces one or many RDDs as output. They dont change the input RDD as RDDs are immutable and hence
cannot be changed or modified, but always produces new RDD by applying the computations operations on
them. By applying transformations you incrementally builds an RDD lineage with all the ancestor RDDs of the
final RDD(s).

Transformations are lazy, i.e. are not executed immediately. Transformations can be executed only when
actions are called. After executing a transformation, the result RDD(s) will always be different from their
ancestors RDD and can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap, union, cartesian) or
the same size (e.g. map) or it can vary in size.
RDD allows you to create dependencies b/w RDDs. Dependencies are the steps for producing results ie a
program. Each RDD in lineage chain, string of dependencies has a function for operating its data and has a
pointer dependency to its ancestor RDD. Spark will divide RDD dependencies into stages and tasks and then
send those to workers for execution.
Or

Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action
occurs. map() and filer() are examples of transformations, where the former applies the function passed to it on
each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements form
current RDD that pass function argument.

Or
In order to identify the operation, one need to look at the return type of an operation.
If operation returns a new RDD in that case an operation is 'Transformation'
If operation returns any other type than RDD in that case an operation is 'Action'

Hence,
Transformation constructs a new RDD from an existing one (previous one) while Action computes the result
based on applied transformation and returns the result to either driver program or save it to the external
storage.

explain action with one example

An action helps in bringing back the data from RDD to the local machine.

An action’s execution is the result of all previously created transformations. reduce() is an action that
implements the function passed again and again until one value if left. take() action takes all the values from
RDD to local node.

Actions is RDD’s operation, that value return back to the spar driver programs, which kick off a job to execute
on a cluster. Transformation’s output is input of Actions. reduce, collect, takeSample, take, first, saveAsTextfile,
saveAsSequenceFile, countByKey, foreach are common actions in Apache spark.

What is RDD Lineage?

Spark does not support data replication in the memory , if any data is lost, it is rebuild using RDD lineage. RDD
lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build
from other datasets.
Or
As you derive new RDDs from each other using transformations, Spark keeps track of the set of dependencies
between different RDDs, called the lineage graph. It uses this information to compute each RDD on demand
and to recover lost data if part of a persistent RDD is lost.
Or
Lineage Graph is the graph of all the parent RDDs for a RDD.
By applying different transformation on an RDD results in lineage graph.
When one derives the new RDD from existing (previous) RDD using transformation, Spark keeps the track of all
the dependencies between RDD is called lineage graph.
Lineage Graph is useful for scenarios mentioned below :
(1) When there is a demand to compute the new RDD.
(2) To recover the lost data if part of persisted RDD is lost.

In other words, Lineage Graph is a graph of all transformation operation that need to executed when an action
operation is called.
How Spark achieves fault tolerance?

Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault
tolerance whereas Spark uses different data storage model, RDD. RDDs achieve fault tolerance through a
notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that
partition.This removes the need for replication to achieve fault tolerance.

How to create RDD?

Spark provides two methods to create RDD:


• By parallelizing a collection in your Driver program. This makes use of SparkContext’s ‘parallelize’ method

val IntellipaatData = Array(2,4,6,8,10)


val distIntellipaatData = sc.parallelize(IntellipaatData)

• By loading an external dataset from external storage like HDFS, HBase, shared file system

What do you understand by Pair RDD?

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to
as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that
collects data based on each key and a join () method that combines different RDDs together, based on the
elements having the same key.

How can you remove the elements with a key present in any other RDD?
Use the subtractByKey () function

What is persist()?

Spark’s RDDs are by default recomputed each time you run an action on them. If you would like to reuse an
RDD in multiple actions, you can ask Spark to persist it using RDD.persist().After computing it the first time,
Spark will store the RDD contents in memory (partitioned across the machines in your cluster), and reuse them
in future actions. Persisting RDDs on disk instead of memory is also possible

What is the difference between persist() and cache()


There are two methods to persist the data, such as persist() to persist permanently and cache() to persist
temporarily in the memory. Different storage level options there such as MEMORY_ONLY,
MEMORY_AND_DISK, DISK_ONLY and many more. Both persist() and cache() uses different options depends
on the task.
persist () allows the user to specify the storage level whereas cache () uses the default storage level.

What are the various levels of persistence in Apache Spark?

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often
suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various
persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication
levels.

The various storage/persistence levels in Spark are -


MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP

Does Apache Spark provide check pointing?

Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the
RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist.
However, the decision on which data to checkpoint - is decided by the user. Checkpoints are useful when the
lineage graphs are long and have wide dependencies.
How can you achieve high availability in Apache Spark?
Implementing single node recovery with local file system
Using StandBy Masters with Apache ZooKeeper.

Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage.
RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to
failure, lineage helps build only that particular lost partition.

Explain about the core components of a distributed Spark application.

Driver- The process that runs the main () method of the program to create RDDs and perform transformations
and actions on them.
Executor –The worker processes that run the individual tasks of a Spark job.
Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The cluster manager
allows Spark to run on top of other external managers like Apache Mesos or YARN.

What is the use of Spark driver, where it gets executed on the cluster ?

The spark driver is a program it defines transformations and actions on RDDs ,its runs on master node, Spark
driver creates SparkContext, connected to a given Spark Master.The driver also delivers the RDD graphs to
Master, where the standalone cluster manager runs.

Or

The spark driver is that the program that defines the transformations and actions on RDDs of knowledge and
submits request to the master.spark driver is a program that runs on the master node of the machine which
declares transformations and actions on knowledge RDDs.

In easy terms, driver in Spark creates SparkContext, connected to a given Spark Master.It conjointly delivers
the RDD graphs to Master, wherever the standalone cluster manager runs.
Or
Driver program is responsible to launch various parallel operations on the cluster.
Driver program contains application's main() function.
It is the process which is running the user code which in turn create the SparkContext object, create RDDs and
performs transformation and action operation on RDD.
Driver program access Spark through a SparkContext object which represents a connection to computing
cluster (From Spark 2.0 on-wards we can access SparkContext object through SparkSession).
Driver program is responsible for converting user program into the unit of physical execution called taks.
It also defines distributed datasets on the cluster and we can apply different operations on dataset
(transformation and action).
Spark program creates logical plan called Directed Acyclic graph which is converted to physical execution plan
by driver when driver program runs.

What is Spark Executor?

When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors
are Spark processes that run computations and store the data on the worker node. The final tasks by
SparkContext are transferred to executors for their execution.
Or
Spark executors are worker processes responsible for running the individual tasks in a given Spark job.
Executors are launched once at the beginning of a Spark application and typically run for the entire lifetime of
an application.Executors have two roles. First, they run the tasks that make up the application and return
results to the driver.Second, they provide in-memory storage for RDDs that are cached by user programs.

Or

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap
size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory
property of the –executor-memory flag. Every spark application will have one executor on each worker node.
The executor memory is basically a measure on how much memory of the worker node will the application
utilize.

What is Executor memory?

You can configure this using the –executor-memory argument to sparksubmit. Each application will have at
most one executor on each worker, so this setting controls how much of that worker’s memory the application
will claim. By default, this setting is 1 GB—you will likely want to increase it on most servers.

How to define partition in a spark?


partition is a smaller and logical division of data similar to ‘split’ in MapReduce. Partitioning is the process to
derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD.

Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By
default HDFS block size is partition size (for best performance), but its’ possible to change partition size like
Split.

How can we split single HDFS block into partitions RDD?

When we create the RDD from a file stored in HDFS


data = context.textFile("/user/dastu/myfile.txt")

by default one partition is created for one block. ie. if we have a file of size 1280 MB (with 128 MB block size)
there will be 10 HDFS blocks, hence similar number of partitions (10) will be created.

If you want to create more partitions than number of blocks, you can specify the same while RDD creation:
data = context.textFile("/user/dataflair/file-name", 20)

It will create 20 partitions for the file. ie for each block 2 partitions will be created.

NOTE: it is often recommended to have more no of partitions than no of block, it improves the performance.

Define functions of SparkCore.Serving as the base engine,

SparkCore performs various important functions like memory management, monitoring jobs, fault-tolerance, job
scheduling and interaction with storage systems.

How Spark store the data?

Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like
HDFS, S3 and other data resources.

how to execute Hive on Spark?

Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.

What file systems Spark support?

• Hadoop Distributed File System (HDFS)


• Local File system
What is a Parquet file?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both
read and write operations with Parquet file and consider it be one of the best big data analytics format so far.

List the functions of Spark SQL.

Spark SQL is capable of:


• Loading data from a variety of structured sources
• Querying data using SQL statements, both inside a Spark program and from external tools that connect to
Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence
tools like Tableau
• Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs
and SQL tables, expose custom functions in SQL, and more

Name types of Cluster Managers in Spark.

The Spark framework supports three major types of Cluster Managers:


• Standalone: a basic manager to set up a cluster
• Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other
applications
• Yarn: responsible for resource management in Hadoop

How to run spark in Standalone cluster mode?

spark-submit \
–class org.apache.spark.examples.SparkPi \
–deploy-mode cluster \
–master spark//$SPARK_MASTER_IP:$SPARK_MASTER_PORT \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10

How to run spark in YARN client mode?

spark-submit \
–class org.apache.spark.examples.SparkPi \
–deploy-mode client \
–master yarn \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10

How to run spark in YARN cluster mode?

spark-submit \
–class org.apache.spark.examples.SparkPi \
–deploy-mode cluster \
–master yarn \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10

How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable
manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large
RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

What is Broadcast Variables?


Broadcast variables let programmer keep a read-only variable cached on each machine, rather than shipping a
copy of it with tasks. Spark supports 2 types of shared variables called broadcast variables (like Hadoop
distributed cache) and accumulators (like Hadoop counters). Broadcast variables stored as Array Buffers, which
sends read-only values to work nodes.
Spark’s second type of shared variable, broadcast variables, allows the program to efficiently send a large,
read-only value to all the worker nodes for use in one or more Spark operations. They come in handy, for
example, if your application needs to send a large, read-only lookup table to all the nodes.

What are Accumulators?


Spark of-line debuggers called accumulators. Spark accumulators are similar to Hadoop counters, to count the
number of events and what’s happening during job you can use accumulators. Only the driver program can
read an accumulator value, not the tasks.

Accumulators, provides a simple syntax for aggregating values from worker nodes back to the driver program.
One of the most common uses of accumulators is to count events that occur during job execution for debugging
purposes.

Why is there a need for broadcast variables when working with Apache Spark?
These are read only variables, present in-memory cache on every machine. When working with Spark, usage
of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be
processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the
retrieval efficiency when compared to an RDD lookup ().

What are the steps that occur when you run a Spark application on a cluster?

The user submits an application using spark-submit.

Spark-submit launches the driver program and invokes the main() method specified by the user.
The driver program contacts the cluster manager to ask for resources to launch executors.
The cluster manager launches executors on behalf of the driver program.
The driver process runs through the user application. Based on the RDD actions and transformations in the
program, the driver sends work to executors in the form of tasks.
Tasks are run on executor processes to compute and save results.
If the driver’s main() method exits or it calls SparkContext.stop(),it will terminate the executors and release
resources from the cluster manager.

Or

In real world, spark operates in master / slave fashion with one central co-ordinator and many distributed
worker.
Central co-ordinator is called 'Driver' while distributed worker called 'executor'.
Driver communicates with large no. of executor.
Driver program runs in its Java Process while each executor runs in its own Java Process.
Driver and executors together known as 'Spark Application'.
Spark application is launched on cluster using Cluster Manager.
Spark has its in-built cluster manager called Standalone Cluster Manager.
However, one can run spark on two popular open source cluster manager known as Hadoop YARN and Apache
Mesos.

Spark Driver --> Cluster Manager (Standalone, YARN, Mesos)--> Worker (executor)
In reality, there are many workers below Cluster Manager but for simplicity just shown one executor.

What is a schema RDD/DataFrame?


A SchemaRDD is an RDD composed of Row objects with additional schema information of the types in each
column. Row objects are just wrappers around arrays of basic types (e.g., integers and strings).

What are Row objects?

Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields.Row objects
have a number of getter functions to obtain the value of each field given its index. The standard getter, get (or
apply in Scala), takes a column number and returns an Object type (or Any in Scala) that we are responsible for
casting to the correct type. For Boolean, Byte, Double, Float, Int, Long, Short, and String, there is a getType()
method, which returns that type. For example, get String(0) would return field 0 as a string.

Differentiate between describe and describe extended.


Describe database/schema- This query displays the name of the database, the root location on the file system
and comments if any.
Describe extended database/schema- Gives the details of the database or schema in a detailed manner.

How Spark Streaming works?

Spark Streaming receives live input data streams and divides the data into batches, which are then processed
by the Spark engine to generate the final stream of results in batches.Spark Streaming provides a high-level
abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can
be created either from input data streams from sources such as Kafka, Flume, or by applying high-level
operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

Explain Spark Streaming Architecture?

Spark Streaming uses a “micro-batch” architecture, where Spark Streaming receives data from various input
sources and groups it into small batches. New batches are created at regular time intervals. At the beginning of
each time interval a new batch is created, and any data that arrives during that interval gets added to that
batch. At the end of the time interval the batch is done growing. The size of the time intervals is determined by a
parameter called the batch interval. Each input batch forms an RDD, and is processed using Spark jobs to
create other RDDs. The processed results can then be pushed out to external systems in batches.

What are DStreams?


Much like Spark is built on the concept of RDDs, Spark Streaming provides an abstraction called DStreams, or
discretized streams. A DStream is a sequence of data arriving over time. Internally, each DStream is
represented as a sequence of RDDs arriving at each time step. DStreams can be created from various input
sources, such as Flume, Kafka, or HDFS. Once built, they offer two types of operations: transformations, which
yield a new DStream, and output operations, which write data to an external system.
Or
Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams
can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two
operations –

Transformations that produce a new DStream.


Output operations that write data to an external system.

Or

As Spark core is build on the concept of RDDs, Spark Streaming provides an abstraction called DStreams or
discretized streams .
DStream is a sequence of data arriving over time.
Each DStream is represented as a sequence of RDDs arriving at repeated / configured time steps.
DStream can created from various input sources like TCP Sockets, Kafka, Flume, HDFS etc.
DStream offer two types of operation : transformation which generators another DStream and output operations
which writes the data to external system.
One can perform the basic operation of RDDs over the DStream in addition to the new operation related to time
like sliding window since DStream derived from the RDDs.

Which spark library allows reliable file sharing at memory speed across different cluster frameworks?
Tachyon / Alluxio's

What are the types of Transformations on DStreams?

In stateless transformations the processing of each batch does not depend on the data of its previous
batches. They include the common RDD transformations like map(), filter(), and reduceByKey().

Stateful transformations, in contrast, use data or intermediate results from previous batches to compute the
results of the current batch. They include transformations based on sliding windows and on tracking state
across time.

What is Receiver in Spark Streaming?

Every input DStream is associated with a Receiver object which receives the data from a source and stores it in
Spark’s memory for processing.
What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming
library provides windowed computations where the transformations on RDDs are applied over a sliding window
of data. Whenever the window slides, the RDDs that fall within the particular window are combined and
operated upon to produce new RDDs of the windowed DStream.

How Spark handles monitoring and logging in Standalone mode?


Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and
job statistics. The log output for each job is written to the work directory of the slave nodes.

Is it possible to overwrite Hadoop MapReduce configuration in Hive?


Yes, hadoop MapReduce configuration can be overwritten by changing the hive conf settings file.

Explain API createOrReplaceTempView()


Its basic Dataset function.
Its under org.apache.spark.sql

def createOrReplaceTempView(viewName: String): Unit


Creates a temporary view using the given name.
The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.

Explain values() operation


It returns an RDD of values only
Values are duplicate in data set
val rdd1 = sc.parallelize(Seq((2,4),(3,6),(4,8),(2,6),(4,12),(5,10),(5,40),(10,40)))
val rdd2 = rdd1.keys
rdd2.collect
Output :
Array[Int] = Array(2, 3, 4, 2, 4, 5, 5, 10)

val rdd3 = rdd1.values


rdd3.collect
Output :
Array[Int] = Array(4, 6, 8, 6, 12, 10, 40, 40)

Why we need compression and what the different compression format supported ?

In Big Data, when we used the compression, it saves the storage space and reduce the network overhead.
One can specify the compression coded while writing the data to HDFS ( Hadoop format)
One can also read the compressed data, for that also we can use compression codec.
Following are the different compression format support in BigData :
* gzip
* lzo
* bzip2
* zlib
* Snappy

textFile Vs wholeTextFile in Spark ?

textFile() :

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]


Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system
URI, and return it as an RDD of Strings
For example sc.textFile("/home/hdadmin/wc-data.txt") so it will create RDD in which each individual line an
element.
Everyone knows the use of textFile.

wholeTextFiles() :

def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]


Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported
file system URI.
Rather than create basic RDD, the wholeTextFile() returns pairRDD
For example, you have few files in a directory so by using wholeTextFile() method,
it creates pair RDD with filename with path as key,
and value being the whole file as string

val myfilerdd = sc.wholeTextFiles("/home/hdadmin/MyFiles")


val keyrdd = myfilerdd.keys
keyrdd.collect
val filerdd = myfilerdd.values
filerdd.collect

Explain cogroup() operation

It's a transformation.
It's in package org.apache.spark.rdd.PairRDDFunctions

def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)]): RDD[(K,
(Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

For each key k in this or other1 or other2 or other3, return a resulting RDD that contains a tuple with the list of
values for that key in this, other1, other2 and other3.

val myrdd1 = sc.parallelize(List((1,"spark"),(2,"HDFS"),(3,"Hive"),(4,"Flink"),(6,"HBase")))


val myrdd2 = sc.parallelize(List((4,"RealTime"),(5,"Kafka"),(6,"NOSQL"),(1,"stream"),(1,"MLlib")))
val result = myrdd1.cogroup(myrdd2)
result.collect

Output :
Array[(Int, (Iterable[String], Iterable[String]))] =
Array((4,(CompactBuffer(Flink),CompactBuffer(RealTime))),
(1,(CompactBuffer(spark),CompactBuffer(stream, MLlib))),
(6,(CompactBuffer(HBase),CompactBuffer(NOSQL))),
(3,(CompactBuffer(Hive),CompactBuffer())),
(5,(CompactBuffer(),CompactBuffer(Kafka))),
(2,(CompactBuffer(HDFS),CompactBuffer())))

Explain pipe() operation


def pipe(command: String): RDD[String]
Return an RDD created by piping elements to a forked external process.

>In general, Spark is using Scala, Java and Python to write the program. However if that is not enough, and
one want to pipe (inject) the data which written in other language like 'R', Spark provides general mechanism in
the form of pipe() method

> Spark provides the pipe() method on RDDs.


> With Spark's pipe() method, one can write a transformation of an RDD that can read each element in the RDD
from standard input as String
> It can writes the results as String to the standard output.

Explain coalesce() operation


> It is a transformation.
> It's in a package org.apache.spark.rdd.ShuffledRDD

def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] =


Option.empty)(implicit ord: Ordering[(K, C)] = null): RDD[(K, C)]

Return a new RDD that is reduced into numPartitions partitions.

This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a
shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking
place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass
shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).

Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a
small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling
coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.

It changes number of partition where data is stored. It combines original partitions to new number of partitions,
so it reduces number of partitions. It is an optimized version of repartition that allows data movement, but only if
you are decreasing number of RDD partitions. It runs operations more efficiently after filtering large datasets.

Example :

val myrdd1 = sc.parallelize(1 to 1000, 15)


myrdd1.partitions.length
val myrdd2 = myrdd1.coalesce(5,false)
myrdd2.partitions.length
Int = 5

Output :
Int = 15
Int = 5

Explain the repartition() operation


repartition() is a transformation.
This function changes the number of partitions mentioned in parameter numPartitions(numPartitions : Int)
It's in package org.apache.spark.rdd.ShuffledRDD

def repartition(numPartitions: Int)(implicit ord: Ordering[(K, C)] = null): RDD[(K, C)]


Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.
If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid
performing a shuffle.

Example :

val rdd1 = sc.parallelize(1 to 100, 3)


rdd1.getNumPartitions
val rdd2 = rdd1.repartition(6)
rdd2.getNumPartitions

Output :
Int = 3
Int = 6

Explain fullOuterJoin() operation

> It is transformation.
> It's in package org.apache.spark.rdd.PairRDDFunctions

def fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]

Perform a full outer join of this and other.


For each element (k, v) in this, the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for w in
other,
or the pair (k, (Some(v), None)) if no elements in other have key k.
Similarly, for each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), Some(w)))
for v in this,
or the pair (k, (None, Some(w))) if no elements in this have key k.
Hash-partitions the resulting RDD using the existing partitioner/ parallelism level.
Example :

val frdd1 = sc.parallelize(Seq(("Spark",35),("Hive",23),("Spark",45),("HBase",89)))


val frdd2 = sc.parallelize(Seq(("Spark",74),("Flume",12),("Hive",14),("Kafka",25)))
val fullouterjoinrdd = frdd1.fullOuterJoin(frdd2)
fullouterjoinrdd.collect

Output :
Array[(String, (Option[Int], Option[Int]))] = Array((Spark,(Some(35),Some(74))), (Spark,(Some(45),Some(74))),
(Kafka,(None,Some(25))), (Flume,(None,Some(12))), (Hive,(Some(23),Some(14))), (Hbase,(Some(89),None)))

Expain leftOuterJoin() and rightOuterJoin() operation

> Both leftOuterJoin() and rightOuterJoin() are transformation.


> Both in package org.apache.spark.rdd.PairRDDFunctions

leftOuterJoin() :
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

Perform a left outer join of this and other. For each element (k, v) in this, the resulting RDD will either contain all
pairs (k, (v, Some(w))) for w in other, or the pair (k, (v, None)) if no elements in other have key k. Hash-partitions
the output using the existing partitioner/parallelism level.

leftOuterJoin() performs a join between two RDDs where the keys must be present in first RDD

Example :

val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))


val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64)))
val leftjoinrdd = rdd1.leftOuterJoin(rdd2)
leftjoinrdd.collect

Output :
Array[(String, (Int, Option[Int]))] = Array((s,(59,Some(61))), (s,(59,Some(62))), (s,(54,Some(61))), (s,
(54,Some(62))), (e,(57,None)), (e,(58,None)), (m,(55,Some(60))), (m,(55,Some(65))), (m,(56,Some(60))), (m,
(56,Some(65))))

rightOuterJoin() :
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]

Perform a right outer join of this and other. For each element (k, w) in other, the resulting RDD will either contain
all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k. Hash-partitions
the resulting RDD using the existing partitioner/parallelism level.

It performs the join between two RDDs where the key must be present in other RDD

Example :

val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))


val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64)))
val rightjoinrdd = rdd1.rightOuterJoin(rdd2)
rightjoinrdd.collect

Array[(String, (Option[Int], Int))] = Array((s,(Some(59),61)), (s,(Some(59),62)), (s,(Some(54),61)), (s,


(Some(54),62)), (h,(None,63)), (h,(None,64)), (m,(Some(55),60)), (m,(Some(55),65)), (m,(Some(56),60)), (m,
(Some(56),65)))
Explain join() operation
> join() is transformation.
> It's in package org.apache.spark.rdd.pairRDDFunction

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]Permalink

Return an RDD containing all pairs of elements with matching keys in this and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other.
Performs a hash join across the cluster.

It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W))
pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin,
and fullOuterJoin.

Example1:

val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))


val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64)))
val joinrdd = rdd1.join(rdd2)
joinrdd.collect

Output :
Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,
(54,61)), (s,(54,62)))

Example2:
<br />
val myrdd1 = sc.parallelize(Seq((1,2),(3,4),(3,6)))<br />
val myrdd2 = sc.parallelize(Seq((3,9)))<br />
val myjoinedrdd = myrdd1.join(myrdd2)<br />
myjoinedrdd.collect<br />

Output:
Array[(Int, (Int, Int))] = Array((3,(4,9)), (3,(6,9)))

Explain the top() and takeOrdered() operation

> Both are actions.


> Both returns the n elements of RDD based on default ordering or based on custom ordering provided by user

def top(num: Int)(implicit ord: Ordering[T]): Array[T]

Returns the top k (largest) elements from this RDD as defined by the specified implicit Ordering[T] and
maintains the ordering. This does the opposite of takeOrdered.

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]

Returns the first k (smallest) elements from this RDD as defined by the specified implicit Ordering[T] and
maintains the ordering. This does the opposite of top.

Example :

val myrdd1 = sc.parallelize(List(5,7,9,13,51,89))


myrdd1.top(3)
myrdd1.takeOrdered(3)
myrdd1.top(3)

Output :
Array[Int] = Array(89, 51, 13)
Array[Int] = Array(5, 7, 9)
Array[Int] = Array(89, 51, 13)
Explain first() operation

> Its an action.


> It returns the first element of the rdd. It is similar to take(1)
Example :
val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt")
rdd1.count
rdd1.first

Output :
Long: 20
String : DataFlair is the leading technology training provider

Explain the mapPartitions() and mapPartitionsWithIndex()


mapPartitions() can be used as an alternative to map() and foreach() .
mapPartitions() can be called for each partitions while map() and foreach() is called for each elements in an
RDD
Hence one can do the initialization on per-partition basis rather than each element basis

mapPartitions() :

mapPartitionsWithIndex is similar to mapPartitions() but it provides second parameter index which keeps the
track of partition.

MappartionwithIndex:

It is similar to MapPartition but with one difference that it takes two parameters, the first parameter is the index
and second is an iterator through all items within this partition (Int, Iterator<t>).

Explain sum(), max(), min() operation

sum() :

> It adds up the value in an RDD


> It is an package org.apache.spark.rdd.DoubleRDDFunctions
> Its return type is Double

Example:

val rdd1 = sc.parallelize(1 to 20)


rdd1.sum

Output:
Double = 210.0

max() :

> It returns a max value from RDD element defined by implicit ordering (element order)
> It is an package org.apache.spark.rdd

Example:

val rdd1 = sc.parallelize(List(1,5,9,0,23,56,99,87))


rdd1.max

Output:
Int = 99

Explain countByValue() operation

> It is an action
> It returns the count of each unique value in an RDD as a local Map (as a Map to driver program) (value,
countofvalues) pair
> Care must be taken to use this API since it returns the value to driver program so its suitable only for small
values

Example:

val rdd1 = sc.parallelize(Seq(("HR",5),("RD",4),("ADMIN",5),("SALES",4),("SER",6),("MAN",8)))


rdd1.countByValue

Output:
scala.collection.Map[(String, Int),Long] = Map((HR,5) -> 1, (RD,4) -> 1, (SALES,4) -> 1, (ADMIN,5) -> 1,
(MAN,8) -> 1, (SER,6) -> 1)

val rdd2 = sc.parallelize{Seq(10,4,3,3)}


rdd2.countByValue

Output:
scala.collection.Map[Int,Long] = Map(4 -> 1, 3 -> 2, 10 -> 1)

Explain the lookup() operation

It is an action
> It returns the list of values in the RDD for key 'key'

val rdd1 = sc.parallelize(Seq(("Spark",78),("Hive",95),("spark",15),("HBase",25),("spark",39),("BigData",78),


("spark",49)))
rdd1.lookup("spark")
rdd1.lookup("Hive")
rdd1.lookup("BigData")

Output:
Seq[Int] = WrappedArray(15, 39, 49)
Seq[Int] = WrappedArray(95)
Seq[Int] = WrappedArray(78)

Parquet file merging or other optimisation tips

There are a couple of SQL optimizations I recommend for you to consider.

1) Making use of partitions for your table may help if you frequently only
access data from certain days at a time. There's a notebook in the
Databricks Guide called "Partitioned Tables" with more data.
2) If your files are really small - it is true that you may get better
performance by consolidating those files into a smaller number. You can do
that easily in spark with a command like this:
sqlContext.parquetFile( SOME_INPUT_FILEPATTERN )
.coalesce(SOME_SMALLER_NUMBER_OF_DESIRED_PARTITIONS)
.write.parquet(SOME_OUTPUT_DIRECTORY)

or
Having a large # of small files or folders can significantly deteriorate the
performance of loading the data. The best way is to keep the folders/files
merged so that each file is around 64MB size. There are different ways to
achieve this: your writer process can either buffer them in memory and
write only after reaching a size or as a second phase you can read the
temp directory and consolidate them together and write it out to a
different location. If you want to do the latter, you can read each of your
input directory as a dataframe and union them and repartition it to the #
of files you want and dump it back. A code snippet in Scala would be:

val dfSeq = MutableList[DataFrame]()


sourceDirsToConsolidate.map(dir => { val df = sqlContext.parquetFile(dir)
dfSeq += df })
val masterDf = dfSeq.reduce((df1, df2) => df1.unionAll(df2))
masterDf.coalesce(numOutputFiles).write.mode(saveMode).parquet(destDi
r)
The dataframe's api is same in python. So you might be able to easily
convert this to python.

You might also like