Spark Interview
Spark Interview
Spark Interview
Spark is scheduling, monitoring and distributing engine for big data.It is a cluster computing platform designed
to be fast and general purpose.Spark extends the popular MapReduce model.One of the main features Spark
offers for speed is the ability to run computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.
In standalone mode, Spark uses a Master daemon which coordinates the efforts of the Workers, which run the
executors. Standalone mode is the default, but it cannot be used on secure clusters.When you submit an
application, you can choose how much memory its executors will use, as well as the total number of cores
across all executors.
In YARN mode, the YARN ResourceManager performs the functions of the Spark Master. The functions of the
Workers are performed by the YARN NodeManager daemons, which run the executors. YARN mode is slightly
more complex to set up, but it supports security.
Each application has a driver process which coordinates its execution. This process can run in the foreground
(client mode) or in the background (cluster mode). Client mode is a little simpler, but cluster mode allows you to
easily log out after starting a Spark application without terminating the application.
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given
dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless
asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed
immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the
overall data processing workflow.
• Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than
Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.
• Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch
processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch
processing.
• Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
• Spark is capable of performing computations multiple times on the same dataset. This is called iterative
computation while there is no iterative computing implemented by Hadoop.
Partitions also known as 'Split' in HDFS, is a logical chunk of data set which may be in the range of Petabyte,
Terabytes and distributed across the cluster.
By Default, Spark creates one Partition for each block of the file (For HDFS)
Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2) so as the split
size.
However, one can explicitly specify the number of partitions to be create.
Partitions are basically used to speed up the data processing.
Or
By Default, Spark creates one Partition for each block of the file (For HDFS)
Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
However, one can explicitly specify the number of partitions to be create.
Example1:
Following code create the RDD of 10 partitions, since we specify the no. of partitions.
rdd1.partitions.length
OR
rdd1.getNumPartitions
PARTITIONER :
# An object that defines how the elements in a key-value pair RDD are partitioned by key. Maps each key to a
partition ID, from 0 to (number of partitions - 1)
# Partitioner captures the data distribution at output. A scheduler can optimize future operation based on type of
partitioner. (i.e. if we perform any operation say transformation or action which require shuffling across nodes in
that we may need the partitioner. Please refer reduceByKey() transformation in the forum)
Transformations are lazy evaluated operations on RDD that create one or many new RDDs, e.g. map, filter,
reduceByKey, join, cogroup, randomSplit. Transformations are functions which takes an RDD as the input and
produces one or many RDDs as output. They dont change the input RDD as RDDs are immutable and hence
cannot be changed or modified, but always produces new RDD by applying the computations operations on
them. By applying transformations you incrementally builds an RDD lineage with all the ancestor RDDs of the
final RDD(s).
Transformations are lazy, i.e. are not executed immediately. Transformations can be executed only when
actions are called. After executing a transformation, the result RDD(s) will always be different from their
ancestors RDD and can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap, union, cartesian) or
the same size (e.g. map) or it can vary in size.
RDD allows you to create dependencies b/w RDDs. Dependencies are the steps for producing results ie a
program. Each RDD in lineage chain, string of dependencies has a function for operating its data and has a
pointer dependency to its ancestor RDD. Spark will divide RDD dependencies into stages and tasks and then
send those to workers for execution.
Or
Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action
occurs. map() and filer() are examples of transformations, where the former applies the function passed to it on
each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements form
current RDD that pass function argument.
Or
In order to identify the operation, one need to look at the return type of an operation.
If operation returns a new RDD in that case an operation is 'Transformation'
If operation returns any other type than RDD in that case an operation is 'Action'
Hence,
Transformation constructs a new RDD from an existing one (previous one) while Action computes the result
based on applied transformation and returns the result to either driver program or save it to the external
storage.
An action helps in bringing back the data from RDD to the local machine.
An action’s execution is the result of all previously created transformations. reduce() is an action that
implements the function passed again and again until one value if left. take() action takes all the values from
RDD to local node.
Actions is RDD’s operation, that value return back to the spar driver programs, which kick off a job to execute
on a cluster. Transformation’s output is input of Actions. reduce, collect, takeSample, take, first, saveAsTextfile,
saveAsSequenceFile, countByKey, foreach are common actions in Apache spark.
Spark does not support data replication in the memory , if any data is lost, it is rebuild using RDD lineage. RDD
lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build
from other datasets.
Or
As you derive new RDDs from each other using transformations, Spark keeps track of the set of dependencies
between different RDDs, called the lineage graph. It uses this information to compute each RDD on demand
and to recover lost data if part of a persistent RDD is lost.
Or
Lineage Graph is the graph of all the parent RDDs for a RDD.
By applying different transformation on an RDD results in lineage graph.
When one derives the new RDD from existing (previous) RDD using transformation, Spark keeps the track of all
the dependencies between RDD is called lineage graph.
Lineage Graph is useful for scenarios mentioned below :
(1) When there is a demand to compute the new RDD.
(2) To recover the lost data if part of persisted RDD is lost.
In other words, Lineage Graph is a graph of all transformation operation that need to executed when an action
operation is called.
How Spark achieves fault tolerance?
Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault
tolerance whereas Spark uses different data storage model, RDD. RDDs achieve fault tolerance through a
notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that
partition.This removes the need for replication to achieve fault tolerance.
• By loading an external dataset from external storage like HDFS, HBase, shared file system
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to
as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that
collects data based on each key and a join () method that combines different RDDs together, based on the
elements having the same key.
How can you remove the elements with a key present in any other RDD?
Use the subtractByKey () function
What is persist()?
Spark’s RDDs are by default recomputed each time you run an action on them. If you would like to reuse an
RDD in multiple actions, you can ask Spark to persist it using RDD.persist().After computing it the first time,
Spark will store the RDD contents in memory (partitioned across the machines in your cluster), and reuse them
in future actions. Persisting RDDs on disk instead of memory is also possible
Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often
suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various
persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication
levels.
Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the
RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist.
However, the decision on which data to checkpoint - is decided by the user. Checkpoints are useful when the
lineage graphs are long and have wide dependencies.
How can you achieve high availability in Apache Spark?
Implementing single node recovery with local file system
Using StandBy Masters with Apache ZooKeeper.
Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage.
RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to
failure, lineage helps build only that particular lost partition.
Driver- The process that runs the main () method of the program to create RDDs and perform transformations
and actions on them.
Executor –The worker processes that run the individual tasks of a Spark job.
Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The cluster manager
allows Spark to run on top of other external managers like Apache Mesos or YARN.
What is the use of Spark driver, where it gets executed on the cluster ?
The spark driver is a program it defines transformations and actions on RDDs ,its runs on master node, Spark
driver creates SparkContext, connected to a given Spark Master.The driver also delivers the RDD graphs to
Master, where the standalone cluster manager runs.
Or
The spark driver is that the program that defines the transformations and actions on RDDs of knowledge and
submits request to the master.spark driver is a program that runs on the master node of the machine which
declares transformations and actions on knowledge RDDs.
In easy terms, driver in Spark creates SparkContext, connected to a given Spark Master.It conjointly delivers
the RDD graphs to Master, wherever the standalone cluster manager runs.
Or
Driver program is responsible to launch various parallel operations on the cluster.
Driver program contains application's main() function.
It is the process which is running the user code which in turn create the SparkContext object, create RDDs and
performs transformation and action operation on RDD.
Driver program access Spark through a SparkContext object which represents a connection to computing
cluster (From Spark 2.0 on-wards we can access SparkContext object through SparkSession).
Driver program is responsible for converting user program into the unit of physical execution called taks.
It also defines distributed datasets on the cluster and we can apply different operations on dataset
(transformation and action).
Spark program creates logical plan called Directed Acyclic graph which is converted to physical execution plan
by driver when driver program runs.
When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors
are Spark processes that run computations and store the data on the worker node. The final tasks by
SparkContext are transferred to executors for their execution.
Or
Spark executors are worker processes responsible for running the individual tasks in a given Spark job.
Executors are launched once at the beginning of a Spark application and typically run for the entire lifetime of
an application.Executors have two roles. First, they run the tasks that make up the application and return
results to the driver.Second, they provide in-memory storage for RDDs that are cached by user programs.
Or
Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap
size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory
property of the –executor-memory flag. Every spark application will have one executor on each worker node.
The executor memory is basically a measure on how much memory of the worker node will the application
utilize.
You can configure this using the –executor-memory argument to sparksubmit. Each application will have at
most one executor on each worker, so this setting controls how much of that worker’s memory the application
will claim. By default, this setting is 1 GB—you will likely want to increase it on most servers.
Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By
default HDFS block size is partition size (for best performance), but its’ possible to change partition size like
Split.
by default one partition is created for one block. ie. if we have a file of size 1280 MB (with 128 MB block size)
there will be 10 HDFS blocks, hence similar number of partitions (10) will be created.
If you want to create more partitions than number of blocks, you can specify the same while RDD creation:
data = context.textFile("/user/dataflair/file-name", 20)
It will create 20 partitions for the file. ie for each block 2 partitions will be created.
NOTE: it is often recommended to have more no of partitions than no of block, it improves the performance.
SparkCore performs various important functions like memory management, monitoring jobs, fault-tolerance, job
scheduling and interaction with storage systems.
Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like
HDFS, S3 and other data resources.
Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both
read and write operations with Parquet file and consider it be one of the best big data analytics format so far.
spark-submit \
–class org.apache.spark.examples.SparkPi \
–deploy-mode cluster \
–master spark//$SPARK_MASTER_IP:$SPARK_MASTER_PORT \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10
spark-submit \
–class org.apache.spark.examples.SparkPi \
–deploy-mode client \
–master yarn \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10
spark-submit \
–class org.apache.spark.examples.SparkPi \
–deploy-mode cluster \
–master yarn \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10
How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable
manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large
RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
Accumulators, provides a simple syntax for aggregating values from worker nodes back to the driver program.
One of the most common uses of accumulators is to count events that occur during job execution for debugging
purposes.
Why is there a need for broadcast variables when working with Apache Spark?
These are read only variables, present in-memory cache on every machine. When working with Spark, usage
of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be
processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the
retrieval efficiency when compared to an RDD lookup ().
What are the steps that occur when you run a Spark application on a cluster?
Spark-submit launches the driver program and invokes the main() method specified by the user.
The driver program contacts the cluster manager to ask for resources to launch executors.
The cluster manager launches executors on behalf of the driver program.
The driver process runs through the user application. Based on the RDD actions and transformations in the
program, the driver sends work to executors in the form of tasks.
Tasks are run on executor processes to compute and save results.
If the driver’s main() method exits or it calls SparkContext.stop(),it will terminate the executors and release
resources from the cluster manager.
Or
In real world, spark operates in master / slave fashion with one central co-ordinator and many distributed
worker.
Central co-ordinator is called 'Driver' while distributed worker called 'executor'.
Driver communicates with large no. of executor.
Driver program runs in its Java Process while each executor runs in its own Java Process.
Driver and executors together known as 'Spark Application'.
Spark application is launched on cluster using Cluster Manager.
Spark has its in-built cluster manager called Standalone Cluster Manager.
However, one can run spark on two popular open source cluster manager known as Hadoop YARN and Apache
Mesos.
Spark Driver --> Cluster Manager (Standalone, YARN, Mesos)--> Worker (executor)
In reality, there are many workers below Cluster Manager but for simplicity just shown one executor.
Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields.Row objects
have a number of getter functions to obtain the value of each field given its index. The standard getter, get (or
apply in Scala), takes a column number and returns an Object type (or Any in Scala) that we are responsible for
casting to the correct type. For Boolean, Byte, Double, Float, Int, Long, Short, and String, there is a getType()
method, which returns that type. For example, get String(0) would return field 0 as a string.
Spark Streaming receives live input data streams and divides the data into batches, which are then processed
by the Spark engine to generate the final stream of results in batches.Spark Streaming provides a high-level
abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can
be created either from input data streams from sources such as Kafka, Flume, or by applying high-level
operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Spark Streaming uses a “micro-batch” architecture, where Spark Streaming receives data from various input
sources and groups it into small batches. New batches are created at regular time intervals. At the beginning of
each time interval a new batch is created, and any data that arrives during that interval gets added to that
batch. At the end of the time interval the batch is done growing. The size of the time intervals is determined by a
parameter called the batch interval. Each input batch forms an RDD, and is processed using Spark jobs to
create other RDDs. The processed results can then be pushed out to external systems in batches.
Or
As Spark core is build on the concept of RDDs, Spark Streaming provides an abstraction called DStreams or
discretized streams .
DStream is a sequence of data arriving over time.
Each DStream is represented as a sequence of RDDs arriving at repeated / configured time steps.
DStream can created from various input sources like TCP Sockets, Kafka, Flume, HDFS etc.
DStream offer two types of operation : transformation which generators another DStream and output operations
which writes the data to external system.
One can perform the basic operation of RDDs over the DStream in addition to the new operation related to time
like sliding window since DStream derived from the RDDs.
Which spark library allows reliable file sharing at memory speed across different cluster frameworks?
Tachyon / Alluxio's
In stateless transformations the processing of each batch does not depend on the data of its previous
batches. They include the common RDD transformations like map(), filter(), and reduceByKey().
Stateful transformations, in contrast, use data or intermediate results from previous batches to compute the
results of the current batch. They include transformations based on sliding windows and on tracking state
across time.
Every input DStream is associated with a Receiver object which receives the data from a source and stores it in
Spark’s memory for processing.
What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming
library provides windowed computations where the transformations on RDDs are applied over a sliding window
of data. Whenever the window slides, the RDDs that fall within the particular window are combined and
operated upon to produce new RDDs of the windowed DStream.
Why we need compression and what the different compression format supported ?
In Big Data, when we used the compression, it saves the storage space and reduce the network overhead.
One can specify the compression coded while writing the data to HDFS ( Hadoop format)
One can also read the compressed data, for that also we can use compression codec.
Following are the different compression format support in BigData :
* gzip
* lzo
* bzip2
* zlib
* Snappy
textFile() :
wholeTextFiles() :
It's a transformation.
It's in package org.apache.spark.rdd.PairRDDFunctions
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)]): RDD[(K,
(Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
For each key k in this or other1 or other2 or other3, return a resulting RDD that contains a tuple with the list of
values for that key in this, other1, other2 and other3.
Output :
Array[(Int, (Iterable[String], Iterable[String]))] =
Array((4,(CompactBuffer(Flink),CompactBuffer(RealTime))),
(1,(CompactBuffer(spark),CompactBuffer(stream, MLlib))),
(6,(CompactBuffer(HBase),CompactBuffer(NOSQL))),
(3,(CompactBuffer(Hive),CompactBuffer())),
(5,(CompactBuffer(),CompactBuffer(Kafka))),
(2,(CompactBuffer(HDFS),CompactBuffer())))
>In general, Spark is using Scala, Java and Python to write the program. However if that is not enough, and
one want to pipe (inject) the data which written in other language like 'R', Spark provides general mechanism in
the form of pipe() method
This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a
shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking
place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass
shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).
Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a
small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling
coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.
It changes number of partition where data is stored. It combines original partitions to new number of partitions,
so it reduces number of partitions. It is an optimized version of repartition that allows data movement, but only if
you are decreasing number of RDD partitions. It runs operations more efficiently after filtering large datasets.
Example :
Output :
Int = 15
Int = 5
Example :
Output :
Int = 3
Int = 6
> It is transformation.
> It's in package org.apache.spark.rdd.PairRDDFunctions
Output :
Array[(String, (Option[Int], Option[Int]))] = Array((Spark,(Some(35),Some(74))), (Spark,(Some(45),Some(74))),
(Kafka,(None,Some(25))), (Flume,(None,Some(12))), (Hive,(Some(23),Some(14))), (Hbase,(Some(89),None)))
leftOuterJoin() :
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
Perform a left outer join of this and other. For each element (k, v) in this, the resulting RDD will either contain all
pairs (k, (v, Some(w))) for w in other, or the pair (k, (v, None)) if no elements in other have key k. Hash-partitions
the output using the existing partitioner/parallelism level.
leftOuterJoin() performs a join between two RDDs where the keys must be present in first RDD
Example :
Output :
Array[(String, (Int, Option[Int]))] = Array((s,(59,Some(61))), (s,(59,Some(62))), (s,(54,Some(61))), (s,
(54,Some(62))), (e,(57,None)), (e,(58,None)), (m,(55,Some(60))), (m,(55,Some(65))), (m,(56,Some(60))), (m,
(56,Some(65))))
rightOuterJoin() :
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
Perform a right outer join of this and other. For each element (k, w) in other, the resulting RDD will either contain
all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k. Hash-partitions
the resulting RDD using the existing partitioner/parallelism level.
It performs the join between two RDDs where the key must be present in other RDD
Example :
Return an RDD containing all pairs of elements with matching keys in this and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other.
Performs a hash join across the cluster.
It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W))
pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin,
and fullOuterJoin.
Example1:
Output :
Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,
(54,61)), (s,(54,62)))
Example2:
<br />
val myrdd1 = sc.parallelize(Seq((1,2),(3,4),(3,6)))<br />
val myrdd2 = sc.parallelize(Seq((3,9)))<br />
val myjoinedrdd = myrdd1.join(myrdd2)<br />
myjoinedrdd.collect<br />
Output:
Array[(Int, (Int, Int))] = Array((3,(4,9)), (3,(6,9)))
Returns the top k (largest) elements from this RDD as defined by the specified implicit Ordering[T] and
maintains the ordering. This does the opposite of takeOrdered.
Returns the first k (smallest) elements from this RDD as defined by the specified implicit Ordering[T] and
maintains the ordering. This does the opposite of top.
Example :
Output :
Array[Int] = Array(89, 51, 13)
Array[Int] = Array(5, 7, 9)
Array[Int] = Array(89, 51, 13)
Explain first() operation
Output :
Long: 20
String : DataFlair is the leading technology training provider
mapPartitions() :
mapPartitionsWithIndex is similar to mapPartitions() but it provides second parameter index which keeps the
track of partition.
MappartionwithIndex:
It is similar to MapPartition but with one difference that it takes two parameters, the first parameter is the index
and second is an iterator through all items within this partition (Int, Iterator<t>).
sum() :
Example:
Output:
Double = 210.0
max() :
> It returns a max value from RDD element defined by implicit ordering (element order)
> It is an package org.apache.spark.rdd
Example:
Output:
Int = 99
> It is an action
> It returns the count of each unique value in an RDD as a local Map (as a Map to driver program) (value,
countofvalues) pair
> Care must be taken to use this API since it returns the value to driver program so its suitable only for small
values
Example:
Output:
scala.collection.Map[(String, Int),Long] = Map((HR,5) -> 1, (RD,4) -> 1, (SALES,4) -> 1, (ADMIN,5) -> 1,
(MAN,8) -> 1, (SER,6) -> 1)
Output:
scala.collection.Map[Int,Long] = Map(4 -> 1, 3 -> 2, 10 -> 1)
It is an action
> It returns the list of values in the RDD for key 'key'
Output:
Seq[Int] = WrappedArray(15, 39, 49)
Seq[Int] = WrappedArray(95)
Seq[Int] = WrappedArray(78)
1) Making use of partitions for your table may help if you frequently only
access data from certain days at a time. There's a notebook in the
Databricks Guide called "Partitioned Tables" with more data.
2) If your files are really small - it is true that you may get better
performance by consolidating those files into a smaller number. You can do
that easily in spark with a command like this:
sqlContext.parquetFile( SOME_INPUT_FILEPATTERN )
.coalesce(SOME_SMALLER_NUMBER_OF_DESIRED_PARTITIONS)
.write.parquet(SOME_OUTPUT_DIRECTORY)
or
Having a large # of small files or folders can significantly deteriorate the
performance of loading the data. The best way is to keep the folders/files
merged so that each file is around 64MB size. There are different ways to
achieve this: your writer process can either buffer them in memory and
write only after reaching a size or as a second phase you can read the
temp directory and consolidate them together and write it out to a
different location. If you want to do the latter, you can read each of your
input directory as a dataframe and union them and repartition it to the #
of files you want and dump it back. A code snippet in Scala would be: