Cloudera CCD-333 Cloudera Certified Developer For Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer For Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer For Apache Hadoop
Key/Values output from the reducer in a binary format, that can then be read back in, in
second MR job using SequenceFileInputFormat.
see answer 1 and then see the comment #1 for it)
In a MapReduce job, you want each of you input files processed by a single map task.
How do you
configure a MapReduce job so that a single map task processes each input file
regardless of how
many blocks the input file occupies?
A. Increase the parameter that controls minimum split size in the job configuration.
B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.
C. Set the number of mappers equal to the number of input files you want to process.
D. Write a custom FileInputFormat and override the method isSplittable to always return
Answer: D
Explanation: Note:
*// Do not allow splitting.
protected boolean isSplittable(JobContext context, Path filename) {
return false;
*InputSplits: An InputSplit describes a unit of work that comprises a single map task in a
MapReduce program. A MapReduce program applied to a data set, collectively referred
to as a
Job, is made up of several (possibly several hundred) tasks. Map tasks may involve
reading a
whole file; they often involve reading only part of a file. By default, the FileInputFormat
and its
descendants break a file up into 64 MB chunks (the same size as blocks in HDFS). You
control this value by setting the mapred.min.split.size parameter in hadoop-site.xml, or
overriding the parameter in the JobConf object used to submit a particular MapReduce
job. By
processing a file in chunks, we allow several map tasks to operate on a single file in
parallel. If the
file is very large, this can improve performance significantly through parallelism. Even
importantly, since the various blocks that make up the file may be spread across several
nodes in the cluster, it allows tasks to be scheduled on each of these different nodes;
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 3
individual blocks are thus all processed locally, instead of needing to be transferred from
one node
to another. Of course, while log files can be processed in this piece-wise fashion, some
formats are not amenable to chunked processing. By writing a custom InputFormat, you
control how the file is broken up (or is not broken up) into splits.
Which of the following best describes the workings of TextInputFormat?
A. Input file splits may cross line breaks. A line that crosses tile splits is ignored.
B. The input file is split exactly at the line breaks, so each Record Reader will read a
series of
complete lines.
C. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReaders of both splits containing the broken line.
D. Input file splits may cross line breaks. A line that crosses file splits is read by the
of the split that contains the end of the broken line.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the
of the split that contains the beginning of the broken line.
Answer: D
Explanation: As the Map operation is parallelized the input file set is first split to several
called FileSplits. If an individual file is so large that it will affect seek time it will be split to
Splits. The splitting does not know anything about the input file's internal logical
structure, for
example line-oriented text files are split on arbitrary byte boundaries. Then a new map
task is
created per FileSplit.
When an individual map task starts it will open a new output writer per configured
reduce task. It
will then proceed to read its FileSplit using the RecordReader it gets from the specified
InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat
must also
handle records that may be split on the FileSplit boundary. For example
TextInputFormat will read
the last line of the FileSplit past the split boundary and, when reading other than the first
TextInputFormat ignores the content up to the first newline.
Reference:How Map and Reduce operations are actually carried out, second paragraph)
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 4
Which of the following statements most accurately describes the relationship between
and Pig?
A. Pig provides additional capabilities that allow certain types of data manipulation not
with MapReduce.
B. Pig provides no additional capabilities to MapReduce. Pig programs are executed as
MapReduce jobs via the Pig interpreter.
C. Pig programs rely on MapReduce but are extensible, allowing developers to do
processing not provided by MapReduce.
D. Pig provides the additional capability of allowing you to control the flow of multiple
Answer: D
Explanation: In addition to providing many relational and data flow operators Pig Latin
ways for you to control how your jobs execute on MapReduce. It allows you to set
values that
control your environment and to control details of MapReduce such as how your data is
You need to import a portion of a relational database every day as files to HDFS, and
Java classes to Interact with your imported data. Which of the following tools should you
use to
accomplish this?
A. Pig
B. Hue
C. Hive
D. Flume
E. Sqoop
F. Oozie
G. fuse-dfs
Answer: E
Explanation: Sqoop (SQL-to-Hadoop) is a straightforward command-line tool with the
Imports individual tables or entire databases to files in HDFS
Generates Java classes to allow you to interact with your imported data
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 5
Provides the ability to import from SQL databases straight into your Hive data
Data Movement Between Hadoop and Relational Databases
Data can be moved between Hadoop and a relational database as a bulk data transfer,
relational tables can be accessed from within a MapReduce map function.
*Cloudera's Distribution for Hadoop provides a bulk data transfer tool (i.e., Sqoop) that
individual tables or entire databases into HDFS files. The tool also generates Java
classes that
support interaction with the imported data. Sqoop supports all relational databases over
and Quest Software provides a connector (i.e., OraOop) that has been optimized for
access to
data residing in Oracle databases.
Data Movement between hadoop and relational databases, second paragraph)
You have an employee who is a Date Analyst and is very comfortable with SQL. He
would like to
run ad-hoc analysis on data in your HDFS duster. Which of the following is a data
software built on top of Apache Hadoop that defines a simple SQL-like query language
for this kind of user?
A. Pig
B. Hue
C. Hive
D. Sqoop
E. Oozie
F. Flume
G. Hadoop Streaming
Answer: C
Explanation: Hive defines a simple SQL-like query language, called QL, that enables
familiar with SQL to query the data. At the same time, this language also allows
programmers who
are familiar with the MapReduce framework to be able to plug in their custom mappers
reducers to perform more sophisticated analysis that may not be supported by the builtin
capabilities of the language. QL can also be extended with custom scalar functions
by the Google File System, Apache HBase provides Bigtable-like capabilities on top of
and HDFS.
Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary
encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or
via JMX
Reference: would I use HBase? First sentence)
Which of the following utilities allows you to create and run MapReduce jobs with any
or script as the mapper and/or the reducer?
A. Oozie
B. Sqoop
C. Flume
D. Hadoop Streaming
Answer: D
Explanation: Hadoop streaming is a utility that comes with the Hadoop distribution. The
allows you to create and run Map/Reduce jobs with any executable or script as the
mapper and/or
the reducer.
Streaming,second sentence)
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 8
What is the preferred way to pass a small number of configuration parameters to a
mapper or
A. As key-value pairs in the jobconf object.
B. As a custom input key-value pair passed to each mapper or reducer.
C. Using a plain text file via the Distributedcache, which each mapper or reducer reads.
D. Through a static variable in the MapReduce driver class (i.e., the class that submits
MapReduce job).
Answer: A
Explanation: In Hadoop, it is sometimes difficult to pass arguments to mappers and
reducers. If
the number of arguments is huge (e.g., big arrays), DistributedCache might be a good
However, here, were discussing small arguments, usually a hand of configuration
In fact, the way to configure these parameters is simple. When you
initializeJobConfobject to
launch a mapreduce job, you can set the parameter by usingsetmethod like:
1JobConf job = (JobConf)getConf();
2job.set("NumberOfDocuments", args[0]);
Here,NumberOfDocumentsis the name of parameter and its value is read
fromargs[0], a
command line argument.
Reference:Passing Parameters and Arguments to Mapper and Reducer in Hadoop
Given a Mapper, Reducer, and Driver class packaged into a jar, which is the correct way
submitting the job to the cluster?
A. jar MyJar.jar
B. jar MyJar.jar MyDriverClass inputdir outputdir
C. hadoop jar MyJar.jar MyDriverClass inputdir outputdir
D. hadoop jar class MyJar.jar MyDriverClass inputdir outputdir
Answer: C
Explanation: Example:
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 9
Run the application:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input
What is the difference between a failed task attempt and a killed task attempt?
A. A failed task attempt is a task attempt that threw an unhandled exception. A killed
task attempt
is one that was terminated by the JobTracker.
B. A failed task attempt is a task attempt that did not generate any key value pairs. A
killed task
attempt is a task attempt that threw an exception, and thus killed by the execution
C. A failed task attempt is a task attempt that completed, but with an unexpected status
value. A
killed task attempt is a duplicate copy of a task attempt that was started as part of
D. A failed task attempt is a task attempt that threw a RuntimeException (i.e., the task
fails). A
killed task attempt is a task attempt that threw any other type of exception (e.g.,
IOException); the
execution framework catches these exceptions and reports them as killed.
Answer: C
Explanation: Note:
*Hadoop uses "speculative execution." The same task may be started on multiple
boxes. The first
one to finish wins, and the other copies are killed.
Failed tasks are tasks that error out.
*There are a few reasons Hadoop can kill tasks by his own decisions:
a) Task does not report progress during timeout (default is 10 minutes)
b) FairScheduler or CapacityScheduler needs the slot for some other pool
(FairScheduler) or
queue (CapacityScheduler).
c) Speculative execution causes results of task not to be needed since it has completed
on other
Reference:Difference failed tasks vs killed tasks
Custom programmer-defined counters in MapReduce are:
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 10
A. Lightweight devices for bookkeeping within MapReduce programs.
B. Lightweight devices for ensuring the correctness of a MapReduce program. Mappers
counters, and reducers decrement counters. If at the end of the program the counters
read zero,
then you are sure that the job completed correctly.
C. Lightweight devices for synchronization within MapReduce programs. You can use
counters to
coordinate execution between a mapper and a reducer.
Answer: A
Explanation: Countersare a useful channel for gathering statistics about the job; for
or for application-level statistics. They are also useful for problem diagnosis. Hadoop
maintains somebuilt-in counters for every job, which reports various metrics for your job.
Hadoop MapReduce also allows the user to define a set of user-defined counters that
can be
incremented (or decremented by specifying a negative value as the parameter), by the
mapper or the reducer.
Answer: B
Explanation: Hadoop has a distributed cache mechanism to make available file locally
that may
be needed by Map/Reduce jobs
Use Case
Lets understand our Use Case a bit more in details so that we can follow-up the code
We have a Key-Value file that we need to use in our Map jobs. For simplicity, lets say
we need to
replace all keywords that we encounter during parsing, with some other value.
So what we need is
A key-values files (Lets use a Properties files)
The Mapper code that uses the code
Write the Mapper code that uses it
view sourceprint?
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 12
public class DistributedCacheMapper extends Mapper<LongWritable, Text, Text, Text> {
Properties cache;
protected void setup(Context context) throws IOException, InterruptedException {
Path[] localCacheFiles =
if(localCacheFiles != null) {
// expecting only single file here
for (int i = 0; i < localCacheFiles.length; i++) {
Path localCacheFile = localCacheFiles[i];
cache = new Properties();
cache.load(new FileReader(localCacheFile.toString()));
} else {
// do your error handling here
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 13
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
// use the cache here
// if value contains some attribute, cache.get(<value>)
// do some action or replace with something else
* Distribute application-specific large, read-only files efficiently.
DistributedCache is a facility provided by the Map-Reduce framework to cache files
(text, archives,
jars etc.) needed by applications.
Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf.
DistributedCache assumes that the files specified via hdfs:// urls are already present on
FileSystem at the path specified by the url.
Reference:Using Hadoop Distributed Cache
What types of algorithms are difficult to express MapReduce?
A. Algorithms that requite global, shared state.
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 14
B. Large-scale graph algorithms that require one-step link traversal.
B. Yes, reducers can communicate with each other by dispatching intermediate key
value pairs
that get shuffled to another reduce
C. Yes, reducers running on the same machine can communicate with each other
through shared
memory, but not reducers on different machines.
D. No, each reducer runs independently and in isolation.
Answer: D
Explanation: MapReduce programming model does not allow reducers to
communicate with each
other. Reducers run in isolation.
Reference:24 Interview Questions & Answers for Hadoop MapReduce developers
question no.
Which of the following best describes the map method input and output?
A. It accepts a single key-value pair as input and can emit only one key-value pair as
B. It accepts a list of key-value pairs as input hut run emit only one key value pair as
C. It accepts a single key-value pair as input and emits a single key and list of
values as output
D. It accepts a single key-value pair as input and can emit any number of key-value
pairs as
output, including zero.
Answer: D
Explanation: public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
extends Object
Maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks which transform input records into a intermediate records.
transformed intermediate records need not be of the same type as the input records. A
given input
pair may map to zero or many output pairs.
Reference: org.apache.hadoop.mapreduce
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 17
You have written a Mapper which invokes the following five calls to the
C:\Documents and Settings\RraAsShHiIdD\Desktop\untitled.JPG
How many times will the Reducer's reduce method be invoked.
A. 0
B. 1
C. 3
D. 5
E. 6
Answer: C
org.apache.hadoop.mapred Interface OutputCollector<K,V>
Collects the <key, value> pairs output by Mappers and Reducers.
OutputCollector is the generalization of the facility provided by the Map-Reduce
framework to
collect data output by either the Mapper or the Reducer i.e. intermediate outputs or the
output of
the job.
In a MapReduce job with 500 map tasks, how many map task attempts will there be?
A. At least 500.
B. Exactly 500.
C. At most 500.
D. Between 500 and 1000.
E. It depends on the number of reducers in the job.
Answer: A
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 18
From Cloudera Training Course:
Task attempt is a particular instance of an attempt to execute a task
There will be at least as many task attempts as there are tasks
If a task attempt fails, another will be started by the JobTracker
Speculative execution can also result in more task attempts than completed tasks
The Hadoop framework provides a mechanism for coping with machine issues such as
configuration or impending hardware failure. MapReduce detects that one or a number
machines are performing poorly and starts more copies of a map or reduce task. All the
tasks run
simultaneously and the task that finish first are used. This is called:
A. Combiner
B. IdentityMapper
C. IdentityReducer
D. Default Partitioner
E. Speculative Execution
Answer: E
Explanation: Speculative execution: One problem with the Hadoop system is that by
dividing the
tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the
For example if one node has a slow disk controller, then it may be reading its input at
only 10% the
speed of all the other nodes. So when 99 map tasks are already complete, the system
is still
waiting for the final map task to check in, which takes much longer than all the other
By forcing tasks to run in isolation from one another, individual tasks do not know where
inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input.
the same input can be processed multiple times in parallel, to exploit differences in
capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform
will schedule
redundant copies of the remaining tasks across several nodes which do not have other
work to
perform. This process is known as speculative execution. When tasks complete, they
this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive
copy. If
other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon
the tasks
and discard their outputs. The Reducers then receive their inputs from whichever
completed successfully, first.
Reference:Apache Hadoop,Module 4: MapReduce
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 19
Which of the Following best describes the lifecycle of a Mapper?
A. The TaskTracker spawns a new Mapper to process each key-value pair.
B. The JobTracker spawns a new Mapper to process all records in a single file.
C. The TaskTracker spawns a new Mapper to process all records in a single input split.
D. The JobTracker calls the FastTracker's configure () method, then its map () method
and finally
its closer ()
Answer: A
Explanation: For each map instance that runs, the TaskTracker creates a new instance
of your
*The Mapper is responsible for processing Key/Value pairs obtained from the
InputFormat. The
mapper may perform a number of Extraction and Transformation functions on the
Key/Value pair
before ultimately outputting none, one or many Key/Value pairs of the same, or different
*With the new Hadoop API, mappers extend the
org.apache.hadoop.mapreduce.Mapper class.
This class defines an 'Identity' map function by default - every input Key/Value pair
obtained from
the InputFormat is written out.
Examining the run() method, we can see the lifecycle of the mapper:
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
public void run(Context context) throws IOException, InterruptedException {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
setup(Context) - Perform any setup for the mapper. The default implementation is a noop method.
map(Key, Value, Context) - Perform a map operation in the given Key / Value pair. The
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 20
implementation calls Context.write(Key, Value)
cleanup(Context) - Perform any cleanup for the mapper. The default implementation is a
Your client application submits a MapReduce job to your Hadoop cluster. The Hadoop
looks for an available slot to schedule the MapReduce operations on which of the
Hadoop computing daemons?
A. DataNode
B. NameNode
C. JobTracker
D. TaskTracker
E. Secondary NameNode
Answer: C
Explanation: JobTracker is the daemon service for submitting and tracking MapReduce
jobs in
Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker
runs on
its own JVM process. In a typical production cluster its run on a separate machine. Each
node is configured with job tracker node location. The JobTracker is single point of
failure for the
Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in
performs following actions(from Hadoop Wiki:)
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often
enough, they
are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what
to do
then: it may resubmit the job elsewhere, it may mark that specific record as something
to avoid,
and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.
Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What
is a
JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 21
Which MapReduce daemon runs on each slave node and participates in job execution?
A. TaskTracker
B. JobTracker
C. NameNode
D. Secondary NameNode
Answer: A
Explanation: Single instance of a Task Tracker is run on each Slave node. Task tracker
is run as
a separate JVM process.
Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What
configuration of a typical slave node on Hadoop cluster? How many JVMs run on a
slave node?
answer to
question no. 5)
What is the standard configuration of slave nodes in a Hadoop cluster?
A. Each slave node runs a JobTracker and a DataNode daemon.
B. Each slave node runs a TaskTracker and a DataNode daemon.
C. Each slave node either runs a TaskTracker or a DataNode daemon, but not both.
D. Each slave node runs a DataNode daemon, but only a fraction of the slave nodes run
E. Each slave node runs a TaskTracker, but only a fraction of the slave nodes run
Answer: B
Explanation: Single instance of a Task Tracker is run on each Slave node. Task tracker
is run as
a separate JVM process.
Single instance of a DataNode daemon is run on each Slave node. DataNode daemon
is run as a
separate JVM process.
One or Multiple instances of Task Instance is run on each slave node. Each task
instance is run as
a separate JVM process. The number of Task instances can be controlled by
Typically a high end machine is configured to run more task instances.
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 22
Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What
configuration of a typical slave node on Hadoop cluster? How many JVMs run on a
slave node?
Which happens if the NameNode crashes?
A. HDFS becomes unavailable until the NameNode is restored.
B. The Secondary NameNode seamlessly takes over and there is no service
C. HDFS becomes unavailable to new MapReduce jobs, but running jobs will continue
D. HDFS becomes temporarily unavailable until an administrator starts redirecting client
to the Secondary NameNode.
Answer: A
Explanation: The NameNode is a Single Point of Failure for the HDFS Cluster. When
NameNode goes down, the file system goes offline.
Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What
is a
NameNode? How many instances of NameNode run on a Hadoop Cluster?
You are running a job that will process a single InputSplit on a cluster which has no
other jobs
currently running. Each node has an equal number of open Map slots. On which node
will Hadoop
first attempt to run the Map task?
A. The node with the most memory
B. The node with the lowest system load
C. The node on which this InputSplit is stored
D. The node with the most free local disk space
Answer: C
Explanation: The TaskTrackers send out heartbeat messages to the JobTracker,
usually every
few minutes, to reassure the JobTracker that it is still alive. These message also inform
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 23
JobTracker of the number of available slots, so the JobTracker can stay up to date with
where in
the cluster work can be delegated. When the JobTracker tries to find somewhere to
schedule a
task within the MapReduce operations, it first looks for an empty slot on the same
server that
hosts the DataNode containing the data, and if not, it looks for an empty slot on a
machine in the
same rack.
How does the NameNode detect that a DataNode has failed?
A. The NameNode does not need to know that a DataNode has failed.
B. When the NameNode fails to receive periodic heartbeats from the DataNode, it
considers the
DataNode as failed.
C. The NameNode periodically pings the datanode. If the DataNode does not respond,
NameNode considers the DataNode as failed.
D. When HDFS starts up, the NameNode tries to communicate with the DataNode and
the DataNode as failed if it does not respond.
Answer: B
In the reducer, the MapReduce API provides you with an iterator over Writable values.
Calling the
next () method:
A. Returns a reference to a different Writable object each time.
B. Returns a reference to a Writable object from an object pool.
C. Returns a reference to the same writable object each time, but populated with
different data.
D. Returns a reference to a Writable object. The API leaves unspecified whether this is
a reused
object or a new object.
E. Returns a reference to the same writable object if the next value is the same as the
value, or a new writable object otherwise.
Answer: C
Explanation: Calling will always return the SAME EXACT instance of
with the contents of that instance replaced with the next value.
Reference:manupulating iterator in mapreduce
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 25
What is a Writable?
A. Writable is an interface that all keys and values in MapReduce must implement.
implementing this interface must implement methods for serializing and deserializing
B. Writable is an abstract class that all keys and values in MapReduce must extend.
extending this abstract base class must implement methods for serializing and
C. Writable is an interface that all keys, but not values, in MapReduce must implement.
implementing this interface must implement methods for serializing and deserializing
D. Writable is an abstract class that all keys, but not values, in MapReduce must
extend. Classes
extending this abstract base class must implement methods for serializing and
Answer: A
Explanation: public interface Writable
A serializable object which implements a simple, efficient, serialization protocol, based
DataInput and DataOutput.
Any key or value type in the Hadoop Map-Reduce framework implements this interface.
Implementations typically implement a static read(DataInput) method which constructs a
instance, calls readFields(DataInput) and returns the instance.
Reference:,Interface Writable
In a MapReduce job, the reducer receives all values associated with the same key.
statement is most accurate about the ordering of these values?
A. The values are in sorted order.
B. The values are arbitrarily ordered, and the ordering may vary from run to run of the
MapReduce job.
C. The values are arbitrarily ordered, but multiple runs of the same MapReduce job will
have the same ordering.
D. Since the values come from mapper outputs, the reducers will receive contiguous
sections of
sorted values.
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 26
Answer: D
*The Mapper outputs are sorted and then partitioned per Reducer.
*The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len,
*Input to the Reducer is the sorted output of the mappers. In this phase the framework
fetches the
relevant partition of the output of all the mappers, via HTTP.
*A MapReduce job usually splits the input data-set into independent chunks which are
by the map tasks in a completely parallel manner. The framework sorts the outputs of
the maps,
which are then input to the reduce tasks.
*The MapReduce framework operates exclusively on <key, value> pairs, that is, the
views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs
as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to
the Writable interface. Additionally, the key classes have to implement the
interface to facilitate sorting by the framework.
Reference:MapReduce Tutorial
All keys used for intermediate output from mappers must do which of the following:
A. Override isSplitable
B. Implement WritableComparable
C. Be a subclass of Filelnput-Format
D. Use a comparator for speedy sorting
E. Be compressed using a splittable compression algorithm.
Answer: B
Explanation: The MapReduce framework operates exclusively on <key, value> pairs,
that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set
of <key,
value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to
the Writable interface. Additionally, the key classes have to implement the
interface to facilitate sorting by the framework.
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 27
Reference:MapReduce Tutorial
You have the following key value pairs as output from your Map task:
(The, 1)
(Fox, 1)
(Runs, 1)
(Faster, 1)
(Than, 1)
(The, 1)
(Dog, 1)
How many keys will be passed to the reducer?
A. One
B. Two
C. Three
D. Four
E. Five
F. Six
Answer: F
Explanation: Only one key value pair will be passed from thetwo(The, 1) key value
You write a MapReduce job to process 100 files in HDFS. Your MapReduce algorithm
TextInputFormat and the IdentityReducer: the mapper applies a regular expression over
values and emits key-value pairs with the key consisting of the matching text, and the
containing the filename and byte offset. Determine the difference between setting the
number of
reducers to zero.
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 28
A. There is no difference in output between the two settings.
B. With zero reducers, no reducer runs and the job throws an exception. With one
instances of matching patterns are stored in a single file on HDFS.
C. With zero reducers, all instances of matching patterns are gathered together in one
file on
HDFS. With one reducer, instances of matching patterns stored in multiple files on
D. With zero reducers, instances of matching patterns are stored in multiple files on
HDFS. With
one reducer, all instances of matching patterns are gathered together in one file on
Answer: D
Explanation: *It is legal to set the number of reduce-tasks to zero if no reduction is
In this case the outputs of the map-tasks go directly to the FileSystem, into the output
path set by
setOutputPath(Path). The framework does not sort the map-outputs before writing them
out to the
*Often, you may want to process input data using a map function only. To do this, simply
mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer
Rather, the outputs of the mapper tasks will be the final output of the job.
In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter)
method is
called for each <key, (list of values)> pair in the grouped inputs.
The output of the reduce task is typically written to the FileSystem via
OutputCollector.collect(WritableComparable, Writable).
Applications can use the Reporter to report progress, set application-level status
messages and
update Counters, or just indicate that they are alive.
The output of the Reducer is not sorted.
For each intermediate key, each reducer task can emit:
A. One final key value pair per key; no restrictions on the type.
B. One final key-value pair per value associated with the key; no restrictions on the
C. As many final key-value pairs as desired, as long as all the keys have the same type
and all the
values have the same type.
D. As many final key-value pairs as desired, but they must have the same type as the
key-value pairs.
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 29
E. As many final key value pairs as desired. There are no restrictions on the types of
those keyvalue
pairs (i.e., they can be heterogeneous)
Answer: A
Explanation: Reducer reduces a set of intermediate values which share a key to a
smaller set of
Reference:Hadoop Map-Reduce Tutorial
For each input key-value pair, mappers can emit:
A. One intermediate key value pair, of a different type.
B. One intermediate key value pair, but of the same type.
C. As many intermediate key-value pairs as desired, but they cannot be of the same
type as the
input key-value pair.
D. As many intermediate key value pairs as desired, as long as all the keys have the
same type
and all the values have the same type.
E. As many intermediate key-value pairs as desired. There are no restrictions on the
types of
those key-value pairs (i.e., they can be heterogeneous).
Answer: E
Explanation: Mapper maps input key/value pairs to a set of intermediate key/value
Maps are the individual tasks that transform input records into intermediate records. The
transformed intermediate records do not need to be of the same type as the input
records. A
giveninput pair may map to zero or many output pairs.
Reference: Hadoop Map-Reduce Tutorial
During the standard sort and shuffle phase of MapReduce, keys and values are passed
reducers. Which of the following is true?
A. Keys are presented to a reducer in sorted order; values for a given key are not
B. Keys are presented to a reducer in soiled order; values for a given key are sorted in
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 30
C. Keys are presented to a reducer in random order; values for a given key are not
D. Keys are presented to a reducer in random order; values for a given key are sorted in
ascending order.
Answer: D
What is the behavior of the default partitioner?
A. The default partitioner assigns key value pairs to reducers based on an internal
random number
B. The default partitioner implements a round robin strategy, shuffling the key value
pairs to each
reducer in turn. This ensures an even partition of the key space.
C. The default partitioner computes the hash of the key. Hash values between specific
ranges are
associated with different buckets, and each bucket is assigned to a specific reducer.
D. The default partitioner computes the hash of the key and divides that value modulo
the number
of reducers. The result determines the reducer assigned to process the key-value pair.
E. The default partitioner computes the hash of the value and takes the mod of that
value with the
number of reducers. The result determines the reducer assigned to process the key
value pair.
Answer: D
Explanation: The default partitioner computes a hash value for the key and assigns the
based on this result.
The default Partitioner implementation is called HashPartitioner. It uses the hashCode()
method of
the key objects modulo the number of partitions total to determine which partition to
send a given
(key, value) pair to.
In Hadoop, the default partitioner is HashPartitioner, which hashes a records key to
which partition (and thus which reducer) the record belongs in.The number of partition is
equal to the number of reduce tasks for the job.
Reference:Getting Started With (Customized) Partitioning
Which statement best describes the data path of intermediate key-value pairs (i.e.,
output of the
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 31
A. Intermediate key-value pairs are written to HDFS. Reducers read the intermediate
data from
B. Intermediate key-value pairs are written to HDFS. Reducers copy the intermediate
data to the
local disks of the machines running the reduce tasks.
C. Intermediate key-value pairs are written to the local disks of the machines running
the map
tasks, and then copied to the machine running the reduce tasks.
D. Intermediate key-value pairs are written to the local disks of the machines running
the map
tasks, and are then copied to HDFS. Reducers read the intermediate data from HDFS.
Answer: C
Explanation: The mapper output (intermediate data) is stored on the Local file system
HDFS) of each individual mapper nodes. This is typically a temporary directory location
which can
be setup in config by the hadoop administrator. The intermediate data is cleaned up
after the
Hadoop Job completes.
*Reducers start copying intermediate key-value pairs from the mappers as soon as they
available. The progress calculation also takes in account the processing of data transfer
which is
done by reduce process, therefore the reduce progress starts showing up as soon as
intermediate key-value pair for a mapper is available to be transferred to reducer.
Though the
reducer progress is updated still the programmer defined reduce method is called only
after all the
mappers have finished.
*Reducer is input the grouped output of a Mapper. In the phase the framework, for each
fetches the relevant partition of the output of all the Mappers, via HTTP.
*Mapper maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records. The
transformed intermediate records do not need to be of the same type as the input
records. A given
input pair may map to zero or many output pairs.
*All intermediate values associated with a given output key are subsequently grouped
by the
framework, and passed to the Reducer(s) to determine the final output.
Reference:Questions & Answers for Hadoop MapReduce developers,Where is the
Mapper Output
(intermediate kay-value data) stored ?
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 32
You've written a MapReduce job that will process 500 million input records and generate
million key-value pairs. The data is not uniformly distributed. Your MapReduce job will
create a
significant amount of intermediate data that it needs to transfer between mappers and
which is a potential bottleneck. A custom implementation of which of the following
interfaces is
most likely to reduce the amount of intermediate data transferred across the network?
A. Writable
B. WritableComparable
C. InputFormat
D. OutputFormat
E. Combiner
F. Partitioner
Answer: E
Explanation: Users can optionally specify a combiner, via
JobConf.setCombinerClass(Class), to
perform local aggregation of the intermediate outputs, which helps to cut down the
amount of data
transferred from the Mapper to the Reducer.
Reference:Map/Reduce Tutorial, 9th
If you run the word count MapReduce program with m mappers and r reducers, how
many output
files will you get at the end of the job? And how many key-value pairs will there be in
each file?
Assume k is the number of unique words in the input files.
A. There will be r files, each with exactly k/r key-value pairs.
B. There will be r files, each with approximately k/m key-value pairs.
C. There will be r files, each with approximately k/r key-value pairs.
D. There will be m files, each with exactly k/m key value pairs.
E. There will be m files, each with approximately k/m key-value pairs.
Answer: A
*A MapReduce job withm mappers and r reducers involves up to m*r distinct copy
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 33
since eachmapper may have intermediate output going to every reducer.
*In the canonical example of word counting, a key-value pair is emitted for every word
found. For
example, if we had 1,000 words, then 1,000 key-value pairs will be emitted from the
mappers to
the reducer(s).
You have a large dataset of key-value pairs, where the keys are strings, and the values
integers. For each unique key, you want to identify the largest integer. In writing a
program to accomplish this, can you take advantage of a combiner?
A. No, a combiner would not be useful in this case.
B. Yes.
C. Yes, but the number of unique keys must be known in advance.
D. Yes, as long as all the keys fit into memory on each node.
E. Yes, as long as all the integer values that share the same key fit into memory on
each node.
Answer: B
What happens in a MapReduce job when you set the number of reducers to zero?
A. No reducer executes, but the mappers generate no output.
B. No reducer executes, and the output of each mapper is written to a separate file in
C. No reducer executes, but the outputs of all the mappers are gathered together and
written to a
single file in HDFS.
D. Setting the number of reducers to zero is invalid, and an exception is thrown.
Answer: B
Explanation: *It is legal to set the number of reduce-tasks to zero if no reduction is
In this case the outputs of the map-tasks go directly to the FileSystem, into the output
path set by
setOutputPath(Path). The framework does not sort the map-outputs before writing them
out to the
*Often, you may want to process input data using a map function only. To do this, simply
mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer
as soon as they are available. The programmer defined reduce method is called only
after all the
mappers have finished.
Reference:24 Interview Questions & Answers for Hadoop MapReduce
developers,When is the
reducers are started in a MapReduce job?
no. 17)
What happens in a MapReduce job when you set the number of reducers to one?
A. A single reducer gathers and processes all the output from all the mappers. The
output is
written in as many separate files as there are mappers.
B. A single reducer gathers and processes all the output from all the mappers. The
output is
written to a single file in HDFS.
C. Setting the number of reducers to one creates a processing bottleneck, and since the
of reducers as specified by the programmer is used as a reference value only, the
runtime provides a default setting for the number of reducers.
D. Setting the number of reducers to one is invalid, and an exception is thrown.
Answer: A
Explanation: *It is legal to set the number of reduce-tasks to zero if no reduction is
In this case the outputs of the map-tasks go directly to the FileSystem, into the output
path set
bysetOutputPath(Path). The framework does not sort the map-outputs before writing
them out to
the FileSystem.
*Often, you may want to process input data using a map function only. To do this, simply
mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer
Rather, the outputs of the mapper tasks will be the final output of the job.
In the standard word count MapReduce algorithm, why might using a combiner reduce
the overall
Job running time?
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 37
A. Because combiners perform local aggregation of word counts, thereby allowing the
mappers to
process input data faster.
B. Because combiners perform local aggregation of word counts, thereby reducing the
number of
HDFS are those that deal with large data sets. These applications write their data only
once but
they read it one or more times and require these reads to be satisfied at streaming
speeds. HDFS
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 38
supports write-once-read-many semantics on files.
*Hadoop Distributed File System: A distributed file system that provides high-throughput
access to
application data.
*DFS is designed to support very large files.
Reference:24 Interview Questions & Answers for Hadoop MapReduce developers
You need to create a GUI application to help your company's sales people add and edit
information. Would HDFS be appropriate for this customer information file?
A. Yes, because HDFS is optimized for random access writes.
B. Yes, because HDFS is optimized for fast retrieval of relatively small amounts of data.
C. No, because HDFS can only be accessed by MapReduce applications.
D. No, because HDFS is optimized for write-once, streaming access for relatively large
Answer: D
Explanation: HDFS is designed to support very large files. Applications that are
compatible with
HDFS are those that deal with large data sets. These applications write their data only
once but
they read it one or more times and require these reads to be satisfied at streaming
speeds. HDFS
supports write-once-read-many semantics on files.
Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What
? How it is different from traditional file systems?
Which of the following describes how a client reads a file from HDFS?
A. The client queries the NameNode for the block location(s). The NameNode returns
the block
location(s) to the client. The client reads the data directly off the DataNode(s).
B. The client queries all DataNodes in parallel. The DataNode that contains the
requested data
responds directly to the client. The client reads the data directly off the DataNode.
C. The client contacts the NameNode for the block location(s). The NameNode then
queries the
DataNodes for block locations. The DataNodes respond to the NameNode, and the
redirects the client to the DataNode that holds the requested data block(s). The client
then reads
the data directly off the DataNode.
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 39
D. The client contacts the NameNode for the block location(s). The NameNode contacts
theDataNode that holds the requested data block. Data is transferred from the
DataNode to the
NameNode, and then from the NameNode to the client.
Answer: C
Explanation: The Client communication to HDFS happens using Hadoop HDFS API.
applications talk to the NameNode whenever they wish to locate a file, or when they
want to
add/copy/move/delete a file on HDFS. The NameNode responds the successful
requests by
returning a list of relevant DataNode servers where the data lives. Client applications
can talk
directly to a DataNode, once the NameNode has provided the location of the data.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers,How
the Client
communicates with HDFS?
You need to create a job that does frequency analysis on input data. You will do this by
writing a
Mapper that uses TextInputForma and splits each value (a line of text from an input file)
individual characters. For each one of these characters, you will emit the character as a
key and
as IntWritable as the value. Since this will produce proportionally more intermediate
data than
input data, which resources could you expect to be likely bottlenecks?
A. Processor and RAM
B. Processor and disk I/O
C. Disk I/O and network I/O
D. Processor and network I/O
Answer: B
Which of the following statements best describes how a large (100 GB) file is stored in
A. The file is divided into variable size blocks, which are stored on multiple data nodes.
Each block
is replicated three times by default.
B. The file is replicated three times by default. Eachcopy of the file is stored on a
C. The master copy of the file is stored on a single datanode. The replica copies are
divided into
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 40
fixed-size blocks, which are stored on multiple datanodes.
D. The file is divided into fixed-size blocks, which are stored on multiple datanodes.
Each block is
replicated three times by default. Multiple blocks from the same file might reside on the
E. The file is divided into fixed-size blocks, which are stored on multiple datanodes.
Each block is
replicated three times by default.HDFS guarantees that different blocks from the same
file are
never on the same datanode.
Answer: E
Explanation: HDFS is designed to reliably store very large files across machines in a
cluster. It stores each file as a sequence of blocks; all blocks in a file except the last
block are the
same size. The blocks of a file are replicated for fault tolerance. The block size and
factor are configurable per file. An application can specify the number of replicas of a
file. The
replication factor can be specified at file creation time and can be changed later. Files in
HDFS are
write-once and have strictly one writer at any time. The NameNode makes all decisions
replication of blocks. HDFS uses rack-aware replica placement policy. In default
there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on
same rack
and 3rd copy on a different rack.
Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,How
the HDFS
Blocks are replicated?
Your cluster has 10 DataNodes, each with a single 1 TB hard drive. You utilize all your
capacity for HDFS, reserving none for MapReduce. You implement default replication
What is the storage capacity of your Hadoop cluster (assuming no compression)?
A. about 3 TB
B. about 5 TB
C. about 10 TB
D. about 11 TB
Answer: A
Explanation: In default configuration there are total 3 copies of a datablock on HDFS, 2
are stored on datanodes on same rack and 3rd copy on a different rack.
Note:HDFS is designed to reliably store very large files across machines in a large
cluster. It
stores each file as a sequence of blocks; all blocks in a file except the last block are the
size. The blocks of a file are replicated for fault tolerance. The block size and replication
factor are
configurable per file. An application can specify the number of replicas of a file. The
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 41
factor can be specified at file creation time and can be changed later. Files in HDFS are
and have strictly one writer at any time. The NameNode makes all decisions regarding
of blocks. HDFS uses rack-aware replica placement policy.
Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,How
the HDFS
Blocks are replicated?
You use the hadoop fs put command to write a 300 MB file using an HDFS block size
of 64 MB.
Just after this command has finished writing 200 MB of this file, what would another
user see
when trying to access this file?
A. They would see no content until the whole file is written and closed.
B. They would see the content of the file through the last completed block.
C. They would see the current state of the file, up to the last bit written by the command.
D. They would see Hadoop throw an concurrentFileAccessException when they try to
access this
Answer: A
Usage: hadoop fs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination filesystem. Also
input from stdin and writes to destination filesystem.
Cloudera CCD-333 Exam
"A Composite Solution With Just One Click" - Certification Guaranteed 42