Hadoop – Schedulers and Types of Schedulers

Last Updated : 15 Feb, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In Hadoop, we can receive multiple jobs from different clients to perform. The Map-Reduce framework is used to perform multiple tasks in parallel in a typical Hadoop cluster to process large size datasets at a fast rate. This Map-Reduce Framework is responsible for scheduling and monitoring the tasks given by different clients in a Hadoop cluster. But this method of scheduling jobs is used prior to Hadoop 2

Now in Hadoop 2, we have YARN (Yet Another Resource Negotiator). In YARN we have separate Daemons for performing Job scheduling, Monitoring, and Resource Management as Application Master, Node Manager, and Resource Manager respectively. 

Here, Resource Manager is the Master Daemon responsible for tracking or providing the resources required by any application within the cluster, and Node Manager is the slave Daemon which monitors and keeps track of the resources used by an application and sends the feedback to Resource Manager. 

Schedulers and Applications Manager are the 2 major components of resource Manager. The Scheduler in YARN is totally dedicated to scheduling the jobs, it can not track the status of the application. On the basis of required resources, the scheduler performs or we can say schedule the Jobs. 

Hadoop-Schedulers-and-Types-of-Schedulers

There are mainly 3 types of Schedulers in Hadoop:  

  1. FIFO (First In First Out) Scheduler.
  2. Capacity Scheduler.
  3. Fair Scheduler.

These Schedulers are actually a kind of algorithm that we use to schedule tasks in a Hadoop cluster when we receive requests from different-different clients. 

A Job queue is nothing but the collection of various tasks that we have received from our various clients. The tasks are available in the queue and we need to schedule this task on the basis of our requirements. 

A-Job-Queue-in-Hadoop-Schedular

1. FIFO Scheduler

As the name suggests FIFO i.e. First In First Out, so the tasks or application that comes first will be served first. This is the default Scheduler we use in Hadoop. The tasks are placed in a queue and the tasks are performed in their submission order. In this method, once the job is scheduled, no intervention is allowed. So sometimes the high-priority process has to wait for a long time since the priority of the task does not matter in this method. 

Advantage: 

  • No need for configuration
  • First Come First Serve
  • simple to execute

Disadvantage:  

  • Priority of task doesn’t matter, so high priority jobs need to wait
  • Not suitable for shared cluster

fifo-scheduler

2. Capacity Scheduler

In Capacity Scheduler we have multiple job queues for scheduling our tasks. The Capacity Scheduler allows multiple occupants to share a large size Hadoop cluster. In Capacity Scheduler corresponding for each job queue, we provide some slots or cluster resources for performing job operation. Each job queue has it’s own slots to perform its task. In case we have tasks to perform in only one queue then the tasks of that queue can access the slots of other queues also as they are free to use, and when the new task enters to some other queue then jobs in running in its own slots of the cluster are replaced with its own job. 

Capacity Scheduler also provides a level of abstraction to know which occupant is utilizing the more cluster resource or slots, so that the single user or application doesn’t take disappropriate or unnecessary slots in the cluster. The capacity Scheduler mainly contains 3 types of the queue that are root, parent, and leaf which are used to represent cluster, organization, or any subgroup, application submission respectively. 

Advantage: 

  • Best for working with Multiple clients or priority jobs in a Hadoop cluster
  • Maximizes throughput in the Hadoop cluster

Disadvantage:  

  • More complex
  • Not easy to configure for everyone

capacity-scheduler

3. Fair Scheduler

The Fair Scheduler is very much similar to that of the capacity scheduler. The priority of the job is kept in consideration. With the help of Fair Scheduler, the YARN applications can share the resources in the large Hadoop Cluster and these resources are maintained dynamically so no need for prior capacity. The resources are distributed in such a manner that all applications within a cluster get an equal amount of time. Fair Scheduler takes Scheduling decisions on the basis of memory, we can configure it to work with CPU also. 

As we told you it is similar to Capacity Scheduler but the major thing to notice is that in Fair Scheduler whenever any high priority job arises in the same queue, the task is processed in parallel by replacing some portion from the already dedicated slots. 

Advantages: 

  • Resources assigned to each application depend upon its priority.
  • it can limit the concurrent running task in a particular pool or queue.

Disadvantages: The configuration is required.  

fair-scheduler

 



Similar Reads

Difference between Hadoop 1 and Hadoop 2
Hadoop is an open source software programming framework for storing a large amount of data and performing the computation. Its framework is based on Java programming with some native code in C and shell scripts. Hadoop 1 vs Hadoop 2 1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet Another Resource Negotiator) and MapReduce ver
2 min read
Difference Between Hadoop 2.x vs Hadoop 3.x
The Journey of Hadoop Started in 2005 by Doug Cutting and Mike Cafarella. Which is an open-source software build for dealing with the large size Data? The objective of this article is to make you familiar with the differences between the Hadoop 2.x vs Hadoop 3.x version. Obviously, Hadoop 3.x has some more advanced and compatible features than the
2 min read
Hadoop - HDFS (Hadoop Distributed File System)
Before head over to learn about the HDFS(Hadoop Distributed File System), we should know what actually the file system is. The file system is a kind of Data structure or method which we use in an operating system to manage file on disk space. This means it allows the user to keep maintain and retrieve data from the local disk. An example of the win
7 min read
Hadoop - Features of Hadoop Which Makes It Popular
Today tons of Companies are adopting Hadoop Big Data tools to solve their Big Data queries and their customer market segments. There are lots of other tools also available in the Market like HPCC developed by LexisNexis Risk Solution, Storm, Qubole, Cassandra, Statwing, CouchDB, Pentaho, Openrefine, Flink, etc. Then why Hadoop is so popular among a
7 min read
Hadoop - Cluster, Properties and its Types
Before we start learning about the Hadoop cluster first thing we need to know is what actually cluster means. Cluster is a collection of something, a simple computer cluster is a group of various computers that are connected with each other through LAN(Local Area Network), the nodes in a cluster share the data, work on the same task and this nodes
4 min read
Sum of even and odd numbers in MapReduce using Cloudera Distribution Hadoop(CDH)
Prerequisites: Hadoop and MapReduce Counting the number of even and odd and finding their sum in any language is a piece of cake like in C, C++, Python, Java, etc. MapReduce also uses Java for the writing the program but it is very easy if you know the syntax how to write it. It is the basic of MapReduce. You will first learn how to execute this co
4 min read
Volunteer and Grid Computing | Hadoop
What is Volunteer Computing? At the point when individuals initially find out about Hadoop and MapReduce they frequently ask, "How is it unique from SETI@home?" SETI, the Search for Extra-Terrestrial Intelligence, runs a venture called SETI@home in which volunteers give CPU time from their generally inactive PCs to examine radio telescope informati
4 min read
Difference Between Hadoop and Cassandra
Hadoop is an open-source software programming framework. The framework of Hadoop is based on Java Programming Language with some native code in shell script and C. This framework is used to manage, store and process the data & computation for the different applications of big data running under clustered systems. The main components of Hadoop a
2 min read
Difference Between Hadoop and Teradata
Hadoop is a software programming framework where a large amount of data is stored and used to perform the computation. Its framework is based on Java programming which is similar to C and shell scripts. In other words, we can say that it is a platform that is used to manage data, store data, and process data for various big data applications runnin
2 min read
Difference Between Cloud Computing and Hadoop
Building infrastructure for cloud computing accounts for almost one-third of all IT spending worldwide. Cloud computing is playing a major role in the IT sector, however, on the other hand, organizations started using Hadoop on a large scale nowadays for storing and performing actions on the increasing size of their data. Cloud Computing: Computing
3 min read
Article Tags :