Project Linux Scheduler 2.6.32
Project Linux Scheduler 2.6.32
Project Linux Scheduler 2.6.32
Version 2.6.32
Author: Thang Minh Le
1. INTRODUCTION .............................................................................................................................. 3
2. OVERVIEW ....................................................................................................................................... 4
3. SCHEDULING TASK ....................................................................................................................... 5
3.1. DATA STRUCTURE ..................................................................................................................... 5
STRUCT TASK_STRUCT ...................................................................................................... 5
STRUCT RQ............................................................................................................................. 6
STRUCT SCHED_CLASS: fair_sched_class & rt_sched_class.............................................. 7
3.2. SCHEDULER................................................................................................................................. 9
schedule().................................................................................................................................. 9
context_switch()...................................................................................................................... 11
Schedule_fork() ...................................................................................................................... 12
3.3. LOAD BALANCER..................................................................................................................... 13
3.4. CFS SCHEDULE CLASS ............................................................................................................ 14
4. CONCLUSION................................................................................................................................. 16
5. APPENDIX....................................................................................................................................... 17
Linux scheduler has been gone through some big improvements since kernel version 2.4. There
were a lot of complaints about the interactivity of the scheduler in kernel 2.4. During this
version, the scheduler was implemented with one running queue for all available processors. At
every scheduling, this queue was locked and every task on this queue got its timeslice update.
This implementation caused poor performance in all aspects. The scheduler algorithm and
supporting code went through a large rewrite early in the 2.5 kernel development series. The new
scheduler was arisen to achieve O(1) run-time regardless number of runnable tasks in the
system. To achieve this, each processor has its own running queue. This helps a lot in reducing
lock contention. The priority array was introduced which used active array and expired array to
keep track running tasks in the system. The O(1) running time is primarily drawn from this new
data structure. The scheduler puts all expired processes into expired array. When there is no
active process available in active array, it swaps active array with expired array, which makes
active array becomes expired array and expired array becomes active array. There were some
twists made into this scheduler to optimize further by putting expired task back to active array
instead of expired array in some cases. O(1) scheduler uses a heuristic calculation to update
dynamic priority of tasks based on their interactivity (I/O bound versus CPU bound) The
industry was happy with this new scheduler until Con Kolivas introduced his new scheduler
named Rotating Staircase Deadline (RSDL) and then later Staircase Deadline (SD). His new
schedulers proved the fact that fair scheduling among processes can be achieved without any
complex computation. His scheduler was designed to run in O(n) but its performance exceeded
the current O(1) scheduler.
The result achieved from SD scheduler surprised all kernel developers and designers. The fair
scheduling approach in SD scheduler encouraged Igno Molnar to re-implement the new Linux
scheduler named Completely Fair Scheduler (CFS). CFS scheduler was a big improvement over
the existing scheduler not only in its performance and interactivity but also in simplifying the
scheduling logic and putting more modularized code into the scheduler. CFS scheduler was
merged into mainline version 2.6.23. Since then, there have been some minor improvements
made to CFS scheduler in some areas such as optimization, load balancing and group scheduling
This study will focus on the latest CFS scheduler in Linux kernel 2.6.32.
The most important data structures used in Linux scheduler are struct task_struct and struct rq.
Many of methods in the kernel scheduler work with these two data structures.
Each process in the system is represented by a task_struct. Since task_struct data type must be
able to capture all information of a process, it is relatively large around 1.7KB in size. Figure.2a
only shows some important fields which are frequently used by the scheduler. When a
process/thread is created, the kernel allocates a new task_struct for it. The kernel then stores this
task_struct in a circular linked list call task_list. There is a convenient macro current to obtain
the current running process. Macro next_task and prev_task allow a process to obtain its next
task and its previous task respectively. Linux scheduler uses many fields of task_struct for its
scheduling task. Most important fields are:
State: this field describes current state of process.
TASK_RUNNING: The process is runnable; it is either currently
running or on a running queue waiting to run.
TASK_INTERRUPTIBLE: The process is sleeping (that is, it is
blocked), waiting for some condition to exist.
TASK_UNINTERRUPTIBLE: The process is sleeping. It does
not wake up and become runnable if it receives a signal.
TASK_ZOMBIE: The task has terminated, but its parent has not
yet issued a wait() system call.
TASK STOPPED: Process execution has stopped; the task is not
running nor is it eligible to run.
static_prio: is static priority of a process. The value of this field does not get changed.
Static priority is also called nice value which ranges from -20 to 19. Before 2.6.23, Linux
scheduler depends heavily on the value of this field to perform heuristic calculation
during scheduling task. In CFS scheduling policy, the scheduler no longer uses the value
of this field in scheduling tasks.
prio: this field holds dynamic priority of a process. Prior to 2.6.23, this field was
calculated as a function of static priority (static_prio) and the task’s interactivity. This
calculation is done by method effective_prio() of sched.c. Since CFS scheduling policy
was introduced, linux scheduler no longer uses the value stored in this field in scheduling
normal_prio: holds expected priority of a process. In most cases, for non real-time
processes, values of normal_prio and static_prio are the same. For real-time processes,
the value of normal_prio might be boosted to avoid deadlock when accessing critical
rt_priority: this field is used for real-time process. It holds real-time priority value.
Competition among real-time tasks is strictly based upon rt_priority.
sched_class: a pointer points to schedule class.
sched_entity: a pointer points to CFS schedule entity
sched_rt_entity: a pointer points to RT schedule entity
policy: holds a value of scheduling policies
At boot time, Linux system creates running queue of type struct rq for each available processor.
Each running queue captures these data:
nr_running: number of runnable task.
nr_switches: number of switches.
cfs: CFS running queue structure.
rt: RT running queue structure.
next_balance: timestamp to next load balance check.
curr: pointer points to currently running task of this running queue.
idle: pointer points to currently idle task of this running queue.
lock: spin lock of running queue. task_rq_lock() and task_rq_unlock() can be used to
lock running queue which a specific task runs on. It is important to obtain running queue
locks in the same order in the case of locking multiple queues.
STRUCT SCHED_CLASS: fair_sched_class & rt_sched_class
struct sched_class was newly introduced in kernel 2.6.23. The intention is to remove the logic of
scheduling policies from the main sched.c file. This makes the codes more modularized. Some of
important methods defined in sched_class are:
enqueue_task(): calls when a task enters a runnable state. It puts the scheduling entity
(task) into the red-black tree and increments the nr_running variable.
dequeue_tree(): when a task is no longer runnable, this function is called to keep the
corresponding scheduling entity out of the red-black tree. It decrements the nr_running
yield_task():this function is basically just a dequeue followed by an enqueue, unless the
compat_yield sysctl is turned on; in that case, it places the scheduling entity at the right-
most end of the red-black tree.
check_preempt_curr(): this function checks if a task that entered the runnable state
should preempt the currently running task.
pick_next_task(): this function chooses the most appropriate task eligible to run next.
set_curr_task(): this function is called when a task changes its scheduling class or
changes its task group.
task_tick(): this function is mostly called from time tick functions; it might lead to
process switch. This drives the running preemption.
task_new(): The core scheduler gives the scheduling module an opportunity to manage
new task startup. The CFS scheduling module uses it for group scheduling, while the
scheduling module for a real-time task does not use it.
schedule() is the most important method of the scheduler. This method is responsible to pick the
next task and switch current task with the next task. This method is executed whenever:
A process voluntarily yields the CPU.
A process waits for signal to occur or wants to sleep.
Timer task occurs through scheduler_tick().
And other cases.
Step1: disables preemption. During execution scheduling logic, process preemption must be
disabled. It is because we will obtain spin lock later.
Step 2: retrieves running queue based on current processor. We call smp_processor_id() to get
current CPU. Calling cpu_rq() to retrieve rq of this CPU. Then, calling release_kernel_lock() to
release kernel lock on the current task and obtain the lock of current rq.
Step 3: executes pre_schedule() method. Recalls that pre_schedule() method is defined in struct
sched_class. This is the hook for scheduling policy to execute any logic before scheduler calls
pick_next_task(). Currently, only RT schedule class implements this method. This is because
Linux favors real-time tasks over normal tasks. Implementing pre_schedule() in RT schedule
class allows RT schedule to verify if it has any runnable real-time tasks at this point. If there is a
runnable real-time task in the system, RT schedule will make this task to be current task.
Step 4: executes pick_next_task(). This method is called on the associated schedule class stored
in task_struct.sched_class of the current task. If the current task is normal task, pick_next_task()
of CFS schedule class will be execute, which will follow Complete Fair Scheduling algorithm to
pick the next task. If the current task is real-time task, pick_next_task() of RT schedule class will
be executed, which will follow POSIX real-time standard requirement.
Step 5: checks whether current task is the same as next task. If they are the same, there is no
need context switch. Simply releases the lock of running queue and executes post_schedule().
Otherwise, context switch is required to switch current task with the next task. When switching
tasks is done, the code calls post_schedule().
Step 6: post_schedule() is a hook which allows RT schedule class to push real-time task on the
current processor.
Step 7: acquires lock on the current task which might be the original task or the next task and
enables preemption. The code then checks if we need to reschedule again; if so, we go back to
corresponding label and redo scheduling.
context_switch() is called from schedule() to perform switching the current task and the next
task. This method does the machine-specific work of switching process memory, registers and
schedule_fork() is called during fork()/clone() system call. The method initializes all scheduling
related fields defined in struct task_struct.
Step 1: it retrieves the current CPU in method get_cpu() by disabling preemption and get CPU id
on SMP system.
Step 2: initializes value for fields of schedule entity of the new task_struct.
Step 3: sets state to TASK_RUNNING
Step 4: if reset on fork is required, it will reset new task_struct. Otherwise, it sets priority to the
new task.
Step 5: sets schedule class for the new task based on its priority. If its priority is real-time
priority, schedule class is RT schedule class. Otherwise, its schedule class is set to CFS schedule
Step 6: sets cpu id to task_struct’s cpu field.
Step 7: puta the new task to process list by calling plist_head_init()
Step 8: it then calls put_cpu() to enable preemption.
load_balance() is called as part of schedule() process. It looks into CPU domain and tries to balance
tasks between busiest CPUs and idle CPU. Load balancer performs these steps:
Step 1: sets all available CPUs in CPU mask. If there is a CPU, its associated bit map is set to 1.
Step 2: using above CPU mask, the code attempts to find the busiest group. Method
find_busiest_group() returns the busiest group within the sched_domain if there is an imbalance. If
there isn't an imbalance, and the user has opted for power-savings, it returns a group whose CPUs
can be put to idle by rebalancing those tasks elsewhere, if such a group exists. Also calculates the
amount of weighted load which should be moved to restore balance.
Step 3: checks if we can find the busiest group. If yes, the method continues. Otherwise, the
execution returns.
Step 4: we just found the busiest group, let find the busiest running queues among this group.
Step 5: obtains double locks in order: current running queue & the busiest queue.
Step 6: begins moving task. Method move_task() tries to move up to max_load_move weighted load
from busiest to this_rq, as part of a balancing operation within domain sd. Returns 1 if successful
and 0 otherwise.
Step 7: release locks obtained in step 5 (releasing in reverse order when obtaining locks)
Step 8: checks whether moving task in step 6 is successful. If it is successful, the code resets balance
interval time. Otherwise, we wake up migration task so that load balance will be kicked off later by
this task.
Step 9: finally, the code updates CPU information after balancing.
In theory, CFS scheduler should be slower than its predecessor scheduler. However, in practice,
CFS scheduler achieves better performance and interactivity over its predecessor. The reason
behind the improvement is due to the fact it does not need to perform any heuristic calculation
which is the problem of the old scheduler. Let us recall the implementation of the previous
scheduler version. The old scheduler works on two priority arrays which have their size equal to
MAX_PRIO. One priority array is being active array which contains all active tasks. The other
one is expired array which holds all expired tasks. A task becomes expired if it runs out of its
allocated timeslice. Once active array has no task, the scheduler performs swapping two arrays
which makes the active array becomes expired array and vice versa. The key of O(1) running
time of this scheduler is the fact of using priority array with size=MAX_PRIO and swapping
expired array with active array when in need.
The bottom line of this scheduler is the overhead when calculating in priority and timeslice of a
process. The achievement of fairness and process interactivity lies in the values of priority and
timeslice. The idea is the scheduler must accurately define a process whether it is an I/O bound
or a CPU bound. Since I/O bound processes do not hog CPU, hence, they seem to execute fast.
Delaying I/O bound process might cause unresponsive experience to users. Based on this fact,
the scheduler favors I/O bound processes over CPU bound processes. The question is how
scheduler can determine if a process is I/O bound or not. The solution is to perform some
heuristic calculation (implemented in effective_prio()) based on task’s static priority and the
amount of sleeping time versus running time. This calculation is then scaled down to some
number which is set to task’s dynamic priority. The overhead in this scheduler is the work to
maintain task’s dynamic priority so that its value can reflect the task itself accurately.
CFS scheduler was designed from the fresh approach which had nothing to do with any of
priority matrix. CFS has its picking logic is based on this schedule_entity.vruntime value and it is
thus very simple: it always tries to run the task with the smallest p->se.vruntime value (i.e., the
task which executed least so far). CFS always tries to split up CPU time between runnable tasks
as close to "ideal multitasking hardware" as possible. For the internal data structure, CFS
scheduler uses Red-Black (RB) tree which is self-balanced tree. The use of RB tree in CFS has
two advantages:
It has no "array switch" artifacts (by which both the previous vanilla scheduler and
RSDL/SD are affected).
It takes for granted the feature of self-balanced of RB tree to reduce complexity in
CFS maintains a time-ordered RB tree, where all runnable tasks are sorted by the vruntime key
(there is a subtraction using rq->cfs.min_vruntime to account for possible wraparounds). CFS
picks the "leftmost" task from this tree and sticks to it. As the system progresses forwards, the
executed tasks are put into the tree more and more to the right slowly but surely giving a chance
for every task to become the "leftmost task" and thus get on the CPU within a deterministic
amount of time.
The study covered most important aspects of Linux scheduler 2.6.32. Kernel scheduler is one of
the most frequently executed components in Linux system. Hence, it has gained a lot of
attentions from kernel developers who have thrived to put the most optimized algorithms and
codes into the scheduler. Different algorithms used in kernel scheduler were discussed in the
study. CFS scheduler achieves a good performance and responsiveness while being relatively
simple compared with the previous algorithm. CFS exceeds performance expectation in some
workloads. But it still shows some weakness in other workloads. There are some complaints
about irresponsiveness of CFS scheduler in 3D game area. It is difficult to make one such
scheduler for all performance purposes. There are some discussions around the idea of allowing
more than one scheduler available to user space in Linux system. At the point of writing this
study, there is a new rival Brain Fuck Scheduler (BFS) which claimed to achieve a better
performance than the current CFS scheduler while being much simpler.
* This is the main, per-CPU runqueue data structure.
* Locking rule: those places that want to lock multiple runqueues
* (such as the load balancing or the thread migration code), lock
* acquire operations must be ordered by ascending &runqueue.
struct rq {
/* runqueue lock: */
spinlock_t lock;
* nr_running and cpu_load should be in the same cacheline because
* remote CPUs use both these fields when doing load calculation.
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
unsigned long last_tick_seen;
unsigned char in_nohz_recently;
/* capture load from *all* tasks on this cpu: */
struct load_weight load;
unsigned long nr_load_updates;
u64 nr_switches;
u64 nr_migrations_in;
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
struct list_head leaf_rt_rq_list;
* This is part of a global counter where only the total sum
* over all CPUs matters. A task can increase this counter on
* one CPU and if it got migrated afterwards it may decrease
* it on another CPU. Always updated under the runqueue lock:
unsigned long nr_uninterruptible;
u64 clock;
atomic_t nr_iowait;
struct root_domain *rd;
struct sched_domain *sd;
unsigned long avg_load_per_task;
u64 rt_avg;
u64 age_stamp;
int hrtick_csd_pending;
struct call_single_data hrtick_csd;
struct hrtimer hrtick_timer;
/* latency stats */
struct sched_info rq_sched_info;
unsigned long long rq_cpu_time;
/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */
/* sys_sched_yield() stats */
unsigned int yld_count;
/* schedule() stats */
unsigned int sched_switch;
unsigned int sched_count;
unsigned int sched_goidle;
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
/* BKL stats */
unsigned int bkl_count;
struct sched_class {
const struct sched_class *next;
int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
void (*rq_online)(struct rq *rq);
void (*rq_offline)(struct rq *rq);
void (*moved_group) (struct task_struct *p);
* All the scheduling class methods:
static const struct sched_class fair_sched_class = {
.next = &idle_sched_class,
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair,
.check_preempt_curr = check_preempt_wakeup,
.pick_next_task = pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
.select_task_rq = select_task_rq_fair,
.load_balance = load_balance_fair,
.move_one_task = move_one_task_fair,
.set_curr_task = set_curr_task_fair,
.task_tick = task_tick_fair,
.task_new = task_new_fair,
.prio_changed = prio_changed_fair,
.switched_to = switched_to_fair,
.get_rr_interval = get_rr_interval_fair,
.moved_group = moved_group_fair,
.check_preempt_curr = check_preempt_curr_rt,
.pick_next_task = pick_next_task_rt,
.put_prev_task = put_prev_task_rt,
.select_task_rq = select_task_rq_rt,
.load_balance = load_balance_rt,
.move_one_task = move_one_task_rt,
.set_cpus_allowed = set_cpus_allowed_rt,
.rq_online = rq_online_rt,
.rq_offline = rq_offline_rt,
.pre_schedule = pre_schedule_rt,
.post_schedule = post_schedule_rt,
.task_wake_up = task_wake_up_rt,
.switched_from = switched_from_rt,
.set_curr_task = set_curr_task_rt,
.task_tick = task_tick_rt,
.get_rr_interval = get_rr_interval_rt,
.prio_changed = prio_changed_rt,
.switched_to = switched_to_rt,
* schedule() is the main scheduler function.
asmlinkage void __sched schedule(void)
struct task_struct *prev, *next;
unsigned long *switch_count;
struct rq *rq;
int cpu;
cpu = smp_processor_id();
rq = cpu_rq(cpu);//get run queue of CPU
rcu_sched_qs(cpu);//some lock mechaism
prev = rq->curr;//current task_struct
switch_count = &prev->nivcsw;
pre_schedule(rq, prev);
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);
put_prev_task(rq, prev);
next = pick_next_task(rq);
if (likely(prev != next)) {
sched_info_switch(prev, next);
perf_event_task_sched_out(prev, next, cpu);
rq->curr = next;
if (need_resched())
goto need_resched;
* context_switch - switch to the new MM and the new
* thread's register state.
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
struct mm_struct *mm, *oldmm;
if (unlikely(!mm)) {
next->active_mm = oldmm;
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
finish_task_switch(this_rq(), prev);
* fork()/clone()-time setup:
void sched_fork(struct task_struct *p, int clone_flags)
int cpu = get_cpu();
* Revert to default priority/policy on fork if requested.
if (unlikely(p->sched_reset_on_fork)) {
if (p->policy == SCHED_FIFO || p->policy == SCHED_RR) {
p->policy = SCHED_NORMAL;
p->normal_prio = p->static_prio;
if (PRIO_TO_NICE(p->static_prio) < 0) {
p->static_prio = NICE_TO_PRIO(0);
p->normal_prio = p->static_prio;
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
p->sched_reset_on_fork = 0;
* Make sure we do not leak PI boosting priority to the child.
p->prio = current->normal_prio;
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;
cpu = p->sched_class->select_task_rq(p, SD_BALANCE_FORK, 0);
set_task_cpu(p, cpu);
* Check this_cpu to ensure it is balanced within domain. Attempt to move
* tasks if there is an imbalance.
static int load_balance(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
int *balance)
int ld_moved, all_pinned = 0, active_balance = 0, sd_idle = 0;
struct sched_group *group;
unsigned long imbalance;
struct rq *busiest;
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
* When power savings policy is enabled for the parent domain, idle
* sibling can pick up load irrespective of busy siblings. In this case,
* let the state of idle sibling percolate up as CPU_IDLE, instead of
* portraying it as CPU_NOT_IDLE.
if (idle != CPU_NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER &&
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
sd_idle = 1;
schedstat_inc(sd, lb_count[idle]);
group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
cpus, balance);
if (*balance == 0)
goto out_balanced;
if (!group) {
schedstat_inc(sd, lb_nobusyg[idle]);
goto out_balanced;
BUG_ON(busiest == this_rq);
ld_moved = 0;
if (busiest->nr_running > 1) {
* Attempt to move tasks. If find_busiest_group has found
* an imbalance but busiest->nr_running <= 1, the group is
* still unbalanced. ld_moved simply stays zero, so it is
* correctly treated as an imbalance.
double_rq_lock(this_rq, busiest);
ld_moved = move_tasks(this_rq, this_cpu, busiest,
imbalance, sd, idle, &all_pinned);
double_rq_unlock(this_rq, busiest);
* some other cpu did the load balance for us.
if (ld_moved && this_cpu != smp_processor_id())
if (!ld_moved) {
schedstat_inc(sd, lb_failed[idle]);
spin_lock_irqsave(&busiest->lock, flags);
if (!busiest->active_balance) {
busiest->active_balance = 1;
busiest->push_cpu = this_cpu;
active_balance = 1;
spin_unlock_irqrestore(&busiest->lock, flags);
if (active_balance)
* We've kicked active balancing, reset the failure
* counter.
sd->nr_balance_failed = sd->cache_nice_tries+1;
} else
sd->nr_balance_failed = 0;
if (likely(!active_balance)) {
/* We were unbalanced, so reset the balancing interval */
sd->balance_interval = sd->min_interval;
} else {
* If we've begun active balancing, start to back off. This
* case may not be covered by the all_pinned logic if there
* is only 1 task on the busy runqueue (because we don't call
* move_tasks).
if (sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
goto out;
schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
/* tune up the balancing interval */
if ((all_pinned && sd->balance_interval < MAX_PINNED_INTERVAL) ||
(sd->balance_interval < sd->max_interval))
sd->balance_interval *= 2;
ld_moved = 0;
if (ld_moved)
return ld_moved;