sched_ext: Enhance built-in idle selection with allowed CPUs

From:		Andrea Righi <arighi-AT-nvidia.com>
To:		Tejun Heo <tj-AT-kernel.org>, David Vernet <void-AT-manifault.com>, Changwoo Min <changwoo-AT-igalia.com>
Subject:		[PATCHSET v7] sched_ext: Enhance built-in idle selection with allowed CPUs
Date:		Sat, 05 Apr 2025 15:39:20 +0200
Message-ID:		<20250405134041.13778-1-arighi@nvidia.com>
Cc:		Joel Fernandes <joelagnelf-AT-nvidia.com>, linux-kernel-AT-vger.kernel.org
Archive-link:		Article
Many scx schedulers implement their own hard or soft-affinity rules to
support topology characteristics, such as heterogeneous architectures
(e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on
specific properties (e.g., running certain tasks only in a subset of CPUs).

Currently, there is no mechanism that allows to use the built-in idle CPU
selection policy to an arbitrary subset of CPUs. As a result, schedulers
often implement their own idle CPU selection policies, which are typically
similar to one another, leading to a lot of code duplication.

To address this, extend the built-in idle CPU selection policy introducing
the concept of allowed CPUs.

With this concept, BPF schedulers can apply the built-in idle CPU selection
policy to a subset of allowed CPUs, allowing them to implement their own
hard/soft-affinity rules while still using the topology optimizations of
the built-in policy, preventing code duplication across different
schedulers.

To implement this introduce a new helper kfunc scx_bpf_select_cpu_and()
that accepts a cpumask of allowed CPUs:

s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu,
			   u64 wake_flags,
			   const struct cpumask *cpus_allowed, u64 flags);

Example usage
=============

s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
		   s32 prev_cpu, u64 wake_flags)
{
	const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
	s32 cpu;

	cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
	if (cpu >= 0) {
		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
		return cpu;
	}

	return prev_cpu;
}

Results
=======

Load distribution on a 4 sockets / 4 cores per socket system, simulated
using virtme-ng, running a modified version of scx_bpfland that uses the
new helper scx_bpf_select_cpu_and() and 0xff00 as allowed domain:

     $ vng --cpu 16,sockets=4,cores=4,threads=1
     ...
     $ stress-ng -c 16
     ...
     $ htop
     ...
       0[                         0.0%]   8[||||||||||||||||||||||||100.0%]
       1[                         0.0%]   9[||||||||||||||||||||||||100.0%]
       2[                         0.0%]  10[||||||||||||||||||||||||100.0%]
       3[                         0.0%]  11[||||||||||||||||||||||||100.0%]
       4[                         0.0%]  12[||||||||||||||||||||||||100.0%]
       5[                         0.0%]  13[||||||||||||||||||||||||100.0%]
       6[                         0.0%]  14[||||||||||||||||||||||||100.0%]
       7[                         0.0%]  15[||||||||||||||||||||||||100.0%]

With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all
the available CPUs.

ChangeLog v6 -> v7:
 - use NULL instead of p->cpus_ptr when the caller doesn't specify an
   additional "and" cpumask
 - handle per-CPU tasks in scx_bpf_select_cpu_and(): since this API can be
   called also from ops.enqueue(), per-CPU tasks are not excluded
 - if prev_cpu isn't in the allowed CPUs, skip optimizations and try to
   pick any idle CPU in the subsets of allowed CPUs
 - do not deprecate scx_bpf_select_cpu_dfl(), as there's no need to convert
   everyone to use the new scx_bpf_select_cpu_and() API

ChangeLog v5 -> v6:
 - prevent redundant cpumask_subset() + cpumask_equal() checks in all
   patches
 - remove cpumask_subset() + cpumask_and() combo with local cpumasks, as
   cpumask_and() alone is generally more efficient
 - cleanup patches to prevent unnecessary function renames

ChangeLog v4 -> v5:
 - simplify code to compute the temporary task's cpumasks (and)

ChangeLog v3 -> v4:
 - keep p->nr_cpus_allowed optimizations (skip cpumask operations when the
   task can run on all CPUs)
 - allow to call scx_bpf_select_cpu_and() also from ops.enqueue() and
   modify the kselftest to cover this case as well
 - rebase to the latest sched_ext/for-6.15

ChangeLog v2 -> v3:
 - incrementally refactor scx_select_cpu_dfl() to accept idle flags and an
   arbitrary allowed cpumask
 - build scx_bpf_select_cpu_and() on top of the existing logic
 - re-arrange scx_select_cpu_dfl() prototype, aligning the first three
   arguments with select_task_rq()
 - do not use "domain" for the allowed cpumask to avoid potential ambiguity
   with sched_domain

ChangeLog v1 -> v2:
  - rename scx_bpf_select_cpu_pref() to scx_bpf_select_cpu_and() and always
    select idle CPUs strictly within the allowed domain
  - rename preferred CPUs -> allowed CPU
  - drop %SCX_PICK_IDLE_IN_PREF (not required anymore)
  - deprecate scx_bpf_select_cpu_dfl() in favor of scx_bpf_select_cpu_and()
    and provide all the required backward compatibility boilerplate

Andrea Righi (5):
      sched_ext: idle: Extend topology optimizations to all tasks
      sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()
      sched_ext: idle: Accept an arbitrary cpumask in scx_select_cpu_dfl()
      sched_ext: idle: Introduce scx_bpf_select_cpu_and()
      selftests/sched_ext: Add test for scx_bpf_select_cpu_and()

 kernel/sched/ext.c                                 |   3 +-
 kernel/sched/ext_idle.c                            | 188 +++++++++++++++++----
 kernel/sched/ext_idle.h                            |   3 +-
 tools/sched_ext/include/scx/common.bpf.h           |   2 +
 tools/testing/selftests/sched_ext/Makefile         |   1 +
 .../testing/selftests/sched_ext/allowed_cpus.bpf.c | 121 +++++++++++++
 tools/testing/selftests/sched_ext/allowed_cpus.c   |  57 +++++++
 7 files changed, 342 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.c