sched_ext: Enhance built-in idle selection with allowed CPUs
From: | Andrea Righi <arighi-AT-nvidia.com> | |
To: | Tejun Heo <tj-AT-kernel.org>, David Vernet <void-AT-manifault.com>, Changwoo Min <changwoo-AT-igalia.com> | |
Subject: | [PATCHSET v7] sched_ext: Enhance built-in idle selection with allowed CPUs | |
Date: | Sat, 05 Apr 2025 15:39:20 +0200 | |
Message-ID: | <20250405134041.13778-1-arighi@nvidia.com> | |
Cc: | Joel Fernandes <joelagnelf-AT-nvidia.com>, linux-kernel-AT-vger.kernel.org | |
Archive-link: | Article |
Many scx schedulers implement their own hard or soft-affinity rules to support topology characteristics, such as heterogeneous architectures (e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on specific properties (e.g., running certain tasks only in a subset of CPUs). Currently, there is no mechanism that allows to use the built-in idle CPU selection policy to an arbitrary subset of CPUs. As a result, schedulers often implement their own idle CPU selection policies, which are typically similar to one another, leading to a lot of code duplication. To address this, extend the built-in idle CPU selection policy introducing the concept of allowed CPUs. With this concept, BPF schedulers can apply the built-in idle CPU selection policy to a subset of allowed CPUs, allowing them to implement their own hard/soft-affinity rules while still using the topology optimizations of the built-in policy, preventing code duplication across different schedulers. To implement this introduce a new helper kfunc scx_bpf_select_cpu_and() that accepts a cpumask of allowed CPUs: s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64 wake_flags, const struct cpumask *cpus_allowed, u64 flags); Example usage ============= s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr; s32 cpu; cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } return prev_cpu; } Results ======= Load distribution on a 4 sockets / 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses the new helper scx_bpf_select_cpu_and() and 0xff00 as allowed domain: $ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[||||||||||||||||||||||||100.0%] 1[ 0.0%] 9[||||||||||||||||||||||||100.0%] 2[ 0.0%] 10[||||||||||||||||||||||||100.0%] 3[ 0.0%] 11[||||||||||||||||||||||||100.0%] 4[ 0.0%] 12[||||||||||||||||||||||||100.0%] 5[ 0.0%] 13[||||||||||||||||||||||||100.0%] 6[ 0.0%] 14[||||||||||||||||||||||||100.0%] 7[ 0.0%] 15[||||||||||||||||||||||||100.0%] With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs. ChangeLog v6 -> v7: - use NULL instead of p->cpus_ptr when the caller doesn't specify an additional "and" cpumask - handle per-CPU tasks in scx_bpf_select_cpu_and(): since this API can be called also from ops.enqueue(), per-CPU tasks are not excluded - if prev_cpu isn't in the allowed CPUs, skip optimizations and try to pick any idle CPU in the subsets of allowed CPUs - do not deprecate scx_bpf_select_cpu_dfl(), as there's no need to convert everyone to use the new scx_bpf_select_cpu_and() API ChangeLog v5 -> v6: - prevent redundant cpumask_subset() + cpumask_equal() checks in all patches - remove cpumask_subset() + cpumask_and() combo with local cpumasks, as cpumask_and() alone is generally more efficient - cleanup patches to prevent unnecessary function renames ChangeLog v4 -> v5: - simplify code to compute the temporary task's cpumasks (and) ChangeLog v3 -> v4: - keep p->nr_cpus_allowed optimizations (skip cpumask operations when the task can run on all CPUs) - allow to call scx_bpf_select_cpu_and() also from ops.enqueue() and modify the kselftest to cover this case as well - rebase to the latest sched_ext/for-6.15 ChangeLog v2 -> v3: - incrementally refactor scx_select_cpu_dfl() to accept idle flags and an arbitrary allowed cpumask - build scx_bpf_select_cpu_and() on top of the existing logic - re-arrange scx_select_cpu_dfl() prototype, aligning the first three arguments with select_task_rq() - do not use "domain" for the allowed cpumask to avoid potential ambiguity with sched_domain ChangeLog v1 -> v2: - rename scx_bpf_select_cpu_pref() to scx_bpf_select_cpu_and() and always select idle CPUs strictly within the allowed domain - rename preferred CPUs -> allowed CPU - drop %SCX_PICK_IDLE_IN_PREF (not required anymore) - deprecate scx_bpf_select_cpu_dfl() in favor of scx_bpf_select_cpu_and() and provide all the required backward compatibility boilerplate Andrea Righi (5): sched_ext: idle: Extend topology optimizations to all tasks sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl() sched_ext: idle: Accept an arbitrary cpumask in scx_select_cpu_dfl() sched_ext: idle: Introduce scx_bpf_select_cpu_and() selftests/sched_ext: Add test for scx_bpf_select_cpu_and() kernel/sched/ext.c | 3 +- kernel/sched/ext_idle.c | 188 +++++++++++++++++---- kernel/sched/ext_idle.h | 3 +- tools/sched_ext/include/scx/common.bpf.h | 2 + tools/testing/selftests/sched_ext/Makefile | 1 + .../testing/selftests/sched_ext/allowed_cpus.bpf.c | 121 +++++++++++++ tools/testing/selftests/sched_ext/allowed_cpus.c | 57 +++++++ 7 files changed, 342 insertions(+), 33 deletions(-) create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.bpf.c create mode 100644 tools/testing/selftests/sched_ext/allowed_cpus.c