Skip to content

bpf: cgroup: fix sysctl new-value handling in __cgroup_bpf_run_filter_sysctl#12232

Open
kernel-patches-daemon-bpf[bot] wants to merge 3 commits into
bpf-next_basefrom
series/1102583=>bpf-next
Open

bpf: cgroup: fix sysctl new-value handling in __cgroup_bpf_run_filter_sysctl#12232
kernel-patches-daemon-bpf[bot] wants to merge 3 commits into
bpf-next_basefrom
series/1102583=>bpf-next

Conversation

@kernel-patches-daemon-bpf
Copy link
Copy Markdown

Pull request for series with
subject: bpf: cgroup: fix sysctl new-value handling in __cgroup_bpf_run_filter_sysctl
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1102583

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

Upstream branch: 9a720e0
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1102583
version: 2

@kernel-patches-review-bot
Copy link
Copy Markdown

commit e0dceb780c938513d0c789ea7a3a3f13393d645d
Author: Dawei Feng <dawei.feng@seu.edu.cn>

bpf: cgroup: NUL-terminate replaced sysctl value

This appends a NUL terminator after the replacement sysctl value written
by bpf_sysctl_set_new_value(), so downstream proc handlers do not read
past the end of the buffer.  A KASAN slab-out-of-bounds read was
reproduced via the cgroup/sysctl replacement path before the fix.

> When writing to sysctls, proc_sys_call_handler() guarantees that the
> buffer passed to proc handlers is NUL-terminated. If
> bpf_sysctl_set_new_value() replaces the pending sysctl value, it can
> hand a replacement buffer directly to proc handlers. However, the
> helper currently copies only buf_len
> bytes into that buffer without appending a NUL terminator, leaving
> downstream parsers vulnerable to out-of-bounds access.
>
> Fix this by appending a '\0' after the replaced value to restore the
> expected sysctl semantics. Since the helper already rejects buf_len
> greater than PAGE_SIZE - 1, there is always room for the extra byte.
>
> Reproduced in a QEMU x86_64 guest booted with KASAN while exercising
> the sysctl replacement path with a cgroup/sysctl BPF program.
>
> [ ... KASAN report snipped ... ]
>
> Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
> Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>

This fixes an out-of-bounds read in bpf_sysctl_set_new_value().  The
missing NUL terminator after the memcpy() of the replacement value looks
like it was introduced by the commit that added the helper.  Should this
carry:

  Fixes: 4e63acdff864 ("bpf: Introduce bpf_sysctl_{get,set}_new_value helpers")



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: cgroup: NUL-terminate replaced sysctl value
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26616337824

@kernel-patches-review-bot
Copy link
Copy Markdown

commit dcbe395f223e5524109eb40fa57257eec6d887d5
Author: Dawei Feng <dawei.feng@seu.edu.cn>

bpf: cgroup: restore sysctl new-value replacement

This changes the success check in __cgroup_bpf_run_filter_sysctl() from
ret == 1 to !ret so that a cgroup/sysctl program can replace the pending
sysctl write buffer again, since bpf_prog_run_array_cg() returns 0 on
success.

> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index a0b5f8cd8b10..3f06e2270f5c 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -1935,7 +1935,7 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
>
>  	kfree(ctx.cur_val);
>
> -	if (ret == 1 && ctx.new_updated) {
> +	if (!ret && ctx.new_updated) {
>  		kvfree(*buf);
>  		*buf = ctx.new_val;
>  		*pcount = ctx.new_len;

With the success check now testing !ret, does the kerneldoc above
__cgroup_bpf_run_filter_sysctl() still describe the right convention?  It
currently reads:

 * This function will return %-EPERM if an attached program is found and
 * returned value != 1 during execution. In all other cases 0 is returned.

Since bpf_prog_run_array_cg() returns 0 on success rather than 1, should
that line say "returned value != 0" to match the new check?

This point was raised in v1 review but does not appear to have been
changed in v2.

Now for the Fixes: tag.  The commit message states:

> Fixes: 4e63acdff864 ("bpf: Introduce bpf_sysctl_{get,set}_new_value helpers")

Is this the commit that actually introduced the problem?  At 4e63acdff864
the runner was BPF_PROG_RUN_ARRAY, which returned 1 on success, so the
ret == 1 check was correct at that time.

There appear to be two candidate commits that changed the return
convention to 0-on-success and left the ret == 1 check stale:

  c4dcfdd406aa ("bpf: Move getsockopt retval to struct bpf_cg_run_ctx")
  f10d05966196 ("bpf: Make BPF_PROG_RUN_ARRAY return -err instead of allow boolean")

Should the Fixes: tag point at one of these instead of 4e63acdff864?

The f10d05966196 suggestion was raised by the bpf-ci review in v1 but the
tag was unchanged in v2.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: cgroup: restore sysctl new-value replacement
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26616337824

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

Dawei Feng added 3 commits May 28, 2026 22:01
proc_sys_call_handler() allocates its temporary sysctl buffer with
kvzalloc() and passes it to __cgroup_bpf_run_filter_sysctl(). Since
kvzalloc() may fall back to vmalloc() for large allocations, freeing
that buffer with kfree() is wrong and can corrupt memory.

Use kvfree() to safely handle both kmalloc and kvzalloc()/vmalloc
allocations.

The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available. Manual inspection confirms that the bug is still
present in v7.1-rc5.

Reproduced the bug based on v7.1-rc4 in a QEMU x86_64 guest booted with
KASAN and CONFIG_FAILSLAB enabled. To exercise the replacement path, the
test tree also included the accompanying fix for the stale ret == 1
check in __cgroup_bpf_run_filter_sysctl(). The reproducer confines
failslab injections to the proc_sys_call_handler() range, uses
stacktrace-depth=32, and injects fail-nth=1 while writing 8191 bytes to
/proc/sys/kernel/domainname from a task in the target cgroup. Under
that setup, fail-nth=1 triggered the fault:

  BUG: unable to handle page fault for address: ffffeb0200024d48
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000  SMP KASAN NOPTI
  CPU: 2 UID: 0 PID: 209 Comm: repro_proc_sys_ Not tainted 7.1.0-rc4-00686-g97625979a5d4  PREEMPT(lazy)
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:kfree+0x6e/0x510
  Code: 80 48 01 ef 0f 82 ae 04 00 00 48 c7 c0 00 00 00 80 48 2b 05 04 1b 23 04 48 01 c7 48 c1 ef 0c 48 c1 e7 06 48 03 3d e2 1a 23 04 <4c> 8b 57 08 4c 89 d0 83 e0 01 48 83 e8 01 49 09 c2 49 >
  RSP: 0018:ffff888108de7ab8 EFLAGS: 00010282
  RAX: 0000777f80000000 RBX: ffff88815af398c0 RCX: 0000000000000080
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffeb0200024d40
  RBP: ffffc90000935000 R08: 0000000000000001 R09: 0000000000000001
  R10: ffffffff86b4b297 R11: 0000000000000000 R12: ffffffff819b71fd
  R13: 0000000000000001 R14: ffff888108de7cc0 R15: 0000000000000000
  FS:  00007f8988cc2b80(0000) GS:ffff8881d3256000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffeb0200024d48 CR3: 0000000101d6b000 CR4: 0000000000350ef0
  Call Trace:
   <TASK>
   ? __cgroup_bpf_run_filter_sysctl+0x626/0xc30
   __cgroup_bpf_run_filter_sysctl+0x74d/0xc30
   ? __pfx___cgroup_bpf_run_filter_sysctl+0x10/0x10
   ? srso_return_thunk+0x5/0x5f
   ? __kvmalloc_node_noprof+0x345/0x870
   ? proc_sys_call_handler+0x250/0x480
   ? srso_return_thunk+0x5/0x5f
   proc_sys_call_handler+0x3a2/0x480
   ? __pfx_proc_sys_call_handler+0x10/0x10
   ? srso_return_thunk+0x5/0x5f
   ? selinux_file_permission+0x39f/0x500
   ? srso_return_thunk+0x5/0x5f
   ? lock_is_held_type+0x9e/0x120
   vfs_write+0x98e/0x1000
   ? srso_return_thunk+0x5/0x5f
   ? kmem_cache_free+0x308/0x550
   ? __pfx_vfs_write+0x10/0x10
   ? __pfx_do_sys_openat2+0x10/0x10
   ksys_write+0xf2/0x1d0
   ? __pfx_ksys_write+0x10/0x10
   ? srso_return_thunk+0x5/0x5f
   ? trace_irq_enable.constprop.0+0x110/0x140
   do_syscall_64+0x115/0x690
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f8988dd8907
   Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8  01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 >
   RSP: 002b:00007fff4069b878 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
   RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8988dd8907
   RDX: 0000000000001fff RSI: 0000564f97ef46b0 RDI: 0000000000000005
   RBP: 0000564f97ef46b0 R08: 0000000000000000 R09: 0000564f97ef46b0
   R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000000
   R13: 0000000000001fff R14: 0000000000000005 R15: 0000000000000001
   </TASK>
With this fix applied on top of the same test setup, rerunning the
reproducer with fail-nth=1 yields no corresponding Oops reports.

Fixes: 4508943 ("proc: use kvzalloc for our kernel buffer")
Cc: stable@vger.kernel.org

Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
When writing to sysctls, proc_sys_call_handler() guarantees that the
buffer passed to proc handlers is NUL-terminated. If
bpf_sysctl_set_new_value() replaces the pending sysctl value, it can
hand a replacement buffer directly to proc handlers. However, the
helper currently copies only buf_len
bytes into that buffer without appending a NUL terminator, leaving
downstream parsers vulnerable to out-of-bounds access.

Fix this by appending a '\0' after the replaced value to restore the
expected sysctl semantics. Since the helper already rejects buf_len
greater than PAGE_SIZE - 1, there is always room for the extra byte.

Reproduced in a QEMU x86_64 guest booted with KASAN while exercising
the sysctl replacement path with a cgroup/sysctl BPF program. The
reproducer targets `/proc/sys/net/core/flow_limit_cpu_bitmap`, fills
the original user write buffer with non-zero bytes, and overrides the
sysctl value so the replacement buffer lacks a terminating NUL. Under
that setup, the pre-fix kernel reported:

  BUG: KASAN: slab-out-of-bounds in strnchrnul+0x72/0x90
  Read of size 1 at addr ffff88800de57000 by task repro_patch3/66
  CPU: 0 UID: 0 PID: 66 Comm: repro_patch3 Not tainted 7.1.0-rc3-00269-g8370ca1f87cc #6 PREEMPT(lazy)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x68/0xa0
   print_report+0xcb/0x5e0
   ? __virt_addr_valid+0x21d/0x3f0
   ? strnchrnul+0x72/0x90
   ? strnchrnul+0x72/0x90
   kasan_report+0xca/0x100
   ? strnchrnul+0x72/0x90
   strnchrnul+0x72/0x90
   bitmap_parse+0x37/0x2e0
   flow_limit_cpu_sysctl+0xc6/0x840
   ? __pfx_flow_limit_cpu_sysctl+0x10/0x10
   ? __kvmalloc_node_noprof+0x5ba/0x870
   proc_sys_call_handler+0x31d/0x480
   ? __pfx_proc_sys_call_handler+0x10/0x10
   ? selinux_file_permission+0x39f/0x500
   ? lock_is_held_type+0x9e/0x120
   vfs_write+0x98e/0x1000
   ? kmem_cache_free+0x308/0x550
   ? __pfx_vfs_write+0x10/0x10
   ? __pfx_do_sys_openat2+0x10/0x10
   ksys_write+0xf2/0x1d0
   ? __pfx_ksys_write+0x10/0x10
   ? trace_irq_enable.constprop.0+0x110/0x140
   do_syscall_64+0x115/0x690
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x447f37
   Code: ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
   RSP: 002b:00007fff01ade608 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
   RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000000000447f37
   RDX: 0000000000001fff RSI: 00000000172b1780 RDI: 0000000000000005
   RBP: 00000000172b1780 R08: 00000000004ca1b0 R09: 00000000172b1780
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000001fff
   R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000003
   </TASK>
  The buggy address is located 0 bytes to the right of
  allocated 4096-byte region [ffff88800de56000, ffff88800de57000)

With this fix applied, rerunning the same sysctl-targeted path yields
no corresponding KASAN reports.

Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
Commit 4e63acd ("bpf: Introduce bpf_sysctl_{get,set}_new_value
helpers") changed the success return value to 0, but failed to update the
corresponding check in __cgroup_bpf_run_filter_sysctl(). Since
bpf_prog_run_array_cg() now returns 0 on success, the legacy ret == 1
condition is never satisfied. As a result, the modified value is ignored,
and bpf_sysctl_set_new_value() fails to replace the write buffer.

Fix this by checking for a return value of 0 instead, so cgroup/sysctl
programs can correctly replace the pending sysctl buffer.

This bug was discovered during a manual code review. Tested via a
cgroup/sysctl BPF reproducer overriding writes to a target sysctl.
Pre-fix, bpf_sysctl_set_new_value("foo") was silently ignored: the write
returned 8192 and the value remained "600". Post-fix, the BPF replacement
buffer properly propagates: the write returns 3 and the value updates to
"foo".

Fixes: 4e63acd ("bpf: Introduce bpf_sysctl_{get,set}_new_value helpers")
Cc: stable@vger.kernel.org

Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
@kernel-patches-daemon-bpf
Copy link
Copy Markdown
Author

Upstream branch: 9a720e0
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1102583
version: 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants