Lock down benchmark manager memory duing kernel compile#38
Lock down benchmark manager memory duing kernel compile#38
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the benchmark runner against in-process tampering during kernel import/compilation by moving kernel import into a separate thread and temporarily making the BenchmarkManager memory inaccessible, backed by a per-thread seccomp filter to prevent re-mapping/protection changes to that memory. It also adds an optional “exploit canary” and several exploit scripts to validate the hardening.
Changes:
- Add
BenchmarkManager::get_kernel()whichmprotect(PROT_NONE)s the manager’s pages during kernel import in a separate thread, and installs a thread-specific seccomp filter for that page range. - Introduce
csrc/seccomp.cppimplementing a small BPF builder +seccomp_protect_page_range(). - Add
ENABLE_EXPLOIT_TARGETbuild option, canary field, and exploit submissions to test writeability/overwrite attempts.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
csrc/manager.cpp |
Runs kernel import in a separate thread while BenchmarkManager pages are PROT_NONE; adds exploit-canary-driven output behavior. |
csrc/manager.h |
Declares get_kernel() and (optionally) an exploit canary member. |
csrc/seccomp.cpp |
New seccomp BPF filter generator to block memory syscalls targeting a protected address range. |
CMakeLists.txt |
Adds ENABLE_EXPLOIT_TARGET option and compiles csrc/seccomp.cpp. |
exploits/submission_test_canary_direct.py |
New exploit script attempting direct overwrite of the canary via scanning writable mappings. |
exploits/submission_test_canary_make_writeable.py |
New exploit script attempting mprotect then overwrite of the canary. |
exploits/submission_event_array.py |
New exploit script attempting event-handle tampering via scanning memory for vectors and overwriting handles. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| nb::callable kernel; | ||
| // make the BenchmarkManager inaccessible | ||
| protect_range(reinterpret_cast<void*>(lo), hi - lo, PROT_NONE); | ||
| // TODO make stack inaccessible (may be impossible) or read-only during the call | ||
| // call the python kernel generation function from a different thread. | ||
|
|
||
| std::thread make_kernel_thread([&]() { | ||
| // new thread, new seccomp. | ||
| seccomp_protect_page_range(lo, hi - lo); | ||
| nb::gil_scoped_acquire guard; | ||
| kernel = kernel_from_qualname(qualname); | ||
| }); | ||
|
|
||
| make_kernel_thread.join(); | ||
| // make it accessible again. This is in the original thread, so the tightened seccomp | ||
| // policy does not apply here. | ||
| protect_range(reinterpret_cast<void*>(lo), hi - lo, PROT_READ | PROT_WRITE); | ||
| return kernel; |
There was a problem hiding this comment.
get_kernel() is not exception-safe w.r.t. memory protections: after mprotect(..., PROT_NONE), any failure (e.g., std::thread constructor throwing, or an early exception path you add later) can leave the BenchmarkManager pages permanently PROT_NONE, leading to hard-to-debug crashes. Use an RAII guard/scope-exit to always restore the original protections (and consider restoring before rethrowing thread errors).
| // prevent messing with the protected page range | ||
| b.block_syscall_if_arg_in_range(__NR_mprotect, 0, lo, hi); | ||
| b.block_syscall_if_arg_in_range(__NR_mmap, 0, lo, hi); | ||
| b.block_syscall_if_arg_in_range(__NR_mremap, 0, lo, hi); | ||
| b.block_syscall_if_arg_in_range(__NR_munmap, 0, lo, hi); | ||
| b.block_syscall_if_arg_in_range(__NR_madvise, 0, lo, hi); | ||
| b.block_syscall_if_arg_in_range(__NR_remap_file_pages,0, lo, hi); | ||
|
|
There was a problem hiding this comment.
The seccomp checks only test whether the start address argument falls in [lo, hi). For range-based syscalls like mprotect(addr, len, ...), munmap(addr, len), madvise(addr, len, ...), an attacker can bypass this by choosing addr < lo with a len large enough to overlap into the protected pages (e.g., mprotect(lo - 4096, 8192, ...)). Update the filter logic to block any syscall whose affected range overlaps [lo, hi), i.e., (addr < hi) && (addr+len > lo), using BPF ALU ops / additional arg checks as needed.
| auto prog = b.build(); | ||
| prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); | ||
| syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog); | ||
| } |
There was a problem hiding this comment.
seccomp_protect_page_range() ignores failures from prctl(PR_SET_NO_NEW_PRIVS, ...) and the seccomp syscall. If either call fails, the thread will continue without the intended protection. Check return codes and throw/abort with a clear error so the caller can fail closed rather than silently running unprotected.
| } | ||
|
|
||
| struct sock_fprog build() { | ||
| return { .len = (unsigned short)Instructions.size(), .filter = Instructions.data() }; |
There was a problem hiding this comment.
BpfBuilder::build() uses C++20 designated initializers ({ .len = ..., .filter = ... }), but the project is configured for C++17 (CMAKE_CXX_STANDARD 17). This will fail to compile on conforming C++17 toolchains. Use aggregate initialization without designators (in field order) or explicitly assign into a sock_fprog local before returning.
| return { .len = (unsigned short)Instructions.size(), .filter = Instructions.data() }; | |
| return { static_cast<unsigned short>(Instructions.size()), Instructions.data() }; |
This moves kernel compilation to a separate thread. While that thread is running, we can make the BenchmarkManager's memory completely inaccessible, and ensure with seccomp that is stays that way. Because seccomp is per thread, the original thread (withno user code running yet) can unprotect the memory before it is being used.
We should be able to set up a similar scheme also for warmup. For the actual measurements, I suspect we need to be less strict because we do need to record information to memory.