fix(sandbox): restore GPU procfs baseline#1522
Conversation
|
🌿 Preview your docs: https://nvidia-preview-pr-1522.docs.buildwithfern.com/openshell |
96a1caa to
59e399a
Compare
12bde4d to
d73e6de
Compare
pimlock
left a comment
There was a problem hiding this comment.
LGTM with a few nits and questions.
| fn allow_cuda_procfs_writes_allows_descendant_comm_write() { | ||
| if std::env::var_os(PROCFS_WRITE_HELPER_ENV).is_some() { | ||
| return; | ||
| } |
There was a problem hiding this comment.
From what I can tell the allow_cuda_procfs_writes_allows_descendant_comm_write launches the current test binary with PROCFS_WRITE_HELPER_ENV, and allow_cuda_procfs_writes_helper only runs real helper logic in that child process.
Could we add a short comment or rename the helper test to make that harness pattern clearer? I assume a subprocess is needed because Landlock restrictions are irreversible for the current process.
There was a problem hiding this comment.
I'm going to remove the implicit broadening of the permissions from this PR and defer it to #1628. We can look at improving the readability of the tests there.
| // TLS handshakes. | ||
| grpc_retry("Policy discovery sync", || { | ||
| grpc_client::discover_and_sync_policy(endpoint, id, sandbox, &discovered) | ||
| }) | ||
| .await? | ||
| }; | ||
|
|
||
| // Ensure baseline filesystem paths are present for proxy-mode | ||
| // sandboxes. If the policy was enriched, sync the updated version | ||
| // back to the gateway so users can see the effective policy. | ||
| let enriched = enrich_proto_baseline_paths(&mut proto_policy); | ||
| let enriched = enrich_proto_baseline_paths(&mut proto_policy)?; | ||
| if enriched | ||
| && let Some(sandbox_name) = sandbox.as_deref() | ||
| && let Err(e) = grpc_client::sync_policy(endpoint, sandbox_name, &proto_policy).await | ||
| { | ||
| warn!( | ||
| error = %e, | ||
| "Failed to sync enriched policy back to gateway (non-fatal)" | ||
| ); | ||
| } |
There was a problem hiding this comment.
This is outside of changes here, but the comment in the BaselinePolicySource got me here (with how the enrich_proto_baseline_paths uses Custom).
I think the flow in this function could be improved:
- line 2278, where we got policy from the server: lines 2313-2325 should likely live there, because that part is only applicable to when we get the policy from the gateway, as if we got the policy as default/from the image, we already enriched it, so there is no need to do that again?
- unless there is a situation in which this is still useful, even after the policy was initially enriched? Maybe safer to leave it, but maybe worth determining the source in this function, e.g. having the
ifin 2277 to capture both the policy and source?
| let result = ruleset | ||
| .add_rule(PathBeneath::new( | ||
| path_fd, | ||
| make_bitflags!(AccessFs::{WriteFile}), |
There was a problem hiding this comment.
How much narrower is this WriteFile from a standard read_write access? I'm wondering mostly because with /proc being in added with read_write to the policy, the user analyzing what's allowed sees it in the policy. It's more risky, but explicit.
With this, the /proc is in the policy as read-only, but some narrower scope is allowed, but without any mention in the policy.
So it's between "wider, but explicit" vs "narrower, but implicit". I'm not sure what are the risks of the former, but I think that may be a better option. WDYT?
There was a problem hiding this comment.
The read-write permissions that OpenShell applies are taken from AccessFs::from_all: https://docs.rs/landlock/0.4.4/src/landlock/fs.rs.html#104
This includes the from_read: https://docs.rs/landlock/0.4.4/src/landlock/fs.rs.html#116-118
And from_write: https://docs.rs/landlock/0.4.4/src/landlock/fs.rs.html#130-143 (which is ABI version dependent).
Assuming V1 this means that the additional permissions that promoting to read-write provide are:
WriteFileRemoveDirRemoveFileMakeCharMakeDirMakeRegMakeSockMakeFifoMakeBlockMakeSym
I think that in the case of OpenShell, where auditing what is allowed/accessed is important, explicit behaviour should be favoured over implicit behaviour. Does this maybe mean that we should keep the broader behaviour for now (the first commit) and then look at how we can better surface the permissions that we are adding? Perhaps adding a finer granularity for the policies makes sense?
There was a problem hiding this comment.
I have created #1628 to track looking into this properly instead of adding this as a "quick" fix on top of this PR.
2f3b5b2 to
a0171ff
Compare
|
Thanks for your initial review @pimlock. After the initial back and forth, I realised that there were a number of edge cases that I was not considering. I believe I was trying to detect user intent with insufficient signal and as such have updated this PR to ALWAYS promote |
Signed-off-by: Evan Lezar <elezar@nvidia.com>
a0171ff to
c828f23
Compare
Summary
Restore CUDA GPU startup compatibility by promoting
/procfromfilesystem_policy.read_onlytofilesystem_policy.read_writewhen/procis part of the active GPU runtime baseline.
This keeps the change intentionally narrow. The existing baseline enrichment
already places
/procin the GPU read-write baseline because CUDA writes/proc/<pid>/task/<tid>/commduring initialization. The missing behavior wasthat an existing read-only
/procentry caused enrichment to skip theread-write baseline path. This PR restores that promotion and emits an
informational log message when it happens.
Broader handling for user-supplied policy conflicts and explicit baseline
conflict controls is left to follow-up work such as #1629.
Related Issue
Fixes #1486
Related follow-up: #1629
Changes
/procfromread_onlytoread_writewhen the GPU read-writebaseline requires it.
/procis promoted for GPU runtimecompatibility.
policy.
Testing
mise exec -- cargo fmt --allmise exec -- cargo test -p openshell-sandbox --lib baseline_tests -- --nocapturemise run pre-commitcompleted Helm lint, Rust format, Rust check, Rust clippy, markdown lint, and license checks;python:protofailed in the parallel run becausegrpc_toolswas missing after.venvrecreation.mise run python:protoChecklist