Skip to content

fix: Bound resource usage and make git transports in MCP server safer#69

Open
Pringled wants to merge 10 commits intomainfrom
security-issues
Open

fix: Bound resource usage and make git transports in MCP server safer#69
Pringled wants to merge 10 commits intomainfrom
security-issues

Conversation

@Pringled
Copy link
Copy Markdown
Member

@Pringled Pringled commented May 5, 2026

This PR addresses several issues from #59:

  • Reject unsafe git transport schemes (ssh://, file://, SCP-form) in MCP repo argument
  • Cap _IndexCache at 10 entries (LRU) to prevent unbounded memory growth
  • Skip files over 1 MB during indexing to prevent DoS via large files
  • Add 60s timeout to git clone to prevent hung MCP server
  • Fix MCP server instructions that encouraged the LLM to hallucinate GitHub URLs

The first concern (file:// URLs allow arbitrary local-repo read) is not addressed yet, as this changes the functionality of Semble (and arguably makes it worse). This is something for a followup PR.

@Pringled Pringled requested a review from stephantul May 5, 2026 07:59
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/semble/index/create.py 100.00% <100.00%> (ø)
src/semble/index/index.py 100.00% <100.00%> (ø)
src/semble/mcp.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/semble/index/index.py
cmd = ["git", "clone", "--depth", "1", *(["--branch", ref] if ref else []), "--", url, tmp_dir]
try:
result = subprocess.run(cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
result = subprocess.run(cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL, timeout=60)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider making this a constant, that makes clear this is in seconds.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also make this configurable using a flag? It could prevent users with very big repositories from using semble.

Comment thread src/semble/mcp.py
"""Initialise an empty cache with a shared embedding model."""
self._model = model
self._tasks: dict[str, asyncio.Task[SembleIndex]] = {}
self._tasks: OrderedDict[str, asyncio.Task[SembleIndex]] = OrderedDict()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated change right? I don't think the old version was actually different, or do you rely on ordering?

Comment thread src/semble/mcp.py
source = repo or default_source
if not source:
raise ValueError(
"No repo specified and no default index. Pass an https:// git URL or local directory path as `repo`."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this error only mentions https because it's actually the preferred option? Same below, you only instruct the LLM to pass https, but http is also accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants