feat(gateway): /fro-bot add-project slash command#676
Conversation
3c0e4d5 to
7341986
Compare
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: CONDITIONAL
Can merge after addressing the one blocking issue below.
Blocking issues
1. : stale guild cache means can miss recently-created channels (race window)
snapshots once at the top of the function (line 47), then calls against that frozen snapshot for every candidate. If two concurrent add-project invocations race through PRE_FLIGHT simultaneously, both will see the same snapshot — both will think a given candidate is free, both will attempt to create() it, and one will get an unexpected rejection that gets silently swallowed as a "transient error" before the loop advances to the next suffix. The winner gets name, the loser silently gets name-2 — with no indication that this happened, and the binding may end up pointing to the "wrong" suffixed channel.
The fix is straightforward: refresh the cache lookup from guild.channels.cache inside each iteration (not from a captured snapshot), so a creation that lands between two loop iterations is visible on the next nameExists check. This doesn't close the window entirely (Discord's cache is eventually consistent), but it tightens it significantly and prevents the "silent suffix escalation" behaviour.
This is blocking because it directly affects the correctness of the binding-to-channel association under concurrent use.
Non-blocking concerns
A. is module-level state — process restart silently resets all per-user counters
This is v1-acceptable (acknowledged in the PR description), but worth noting: a gateway crash and restart while a user is mid-rate-limit window will silently reset their counter. An operator who kills and restarts the process during an attempted abuse window gets no protection. Consider at least a comment in the code noting this limitation so future readers don't accidentally "fix" the restart without replacing the backing store.
B. returns on non-permission errors and the loop silently advances to the next candidate
At , any Discord error that doesn't match the permission regex causes to return , and treats that as "try the next name". A transient network error, rate-limit from Discord (429), or an unexpected error code during channel creation will silently burn through all 10 suffix candidates and return — giving the operator a misleading error message. These cases should be surfaced with a distinct error rather than being conflated with name collisions.
C. accepts and slugs with no further validation
The regex strips suffix and captures anything between slashes, including slugs with consecutive hyphens, leading/trailing hyphens, or component lengths that GitHub doesn't allow (e.g. a 40-character all-hex owner that looks like a SHA). This reaches the workspace agent unvalidated. This is low-risk for v1 since the GitHub App auth step will reject invalid repos, but the error message the operator sees in that case ("GitHub App not installed") will be confusing for a typo like . Not blocking, but a TODO comment would be appropriate.
D. Interaction window guard placement
The 14-minute guard at the start of CLONING (line 296) fires only if PRE_FLIGHT itself takes >14 minutes, which is essentially impossible (it's all in-process or fast GitHub API calls). The more realistic expiry scenario is PRE_FLIGHT + CLONING > 14 minutes, but there's no guard entering CREATING_CHANNEL that checks . The guard at line 366 (entering WRITING_BINDING via CREATING_CHANNEL) correctly catches that window. The first guard is harmless but misleading — worth removing or adding a comment explaining what it's actually protecting against.
E. defaults to (plaintext)
This is documented in the PR as v1-acceptable for the internal compose network. The default is appropriate, but a startup log line noting that plaintext HTTP is in use (when the URL is not HTTPS) would help operators notice misconfiguration in non-compose deployments.
Missing tests
- No test for concurrent invocations racing through channel creation — the stale-snapshot bug above is untested. A test with two simultaneous calls sharing the same guild mock would catch the issue.
- No test for returning due to a non-permission Discord error (e.g. 429 or a generic network error during ). The current tests only cover the permission-denied (403) path and explicit transient-failure-then-success paths. A test that makes all 10
createcalls throw a 429-shaped error should verify the returned error kind is meaningful (currently it would return , which is wrong). - : no test for (exact name only, no suffixes attempted).
Risk assessment: MED
The stale-cache race condition in channel creation is the primary risk: under concurrent use (two operators running add-project simultaneously), one invocation can silently land on a suffixed channel name with no operator-visible warning, and the binding will point to that name permanently. Given the v1 operator audience is small this is unlikely to trigger in practice, but the silent nature of the failure (no error returned, just a "wrong" channel name) makes it worth fixing before shipping.
Security posture is strong: IAT never logs, URL parsing doesn't echo user input into log fields, channel name hostile-character rejection is correct, and the captured-logger test suite covers the key paths. No auth bypass or injection vectors found.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26528328556 |
| Cache | hit |
| Session | ses_19574fa7effeEB2i6dUTcHUnSk |
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: CONDITIONAL
Can merge after addressing the one blocking issue below.
Blocking issues
1. channels.ts: stale guild cache means nameExists() can miss recently-created channels (race window)
createChannelWithCollisionSuffix snapshots guild.channels.cache once at the top of the function (line 47), then calls nameExists() against that frozen snapshot for every candidate. If two concurrent add-project invocations race through PRE_FLIGHT simultaneously, both will see the same snapshot -- both will think a given candidate is free, both will attempt to create() it, and one will get an unexpected rejection that gets silently swallowed as a transient error before the loop advances to the next suffix. The winner gets the exact name, the loser silently gets name-2 with no indication that this happened, and the binding may end up pointing to the wrong suffixed channel.
The fix: refresh the cache lookup from guild.channels.cache inside each iteration rather than from a captured snapshot, so a creation that lands between two loop iterations is visible on the next nameExists check.
This is blocking because it directly affects the correctness of the binding-to-channel association under concurrent use.
Non-blocking concerns
A. rateLimitMap is module-level state -- process restart silently resets all per-user counters
V1-acceptable per the PR description, but worth a comment in the code noting this limitation so future readers do not accidentally fix the restart without replacing the backing store.
B. tryCreate returns null on non-permission errors and the loop silently advances
At channels.ts:88-91, any Discord error that does not match the permission regex causes tryCreate to return null, and createChannelWithCollisionSuffix treats that as try the next name. A transient network error, a 429 rate-limit from Discord, or any unexpected error during channel creation will silently burn through all 10 suffix candidates and return collision-exhausted -- giving the operator a misleading error message. These cases should surface a distinct create-failed error rather than being conflated with name collisions.
C. parseGitHubUrl accepts owner/repo slugs with no further validation
The regex strips .git and captures anything between slashes. Invalid GitHub slugs (consecutive hyphens, all-hex 40-char owner, etc.) reach the workspace agent unvalidated. Low-risk since GitHub App auth rejects invalid repos, but the resulting error message (GitHub App not installed) is confusing for a typo. A TODO comment would help.
D. First interaction-window guard is effectively unreachable
The 14-minute guard at the start of CLONING (add-project.ts:296) fires only if PRE_FLIGHT itself takes more than 14 minutes -- essentially impossible. The guard is harmless but misleading; consider removing it or adding a comment.
E. workspaceAgentUrl defaults to plaintext HTTP
V1-acceptable. A startup log line noting that plaintext HTTP is in use when the URL scheme is not HTTPS would help operators notice misconfiguration in non-compose deployments.
Missing tests
- No test for concurrent add-project invocations racing through channel creation -- the stale-snapshot bug above is untested. A test with two simultaneous calls sharing the same guild mock would catch it.
- No test for tryCreate returning null due to a non-permission Discord error (e.g. a 429-shaped error). The current suite only covers permission-denied (403) and explicit transient-failure-then-success paths. A test that makes all 10 create calls throw a 429-shaped error should assert the returned error kind is meaningful; currently it would return collision-exhausted, which is wrong.
- channels.ts: no test for maxSuffix: 1 (exact name only, no suffixes attempted).
Risk assessment: MED
The stale-cache race condition in channel creation is the primary risk: under concurrent use, one invocation can silently land on a suffixed channel name with no operator-visible warning, and the binding will point to that name permanently. Given the v1 operator audience is small this is unlikely to trigger in practice, but the silent failure mode makes it worth fixing before shipping.
Security posture is strong: IAT never logs, URL parsing does not echo user input into log fields, channel name hostile-character rejection is correct, and the captured-logger test suite covers the key error paths. No auth bypass or injection vectors found.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26528328556 |
| Cache | hit |
| Session | ses_19574fa7effeEB2i6dUTcHUnSk |
Wires PR A bindings store + PR B GitHub App auth + PR C workspace-agent into the first user-facing gateway feature. Operator runs `/fro-bot add-project url:<git-url> [channel:<name>]` and the gateway authenticates, clones the repo into the workspace container, creates a Discord channel, and writes a binding to S3. The /fro-bot parent command is refactored into a factory (createFroBotCommand) that takes injected dependencies and dispatches ping + add-project subcommands. Permission checks use interaction.appPermissions (always populated) instead of guild member cache (requires GuildMembers intent we don't hold). Workspace-api client parses structured CloneFailure bodies from any HTTP status, preserving error codes like disk-full, repo-exists, overloaded, clone-timeout through to operator-facing recovery copy. Native fetch with AbortSignal.timeout(5min). Never logs request/response body. Captured-logger test asserts no ghs_* token leaks across all error paths. Channel collision logic iterates name, name-2, ..., name-10; never returns an existing channel by name (would silently bind a new repo to an unrelated existing channel). Returns collision-exhausted if all 10 candidates exist. 273/273 gateway tests pass. Plan doc deepened with adversarial + feasibility + security review (16 findings applied).
7341986 to
c4278b0
Compare
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
Both blocking issues from the prior review have been addressed. Approving.
Blocking issues
None. The two issues flagged in the prior CHANGES_REQUESTED review are resolved:
-
Stale cache race (channels.ts) -- Fixed. The snapshot-once approach is replaced with a per-iteration cache.find() call (line 55), with an explanatory comment referencing the exact race. The new concurrent-race test at channels.test.ts:256 validates the live-cache behaviour by sharing a mutable channels array between two simultaneous invocations.
-
tryCreate silent-swallow -- Fixed. The function now returns a proper discriminated union (ok | permission-denied | name-taken | create-failed), and create-failed short-circuits immediately rather than burning suffix candidates. The new test at channels.test.ts:232 verifies that a 429-shaped error returns create-failed after exactly one create() call.
Non-blocking concerns
The non-blocking items from the prior review (rateLimitMap comment, slug validation TODO, unreachable window guard, HTTP plaintext log line) remain unaddressed, but they were explicitly flagged as non-blocking and are acceptable for v1.
Missing tests
None remaining. All three missing-test items from the prior review are covered:
- Concurrent race test: channels.test.ts:256
- 429/create-failed short-circuit test: channels.test.ts:232
- maxSuffix:1 boundary test: channels.test.ts:307
Risk assessment: LOW
The blocking race condition is fixed with the correct approach. The discriminated result type on tryCreate is strictly better than the prior null-return semantics. Security invariants (IAT never logged, hostile-char rejection, canonicalization) are unchanged and correct.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26529107464 |
| Cache | hit |
| Session | ses_19574fa7effeEB2i6dUTcHUnSk |
Ships the first user-facing gateway feature. Operator runs
/fro-bot add-project url:<git-url> [channel:<name>]and the gateway authenticates to GitHub, clones the repo into the workspace container, creates a Discord channel for it, and writes a binding record to S3 so future@fro-botmentions in that channel know which repo to act against.What changed
New surface —
/fro-bot add-projectslash commandFive-phase orchestration:
owner/repoto lowercase, validate channel name (hostile-char rejection), check existing binding, mint installation access token via the GitHub App/cloneendpoint withAbortSignal.timeout(5min)name,name-2, ...,name-10collision suffix logic/fro-botparent command refactorfro-bot.tsis now a factory (createFroBotCommand(deps)) that builds the parentSlashCommandBuilderwith bothpingandadd-projectsubcommands and dispatches oninteraction.options.getSubcommand().ping.tsandadd-project.tsexport subcommand handlers consumed by the parent. Dependencies are injected at registry construction time (no module-global state).Workspace-api client (
packages/gateway/src/workspace-api/)Native fetch with 5-minute timeout. Mirrors the workspace-agent's exact type contract including all 17
CloneErrorCodevariants. Parses structuredCloneFailurebodies on any HTTP status, preserving error semantics (disk-full,repo-exists,overloaded,clone-timeout, etc.) through to operator-facing recovery copy. Response-path integrity check (must end with/{owner}/{repo}). Captured-logger test asserts no token-shaped string ever appears in any log line across happy and all error paths.Discord channel helper (
packages/gateway/src/discord/channels.ts)findOrCreateChannel(guild, name, {maxSuffix: 10})— never returns an existing channel by name (preventing silent binding of new repos to unrelated existing channels). Iteratesname,name-2, ...,name-10to find the first non-existing candidate and creates it. Returnscollision-exhaustedwhen all 10 candidates are taken.Permission check uses
interaction.appPermissionsappPermissionsis reliably populated by Discord for every guild interaction regardless of intent configuration. The gateway runs without theGuildMembersprivileged intent, soguild.members.cache.get()returnsundefinedfor the bot itself — defending against this false-negative is whatappPermissionssolves.Per-user rate limit
In-memory bucket: 5 invocations per 60 seconds per Discord user. Sixth+ invocation gets ephemeral "rate-limited; try again in N seconds" without entering any phase.
Plan + design
docs/plans/2026-05-23-001-feat-gateway-unit-5-add-project-plan.mdcarries the deepened PR D spec (adversarial + feasibility + security review applied) and the v1 acceptance call-outs:sandbox-net: plaintext HTTP gateway↔workspace-agent is acceptable v1 only because the network is internal compose-only with no published ports. HMAC request signing required if that ever changes.phase,outcome,errorKind,durationMs), token never logged.Files
New (8):
workspace-api/{types,client,client.test}.ts,discord/channels{,.test}.ts,discord/commands/{fro-bot,add-project,add-project.test}.tsModified (5):
discord/commands/{ping,ping.test,index,index.test}.ts(factory pattern + dispatch),config.ts(+workspaceAgentUrl),program.ts(compose + inject deps),program.test.ts,github/app-client.ts,discord/mentions.test.tsVerification
ghs_*token appears in any log call