Skip to content

feat(gateway): /fro-bot add-project slash command#676

Open
marcusrbrown wants to merge 1 commit into
mainfrom
feat/gateway-add-project-command
Open

feat(gateway): /fro-bot add-project slash command#676
marcusrbrown wants to merge 1 commit into
mainfrom
feat/gateway-add-project-command

Conversation

@marcusrbrown
Copy link
Copy Markdown
Collaborator

Ships the first user-facing gateway feature. Operator runs /fro-bot add-project url:<git-url> [channel:<name>] and the gateway authenticates to GitHub, clones the repo into the workspace container, creates a Discord channel for it, and writes a binding record to S3 so future @fro-bot mentions in that channel know which repo to act against.

What changed

New surface — /fro-bot add-project slash command

Five-phase orchestration:

  1. PRE_FLIGHT — defer ephemerally, parse + canonicalize owner/repo to lowercase, validate channel name (hostile-char rejection), check existing binding, mint installation access token via the GitHub App
  2. CLONING — POST to the workspace-agent's /clone endpoint with AbortSignal.timeout(5min)
  3. CREATING_CHANNEL — defensive permission re-check, then create the channel with name, name-2, ..., name-10 collision suffix logic
  4. WRITING_BINDING — persist to S3 with two-key schema (primary + by-channel-id index)
  5. READY — confirmation message in the source thread, welcome embed posted in the new channel

/fro-bot parent command refactor

fro-bot.ts is now a factory (createFroBotCommand(deps)) that builds the parent SlashCommandBuilder with both ping and add-project subcommands and dispatches on interaction.options.getSubcommand(). ping.ts and add-project.ts export subcommand handlers consumed by the parent. Dependencies are injected at registry construction time (no module-global state).

Workspace-api client (packages/gateway/src/workspace-api/)

Native fetch with 5-minute timeout. Mirrors the workspace-agent's exact type contract including all 17 CloneErrorCode variants. Parses structured CloneFailure bodies on any HTTP status, preserving error semantics (disk-full, repo-exists, overloaded, clone-timeout, etc.) through to operator-facing recovery copy. Response-path integrity check (must end with /{owner}/{repo}). Captured-logger test asserts no token-shaped string ever appears in any log line across happy and all error paths.

Discord channel helper (packages/gateway/src/discord/channels.ts)

findOrCreateChannel(guild, name, {maxSuffix: 10}) — never returns an existing channel by name (preventing silent binding of new repos to unrelated existing channels). Iterates name, name-2, ..., name-10 to find the first non-existing candidate and creates it. Returns collision-exhausted when all 10 candidates are taken.

Permission check uses interaction.appPermissions

appPermissions is reliably populated by Discord for every guild interaction regardless of intent configuration. The gateway runs without the GuildMembers privileged intent, so guild.members.cache.get() returns undefined for the bot itself — defending against this false-negative is what appPermissions solves.

Per-user rate limit

In-memory bucket: 5 invocations per 60 seconds per Discord user. Sixth+ invocation gets ephemeral "rate-limited; try again in N seconds" without entering any phase.

Plan + design

docs/plans/2026-05-23-001-feat-gateway-unit-5-add-project-plan.md carries the deepened PR D spec (adversarial + feasibility + security review applied) and the v1 acceptance call-outs:

  • Orchestration durability: in-memory only. Process crash mid-flow leaves orphaned state with operator-facing manual recovery instructions per partial-failure shape (clone-no-channel, channel-no-binding, binding-no-welcome). Persistent setup-state is deferred.
  • Trust model for sandbox-net: plaintext HTTP gateway↔workspace-agent is acceptable v1 only because the network is internal compose-only with no published ports. HMAC request signing required if that ever changes.
  • Observability contract: correlationId from Discord interaction ID, per-phase structured log lines (phase, outcome, errorKind, durationMs), token never logged.

Files

New (8): workspace-api/{types,client,client.test}.ts, discord/channels{,.test}.ts, discord/commands/{fro-bot,add-project,add-project.test}.ts

Modified (5): discord/commands/{ping,ping.test,index,index.test}.ts (factory pattern + dispatch), config.ts (+workspaceAgentUrl), program.ts (compose + inject deps), program.test.ts, github/app-client.ts, discord/mentions.test.ts

Verification

  • 273/273 gateway tests pass (+25 net)
  • 0 lint errors
  • Types clean
  • Captured-logger security test covers happy + 7 error paths (timeout, clone-error, response-mismatch, parse-error, network-error, http-error, malformed-success); no ghs_* token appears in any log call

@marcusrbrown marcusrbrown requested a review from fro-bot as a code owner May 27, 2026 17:25
@fro-bot fro-bot added the agent: working Agent is currently working on this label May 27, 2026
@marcusrbrown marcusrbrown force-pushed the feat/gateway-add-project-command branch from 3c0e4d5 to 7341986 Compare May 27, 2026 17:43
Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: CONDITIONAL

Can merge after addressing the one blocking issue below.


Blocking issues

1. : stale guild cache means can miss recently-created channels (race window)

snapshots once at the top of the function (line 47), then calls against that frozen snapshot for every candidate. If two concurrent add-project invocations race through PRE_FLIGHT simultaneously, both will see the same snapshot — both will think a given candidate is free, both will attempt to create() it, and one will get an unexpected rejection that gets silently swallowed as a "transient error" before the loop advances to the next suffix. The winner gets name, the loser silently gets name-2 — with no indication that this happened, and the binding may end up pointing to the "wrong" suffixed channel.

The fix is straightforward: refresh the cache lookup from guild.channels.cache inside each iteration (not from a captured snapshot), so a creation that lands between two loop iterations is visible on the next nameExists check. This doesn't close the window entirely (Discord's cache is eventually consistent), but it tightens it significantly and prevents the "silent suffix escalation" behaviour.

This is blocking because it directly affects the correctness of the binding-to-channel association under concurrent use.


Non-blocking concerns

A. is module-level state — process restart silently resets all per-user counters

This is v1-acceptable (acknowledged in the PR description), but worth noting: a gateway crash and restart while a user is mid-rate-limit window will silently reset their counter. An operator who kills and restarts the process during an attempted abuse window gets no protection. Consider at least a comment in the code noting this limitation so future readers don't accidentally "fix" the restart without replacing the backing store.

B. returns on non-permission errors and the loop silently advances to the next candidate

At , any Discord error that doesn't match the permission regex causes to return , and treats that as "try the next name". A transient network error, rate-limit from Discord (429), or an unexpected error code during channel creation will silently burn through all 10 suffix candidates and return — giving the operator a misleading error message. These cases should be surfaced with a distinct error rather than being conflated with name collisions.

C. accepts and slugs with no further validation

The regex strips suffix and captures anything between slashes, including slugs with consecutive hyphens, leading/trailing hyphens, or component lengths that GitHub doesn't allow (e.g. a 40-character all-hex owner that looks like a SHA). This reaches the workspace agent unvalidated. This is low-risk for v1 since the GitHub App auth step will reject invalid repos, but the error message the operator sees in that case ("GitHub App not installed") will be confusing for a typo like . Not blocking, but a TODO comment would be appropriate.

D. Interaction window guard placement

The 14-minute guard at the start of CLONING (line 296) fires only if PRE_FLIGHT itself takes >14 minutes, which is essentially impossible (it's all in-process or fast GitHub API calls). The more realistic expiry scenario is PRE_FLIGHT + CLONING > 14 minutes, but there's no guard entering CREATING_CHANNEL that checks . The guard at line 366 (entering WRITING_BINDING via CREATING_CHANNEL) correctly catches that window. The first guard is harmless but misleading — worth removing or adding a comment explaining what it's actually protecting against.

E. defaults to (plaintext)

This is documented in the PR as v1-acceptable for the internal compose network. The default is appropriate, but a startup log line noting that plaintext HTTP is in use (when the URL is not HTTPS) would help operators notice misconfiguration in non-compose deployments.


Missing tests

  • No test for concurrent invocations racing through channel creation — the stale-snapshot bug above is untested. A test with two simultaneous calls sharing the same guild mock would catch the issue.
  • No test for returning due to a non-permission Discord error (e.g. 429 or a generic network error during ). The current tests only cover the permission-denied (403) path and explicit transient-failure-then-success paths. A test that makes all 10 create calls throw a 429-shaped error should verify the returned error kind is meaningful (currently it would return , which is wrong).
  • : no test for (exact name only, no suffixes attempted).

Risk assessment: MED

The stale-cache race condition in channel creation is the primary risk: under concurrent use (two operators running add-project simultaneously), one invocation can silently land on a suffixed channel name with no operator-visible warning, and the binding will point to that name permanently. Given the v1 operator audience is small this is unlikely to trigger in practice, but the silent nature of the failure (no error returned, just a "wrong" channel name) makes it worth fixing before shipping.

Security posture is strong: IAT never logs, URL parsing doesn't echo user input into log fields, channel name hostile-character rejection is correct, and the captured-logger test suite covers the key paths. No auth bypass or injection vectors found.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26528328556
Cache hit
Session ses_19574fa7effeEB2i6dUTcHUnSk

Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: CONDITIONAL

Can merge after addressing the one blocking issue below.


Blocking issues

1. channels.ts: stale guild cache means nameExists() can miss recently-created channels (race window)

createChannelWithCollisionSuffix snapshots guild.channels.cache once at the top of the function (line 47), then calls nameExists() against that frozen snapshot for every candidate. If two concurrent add-project invocations race through PRE_FLIGHT simultaneously, both will see the same snapshot -- both will think a given candidate is free, both will attempt to create() it, and one will get an unexpected rejection that gets silently swallowed as a transient error before the loop advances to the next suffix. The winner gets the exact name, the loser silently gets name-2 with no indication that this happened, and the binding may end up pointing to the wrong suffixed channel.

The fix: refresh the cache lookup from guild.channels.cache inside each iteration rather than from a captured snapshot, so a creation that lands between two loop iterations is visible on the next nameExists check.

This is blocking because it directly affects the correctness of the binding-to-channel association under concurrent use.


Non-blocking concerns

A. rateLimitMap is module-level state -- process restart silently resets all per-user counters

V1-acceptable per the PR description, but worth a comment in the code noting this limitation so future readers do not accidentally fix the restart without replacing the backing store.

B. tryCreate returns null on non-permission errors and the loop silently advances

At channels.ts:88-91, any Discord error that does not match the permission regex causes tryCreate to return null, and createChannelWithCollisionSuffix treats that as try the next name. A transient network error, a 429 rate-limit from Discord, or any unexpected error during channel creation will silently burn through all 10 suffix candidates and return collision-exhausted -- giving the operator a misleading error message. These cases should surface a distinct create-failed error rather than being conflated with name collisions.

C. parseGitHubUrl accepts owner/repo slugs with no further validation

The regex strips .git and captures anything between slashes. Invalid GitHub slugs (consecutive hyphens, all-hex 40-char owner, etc.) reach the workspace agent unvalidated. Low-risk since GitHub App auth rejects invalid repos, but the resulting error message (GitHub App not installed) is confusing for a typo. A TODO comment would help.

D. First interaction-window guard is effectively unreachable

The 14-minute guard at the start of CLONING (add-project.ts:296) fires only if PRE_FLIGHT itself takes more than 14 minutes -- essentially impossible. The guard is harmless but misleading; consider removing it or adding a comment.

E. workspaceAgentUrl defaults to plaintext HTTP

V1-acceptable. A startup log line noting that plaintext HTTP is in use when the URL scheme is not HTTPS would help operators notice misconfiguration in non-compose deployments.


Missing tests

  • No test for concurrent add-project invocations racing through channel creation -- the stale-snapshot bug above is untested. A test with two simultaneous calls sharing the same guild mock would catch it.
  • No test for tryCreate returning null due to a non-permission Discord error (e.g. a 429-shaped error). The current suite only covers permission-denied (403) and explicit transient-failure-then-success paths. A test that makes all 10 create calls throw a 429-shaped error should assert the returned error kind is meaningful; currently it would return collision-exhausted, which is wrong.
  • channels.ts: no test for maxSuffix: 1 (exact name only, no suffixes attempted).

Risk assessment: MED

The stale-cache race condition in channel creation is the primary risk: under concurrent use, one invocation can silently land on a suffixed channel name with no operator-visible warning, and the binding will point to that name permanently. Given the v1 operator audience is small this is unlikely to trigger in practice, but the silent failure mode makes it worth fixing before shipping.

Security posture is strong: IAT never logs, URL parsing does not echo user input into log fields, channel name hostile-character rejection is correct, and the captured-logger test suite covers the key error paths. No auth bypass or injection vectors found.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26528328556
Cache hit
Session ses_19574fa7effeEB2i6dUTcHUnSk

@fro-bot fro-bot removed the agent: working Agent is currently working on this label May 27, 2026
Wires PR A bindings store + PR B GitHub App auth + PR C workspace-agent
into the first user-facing gateway feature. Operator runs
`/fro-bot add-project url:<git-url> [channel:<name>]` and the gateway
authenticates, clones the repo into the workspace container, creates a
Discord channel, and writes a binding to S3.

The /fro-bot parent command is refactored into a factory
(createFroBotCommand) that takes injected dependencies and dispatches
ping + add-project subcommands. Permission checks use
interaction.appPermissions (always populated) instead of guild member
cache (requires GuildMembers intent we don't hold).

Workspace-api client parses structured CloneFailure bodies from any HTTP
status, preserving error codes like disk-full, repo-exists, overloaded,
clone-timeout through to operator-facing recovery copy. Native fetch
with AbortSignal.timeout(5min). Never logs request/response body.
Captured-logger test asserts no ghs_* token leaks across all error
paths.

Channel collision logic iterates name, name-2, ..., name-10; never
returns an existing channel by name (would silently bind a new repo to
an unrelated existing channel). Returns collision-exhausted if all 10
candidates exist.

273/273 gateway tests pass. Plan doc deepened with adversarial +
feasibility + security review (16 findings applied).
@marcusrbrown marcusrbrown force-pushed the feat/gateway-add-project-command branch from 7341986 to c4278b0 Compare May 27, 2026 17:58
@fro-bot fro-bot added the agent: working Agent is currently working on this label May 27, 2026
Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: PASS

Both blocking issues from the prior review have been addressed. Approving.

Blocking issues

None. The two issues flagged in the prior CHANGES_REQUESTED review are resolved:

  1. Stale cache race (channels.ts) -- Fixed. The snapshot-once approach is replaced with a per-iteration cache.find() call (line 55), with an explanatory comment referencing the exact race. The new concurrent-race test at channels.test.ts:256 validates the live-cache behaviour by sharing a mutable channels array between two simultaneous invocations.

  2. tryCreate silent-swallow -- Fixed. The function now returns a proper discriminated union (ok | permission-denied | name-taken | create-failed), and create-failed short-circuits immediately rather than burning suffix candidates. The new test at channels.test.ts:232 verifies that a 429-shaped error returns create-failed after exactly one create() call.

Non-blocking concerns

The non-blocking items from the prior review (rateLimitMap comment, slug validation TODO, unreachable window guard, HTTP plaintext log line) remain unaddressed, but they were explicitly flagged as non-blocking and are acceptable for v1.

Missing tests

None remaining. All three missing-test items from the prior review are covered:

  • Concurrent race test: channels.test.ts:256
  • 429/create-failed short-circuit test: channels.test.ts:232
  • maxSuffix:1 boundary test: channels.test.ts:307

Risk assessment: LOW

The blocking race condition is fixed with the correct approach. The discriminated result type on tryCreate is strictly better than the prior null-return semantics. Security invariants (IAT never logged, hostile-char rejection, canonicalization) are unchanged and correct.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26529107464
Cache hit
Session ses_19574fa7effeEB2i6dUTcHUnSk

@fro-bot fro-bot removed the agent: working Agent is currently working on this label May 27, 2026
@fro-bot fro-bot mentioned this pull request May 27, 2026
46 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants