Skip to content

feat(mt#1028): KB activity source — GitHub Issues/PRs provider#710

Open
minsky-ai[bot] wants to merge 2 commits into
mainfrom
task/mt-1028
Open

feat(mt#1028): KB activity source — GitHub Issues/PRs provider#710
minsky-ai[bot] wants to merge 2 commits into
mainfrom
task/mt-1028

Conversation

@minsky-ai
Copy link
Copy Markdown
Contributor

@minsky-ai minsky-ai Bot commented Apr 22, 2026

Summary

Ships the GitHub half of KB activity sources. Linear provider deferred to mt#1117 — same split pattern as mt#1027 Layer 3 / mt#1108 Layer 4: subagent hit its turn limit mid-implementation; main agent finished tests + schema wiring + service registration and committed a shippable GitHub-only unit.

What's in this PR

  • github-activity-provider.ts (407 lines) — KnowledgeSourceProvider with sourceType: "github-activity". Indexes issue bodies + issue comments + PR review comments. Attribution headers per segment (**Issue body** (by @alice), **Comment** (by @bob, 2026-03-01)). Document ID format: {owner}/{repo}#{number}.
  • isRetryableGitHubError predicate (rate-limit aware) added to embedding-service-openai.ts. Retryable: 429, 403 with rate-limit language, 5xx. Not retryable: 401, 404, 422, 403 without rate-limit reason.
  • Default bot filter: github-actions[bot], dependabot[bot], renovate[bot], etc. Users can add more via excludeAuthors.

Config schema

  • knowledgeBaseEntrySchema.type enum extended with "github-activity" + "linear" (Linear stub for mt#1117).
  • New entry fields: owner, repo, states?: "open"|"closed"|"all", labels?, excludeLabels?, excludeAuthors?, maxAgeDays?.
  • sensitivity: "public" | "internal" | "restricted" (default "public") — restricted sources excluded from default search results. Shared across all future activity + conversational sources.

Service wiring

  • knowledge-service.ts createProvider switch: new "github-activity" case, lazy-loads via dynamic import. Credential + owner/repo validation mirrors the Notion pattern.

Tests (8 tests, all pass)

  • listDocuments yields documents with attribution headers for body + comments
  • Default bot-author exclusion (body + comment)
  • excludeLabels drops issues with any matching label
  • labels requires all listed labels present
  • maxAgeDays filters by updated_at
  • fetchDocument by owner/repo#number id
  • Malformed id rejection
  • 429 retry succeeds on second attempt

Out of scope (explicitly deferred to follow-ups)

  • Linear providermt#1117 (filed; Linear half with GraphQL API + same schema shape)
  • Task-backend overlap docs note in docs/architecture.md §9 — will consolidate into the mt#1117 docs rollup (same pattern as mt#1027 → mt#1108 docs consolidation)
  • CLI warning when a repo is configured as both task backend AND activity source — deferred to mt#1117 (low-risk; information-only)
  • Jira provider, GitHub Discussions — separate future tasks

Action required

Merging this closes mt#1028 as scope-reduced. Linear (mt#1117) ships separately. The sensitivity config field + the activity-track type-enum + the schema shape are all in place, so mt#1117 is a pure provider + tests task.

Test plan

  • 8 unit tests pass locally
  • Pre-commit hook: format, tsc, lint, tests, secrets scan all pass
  • lint:strict (max-warnings=0) clean
  • End-to-end verification against a real GitHub repo → deferred (requires token + real issues; will ride along with mt#1117)

🤖 Implementation by a subagent through the provider file, finished from main after the subagent hit its turn limit. Main agent wrote the 8 unit tests, extended types.ts + knowledge-bases.ts schemas, registered the provider in knowledge-service.ts, fixed three non-null-assertion lint warnings, and filed mt#1117 for the Linear half.

edobry added 2 commits April 22, 2026 19:15
Ships the GitHub half of the KB activity-sources track. Linear provider
deferred to a sibling follow-up task (mt#1028 scope-reduced at subagent
turn-limit; GitHub half is a coherent ship).

- providers/github-activity-provider.ts: KnowledgeSourceProvider
  implementation for GitHub Issues + PRs. Indexes issue bodies + comments
  + PR review comments with attribution headers. Default bot-author
  filter (github-actions, dependabot, renovate, etc). Label include +
  exclude filters. maxAgeDays cutoff. @octokit-compatible shapes but
  uses direct fetch + IntelligentRetryService for test injection.
  isRetryableGitHubError predicate (rate-limit aware) in
  embedding-service-openai.ts.
- providers/github-activity-provider.test.ts: 8 tests covering body +
  comment ingestion with attribution, default bot exclusion, label
  include/exclude filters, maxAgeDays cutoff, fetchDocument by id,
  malformed-id rejection, 429 retry.
- types.ts: KnowledgeSourceConfig type enum extended with
  github-activity + linear (Linear stub for upcoming sibling).
  New fields: owner, repo, states, labels, excludeLabels,
  excludeAuthors, maxAgeDays, sensitivity.
- configuration/schemas/knowledge-bases.ts: matching Zod schema
  extension + sensitivity enum (default public). Passthrough retained.
- knowledge-service.ts: new github-activity case in createProvider
  switch, lazy-loads via dynamic import; credential + owner/repo
  validation mirrors Notion pattern.

Out of scope (deferred to follow-up):
- Linear provider implementation
- Task-backend overlap docs note in docs/architecture.md §9
- CLI warning when repo is both task backend + activity source

8/8 tests pass locally. typecheck clean. lint:strict clean.
@minsky-ai minsky-ai Bot added the authorship/co-authored Co-authored by human and AI agent label Apr 22, 2026
Copy link
Copy Markdown

@minsky-reviewer minsky-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown


Findings

[BLOCKING] Silent data corruption on fetchDocument when ID owner/repo differs from provider config

  • File: src/domain/knowledge/providers/github-activity-provider.ts: approx. 380–407 (buildDocument return object)
  • Evidence: fetchDocument accepts any id of the form owner/repo#number (lines ~267–296) and fetches the issue for those owner/repo values. However, buildDocument constructs the returned document using this.owner and this.repo:
    • id: ${this.owner}/${this.repo}#${issue.number}
    • title: ${this.owner}/${this.repo} #${issue.number}: ${issue.title}
    • metadata.owner: this.owner
    • metadata.repo: this.repo
  • Failure mode: If fetchDocument is called with an ID that does not match the provider’s configured owner/repo, the returned document will be incorrectly labeled (IDs and metadata will point to the provider’s configured repo, not the fetched issue’s repo). This is a correctness and attribution bug that can lead to cross-repo document contamination.
  • Remediation options:
    • Enforce that parsed owner/repo from the ID matches this.owner/this.repo and throw if not.
    • Or pass the parsed owner/repo through to buildDocument (and use them when constructing id/title/metadata) so the document reflects the real repository.

[BLOCKING] lastModified omits PR review comments’ timestamps

  • File: src/domain/knowledge/providers/github-activity-provider.ts: approx. 360–370 (within buildDocument)
  • Evidence: lastModified is computed as max of issue.updated_at and issue/PR comments’ updated_at:
    const allDates = [new Date(issue.updated_at), ...comments.map((c) => new Date(c.updated_at))];
    It does not include reviewComments when isPR is true.
  • Failure mode: For PRs where the latest activity is a review comment, lastModified will be stale. This affects sync freshness, staleness classification, and any incremental sync that relies on lastModified.
  • Remediation: Include PR reviewComments’ updated_at in the allDates aggregation.

[NON-BLOCKING] Retry predicate misses common network error strings (fetch failed), risking no retry on transient network faults

  • File: src/domain/ai/embedding-service-openai.ts: approx. 15–45 (isRetryableGitHubError)
  • Evidence: The function retries when the message matches:
    • /^GitHub API error: (429|5\d\d)/i or 429|rate.limit|503|Service Unavailable|ECONNRESET|ETIMEDOUT
  • In Node/Bun, transient network errors are frequently thrown as TypeError: fetch failed or similar messages that do not include ECONNRESET/ETIMEDOUT text. Those would not match this predicate, so IntelligentRetryService would not retry, even though they are transient.
  • Suggestion: Extend the predicate to include generic fetch failures (e.g., /\bfetch failed\b/i, /\bnetwork error\b/i, /\bsocket hang up\b/i) and/or detect TypeError instances from fetch.

[NON-BLOCKING] Service type/schema inconsistency for linear

  • Files:
    • src/domain/configuration/schemas/knowledge-bases.ts: type enum includes "linear"
    • src/domain/knowledge/types.ts: KnowledgeSourceConfig.type includes "linear"
    • src/domain/knowledge/knowledge-service.ts: createProvider switch has no "linear" case; default error message lists “notion”, “google-docs”, “github-activity” only
  • Risk: A user configuring a linear source (allowed by schema/types per PR) will get a runtime “Unsupported knowledge source type” error. The PR description calls Linear “stub for mt#1117,” but the schema admission without service wiring is a papercut.
  • Suggestion: Either remove “linear” from the enum until wired, or add a temporary helpful error (explicit case "linear" with a clear “not implemented yet, see mt#1117” message) to avoid confusion.

[NON-BLOCKING] Test gaps

  • File: src/domain/knowledge/providers/github-activity-provider.test.ts
  • Gaps:
    • No test that PR review comments are included in content and affect lastModified (would have caught the lastModified omission).
    • No test asserting fetchDocument rejects IDs for other owner/repo or at least that it preserves/validates owner/repo consistency.
    • No test for states filter (open/closed/all) being propagated to the issues query and respected.
    • No test for 403 primary rate limit retry path (message includes “rate limit”).
    • No test for custom excludeAuthors beyond the default bot list.
  • Suggestion: Add tests covering these behaviors, particularly the review comment freshness and owner/repo mismatch cases.

[PRE-EXISTING] “confluence” present in type enums but not supported by knowledge-service

  • Files:
    • src/domain/knowledge/types.ts and src/domain/configuration/schemas/knowledge-bases.ts: include "confluence"
    • src/domain/knowledge/knowledge-service.ts: no “confluence” case in createProvider and default error message omits it
  • This appears to predate the PR; mentioning for awareness only.

Spec verification

  • Provider implemented: GitHub Issues/PRs indexing with attribution headers
    • Met (implementation present). Note: review comments are included in content but not covered by tests, and lastModified excludes review comment timestamps (see blocking).
  • isRetryableGitHubError added with rate-limit awareness
    • Met (function present). Caveat: see network error gap above.
  • Default bot filter and excludeAuthors override
    • Met (DEFAULT_BOT_AUTHORS list; excludeAuthors merged). Tests cover bot exclusion for body and comments.
  • Config schema extended (type enum includes "github-activity", "linear"; fields owner, repo, states, labels, excludeLabels, excludeAuthors, maxAgeDays; sensitivity)
    • Met.
  • Service wiring: knowledge-service adds "github-activity" case, dynamic import, token/owner/repo validation
    • Met. Note: “linear” not wired; see non-blocking inconsistency above.
  • Tests (8)
    • listDocuments yields attribution headers for body + comments: Met
    • Default bot-author exclusion: Met
    • excludeLabels drops issues: Met
    • labels requires all: Met
    • maxAgeDays filters by updated_at: Met
    • fetchDocument by id: Met
    • Malformed id rejection: Met
    • 429 retry succeeds: Met

Documentation impact

  • Needs updates:
    • Document the new “github-activity” provider configuration and semantics (fields: owner, repo, states, labels/excludeLabels, excludeAuthors, maxAgeDays).
    • Document the new sensitivity field and how it affects default search behavior across sources.
    • Note that “linear” appears in the schema but is not yet supported (or remove from schema until mt#1117 ships).
  • PR description states some docs are deferred; please ensure a tracking issue links to the required docs changes.

Conclusion

REQUEST_CHANGES

Blocking issues to address before merge:

  • Fix fetchDocument to either enforce owner/repo match with provider config or construct the document using the owner/repo parsed from the ID (and ensure metadata/id/title reflect the actual repo).
  • Include PR review comments’ updated_at in lastModified computation. Add a test to prevent regression.

Non-blocking improvements are noted above (retry predicate resilience, linear wiring clarity, and test coverage).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

authorship/co-authored Co-authored by human and AI agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant