feat(mt#1028): KB activity source — GitHub Issues/PRs provider#710
feat(mt#1028): KB activity source — GitHub Issues/PRs provider#710minsky-ai[bot] wants to merge 2 commits into
Conversation
Ships the GitHub half of the KB activity-sources track. Linear provider deferred to a sibling follow-up task (mt#1028 scope-reduced at subagent turn-limit; GitHub half is a coherent ship). - providers/github-activity-provider.ts: KnowledgeSourceProvider implementation for GitHub Issues + PRs. Indexes issue bodies + comments + PR review comments with attribution headers. Default bot-author filter (github-actions, dependabot, renovate, etc). Label include + exclude filters. maxAgeDays cutoff. @octokit-compatible shapes but uses direct fetch + IntelligentRetryService for test injection. isRetryableGitHubError predicate (rate-limit aware) in embedding-service-openai.ts. - providers/github-activity-provider.test.ts: 8 tests covering body + comment ingestion with attribution, default bot exclusion, label include/exclude filters, maxAgeDays cutoff, fetchDocument by id, malformed-id rejection, 429 retry. - types.ts: KnowledgeSourceConfig type enum extended with github-activity + linear (Linear stub for upcoming sibling). New fields: owner, repo, states, labels, excludeLabels, excludeAuthors, maxAgeDays, sensitivity. - configuration/schemas/knowledge-bases.ts: matching Zod schema extension + sensitivity enum (default public). Passthrough retained. - knowledge-service.ts: new github-activity case in createProvider switch, lazy-loads via dynamic import; credential + owner/repo validation mirrors Notion pattern. Out of scope (deferred to follow-up): - Linear provider implementation - Task-backend overlap docs note in docs/architecture.md §9 - CLI warning when repo is both task backend + activity source 8/8 tests pass locally. typecheck clean. lint:strict clean.
There was a problem hiding this comment.
Independent adversarial review (Chinese-wall)
Reviewer: minsky-reviewer[bot] via openai:gpt-5
Tier: unknown
Findings
[BLOCKING] Silent data corruption on fetchDocument when ID owner/repo differs from provider config
- File: src/domain/knowledge/providers/github-activity-provider.ts: approx. 380–407 (buildDocument return object)
- Evidence: fetchDocument accepts any id of the form owner/repo#number (lines ~267–296) and fetches the issue for those owner/repo values. However, buildDocument constructs the returned document using this.owner and this.repo:
- id:
${this.owner}/${this.repo}#${issue.number} - title:
${this.owner}/${this.repo} #${issue.number}: ${issue.title} - metadata.owner: this.owner
- metadata.repo: this.repo
- id:
- Failure mode: If fetchDocument is called with an ID that does not match the provider’s configured owner/repo, the returned document will be incorrectly labeled (IDs and metadata will point to the provider’s configured repo, not the fetched issue’s repo). This is a correctness and attribution bug that can lead to cross-repo document contamination.
- Remediation options:
- Enforce that parsed owner/repo from the ID matches this.owner/this.repo and throw if not.
- Or pass the parsed owner/repo through to buildDocument (and use them when constructing id/title/metadata) so the document reflects the real repository.
[BLOCKING] lastModified omits PR review comments’ timestamps
- File: src/domain/knowledge/providers/github-activity-provider.ts: approx. 360–370 (within buildDocument)
- Evidence: lastModified is computed as max of issue.updated_at and issue/PR comments’ updated_at:
const allDates = [new Date(issue.updated_at), ...comments.map((c) => new Date(c.updated_at))];
It does not include reviewComments when isPR is true. - Failure mode: For PRs where the latest activity is a review comment, lastModified will be stale. This affects sync freshness, staleness classification, and any incremental sync that relies on lastModified.
- Remediation: Include PR reviewComments’ updated_at in the allDates aggregation.
[NON-BLOCKING] Retry predicate misses common network error strings (fetch failed), risking no retry on transient network faults
- File: src/domain/ai/embedding-service-openai.ts: approx. 15–45 (isRetryableGitHubError)
- Evidence: The function retries when the message matches:
- /^GitHub API error: (429|5\d\d)/i or 429|rate.limit|503|Service Unavailable|ECONNRESET|ETIMEDOUT
- In Node/Bun, transient network errors are frequently thrown as TypeError: fetch failed or similar messages that do not include ECONNRESET/ETIMEDOUT text. Those would not match this predicate, so IntelligentRetryService would not retry, even though they are transient.
- Suggestion: Extend the predicate to include generic fetch failures (e.g., /\bfetch failed\b/i, /\bnetwork error\b/i, /\bsocket hang up\b/i) and/or detect TypeError instances from fetch.
[NON-BLOCKING] Service type/schema inconsistency for linear
- Files:
- src/domain/configuration/schemas/knowledge-bases.ts: type enum includes "linear"
- src/domain/knowledge/types.ts: KnowledgeSourceConfig.type includes "linear"
- src/domain/knowledge/knowledge-service.ts: createProvider switch has no "linear" case; default error message lists “notion”, “google-docs”, “github-activity” only
- Risk: A user configuring a linear source (allowed by schema/types per PR) will get a runtime “Unsupported knowledge source type” error. The PR description calls Linear “stub for mt#1117,” but the schema admission without service wiring is a papercut.
- Suggestion: Either remove “linear” from the enum until wired, or add a temporary helpful error (explicit case "linear" with a clear “not implemented yet, see mt#1117” message) to avoid confusion.
[NON-BLOCKING] Test gaps
- File: src/domain/knowledge/providers/github-activity-provider.test.ts
- Gaps:
- No test that PR review comments are included in content and affect lastModified (would have caught the lastModified omission).
- No test asserting fetchDocument rejects IDs for other owner/repo or at least that it preserves/validates owner/repo consistency.
- No test for states filter (open/closed/all) being propagated to the issues query and respected.
- No test for 403 primary rate limit retry path (message includes “rate limit”).
- No test for custom excludeAuthors beyond the default bot list.
- Suggestion: Add tests covering these behaviors, particularly the review comment freshness and owner/repo mismatch cases.
[PRE-EXISTING] “confluence” present in type enums but not supported by knowledge-service
- Files:
- src/domain/knowledge/types.ts and src/domain/configuration/schemas/knowledge-bases.ts: include "confluence"
- src/domain/knowledge/knowledge-service.ts: no “confluence” case in createProvider and default error message omits it
- This appears to predate the PR; mentioning for awareness only.
Spec verification
- Provider implemented: GitHub Issues/PRs indexing with attribution headers
- Met (implementation present). Note: review comments are included in content but not covered by tests, and lastModified excludes review comment timestamps (see blocking).
- isRetryableGitHubError added with rate-limit awareness
- Met (function present). Caveat: see network error gap above.
- Default bot filter and excludeAuthors override
- Met (DEFAULT_BOT_AUTHORS list; excludeAuthors merged). Tests cover bot exclusion for body and comments.
- Config schema extended (type enum includes "github-activity", "linear"; fields owner, repo, states, labels, excludeLabels, excludeAuthors, maxAgeDays; sensitivity)
- Met.
- Service wiring: knowledge-service adds "github-activity" case, dynamic import, token/owner/repo validation
- Met. Note: “linear” not wired; see non-blocking inconsistency above.
- Tests (8)
- listDocuments yields attribution headers for body + comments: Met
- Default bot-author exclusion: Met
- excludeLabels drops issues: Met
- labels requires all: Met
- maxAgeDays filters by updated_at: Met
- fetchDocument by id: Met
- Malformed id rejection: Met
- 429 retry succeeds: Met
Documentation impact
- Needs updates:
- Document the new “github-activity” provider configuration and semantics (fields: owner, repo, states, labels/excludeLabels, excludeAuthors, maxAgeDays).
- Document the new sensitivity field and how it affects default search behavior across sources.
- Note that “linear” appears in the schema but is not yet supported (or remove from schema until mt#1117 ships).
- PR description states some docs are deferred; please ensure a tracking issue links to the required docs changes.
Conclusion
REQUEST_CHANGES
Blocking issues to address before merge:
- Fix fetchDocument to either enforce owner/repo match with provider config or construct the document using the owner/repo parsed from the ID (and ensure metadata/id/title reflect the actual repo).
- Include PR review comments’ updated_at in lastModified computation. Add a test to prevent regression.
Non-blocking improvements are noted above (retry predicate resilience, linear wiring clarity, and test coverage).
Summary
Ships the GitHub half of KB activity sources. Linear provider deferred to mt#1117 — same split pattern as mt#1027 Layer 3 / mt#1108 Layer 4: subagent hit its turn limit mid-implementation; main agent finished tests + schema wiring + service registration and committed a shippable GitHub-only unit.
What's in this PR
github-activity-provider.ts(407 lines) —KnowledgeSourceProviderwithsourceType: "github-activity". Indexes issue bodies + issue comments + PR review comments. Attribution headers per segment (**Issue body** (by @alice),**Comment** (by @bob, 2026-03-01)). Document ID format:{owner}/{repo}#{number}.isRetryableGitHubErrorpredicate (rate-limit aware) added toembedding-service-openai.ts. Retryable: 429, 403 with rate-limit language, 5xx. Not retryable: 401, 404, 422, 403 without rate-limit reason.github-actions[bot],dependabot[bot],renovate[bot], etc. Users can add more viaexcludeAuthors.Config schema
knowledgeBaseEntrySchema.typeenum extended with"github-activity"+"linear"(Linear stub for mt#1117).owner,repo,states?: "open"|"closed"|"all",labels?,excludeLabels?,excludeAuthors?,maxAgeDays?.sensitivity: "public" | "internal" | "restricted"(default"public") — restricted sources excluded from default search results. Shared across all future activity + conversational sources.Service wiring
knowledge-service.tscreateProviderswitch: new"github-activity"case, lazy-loads via dynamic import. Credential +owner/repovalidation mirrors the Notion pattern.Tests (8 tests, all pass)
excludeLabelsdrops issues with any matching labellabelsrequires all listed labels presentmaxAgeDaysfilters byupdated_atfetchDocumentbyowner/repo#numberidOut of scope (explicitly deferred to follow-ups)
docs/architecture.md§9 — will consolidate into the mt#1117 docs rollup (same pattern as mt#1027 → mt#1108 docs consolidation)Action required
Merging this closes mt#1028 as scope-reduced. Linear (mt#1117) ships separately. The
sensitivityconfig field + the activity-track type-enum + the schema shape are all in place, so mt#1117 is a pure provider + tests task.Test plan
lint:strict(max-warnings=0) clean🤖 Implementation by a subagent through the provider file, finished from main after the subagent hit its turn limit. Main agent wrote the 8 unit tests, extended
types.ts+knowledge-bases.tsschemas, registered the provider inknowledge-service.ts, fixed three non-null-assertion lint warnings, and filed mt#1117 for the Linear half.