Context
Advance the predictive workflow in plans/feature-risk-learning.md, teaching IssueTriage to forecast risk for new items and retrieve similar historical work.
Goals
- Build data pipeline exporting historical issue/PR metrics, likeness summaries, and embeddings.
- Train baseline heuristics and classical ML model for risk level/score prediction.
- Implement similarity index service returning top related issues with evidence.
- Integrate predicted metrics and matches into assessment prompts and panel UI.
Feature Plan: Historical Risk Learning
Goal
Enable IssueTriage to forecast risk metrics for new issues or pull requests by learning from historical repository activity and feeding those insights into the assessment workflow.
MVP Implementation Decisions (Finalized)
Data & Storage:
- Use the connected repository's local SQLite cache as the primary data source
- Export to co-located SQLite bundle for training (schema defined below)
- Implement in TypeScript within
src/services/ and export scripts in scripts/
Embeddings & Search:
- Use OpenAI
text-embedding-3-large (1536 dimensions) for batch training exports
- Use OpenAI
text-embedding-3-small (512 dimensions) for real-time similarity queries
- Deploy
sqlite-vss extension for vector storage and ANN search
- Fallback to brute-force SQL cosine when extension unavailable or <200 vectors
Model Training:
- Manual training workflow triggered via "ML Training" tab →
Train now button
- Start with baseline heuristics (historical averages by label/component)
- Package trained models as local ONNX artifacts alongside the extension
- Defer automated retraining and server-side aggregation to post-MVP
API & Quotas:
- Daily token budget: 200k tokens (configurable via settings)
- Batch size: 50 summaries per export run
- Warning threshold: 80% of daily budget
- Rate limiting: 50ms jittered delays, exponential backoff on 429/5xx (3 retries max)
Integration:
SimilarityService returns top-5 matches with blended cosine + structural scores
PredictedRiskService outputs: {riskLevel, riskScore, confidence, drivers[], similarIssues[]}
- Surface predictions in IssueTriage panel alongside existing risk intelligence
- Feed top-3 similar issues into assessment prompts as historical context
Acceptance Criteria:
- Export <5 minutes for 500 closed issues
- Validation rejects if >5% missing risk snapshots
- Embedding count matches issue count for
title, body, likeness_summary
- Manifest persists with schema version, token usage, and validation report
Target Outcomes
- Predict likelihood of high, medium, or low risk for a new triage item before linked work exists.
- Generate estimated drivers (e.g., expected change volume, review friction) to seed readiness guidance.
- Surface similar historical issues/PRs to provide reviewers with evidence for the prediction.
Similarity Discovery Strategy
- Capture per-issue similarity snapshots when issues close by persisting:
- Final label set, assignees, milestone, linked PR metrics, and risk outcomes.
- Generated "likeness summaries" that describe components touched, change scope, and notable risk drivers.
- Compute semantic embeddings (sentence-transformer or Azure OpenAI) for title, body, and likeness summary; store vectors alongside structured fingerprints (e.g., label hashes, hot-file counts).
- Maintain an incremental similarity index (Faiss, Pinecone, Redis, or in-process ANN) keyed by issue id with timestamps for temporal filtering.
- Serve a
SimilarityService that blends cosine similarity with structured overlap metrics (label Jaccard, shared files) to rank top matches.
- Surface retrieved matches inside assessments, webview UI, and as prompt context for downstream LLM/ML predictions.
Data Requirements
- Source repositories: choose one or more public repos with rich history (e.g.,
microsoft/vscode, numpy/numpy).
- Data types:
- Issue metadata: titles, bodies, labels, milestones, timestamps, author.
- Linked PR data: churn metrics, review state counts, merge outcomes, time-to-merge.
- Post-merge signals: subsequent bug reports, reverts, security advisories (optional but valuable).
- Assessment + risk records generated by IssueTriage once instrumentation exists.
- Similarity captures: likeness summaries, embedding vectors, structured fingerprints saved when issues close or PRs merge.
- Collection tooling: extend
GitHubClient or create scripts to export historical snapshots, respecting rate limits and caching.
- Storage: structured dataset (Parquet/SQLite/Postgres) with normalized tables for issues, PRs, risk outcomes, derived features, plus a similarity index table storing vectors and metadata for retrieval.
Existing Risk Intelligence Signals
- Risk intelligence comment: each closed issue captures
riskLevel (e.g., Low), riskScore (e.g., 15), and the lastUpdated timestamp from the risk engine. We store each comment in the local SQLite cache and will standardize the markdown block with a dedicated header tag (e.g., <!-- risk-intelligence -->) so exports can target the most recent instance unambiguously.
- Operational metrics: parsed from the same comment block and include
linkedCommits, filesTouched, linesChanged, and reviewFrictionSignals (currently integer counts). These metrics become typed columns in the dataset for model consumption without secondary parsing.
- Composite model assessment: initial assessment comment contains
compositeScore (e.g., 62.5), per-dimension scores (requirements, complexity, security, business), the modelId (e.g., openai/gpt-5-mini), and runTimestamp. The block will use a matching tag (e.g., <!-- assessment-intelligence -->) and the exporter selects the latest occurrence.
- Qualitative guidance: textual summary (e.g., "Add missing context then reassess."), model provenance statement, and structured question prompts. We retain the raw summary as a
TEXT column and expose a summary_embedding vector in the embeddings table for similarity retrieval.
- Issue context: canonical issue title, number, author, milestone, labels, and comment stream provide additional supervised signals and retrieval evidence; comments unrelated to our tagged blocks remain available to the similarity engine.
Comment Tagging Specification
- Goals: keep comments visually friendly while giving exporters deterministic anchors; tags should survive manual edits and support versioning.
- Wrapper syntax: each machine-readable block is enclosed by paired HTML comments with required attributes:
- Risk comment wrapper:
<!-- risk-intel:start version=1 --> ... <!-- risk-intel:end -->
- Assessment comment wrapper:
<!-- assessment-intel:start version=1 --> ... <!-- assessment-intel:end -->
- Display payload: inside the wrapper, keep human-readable headings and bullet lists so reviewers can skim without tooling. Example:
<!-- risk-intel:start version=1 issued-at="2025-10-19T20:25:33Z" source="IssueTriage" -->
Risk Intelligence
Low risk · Score 15
Last updated 10/19/2025, 8:25:33 PM
Key metrics
- 1 linked commits
- 3 files touched
- 15 lines changed
- 0 review friction signals
<!-- risk-intel:data
{
"riskLevel": "Low",
"riskScore": 15,
"linkedCommits": 1,
"filesTouched": 3,
"linesChanged": 15,
"reviewFrictionSignals": 0,
"analysisId": "abc123"
}
-->
<!-- risk-intel:end -->
- Embedded data node: the
<!-- risk-intel:data ... --> tag holds canonical JSON for exporters. Renderers ignore it while parsers read the JSON payload. The wrapper attributes provide quick filters (versioning, timestamps, emit source).
- Assessment block: follow the same pattern with fields tailored to composite scoring:
<!-- assessment-intel:start version=1 issued-at="2025-10-19T19:51:48Z" source="IssueTriage" -->
Composite Assessment
Composite 62.5 · Model openai/gpt-5-mini
Dimensions
- Requirements: 60.0
- Complexity: 80.0
- Security: 45.0
- Business: 65.0
Summary
Add missing context then reassess.
<!-- assessment-intel:data
{
"compositeScore": 62.5,
"modelId": "openai/gpt-5-mini",
"dimensions": {
"requirements": 60.0,
"complexity": 80.0,
"security": 45.0,
"business": 65.0
},
"summary": "Add missing context then reassess.",
"analysisRunId": "run-20251019-195148"
}
-->
<!-- assessment-intel:end -->
- Multiple runs: when updates occur, append a fresh tagged block to the issue comments. The exporter selects the latest
version and issued-at instance, while older blocks remain for audit history.
- Error handling: if the JSON subtag is malformed, the exporter falls back to cache data and logs a warning so operators can fix the comment manually.
Likeness Summary & Similarity Strategy
- Objective: encode each closed issue into a concise, semantically rich summary that drives LLM embeddings and human review of related work.
- Authoring: generate the likeness summary with an LLM prompt that incorporates change metadata (labels, touched files, churn metrics) and risk outcomes. Anchor the prompt to a strict template to keep outputs consistent.
- Recommended template: 2–3 sentences covering problem statement, implementation touchpoints, and notable risk/complexity signals, followed by a bullet list of key keywords. Example prompt sketch:
Summarize the closed issue for historical risk learning.
Include:
1. Intent (component + change objective).
2. Implementation footprint (files or subsystems, approx change size).
3. Notable risk drivers (dependencies, security surfaces, coordination).
End with a line 'Keywords: <comma-separated tokens>'. Max 120 tokens.
-
Locked prompts:
-
Summary generation prompt (system):
You are IssueTriage, producing likeness summaries for closed GitHub issues. Respond in English.
Output two short sentences (<=120 tokens total) describing intent, implementation footprint, and risk drivers.
End with a line exactly formatted as 'Keywords: <comma-separated tokens>'.
Use present-perfect phrasing ("Updated", "Refactored") and avoid markdown headings.
-
Summary generation prompt (user template):
Issue title: {title}
Labels: {labels_csv}
Files touched: {files_csv}
Lines changed: {lines_changed}
Linked commits: {linked_commits}
Risk level: {risk_level}
Risk score: {risk_score}
Description:
{body_excerpt}
-
Similarity justification prompt (system):
You explain why two historical issues are similar for risk assessment. Respond with one sentence <=60 tokens, referencing concrete overlaps.
-
Similarity justification prompt (user template):
Candidate issue summary: {candidate_summary}
Query issue summary: {query_summary}
Shared attributes: Labels[{shared_labels_csv}] Files[{shared_files_csv}]
-
Embedding pipeline:
- Run the LLM locally or via API to produce the likeness summary when an issue closes or during the export batch.
- Feed the summary text into a sentence embedding model (OpenAI text-embedding-3, Azure embed, or open-source alternative) and store the resulting vector in the
embeddings table with source='likeness_summary'.
-
Similarity scoring:
- Primary score: cosine similarity between summary embeddings.
- Signal blend: linearly combine the cosine score with structured overlaps (label Jaccard, shared file hotspots, delta in risk levels) to avoid purely semantic matches.
- Re-ranking: for the top-k cosine matches, ask a lightweight LLM prompt to justify similarity, producing an evidence snippet for the UI (e.g., "Both touched auth middleware and involved large refactors").
-
Inference flow:
- For a new issue, create a provisional likeness summary using available metadata + user description.
- Compute embedding and query the ANN index (Faiss/SQLite FTS + vector extension).
- Return top-k candidates with blended scores and optional LLM-generated rationales.
- Cache results and update as additional context (linked PRs, telemetry) arrives.
-
API guardrails:
- Default to batching likeness summaries during manual export runs to minimize per-issue request overhead (e.g., 50 summaries per run).
- Enforce a daily token budget (e.g., 200k tokens) configurable via settings; surface warnings in the UI when usage approaches 80%.
- Cache generated summaries and justifications in SQLite with
analysis_run_id to avoid repeat calls.
- Provide a "dry run" mode to estimate token spend without calling the LLM, using average tokens per prompt (summary ≈ 300 input, 70 output; justification ≈ 200 input, 40 output).
- Respect API rate limits by inserting jittered delays between requests (default 50 ms) and retrying with exponential backoff on 429/5xx responses up to 3 attempts.
Embedding & Retrieval Options
- Embedding providers:
- OpenAI/Azure OpenAI (MVP choice): standardize on
text-embedding-3-large for training exports and text-embedding-3-small for real-time queries. This keeps us on the same vendor path as existing assessment models while leaving room to swap hosts (OpenAI vs Azure OpenAI) if governance requires.
- Fallback: Hugging Face sentence transformers: keep a documented alternative (
all-mpnet-base-v2) for offline mode, but defer implementation until we face network constraints.
- Future: fine-tuned embeddings: revisit once labeled similarity pairs exist; plan to package via ONNX so we can self-host if API policies shift.
- Vector storage:
- SQLite extensions: use pgvector-like add-ons (
sqlite-vss, sqlite-vec) to keep embeddings beside other issue data for MVP; good for small to medium corpora.
- Faiss on-disk index: efficient approximate nearest neighbor for thousands to millions of vectors; integrate with Python export job and persist index artifacts.
- Managed services: Pinecone, Weaviate Cloud, Azure Cognitive Search with vector fields; useful once we need multi-repo scale or shared inference endpoints.
- Similarity search modes:
- Pure cosine/inner product: baseline ranking directly on embedding vectors.
- Hybrid semantic + keyword: combine vector search with SQLite full-text search over summaries/comments to balance lexical signals.
- Graph-aware rerankers: build secondary filters using shared files, label overlap, or risk level differences to refine top results.
- Title & comment enhancements:
- Maintain separate embeddings for issue titles, likeness summaries, and curated comment snippets; blend scores via weighted average or reciprocal rank fusion.
- Normalize titles before indexing (lowercase, strip prefixes like "Bug:"/"Feature:"/issue IDs) to improve both lexical and semantic matching.
- Run an FTS index on titles plus tagged comment excerpts so exact keyword hits can seed candidates prior to vector re-ranking.
- Extract prominent entities (components, file paths, dependency names) from comments and persist them as auxiliary keywords to power filters and UI facets.
- Storage lifecycle:
- Persist embeddings alongside the
manifest and export_run_id so retraining jobs can sync versions.
- Regenerate embeddings when the likeness prompt/schema changes; version via
embedding_model and embedding_run_id columns.
- Archive older Faiss/ANN indexes for reproducibility while the live index tracks the latest export.
- ANN implementation (MVP):
- Choose
sqlite-vss extension to embed vectors directly in the existing SQLite export, enabling SELECT ... ORDER BY vector_distance_limit queries without external services.
- Batch index updates: after serialization, insert embeddings in chunks of 200 rows to avoid long transactions; vacuum the VSS index once per export run.
- Fallback to brute-force cosine search via SQL when the extension is unavailable or vector count < 200 (auto-detected), logging a warning but keeping results.
- Error handling: if any issue lacks one of the required embeddings (
title, likeness_summary), skip it during indexing and append a warning row in the validation report; ensure query path filters out missing vectors gracefully.
- Versioning: store the VSS index parameters (
dimension, metric, lists) in the manifest so later runs can reproduce settings.
Historical Dataset Schema
- Storage target: single SQLite bundle co-located with the extension for MVP training runs, with Parquet export hooks for future server-side scaling.
- Import contract: exporter normalizes comment-derived metrics into typed columns so downstream ML jobs avoid brittle parsing logic.
| Table |
Key Fields |
Description |
issues |
issue_id (PK), repo_slug, number, title, body, state, author, created_at, closed_at, milestone_id, labels (array/json) |
Canonical issue record and categorical features. |
risk_intelligence_snapshots |
snapshot_id (PK), issue_id (FK), risk_level, risk_score, last_updated, linked_commits, files_touched, lines_changed, review_friction_signals, comment_tag |
Structured view of the "Risk Intelligence" comment payload captured at close with tag provenance. |
analysis_scores |
analysis_id (PK), issue_id (FK), composite_score, requirements_score, complexity_score, security_score, business_score, model_id, model_run_timestamp, analysis_summary, comment_tag |
Quantitative and textual outputs from the initial assessment block with tag provenance. |
activity_metrics |
issue_id (PK/FK), linked_pr_ids, time_to_first_response, cycle_time, comment_count, assignee_count, hot_file_score |
Aggregated behavioral signals computed during export. |
comments |
comment_id (PK), issue_id (FK), author, created_at, body, is_system_generated |
Free-form context for similarity search and qualitative supervision. |
embeddings |
embedding_id (PK), issue_id (FK), source (title/body/likeness_summary), vector, model, created_at |
Semantic vectors aligned with similarity retrieval strategy. |
- Derived views: materialize
issue_risk_training_view joining the above tables and flattening categorical features into model-ready columns. Labels and linked PR IDs stay as JSON arrays to preserve multi-value similarity signals, avoiding extraneous audit metadata.
- Versioning: add
export_run_id and source_comment_hash columns on snapshot tables to track provenance and enable reprocessing when comment formats change; manual exports track runs through a separate manifest table.
Import Workflow
-
Initiation: researcher triggers a "Analyze closed issues" command inside the extension UI, which prompts the local engine to scan for closed issues lacking tagged analysis/risk comments and backfill them before export.
-
Extraction: exporter reads from the local SQLite cache (primary data source) and, if needed, pulls additional GitHub data using the authenticated session associated with the connected repository.
-
Normalization: parser maps extracted metrics to typed schema fields, applies defaulting for missing values, and stores numeric metrics as integers/floats for ML readiness.
-
Validation: run contracts that confirm counts (e.g., risk_score within 0-100, linked_commits >= 0) and emit anomalies for manual review.
-
Serialization: batch writes normalized rows into the local SQLite dataset and emits a manifest containing schema version, export timestamp, record counts, and export_run_id.
-
Loading: ML experiments consume the SQLite file directly for MVP; future pipelines may mirror to Parquet/postgres but the local file remains the authoritative source during manual training cycles.
-
Exporter acceptance criteria:
- Manifest schema includes:
export_run_id, repo_slug, issues_exported, snapshots_exported, analysis_exported, export_started_at, export_completed_at, schema_version, embedding_model, token_usage_summary.
- Validation thresholds: reject export if >5% of closed issues lack risk snapshots, or if any numeric metric falls outside expected bounds (
risk_score 0-100, lines_changed >=0, linked_commits >=0). Provide warning-only for 1-5% gaps.
- Integrity checks: ensure every row in
risk_intelligence_snapshots has a matching analysis_scores entry (within the same export) unless explicitly flagged analysis_missing=true.
- Verify embeddings count equals number of issues in manifest for sources
title, body, and likeness_summary; log discrepancies.
- Persist validation report with summary counts and error list in manifest directory for audit.
Operational Scope
- Repository coverage: MVP targets the user-selected repository connected through the extension; the export runs on demand rather than on a fixed schedule.
- Access model: leverage the existing local SQLite engine and authenticated GitHub client to request batch analyses for any closed issues missing the standardized comment blocks.
- Refresh cadence: training remains a manual process for now; operators trigger exports and subsequent model retraining as needed until an automated cadence is justified.
- Scalability plan: design schema and manifest format to support eventual migration to a server-side store (e.g., Postgres) once dataset size or multi-repo aggregation requires it, without blocking the local inference prototype.
UI Additions (ML Training Tab)
- Navigation: add an "ML Training" tab to the IssueTriage panel alongside existing assessment views.
- Primary action: include a prominent
Train now button that executes the export pipeline and, upon success, kicks off local model training/inference rebuild.
- Secondary info: show last export timestamp, issues processed, and most recent token usage summary; disable the button with tooltip if no closed issues require analysis.
- Progress feedback: display step-by-step status (Backfill → Validate → Embed → Persist) with spinner, turning each step to a checkmark on completion; surface warnings inline with links to the validation report.
- Post-run: provide quick links to the generated manifest and allow user to open the latest similarity search results within the tab.
Feature Engineering
- Text embeddings from issue titles/bodies (OpenAI/Azure, sentence-transformers) for similarity and semantic features.
- Numeric aggregates: historical churn averages per label/component, team cadence, time-based trend features.
- Graph features: author collaboration networks, file hot-spots, PR dependency counts.
- Labels for supervised learning: existing
riskLevel, riskScore, binary outcomes (e.g., high-risk vs non-high-risk).
Modeling Strategy
- Baseline Heuristics
- Rule-based scoring using historical averages per label/component.
- Serves as benchmark and fallback.
- Classical ML
- Train gradient boosting or random forest classifiers/regressors on engineered features.
- Predict risk level (classification) and risk score (regression).
- Evaluate with cross-validation, precision/recall for high-risk, MAE for scores.
- LLM-Assisted
- Prompt LLM with structured historical context plus new issue details to obtain predicted metrics.
- Optionally fine-tune small models (LLaMA-family) with instruction data generated from historical pairs.
- Compare performance vs ML models; consider ensemble (LLM justification + ML probability).
Evaluation Plan
- Split dataset by time (train on past periods, test on future intervals) to mimic real deployment.
- Metrics:
- Classification: F1, recall on high-risk, ROC-AUC.
- Regression: MAE / RMSE on riskScore.
- Similarity retrieval: precision@k for finding truly related past issues.
- Baseline comparison: ensure model outperforms simple heuristics.
- Statistical significance tests when combining multiple repos.
Integration Steps
- Data Pipeline: batch exporter to populate historical dataset; schedule periodic refresh.
- Model Serving: package trained model (e.g., ONNX or JSON weights) accessible from VS Code extension via local service or remote API.
- Extension Wiring:
- Add
PredictedRiskService to request predictions when issue loads.
- Merge outputs with existing
RiskIntelligenceService summaries.
- Display confidence, key drivers, and similar past items in the IssueTriage panel.
- Integrate
SimilarityService to return top-k matches from precomputed embeddings/fingerprints for UI and prompt augmentation.
- Prompt Augmentation: feed predicted metrics and retrieved historical snippets into the assessment payload for OpenRouter.
Testing Strategy
- Offline unit tests for feature extractors and data loaders.
- Regression tests validating model predictions against a frozen evaluation set.
- Integration tests simulating extension calls against a mock service.
- Manual validation using selected public repo issues to check qualitative usefulness.
MVP Test Plan
- Unit tests
- Exporter parsers convert tagged comments and SQLite cache rows into schema-compliant DTOs (cover missing field defaults and malformed JSON handling).
- Manifest builder enforces required fields and fails when validation thresholds are exceeded.
- Likeness summary prompt wrapper ensures output matches keyword line regex and token limit logic.
- Embedding pipeline stubs produce deterministic vectors and raise errors if required sources absent.
- Integration tests
- End-to-end export run using a fixture repository snapshot (seed SQLite + mock GitHub) yields manifest, snapshot tables, and embeddings with counts matching expectations.
- Similarity query against the sqlite-vss index returns blended results, including fallback path when vector extension is disabled.
Train now command publishes progress events and final status to the UI mock.
- Manual validation
- Run the ML Training tab against a known repo (e.g., sample dataset) and confirm comment tagging, manifest output, and similarity results align with expectations.
- Inspect validation report warnings, retry with intentionally missing embeddings, and verify fallback behavior.
- Smoke test API quotas by triggering consecutive exports until hitting 80% budget warning.
- Performance spot-check
- Measure export runtime on ~500 closed issues, ensuring end-to-end process completes within the target (e.g., <5 minutes) and document observed token consumption.
Tooling & Infrastructure
- Python or TypeScript data pipeline (consider
scripts/ folder) leveraging GitHub REST/GraphQL.
- ML stack: scikit-learn/lightGBM for classical models, Hugging Face transformers or Azure OpenAI for embeddings/LLM prompts.
- Experiment tracking (MLflow, Weights & Biases) to log runs and metrics.
- Versioned model artifacts stored in storage bucket or repository releases.
Risks & Mitigations
- Sparse labeled data: bootstrap with heuristic labels, progressively replace with real assessment outcomes.
- API limits: implement caching/backoff, consider GitHub Archival datasets (GH Archive, BigQuery).
- Model drift: schedule retraining; monitor telemetry from live predictions.
- Privacy: ensure no sensitive repository data stored without consent; default to public repos for initial models.
Timeline (High-Level)
- Weeks 1-2: Data audit, repository selection, pipeline scaffold.
- Weeks 3-5: Feature engineering, baseline heuristics, initial ML model.
- Weeks 6-8: LLM prompt experiments, similarity retrieval integration.
- Weeks 9-10: Extension integration prototype, UI surfacing, offline evaluation suite.
- Week 11+: Feedback loop, telemetry instrumentation, incremental deployment.
Next Actions
- Confirm pilot repository list and secure API tokens.
- Define schema for historical dataset and implement initial export script.
- Choose experiment tracking approach and spin up environment (local or cloud notebook).
- Draft specification for
PredictedRiskService interface inside the extension.
- Design likeness summary generator and similarity index schema; prototype offline ANN lookup against exported data.
Context
Advance the predictive workflow in
plans/feature-risk-learning.md, teaching IssueTriage to forecast risk for new items and retrieve similar historical work.Goals
Feature Plan: Historical Risk Learning
Goal
Enable IssueTriage to forecast risk metrics for new issues or pull requests by learning from historical repository activity and feeding those insights into the assessment workflow.
MVP Implementation Decisions (Finalized)
Data & Storage:
src/services/and export scripts inscripts/Embeddings & Search:
text-embedding-3-large(1536 dimensions) for batch training exportstext-embedding-3-small(512 dimensions) for real-time similarity queriessqlite-vssextension for vector storage and ANN searchModel Training:
Train nowbuttonAPI & Quotas:
Integration:
SimilarityServicereturns top-5 matches with blended cosine + structural scoresPredictedRiskServiceoutputs:{riskLevel, riskScore, confidence, drivers[], similarIssues[]}Acceptance Criteria:
title,body,likeness_summaryTarget Outcomes
Similarity Discovery Strategy
SimilarityServicethat blends cosine similarity with structured overlap metrics (label Jaccard, shared files) to rank top matches.Data Requirements
microsoft/vscode,numpy/numpy).GitHubClientor create scripts to export historical snapshots, respecting rate limits and caching.Existing Risk Intelligence Signals
riskLevel(e.g., Low),riskScore(e.g., 15), and thelastUpdatedtimestamp from the risk engine. We store each comment in the local SQLite cache and will standardize the markdown block with a dedicated header tag (e.g.,<!-- risk-intelligence -->) so exports can target the most recent instance unambiguously.linkedCommits,filesTouched,linesChanged, andreviewFrictionSignals(currently integer counts). These metrics become typed columns in the dataset for model consumption without secondary parsing.compositeScore(e.g., 62.5), per-dimension scores (requirements,complexity,security,business), themodelId(e.g., openai/gpt-5-mini), andrunTimestamp. The block will use a matching tag (e.g.,<!-- assessment-intelligence -->) and the exporter selects the latest occurrence.TEXTcolumn and expose asummary_embeddingvector in the embeddings table for similarity retrieval.Comment Tagging Specification
<!-- risk-intel:start version=1 -->...<!-- risk-intel:end --><!-- assessment-intel:start version=1 -->...<!-- assessment-intel:end --><!-- risk-intel:data ... -->tag holds canonical JSON for exporters. Renderers ignore it while parsers read the JSON payload. The wrapper attributes provide quick filters (versioning, timestamps, emit source).versionandissued-atinstance, while older blocks remain for audit history.Likeness Summary & Similarity Strategy
Locked prompts:
Summary generation prompt (system):
Summary generation prompt (user template):
Similarity justification prompt (system):
Similarity justification prompt (user template):
Embedding pipeline:
embeddingstable withsource='likeness_summary'.Similarity scoring:
Inference flow:
API guardrails:
analysis_run_idto avoid repeat calls.Embedding & Retrieval Options
text-embedding-3-largefor training exports andtext-embedding-3-smallfor real-time queries. This keeps us on the same vendor path as existing assessment models while leaving room to swap hosts (OpenAI vs Azure OpenAI) if governance requires.all-mpnet-base-v2) for offline mode, but defer implementation until we face network constraints.sqlite-vss,sqlite-vec) to keep embeddings beside other issue data for MVP; good for small to medium corpora.manifestandexport_run_idso retraining jobs can sync versions.embedding_modelandembedding_run_idcolumns.sqlite-vssextension to embed vectors directly in the existing SQLite export, enablingSELECT ... ORDER BY vector_distance_limitqueries without external services.title,likeness_summary), skip it during indexing and append a warning row in the validation report; ensure query path filters out missing vectors gracefully.dimension,metric,lists) in the manifest so later runs can reproduce settings.Historical Dataset Schema
issuesissue_id(PK),repo_slug,number,title,body,state,author,created_at,closed_at,milestone_id,labels(array/json)risk_intelligence_snapshotssnapshot_id(PK),issue_id(FK),risk_level,risk_score,last_updated,linked_commits,files_touched,lines_changed,review_friction_signals,comment_taganalysis_scoresanalysis_id(PK),issue_id(FK),composite_score,requirements_score,complexity_score,security_score,business_score,model_id,model_run_timestamp,analysis_summary,comment_tagactivity_metricsissue_id(PK/FK),linked_pr_ids,time_to_first_response,cycle_time,comment_count,assignee_count,hot_file_scorecommentscomment_id(PK),issue_id(FK),author,created_at,body,is_system_generatedembeddingsembedding_id(PK),issue_id(FK),source(title/body/likeness_summary),vector,model,created_atissue_risk_training_viewjoining the above tables and flattening categorical features into model-ready columns. Labels and linked PR IDs stay as JSON arrays to preserve multi-value similarity signals, avoiding extraneous audit metadata.export_run_idandsource_comment_hashcolumns on snapshot tables to track provenance and enable reprocessing when comment formats change; manual exports track runs through a separate manifest table.Import Workflow
Initiation: researcher triggers a "Analyze closed issues" command inside the extension UI, which prompts the local engine to scan for closed issues lacking tagged analysis/risk comments and backfill them before export.
Extraction: exporter reads from the local SQLite cache (primary data source) and, if needed, pulls additional GitHub data using the authenticated session associated with the connected repository.
Normalization: parser maps extracted metrics to typed schema fields, applies defaulting for missing values, and stores numeric metrics as integers/floats for ML readiness.
Validation: run contracts that confirm counts (e.g.,
risk_scorewithin 0-100,linked_commits>= 0) and emit anomalies for manual review.Serialization: batch writes normalized rows into the local SQLite dataset and emits a manifest containing schema version, export timestamp, record counts, and
export_run_id.Loading: ML experiments consume the SQLite file directly for MVP; future pipelines may mirror to Parquet/postgres but the local file remains the authoritative source during manual training cycles.
Exporter acceptance criteria:
export_run_id,repo_slug,issues_exported,snapshots_exported,analysis_exported,export_started_at,export_completed_at,schema_version,embedding_model,token_usage_summary.risk_score0-100,lines_changed>=0,linked_commits>=0). Provide warning-only for 1-5% gaps.risk_intelligence_snapshotshas a matchinganalysis_scoresentry (within the same export) unless explicitly flaggedanalysis_missing=true.title,body, andlikeness_summary; log discrepancies.Operational Scope
UI Additions (ML Training Tab)
Train nowbutton that executes the export pipeline and, upon success, kicks off local model training/inference rebuild.Feature Engineering
riskLevel,riskScore, binary outcomes (e.g., high-risk vs non-high-risk).Modeling Strategy
Evaluation Plan
Integration Steps
PredictedRiskServiceto request predictions when issue loads.RiskIntelligenceServicesummaries.SimilarityServiceto return top-k matches from precomputed embeddings/fingerprints for UI and prompt augmentation.Testing Strategy
MVP Test Plan
Train nowcommand publishes progress events and final status to the UI mock.Tooling & Infrastructure
scripts/folder) leveraging GitHub REST/GraphQL.Risks & Mitigations
Timeline (High-Level)
Next Actions
PredictedRiskServiceinterface inside the extension.