Feature: Historical risk learning and similarity service

## Context
Advance the predictive workflow in `plans/feature-risk-learning.md`, teaching IssueTriage to forecast risk for new items and retrieve similar historical work.

## Goals
- Build data pipeline exporting historical issue/PR metrics, likeness summaries, and embeddings.
- Train baseline heuristics and classical ML model for risk level/score prediction.
- Implement similarity index service returning top related issues with evidence.
- Integrate predicted metrics and matches into assessment prompts and panel UI.

# Feature Plan: Historical Risk Learning

## Goal
Enable IssueTriage to forecast risk metrics for new issues or pull requests by learning from historical repository activity and feeding those insights into the assessment workflow.

## MVP Implementation Decisions (Finalized)

**Data & Storage:**
- Use the connected repository's local SQLite cache as the primary data source
- Export to co-located SQLite bundle for training (schema defined below)
- Implement in TypeScript within `src/services/` and export scripts in `scripts/`

**Embeddings & Search:**
- Use OpenAI `text-embedding-3-large` (1536 dimensions) for batch training exports
- Use OpenAI `text-embedding-3-small` (512 dimensions) for real-time similarity queries
- Deploy `sqlite-vss` extension for vector storage and ANN search
- Fallback to brute-force SQL cosine when extension unavailable or <200 vectors

**Model Training:**
- Manual training workflow triggered via "ML Training" tab → `Train now` button
- Start with baseline heuristics (historical averages by label/component)
- Package trained models as local ONNX artifacts alongside the extension
- Defer automated retraining and server-side aggregation to post-MVP

**API & Quotas:**
- Daily token budget: 200k tokens (configurable via settings)
- Batch size: 50 summaries per export run
- Warning threshold: 80% of daily budget
- Rate limiting: 50ms jittered delays, exponential backoff on 429/5xx (3 retries max)

**Integration:**
- `SimilarityService` returns top-5 matches with blended cosine + structural scores
- `PredictedRiskService` outputs: `{riskLevel, riskScore, confidence, drivers[], similarIssues[]}`
- Surface predictions in IssueTriage panel alongside existing risk intelligence
- Feed top-3 similar issues into assessment prompts as historical context

**Acceptance Criteria:**
- Export <5 minutes for 500 closed issues
- Validation rejects if >5% missing risk snapshots
- Embedding count matches issue count for `title`, `body`, `likeness_summary`
- Manifest persists with schema version, token usage, and validation report

## Target Outcomes
- Predict likelihood of high, medium, or low risk for a new triage item before linked work exists.
- Generate estimated drivers (e.g., expected change volume, review friction) to seed readiness guidance.
- Surface similar historical issues/PRs to provide reviewers with evidence for the prediction.

## Similarity Discovery Strategy
- Capture per-issue similarity snapshots when issues close by persisting:
   - Final label set, assignees, milestone, linked PR metrics, and risk outcomes.
   - Generated "likeness summaries" that describe components touched, change scope, and notable risk drivers.
- Compute semantic embeddings (sentence-transformer or Azure OpenAI) for title, body, and likeness summary; store vectors alongside structured fingerprints (e.g., label hashes, hot-file counts).
- Maintain an incremental similarity index (Faiss, Pinecone, Redis, or in-process ANN) keyed by issue id with timestamps for temporal filtering.
- Serve a `SimilarityService` that blends cosine similarity with structured overlap metrics (label Jaccard, shared files) to rank top matches.
- Surface retrieved matches inside assessments, webview UI, and as prompt context for downstream LLM/ML predictions.

## Data Requirements
- **Source repositories**: choose one or more public repos with rich history (e.g., `microsoft/vscode`, `numpy/numpy`).
- **Data types**:
  - Issue metadata: titles, bodies, labels, milestones, timestamps, author.
  - Linked PR data: churn metrics, review state counts, merge outcomes, time-to-merge.
  - Post-merge signals: subsequent bug reports, reverts, security advisories (optional but valuable).
  - Assessment + risk records generated by IssueTriage once instrumentation exists.
- Similarity captures: likeness summaries, embedding vectors, structured fingerprints saved when issues close or PRs merge.
- **Collection tooling**: extend `GitHubClient` or create scripts to export historical snapshots, respecting rate limits and caching.
- **Storage**: structured dataset (Parquet/SQLite/Postgres) with normalized tables for issues, PRs, risk outcomes, derived features, plus a similarity index table storing vectors and metadata for retrieval.

## Existing Risk Intelligence Signals
- **Risk intelligence comment**: each closed issue captures `riskLevel` (e.g., Low), `riskScore` (e.g., 15), and the `lastUpdated` timestamp from the risk engine. We store each comment in the local SQLite cache and will standardize the markdown block with a dedicated header tag (e.g., ``) so exports can target the most recent instance unambiguously.
- **Operational metrics**: parsed from the same comment block and include `linkedCommits`, `filesTouched`, `linesChanged`, and `reviewFrictionSignals` (currently integer counts). These metrics become typed columns in the dataset for model consumption without secondary parsing.
- **Composite model assessment**: initial assessment comment contains `compositeScore` (e.g., 62.5), per-dimension scores (`requirements`, `complexity`, `security`, `business`), the `modelId` (e.g., openai/gpt-5-mini), and `runTimestamp`. The block will use a matching tag (e.g., ``) and the exporter selects the latest occurrence.
- **Qualitative guidance**: textual summary (e.g., "Add missing context then reassess."), model provenance statement, and structured question prompts. We retain the raw summary as a `TEXT` column and expose a `summary_embedding` vector in the embeddings table for similarity retrieval.
- **Issue context**: canonical issue title, number, author, milestone, labels, and comment stream provide additional supervised signals and retrieval evidence; comments unrelated to our tagged blocks remain available to the similarity engine.

## Comment Tagging Specification
- **Goals**: keep comments visually friendly while giving exporters deterministic anchors; tags should survive manual edits and support versioning.
- **Wrapper syntax**: each machine-readable block is enclosed by paired HTML comments with required attributes:
   - Risk comment wrapper: `` ... ``
   - Assessment comment wrapper: `` ... ``
- **Display payload**: inside the wrapper, keep human-readable headings and bullet lists so reviewers can skim without tooling. Example:

```markdown

Risk Intelligence
Low risk · Score 15

Last updated 10/19/2025, 8:25:33 PM

Key metrics
- 1 linked commits
- 3 files touched
- 15 lines changed
- 0 review friction signals



```

- **Embedded data node**: the `` tag holds canonical JSON for exporters. Renderers ignore it while parsers read the JSON payload. The wrapper attributes provide quick filters (versioning, timestamps, emit source).
- **Assessment block**: follow the same pattern with fields tailored to composite scoring:

```markdown

Composite Assessment
Composite 62.5 · Model openai/gpt-5-mini

Dimensions
- Requirements: 60.0
- Complexity: 80.0
- Security: 45.0
- Business: 65.0

Summary
Add missing context then reassess.



```

- **Multiple runs**: when updates occur, append a fresh tagged block to the issue comments. The exporter selects the latest `version` and `issued-at` instance, while older blocks remain for audit history.
- **Error handling**: if the JSON subtag is malformed, the exporter falls back to cache data and logs a warning so operators can fix the comment manually.

## Likeness Summary & Similarity Strategy
- **Objective**: encode each closed issue into a concise, semantically rich summary that drives LLM embeddings and human review of related work.
- **Authoring**: generate the likeness summary with an LLM prompt that incorporates change metadata (labels, touched files, churn metrics) and risk outcomes. Anchor the prompt to a strict template to keep outputs consistent.
- **Recommended template**: 2–3 sentences covering problem statement, implementation touchpoints, and notable risk/complexity signals, followed by a bullet list of key keywords. Example prompt sketch:

```
Summarize the closed issue for historical risk learning.
Include:
1. Intent (component + change objective).
2. Implementation footprint (files or subsystems, approx change size).
3. Notable risk drivers (dependencies, security surfaces, coordination).
End with a line 'Keywords: <comma-separated tokens>'. Max 120 tokens.
```

- **Locked prompts**:
   - *Summary generation prompt* (system):

      ```
      You are IssueTriage, producing likeness summaries for closed GitHub issues. Respond in English.
      Output two short sentences (<=120 tokens total) describing intent, implementation footprint, and risk drivers.
      End with a line exactly formatted as 'Keywords: <comma-separated tokens>'.
      Use present-perfect phrasing ("Updated", "Refactored") and avoid markdown headings.
      ```

   - *Summary generation prompt* (user template):

      ```
      Issue title: {title}
      Labels: {labels_csv}
      Files touched: {files_csv}
      Lines changed: {lines_changed}
      Linked commits: {linked_commits}
      Risk level: {risk_level}
      Risk score: {risk_score}
      Description:
      {body_excerpt}
      ```

   - *Similarity justification prompt* (system):

      ```
      You explain why two historical issues are similar for risk assessment. Respond with one sentence <=60 tokens, referencing concrete overlaps.
      ```

   - *Similarity justification prompt* (user template):

      ```
      Candidate issue summary: {candidate_summary}
      Query issue summary: {query_summary}
      Shared attributes: Labels[{shared_labels_csv}] Files[{shared_files_csv}]
      ```

- **Embedding pipeline**:
   - Run the LLM locally or via API to produce the likeness summary when an issue closes or during the export batch.
   - Feed the summary text into a sentence embedding model (OpenAI text-embedding-3, Azure embed, or open-source alternative) and store the resulting vector in the `embeddings` table with `source='likeness_summary'`.
- **Similarity scoring**:
   - Primary score: cosine similarity between summary embeddings.
   - Signal blend: linearly combine the cosine score with structured overlaps (label Jaccard, shared file hotspots, delta in risk levels) to avoid purely semantic matches.
   - Re-ranking: for the top-k cosine matches, ask a lightweight LLM prompt to justify similarity, producing an evidence snippet for the UI (e.g., "Both touched auth middleware and involved large refactors").
- **Inference flow**:
   1. For a new issue, create a provisional likeness summary using available metadata + user description.
   2. Compute embedding and query the ANN index (Faiss/SQLite FTS + vector extension).
   3. Return top-k candidates with blended scores and optional LLM-generated rationales.
   4. Cache results and update as additional context (linked PRs, telemetry) arrives.

- **API guardrails**:
   - Default to batching likeness summaries during manual export runs to minimize per-issue request overhead (e.g., 50 summaries per run).
   - Enforce a daily token budget (e.g., 200k tokens) configurable via settings; surface warnings in the UI when usage approaches 80%.
   - Cache generated summaries and justifications in SQLite with `analysis_run_id` to avoid repeat calls.
   - Provide a "dry run" mode to estimate token spend without calling the LLM, using average tokens per prompt (summary ≈ 300 input, 70 output; justification ≈ 200 input, 40 output).
   - Respect API rate limits by inserting jittered delays between requests (default 50 ms) and retrying with exponential backoff on 429/5xx responses up to 3 attempts.

## Embedding & Retrieval Options
- **Embedding providers**:
   - *OpenAI/Azure OpenAI (MVP choice)*: standardize on `text-embedding-3-large` for training exports and `text-embedding-3-small` for real-time queries. This keeps us on the same vendor path as existing assessment models while leaving room to swap hosts (OpenAI vs Azure OpenAI) if governance requires.
   - *Fallback: Hugging Face sentence transformers*: keep a documented alternative (`all-mpnet-base-v2`) for offline mode, but defer implementation until we face network constraints.
   - *Future: fine-tuned embeddings*: revisit once labeled similarity pairs exist; plan to package via ONNX so we can self-host if API policies shift.
- **Vector storage**:
   - *SQLite extensions*: use pgvector-like add-ons (`sqlite-vss`, `sqlite-vec`) to keep embeddings beside other issue data for MVP; good for small to medium corpora.
   - *Faiss on-disk index*: efficient approximate nearest neighbor for thousands to millions of vectors; integrate with Python export job and persist index artifacts.
   - *Managed services*: Pinecone, Weaviate Cloud, Azure Cognitive Search with vector fields; useful once we need multi-repo scale or shared inference endpoints.
- **Similarity search modes**:
   - *Pure cosine/inner product*: baseline ranking directly on embedding vectors.
   - *Hybrid semantic + keyword*: combine vector search with SQLite full-text search over summaries/comments to balance lexical signals.
   - *Graph-aware rerankers*: build secondary filters using shared files, label overlap, or risk level differences to refine top results.
- **Title & comment enhancements**:
   - Maintain separate embeddings for issue titles, likeness summaries, and curated comment snippets; blend scores via weighted average or reciprocal rank fusion.
   - Normalize titles before indexing (lowercase, strip prefixes like "Bug:"/"Feature:"/issue IDs) to improve both lexical and semantic matching.
   - Run an FTS index on titles plus tagged comment excerpts so exact keyword hits can seed candidates prior to vector re-ranking.
   - Extract prominent entities (components, file paths, dependency names) from comments and persist them as auxiliary keywords to power filters and UI facets.
- **Storage lifecycle**:
   - Persist embeddings alongside the `manifest` and `export_run_id` so retraining jobs can sync versions.
   - Regenerate embeddings when the likeness prompt/schema changes; version via `embedding_model` and `embedding_run_id` columns.
   - Archive older Faiss/ANN indexes for reproducibility while the live index tracks the latest export.
- **ANN implementation (MVP)**:
   - Choose `sqlite-vss` extension to embed vectors directly in the existing SQLite export, enabling `SELECT ... ORDER BY vector_distance_limit` queries without external services.
   - Batch index updates: after serialization, insert embeddings in chunks of 200 rows to avoid long transactions; vacuum the VSS index once per export run.
   - Fallback to brute-force cosine search via SQL when the extension is unavailable or vector count < 200 (auto-detected), logging a warning but keeping results.
   - Error handling: if any issue lacks one of the required embeddings (`title`, `likeness_summary`), skip it during indexing and append a warning row in the validation report; ensure query path filters out missing vectors gracefully.
   - Versioning: store the VSS index parameters (`dimension`, `metric`, `lists`) in the manifest so later runs can reproduce settings.

## Historical Dataset Schema
- **Storage target**: single SQLite bundle co-located with the extension for MVP training runs, with Parquet export hooks for future server-side scaling.
- **Import contract**: exporter normalizes comment-derived metrics into typed columns so downstream ML jobs avoid brittle parsing logic.

| Table | Key Fields | Description |
| --- | --- | --- |
| `issues` | `issue_id` (PK), `repo_slug`, `number`, `title`, `body`, `state`, `author`, `created_at`, `closed_at`, `milestone_id`, `labels` (array/json) | Canonical issue record and categorical features.
| `risk_intelligence_snapshots` | `snapshot_id` (PK), `issue_id` (FK), `risk_level`, `risk_score`, `last_updated`, `linked_commits`, `files_touched`, `lines_changed`, `review_friction_signals`, `comment_tag` | Structured view of the "Risk Intelligence" comment payload captured at close with tag provenance.
| `analysis_scores` | `analysis_id` (PK), `issue_id` (FK), `composite_score`, `requirements_score`, `complexity_score`, `security_score`, `business_score`, `model_id`, `model_run_timestamp`, `analysis_summary`, `comment_tag` | Quantitative and textual outputs from the initial assessment block with tag provenance.
| `activity_metrics` | `issue_id` (PK/FK), `linked_pr_ids`, `time_to_first_response`, `cycle_time`, `comment_count`, `assignee_count`, `hot_file_score` | Aggregated behavioral signals computed during export.
| `comments` | `comment_id` (PK), `issue_id` (FK), `author`, `created_at`, `body`, `is_system_generated` | Free-form context for similarity search and qualitative supervision.
| `embeddings` | `embedding_id` (PK), `issue_id` (FK), `source` (`title`/`body`/`likeness_summary`), `vector`, `model`, `created_at` | Semantic vectors aligned with similarity retrieval strategy.

- **Derived views**: materialize `issue_risk_training_view` joining the above tables and flattening categorical features into model-ready columns. Labels and linked PR IDs stay as JSON arrays to preserve multi-value similarity signals, avoiding extraneous audit metadata.
- **Versioning**: add `export_run_id` and `source_comment_hash` columns on snapshot tables to track provenance and enable reprocessing when comment formats change; manual exports track runs through a separate manifest table.

## Import Workflow
- **Initiation**: researcher triggers a "Analyze closed issues" command inside the extension UI, which prompts the local engine to scan for closed issues lacking tagged analysis/risk comments and backfill them before export.
- **Extraction**: exporter reads from the local SQLite cache (primary data source) and, if needed, pulls additional GitHub data using the authenticated session associated with the connected repository.
- **Normalization**: parser maps extracted metrics to typed schema fields, applies defaulting for missing values, and stores numeric metrics as integers/floats for ML readiness.
- **Validation**: run contracts that confirm counts (e.g., `risk_score` within 0-100, `linked_commits` >= 0) and emit anomalies for manual review.
- **Serialization**: batch writes normalized rows into the local SQLite dataset and emits a manifest containing schema version, export timestamp, record counts, and `export_run_id`.
- **Loading**: ML experiments consume the SQLite file directly for MVP; future pipelines may mirror to Parquet/postgres but the local file remains the authoritative source during manual training cycles.

- **Exporter acceptance criteria**:
   - Manifest schema includes: `export_run_id`, `repo_slug`, `issues_exported`, `snapshots_exported`, `analysis_exported`, `export_started_at`, `export_completed_at`, `schema_version`, `embedding_model`, `token_usage_summary`.
   - Validation thresholds: reject export if >5% of closed issues lack risk snapshots, or if any numeric metric falls outside expected bounds (`risk_score` 0-100, `lines_changed` >=0, `linked_commits` >=0). Provide warning-only for 1-5% gaps.
   - Integrity checks: ensure every row in `risk_intelligence_snapshots` has a matching `analysis_scores` entry (within the same export) unless explicitly flagged `analysis_missing=true`.
   - Verify embeddings count equals number of issues in manifest for sources `title`, `body`, and `likeness_summary`; log discrepancies.
   - Persist validation report with summary counts and error list in manifest directory for audit.

## Operational Scope
- **Repository coverage**: MVP targets the user-selected repository connected through the extension; the export runs on demand rather than on a fixed schedule.
- **Access model**: leverage the existing local SQLite engine and authenticated GitHub client to request batch analyses for any closed issues missing the standardized comment blocks.
- **Refresh cadence**: training remains a manual process for now; operators trigger exports and subsequent model retraining as needed until an automated cadence is justified.
- **Scalability plan**: design schema and manifest format to support eventual migration to a server-side store (e.g., Postgres) once dataset size or multi-repo aggregation requires it, without blocking the local inference prototype.

## UI Additions (ML Training Tab)
- **Navigation**: add an "ML Training" tab to the IssueTriage panel alongside existing assessment views.
- **Primary action**: include a prominent `Train now` button that executes the export pipeline and, upon success, kicks off local model training/inference rebuild.
- **Secondary info**: show last export timestamp, issues processed, and most recent token usage summary; disable the button with tooltip if no closed issues require analysis.
- **Progress feedback**: display step-by-step status (Backfill → Validate → Embed → Persist) with spinner, turning each step to a checkmark on completion; surface warnings inline with links to the validation report.
- **Post-run**: provide quick links to the generated manifest and allow user to open the latest similarity search results within the tab.

## Feature Engineering
- Text embeddings from issue titles/bodies (OpenAI/Azure, sentence-transformers) for similarity and semantic features.
- Numeric aggregates: historical churn averages per label/component, team cadence, time-based trend features.
- Graph features: author collaboration networks, file hot-spots, PR dependency counts.
- Labels for supervised learning: existing `riskLevel`, `riskScore`, binary outcomes (e.g., high-risk vs non-high-risk).

## Modeling Strategy
1. **Baseline Heuristics**
   - Rule-based scoring using historical averages per label/component.
   - Serves as benchmark and fallback.
2. **Classical ML**
   - Train gradient boosting or random forest classifiers/regressors on engineered features.
   - Predict risk level (classification) and risk score (regression).
   - Evaluate with cross-validation, precision/recall for high-risk, MAE for scores.
3. **LLM-Assisted**
   - Prompt LLM with structured historical context plus new issue details to obtain predicted metrics.
   - Optionally fine-tune small models (LLaMA-family) with instruction data generated from historical pairs.
   - Compare performance vs ML models; consider ensemble (LLM justification + ML probability).

## Evaluation Plan
- Split dataset by time (train on past periods, test on future intervals) to mimic real deployment.
- Metrics:
  - Classification: F1, recall on high-risk, ROC-AUC.
  - Regression: MAE / RMSE on riskScore.
  - Similarity retrieval: precision@k for finding truly related past issues.
- Baseline comparison: ensure model outperforms simple heuristics.
- Statistical significance tests when combining multiple repos.

## Integration Steps
1. **Data Pipeline**: batch exporter to populate historical dataset; schedule periodic refresh.
2. **Model Serving**: package trained model (e.g., ONNX or JSON weights) accessible from VS Code extension via local service or remote API.
3. **Extension Wiring**:
   - Add `PredictedRiskService` to request predictions when issue loads.
   - Merge outputs with existing `RiskIntelligenceService` summaries.
   - Display confidence, key drivers, and similar past items in the IssueTriage panel.
   - Integrate `SimilarityService` to return top-k matches from precomputed embeddings/fingerprints for UI and prompt augmentation.
4. **Prompt Augmentation**: feed predicted metrics and retrieved historical snippets into the assessment payload for OpenRouter.

## Testing Strategy
- Offline unit tests for feature extractors and data loaders.
- Regression tests validating model predictions against a frozen evaluation set.
- Integration tests simulating extension calls against a mock service.
- Manual validation using selected public repo issues to check qualitative usefulness.

### MVP Test Plan
- **Unit tests**
   - Exporter parsers convert tagged comments and SQLite cache rows into schema-compliant DTOs (cover missing field defaults and malformed JSON handling).
   - Manifest builder enforces required fields and fails when validation thresholds are exceeded.
   - Likeness summary prompt wrapper ensures output matches keyword line regex and token limit logic.
   - Embedding pipeline stubs produce deterministic vectors and raise errors if required sources absent.
- **Integration tests**
   - End-to-end export run using a fixture repository snapshot (seed SQLite + mock GitHub) yields manifest, snapshot tables, and embeddings with counts matching expectations.
   - Similarity query against the sqlite-vss index returns blended results, including fallback path when vector extension is disabled.
   - `Train now` command publishes progress events and final status to the UI mock.
- **Manual validation**
   - Run the ML Training tab against a known repo (e.g., sample dataset) and confirm comment tagging, manifest output, and similarity results align with expectations.
   - Inspect validation report warnings, retry with intentionally missing embeddings, and verify fallback behavior.
   - Smoke test API quotas by triggering consecutive exports until hitting 80% budget warning.
- **Performance spot-check**
   - Measure export runtime on ~500 closed issues, ensuring end-to-end process completes within the target (e.g., <5 minutes) and document observed token consumption.

## Tooling & Infrastructure
- Python or TypeScript data pipeline (consider `scripts/` folder) leveraging GitHub REST/GraphQL.
- ML stack: scikit-learn/lightGBM for classical models, Hugging Face transformers or Azure OpenAI for embeddings/LLM prompts.
- Experiment tracking (MLflow, Weights & Biases) to log runs and metrics.
- Versioned model artifacts stored in storage bucket or repository releases.

## Risks & Mitigations
- **Sparse labeled data**: bootstrap with heuristic labels, progressively replace with real assessment outcomes.
- **API limits**: implement caching/backoff, consider GitHub Archival datasets (GH Archive, BigQuery).
- **Model drift**: schedule retraining; monitor telemetry from live predictions.
- **Privacy**: ensure no sensitive repository data stored without consent; default to public repos for initial models.

## Timeline (High-Level)
1. **Weeks 1-2**: Data audit, repository selection, pipeline scaffold.
2. **Weeks 3-5**: Feature engineering, baseline heuristics, initial ML model.
3. **Weeks 6-8**: LLM prompt experiments, similarity retrieval integration.
4. **Weeks 9-10**: Extension integration prototype, UI surfacing, offline evaluation suite.
5. **Week 11+**: Feedback loop, telemetry instrumentation, incremental deployment.

## Next Actions
- Confirm pilot repository list and secure API tokens.
- Define schema for historical dataset and implement initial export script.
- Choose experiment tracking approach and spin up environment (local or cloud notebook).
- Draft specification for `PredictedRiskService` interface inside the extension.
- Design likeness summary generator and similarity index schema; prototype offline ANN lookup against exported data.

Table	Key Fields	Description
`issues`	`issue_id` (PK), `repo_slug`, `number`, `title`, `body`, `state`, `author`, `created_at`, `closed_at`, `milestone_id`, `labels` (array/json)	Canonical issue record and categorical features.
`risk_intelligence_snapshots`	`snapshot_id` (PK), `issue_id` (FK), `risk_level`, `risk_score`, `last_updated`, `linked_commits`, `files_touched`, `lines_changed`, `review_friction_signals`, `comment_tag`	Structured view of the "Risk Intelligence" comment payload captured at close with tag provenance.
`analysis_scores`	`analysis_id` (PK), `issue_id` (FK), `composite_score`, `requirements_score`, `complexity_score`, `security_score`, `business_score`, `model_id`, `model_run_timestamp`, `analysis_summary`, `comment_tag`	Quantitative and textual outputs from the initial assessment block with tag provenance.
`activity_metrics`	`issue_id` (PK/FK), `linked_pr_ids`, `time_to_first_response`, `cycle_time`, `comment_count`, `assignee_count`, `hot_file_score`	Aggregated behavioral signals computed during export.
`comments`	`comment_id` (PK), `issue_id` (FK), `author`, `created_at`, `body`, `is_system_generated`	Free-form context for similarity search and qualitative supervision.
`embeddings`	`embedding_id` (PK), `issue_id` (FK), `source` (`title`/`body`/`likeness_summary`), `vector`, `model`, `created_at`	Semantic vectors aligned with similarity retrieval strategy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Historical risk learning and similarity service #10

Context

Goals

Feature Plan: Historical Risk Learning

Goal

MVP Implementation Decisions (Finalized)

Target Outcomes

Similarity Discovery Strategy

Data Requirements

Existing Risk Intelligence Signals

Comment Tagging Specification

Likeness Summary & Similarity Strategy

Embedding & Retrieval Options

Historical Dataset Schema

Import Workflow

Operational Scope

UI Additions (ML Training Tab)

Feature Engineering

Modeling Strategy

Evaluation Plan

Integration Steps

Testing Strategy

MVP Test Plan

Tooling & Infrastructure

Risks & Mitigations

Timeline (High-Level)

Next Actions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Historical risk learning and similarity service #10

Description

Context

Goals

Feature Plan: Historical Risk Learning

Goal

MVP Implementation Decisions (Finalized)

Target Outcomes

Similarity Discovery Strategy

Data Requirements

Existing Risk Intelligence Signals

Comment Tagging Specification

Likeness Summary & Similarity Strategy

Embedding & Retrieval Options

Historical Dataset Schema

Import Workflow

Operational Scope

UI Additions (ML Training Tab)

Feature Engineering

Modeling Strategy

Evaluation Plan

Integration Steps

Testing Strategy

MVP Test Plan

Tooling & Infrastructure

Risks & Mitigations

Timeline (High-Level)

Next Actions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions