Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
252 changes: 252 additions & 0 deletions docs/designs/owasp-pane-of-glass.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
# RFC: The OpenCRE Scraper & Indexer (Project OIE)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change name to OWASP Agent. Position it as promise first: the why, not the how. So not: 'scraper and indexer'


| Meta | Details |
| :---------------- | :--------------------------------------- |
| **Status** | `Draft` / `Review Pending` |
| **Target System** | OpenCRE.org / OWASP Chatbot |
| **Focus** | Automation, Knowledge Graph, Low-Ops ETL |
| **Authors** | Spyros Gasteratos |
| **Date** | 2026-02-01 |
---

## 1. Context & Motivation

**Problem Statement**
OWASP produces an immense amount of high-value security knowledge, but it is fragmented.
A developer looking for "JWT Security" might find a *Cheat Sheet*, but miss the corresponding *ASVS* requirement, *Testing Guide* techniques, and relevant *AppSec Global* talks that explain a bypasses and defences.

**Current State**
OpenCRE currently maps standards (NIST, ISO, OWASP) well.
However, it fails to capture the "living" knowledge of the community—repo updates, new chapters, events, and blog posts.

**Proposed Solution**
We can build a reliable ETL (Extract, Transform, Load) pipeline that acts as a **Scraper & Indexer**.
It will autonomously ingest raw content, filter out noise, and link it to the OpenCRE or "community/owasp info" graphs, making it queryable via the existing chat interface.

---

## 2. Architecture Overview

The system consists of 4 autonomous modules.

**Info for contributors: DO NOT write production code for a module until the "Pre-Code Experiment" is passed.**

### Module A: Information Harvesting & Normalization

**Goal:** Connect to configured sources, **backfill** unseen resources, **incrementally** detect changes, then **normalize and chunk** content before anything reaches Module B.

Module A owns **acquire + first-read + structure**, not relevance filtering (B) or CRE linking (C).

```ascii
[ GitHub Actions ] (Trigger: 02:00 UTC, or manual backfill)
|
v
[ A.1 Config Reader ] (sources.yaml: repos, feeds, URLs)
|
v
[ A.2 Artifact Registry ] (Have we seen this artifact before?)
|
+--(new / unknown)----> [ A.3a Backfill Reader ] --> full resource body
|
+--(known)------------> [ A.3b Incremental Fetcher ] --> git diff / poll / hash
|
v
[ A.4 Normalize ] (HTML/Markdown -> clean text; strip boilerplate)
|
v
[ A.5 Chunk ] (heading-aware splits; token/size limits per chunk)
|
v
[ Ingest Bucket ] (ArtifactIngestEvent + IngestChunk[], JSONL)
```

* Metric Rating
* Difficulty ⭐⭐⭐ (Medium–High)
* Vibe Coding Potential Low for connectors/registry; medium for chunking tuning.
* Tech Stack: Python (requests, PyGithub, markdown/html parsers), GitHub Actions (Cron), object storage for ingest batches.
* **Harvest modes**
* **Backfill:** First time an artifact appears in the registry — read the **full** resource (whole file, feed entry body, page HTML), normalize, chunk entirely.
* **Incremental:** Known artifact — fetch only what changed since last `content_hash` / commit window; re-chunk affected sections or whole artifact if structure shifted.
* **Connectors (adapters):** Pluggable per `source.type` (`github`, `rss`, `url`, …). If you have already played with **LlamaIndex** in Colab, a familiar pattern is: load files with a reader → split into chunks → write JSON lines. Field names in `docs/owasp-graph/apis/` are chosen to feel natural after that (see hints on `IngestChunk` / `ArtifactIngestEvent`).
* **Chunking (owned by A):** Markdown: split on headings with max chunk size; HTML: readability/main-content extraction then heading or paragraph splits; cap chunk size for downstream LLM cost. Each chunk gets a stable `chunk_id` under its `artifact_id`.
* MVP Logic: Backfill all `*.md` in ASVS + WSTG once; nightly incremental for git sources; emit `IngestBatch` to ingest bucket.
* Pre-Code Experiment (Do This First)
Shifting through "Junk" : Manually inspect the file structure of 10 random OWASP repositories.
Task: Identify common junk files (e.g., package-lock.json, CNAME, _config.yml).
Goal: Create a Regex Exclusion List that eliminates 90% of noise without downloading the files.
"Diff" Simulation: Pick a large Markdown file (e.g., in wstg). Modify one paragraph.
Task: Fetch the modified paragraph as clean text (not raw `<<<<` markers).
Success Criteria: Incremental run emits only affected chunk(s), or flags full re-chunk when headings move.
**Backfill drill:** Pick one repo path never ingested before. Run backfill once; verify chunk count > 0 and `event_type: discovered`. Run again with no edits; verify **no duplicate** chunks emitted (registry + content hash).

* Bonus / Pro-Mode: LLM Diff Judge:
For ambiguous git diffs, ask a lightweight LLM whether meaning changed before re-chunking.

### Module B: Noise/Relevance Filter

**Goal:** Filter out bureaucracy (formatting, linting) cheaply.

```ascii
[ Ingest Bucket ] (IngestChunk units from A)
|
v
[ B.1 Regex Filter ] (Reject *.css, lockfiles, tests/)
|
v
[ B.2 LLM API ] (Gemini Flash / GPT-4o-mini)
Prompt: "Is this security knowledge? JSON Bool."
|
+---(No)---> [ Discard ]
|
+---(Yes)--> [ Knowledge Queue ]
```

* Metric Rating
* Difficulty ⭐ (Low / Entry Level)
* Vibe Coding Potential High. You can "vibe code" the prompts. Tweak the system prompt until it "feels right."
* Tech Stack Python (langchain or raw API calls), Managed LLM APIs.
* MVP Logic: Regex list first (free), then Managed LLM API (cheap).
* Pre-Code Experiment (Do This First)
Human Benchmark:
Build a set of 100 **`IngestChunk`-sized text samples** (same shape Module B receives from A) drawn from OWASP repos and feeds — include admin/formatting noise and real security prose.
Manually tag each sample: Relevant (Security Info) vs Noise (Typos, Admin, Formatting).
Run these 100 items through your proposed LLM Prompt on `chunk.text`.
Success Criteria: The LLM must match your tags >97% of the time. If it flags "Updated Code of Conduct" as "Security Knowledge," your prompt failed.

### Module C: The Librarian (The "Smart" Part)

**Goal:** Accurately map text to CRE nodes (handling the "Negation Problem") and detect updates to existing content.

```ascii
[ Knowledge Queue ]
|
v
[ C.1 Initial Retrieval ] (Vector Search / Pgvector)
-> "Get top 20 candidates"
|
v
[ C.2 The Cross-Encoder ] (Local Re-Ranking)
-> Model: ms-marco-MiniLM-L-6-v2
-> "Compare Input vs Candidate. Output Score."
|
v
[ C.3 Update Detection ] (New Logic)
-> Check if content is an update to existing content.
-> Implement security gates to detect adversarial updates or contradictions to previous content.
|
v
[ C.4 Threshold Check ]
-> Score > 0.8? Link to CRE.
-> Score < 0.8? Flag for Human Review.
```

* Difficulty ⭐⭐⭐ (Hard)
* Vibe Coding Potential Medium. Prompts are vibe-based, but Vector Search requires strict math/logic.
* Tech Stack sentence-transformers (HuggingFace), pgvector, Python.
* Prerequisites Understanding of Embeddings, Bi-Encoders vs Cross-Encoders.
* MVP Logic: Retrieve top 20 with Cosine Similarity, Re-rank top 5 with Cross-Encoder, and implement update detection.
* Pre-Code Experiment (Do This First)
ASVS Re-Classify Challenge:
Select 50 random ASVS requirements (e.g., "Verify password complexity...").
Strip their metadata so you only have text.
Feed them into a basic Vector Search (Cosine Similarity).
Check: Does it map to the correct CRE node?
Compare: Now run them through a Cross-Encoder.
Success Criteria: The Cross-Encoder must show a 20% accuracy improvement over basic Cosine Similarity, specifically for "Negative" requirements (e.g., "Do NOT use MD5").
You can use the existing CRE database as ground truth. You can repeat this with WSTG and NIST items too.

* Bonus / Pro-Mode: Hybrid Search
Don't rely just on vectors. Use Hybrid Search (Vector + Keyword/BM25).
Why: Vectors are bad at exact keyword matches (e.g., specific CVE IDs).

### Module D: HITL & Logging
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the workflow more clear. thanks

Copy link
Copy Markdown
Contributor

@PRAteek-singHWY PRAteek-singHWY Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @robvanderveer that makes a lot of sense.
I’ll rename this to OWASP Agent and adjust the introduction to focus first on the problem and the promise it delivers, before going into the implementation details.

I’ll also rework the workflow section to make the end-to-end flow clearer and more explicit, especially around module responsibilities and how data moves between ingestion, hybrid retrieval, semantic reasoning, human validation, and the master database.
I’ll iterate on the document accordingly.


**Goal:** Simple human oversight without db bloat.

```ascii
[ Flagged for Review ] ---> [ D.1 Admin UI ] ---> [ Maintainer ]
|
v
[ S3 / Blob ]
(Appends to corrections.jsonl)
```

* Difficulty ⭐ (Low)
* Vibe Coding Potential High. Standard CRUD web app. Ideal for junior devs or frontend contributors.
* Tech Stack Flask/React, S3/MinIO.
* MVP Logic: Simple Admin UI. Logs corrections to a JSONL file.
* Pre-Code Experiment (Do This First)
The "Click-Speed" Prototype:
Draw a wireframe on paper or build a 10-line HTML prototype.
Test: Can a user review, approve/reject, and save a correction in under 3 seconds per item?
Goal: If the UI requires 5 clicks to approve one item, the volunteers will quit. Optimize for "Tinder-swipe" speed (Keybind 'y' for yes, 'n' for no).

* Bonus / Pro-Mode: Loss Warehousing
Capture the "Loss Event" (Input + Wrong Prediction + Correct Label) in a structured format.
Why: Allows future researchers to "Retrain on Loss."

## 2.5 Pipeline contracts (normative detail)

Module boundaries above are summarized here; **JSON schemas and field tables** live in `docs/owasp-graph/apis/` (`schema_version` `0.2.0`).

| Stage | Primary types | Transport (MVP) |
| --- | --- | --- |
| A → B | `IngestChunkRecord` (wraps `IngestChunk`) | `ingest-chunks.jsonl` |
| B → C | `KnowledgeItem` | `knowledge-queue/` JSONL |
| C → D | `ReviewItem` | `review-queue/` JSONL |
| C → OpenCRE | `LinkProposal` | existing ingest path |
| D → feedback | `HumanDecision` | `corrections.jsonl` |

## 3. Agent-Ready CI Pipeline (New way of code review)

Since we expect AI-generated PRs, we cannot rely solely on human code review. We will build the following:

* Strict Linting is enforced. No style arguments.
* Regression Eval: PRs come with tests. Test coverage under 70% is rejected.
* We introduce dataset tests for Module B & C.
We maintain a golden_dataset.json (100 samples of known-good inputs/outputs).
Any PR touching Module B or C runs this dataset.
Failure Condition: If accuracy drops by >2% compared to main, the PR is blocked automatically.
Mandatory Tests: If code coverage drops, the PR is rejected.


## 4. Implementation Roadmap

Phase 1: Foundation (Week 1-2)

[ ] Run Experiments

[ ] Set up Ingest -> Process -> Store interfaces.

[ ] Build The new CI Pipeline & Golden Dataset. Note: We build the test before the code.

Phase 2: Ingestion & Filtering (Week 3-4)

[ ] Implement Module A (registry, backfill, connectors, normalize/chunk, GitHub Action Cron).

[ ] Implement Module B (LLM Client; consumes IngestChunk).

Phase 3: Intelligence (Week 5-6)

[ ] Implement Module C (sentence-transformers integration).

[ ] Tune the Cross-Encoder threshold against the Golden Dataset.

Phase 4: Dashboard (Week 7)

[ ] Build simple Admin UI for Module D.

1. Call for Contributors

We are looking for distributed teams to own these modules.

Backend Engineers: Owner for Module A. Needs Python, connector APIs, HTML/Markdown parsing, and chunking experience.

Prompt/AI Engineers: Owner for Module B. Needs experience with prompting.

Data Scientists: Owner for Module C. Needs understanding of Bi-Encoders vs Cross-Encoders.

Fullstack Devs: Owner for Module D. Simple Flask/React UI work.

To Contribute: Please reply to this RFC with the Module you wish to claim and provide a link to your working experiments.
If you are using AI tools (Cursor/Windsurf), please confirm you have read Section 3.
21 changes: 21 additions & 0 deletions docs/owasp-graph/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# OWASP Graph (Project OIE)

Living-knowledge ETL for OWASP content: harvest → filter → link → human review.

| Doc | Purpose |
| --- | --- |
| [RFC: Pane of Glass](../designs/owasp-pane-of-glass.md) | Motivation, four modules, experiments |
| [APIs & handovers](./apis/README.md) | Contracts between modules |

## Pipeline (logical)

```
Module A Module B Module C Module D
(Harvest+chunk) → (Filter) → (Librarian) → (HITL)
Ingest Bucket Knowledge Link proposals Human
(chunks) Queue + review flags corrections
```

Module A **backfills** new artifacts (full read + chunk) and **incrementally** updates known ones. Each **chunk** is the unit Module B classifies and Module C links to CREs.

Handovers are versioned JSON (MVP: object storage / JSONL).
67 changes: 67 additions & 0 deletions docs/owasp-graph/apis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# OIE module handover APIs

Draft contracts for data passed between the four OIE modules. These are **internal pipeline APIs**, not the public OpenCRE REST API. Shapes follow the [RFC](../designs/owasp-pane-of-glass.md) module boundaries.

## Conventions

| Rule | Detail |
| --- | --- |
| **Format** | JSON documents; batches may be JSONL (one envelope per line) |
| **Versioning** | Every envelope includes `schema_version` (semver string, e.g. `0.2.0`) |
| **IDs** | `artifact_id` = stable resource; `chunk_id` = stable text unit for B/C; `event_id` = one harvest pass |
| **Correlation** | `pipeline_run_id` groups all items from one scheduled run |
| **Time** | ISO-8601 UTC |
| **Text** | Module A emits **normalized, chunked** plain text |
| **Idempotency** | Consumers tolerate duplicate `chunk_id`; registry + content hash prevent re-emit |

## Boundaries

| From → To | Bucket / queue name (MVP) | Envelope type | Doc |
| --- | --- | --- | --- |
| **A → B** | `ingest-bucket/` | `IngestChunkRecord` (JSONL) or `IngestBatch` | [module-a-harvesting.md](./module-a-harvesting.md) |
| **B → C** | `knowledge-queue/` | `KnowledgeItem` (per `chunk_id`) | [module-b-filter.md](./module-b-filter.md) |
| **C → D** | `review-queue/` | `ReviewItem` | [module-c-librarian.md](./module-c-librarian.md) |
| **C → OpenCRE** | (existing ingest path) | `LinkProposal` | [module-c-librarian.md](./module-c-librarian.md) |
| **D → (feedback)** | `corrections.jsonl` | `HumanDecision` | [module-d-hitl.md](./module-d-hitl.md) |

## MVP transport (not normative)

1. **Files** — `s3://…/oie/{pipeline_run_id}/ingest-chunks.jsonl`
2. **Pull API** — `GET /internal/oie/v0/ingest/runs/{id}/chunks` (future)
3. **Queue** — topic per boundary with the same JSON body

## Shared types

JSON Schemas live in [schemas/](./schemas/).

```json
{
"schema_version": "0.2.0",
"chunk_id": "chk:art:OWASP/ASVS:…:0",
"artifact_id": "art:OWASP/ASVS:4.0/en/0x12-V3-Authentication.md",
"event_id": "evt_20260201_00001",
"pipeline_run_id": "20260201T020000Z",
"source": { "type": "github", "repo": "OWASP/ASVS", "commit_sha": "abc123", "committed_at": "2026-02-01T01:00:00Z" },
"locator": { "kind": "repo_path", "id": "4.0/en/0x12-V3-Authentication.md", "path": "4.0/en/0x12-V3-Authentication.md" }
}
```

**Deprecated 0.1.0:** `change_id` + `RawChangeItem` — use `chunk_id` + `IngestChunkRecord`.

## Status codes (logical, all modules)

| Code | Meaning |
| --- | --- |
| `accepted` | Passed this stage; forwarded |
| `rejected` | Dropped intentionally |
| `deferred` | Held for retry or human |
| `linked` | Module C auto-linked above threshold |
| `review_required` | Module C below threshold or policy flag |
| `corrected` | Module D human override recorded |

## Module docs

- [Module A — Harvest & chunk](./module-a-harvesting.md)
- [Module B — Filter](./module-b-filter.md)
- [Module C — Librarian](./module-c-librarian.md)
- [Module D — HITL](./module-d-hitl.md)
Loading
Loading