Skip to content

GSoC 2026 Module B — Week 1: input contract, schemas, and benchmark dataset#913

Open
manshusainishab wants to merge 6 commits into
OWASP:mainfrom
manshusainishab:module_b_w1
Open

GSoC 2026 Module B — Week 1: input contract, schemas, and benchmark dataset#913
manshusainishab wants to merge 6 commits into
OWASP:mainfrom
manshusainishab:module_b_w1

Conversation

@manshusainishab
Copy link
Copy Markdown
Contributor

Summary

  • Establishes the Module A → Module B input contract (Pydantic v2 schemas matching A's actual emission shape) + B-side compute_content_hash() (Module A does not emit a hash, so B computes its own for knowledge_queue dedup).
  • Adds a stand-in OWASP-commit harvester that produces records in Module A's exact shape — lets B iterate on its classifier independently of Module A's delivery timeline, and the harvester itself is part of the deliverable for reproducibility.
  • Ships a 100-record hand-labeled benchmark (labeled_data.json) for prompt iteration and Stage-2 evaluation.

Part of the GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B (Noise/Relevance Filter).

What this PR adds

File Purpose
application/utils/noise_filter/__init__.py Package marker + contract docstring
application/utils/noise_filter/schemas.py Pydantic v2 models: ChangeRecord (top), Source discriminated union (GithubSource/RssSource), Span, Locator, + internal ClassifyResult/QueuePayload
application/utils/noise_filter/hashing.py normalize_text() (NFC + line endings + whitespace + code-fence preservation, idempotent) + compute_content_hash() (SHA-256 of normalized text)
application/tests/noise_filter/schemas_test.py 22 unittest cases (round-trip, discriminated union, hash determinism, normalization, fence preservation)
application/tests/noise_filter/fixtures/module_a_mock.jsonl Module A's canonical 20-record mock as JSONL
application/tests/noise_filter/fixtures/candidate_commits.json 100-record harvest of OWASP commits in Module A's shape; 25 per repo (WSTG/ASVS/CheatSheetSeries/SAMM); all Pydantic-valid
application/tests/noise_filter/fixtures/labeled_data.json Same 100 records with KNOWLEDGE/NOISE/UNCERTAIN labels under the recall-first rule (55/40/5)
docs/gsoc_2026_module_b/module_a_contract.schema.json JSON Schema generated from ChangeRecord.model_json_schema() — source of truth for cross-module CI validation
scripts/build_labeled_dataset.py PyGithub-based harvester; fence-aware stack-based markdown chunker tracking heading_path + char/line offsets; idempotent; reproducible
scripts/label_dataset.py Resumable interactive labeling TUI; atomic-writes per keystroke; embeds the recall-first definition

Test plan

  • make test — full suite passes locally: 271 tests, 0 failures, 0 errors, 1 skip (was 249 before this PR; we added 22 new tests under application/tests/noise_filter/).
  • application/utils/noise_filter/schemas.py validates Module A's mock JSONL round-trip (20/20 records).
  • compute_content_hash is deterministic, idempotent, and correctly preserves whitespace inside code fences while collapsing it in prose.
  • All 100 records in candidate_commits.json and labeled_data.json pass ChangeRecord.model_validate().
  • scripts/build_labeled_dataset.py runs end-to-end with a GITHUB_TOKEN env var and regenerates candidate_commits.json deterministically.

Notes for reviewers

  • Per-week PR strategy: this is module_b_w1; future weeks will land as module_b_w2, etc. Per maintainer discussion.
  • Out of this PR (intentional):
    • Stage 1 regex filter (regex_filter.py + noise_patterns.yaml) — Week 2 deliverable. Empirical findings from this Week 1 labeled set (e.g. SAMM's Website/**, Supporting Resources/meetings/** consistently produced NOISE) will inform noise_patterns.yaml.
    • Stage 2 LLM classifier wrapping PromptHandler (LiteLLM-backed) — Week 3 deliverable.
    • KnowledgeQueueItem SQLAlchemy model + Alembic migration — Week 5 deliverable.
  • Module A coordination: the input contract spec (module_a_contract.md v0.3) is shared directly with the Module A contributor via Slack #project-opencre rather than committed (project's .gitignore excludes *.md). This PR includes the machine-readable module_a_contract.schema.json artifact for CI validation.
  • Labeling rule: records were labeled under a recall-first rule agreed with the @northdpole — KNOWLEDGE for any chunk with a security signal; NOISE only for pure organizational/non-security content. Rationale: NOISE rows are dropped before Module C, so a misclassified security chunk is lost forever; false positives at Stage 2 just waste downstream compute that Module C re-judges.

…contract

Establishes the data contract Module B consumes from Module A. ChangeRecord
is a Pydantic v2 model matching A's actual emission shape: nested source
(discriminated union on type for github/rss), span (chunk position +
heading_path + char/line offsets), and locator (addressing scheme). Internal
models ClassifyResult and QueuePayload prep for later stages.

hashing.py provides normalize_text + compute_content_hash since Module A
does not emit content_hash; B computes its own (SHA-256 of normalized text)
for use as the knowledge_queue dedup key.

22 unittest cases cover the round-trip, the discriminated union, hash
determinism, normalization rules, code-fence preservation, and idempotency.
Full make test: 271 passing, no regressions.

Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
…ifact

module_a_mock.jsonl: Module A's canonical 20-record mock shared 2026-05-29,
saved as JSONL (one record per line per the contract). Becomes a permanent
integration-test fixture for B's parser and a reference shape for the
Module A contributor.

module_a_contract.schema.json: JSON Schema generated from B's Pydantic
ChangeRecord model via model_json_schema(). 246 lines covering all four
nested types (ChangeRecord, GithubSource, RssSource, Span, Locator).
Source of truth for cross-module CI validation.

Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
build_labeled_dataset.py: PyGithub-based harvester that acts as Module A's
stand-in for producing benchmark data. Fetches recent commits from 4 OWASP
repos (WSTG, ASVS, CheatSheetSeries, SAMM), applies the contract's
normalization rules, splits into chunks at markdown heading boundaries
with a fence-aware stack-based walker that tracks heading_path + char/line
offsets, and emits records in Module A's actual nested shape. Pluggable
via GITHUB_TOKEN env var. Reproducible: python scripts/build_labeled_dataset.py
regenerates the candidate set.

label_dataset.py: resumable interactive TUI for manual classification.
Atomic-writes labeled_data.json after every keystroke; lookup by chunk_id
for resume. Embeds the recall-first definition (agreed with maintainer
2026-06-01) so labelers see the rule front-of-mind: KNOWLEDGE for any
chunk with security signal, NOISE only for pure organizational content.

candidate_commits.json: 100 records, 25 per repo, all Pydantic-valid
against ChangeRecord. 90/100 have non-empty heading_path; 10 multi-chunk
artifacts captured.

labeled_data.json: 100/100 labeled by hand under the recall-first rule.
Distribution 55 KNOWLEDGE / 40 NOISE / 5 UNCERTAIN. Per-repo skew is
visible: CheatSheetSeries 92% K, SAMM 0% K (the SAMM commits sampled
landed entirely on Website/Sponsorship/meetings paths -- empirical input
for Week 2's noise_patterns.yaml).

Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Warning

Review limit reached

@manshusainishab, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 4 minutes and 48 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8a7698ad-0682-4445-a26f-cdf3c77669a0

📥 Commits

Reviewing files that changed from the base of the PR and between 8cf4dd1 and 19c0fc3.

📒 Files selected for processing (6)
  • application/tests/noise_filter/schemas_test.py
  • application/utils/noise_filter/hashing.py
  • application/utils/noise_filter/schemas.py
  • docs/gsoc_2026_module_b/module_a_contract.schema.json
  • scripts/build_labeled_dataset.py
  • scripts/label_dataset.py

Walkthrough

This PR implements the complete Module B (noise filter) component of the OpenCRE scraper pipeline: data contracts in Pydantic and JSON Schema, content-hashing utilities, comprehensive validation tests, a GitHub dataset harvesting script with heading-aware markdown chunking, and an interactive record-labeling CLI tool.

Changes

Module B Noise Filter Implementation

Layer / File(s) Summary
Data contracts and Pydantic schemas
application/utils/noise_filter/__init__.py, application/utils/noise_filter/schemas.py, docs/gsoc_2026_module_b/module_a_contract.schema.json
ChangeRecord and supporting models (GithubSource, RssSource, Span, Locator, ClassifyResult, QueuePayload) define Module A→B and B→C data contracts with discriminated source union, forward-compatible extra-field handling, and matching JSON Schema.
Content hashing and text normalization
application/utils/noise_filter/hashing.py
compute_content_hash and normalize_text provide deterministic SHA-256 hashing with fence-aware whitespace handling (NFC, line-ending normalization, prose-only whitespace collapsing, code-fence/pre-tag preservation, blank-line stripping).
Schema validation and hashing tests
application/tests/noise_filter/schemas_test.py, application/tests/noise_filter/fixtures/module_a_mock.jsonl
ModuleAMockTests, ChangeRecordTests, SourceUnionTests, ContentHashTests validate parsing, field constraints, forward compatibility, normalization idempotence, and hash determinism using a 20-record JSONL fixture.
GitHub dataset harvesting script
scripts/build_labeled_dataset.py
Fetches OWASP repository files from commit history via GitHub API, normalizes and chunks content into position-aware segments (tracking heading paths, character/line offsets), builds ChangeRecord objects, deduplicates by chunk_id, and atomically persists candidates to JSON.
Interactive record labeling CLI
scripts/label_dataset.py
Loads candidate records, presents each pending record in a TUI flow with key bindings (k/n/u for label, s to skip, q to quit), captures optional labeling rationale, persists labels with timestamp and metadata, and tracks progress across labeled/pending/total.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main PR objective: establishing Module B's input contract, schemas, and benchmark dataset for GSoC 2026 Week 1.
Description check ✅ Passed The description is detailed and directly related to the changeset, covering the Module A contract establishment, schema implementation, hashing utilities, tests, fixtures, and tooling scripts.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
scripts/build_labeled_dataset.py (1)

92-122: ⚡ Quick win

Consider importing from hashing.py or adding a sync test.

The duplicated normalization logic creates drift risk — if application/utils/noise_filter/hashing.py is updated, this script's normalization could diverge, causing chunk_id mismatches between harvested data and Module B's deduplication. The upstream tests validate hashing.py, not this copy.

Options:

  1. Add sys.path manipulation to import normalize_text from the application module
  2. Add a unit test asserting both implementations produce identical output on a sample corpus
  3. At minimum, add a comment noting which commit of hashing.py this was copied from
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/build_labeled_dataset.py` around lines 92 - 122, This script
duplicates normalize_text (and helper functions _process_prose, _process_fence)
from application/utils/noise_filter/hashing.py which risks divergence; either
import the implementation from that module (e.g., add sys.path manipulation so
you can from application.utils.noise_filter.hashing import normalize_text and
remove the local copies) or add a unit test that compares this script's
normalize_text output against the one in hashing.py across representative inputs
to prevent drift; if importing is infeasible, at minimum add a clear comment
documenting the exact commit/sha of hashing.py this was copied from and a TODO
to remove duplication.
scripts/label_dataset.py (1)

187-189: 💤 Low value

Clarify skip behavior in docstring.

The docstring (line 16) says s = SKIP (drop this record from the dataset entirely), but skipped records are not persisted and will reappear on the next run. This may be intentional (allows reconsideration), but the wording implies permanent removal.

Suggested docstring update
-    s = SKIP        (drop this record from the dataset entirely)
+    s = SKIP        (skip for now; record will reappear on next run)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/label_dataset.py` around lines 187 - 189, Update the module/function
docstring that currently states "s = SKIP (drop this record from the dataset
entirely)" to accurately reflect the behavior in the interactive loop where ans
== "s" only skips persisting the item for the current run (it is not saved and
will reappear on the next run); reference the interactive branch handling ans
(the if ans == "s" branch) and explicitly state that this skip is temporary and
does not permanently remove the record, and note how to permanently remove
records if there is an alternative action or manual step.
application/utils/noise_filter/hashing.py (1)

32-35: ⚡ Quick win

Guard against the duplicated normalizer drifting.

application/utils/noise_filter/hashing.py and scripts/build_labeled_dataset.py currently have the v0.2 normalization block (_FENCE_RE/_PROSE_WS_RE, normalize_text, _process_prose, _process_fence) byte-for-byte identical—but any future edit to only one copy would desync the content_hash dedup key. Add a small CI parity test that normalizes a few fixed fixtures through both implementations and asserts equality so drift fails fast.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/noise_filter/hashing.py` around lines 32 - 35, There are
two identical normalization implementations (_FENCE_RE, _PROSE_WS_RE,
normalize_text, _process_prose, _process_fence) in different places that must
stay in sync; add a CI parity test that runs a small set of fixed fixtures
through both implementations and asserts their outputs are byte-for-byte equal
so any future drift fails. Implement the test to import both normalization
functions (the one in hashing.py and the other implementation used by
build_labeled_dataset.py), feed each fixture string to both
normalize_text/_process_prose/_process_fence entry points as appropriate, and
assert equality for each fixture; fail the test on mismatch and include the
fixture name in the assertion message for faster debugging. Ensure the test is
lightweight, added to the test suite, and runs in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/utils/noise_filter/hashing.py`:
- Around line 93-96: The __all__ export list in hashing.py is not alphabetized
causing a Ruff RUF022 warning; reorder the entries in the __all__ list
alphabetically (e.g., "compute_content_hash" before "normalize_text") so the
list is sorted, leaving the rest of the file and the function names
(compute_content_hash, normalize_text) unchanged.

In `@application/utils/noise_filter/schemas.py`:
- Around line 146-155: The __all__ list in this module is not alphabetically
sorted (RUF022); reorder the exported names in isort-style alphabetical order so
the list reads: "ChangeRecord", "ClassifyResult", "GithubSource", "Locator",
"QueuePayload", "RssSource", "Source", "Span" (or follow your project's exact
alphabetical convention) and save the file; update the __all__ variable (the one
containing ChangeRecord, Source, GithubSource, RssSource, Span, Locator,
ClassifyResult, QueuePayload) to that sorted ordering to satisfy Ruff.

In `@docs/gsoc_2026_module_b/module_a_contract.schema.json`:
- Around line 1-2: The generated JSON Schema currently begins with "$defs" and
lacks a top-level "$schema" declaration; update the generation or add a
post-processing step that injects a top-level "$schema":
"https://json-schema.org/draft/2020-12/schema" (or the appropriate draft URI) so
the artifact explicitly pins the JSON Schema dialect, and ensure the generator
call (model_json_schema()) or its post-hook consistently writes this "$schema"
key to prevent drift on regeneration.

In `@scripts/build_labeled_dataset.py`:
- Around line 180-184: chunk_markdown only toggles in_fence for triple-backtick
fences but normalize_text/_FENCE_RE also handles <pre>...</pre>, so add tracking
for <pre> blocks inside chunk_markdown: introduce an in_pre boolean and update
it by testing each line with the existing/predefined regexes (e.g., _PRE_OPEN_RE
and _PRE_CLOSE_RE) when iterating lines in chunk_markdown, and change the
heading detection (the _HEADING_RE match) to only run when not (in_fence or
in_pre); update references to in_fence, add in_pre initialization, and ensure
_PRE_OPEN_RE/_PRE_CLOSE_RE are imported/defined where chunk_markdown runs.
- Around line 245-262: The offset advance unconditionally adds 2 for a "\n\n"
separator (cursor_char = end_char + 2 and cursor_line = end_line + 1) which is
wrong for hard-splits that have no separator; update the cursor advancement to
inspect the original chunk.text between the sub-chunk boundaries and compute the
actual separator length/newlines: map end_char back to the local index using
chunk.start_char_idx, read the substring after the sub-chunk to determine
sep_len and sep_newlines (0 for contiguous, 1 or 2 for "\n" or "\n\n"), then use
cursor_char = end_char + sep_len and cursor_line = end_line + sep_newlines so
offsets follow the real separator instead of always assuming "\n\n" (affects the
loop using out_texts, cursor_char, cursor_line, and chunk).

In `@scripts/label_dataset.py`:
- Around line 73-78: The load_labeled function may crash on a malformed
labeled_data.json; wrap the LABELED_PATH.read_text() + json.loads(...) in a
try/except that catches json.JSONDecodeError (and optionally
ValueError/Exception), and on error produce a clear, actionable message
including the LABELED_PATH and suggested remediation (e.g., delete or restore
the file), then either return an empty dict or exit with a user-friendly error;
update load_labeled to perform this defensive handling so callers of
load_labeled get a clear diagnostic instead of a traceback.

---

Nitpick comments:
In `@application/utils/noise_filter/hashing.py`:
- Around line 32-35: There are two identical normalization implementations
(_FENCE_RE, _PROSE_WS_RE, normalize_text, _process_prose, _process_fence) in
different places that must stay in sync; add a CI parity test that runs a small
set of fixed fixtures through both implementations and asserts their outputs are
byte-for-byte equal so any future drift fails. Implement the test to import both
normalization functions (the one in hashing.py and the other implementation used
by build_labeled_dataset.py), feed each fixture string to both
normalize_text/_process_prose/_process_fence entry points as appropriate, and
assert equality for each fixture; fail the test on mismatch and include the
fixture name in the assertion message for faster debugging. Ensure the test is
lightweight, added to the test suite, and runs in CI.

In `@scripts/build_labeled_dataset.py`:
- Around line 92-122: This script duplicates normalize_text (and helper
functions _process_prose, _process_fence) from
application/utils/noise_filter/hashing.py which risks divergence; either import
the implementation from that module (e.g., add sys.path manipulation so you can
from application.utils.noise_filter.hashing import normalize_text and remove the
local copies) or add a unit test that compares this script's normalize_text
output against the one in hashing.py across representative inputs to prevent
drift; if importing is infeasible, at minimum add a clear comment documenting
the exact commit/sha of hashing.py this was copied from and a TODO to remove
duplication.

In `@scripts/label_dataset.py`:
- Around line 187-189: Update the module/function docstring that currently
states "s = SKIP (drop this record from the dataset entirely)" to accurately
reflect the behavior in the interactive loop where ans == "s" only skips
persisting the item for the current run (it is not saved and will reappear on
the next run); reference the interactive branch handling ans (the if ans == "s"
branch) and explicitly state that this skip is temporary and does not
permanently remove the record, and note how to permanently remove records if
there is an alternative action or manual step.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 29b2621a-4763-4c60-bf5d-170c0f66b3b2

📥 Commits

Reviewing files that changed from the base of the PR and between e93ce92 and 8cf4dd1.

📒 Files selected for processing (11)
  • application/tests/noise_filter/__init__.py
  • application/tests/noise_filter/fixtures/candidate_commits.json
  • application/tests/noise_filter/fixtures/labeled_data.json
  • application/tests/noise_filter/fixtures/module_a_mock.jsonl
  • application/tests/noise_filter/schemas_test.py
  • application/utils/noise_filter/__init__.py
  • application/utils/noise_filter/hashing.py
  • application/utils/noise_filter/schemas.py
  • docs/gsoc_2026_module_b/module_a_contract.schema.json
  • scripts/build_labeled_dataset.py
  • scripts/label_dataset.py

Comment thread application/utils/noise_filter/hashing.py
Comment thread application/utils/noise_filter/schemas.py
Comment thread docs/gsoc_2026_module_b/module_a_contract.schema.json
Comment thread scripts/build_labeled_dataset.py
Comment thread scripts/build_labeled_dataset.py
Comment thread scripts/label_dataset.py Outdated
Super-Linter (Black 24.4.2) flagged 4 files in the previous push.
Applied `black` (same pinned version) to bring them in line with the
repo's formatting standard. Cosmetic changes only: blank lines around
section-separator comments, one multi-line dict join. No behavior or
test changes -- `make test` remains 271 passing, 1 skip.
@manshusainishab
Copy link
Copy Markdown
Contributor Author

@coderabbitai can you re-review after the new commit.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

@manshusainishab Sure! I'll kick off a fresh review of the PR now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

- Sort __all__ lists in hashing.py and schemas.py to satisfy
  Ruff RUF022.
- Declare JSON Schema dialect ($schema = draft 2020-12, which is
  what Pydantic v2 model_json_schema() emits) on the contract artifact.
- Wrap load_labeled() in scripts/label_dataset.py with try/except so a
  corrupted labeled_data.json prints an actionable hint instead of a
  raw JSONDecodeError stack trace.

Deferred to Week 2 (will be addressed when we touch the harvester):
- chunker should also track <pre> open/close, not just ``` fences
- _split_chunk_by_size cursor arithmetic assumes \\n\\n separator even
  on hard-split sub-chunks

Tests: 271 passing, 1 skip (unchanged). Black: clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant