GSoC 2026 Module B — Week 1: input contract, schemas, and benchmark dataset#913
GSoC 2026 Module B — Week 1: input contract, schemas, and benchmark dataset#913manshusainishab wants to merge 6 commits into
Conversation
…contract Establishes the data contract Module B consumes from Module A. ChangeRecord is a Pydantic v2 model matching A's actual emission shape: nested source (discriminated union on type for github/rss), span (chunk position + heading_path + char/line offsets), and locator (addressing scheme). Internal models ClassifyResult and QueuePayload prep for later stages. hashing.py provides normalize_text + compute_content_hash since Module A does not emit content_hash; B computes its own (SHA-256 of normalized text) for use as the knowledge_queue dedup key. 22 unittest cases cover the round-trip, the discriminated union, hash determinism, normalization rules, code-fence preservation, and idempotency. Full make test: 271 passing, no regressions. Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
…ifact module_a_mock.jsonl: Module A's canonical 20-record mock shared 2026-05-29, saved as JSONL (one record per line per the contract). Becomes a permanent integration-test fixture for B's parser and a reference shape for the Module A contributor. module_a_contract.schema.json: JSON Schema generated from B's Pydantic ChangeRecord model via model_json_schema(). 246 lines covering all four nested types (ChangeRecord, GithubSource, RssSource, Span, Locator). Source of truth for cross-module CI validation. Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
build_labeled_dataset.py: PyGithub-based harvester that acts as Module A's stand-in for producing benchmark data. Fetches recent commits from 4 OWASP repos (WSTG, ASVS, CheatSheetSeries, SAMM), applies the contract's normalization rules, splits into chunks at markdown heading boundaries with a fence-aware stack-based walker that tracks heading_path + char/line offsets, and emits records in Module A's actual nested shape. Pluggable via GITHUB_TOKEN env var. Reproducible: python scripts/build_labeled_dataset.py regenerates the candidate set. label_dataset.py: resumable interactive TUI for manual classification. Atomic-writes labeled_data.json after every keystroke; lookup by chunk_id for resume. Embeds the recall-first definition (agreed with maintainer 2026-06-01) so labelers see the rule front-of-mind: KNOWLEDGE for any chunk with security signal, NOISE only for pure organizational content. candidate_commits.json: 100 records, 25 per repo, all Pydantic-valid against ChangeRecord. 90/100 have non-empty heading_path; 10 multi-chunk artifacts captured. labeled_data.json: 100/100 labeled by hand under the recall-first rule. Distribution 55 KNOWLEDGE / 40 NOISE / 5 UNCERTAIN. Per-repo skew is visible: CheatSheetSeries 92% K, SAMM 0% K (the SAMM commits sampled landed entirely on Website/Sponsorship/meetings paths -- empirical input for Week 2's noise_patterns.yaml). Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
|
Warning Review limit reached
More reviews will be available in 4 minutes and 48 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
WalkthroughThis PR implements the complete Module B (noise filter) component of the OpenCRE scraper pipeline: data contracts in Pydantic and JSON Schema, content-hashing utilities, comprehensive validation tests, a GitHub dataset harvesting script with heading-aware markdown chunking, and an interactive record-labeling CLI tool. ChangesModule B Noise Filter Implementation
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
scripts/build_labeled_dataset.py (1)
92-122: ⚡ Quick winConsider importing from
hashing.pyor adding a sync test.The duplicated normalization logic creates drift risk — if
application/utils/noise_filter/hashing.pyis updated, this script's normalization could diverge, causing chunk_id mismatches between harvested data and Module B's deduplication. The upstream tests validatehashing.py, not this copy.Options:
- Add
sys.pathmanipulation to importnormalize_textfrom the application module- Add a unit test asserting both implementations produce identical output on a sample corpus
- At minimum, add a comment noting which commit of
hashing.pythis was copied from🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/build_labeled_dataset.py` around lines 92 - 122, This script duplicates normalize_text (and helper functions _process_prose, _process_fence) from application/utils/noise_filter/hashing.py which risks divergence; either import the implementation from that module (e.g., add sys.path manipulation so you can from application.utils.noise_filter.hashing import normalize_text and remove the local copies) or add a unit test that compares this script's normalize_text output against the one in hashing.py across representative inputs to prevent drift; if importing is infeasible, at minimum add a clear comment documenting the exact commit/sha of hashing.py this was copied from and a TODO to remove duplication.scripts/label_dataset.py (1)
187-189: 💤 Low valueClarify skip behavior in docstring.
The docstring (line 16) says
s = SKIP (drop this record from the dataset entirely), but skipped records are not persisted and will reappear on the next run. This may be intentional (allows reconsideration), but the wording implies permanent removal.Suggested docstring update
- s = SKIP (drop this record from the dataset entirely) + s = SKIP (skip for now; record will reappear on next run)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/label_dataset.py` around lines 187 - 189, Update the module/function docstring that currently states "s = SKIP (drop this record from the dataset entirely)" to accurately reflect the behavior in the interactive loop where ans == "s" only skips persisting the item for the current run (it is not saved and will reappear on the next run); reference the interactive branch handling ans (the if ans == "s" branch) and explicitly state that this skip is temporary and does not permanently remove the record, and note how to permanently remove records if there is an alternative action or manual step.application/utils/noise_filter/hashing.py (1)
32-35: ⚡ Quick winGuard against the duplicated normalizer drifting.
application/utils/noise_filter/hashing.pyandscripts/build_labeled_dataset.pycurrently have the v0.2 normalization block (_FENCE_RE/_PROSE_WS_RE,normalize_text,_process_prose,_process_fence) byte-for-byte identical—but any future edit to only one copy would desync thecontent_hashdedup key. Add a small CI parity test that normalizes a few fixed fixtures through both implementations and asserts equality so drift fails fast.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/noise_filter/hashing.py` around lines 32 - 35, There are two identical normalization implementations (_FENCE_RE, _PROSE_WS_RE, normalize_text, _process_prose, _process_fence) in different places that must stay in sync; add a CI parity test that runs a small set of fixed fixtures through both implementations and asserts their outputs are byte-for-byte equal so any future drift fails. Implement the test to import both normalization functions (the one in hashing.py and the other implementation used by build_labeled_dataset.py), feed each fixture string to both normalize_text/_process_prose/_process_fence entry points as appropriate, and assert equality for each fixture; fail the test on mismatch and include the fixture name in the assertion message for faster debugging. Ensure the test is lightweight, added to the test suite, and runs in CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@application/utils/noise_filter/hashing.py`:
- Around line 93-96: The __all__ export list in hashing.py is not alphabetized
causing a Ruff RUF022 warning; reorder the entries in the __all__ list
alphabetically (e.g., "compute_content_hash" before "normalize_text") so the
list is sorted, leaving the rest of the file and the function names
(compute_content_hash, normalize_text) unchanged.
In `@application/utils/noise_filter/schemas.py`:
- Around line 146-155: The __all__ list in this module is not alphabetically
sorted (RUF022); reorder the exported names in isort-style alphabetical order so
the list reads: "ChangeRecord", "ClassifyResult", "GithubSource", "Locator",
"QueuePayload", "RssSource", "Source", "Span" (or follow your project's exact
alphabetical convention) and save the file; update the __all__ variable (the one
containing ChangeRecord, Source, GithubSource, RssSource, Span, Locator,
ClassifyResult, QueuePayload) to that sorted ordering to satisfy Ruff.
In `@docs/gsoc_2026_module_b/module_a_contract.schema.json`:
- Around line 1-2: The generated JSON Schema currently begins with "$defs" and
lacks a top-level "$schema" declaration; update the generation or add a
post-processing step that injects a top-level "$schema":
"https://json-schema.org/draft/2020-12/schema" (or the appropriate draft URI) so
the artifact explicitly pins the JSON Schema dialect, and ensure the generator
call (model_json_schema()) or its post-hook consistently writes this "$schema"
key to prevent drift on regeneration.
In `@scripts/build_labeled_dataset.py`:
- Around line 180-184: chunk_markdown only toggles in_fence for triple-backtick
fences but normalize_text/_FENCE_RE also handles <pre>...</pre>, so add tracking
for <pre> blocks inside chunk_markdown: introduce an in_pre boolean and update
it by testing each line with the existing/predefined regexes (e.g., _PRE_OPEN_RE
and _PRE_CLOSE_RE) when iterating lines in chunk_markdown, and change the
heading detection (the _HEADING_RE match) to only run when not (in_fence or
in_pre); update references to in_fence, add in_pre initialization, and ensure
_PRE_OPEN_RE/_PRE_CLOSE_RE are imported/defined where chunk_markdown runs.
- Around line 245-262: The offset advance unconditionally adds 2 for a "\n\n"
separator (cursor_char = end_char + 2 and cursor_line = end_line + 1) which is
wrong for hard-splits that have no separator; update the cursor advancement to
inspect the original chunk.text between the sub-chunk boundaries and compute the
actual separator length/newlines: map end_char back to the local index using
chunk.start_char_idx, read the substring after the sub-chunk to determine
sep_len and sep_newlines (0 for contiguous, 1 or 2 for "\n" or "\n\n"), then use
cursor_char = end_char + sep_len and cursor_line = end_line + sep_newlines so
offsets follow the real separator instead of always assuming "\n\n" (affects the
loop using out_texts, cursor_char, cursor_line, and chunk).
In `@scripts/label_dataset.py`:
- Around line 73-78: The load_labeled function may crash on a malformed
labeled_data.json; wrap the LABELED_PATH.read_text() + json.loads(...) in a
try/except that catches json.JSONDecodeError (and optionally
ValueError/Exception), and on error produce a clear, actionable message
including the LABELED_PATH and suggested remediation (e.g., delete or restore
the file), then either return an empty dict or exit with a user-friendly error;
update load_labeled to perform this defensive handling so callers of
load_labeled get a clear diagnostic instead of a traceback.
---
Nitpick comments:
In `@application/utils/noise_filter/hashing.py`:
- Around line 32-35: There are two identical normalization implementations
(_FENCE_RE, _PROSE_WS_RE, normalize_text, _process_prose, _process_fence) in
different places that must stay in sync; add a CI parity test that runs a small
set of fixed fixtures through both implementations and asserts their outputs are
byte-for-byte equal so any future drift fails. Implement the test to import both
normalization functions (the one in hashing.py and the other implementation used
by build_labeled_dataset.py), feed each fixture string to both
normalize_text/_process_prose/_process_fence entry points as appropriate, and
assert equality for each fixture; fail the test on mismatch and include the
fixture name in the assertion message for faster debugging. Ensure the test is
lightweight, added to the test suite, and runs in CI.
In `@scripts/build_labeled_dataset.py`:
- Around line 92-122: This script duplicates normalize_text (and helper
functions _process_prose, _process_fence) from
application/utils/noise_filter/hashing.py which risks divergence; either import
the implementation from that module (e.g., add sys.path manipulation so you can
from application.utils.noise_filter.hashing import normalize_text and remove the
local copies) or add a unit test that compares this script's normalize_text
output against the one in hashing.py across representative inputs to prevent
drift; if importing is infeasible, at minimum add a clear comment documenting
the exact commit/sha of hashing.py this was copied from and a TODO to remove
duplication.
In `@scripts/label_dataset.py`:
- Around line 187-189: Update the module/function docstring that currently
states "s = SKIP (drop this record from the dataset entirely)" to accurately
reflect the behavior in the interactive loop where ans == "s" only skips
persisting the item for the current run (it is not saved and will reappear on
the next run); reference the interactive branch handling ans (the if ans == "s"
branch) and explicitly state that this skip is temporary and does not
permanently remove the record, and note how to permanently remove records if
there is an alternative action or manual step.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 29b2621a-4763-4c60-bf5d-170c0f66b3b2
📒 Files selected for processing (11)
application/tests/noise_filter/__init__.pyapplication/tests/noise_filter/fixtures/candidate_commits.jsonapplication/tests/noise_filter/fixtures/labeled_data.jsonapplication/tests/noise_filter/fixtures/module_a_mock.jsonlapplication/tests/noise_filter/schemas_test.pyapplication/utils/noise_filter/__init__.pyapplication/utils/noise_filter/hashing.pyapplication/utils/noise_filter/schemas.pydocs/gsoc_2026_module_b/module_a_contract.schema.jsonscripts/build_labeled_dataset.pyscripts/label_dataset.py
Super-Linter (Black 24.4.2) flagged 4 files in the previous push. Applied `black` (same pinned version) to bring them in line with the repo's formatting standard. Cosmetic changes only: blank lines around section-separator comments, one multi-line dict join. No behavior or test changes -- `make test` remains 271 passing, 1 skip.
|
@coderabbitai can you re-review after the new commit. |
|
✅ Actions performedReview triggered.
|
- Sort __all__ lists in hashing.py and schemas.py to satisfy Ruff RUF022. - Declare JSON Schema dialect ($schema = draft 2020-12, which is what Pydantic v2 model_json_schema() emits) on the contract artifact. - Wrap load_labeled() in scripts/label_dataset.py with try/except so a corrupted labeled_data.json prints an actionable hint instead of a raw JSONDecodeError stack trace. Deferred to Week 2 (will be addressed when we touch the harvester): - chunker should also track <pre> open/close, not just ``` fences - _split_chunk_by_size cursor arithmetic assumes \\n\\n separator even on hard-split sub-chunks Tests: 271 passing, 1 skip (unchanged). Black: clean.
Summary
compute_content_hash()(Module A does not emit a hash, so B computes its own forknowledge_queuededup).labeled_data.json) for prompt iteration and Stage-2 evaluation.Part of the GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B (Noise/Relevance Filter).
What this PR adds
application/utils/noise_filter/__init__.pyapplication/utils/noise_filter/schemas.pyChangeRecord(top),Sourcediscriminated union (GithubSource/RssSource),Span,Locator, + internalClassifyResult/QueuePayloadapplication/utils/noise_filter/hashing.pynormalize_text()(NFC + line endings + whitespace + code-fence preservation, idempotent) +compute_content_hash()(SHA-256 of normalized text)application/tests/noise_filter/schemas_test.pyapplication/tests/noise_filter/fixtures/module_a_mock.jsonlapplication/tests/noise_filter/fixtures/candidate_commits.jsonapplication/tests/noise_filter/fixtures/labeled_data.jsondocs/gsoc_2026_module_b/module_a_contract.schema.jsonChangeRecord.model_json_schema()— source of truth for cross-module CI validationscripts/build_labeled_dataset.pyheading_path+ char/line offsets; idempotent; reproduciblescripts/label_dataset.pyTest plan
make test— full suite passes locally: 271 tests, 0 failures, 0 errors, 1 skip (was 249 before this PR; we added 22 new tests underapplication/tests/noise_filter/).application/utils/noise_filter/schemas.pyvalidates Module A's mock JSONL round-trip (20/20 records).compute_content_hashis deterministic, idempotent, and correctly preserves whitespace inside code fences while collapsing it in prose.candidate_commits.jsonandlabeled_data.jsonpassChangeRecord.model_validate().scripts/build_labeled_dataset.pyruns end-to-end with aGITHUB_TOKENenv var and regeneratescandidate_commits.jsondeterministically.Notes for reviewers
module_b_w1; future weeks will land asmodule_b_w2, etc. Per maintainer discussion.regex_filter.py+noise_patterns.yaml) — Week 2 deliverable. Empirical findings from this Week 1 labeled set (e.g. SAMM'sWebsite/**,Supporting Resources/meetings/**consistently produced NOISE) will informnoise_patterns.yaml.PromptHandler(LiteLLM-backed) — Week 3 deliverable.KnowledgeQueueItemSQLAlchemy model + Alembic migration — Week 5 deliverable.module_a_contract.mdv0.3) is shared directly with the Module A contributor via Slack#project-opencrerather than committed (project's.gitignoreexcludes*.md). This PR includes the machine-readablemodule_a_contract.schema.jsonartifact for CI validation.