fix(docx-core): harden heading detection (#157 Phase 1)#178
Open
stevenobiajulu wants to merge 1 commit intomainfrom
Open
fix(docx-core): harden heading detection (#157 Phase 1)#178stevenobiajulu wants to merge 1 commit intomainfrom
stevenobiajulu wants to merge 1 commit intomainfrom
Conversation
Three concrete bugs in `read_file(format="json")` heading signals, verified against `tests/test_documents/nvca-regression/source.docx`: 1. False negative on the SPA's center-aligned ALL-CAPS bold title (`SERIES […] PREFERRED STOCK PURCHASE AGREEMENT`). It produced no heading signal because none of the existing detectors targeted centered-title formatting. 2. False positives in `extractHeaderInfo()`: the long-paragraph regex branch returned `title_bare` for body prose that began with a capitalized word, surfacing the leading word as `header_text` (e.g. `Termination` from `Termination of Section 2.2(d)(i) shall not affect …`). 3. False positives in `detectRunInHeader()`: paragraphs whose visible runs were entirely bold/underline (signature labels, defined-term lead-in sentences) were classified as `run_in_header` because the detector never required a real header-prefix → body transition. Why this scope (Phase 1 only — no `is_heading` schema yet): peer review (gemini + codex) flagged that the existing signals were too noisy to bless with a top-level verdict field. Across the 367-paragraph NVCA SPA, the `run_in_header` count drops from 12 (all false positives, each carrying a body-prose `header_text`) to 0 with the transition guard. With the signals now correct, derived `is_heading` / `heading_level` fields can be designed against a stable foundation in a follow-up issue. Changes: - `extractHeaderInfo()`: long-paragraph regex branch returns null when no explicit `.` or `:` terminator was matched. Short-paragraph branch (≤ 50 chars) and the explicit-terminator paths are unchanged — the existing `Governing Law and Venue: …` test still passes. - `detectRunInHeader()`: adds a formatting-transition invariant (`headerCharCount < totalVisibleChars`). Whole-paragraph bold/underline blocks are no longer surfaced as run-in headers; a real bold-prefix → non-bold-body transition still works. - New `detectTitleCapsCentered()`: final fallback for centered, ALL-CAPS, bold standalone titles. Strict gates (no lowercase, no list label, not in a table cell, ≤ 60 chars). Emits `header_style: 'title_caps_centered'`. - Documents the canonical heading-detection precedence rule in `skills/docx-editing/SKILL.md` and adds a brief pointer in `packages/docx-mcp/README.md`. - Regression tests: 5 new branch-coverage tests in `document_view.branch.test.ts` (synthetic XML) and 1 end-to-end JSON-shape test in `nvca_spa_regression.test.ts` against the actual SPA fixture. The Google Docs renderer (`packages/google-docs-core`) is structurally cast to `DocumentViewNode`; its schema is unchanged. Phase 2 design (the deferred `is_heading` / `heading_level` fields) will be tracked in a follow-up issue. Ref: #157
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
6 tasks
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 of #157: tighten the existing heading-detection signals in
read_file(format=\"json\"), document the canonical precedence rule, and add a centered-title detector. Phase 2 (derivedis_heading/heading_levelschema fields) is deferred to a follow-up issue.Three concrete bugs fixed against
tests/test_documents/nvca-regression/source.docx:SERIES […] PREFERRED STOCK PURCHASE AGREEMENT) now surfaces asheader_style: 'title_caps_centered'instead of producing a null heading signal.extractHeaderInfo(): the long-paragraph regex branch no longer returnstitle_barewhen no explicit.or:terminator is matched — body prose likeTermination of Section 2.2(d)(i) shall not affect …andExcept as described in Section 2.2(d)(i), …no longer surface a leading-wordheader_text.detectRunInHeader(): a formatting-transition invariant (headerCharCount < totalVisibleChars) ensures the bold/underline prefix is followed by non-header body. Whole-paragraph bold/underline blocks (signature labels, defined-term lead-in sentences) no longer get classified as run-in headers.Smoke results across the 367-paragraph NVCA SPA
run_in_headerheader_text= whole body prose)title_with_periodtitle_with_colonrun_in_headermis-classification)title_baretitle_caps_centeredWhy Phase 1 only (not Phase 2)
Peer-reviewed with both Gemini and Codex CLI before implementation. Codex flagged that:
run_in_headerentries on main are exactly that noise).headerstring field is wired into toon header-stripping (document_view.ts:363) and grep locators (packages/docx-mcp/src/tools/grep.ts:56) — repurposing it would be more breaking than adding new fields later.DocumentViewNodeatpackages/docx-mcp/src/tools/gdocs/read_file.ts:21, so any new top-level field is immediately a cross-renderer contract.Phase 1 lands the cleaner signals and the documented precedence rule first; the Phase 2 schema (
heading: null | { text, source, level }per Codex's preference, vs. nestedheading: { is_heading, tier, level, source, source_detail }per Gemini's) will be designed against stable signals in a follow-up issue.Documentation
skills/docx-editing/SKILL.md— new "Heading Detection (read_file JSON output)" subsection with the canonical precedence table.packages/docx-mcp/README.md— brief paragraph documentinglist_metadata.header_text/header_style(previously undocumented), pointing to SKILL.md.Out of scope (tracked in follow-up)
is_heading/heading_levelschema work.Heading 1–Heading 6style detection (e.g., theHeadingPara1/HeadingPara2footgun in the NVCA fixture where styles inherit from heading styles but behave like body).Test plan
npm -w @usejunior/docx-core test— 1137 pass, 1 skipped, 0 fail (5 new branch tests added).npm -w @usejunior/docx-mcp test— 675 pass, 0 fail (1 new end-to-end JSON-shape test against the SPA fixture).npm run lint:workspaces— 0 errors (1 pre-existing warning inedit.test.tsunchanged).title_caps_centered,Termination of Section …andExcept as described in Section …no longer surface a non-nullheader_text, all 12 main-branchrun_in_headerfalse positives suppressed.Governing Law and Venue: …style inline headers still detected astitle_with_colon(no regression indocument_view.branch.test.ts:150).run_in_headerstill fires on legitimate cases.Ref: #157