Skip to content

fix(docx-core): harden heading detection (#157 Phase 1)#178

Open
stevenobiajulu wants to merge 1 commit intomainfrom
157-heading-detection-20260505
Open

fix(docx-core): harden heading detection (#157 Phase 1)#178
stevenobiajulu wants to merge 1 commit intomainfrom
157-heading-detection-20260505

Conversation

@stevenobiajulu
Copy link
Copy Markdown
Member

Summary

Phase 1 of #157: tighten the existing heading-detection signals in read_file(format=\"json\"), document the canonical precedence rule, and add a centered-title detector. Phase 2 (derived is_heading / heading_level schema fields) is deferred to a follow-up issue.

Three concrete bugs fixed against tests/test_documents/nvca-regression/source.docx:

  1. False negative: the SPA's center-aligned, ALL-CAPS, bold title (SERIES […] PREFERRED STOCK PURCHASE AGREEMENT) now surfaces as header_style: 'title_caps_centered' instead of producing a null heading signal.
  2. False positives in extractHeaderInfo(): the long-paragraph regex branch no longer returns title_bare when no explicit . or : terminator is matched — body prose like Termination of Section 2.2(d)(i) shall not affect … and Except as described in Section 2.2(d)(i), … no longer surface a leading-word header_text.
  3. False positives in detectRunInHeader(): a formatting-transition invariant (headerCharCount < totalVisibleChars) ensures the bold/underline prefix is followed by non-header body. Whole-paragraph bold/underline blocks (signature labels, defined-term lead-in sentences) no longer get classified as run-in headers.

Smoke results across the 367-paragraph NVCA SPA

Header style Main This PR Net
run_in_header 12 (all false positives, header_text = whole body prose) 0 -12 false positives
title_with_period 12 12 unchanged
title_with_colon 9 12 +3 (recovered from former run_in_header mis-classification)
title_bare 90 88 -2 (Termination/Except now correctly null)
title_caps_centered 0 3 +3 NEW true positives (SPA title + 2 others)

Why Phase 1 only (not Phase 2)

Peer-reviewed with both Gemini and Codex CLI before implementation. Codex flagged that:

  • The existing signals are too noisy to bless with a top-level verdict field today (the 12 false-positive run_in_header entries on main are exactly that noise).
  • The existing header string field is wired into toon header-stripping (document_view.ts:363) and grep locators (packages/docx-mcp/src/tools/grep.ts:56) — repurposing it would be more breaking than adding new fields later.
  • The Google Docs renderer is structurally cast to DocumentViewNode at packages/docx-mcp/src/tools/gdocs/read_file.ts:21, so any new top-level field is immediately a cross-renderer contract.

Phase 1 lands the cleaner signals and the documented precedence rule first; the Phase 2 schema (heading: null | { text, source, level } per Codex's preference, vs. nested heading: { is_heading, tier, level, source, source_detail } per Gemini's) will be designed against stable signals in a follow-up issue.

Documentation

  • skills/docx-editing/SKILL.md — new "Heading Detection (read_file JSON output)" subsection with the canonical precedence table.
  • packages/docx-mcp/README.md — brief paragraph documenting list_metadata.header_text / header_style (previously undocumented), pointing to SKILL.md.

Out of scope (tracked in follow-up)

  • Top-level is_heading / heading_level schema work.
  • Refining Heading 1Heading 6 style detection (e.g., the HeadingPara1 / HeadingPara2 footgun in the NVCA fixture where styles inherit from heading styles but behave like body).
  • Mirroring on the Google Docs renderer (its structural cast already auto-mirrors any new top-level fields, so no Phase 1 changes break it).

Test plan

  • npm -w @usejunior/docx-core test — 1137 pass, 1 skipped, 0 fail (5 new branch tests added).
  • npm -w @usejunior/docx-mcp test — 675 pass, 0 fail (1 new end-to-end JSON-shape test against the SPA fixture).
  • npm run lint:workspaces — 0 errors (1 pre-existing warning in edit.test.ts unchanged).
  • Manual smoke against the actual NVCA SPA fixture: SPA title detected as title_caps_centered, Termination of Section … and Except as described in Section … no longer surface a non-null header_text, all 12 main-branch run_in_header false positives suppressed.
  • Governing Law and Venue: … style inline headers still detected as title_with_colon (no regression in document_view.branch.test.ts:150).
  • New regression test for a real bold-prefix + body-suffix paragraph confirms run_in_header still fires on legitimate cases.

Ref: #157

Three concrete bugs in `read_file(format="json")` heading signals,
verified against `tests/test_documents/nvca-regression/source.docx`:

1. False negative on the SPA's center-aligned ALL-CAPS bold title
   (`SERIES […] PREFERRED STOCK PURCHASE AGREEMENT`). It produced
   no heading signal because none of the existing detectors targeted
   centered-title formatting.
2. False positives in `extractHeaderInfo()`: the long-paragraph regex
   branch returned `title_bare` for body prose that began with a
   capitalized word, surfacing the leading word as `header_text`
   (e.g. `Termination` from `Termination of Section 2.2(d)(i) shall not
   affect …`).
3. False positives in `detectRunInHeader()`: paragraphs whose visible
   runs were entirely bold/underline (signature labels, defined-term
   lead-in sentences) were classified as `run_in_header` because the
   detector never required a real header-prefix → body transition.

Why this scope (Phase 1 only — no `is_heading` schema yet): peer
review (gemini + codex) flagged that the existing signals were too
noisy to bless with a top-level verdict field. Across the 367-paragraph
NVCA SPA, the `run_in_header` count drops from 12 (all false positives,
each carrying a body-prose `header_text`) to 0 with the transition
guard. With the signals now correct, derived `is_heading` /
`heading_level` fields can be designed against a stable foundation in
a follow-up issue.

Changes:
- `extractHeaderInfo()`: long-paragraph regex branch returns null
  when no explicit `.` or `:` terminator was matched. Short-paragraph
  branch (≤ 50 chars) and the explicit-terminator paths are unchanged
  — the existing `Governing Law and Venue: …` test still passes.
- `detectRunInHeader()`: adds a formatting-transition invariant
  (`headerCharCount < totalVisibleChars`). Whole-paragraph
  bold/underline blocks are no longer surfaced as run-in headers; a
  real bold-prefix → non-bold-body transition still works.
- New `detectTitleCapsCentered()`: final fallback for centered,
  ALL-CAPS, bold standalone titles. Strict gates (no lowercase, no
  list label, not in a table cell, ≤ 60 chars). Emits
  `header_style: 'title_caps_centered'`.
- Documents the canonical heading-detection precedence rule in
  `skills/docx-editing/SKILL.md` and adds a brief pointer in
  `packages/docx-mcp/README.md`.
- Regression tests: 5 new branch-coverage tests in
  `document_view.branch.test.ts` (synthetic XML) and 1 end-to-end
  JSON-shape test in `nvca_spa_regression.test.ts` against the actual
  SPA fixture.

The Google Docs renderer (`packages/google-docs-core`) is structurally
cast to `DocumentViewNode`; its schema is unchanged. Phase 2 design
(the deferred `is_heading` / `heading_level` fields) will be tracked
in a follow-up issue.

Ref: #157
@vercel
Copy link
Copy Markdown

vercel Bot commented May 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
site Ready Ready Preview, Comment May 8, 2026 5:15pm

Request Review

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 84.12698% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
packages/docx-core/src/primitives/document_view.ts 84.12% 6 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant