fix(docx-core): harden heading detection (#157 Phase 1) by stevenobiajulu · Pull Request #178 · UseJunior/safe-docx

stevenobiajulu · 2026-05-08T17:14:55Z

Summary

Phase 1 of #157: tighten the existing heading-detection signals in read_file(format=\"json\"), document the canonical precedence rule, and add a centered-title detector. Phase 2 (derived is_heading / heading_level schema fields) is deferred to a follow-up issue.

Three concrete bugs fixed against tests/test_documents/nvca-regression/source.docx:

False negative: the SPA's center-aligned, ALL-CAPS, bold title (SERIES […] PREFERRED STOCK PURCHASE AGREEMENT) now surfaces as header_style: 'title_caps_centered' instead of producing a null heading signal.
False positives in extractHeaderInfo(): the long-paragraph regex branch no longer returns title_bare when no explicit . or : terminator is matched — body prose like Termination of Section 2.2(d)(i) shall not affect … and Except as described in Section 2.2(d)(i), … no longer surface a leading-word header_text.
False positives in detectRunInHeader(): a formatting-transition invariant (headerCharCount < totalVisibleChars) ensures the bold/underline prefix is followed by non-header body. Whole-paragraph bold/underline blocks (signature labels, defined-term lead-in sentences) no longer get classified as run-in headers.

Smoke results across the 367-paragraph NVCA SPA

Header style	Main	This PR	Net
`run_in_header`	12 (all false positives, `header_text` = whole body prose)	0	-12 false positives
`title_with_period`	12	12	unchanged
`title_with_colon`	9	12	+3 (recovered from former `run_in_header` mis-classification)
`title_bare`	90	88	-2 (Termination/Except now correctly null)
`title_caps_centered`	0	3	+3 NEW true positives (SPA title + 2 others)

Why Phase 1 only (not Phase 2)

Peer-reviewed with both Gemini and Codex CLI before implementation. Codex flagged that:

The existing signals are too noisy to bless with a top-level verdict field today (the 12 false-positive run_in_header entries on main are exactly that noise).
The existing header string field is wired into toon header-stripping (document_view.ts:363) and grep locators (packages/docx-mcp/src/tools/grep.ts:56) — repurposing it would be more breaking than adding new fields later.
The Google Docs renderer is structurally cast to DocumentViewNode at packages/docx-mcp/src/tools/gdocs/read_file.ts:21, so any new top-level field is immediately a cross-renderer contract.

Phase 1 lands the cleaner signals and the documented precedence rule first; the Phase 2 schema (heading: null | { text, source, level } per Codex's preference, vs. nested heading: { is_heading, tier, level, source, source_detail } per Gemini's) will be designed against stable signals in a follow-up issue.

Documentation

skills/docx-editing/SKILL.md — new "Heading Detection (read_file JSON output)" subsection with the canonical precedence table.
packages/docx-mcp/README.md — brief paragraph documenting list_metadata.header_text / header_style (previously undocumented), pointing to SKILL.md.

Out of scope (tracked in follow-up)

Top-level is_heading / heading_level schema work.
Refining Heading 1–Heading 6 style detection (e.g., the HeadingPara1 / HeadingPara2 footgun in the NVCA fixture where styles inherit from heading styles but behave like body).
Mirroring on the Google Docs renderer (its structural cast already auto-mirrors any new top-level fields, so no Phase 1 changes break it).

Test plan

npm -w @usejunior/docx-core test — 1137 pass, 1 skipped, 0 fail (5 new branch tests added).
npm -w @usejunior/docx-mcp test — 675 pass, 0 fail (1 new end-to-end JSON-shape test against the SPA fixture).
npm run lint:workspaces — 0 errors (1 pre-existing warning in edit.test.ts unchanged).
Manual smoke against the actual NVCA SPA fixture: SPA title detected as title_caps_centered, Termination of Section … and Except as described in Section … no longer surface a non-null header_text, all 12 main-branch run_in_header false positives suppressed.
Governing Law and Venue: … style inline headers still detected as title_with_colon (no regression in document_view.branch.test.ts:150).
New regression test for a real bold-prefix + body-suffix paragraph confirms run_in_header still fires on legitimate cases.

Ref: #157

Three concrete bugs in `read_file(format="json")` heading signals, verified against `tests/test_documents/nvca-regression/source.docx`: 1. False negative on the SPA's center-aligned ALL-CAPS bold title (`SERIES […] PREFERRED STOCK PURCHASE AGREEMENT`). It produced no heading signal because none of the existing detectors targeted centered-title formatting. 2. False positives in `extractHeaderInfo()`: the long-paragraph regex branch returned `title_bare` for body prose that began with a capitalized word, surfacing the leading word as `header_text` (e.g. `Termination` from `Termination of Section 2.2(d)(i) shall not affect …`). 3. False positives in `detectRunInHeader()`: paragraphs whose visible runs were entirely bold/underline (signature labels, defined-term lead-in sentences) were classified as `run_in_header` because the detector never required a real header-prefix → body transition. Why this scope (Phase 1 only — no `is_heading` schema yet): peer review (gemini + codex) flagged that the existing signals were too noisy to bless with a top-level verdict field. Across the 367-paragraph NVCA SPA, the `run_in_header` count drops from 12 (all false positives, each carrying a body-prose `header_text`) to 0 with the transition guard. With the signals now correct, derived `is_heading` / `heading_level` fields can be designed against a stable foundation in a follow-up issue. Changes: - `extractHeaderInfo()`: long-paragraph regex branch returns null when no explicit `.` or `:` terminator was matched. Short-paragraph branch (≤ 50 chars) and the explicit-terminator paths are unchanged — the existing `Governing Law and Venue: …` test still passes. - `detectRunInHeader()`: adds a formatting-transition invariant (`headerCharCount < totalVisibleChars`). Whole-paragraph bold/underline blocks are no longer surfaced as run-in headers; a real bold-prefix → non-bold-body transition still works. - New `detectTitleCapsCentered()`: final fallback for centered, ALL-CAPS, bold standalone titles. Strict gates (no lowercase, no list label, not in a table cell, ≤ 60 chars). Emits `header_style: 'title_caps_centered'`. - Documents the canonical heading-detection precedence rule in `skills/docx-editing/SKILL.md` and adds a brief pointer in `packages/docx-mcp/README.md`. - Regression tests: 5 new branch-coverage tests in `document_view.branch.test.ts` (synthetic XML) and 1 end-to-end JSON-shape test in `nvca_spa_regression.test.ts` against the actual SPA fixture. The Google Docs renderer (`packages/google-docs-core`) is structurally cast to `DocumentViewNode`; its schema is unchanged. Phase 2 design (the deferred `is_heading` / `heading_level` fields) will be tracked in a follow-up issue. Ref: #157

vercel · 2026-05-08T17:15:02Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
site	Ready	Preview, Comment	May 8, 2026 5:15pm

codecov · 2026-05-08T17:20:59Z

Codecov Report

❌ Patch coverage is 84.12698% with 10 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
packages/docx-core/src/primitives/document_view.ts	84.12%	6 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

github-actions Bot added the fix label May 8, 2026

stevenobiajulu mentioned this pull request May 8, 2026

Heading detection Phase 2: derived is_heading / heading_level (deferred from #157) #179

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docx-core): harden heading detection (#157 Phase 1)#178

fix(docx-core): harden heading detection (#157 Phase 1)#178
stevenobiajulu wants to merge 1 commit intomainfrom
157-heading-detection-20260505

stevenobiajulu commented May 8, 2026

Uh oh!

vercel Bot commented May 8, 2026

Uh oh!

codecov Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stevenobiajulu commented May 8, 2026

Summary

Smoke results across the 367-paragraph NVCA SPA

Why Phase 1 only (not Phase 2)

Documentation

Out of scope (tracked in follow-up)

Test plan

Uh oh!

vercel Bot commented May 8, 2026

Uh oh!

codecov Bot commented May 8, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant