feat(mcp): MS-OI29500 implementation notes ingest + ooxml_behavior tool by caio-pizzol · Pull Request #5 · superdoc-dev/ooxml-dev

caio-pizzol · 2026-04-28T18:11:58Z

Summary

Adds Word/Office implementation behavior to the MCP. Each MS-OI29500 implementation note is parsed from Microsoft Learn's native markdown into a behavior_notes row, attached to a schema symbol when one fits. The new ooxml_behavior tool exposes them by qname, section, source anchor, free text, app, or claim type. ooxml_element and ooxml_type also inline-render notes when a top-level symbol has any.

ECMA-376 says what the spec allows, the XSD graph says what's structurally legal, but neither captures Word's actual behavior. MS-OI29500 fills that gap with ~3,667 "spec says X / Word does Y" claims across ~1,640 implementation pages.

Parser input is the native Microsoft Learn markdown (?accept=text/markdown) — first-party, free, structurally clean; beat Firecrawl and Jina Reader in a head-to-head bakeoff. The pinned per-doc PDF sha256 backs citation provenance.
~24% of rows attach to a schema symbol; ~76% carry a target_ref because the source vocabulary isn't in our XSD set (SML/PML/VML), the name is genuinely ambiguous, or the entry describes a field code / overview. Resolver is conservative — no wrong attachments by design.
Per-row source_commit (page git_commit_id from the markdown frontmatter) makes a re-ingest reproducible from data/sources.json.
Renames scripts/ingest-{pdf,xsd}/ to scripts/ingest-ecma-376-{pdfs,xsds}/ for parity with the new scripts/ingest-ms-oi29500/; folder names now answer "what is being ingested?".

Test plan

bun run test — 93 pass, 0 fail (db 8, ingest-ecma-376-xsds 13, ingest-ms-oi29500 37, mcp-server 35).
bun run lint — clean on new files (one pre-existing unused-import warning unchanged).
bun run ms:ingest against prod — 3,667 rows inserted; same count on idempotent re-run.
Smoke-tested both surfaces against prod data: inline notes on ooxml_type w:ST_Jc; ooxml_behavior filters by qname / section_id / source_anchor / query / app / claim_type.

Review: pay attention to the resolver's conservative attachment rules in scripts/ingest-ms-oi29500/resolve.ts (top-level vs local, single-vocab vs multi-vocab DML disambiguation), the natural-key index in migration 0007 (the partial unique index uses claim_index because claim_label collides for multi-bullet claims), and the qname word-boundary regex in apps/mcp-server/src/ooxml-queries.ts:fetchBehaviorNotes (substring target_ref ILIKE would have let qname=tbl pull in tblPr notes — ~ with \W-style boundaries fixes that). The migration files (0006, 0007, 0008) have already been applied to prod and are idempotent.

Adds Word/Office implementation behavior to the MCP. Each MS-OI29500 implementation note is parsed from Microsoft Learn's native markdown into a behavior_notes row, attached to a schema symbol when one fits. The new ooxml_behavior tool exposes them by qname, section_id, source_anchor, free-text, app, or claim_type; ooxml_element / ooxml_type also inline-render notes when a top-level symbol has any. Why: ECMA-376 says what the spec allows, the XSD graph says what's structurally legal, but neither captures Word's actual behavior. MS-OI29500 fills that gap with ~3,667 'spec says X / Word does Y' claims across ~1,640 implementation pages. - Markdown via ?accept=text/markdown is the parser input. First-party, free, structurally clean — beat Firecrawl and Jina Reader in a head-to-head bakeoff. The pinned per-doc PDF sha256 backs citation provenance. - ~24% of rows attach to a schema symbol; ~76% carry a target_ref because the source vocabulary is not in our XSD set (SML/PML/VML), the name is genuinely ambiguous, or the entry is a field code / overview. Resolver is conservative — no wrong attachments by design. - Per-row source_commit (page git_commit_id from the markdown frontmatter) makes a re-ingest reproducible from data/sources.json. - Renames scripts/ingest-{pdf,xsd}/ to ingest-ecma-376-{pdfs,xsds}/ for parity with the new ingest-ms-oi29500/ directory; folder names now answer 'what is being ingested?'. Verified: bun run ms:ingest → 3,667 rows on prod (idempotent on re-run); 93 tests pass.

Five fixes raised on PR #5: - Always populate target_ref, even on resolved rows. Migration 0006 changed symbol_id FK to ON DELETE SET NULL, so a future xsd:ingest can null out symbol_id on attached rows. Without target_ref to fall back on, those notes become unreachable via ooxml_behavior(qname=...) until a full ms:ingest reruns. The qname word-boundary fallback now has something to match. - Accept x: (SpreadsheetML) and p: (PresentationML) prefixes in parseQName. ~30% of behavior_notes are SML/PML target_ref-only; users could not query them by qname before because the parser rejected the prefix. - Make Excel/PowerPoint variants consistent in the claim-type classifier. layout_behavior and requires_despite_optional regexes now include all four apps; ignores/writes already did. - Filter symbol resolution by profile in fetchBehaviorNotes. Other queries already scope to p.name='transitional'; this one was missing the join, which would let notes leak across profiles once non-transitional profiles exist. - Replace em dashes with hyphens in user-facing strings per the global style rule. Re-ingested prod: 3,667 rows; with_target_ref now 3,667 (was 2,799).

…layer Reframes the MS-OI29500 surface as 'documented, not verified' and adds the ground-truth verification layer that Phase 4 dogfooding showed was needed. Renames `ooxml_behavior` to `ooxml_implementation_notes` so the name doesn't claim authority it doesn't have. The tool description now leads with 'Microsoft-documented Office implementation notes... NOT necessarily verified against the live Word binary.' Adds three tables (migration 0009) and a new MCP tool to back the rename: - `word_fixtures`: one row per authored .docx, with sha256, generator script, and Word version so observations are reproducible. - `word_observations`: per-fixture findings with before/after XML fragments. The XML diff is the proof of the finding. - `behavior_note_observations`: join table linking docs claims to observations with a status (confirmed / refined / contradicted / not_reproducible). 'Refined' is the important new one: a documented claim that's directionally correct but glosses over Word's actual repair path. - `ooxml_word_behavior` MCP tool: filter observations by fixture name, scenario, free-text query, or status. Why: Phase 4 dogfooding ran four MS-OI29500 claims through real Word fixtures via the Word API MCP. One was contradicted (Word writes `<w:b/>` for cs/rtl runs even though the doc implies it shouldn't), one was refined (Word silently strips `<w:trHeight w:val=0 w:hRule=exact/>` rather than enforcing the constraint as the doc suggests), and two were confirmed. That signal told us the docs alone overpromise. Inline output on `ooxml_element` / `ooxml_type` and the dedicated tool now stamp every note with `[confirmed]` / `[refined]` / `[contradicted]` / `[not_reproducible]` / `[unverified]` and surface the linked observation finding inline. Users can see at a glance which claims have ground truth behind them. Seeds prod with the four observations recorded today so the verification layer ships with real data. Tests: 99 pass (was 93). New test file `ooxml-word-behavior.test.ts` covers the new tool and verification badges.

Two correctness fixes raised on PR #5: - The seed script linked every observation to claim 'a' on its source page. Two of the four observations are actually about claim 'c' (cols/num and trHeight val=0), so prod was tagging the wrong sub-claims [confirmed]/[refined] while leaving the actually-tested claims [unverified]. Each link entry now carries an explicit claimLabel, and the seed deletes prior links per observation before re-inserting so a corrected re-seed cleans up stale rows. - ooxml_word_behavior applied the status filter AFTER the LIMIT pulled the latest N observations. If the matching confirmed/refined rows were older than that window, status=confirmed could return empty incorrectly. Moved the filter into the SQL via EXISTS so it runs before ORDER BY ... LIMIT. Re-seeded prod: cols-test now tags claim c (Word's "num must match col count" rule), rowheight-val-zero tags claim c (the val!=0 with hRule=exact rule). Verified the badges land on the right claims. Tests: 100 pass (was 99). New regression test inserts noise observations after the seeded ones and asserts status=confirmed limit=2 still finds the older confirmed observation.

caiopizzol added 4 commits April 28, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): MS-OI29500 implementation notes ingest + ooxml_behavior tool#5

feat(mcp): MS-OI29500 implementation notes ingest + ooxml_behavior tool#5
caio-pizzol wants to merge 4 commits intomainfrom
caio/ms-oi29500-behavior-notes

caio-pizzol commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caio-pizzol commented Apr 28, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants