feat(mcp): MS-OI29500 implementation notes ingest + ooxml_behavior tool#5
Open
caio-pizzol wants to merge 4 commits intomainfrom
Open
feat(mcp): MS-OI29500 implementation notes ingest + ooxml_behavior tool#5caio-pizzol wants to merge 4 commits intomainfrom
caio-pizzol wants to merge 4 commits intomainfrom
Conversation
Adds Word/Office implementation behavior to the MCP. Each MS-OI29500 implementation note is parsed from Microsoft Learn's native markdown into a behavior_notes row, attached to a schema symbol when one fits. The new ooxml_behavior tool exposes them by qname, section_id, source_anchor, free-text, app, or claim_type; ooxml_element / ooxml_type also inline-render notes when a top-level symbol has any.
Why: ECMA-376 says what the spec allows, the XSD graph says what's structurally legal, but neither captures Word's actual behavior. MS-OI29500 fills that gap with ~3,667 'spec says X / Word does Y' claims across ~1,640 implementation pages.
- Markdown via ?accept=text/markdown is the parser input. First-party, free, structurally clean — beat Firecrawl and Jina Reader in a head-to-head bakeoff. The pinned per-doc PDF sha256 backs citation provenance.
- ~24% of rows attach to a schema symbol; ~76% carry a target_ref because the source vocabulary is not in our XSD set (SML/PML/VML), the name is genuinely ambiguous, or the entry is a field code / overview. Resolver is conservative — no wrong attachments by design.
- Per-row source_commit (page git_commit_id from the markdown frontmatter) makes a re-ingest reproducible from data/sources.json.
- Renames scripts/ingest-{pdf,xsd}/ to ingest-ecma-376-{pdfs,xsds}/ for parity with the new ingest-ms-oi29500/ directory; folder names now answer 'what is being ingested?'.
Verified: bun run ms:ingest → 3,667 rows on prod (idempotent on re-run); 93 tests pass.
Five fixes raised on PR #5: - Always populate target_ref, even on resolved rows. Migration 0006 changed symbol_id FK to ON DELETE SET NULL, so a future xsd:ingest can null out symbol_id on attached rows. Without target_ref to fall back on, those notes become unreachable via ooxml_behavior(qname=...) until a full ms:ingest reruns. The qname word-boundary fallback now has something to match. - Accept x: (SpreadsheetML) and p: (PresentationML) prefixes in parseQName. ~30% of behavior_notes are SML/PML target_ref-only; users could not query them by qname before because the parser rejected the prefix. - Make Excel/PowerPoint variants consistent in the claim-type classifier. layout_behavior and requires_despite_optional regexes now include all four apps; ignores/writes already did. - Filter symbol resolution by profile in fetchBehaviorNotes. Other queries already scope to p.name='transitional'; this one was missing the join, which would let notes leak across profiles once non-transitional profiles exist. - Replace em dashes with hyphens in user-facing strings per the global style rule. Re-ingested prod: 3,667 rows; with_target_ref now 3,667 (was 2,799).
…layer Reframes the MS-OI29500 surface as 'documented, not verified' and adds the ground-truth verification layer that Phase 4 dogfooding showed was needed. Renames `ooxml_behavior` to `ooxml_implementation_notes` so the name doesn't claim authority it doesn't have. The tool description now leads with 'Microsoft-documented Office implementation notes... NOT necessarily verified against the live Word binary.' Adds three tables (migration 0009) and a new MCP tool to back the rename: - `word_fixtures`: one row per authored .docx, with sha256, generator script, and Word version so observations are reproducible. - `word_observations`: per-fixture findings with before/after XML fragments. The XML diff is the proof of the finding. - `behavior_note_observations`: join table linking docs claims to observations with a status (confirmed / refined / contradicted / not_reproducible). 'Refined' is the important new one: a documented claim that's directionally correct but glosses over Word's actual repair path. - `ooxml_word_behavior` MCP tool: filter observations by fixture name, scenario, free-text query, or status. Why: Phase 4 dogfooding ran four MS-OI29500 claims through real Word fixtures via the Word API MCP. One was contradicted (Word writes `<w:b/>` for cs/rtl runs even though the doc implies it shouldn't), one was refined (Word silently strips `<w:trHeight w:val=0 w:hRule=exact/>` rather than enforcing the constraint as the doc suggests), and two were confirmed. That signal told us the docs alone overpromise. Inline output on `ooxml_element` / `ooxml_type` and the dedicated tool now stamp every note with `[confirmed]` / `[refined]` / `[contradicted]` / `[not_reproducible]` / `[unverified]` and surface the linked observation finding inline. Users can see at a glance which claims have ground truth behind them. Seeds prod with the four observations recorded today so the verification layer ships with real data. Tests: 99 pass (was 93). New test file `ooxml-word-behavior.test.ts` covers the new tool and verification badges.
Two correctness fixes raised on PR #5: - The seed script linked every observation to claim 'a' on its source page. Two of the four observations are actually about claim 'c' (cols/num and trHeight val=0), so prod was tagging the wrong sub-claims [confirmed]/[refined] while leaving the actually-tested claims [unverified]. Each link entry now carries an explicit claimLabel, and the seed deletes prior links per observation before re-inserting so a corrected re-seed cleans up stale rows. - ooxml_word_behavior applied the status filter AFTER the LIMIT pulled the latest N observations. If the matching confirmed/refined rows were older than that window, status=confirmed could return empty incorrectly. Moved the filter into the SQL via EXISTS so it runs before ORDER BY ... LIMIT. Re-seeded prod: cols-test now tags claim c (Word's "num must match col count" rule), rowheight-val-zero tags claim c (the val!=0 with hRule=exact rule). Verified the badges land on the right claims. Tests: 100 pass (was 99). New regression test inserts noise observations after the seeded ones and asserts status=confirmed limit=2 still finds the older confirmed observation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Word/Office implementation behavior to the MCP. Each MS-OI29500 implementation note is parsed from Microsoft Learn's native markdown into a
behavior_notesrow, attached to a schema symbol when one fits. The newooxml_behaviortool exposes them by qname, section, source anchor, free text, app, or claim type.ooxml_elementandooxml_typealso inline-render notes when a top-level symbol has any.ECMA-376 says what the spec allows, the XSD graph says what's structurally legal, but neither captures Word's actual behavior. MS-OI29500 fills that gap with ~3,667 "spec says X / Word does Y" claims across ~1,640 implementation pages.
?accept=text/markdown) — first-party, free, structurally clean; beat Firecrawl and Jina Reader in a head-to-head bakeoff. The pinned per-doc PDF sha256 backs citation provenance.target_refbecause the source vocabulary isn't in our XSD set (SML/PML/VML), the name is genuinely ambiguous, or the entry describes a field code / overview. Resolver is conservative — no wrong attachments by design.source_commit(pagegit_commit_idfrom the markdown frontmatter) makes a re-ingest reproducible fromdata/sources.json.scripts/ingest-{pdf,xsd}/toscripts/ingest-ecma-376-{pdfs,xsds}/for parity with the newscripts/ingest-ms-oi29500/; folder names now answer "what is being ingested?".Test plan
bun run test— 93 pass, 0 fail (db 8, ingest-ecma-376-xsds 13, ingest-ms-oi29500 37, mcp-server 35).bun run lint— clean on new files (one pre-existing unused-import warning unchanged).bun run ms:ingestagainst prod — 3,667 rows inserted; same count on idempotent re-run.ooxml_type w:ST_Jc;ooxml_behaviorfilters by qname / section_id / source_anchor / query / app / claim_type.Review: pay attention to the resolver's conservative attachment rules in
scripts/ingest-ms-oi29500/resolve.ts(top-level vs local, single-vocab vs multi-vocab DML disambiguation), the natural-key index in migration 0007 (the partial unique index usesclaim_indexbecauseclaim_labelcollides for multi-bullet claims), and the qname word-boundary regex inapps/mcp-server/src/ooxml-queries.ts:fetchBehaviorNotes(substringtarget_ref ILIKEwould have letqname=tblpull intblPrnotes —~with\W-style boundaries fixes that). The migration files (0006, 0007, 0008) have already been applied to prod and are idempotent.