Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7511 +/- ##
=======================================
Coverage 99.61% 99.61%
=======================================
Files 283 285 +2
Lines 11877 11947 +70
Branches 2898 2914 +16
=======================================
+ Hits 11831 11901 +70
Misses 46 46 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This pull request normalizes quote selector creation and anchoring for both HTML and PDF documents to ensure that stored selectors match what users see in rendered text. The key change is that line breaks (from <br> tags and block elements in HTML, or newlines in PDF) are now converted to spaces, and consecutive whitespace is collapsed to single spaces. This prevents issues where text like <p>foo<br>bar</p> was previously stored as "foobar" but is now correctly stored as "foo bar" to match the visual rendering.
Changes:
- Introduced
rendered-text.tsmodule that builds normalized text from HTML DOM with offset mappings between raw and normalized positions - Updated
TextQuoteAnchorto use normalized text when creating and matching selectors - Applied consistent PDF text normalization in selector creation and anchoring
- Normalized quote display in the UI component to match the stored format
- Updated test fixtures and baselines to reflect normalized selector output
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/annotator/anchoring/rendered-text.ts | New module providing HTML text normalization with offset mapping for converting between raw and normalized positions |
| src/annotator/anchoring/types.ts | Updated TextQuoteAnchor to use normalized text for selector creation and matching |
| src/annotator/anchoring/pdf.ts | Applied consistent PDF text normalization in describe() and anchor() paths |
| src/sidebar/components/Annotation/AnnotationQuote.tsx | Normalized quote display in UI to match stored format |
| src/annotator/anchoring/test/rendered-text-test.js | New tests for the rendered-text normalization module |
| src/annotator/anchoring/test/types-test.js | Updated test expectations to match normalized selector format and relaxed some assertions |
| src/annotator/anchoring/test/pdf-test.js | Updated test expectations and relaxed some assertions to accommodate normalization |
| src/annotator/anchoring/test/html-test.js | Added normalization helpers and updated tests to compare normalized selectors |
| src/annotator/anchoring/test/html-baselines/wikipedia-regression-testing.json | Updated baseline expectations with normalized prefix/suffix values |
| src/annotator/anchoring/test/html-baselines/minimal.json | Updated baseline expectations with normalized prefix/suffix values |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * Rendered/normalized text of the root: collapsed whitespace, with a single | ||
| * space inserted at each `<br>` and block-tag boundary. |
There was a problem hiding this comment.
Don't collapse whitespace or block-level tags. Preserve the same behavior as root.textContent, but add a space for <br> elements
| // whitespace differences when matching, and keeping them raw preserves | ||
| // backward compatibility with selectors stored before the rendered-text | ||
| // normalization landed. | ||
| const prefix = rawText.slice(Math.max(0, rawStart - contextLen), rawStart); |
There was a problem hiding this comment.
We should normalize this
Summary
Normalize quote selectors so we store and match what users see.
<p>foo<br>bar</p>becomes"foo bar", not"foobar". Prefix/suffix are normalized too.Details
HTML (
TextQuoteAnchor)DOM walk produces rendered text (spaces at
<br>and block boundaries, whitespace collapsed) plus forward/reverse offset maps between rawtextContentand rendered text.fromRangeuses the maps to sliceexact/prefix/suffixfrom rendered text.toPositionAnchormatches against rendered text and maps offsets back to raw DOM coordinates forTextPositionSelector.Why use explicit maps instead of reusing
translateOffsetstranslateOffsetsaligns by counting non-whitespace chars, which works for PDF where both strings have the same characters with different spacing. Our rendered text contains synthesized characters (the space at a<br>has no source intextContent); counting can't align those, and usingtranslateOffsetsshifted real anchors by one character. The maps record the correspondence during the walk, so synthesized characters are tracked correctly.PDF
Selectors and page text normalized via
normalizePDFText; match offsets translated back viatranslateOffsets(no synthesized characters here, so it's the right tool).isSpace/isNotSpacelifted intoutil/normalize.tsto share with HTML. Redundant[\r\n]+innormalizePDFTextdropped (\s+covers it).