Remove RTL markers from verse token data #378

Enkidu93 · 2026-01-22T18:06:56Z

This was the cleanest place to make the edit, but I know you don't like to edit these classes if you don't have to. I'm open to alternatives but the token data itself needs to be sanitized in order for this to work when updating usfm. Also, added tests.

Fixes sillsdev/machine.py#250. This will need to be ported.

This change is

codecov-commenter · 2026-01-22T18:08:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.73%. Comparing base (7e848e3) to head (83931a0).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #378   +/-   ##
=======================================
  Coverage   72.72%   72.73%           
=======================================
  Files         423      423           
  Lines       36021    36029    +8     
  Branches     4968     4969    +1     
=======================================
+ Hits        26197    26205    +8     
  Misses       8733     8733           
  Partials     1091     1091

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ddaspit

@ddaspit reviewed 3 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93).

src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):

                                    null,
                                    null,
                                    SanitizeVerseData(GetNextWord(usfm, ref index, preserveWhitespace))

I don't know if this is the right place to sanitize the verse data. It might be better to do this in UsfmParser when we create the verse ref.

src/SIL.Machine/Corpora/UsfmTokenizer.cs line 568 at r1 (raw file):

        private static string SanitizeVerseData(string verseData)
        {
            return verseData.Replace("‏", "");

I assume that there is an RTL mark character in the string. It would be helpful to specify the character as a Unicode character literal, i.e. \u200f.

ddaspit

@ddaspit made 1 comment.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93).

src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't know if this is the right place to sanitize the verse data. It might be better to do this in UsfmParser when we create the verse ref.

Generally, the tokenizer should try to preserve the original USFM as much as possible.

Enkidu93

@Enkidu93 made 1 comment.
Reviewable status: 1 of 4 files reviewed, 2 unresolved discussions (waiting on @ddaspit).

src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Generally, the tokenizer should try to preserve the original USFM as much as possible.

That makes sense. The problem is that I need to edit the token data itself. I considered, like you said, sanitizing it in the UsfmParser but the UsfmParser doesn't currently do anything like that (the state token etc are all read-only). I could create a new method or change an access modifier to enable updating the token data . This could also be used, for example, to not carry through non-Latin verse numbers when updating.

The other option is to leave it as-is on the parsing side and instead sanitize it when we do the updating. I've pushed a commit with this option for reference. I could see a similar case being made that even the parser should leave the token untouched. Whatever you think is best!

Enkidu93

@Enkidu93 made 1 comment.
Reviewable status: 1 of 4 files reviewed, 2 unresolved discussions (waiting on @ddaspit).

src/SIL.Machine/Corpora/UsfmTokenizer.cs line 568 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I assume that there is an RTL mark character in the string. It would be helpful to specify the character as a Unicode character literal, i.e. \u200f.

Done.

ddaspit

Are RTL marks handled properly in text corpora? If so, then this .

@ddaspit reviewed 3 files and all commit messages, made 2 comments, and resolved 2 discussions.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).

src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

That makes sense. The problem is that I need to edit the token data itself. I considered, like you said, sanitizing it in the UsfmParser but the UsfmParser doesn't currently do anything like that (the state token etc are all read-only). I could create a new method or change an access modifier to enable updating the token data . This could also be used, for example, to not carry through non-Latin verse numbers when updating.

The other option is to leave it as-is on the parsing side and instead sanitize it when we do the updating. I've pushed a commit with this option for reference. I could see a similar case being made that even the parser should leave the token untouched. Whatever you think is best!

The best place to sanitize the data is in the handler. I just wasn't sure how feasible it would be.

Enkidu93

Could you elaborate?

@Enkidu93 made 2 comments.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).

src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The best place to sanitize the data is in the handler. I just wasn't sure how feasible it would be.

Perfect 👌

ddaspit

I see that you have a test that checks UsfmMemoryText. Can you use the unicode literal for the RTL mark in the test? It would make it clearer that you are testing the RTL marks.

@ddaspit made 1 comment.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).

Enkidu93

I see now. Done.

@Enkidu93 made 1 comment.
Reviewable status: 3 of 4 files reviewed, all discussions resolved (waiting on @ddaspit).

ddaspit

@ddaspit reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).

Use ignore line endings Move verse token data sanitization to updater Add unicode literal

Enkidu93 requested a review from ddaspit January 22, 2026 18:06

ddaspit requested changes Jan 22, 2026

View reviewed changes

ddaspit reviewed Jan 22, 2026

View reviewed changes

Enkidu93 commented Jan 23, 2026

View reviewed changes

Enkidu93 requested a review from ddaspit January 26, 2026 16:39

ddaspit approved these changes Jan 26, 2026

View reviewed changes

Enkidu93 commented Jan 26, 2026

View reviewed changes

Enkidu93 requested a review from ddaspit January 27, 2026 13:35

ddaspit requested changes Jan 27, 2026

View reviewed changes

Enkidu93 commented Jan 27, 2026

View reviewed changes

ddaspit approved these changes Jan 27, 2026

View reviewed changes

Remove RTL markers from verse token data; add tests

83931a0

Use ignore line endings Move verse token data sanitization to updater Add unicode literal

Enkidu93 force-pushed the strip_rtl_markers branch from 75a401d to 83931a0 Compare January 27, 2026 18:50

Enkidu93 merged commit 44d11a0 into master Jan 27, 2026
3 of 4 checks passed

Enkidu93 deleted the strip_rtl_markers branch January 27, 2026 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove RTL markers from verse token data #378

Remove RTL markers from verse token data #378

Uh oh!

Enkidu93 commented Jan 22, 2026 •

edited by ddaspit

Loading

Uh oh!

codecov-commenter commented Jan 22, 2026 •

edited

Loading

Uh oh!

ddaspit left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Remove RTL markers from verse token data #378

Remove RTL markers from verse token data #378

Uh oh!

Conversation

Enkidu93 commented Jan 22, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enkidu93 commented Jan 22, 2026 •

edited by ddaspit

Loading

codecov-commenter commented Jan 22, 2026 •

edited

Loading