-
-
Notifications
You must be signed in to change notification settings - Fork 17
Remove RTL markers from verse token data #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #378 +/- ##
=======================================
Coverage 72.72% 72.73%
=======================================
Files 423 423
Lines 36021 36029 +8
Branches 4968 4969 +1
=======================================
+ Hits 26197 26205 +8
Misses 8733 8733
Partials 1091 1091 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit reviewed 3 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93).
src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):
null, null, SanitizeVerseData(GetNextWord(usfm, ref index, preserveWhitespace))
I don't know if this is the right place to sanitize the verse data. It might be better to do this in UsfmParser when we create the verse ref.
src/SIL.Machine/Corpora/UsfmTokenizer.cs line 568 at r1 (raw file):
private static string SanitizeVerseData(string verseData) { return verseData.Replace("", "");
I assume that there is an RTL mark character in the string. It would be helpful to specify the character as a Unicode character literal, i.e. \u200f.
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit made 1 comment.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93).
src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I don't know if this is the right place to sanitize the verse data. It might be better to do this in
UsfmParserwhen we create the verse ref.
Generally, the tokenizer should try to preserve the original USFM as much as possible.
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Enkidu93 made 1 comment.
Reviewable status: 1 of 4 files reviewed, 2 unresolved discussions (waiting on @ddaspit).
src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Generally, the tokenizer should try to preserve the original USFM as much as possible.
That makes sense. The problem is that I need to edit the token data itself. I considered, like you said, sanitizing it in the UsfmParser but the UsfmParser doesn't currently do anything like that (the state token etc are all read-only). I could create a new method or change an access modifier to enable updating the token data . This could also be used, for example, to not carry through non-Latin verse numbers when updating.
The other option is to leave it as-is on the parsing side and instead sanitize it when we do the updating. I've pushed a commit with this option for reference. I could see a similar case being made that even the parser should leave the token untouched. Whatever you think is best!
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Enkidu93 made 1 comment.
Reviewable status: 1 of 4 files reviewed, 2 unresolved discussions (waiting on @ddaspit).
src/SIL.Machine/Corpora/UsfmTokenizer.cs line 568 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I assume that there is an RTL mark character in the string. It would be helpful to specify the character as a Unicode character literal, i.e.
\u200f.
Done.
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are RTL marks handled properly in text corpora? If so, then this .
@ddaspit reviewed 3 files and all commit messages, made 2 comments, and resolved 2 discussions.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).
src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
That makes sense. The problem is that I need to edit the token data itself. I considered, like you said, sanitizing it in the
UsfmParserbut theUsfmParserdoesn't currently do anything like that (the state token etc are all read-only). I could create a new method or change an access modifier to enable updating the token data . This could also be used, for example, to not carry through non-Latin verse numbers when updating.The other option is to leave it as-is on the parsing side and instead sanitize it when we do the updating. I've pushed a commit with this option for reference. I could see a similar case being made that even the parser should leave the token untouched. Whatever you think is best!
The best place to sanitize the data is in the handler. I just wasn't sure how feasible it would be.
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate?
@Enkidu93 made 2 comments.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).
src/SIL.Machine/Corpora/UsfmTokenizer.cs line 178 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
The best place to sanitize the data is in the handler. I just wasn't sure how feasible it would be.
Perfect 👌
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you have a test that checks UsfmMemoryText. Can you use the unicode literal for the RTL mark in the test? It would make it clearer that you are testing the RTL marks.
@ddaspit made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. Done.
@Enkidu93 made 1 comment.
Reviewable status: 3 of 4 files reviewed, all discussions resolved (waiting on @ddaspit).
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).
Use ignore line endings Move verse token data sanitization to updater Add unicode literal
75a401d to
83931a0
Compare
This was the cleanest place to make the edit, but I know you don't like to edit these classes if you don't have to. I'm open to alternatives but the token data itself needs to be sanitized in order for this to work when updating usfm. Also, added tests.
Fixes sillsdev/machine.py#250. This will need to be ported.
This change is