Skip to content

Conversation

@lindsay-stevens
Copy link
Contributor

@lindsay-stevens lindsay-stevens commented Dec 8, 2025

Closes #714

Why is this the best possible solution? Were any other approaches considered?

For description of approach see commit messages (mainly 64a2a37). PR also includes a couple of minor tidy-ups.

An alternative might be to pass through context info such as row number to the SurveyElement / Survey classes, or assume the sheet name from the class type (e.g. Questions come from the survey sheet), or the column name from the attribute (e.g. Question.hint comes from hint (per language though)) . The main problem with that is that it would significantly change the internal data structure expected to pass through builder.py which would affect Ona who uses that as a storage format. Also generally have been trying to move validation (especially user-facing) out of SurveyElement into xls2json where the source data context is readily available, and to separate that validation from the transformations (to XML).

What are the regression risks?

The tests for the main #714 related changes are mainly additive so should be minimal. In 0a7bb72 the expressions lexer/parser was replaced (see commit message for details) but a test suite for the old one was added in 8b41399 and the new one meets that plus all other tests - so that approach should mitigate regression risk.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run python -m unittest and verified all tests pass
  • run ruff format pyxform tests and ruff check pyxform tests to lint code
  • verified that any code or assets from external sources are properly credited in comments

- this error is caught by survey.py normally but in that scope it is
  not possible to provide a user-friendly error message with the source
  of the problem i.e. sheet name, column name, and row number.
- overall changes:
  - move string validation func `clean_text_values` from xls2json
    to sheet_headers, which (via dealias_and_group_headers) is already
    iterating over the sheet data (to avoid introducing an extra
    iteration over the sheet data) before the main round of processing
    in workbook_to_json.
  - move pyxform ref validation the end of `workbook_to_json` so that
    the (mostly final) list of question names can be used for the
    validation (excludes elements added for `loops` which is done in
    `builder.py` and refactoring that may be significant)
  - try to reduce the amount of re-parsing, or parsing for no reason,
    by adding some selectivity to in pyxform_reference.py. For example
    references don't work in external_choices at all, and aren't
    resolved in a useful way for extra (choice filter) columns (i.e.
    the reference goes into element text with no `output` wrapper).
- added tests to the main test modules for each validated sheet.
- move "toParseString" into func signature instead of iterating through
  kwargs to find it for special handling
- similarly set "tag" via the first named arg rather than slicing args
  - ideally the same sort of thing would be done for the "unicode_args"
    by adding a "text: str" kwarg, but that would require a larger
    refactor since the text arg isn't necessarily always the second arg.
- in python 3.13.10 a validation was added to re.Scanner to raise an
  error if a re.Scanner lexicon included a rule with more than one
  capturing group. This error was reverted in 3.13.11 but the relevant
  PRs/issues indicated that it will be permanent in python 3.14. The
  lexer replaced here was using multiple capturing groups but probably
  should have been avoiding named groups altogether since that's part of
  how re.Scanner identifies tokens. Also while researching the problem
  it seemed like the cpython developers are not supportive of anyone
  using re.Scanner. So this commit moves the lexing over to an
  equivalent `lark` grammar, which coincidentally is compiled into regex
  anyway. The grammar could be improved with rules to replace a) some
  compound terminals/tokens (e.g. FUNC_CALL) and b) external parsing
  functions (e.g. in instance_expression.py and pyxform_reference.py).
  - lark was selected since it seems to be a reasonably robust and
    actively supported library, with minimal dependencies (as of 1.3.1
    there are none), with adequate features, and decent performance.
- L791 does a str() conversion so question_name can never be bytes
@lindsay-stevens
Copy link
Contributor Author

lindsay-stevens commented Dec 10, 2025

@lognaturel this PR will probably need to be merged before any other PRs since the change in 0a7bb72 (see commit message for background) resolves a CI test failure. The commit 271e6ff failed on the "verify" actions workflow run on 2025-12-09, specifically the job "test (3.13, macos-latest, false)" failed (see here) because that macos job ran on py3.13.10 - the other 3.13 passed because they ran with py3.13.9. The commits for the lexer change could probably be broken off into a separate PR if anything about the changes for #714 seems like it needs discussion/revision.

Copy link
Contributor

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I carefully looked at every change at 15ff687 but test coverage looks sufficient. I think our best course of action is merging and seeing if anyone reports any issues.

This will be a really nice change, I just ran into the issue again a couple days ago.

@lognaturel lognaturel merged commit fa24bf3 into XLSForm:master Dec 15, 2025
14 checks passed
@lindsay-stevens lindsay-stevens deleted the pyxform-714 branch December 16, 2025 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Indicate which sheet a ${} replacement error comes from

2 participants