Skip to content

v1.8.0 updates: improve parser.c size, fix dotted statements edge cases and nested #IF 0 edge cases#41

Open
hkimura-intersys wants to merge 4 commits into
intersystems:mainfrom
hkimura-intersys:v1.8.0-updates
Open

v1.8.0 updates: improve parser.c size, fix dotted statements edge cases and nested #IF 0 edge cases#41
hkimura-intersys wants to merge 4 commits into
intersystems:mainfrom
hkimura-intersys:v1.8.0-updates

Conversation

@hkimura-intersys
Copy link
Copy Markdown
Collaborator

Overview

The key changes from this PR include:

  1. Reduced all parser.c file sizes by over 30 MB. (See Parser.c size decrease block below for details)
  2. Fixed dotted statements edge cases. (see Dotted Statements Fix block below for details)
  3. Fixed #IF 0 edge case (see IF 0 edge case fix block for details)
  4. Pinned windows-2022 in ci workflow, because windows-latest contains a vscode image that doesn't work with the node versions that work with tree-sitter (as they must be <24).

Testing

First, all test cases pass here. (see workflow).
Additionally, I tested this on all .rtn and .mac files from //projects/sql/databases/sys/rtn/routine/ and //projects/sql/databases/sys/rtn/sql/. This tests are not included in this repo, as they are not open source. But, this was done as part of my local testing.

Detailed Explanation of Changes

Parser.c size decrease

I was able to decrease the parser.c size mainly by:

  • creating reusable functions and calling those from nodes (ex: _statements_block, build_legacy_version).
  • scanner adjustments (see the other scanner changes block below)
  • removing unreachable nodes

All of these efforts were done in the core grammar, so there is still room for improvement for the udl grammar, specifically the keywords.js file. Although the parser.c size isn't notably larger in udl, the wasm file is, so that would be a good thing to optimize next.

Dotted Statements Fix (multiline statements and tags)

My first fix was correctly parsing dotted statements that included multiline commands, such as:
image

Before, the scanner expected dotted statements to be contained to a single line.

Additionally, this now correctly parses dotted statements with tags. It is valid to have a tag before a dotted statement, like this:
image

Lastly, this is now able to parse dotted statements that are not part of do commands. This scenario actually makes it so any statement that is a part of this doesn't run, but since this is something that compiles, I decided to add support for it.
ex:
. w "hi"

I decided against treating this as a comment even though it doesn't run, because the compiler requires the content to be valid statements after the dot.

IF 0 edge case fix

There is a special edge case of #IF where if #IF 0, then you don't actually need #ENDIF, it is valid to terminate using #ELSE. However, this does not apply to nested #IF 0 scenarios. So, if there is an #IF 0 within an #IF 0, that nested #IF 0 requires an #ENDIF. This specification has been added, and tests have been added as well to reflect this.

Example (note that everything within the outer #IF 0 is treated as a comment, as it will never be reached:
image

Other Scanner Changes

I rewrote all of the logic for _termination, _argumentless_command_end, BOL, _immediate_single_whitespace.., _argumentless_loop, because these scenarios really should be evaluated together, and there were too many separate blocks that it was hard to trace the flow. With this rewrite, I also decided to change a few things:

  1. for scenarios where you are terminating a single statement, the token it is now looking for is _statement_termination. Before, I allowed either _argumentless_command_end or _termination, but that overcomplicated things, and led to newlines and whitespace being wrongfully consumed sometimes. In all scenarios, _statement_termination is a zero width token.
  2. the BOL check also now evaluates tags, since it is valid to have as part of a dotted statement.
  3. rather than wrongfully consuming comments as whitespace sometimes, there is a new token INLINE_COMMENT that represents a comment within a command, specifically a comment that is usually between the if expression and the statements block that follows.
  4. BOL_EXTRA represents the dots of a dotted statement that is not a part of a do command.
  5. Removed _XECUTE_ARG_INVALID and ZBREAK_DEVICE_TERMINATION, they didn't add anything but complexity.
  6. For argumentless commands, a comment counts as a termination, so the _statement_termination token is returned
  7. the special #IF 0 #else case is only considered for outer #IF 0 cases, as nested #IF 0 cases still require #ENDIF

…nts with blocks, and dotted statements with tags

dotted statement fix includes:
- ability to parse multiline blocks (ex: if blocks within dotted statements);
- ability to parse dotted statements that are part of tags, as it is valid
  for a dotted statement to come after a tag
- ability to parse dotted statements that are not part of a do command, this is
  valid and compiles in iris even though it doesn't actually run.

other scanner changes include:
- a comment is a statement termination for argumentless commands
- the special #IF 0 #else case is only considered for outer #IF 0 cases, as nested #IF 0
  cases still require #ENDIF
- fixed parsing dotted statements within #IF cases
- separated argumentless statement termination tokens from termination tokens. _statement_termination is used for
  a singluar statement, and _termination is used to terminate something that could have multiple statements, like
  an if command or dotted statement.

other changes:
- rewrote many parts of the grammar, so the parser.c size has now decreased from 75MB to 44MB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant