-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Python: Update tree-sitter dependency
#19929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updates the Python extractor to depend on version 0.24.7 of tree-sitter (and 0.12.0 of tree-sitter-graph). A few changes were needed in order to make the code build and run after updating the dependencies: - In `main.rs`, the `Language` parameter is now passed as a reference. - In `python.tsg`, many queries had captures that were not actually used in the body of the stanza. This is no longer allowed (unless the captures start with an underscore), as it may indicate an error. To fix this, I added underscores in the appropriate places (and verified that none of these unused captures were in fact bugs).
It seems that with a newer version of tree-sitter, we no longer parse the (not actually valid!) syntax `Spam[**P2]` as if the `**` is an exponentiation operation (with a missing left operand).
Rather than relying on matching arbitrary nodes inside tree-sitter-graph and then checking whether they are of type ERROR or MISSING (which seems to have stopped working in later versions of tree-sitter), we now explicitly go through the tree-sitter tree, locating all of the error and missing nodes along the way. We then add these on to the graph output in the same format as was previously produced by tree-sitter-graph. Note that it's very likely that some of the syntax errors will move around a bit as a consequence of this change. In general, we don't expect syntax errors to have stable locations, as small changes in the grammar can cause an error to appear in a different position, even if the underlying (erroneous) code has not changed.
b90a358 to
bda5220
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the Python extractor to use newer versions of tree-sitter dependencies (0.24.7 for tree-sitter and 0.12.0 for tree-sitter-graph) to modernize the codebase and fix compatibility issues.
Key changes:
- Updated dependency versions in Cargo.toml files and build configurations
- Adapted code to new API requirements, including passing Language parameter as reference
- Implemented a new approach for syntax error detection through manual tree traversal since the old method no longer works
Reviewed Changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| python/extractor/tsg-python/tsp/Cargo.toml | Updates tree-sitter version to 0.24.7 |
| python/extractor/tsg-python/src/main.rs | Adds manual syntax error detection and fixes API call to pass language by reference |
| python/extractor/tsg-python/python.tsg | Adds underscores to unused captures to comply with new tree-sitter-graph requirements |
| python/extractor/tsg-python/Cargo.toml | Updates both tree-sitter and tree-sitter-graph versions |
| python/extractor/tests/parser/types_new.py | Comments out invalid syntax that was causing issues |
| misc/bazel/3rdparty/py_deps/* | Updates build files and dependency configurations for new versions |
| self.nodes_to_visit | ||
| .extend((0..node.child_count()).rev().filter_map(|i| node.child(i))); |
Copilot
AI
Sep 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider collecting the children into a vector before extending to avoid multiple allocations. This can be more efficient for nodes with many children.
| self.nodes_to_visit | |
| .extend((0..node.child_count()).rev().filter_map(|i| node.child(i))); | |
| let children: Vec<_> = (0..node.child_count()).rev().filter_map(|i| node.child(i)).collect(); | |
| self.nodes_to_visit.extend(children); |
| .map(move |node| { | ||
| let start_pos = node.start_position(); | ||
| let end_pos = node.end_position(); | ||
| let text = &source[node.byte_range()]; |
Copilot
AI
Sep 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The byte range indexing could panic if the node's byte range is invalid or extends beyond the source string. Consider using source.get(node.byte_range()) and handling the None case appropriately.
| let text = &source[node.byte_range()]; | |
| let text = source.get(node.byte_range()).unwrap_or(""); |
|
The suggestions from Copilot seem reasonable. I'll add a separate commit containing them. |
|
DCA looks uneventful. Some changes in a few syntax error alerts, but that is to be expected. |
IdrissRio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the PR commit by commit. Except from a very minor comment, everything looks good. DCA is uneventful.
| # TypeAliases | ||
| type Foo2[T15, U1 = str] = Bar1[T15, U1] | ||
| type Baz2[**P2 = [int, str]] = Spam[**P2] | ||
| # type Baz2[**P2 = [int, str]] = Spam[**P2] # From the PEP, but this is not actually valid syntax! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d either remove it or update the comment to explain why it’s commented out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like "this is not actually valid syntax!" adequately explains why it's commented out -- if you uncommented it, then it would be a syntax error. 😅
Updates the Python extractor to depend on version 0.24.7 of tree-sitter (and 0.12.0 of tree-sitter-graph).
A few changes were needed in order to make the code build and run after updating the dependencies:
main.rs, theLanguageparameter is now passed as a reference.python.tsg, many queries had captures that were not actually used in the body of the stanza. This is no longer allowed (unless the captures start with an underscore), as it may indicate an error. To fix this, I added underscores in the appropriate places (and verified that none of these unused captures were in fact bugs).Finally, something in the newer version of tree-sitter/tree-sitter-graph made it so that we were no longer able to detect syntax errors correctly. We previously did this by matching (in tree-sitter-graph) against all nodes, and then checking if their node type was either
ERRORorMISSING.As this no longer works, we now instead traverse the tree in a separate pass (but only if errors are present) and record all nodes that correspond to syntax errors, and then manually add these to the graph output.
Finally, A new option
--dump-errorswas added totsg-pythonto make it easy to dump a list of the syntax errors encountered.