Skip to content

Conversation

@tausbn
Copy link
Contributor

@tausbn tausbn commented Jun 30, 2025

Updates the Python extractor to depend on version 0.24.7 of tree-sitter (and 0.12.0 of tree-sitter-graph).

A few changes were needed in order to make the code build and run after updating the dependencies:

  • In main.rs, the Language parameter is now passed as a reference.
  • In python.tsg, many queries had captures that were not actually used in the body of the stanza. This is no longer allowed (unless the captures start with an underscore), as it may indicate an error. To fix this, I added underscores in the appropriate places (and verified that none of these unused captures were in fact bugs).

Finally, something in the newer version of tree-sitter/tree-sitter-graph made it so that we were no longer able to detect syntax errors correctly. We previously did this by matching (in tree-sitter-graph) against all nodes, and then checking if their node type was either ERROR or MISSING.

As this no longer works, we now instead traverse the tree in a separate pass (but only if errors are present) and record all nodes that correspond to syntax errors, and then manually add these to the graph output.

Finally, A new option --dump-errors was added to tsg-python to make it easy to dump a list of the syntax errors encountered.

@tausbn tausbn added the no-change-note-required This PR does not need a change note label Jun 30, 2025
Updates the Python extractor to depend on version 0.24.7 of tree-sitter
(and 0.12.0 of tree-sitter-graph).

A few changes were needed in order to make the code build and run after
updating the dependencies:

- In `main.rs`, the `Language` parameter is now passed as a reference.
- In `python.tsg`, many queries had captures that were not actually used
in the body of the stanza. This is no longer allowed (unless the
captures start with an underscore), as it may indicate an error. To fix
this, I added underscores in the appropriate places (and verified that
none of these unused captures were in fact bugs).
It seems that with a newer version of tree-sitter, we no longer parse
the (not actually valid!) syntax `Spam[**P2]` as if the `**` is an
exponentiation operation (with a missing left operand).
Rather than relying on matching arbitrary nodes inside tree-sitter-graph
and then checking whether they are of type ERROR or MISSING (which seems
to have stopped working in later versions of tree-sitter), we now
explicitly go through the tree-sitter tree, locating all of the error
and missing nodes along the way. We then add these on to the graph
output in the same format as was previously produced by
tree-sitter-graph.

Note that it's very likely that some of the syntax errors will move
around a bit as a consequence of this change. In general, we don't
expect syntax errors to have stable locations, as small changes in the
grammar can cause an error to appear in a different position, even if
the underlying (erroneous) code has not changed.
@tausbn tausbn force-pushed the tausbn/python-update-tree-sitter-dependency branch from b90a358 to bda5220 Compare September 2, 2025 12:51
@tausbn tausbn marked this pull request as ready for review September 2, 2025 12:57
@tausbn tausbn requested review from a team as code owners September 2, 2025 12:57
Copilot AI review requested due to automatic review settings September 2, 2025 12:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the Python extractor to use newer versions of tree-sitter dependencies (0.24.7 for tree-sitter and 0.12.0 for tree-sitter-graph) to modernize the codebase and fix compatibility issues.

Key changes:

  • Updated dependency versions in Cargo.toml files and build configurations
  • Adapted code to new API requirements, including passing Language parameter as reference
  • Implemented a new approach for syntax error detection through manual tree traversal since the old method no longer works

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
python/extractor/tsg-python/tsp/Cargo.toml Updates tree-sitter version to 0.24.7
python/extractor/tsg-python/src/main.rs Adds manual syntax error detection and fixes API call to pass language by reference
python/extractor/tsg-python/python.tsg Adds underscores to unused captures to comply with new tree-sitter-graph requirements
python/extractor/tsg-python/Cargo.toml Updates both tree-sitter and tree-sitter-graph versions
python/extractor/tests/parser/types_new.py Comments out invalid syntax that was causing issues
misc/bazel/3rdparty/py_deps/* Updates build files and dependency configurations for new versions

Comment on lines 501 to 502
self.nodes_to_visit
.extend((0..node.child_count()).rev().filter_map(|i| node.child(i)));
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider collecting the children into a vector before extending to avoid multiple allocations. This can be more efficient for nodes with many children.

Suggested change
self.nodes_to_visit
.extend((0..node.child_count()).rev().filter_map(|i| node.child(i)));
let children: Vec<_> = (0..node.child_count()).rev().filter_map(|i| node.child(i)).collect();
self.nodes_to_visit.extend(children);

Copilot uses AI. Check for mistakes.
.map(move |node| {
let start_pos = node.start_position();
let end_pos = node.end_position();
let text = &source[node.byte_range()];
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte range indexing could panic if the node's byte range is invalid or extends beyond the source string. Consider using source.get(node.byte_range()) and handling the None case appropriately.

Suggested change
let text = &source[node.byte_range()];
let text = source.get(node.byte_range()).unwrap_or("");

Copilot uses AI. Check for mistakes.
@IdrissRio IdrissRio self-requested a review September 2, 2025 13:29
@tausbn
Copy link
Contributor Author

tausbn commented Sep 2, 2025

The suggestions from Copilot seem reasonable. I'll add a separate commit containing them.

@tausbn
Copy link
Contributor Author

tausbn commented Sep 5, 2025

DCA looks uneventful. Some changes in a few syntax error alerts, but that is to be expected.

Copy link
Contributor

@IdrissRio IdrissRio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the PR commit by commit. Except from a very minor comment, everything looks good. DCA is uneventful.

# TypeAliases
type Foo2[T15, U1 = str] = Bar1[T15, U1]
type Baz2[**P2 = [int, str]] = Spam[**P2]
# type Baz2[**P2 = [int, str]] = Spam[**P2] # From the PEP, but this is not actually valid syntax!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d either remove it or update the comment to explain why it’s commented out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like "this is not actually valid syntax!" adequately explains why it's commented out -- if you uncommented it, then it would be a syntax error. 😅

@tausbn tausbn merged commit f5a06be into main Sep 17, 2025
30 checks passed
@tausbn tausbn deleted the tausbn/python-update-tree-sitter-dependency branch September 17, 2025 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-change-note-required This PR does not need a change note Python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants