Conversation
0cb1793 to
33485ca
Compare
|
CI failure seems to be unrelated to my changes. |
There was a problem hiding this comment.
Pull request overview
This PR introduces YEAST, a Rust library for declarative AST cleanup/desugaring on top of tree-sitter parse trees, and integrates it into the shared tree-sitter extractor so extraction can optionally run on a rewritten AST and/or validate against an alternate output node-types schema.
Changes:
- Add new
shared/yeast+shared/yeast-macroscrates implementing the rule/query/template system, plus tests and documentation. - Extend the shared tree-sitter extractor to optionally run YEAST rules and to support separate
output_node_typesfor schema generation and TRAP validation. - Update Bazel vendored Rust deps to include YEAST dependencies (and bump
tree-sitter,cc, etc.).
Show a summary per file
| File | Description |
|---|---|
| shared/yeast/tests/test.rs | End-to-end tests for parsing, query matching, tree building, and desugaring rules. |
| shared/yeast/tests/node-types.yml | Test output schema in the new YAML node-types format. |
| shared/yeast/src/visitor.rs | Converts a tree-sitter Tree into a YEAST Ast. |
| shared/yeast/src/tree_builder.rs | Fresh identifier generation support for templates/rules. |
| shared/yeast/src/schema.rs | Schema representation for kinds/fields (language-derived or YAML-derived). |
| shared/yeast/src/range.rs | Serde helpers for (de)serializing tree_sitter::Range. |
| shared/yeast/src/query.rs | Query AST and matching engine (captures, repetition, named/unnamed semantics). |
| shared/yeast/src/print.rs | Debug printer for walking a YEAST AST. |
| shared/yeast/src/node_types_yaml.rs | YAML ↔ JSON node-types conversion + schema construction from YAML. |
| shared/yeast/src/lib.rs | Core YEAST types (Ast, Node, Rule, Runner) and rewrite application logic. |
| shared/yeast/src/dump.rs | Human-readable AST dump utility used by tests. |
| shared/yeast/src/cursor.rs | Cursor trait abstraction used by traversal/extractor integration. |
| shared/yeast/src/captures.rs | Capture storage and utilities (single/repeated/optional). |
| shared/yeast/src/build.rs | BuildCtx used by tree!/trees! macros to build synthetic nodes. |
| shared/yeast/src/bin/node_types_yaml.rs | CLI tool to convert YAML node-types ↔ JSON node-types. |
| shared/yeast/src/bin/main.rs | Minimal YEAST CLI for parsing and printing. |
| shared/yeast/doc/yeast.md | Main YEAST documentation (architecture, query/template language, integration). |
| shared/yeast/doc/node-types-yaml.md | Specification for the YAML node-types format and CLI usage. |
| shared/yeast/Cargo.toml | New yeast crate manifest and dependencies. |
| shared/yeast/Cargo.lock | Lockfile for the standalone shared/yeast crate. |
| shared/yeast/BUILD.bazel | Bazel target for the yeast Rust library. |
| shared/yeast/.gitkeep | Placeholder file for directory tracking. |
| shared/yeast/.gitignore | Ignores shared/yeast/target. |
| shared/yeast/.envrc | Direnv config for local development. |
| shared/yeast-macros/src/parse.rs | Proc-macro parsing and codegen for query!, tree!, trees!, rule!. |
| shared/yeast-macros/src/lib.rs | Proc-macro entry points and user-facing macro docs. |
| shared/yeast-macros/Cargo.toml | New yeast-macros proc-macro crate manifest. |
| shared/yeast-macros/BUILD.bazel | Bazel target for the yeast-macros proc-macro crate. |
| shared/tree-sitter-extractor/tests/multiple_languages.rs | Updates tests to include output_node_types in LanguageSpec. |
| shared/tree-sitter-extractor/tests/integration_test.rs | Updates tests to include output_node_types in LanguageSpec. |
| shared/tree-sitter-extractor/src/generator/mod.rs | Generator uses output_node_types when provided. |
| shared/tree-sitter-extractor/src/generator/language.rs | Adds output_node_types to generator Language. |
| shared/tree-sitter-extractor/src/extractor/simple.rs | Uses output_node_types for schema validation in the simple extractor. |
| shared/tree-sitter-extractor/src/extractor/mod.rs | Adds optional YEAST desugaring path and AstNode abstraction. |
| shared/tree-sitter-extractor/Cargo.toml | Adds a path dependency on shared/yeast. |
| shared/tree-sitter-extractor/BUILD.bazel | Adds Bazel dep on //shared/yeast. |
| ruby/extractor/src/generator.rs | Populates output_node_types: None for Ruby/Erb generator languages. |
| ruby/extractor/src/extractor.rs | Updates shared extractor invocation with new extract(...) params. |
| ql/extractor/src/generator.rs | Populates output_node_types: None for QL generator languages. |
| ql/extractor/src/extractor.rs | Populates output_node_types: None for QL simple extractor languages. |
| MODULE.bazel | Adds/upgrades vendored crates (notably tree-sitter and new deps). |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/defs.bzl | Adds YEAST + YEAST-macros crates and bumps vendored dependencies. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.zstd-sys-2.0.16+zstd.1.5.7.bazel | Updates cc dependency reference. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-ruby-0.23.1.bazel | Updates cc dependency reference. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-ql-0.23.1.bazel | Updates cc dependency reference. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-python-0.23.6.bazel | Adds vendoring/build definitions for tree-sitter-python. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-json-0.24.8.bazel | Updates cc dependency reference. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-embedded-template-0.25.0.bazel | Updates cc dependency reference. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.tree-sitter-0.26.8.bazel | Bumps vendored tree-sitter to 0.26.8 and updates cc reference. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.iana-time-zone-haiku-0.1.2.bazel | Updates cc dependency reference. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.find-msvc-tools-0.1.9.bazel | Updates vendored find-msvc-tools version metadata. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.cc-1.2.61.bazel | Updates vendored cc version metadata and dependencies. |
| misc/bazel/3rdparty/tree_sitter_extractors_deps/BUILD.bazel | Adds aliases for serde_yaml and tree-sitter-python; bumps tree-sitter alias. |
| Cargo.toml | Adds shared/yeast and shared/yeast-macros to the workspace. |
| Cargo.lock | Workspace lock updates (adds yeast crates; bumps tree-sitter, cc, etc.). |
Copilot's findings
Comments suppressed due to low confidence (1)
shared/yeast/src/node_types_yaml.rs:303
schema_from_yaml_with_languagealso registers YAMLunnamed:tokens usingschema.register_kind(name), which only affects the named kind map. If the YAML adds any unnamed tokens not present in the tree-sitter language,QueryNode::UnnamedNodelookups will still fail becauseunnamed_kind_idsis never updated.
This should use an unnamed-kind registration path (updating unnamed_kind_ids) rather than register_kind.
- Files reviewed: 52/55 changed files
- Comments generated: 6
33485ca to
fb1d844
Compare
fb1d844 to
cba9c08
Compare
ed1ba0a to
1b8f451
Compare
1b8f451 to
e612319
Compare
| The `Runner` applies rules by walking the tree top-down. At each node, it | ||
| tries each rule in order. If a rule's query matches, the node is replaced by | ||
| the transform's output, and the rules are re-applied to the result. If no | ||
| rule matches, the node is kept and its children are processed recursively. |
There was a problem hiding this comment.
How does this work when that input and output node-types are distinct? My thinking is:
- If no rule applies, we can't just return the node as it is, because it has the wrong type (the node kind is not valid in the output node-types)
- A rule cannot apply multiple times, because the output AST cannot generally be matched by a query that operates on the input AST.
- If a given node name appears in both the input and output ASTs, but means something completely different in the two ASTs, will we accidentally re-apply rules on the output node? (misinterpreting it to be an input node of that kind)
There was a problem hiding this comment.
The short answer is: it doesn't. The behaviour described where unknown nodes are passed through without modification only applies to the case where the input language is a subset of the output language.
As for your second point, I think it should work to match against both input and output types -- I'll add a test to make sure. The original intent was that one could have two phases of desugaring: the first one turns the input AST into a nicer output AST (this is the part we care more about right now), the second desugars complex constructs into simpler one (this is the part we focused on in the hackathon).
However, there is one subtlety that we might want to address: currently we rewrite the parent node before we descend into its children (and if no rewrite applies, we don't go back to it later). This makes actually implementing the second phase awkward -- when we see an output node we want to desugar, its children will still be in the input AST. It would probably be better to use a traversal where child nodes are handled before their parent. (However, for simple cleanup transformations, or more generally for transformations where we don't inspect the structure of the child nodes, it doesn't make a difference.)
Finally, if the same node type appears both in the input and output node types, then yes, we could accidentally re-apply rules to the output node. However, merely having the same node type isn't enough. The entire query has to match. Thus, if we do something like, say, change the name of a field, the query will stop matching, and we won't loop.
However, there is (at least) one case where we would loop. Consider the rule
(foo (_) @children) => (foo bar: {..children})
(that is, moving all unnamed children into a field). In this case, we would match repeatedly. The first time around we move all of the children into the bar field. The second time around we can still match, capturing an empty list of children in @children, and then overwrite the bar field, and so on. This continues until we hit the recursion depth limit (currently 100).
To mitigate this, what we could do is enforce that a given rule is only applied once, either globally or on a rule-by-rule basis. For simple AST cleanup, I don't currently see any issues with enforcing this behaviour globally. For more advanced desugaring, it might be detrimental.
There was a problem hiding this comment.
a0a0e9e demonstrates that output->output transformations are indeed possible.
YEAST (YEAST Elaborates Abstract Syntax Trees) is a framework for transforming tree-sitter parse trees before CodeQL extraction. Core components: - shared/yeast/ — Ast, Node, Schema, query matching engine, captures, FreshScope, BuildCtx - shared/yeast-macros/ — proc macros: query!, tree!, trees!, rule! The query language is inspired by tree-sitter queries: (assignment left: (_) @lhs right: (_) @rhs) Templates support embedded Rust ({expr}), splicing ({..expr}), computed literals (#{expr}), and fresh identifiers ($name). The rule! macro combines query and transform: rule!((for pattern: (_) @pat ...) => (call receiver: {val} ...)) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Human-friendly YAML alternative to tree-sitter node-types.json with three sections: supertypes, named, unnamed. Supports bidirectional conversion and building Schema objects from YAML. Includes CLI binary (node_types_yaml) and documentation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Produces indented text showing node kinds, named fields, and leaf content. Unnamed tokens are hidden unless inside a named field. Used by tests for readable assertions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 tests covering parsing, queries, tree building, desugaring rules, cursor navigation, and the shorthand rule! syntax. Tests use a custom output node-types.yml with named fields for all children (parameter, stmt, index), loaded via schema_from_yaml_with_language. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Covers architecture, query language, template language (tree!/trees!/rule!), capture semantics, fresh identifiers, and extractor integration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
extract() gains a rules parameter. When empty, uses tree-sitter native traversal (no behavior change). When non-empty, runs yeast desugaring and extracts via traverse_yeast. Adds AstNode trait abstracting over tree_sitter::Node and yeast::Node, with minimal changes to existing Visitor methods (Node -> &N in 6 signatures). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Language and LanguageSpec gain optional output_node_types field. When set, the generator produces dbscheme/QL from the output types and the extractor validates TRAP against them. All existing extractors pass None (no behavior change). Ruby extract() calls gain vec![] for the new rules parameter. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add BUILD.bazel files for the yeast and yeast-macros crates, register them as dependencies of the shared tree-sitter extractor, and refresh the vendored crate dependencies via update_tree_sitter_extractors_deps.sh.
e612319 to
60dcf88
Compare
Adds a regression test verifying that desugaring rules can chain across output-only node kinds: a first rule rewrites an input kind to an output-only kind, and a second rule then rewrites that output-only kind into another output-only kind. This exercises the schema lookup for query patterns whose root kind is not present in the input tree-sitter grammar.
This PR adds a cleaned-up prototype of the YEAST library that was developed in a hackathon a few years ago.
YEAST is intended to be a lightweight layer for performing various kinds of AST cleanup and desugaring directly on the parse tree produced by a
tree-sitterparser. Rewrite rules are specified declaratively, with a query language that approximates that oftree-sitter, though notably with no alternation or anchors (and also with greedy semantics -- no backtracking). I expect that this will be sufficient for most uses.Output templates also look like
tree-sittertrees, with embedded rust blocks for specifying code that calculates an AST based on the given input.Because the output AST may be an entirely different language from the input AST, this PR also adds a new
node-types.ymlformat -- a lightweight reformulation ofnode-types.jsonintended for human consumption (unlike the latter).Of note: the output format disallows having field-less child nodes. The
node-types.ymlformat supports them, but YEAST itself will silently throw them away.There's a lot of code in this PR, but it's just a prototype, so don't feel compelled to review it in detail.
DO, however, look at the documentation, and also the changes to the existing
tree-sitterextractor infrastructure (the final two commits).