Skip to content

Conversation

@wackywendell
Copy link
Contributor

@wackywendell wackywendell commented Sep 15, 2025

Summary

Begins addressing #367 and #342 by adding support for parsing the types in YAML Simple Extension Files into Rustic types - with validity enforced. This includes a string text parser handling built-in types, compound types, named structs, custom types, and validated parameter constraints in the Simple Extension YAML files.

Scope

  • Types-only: no functions or call validation yet.
  • Public API exposes parsed types (ExtensionFile, Registry, CustomType, ConcreteType) and enforces validation of those on creation / read.

Key Changes

  • Type system
    • New BuiltinType, CompoundType, ConcreteType, CustomType with Display/round‑trip support for alias and named‑struct structures.
    • Parameter constraints: data type, integer (with min/max), enum (validated/deduped), boolean, string.
    • Parsing to and from the YAML structures (TryFrom<TypeParamDefsItem>, Parse<RawType>)
  • File/Registry type: abstraction for handling YAML files
  • Context and proto glue
    • Separates out ProtoContext from Context, to distinguish between things needed for Protobuf parsing (ProtoContext)
  • Type expression parser
    • Parses simple, user‑defined (u!Name) and type variables; visits extension references for linkage bookkeeping.
  • Build/CI
    • parse feature includes serde_yaml; include!(extensions.in) is gated behind extensions feature.
    • Aligns actions/checkout to v4, updates Cargo dependency set, and bumps the substrait submodule.

Compatibility Notes

  • New trait bound ProtoContext on proto parsing that previously required only Context.
  • extensions.in now compiled only with features=["extensions"].
  • Minimal, types-only round‑trip implemented; other sections remain empty when converting back to text.

Testing

  • New unit tests cover:
    • Type parsing and round‑trip for alias and named‑struct.
    • Parameter constraint handling including enum validation and integer bounds (with current truncation behavior).
    • Registry creation and type lookup; core registry smoke test behind features=["extensions"].

@wackywendell wackywendell changed the title Add Initial Extension Support feat: Add Initial Extension Support Oct 10, 2025
@wackywendell wackywendell changed the title feat: Add Initial Extension Support feat: add initial extension support Oct 10, 2025
@wackywendell wackywendell marked this pull request as ready for review October 20, 2025 15:32
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but still haven't gone through most of the PR. Just wanted to flush what I have so far, but I will review more later. Thanks!

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments, but excited to get this in! Sorry that this PR has been sitting so long 😅

Refactored ProtoContext trait (removed), by replacing with a concrete type.
Some other small refactoring simplifications and code cleanup.
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments. Had to run so I didn't get to entirely finish the review but will revisit later. Thanks

/// Integer parameter (e.g., precision, scale)
Integer(i64),
/// Type parameter (nested type)
Type(ConcreteType),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to point out that the spec technically allows for other type parameters:

  message Parameter {
    oneof parameter {
      // Explicitly null/unspecified parameter, to select the default value (if
      // any).
      google.protobuf.Empty null = 1;

      // Data type parameters, like the i32 in LIST<i32>.
      Type data_type = 2;

      // Value parameters, like the 10 in VARCHAR<10>.
      bool boolean = 3;
      int64 integer = 4;
      string enum = 5;
      string string = 6;
    }
  }

That being said, I would prefer you leave what you have as is since the PR is fairly large already. Just a note for future work!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to this effect.

/// Parameterized builtin types that require non-type parameters, e.g. numbers
/// or enum
#[derive(Clone, Debug, PartialEq)]
pub enum BuiltinParameterized {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to separate out builtin parameterized types which don't take in type parameters from those that do? Not saying this is necessarily wrong, just curious to understand why. Substrait-java treats both e.g. decimal (which takes only an int param) and list as implementors of BaseParameterizedType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now there's a separation between three things:

  1. Primitive types (PrimitiveType, i.e. Simple Types): no type parameters
  2. Simple compound types (BuiltinKind, i.e. Compound Types with non-type parameters): has type parameters, but only literal ones; no recursion
  3. Advanced compound types (seen in ConcreteKind, i.e. Compound Types with parameters that can be types.

I think I separated (1) and (2) based on the division in the docs; (2) and (3) are separated to handle the recursive type management in (3) separately.

On reflection - I think it simplifies things to merge (1) and (2); I don't see much reason not to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at it - I did manage to merge (1) and (2), and also drop the BuiltinKind naming as well. It's an overall negative diff by >100 LOC, so I think its a net improvement.

I left the recursive types in ConcreteType. We could perhaps combine those in as well... but keeping them in their own place might be easier to work with.

/// Sub-second precision digits
precision: i32,
},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these wonderful comments 🙏

/// Type parameters (e.g., for generic types)
pub parameters: Vec<TypeParam>,
/// Concrete structure definition, if any
pub structure: Option<ConcreteType>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note (haven't checked yet). It would be good to make sure there is a test specifically for structure.

This is kind of a rarer feature which in which we only recently introduced handling for substrait-go and is still a WIP for substrait-java.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah there kind of implicitly is a test for this because of the extension_types.yaml file in the core extensions, which the registry.rs at least checks is able to load everything without an error.

}

impl<C: Context> Parse<C> for proto::Version {
impl Parse<ExtensionAnchors> for proto::Version {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Does parsing Version require ExtensionAnchors context, or any context at all? Could imagine either passing around () for empty context, or having two traits, one called something like Parse and another called ParseWith.

I'm just dumping some thoughts though :) this doesn't have to be in this PR, especially considering that Parse already existed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you could do that; although then you have version.parse(&mut ()) which is pretty awkward and surprising on its own. You could perhaps fix it by having a ContextFreeParse trait, implemented automatically for all Parse<()> objects, and then you could have version.parse().

But... that feels like it's getting a bit out of scope for this PR, so I added a note in the code 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further thought, isn't ContextFreeParse really just the From or TryFrom traits? Probably can just leverage those standard traits instead of introducing new ones in a later PR.

}
other => panic!("unexpected error type: {other:?}"),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked myself yet if it already exists but just noting that it would be good to have tests somewhere that ensure all of the core extension files parse correctly. Could be just as simple as showing they don't result in an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you already got it in the registry file, nice!

Add TODO referencing Substrait type proto for unsupported type
parameters not yet implemented.
Use a single type for non-recursive builtin types.
Also change get_simple_extension_urn to just return an option, and update some comments.
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I once again have to run before getting to fully finish everything. I actually feel quite confident in everything except for the types.rs file, which is the one file I just didn't get to finish going through. But everything is looking great!

Thank you for bearing with my (sometimes excessive) comments 😄

//! [here](https://github.com/substrait-io/substrait/blob/5f031b69ed211e1ec307be3db7989d64c65d33a2/grammar/SubstraitType.g4).
// Ideally, this module would be generated from the grammar, but there are no
// well-maintained ANTLR grammar modules for Rust availableat the time of this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// well-maintained ANTLR grammar modules for Rust availableat the time of this
// well-maintained ANTLR grammar modules for Rust available at the time of this


//! Parsed type AST used by the simple extensions type parser.
//!
//! This module is based on the offical substrait type grammar defined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! This module is based on the offical substrait type grammar defined
//! This module is based on the official substrait type grammar defined

min,
max,
options: param.param_type.raw_options(),
optional: None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this always be None?

@@ -0,0 +1,137 @@
// SPDX-License-Identifier: Apache-2.0

//! Parsing context for extension processing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a more accurate comment here is:

Suggested change
//! Parsing context for extension processing.
//! Validated simple extensions: [`SimpleExtensions`].
//!
//! Currently only type definitions are supported; function parsing will be
//! added in a future update.

?

/// A context for parsing simple extensions, tracking what type names are
/// resolved or unresolved.
#[derive(Debug, Default)]
pub struct TypeContext {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub struct TypeContext {
pub(crate) struct TypeContext {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on renaming this file to type_expr.rs? I feel like that is a little more revealing of what this file does considering that the primary export is the TypeExpr enum.


use crate::parse::text::simple_extensions::types::is_builtin_type_name;

/// A parsed type expression from a type string, with lifetime tied to the original string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have such a poor understanding of lifetimes at the current moment, so maybe this is more of a question for my personal learning.

But what is it about this type that necessitated introducing lifetimes here, considering we haven't had to use them much before?

I tried dropping them all locally and I am able to get things to compile and get tests to pass fine. Is it giving us some kind of performance enhancement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly: did you have the parse feature enabled? This code is only enabled with that feature. I have been running cargo test --features parse,serde,extensions to cover everything here. And with that - I would expect this to fail without lifetimes.

The TypeExpr here does not copy the strings out of the type_str, but instead contains pointers into that parsed string; that's what the &'a str in Simple and UserDefined is - it's a pointer to a substring of the original, parsed string.

Given that it contains pointers, for memory safety reasons, the TypeExpr can't outlive the string it is pointing to - or those would become dangling pointers, leading to potential 'use after free' bugs. So to make sure that they don't, the TypeExpr comes with a lifetime as part of its type, where the lifetime of any instance of TypeExpr has to be a subset of the lifetime of the string that was parsed.

So - to remove the lifetimes here, we'd have to copy all the strings. We could do that; it would be a performance hit of some sort, but perhaps not a big deal - after all, how much are we ever actually parsing, and how much? I doubt anyone has a server somewhere parsing 100s of simple extension files a second. But... lifetimes are fairly common in Rust for ephemeral types, and this TypeExpr is rather ephemeral - it's an AST we build for just long enough to parse it into a ConcreteType, and in the conversion from TypeExpr to ConcreteType, the string is indeed copied (via .to_ascii_lowercase(), which copies; see code).

That said - I'm inclined to leave it as is; the AST is a structured representation of the strings in the file, and so pointers to that string seem reasonable - that's how nom, pest, and lalrpop work, common Rust libraries for parsing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I spoke a bit unclearly. I didn't mean to say that I merely deleted the lifetime annotations. I also had to introduce some string copying. But regardless, I appreciate the very thorough explanation!

I agree then that leaving this implementation is reasonable, especially given the precedent in other libraries. Thanks!

"timestamp_tz?",
TypeExpr::Simple("timestamp_tz", vec![], true),
),
("time", TypeExpr::Simple("time", vec![], false)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
("time", TypeExpr::Simple("time", vec![], false)),
("time", TypeExpr::Simple("time", vec![], false)),
("any", TypeExpr::Simple("any", vec![], false)),

Just because it wasn't entirely obvious to me if any was treated as a TypeVariable or a Simple type.


#[test]
fn test_user_defined_and_parameters() {
let cases = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome tests, these inspire a lot of confidence in the parser.

}

/// Check if a name corresponds to any built-in type (scalar or container)
pub fn is_builtin_type_name(name: &str) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some inconsistencies with the meaning of BuiltinType. This function suggests that all of the enum types inside of BuiltinType + list + map + struct are considered builtin types, which feels like it conflicts with the name BuiltinType being used exclusively for non-container types.

Not sure what the cleanest fix here would be. Could rename BuiltinTypes to ScalarBuiltinTypes, PrimitiveBuiltinTypes, etc...

Or we could rename this function to is_reserved_type_name? I'm not sure honestly...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the comment makes the Scalar vs Container distinction. What do you think about using those in the actual type names? To be clear, I can go either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrimitiveBuiltinType seems good, since I notice that below there is a function called primitive_builtin which is for builtin types without type parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants