-
Notifications
You must be signed in to change notification settings - Fork 28
feat: add initial extension support #395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
07c6293 to
95270ab
Compare
01172ec to
3e58a42
Compare
benbellick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments but still haven't gone through most of the PR. Just wanted to flush what I have so far, but I will review more later. Thanks!
benbellick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more comments, but excited to get this in! Sorry that this PR has been sitting so long 😅
Refactored ProtoContext trait (removed), by replacing with a concrete type. Some other small refactoring simplifications and code cleanup.
5fae289 to
d3e3627
Compare
benbellick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more comments. Had to run so I didn't get to entirely finish the review but will revisit later. Thanks
| /// Integer parameter (e.g., precision, scale) | ||
| Integer(i64), | ||
| /// Type parameter (nested type) | ||
| Type(ConcreteType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to point out that the spec technically allows for other type parameters:
message Parameter {
oneof parameter {
// Explicitly null/unspecified parameter, to select the default value (if
// any).
google.protobuf.Empty null = 1;
// Data type parameters, like the i32 in LIST<i32>.
Type data_type = 2;
// Value parameters, like the 10 in VARCHAR<10>.
bool boolean = 3;
int64 integer = 4;
string enum = 5;
string string = 6;
}
}That being said, I would prefer you leave what you have as is since the PR is fairly large already. Just a note for future work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to this effect.
| /// Parameterized builtin types that require non-type parameters, e.g. numbers | ||
| /// or enum | ||
| #[derive(Clone, Debug, PartialEq)] | ||
| pub enum BuiltinParameterized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to separate out builtin parameterized types which don't take in type parameters from those that do? Not saying this is necessarily wrong, just curious to understand why. Substrait-java treats both e.g. decimal (which takes only an int param) and list as implementors of BaseParameterizedType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now there's a separation between three things:
- Primitive types (
PrimitiveType, i.e. Simple Types): no type parameters - Simple compound types (
BuiltinKind, i.e. Compound Types with non-type parameters): has type parameters, but only literal ones; no recursion - Advanced compound types (seen in
ConcreteKind, i.e. Compound Types with parameters that can be types.
I think I separated (1) and (2) based on the division in the docs; (2) and (3) are separated to handle the recursive type management in (3) separately.
On reflection - I think it simplifies things to merge (1) and (2); I don't see much reason not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looking at it - I did manage to merge (1) and (2), and also drop the BuiltinKind naming as well. It's an overall negative diff by >100 LOC, so I think its a net improvement.
I left the recursive types in ConcreteType. We could perhaps combine those in as well... but keeping them in their own place might be easier to work with.
| /// Sub-second precision digits | ||
| precision: i32, | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these wonderful comments 🙏
| /// Type parameters (e.g., for generic types) | ||
| pub parameters: Vec<TypeParam>, | ||
| /// Concrete structure definition, if any | ||
| pub structure: Option<ConcreteType>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note (haven't checked yet). It would be good to make sure there is a test specifically for structure.
This is kind of a rarer feature which in which we only recently introduced handling for substrait-go and is still a WIP for substrait-java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah there kind of implicitly is a test for this because of the extension_types.yaml file in the core extensions, which the registry.rs at least checks is able to load everything without an error.
| } | ||
|
|
||
| impl<C: Context> Parse<C> for proto::Version { | ||
| impl Parse<ExtensionAnchors> for proto::Version { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Does parsing Version require ExtensionAnchors context, or any context at all? Could imagine either passing around () for empty context, or having two traits, one called something like Parse and another called ParseWith.
I'm just dumping some thoughts though :) this doesn't have to be in this PR, especially considering that Parse already existed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you could do that; although then you have version.parse(&mut ()) which is pretty awkward and surprising on its own. You could perhaps fix it by having a ContextFreeParse trait, implemented automatically for all Parse<()> objects, and then you could have version.parse().
But... that feels like it's getting a bit out of scope for this PR, so I added a note in the code 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further thought, isn't ContextFreeParse really just the From or TryFrom traits? Probably can just leverage those standard traits instead of introducing new ones in a later PR.
| } | ||
| other => panic!("unexpected error type: {other:?}"), | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked myself yet if it already exists but just noting that it would be good to have tests somewhere that ensure all of the core extension files parse correctly. Could be just as simple as showing they don't result in an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, you already got it in the registry file, nice!
Add TODO referencing Substrait type proto for unsupported type parameters not yet implemented.
Use a single type for non-recursive builtin types.
Also change get_simple_extension_urn to just return an option, and update some comments.
benbellick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I once again have to run before getting to fully finish everything. I actually feel quite confident in everything except for the types.rs file, which is the one file I just didn't get to finish going through. But everything is looking great!
Thank you for bearing with my (sometimes excessive) comments 😄
| //! [here](https://github.com/substrait-io/substrait/blob/5f031b69ed211e1ec307be3db7989d64c65d33a2/grammar/SubstraitType.g4). | ||
| // Ideally, this module would be generated from the grammar, but there are no | ||
| // well-maintained ANTLR grammar modules for Rust availableat the time of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // well-maintained ANTLR grammar modules for Rust availableat the time of this | |
| // well-maintained ANTLR grammar modules for Rust available at the time of this |
|
|
||
| //! Parsed type AST used by the simple extensions type parser. | ||
| //! | ||
| //! This module is based on the offical substrait type grammar defined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| //! This module is based on the offical substrait type grammar defined | |
| //! This module is based on the official substrait type grammar defined |
| min, | ||
| max, | ||
| options: param.param_type.raw_options(), | ||
| optional: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this always be None?
| @@ -0,0 +1,137 @@ | |||
| // SPDX-License-Identifier: Apache-2.0 | |||
|
|
|||
| //! Parsing context for extension processing. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a more accurate comment here is:
| //! Parsing context for extension processing. | |
| //! Validated simple extensions: [`SimpleExtensions`]. | |
| //! | |
| //! Currently only type definitions are supported; function parsing will be | |
| //! added in a future update. |
?
| /// A context for parsing simple extensions, tracking what type names are | ||
| /// resolved or unresolved. | ||
| #[derive(Debug, Default)] | ||
| pub struct TypeContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub struct TypeContext { | |
| pub(crate) struct TypeContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on renaming this file to type_expr.rs? I feel like that is a little more revealing of what this file does considering that the primary export is the TypeExpr enum.
|
|
||
| use crate::parse::text::simple_extensions::types::is_builtin_type_name; | ||
|
|
||
| /// A parsed type expression from a type string, with lifetime tied to the original string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have such a poor understanding of lifetimes at the current moment, so maybe this is more of a question for my personal learning.
But what is it about this type that necessitated introducing lifetimes here, considering we haven't had to use them much before?
I tried dropping them all locally and I am able to get things to compile and get tests to pass fine. Is it giving us some kind of performance enhancement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly: did you have the parse feature enabled? This code is only enabled with that feature. I have been running cargo test --features parse,serde,extensions to cover everything here. And with that - I would expect this to fail without lifetimes.
The TypeExpr here does not copy the strings out of the type_str, but instead contains pointers into that parsed string; that's what the &'a str in Simple and UserDefined is - it's a pointer to a substring of the original, parsed string.
Given that it contains pointers, for memory safety reasons, the TypeExpr can't outlive the string it is pointing to - or those would become dangling pointers, leading to potential 'use after free' bugs. So to make sure that they don't, the TypeExpr comes with a lifetime as part of its type, where the lifetime of any instance of TypeExpr has to be a subset of the lifetime of the string that was parsed.
So - to remove the lifetimes here, we'd have to copy all the strings. We could do that; it would be a performance hit of some sort, but perhaps not a big deal - after all, how much are we ever actually parsing, and how much? I doubt anyone has a server somewhere parsing 100s of simple extension files a second. But... lifetimes are fairly common in Rust for ephemeral types, and this TypeExpr is rather ephemeral - it's an AST we build for just long enough to parse it into a ConcreteType, and in the conversion from TypeExpr to ConcreteType, the string is indeed copied (via .to_ascii_lowercase(), which copies; see code).
That said - I'm inclined to leave it as is; the AST is a structured representation of the strings in the file, and so pointers to that string seem reasonable - that's how nom, pest, and lalrpop work, common Rust libraries for parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I spoke a bit unclearly. I didn't mean to say that I merely deleted the lifetime annotations. I also had to introduce some string copying. But regardless, I appreciate the very thorough explanation!
I agree then that leaving this implementation is reasonable, especially given the precedent in other libraries. Thanks!
| "timestamp_tz?", | ||
| TypeExpr::Simple("timestamp_tz", vec![], true), | ||
| ), | ||
| ("time", TypeExpr::Simple("time", vec![], false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ("time", TypeExpr::Simple("time", vec![], false)), | |
| ("time", TypeExpr::Simple("time", vec![], false)), | |
| ("any", TypeExpr::Simple("any", vec![], false)), |
Just because it wasn't entirely obvious to me if any was treated as a TypeVariable or a Simple type.
|
|
||
| #[test] | ||
| fn test_user_defined_and_parameters() { | ||
| let cases = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome tests, these inspire a lot of confidence in the parser.
| } | ||
|
|
||
| /// Check if a name corresponds to any built-in type (scalar or container) | ||
| pub fn is_builtin_type_name(name: &str) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some inconsistencies with the meaning of BuiltinType. This function suggests that all of the enum types inside of BuiltinType + list + map + struct are considered builtin types, which feels like it conflicts with the name BuiltinType being used exclusively for non-container types.
Not sure what the cleanest fix here would be. Could rename BuiltinTypes to ScalarBuiltinTypes, PrimitiveBuiltinTypes, etc...
Or we could rename this function to is_reserved_type_name? I'm not sure honestly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that the comment makes the Scalar vs Container distinction. What do you think about using those in the actual type names? To be clear, I can go either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrimitiveBuiltinType seems good, since I notice that below there is a function called primitive_builtin which is for builtin types without type parameters.
Summary
Begins addressing #367 and #342 by adding support for parsing the types in YAML Simple Extension Files into Rustic types - with validity enforced. This includes a string text parser handling built-in types, compound types, named structs, custom types, and validated parameter constraints in the Simple Extension YAML files.
Scope
ExtensionFile,Registry,CustomType,ConcreteType) and enforces validation of those on creation / read.Key Changes
BuiltinType,CompoundType,ConcreteType,CustomTypewith Display/round‑trip support for alias and named‑struct structures.TryFrom<TypeParamDefsItem>,Parse<RawType>)ProtoContextfromContext, to distinguish between things needed for Protobuf parsing (ProtoContext)u!Name) and type variables; visits extension references for linkage bookkeeping.parsefeature includesserde_yaml;include!(extensions.in)is gated behindextensionsfeature.actions/checkoutto v4, updates Cargo dependency set, and bumps thesubstraitsubmodule.Compatibility Notes
ProtoContexton proto parsing that previously required onlyContext.extensions.innow compiled only withfeatures=["extensions"].Testing
features=["extensions"].