Skip to content

Nested sourced_from support#1216

Draft
ellisandrews-toast wants to merge 3 commits into
block:mainfrom
ellisandrews-toast:nested-sourced-from
Draft

Nested sourced_from support#1216
ellisandrews-toast wants to merge 3 commits into
block:mainfrom
ellisandrews-toast:nested-sourced-from

Conversation

@ellisandrews-toast
Copy link
Copy Markdown
Contributor

TODO

Copy link
Copy Markdown
Collaborator

@myronmarston myronmarston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this spike up. I don't have any major concerns--the architecture and approach is sound. I left some comments throughout.

Looking forward to seeing individual PRs spun off this!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On past projects, I've found it useful to update the schema artifacts structure (e.g. to add new fields like you are here) as a stand-alone PR before anything uses the new fields. Consider doing that for your PR stack.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@parent_relationship_config = {
parent_type_name: parent_type_name,
parent_relationship_name: parent_relationship_name
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define a Data class for this config. A hash is quick-and-dirty but is more prone to silently ignoring mispelled keys, etc--data_hash[:mispelled] returns nil whereas data_object.mispelled throws an exception.

The data class could be defined within this relationship class, e.g.:

class Relationship < DelegateClass(Field)
  # ...

  ParentRef = ::Data.define(:type_ref, :relationship_name)
end

Also, instead of just storing the parent type name as a string, can we store it as a TypeRef? While we always accept types as strings via the APIs called by EG users, we generally convert them to type refs early on because type refs are more capable than strings :).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented!


# @return [Hash, nil] configuration for parent relationship in a nested sourced_from chain
# @private
attr_reader :parent_relationship_config
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this parent_relationship_ref? To me it's less configuration and more just a reference to the parent relationship.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Results class is pretty large and I'd like to keep it from getting larger. It'd be nice to put your logic into another class which Results can delegate to. Actually, maybe we should preemptively extract identify_extra_update_targets_by_object_type_name into a separate class?

Maybe we can move that into SchemaUpdateTargetResolver which sits besides https://github.com/ellisandrews-toast/elasticgraph/blob/f5b4e7bce0b55bb86254898d83a336f52be2358d/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/indexing/update_target_resolver.rb ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this into a class called SourcedUpdateTargetsResolver. That class finds all sourced_from relationships (both top-level and nested), resolves them into UpdateTarget objects.


# Identifies update targets for sourced_from fields on non-indexed embedded types
# that use parent_relationship chains.
def identify_nested_sourced_update_targets(object_type, extra_update_targets_by_type_name, errors)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be moved into https://github.com/ellisandrews-toast/elasticgraph/blob/f5b4e7bce0b55bb86254898d83a336f52be2358d/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/indexing/update_target_resolver.rb? IIRC that's the main spot where the update target logic for a single object type lives...which seems to be what you're working with here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this into the new SourcedUpdateTargetsResolver that delegates to both UpdateTargetResolver (for top-level) and NestedUpdateTargetResolver (for nested).

end

# Find the parent type
parent_type = @schema_def_state.object_types_by_name[config[:parent_type_name]]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of looking up the type like this, if you use a type ref you can just do type_ref.as_object_type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — storing a TypeRef on ParentRef and using ref.type_ref.as_object_type for lookup.


// Splits a composite nested element key into a list of parts.
List splitNestedElementKey(String nestedElementKey) {
return Arrays.asList(nestedElementKey.splitOnToken(":"));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another comment I mentioned the danger of assuming : won't be in any keys. Since we can pass arbitrary JSON in our parameters, can we just pass this as a list?

Or, if you are encoding/decoding a list of strings for storage in a map key can you encode it as a JSON string (being careful to apply some "canonical" formatting)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per a previous comment above, I struggled to come up with a good solution here (but I bet we can find one). Implemented a naive length:value encoding for now that technically works.

Tried using Lists as map keys (which works in-memory in Painless), but it breaks when the document is serialized to JSON for storage and reloaded — JSON object keys must be strings.

Can spend more time on this aspect later.

parts.add(segment.get("object"));
}
}
return String.join(":", parts);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another comment I mentioned the danger of assuming : won't be in any keys. We should figure out a way to encode nested keys that avoids making assumptions about what could be in the key values...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. Mentioned some thoughts and current workaround in other comments.

"because this element was previously sourced from a different event (" + previousSourceIds + "). " +
"Each nested element can only be sourced from one source document."
);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/nit can we unify these two exceptions messages into one message that makes sense for both cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// Applies nested sourced data from the __nested_sourced_data buffer to matched nested elements.
// Reads path config from the document itself — no external params needed.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting--dose that mean we are storing some static path config on each document? Seems like that would be inefficient to store the same path config on billions of docs rather than pass it in as params...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on Zoom, was doing this because it was awkward to thread this static config through to the script the "normal" way.

I decided to instead store this on the IndexDefinition (the per-index runtime metadata) and pass it to the painless script as a param (we are already doing this kind of thing for __counts). I think this makes sense because the path config is static configuration that describes the structure of nested relationships within an index — it's the same for every document in that index, and needs to be known by any update event targeting that index. The script reads it from params.nestedSourcedPaths at execution time. The __nested_sourced_data structure on the document now only stores the actual sourced field values (the data that varies per nested element), not the path navigation config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants