feat(datastore): Decorate and log errors thrown during DataStore entrypoint initialization#27327
feat(datastore): Decorate and log errors thrown during DataStore entrypoint initialization#27327markfields wants to merge 7 commits into
Conversation
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (251 lines, 7 files), I've queued these reviewers:
How this works
|
There was a problem hiding this comment.
Pull request overview
This PR adds telemetry for DataStore entry point initialization failures so component load issues include Fluid-specific context like DataStore identity and package path.
Changes:
- Adds a shared helper for code-artifact-tagged DataStore telemetry properties.
- Logs and annotates errors thrown while resolving
FluidDataStoreRuntime.entryPoint. - Reuses the shared telemetry helper for DataStore realization errors and adds tests for entry point failure logging.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
packages/utils/telemetry-utils/src/error.ts |
Aligns DataProcessingError wrapping factory usage. |
packages/runtime/runtime-utils/src/index.ts |
Exports the new DataStore telemetry helper. |
packages/runtime/runtime-utils/src/dataStoreHelpers.ts |
Adds shared DataStore identity telemetry property helper. |
packages/runtime/datastore/src/test/dataStoreRuntime.spec.ts |
Adds coverage for lazy entry point initialization and failure telemetry. |
packages/runtime/datastore/src/dataStoreRuntime.ts |
Wraps, annotates, logs, and rethrows entry point initialization failures. |
packages/runtime/container-runtime/src/dataStoreContext.ts |
Reuses the shared DataStore telemetry helper for realization errors. |
Fleet Review — CleanNo issues found across the reviewer fleet for this run. |
| error, | ||
| "entryPointInitialization", | ||
| ); | ||
| errorWrapped.addTelemetryProperties( |
There was a problem hiding this comment.
Note: In testing I found that these telemetry props get stripped off on the way up by roundtripping through exceptionToResponse and responseToException. A bit curious to look into that flow and if we could cut out those tranforms.
There was a problem hiding this comment.
Does this have to do with privacy concerns? I know there can sometimes be policies to strip all error information away in case it contains user data or identifiable information.
There was a problem hiding this comment.
Nope, it has to do with legacy architectural direction (URI-based routing within and across containers)
There was a problem hiding this comment.
These properties are correctly tagged as "code artifacts", so hosts can decide if they're safe to log under different circumstances
ChumpChief
left a comment
There was a problem hiding this comment.
Love wrapping at the exit to custom code! Few organizational/API suggestions.
| new LazyPromise(async () => provideEntryPoint(this)), | ||
| new LazyPromise(async () => | ||
| provideEntryPoint(this).catch((error) => { | ||
| const errorWrapped = DataProcessingError.wrapIfUnrecognized( |
| new LazyPromise(async () => provideEntryPoint(this)), | ||
| new LazyPromise(async () => | ||
| provideEntryPoint(this).catch((error) => { | ||
| const errorWrapped = DataProcessingError.wrapIfUnrecognized( |
There was a problem hiding this comment.
I am wondering if DataProcessingError is appropriate here. This runs application entry point logic and has nothing to do with Fluid right? So, maybe this is a UsageError or just a generic error?
There was a problem hiding this comment.
Thinking more about this - we should separate out the path that loads the data store / DDS data from the path that loads the application logic. And we should be able to tell these apart from telemetry. I would suggest to not use data process error for the latter because that should not cause any problems with the data (it does fail summaries today but that is an unexpected scenario and will be fixed soon)
There was a problem hiding this comment.
I think we have different impression of what DPE means. I think it's any error that happens on a critical load or process path, regardless of the source. Asking Copilot to audit usage, it buckets about 70% as FF-root-cause and 30% as app callback or hook code.
Practically speaking it doesn't really matter, because we'd expect an application to want to monitor these closely. Root cause is a second step.
PErhaps we can subdivide the errorType like you're thinking, but I think that's a bigger effort.
As for the feasibility of doing so here (capturing DDS load v. app code), I don't think the entryPoint abstraction affords us that. That really depends on the implementation. Even PureDataObject doesn't know. It happens that we typically do DDS (at least root) loading in a shared base class like DataObject or MultiFormatDataObject (moved to another repo), but even that is not a hard requirement.
Let's keep talking and we can refine this, I'm probably glossing over something you're seeing, but I'm confident that this is a reasonable use of DPE for now.
Description
Component load failures are one of the top scenarios our partners need to be able to monitor and diagnose in production. This PR closes a long-standing gap where errors thrown during entryPoint initialization (e.g. for
PureDataObject, that includes the initalization hooks) are not logged to FF tables.While host applications can and do log these errors, there is context that's only readily available within FF that is really important for investigation - All the common ContainerRuntime props (notably clientType - is it the Summarizer?), as well as the DataStore's packagePath (indicates the component type) and id.
This PR logs an error event with all these props, and additionally annotates the error object* with the dataStore package/id info.
* Note: In testing I found that these telemetry props get stripped off on the way up by roundtripping through
exceptionToResponseandresponseToException. A bit curious to look into that flow and if we could cut out those tranforms.