feat: local telemetry storage#8400
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (4)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| - `StandardNode.to_db()` automatically serializes Pydantic models to JSON via `model_dump_json()` and deserializes them in `from_db()` via `ujson.loads()`. This is proven by existing test fixtures (`PadanticStdNode` in test_node_standard.py). | ||
| - The telemetry payload is ~3-5 KB of JSON. Storing as a single JSON string avoids fragmenting it across multiple Neo4j nodes/relationships. | ||
| - The `TelemetryData` model is already well-typed with nested Pydantic models (`TelemetryWorkerData`, `TelemetryBranchData`, `TelemetrySchemaData`, `TelemetryDatabaseData`, `TelemetryPrefectData`). | ||
| - Using the existing model type in the StandardNode field enables automatic Pydantic validation on read. |
There was a problem hiding this comment.
I think we need to be careful with that, overtime the format of this data will change and we'll have old data in the database, we need to make sure that we can still read and export the old data, not just the latest one
There was a problem hiding this comment.
Good point... We should probably create a TelemetryData Pydantic class for each format version, so that it gets correctly (de)serialized. I will add to the plan.
There was a problem hiding this comment.
Updated the spec: instead of versioned Pydantic classes, the data field is now stored as plain dict[str, Any] (JSON). The TelemetryData Pydantic model is still used at collection time for validation, then model.model_dump() converts it to a plain dict before storage. On read, data comes back as a raw dict — no deserialization needed. The payload_format field identifies which version of the schema produced the data, so consumers can interpret differences if needed. This keeps old snapshots always readable regardless of how the model evolves.
|
|
||
| --- | ||
|
|
||
| ### R-006: REST API Design — Programmatic Interface |
There was a problem hiding this comment.
it's not clearly mentioned but I think we should have a sane default limit for this API to avoid returning everything each time and potentially having a pagination system would be helpful to avoid returning 100s of responses at once
There was a problem hiding this comment.
Pagination system will be implemented as described in contracts/rest-api.md
bfd456b to
bcf3264
Compare
|
This PR has been inactive for 2 weeks and has been marked as stale. If you're still working on this, please:
The stale label will be removed automatically when there's new activity. |
dfc2fc4 to
bbbfdd4
Compare
ajtmccarty
left a comment
There was a problem hiding this comment.
looks pretty good to me
| payload_format: str # Format version | ||
| deployment_id: str # Deployment identifier | ||
| infrahub_version: str # Product version | ||
| data: dict[str, Any] # Full telemetry payload |
There was a problem hiding this comment.
might be nice for this to be structured data too, but not sure if it really matters in this case
I see the research section explains why the data is unstructured: to be backwards compatible with old payload formats when updating the format going forward
bbbfdd4 to
79d9576
Compare
|
Thanks for the review @ajtmccarty I've addressed your comments and pushed the actual implementation |
f1bed15 to
bbc2784
Compare
bbc2784 to
2a2c697
Compare
6adb50c to
5a964a9
Compare
There was a problem hiding this comment.
4 issues found across 23 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dev/specs/fac-001-local-telemetry-storage/contracts/rest-api.md">
<violation number="1" location="dev/specs/fac-001-local-telemetry-storage/contracts/rest-api.md:75">
P2: Type `remote_send_status` as `Literal["sent", "skipped", "failed", "pending"]` (or a `StrEnum`) instead of plain `str`. This gives Pydantic validation and self-documents the allowed values in the contract.
(Based on your team's feedback about preferring enums and typed values over literals.) [FEEDBACK_USED]</violation>
<violation number="2" location="dev/specs/fac-001-local-telemetry-storage/contracts/rest-api.md:121">
P1: The `export` command lacks `--limit`/`--offset` options and there's no mention of automatic pagination. Since the REST API defaults to `limit=1000`, an export from a long-running deployment would silently truncate results. Either add explicit pagination options (matching `telemetry list`) or specify that the command auto-paginates through all results.</violation>
</file>
<file name="backend/infrahub/telemetry/tasks.py">
<violation number="1" location="backend/infrahub/telemetry/tasks.py:126">
P2: `database` is only assigned inside the first `try` block. If `get_database()` fails, subsequent blocks reference an unbound `database` variable. Move `database = await get_database()` before the first `try`, or re-obtain it in each block that needs it, so the variable is always defined when used.</violation>
</file>
<file name="backend/infrahub/api/telemetry.py">
<violation number="1" location="backend/infrahub/api/telemetry.py:83">
P2: `count` returns the current page size, not the total number of matching snapshots. For a paginated endpoint (with `limit`/`offset`), this makes it impossible for clients to determine whether more pages exist. Either rename this to clarify it's the page size (e.g., `returned`), or add a total-count query and return the true total here.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
polmichel
left a comment
There was a problem hiding this comment.
This looks good! Just a comment worth really looking into regarding the end_date bound in the query when querying "today's data
| @@ -96,21 +104,64 @@ async def post_telemetry_data(url: str, payload: dict[str, Any]) -> None: | |||
| @flow(name="anonymous_telemetry_send", flow_run_name="Send anonymous telemetry") | |||
| async def send_telemetry_push() -> None: | |||
There was a problem hiding this comment.
I was just on my dependency injection soapbox yesterday for a different PR making some changes to a flow and I'm going to do it again
to be more SOLID (particularly the D for dependency injection), this flow should just instantiate a new TelemetrySnapshotService that encapsulates all of the below logic. something like this
async def send_telemetry_push() -> None:
telemetry_repository = TelemetryRepository(db=db)
telemetry_service = TelemetrySnapshotService(
db=db, telemetry_repository=telemetry_repository,
)
data = await gather_anonymous_telemetry_data()
await telemetry_service.save_and_send(data)
and then in your tests, you don't have to patch a bunch of stuff. you can create mocks using telemetry_repo = AsyncMock(spec=TelemetryRepository) or create_autospec(TelemetryRepository, spec_set=True) and you can assert they are called with telemetry_repo.assert_awaited_once_with() or telemetry_repo.assert_has_awaits() or another method on the mock like this
I think it improves tests (accurate interfaces for the mocks instead of "anything goes") and makes them easier to read (no more patch context managers)
that said, this would probably require some more refactoring to turn gather_anonymous_telemetry_data into TelemetryDataGatherer and post_telemetry_data into TelemetryDataPoster to get the full benefits, but something to consider and an LLM can probably make pretty quick work of it
There was a problem hiding this comment.
Thanks I'll try to tackle that refactor in another PR (to not block 1.9)
There was a problem hiding this comment.
1 issue found across 12 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/infrahub/telemetry/tasks.py">
<violation number="1" location="backend/infrahub/telemetry/tasks.py:127">
P1: Unhandled exception: `repository.save()` on the opt-out path lost its `try/except` during refactoring. The other two `repository.save()` calls in this function are protected, but a DB failure here will crash the flow. Wrap this in the same error-handling pattern used by the adjacent save calls.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
9187fae to
37c3533
Compare
Signed-off-by: Fatih Acar <fatih@opsmill.com>
Signed-off-by: Fatih Acar <fatih@opsmill.com>
c7116a4 to
6151156
Compare
As per the quickstart.md file:
What This Feature Does
Infrahub now stores a daily telemetry snapshot locally in the database, regardless of whether remote telemetry reporting is enabled. This ensures all deployments — including air-gapped and opted-out — retain usage data for support, auditing, and license compliance.
Key Changes
infrahubctl telemetry exportexports snapshots to a JSON file.infrahubctl telemetry listshows a summary table of stored snapshots.GET /api/telemetry/snapshotsprovides programmatic access.READ_TELEMETRYglobal permission.