Skip to content

fix(telemetry): hash file paths in payloads#717

Draft
typpo wants to merge 3 commits intomainfrom
codex/telemetry-file-hashes
Draft

fix(telemetry): hash file paths in payloads#717
typpo wants to merge 3 commits intomainfrom
codex/telemetry-file-hashes

Conversation

@typpo
Copy link
Contributor

@typpo typpo commented Mar 16, 2026

Use hashed identifiers for telemetry file and URL fields instead of raw values.
Update the telemetry regression tests and changelog wording.

Summary by CodeRabbit

  • Bug Fixes

    • Telemetry payloads no longer include raw file paths, URLs, or other direct location identifiers to better protect sensitive file and location information.
    • Anonymous metadata (e.g., file/type summaries, counts) is preserved without leaking raw identifiers.
  • Tests

    • Test suite updated to assert absence of path/URL identifiers and raw location values.
  • Documentation

    • Changelog updated to note the privacy-preserving telemetry change.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Walkthrough

Telemetry payloads were simplified to omit raw file paths and URL identifiers. Two helper builders, _build_path_properties and _build_url_properties, were added and multiple telemetry recorders were refactored to use them. Tests and the changelog were updated to reflect the privacy-preserving payload shape.

Changes

Cohort / File(s) Summary
Telemetry core
modelaudit/telemetry.py
Added _build_path_properties(path) and _build_url_properties(url); refactored recorders (record_scan_started, record_scan_completed, record_file_type_detected, record_issue_found, record_download_started, record_download_completed) to use these builders and removed raw file_path, path_identifiers, and URL identifier fields from payloads.
Tests
tests/test_telemetry.py
Updated tests to assert privacy-preserving payloads: removed expectations for raw paths/URLs and identifiers, verified sensitive values are absent in serialized telemetry, and adjusted canonical issue references.
Changelog
CHANGELOG.md
Added a Fixed entry noting telemetry payload simplification to omit file path identifiers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hop through code with nimble feet,
Paths hidden where the quiet hashes meet,
Raw URLs and files I softly hide,
Telemetry whispers, tidy and wide,
A rabbit dances — privacy's delight.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(telemetry): hash file paths in payloads' accurately describes the main objective of the PR: replacing raw file paths and URLs with hashed identifiers in telemetry payloads.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/telemetry-file-hashes
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 544-551: The code coerces issue.get("location", "") to str,
turning None into the string "None" and causing _hash_identifier to produce
synthetic IDs; change issue_location to keep the original value (e.g.,
issue_location = issue.get("location")) and only convert to string or call
_hash_identifier / _extract_model_name / _classify_path when the value is
non-null/non-empty (e.g., if issue_location: location_str = str(issue_location);
use location_str for classification but call _hash_identifier(location_str) only
when issue_location is truthy, otherwise set location_identifier/model_name to
None). Apply the same fix to the other block around the references to
_hash_identifier/_classify_path/_extract_model_name noted at 572-576.

In `@tests/test_telemetry.py`:
- Around line 370-372: Add a return type annotation "-> None" to the test method
signature for test_path_and_url_fields_use_hashed_identifiers_only so the method
reads "def test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:";
update only the function signature to include the annotation to satisfy the
repository typing rules for tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6352dee5-7c86-42e4-86fb-b473b01282c1

📥 Commits

Reviewing files that changed from the base of the PR and between d9fe283 and 26588e7.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • modelaudit/telemetry.py
  • tests/test_telemetry.py

Comment on lines 544 to 551
issue_location = str(issue.get("location", ""))
issue_details.append(
{
"type": issue_type,
"severity": severity,
"location_type": self._classify_path(issue_location),
"file_path": issue_location,
"location_identifier": self._hash_identifier(issue_location) if issue_location else None,
"model_name": self._extract_model_name(issue_location) if issue_location else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid hashing "None" for missing locations/paths.

Current str(...) coercion turns None into "None" (truthy), so missing location/path values get a synthetic identifier instead of being omitted.

🔧 Proposed fix
-                issue_location = str(issue.get("location", ""))
+                raw_issue_location = issue.get("location")
+                issue_location = raw_issue_location if isinstance(raw_issue_location, str) else ""
                 issue_details.append(
                     {
                         "type": issue_type,
                         "severity": severity,
                         "location_type": self._classify_path(issue_location),
                         "location_identifier": self._hash_identifier(issue_location) if issue_location else None,
                         "model_name": self._extract_model_name(issue_location) if issue_location else None,
                     }
                 )
@@
-                "file_identifiers": [
-                    self._hash_identifier(asset_path)
-                    for asset in assets
-                    if (asset_path := str(asset.get("path", "")))
-                ],
+                "file_identifiers": [
+                    self._hash_identifier(asset_path)
+                    for asset in assets
+                    if isinstance((asset_path := asset.get("path")), str) and asset_path
+                ],

Also applies to: 572-576

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/telemetry.py` around lines 544 - 551, The code coerces
issue.get("location", "") to str, turning None into the string "None" and
causing _hash_identifier to produce synthetic IDs; change issue_location to keep
the original value (e.g., issue_location = issue.get("location")) and only
convert to string or call _hash_identifier / _extract_model_name /
_classify_path when the value is non-null/non-empty (e.g., if issue_location:
location_str = str(issue_location); use location_str for classification but call
_hash_identifier(location_str) only when issue_location is truthy, otherwise set
location_identifier/model_name to None). Apply the same fix to the other block
around the references to _hash_identifier/_classify_path/_extract_model_name
noted at 572-576.

Comment on lines 370 to 372
def test_path_and_url_fields_use_hashed_identifiers_only(self):
"""Path and URL fields should not include raw sensitive values."""
mock_posthog = MagicMock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add -> None return annotation on this test method.

This keeps test typing consistent with repository rules.

✏️ Proposed fix
-    def test_path_and_url_fields_use_hashed_identifiers_only(self):
+    def test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:

As per coding guidelines: "tests/**/*.py: Use type hints -> None on all test methods and tmp_path: Path / monkeypatch: pytest.MonkeyPatch on test parameters".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_telemetry.py` around lines 370 - 372, Add a return type annotation
"-> None" to the test method signature for
test_path_and_url_fields_use_hashed_identifiers_only so the method reads "def
test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:"; update
only the function signature to include the annotation to satisfy the repository
typing rules for tests.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_telemetry.py`:
- Around line 273-274: The test method
test_scan_completed_uses_top_level_results_schema is missing a return type
annotation; update its definition to include "-> None" (i.e., change the
function signature for test_scan_completed_uses_top_level_results_schema to
declare a None return type) to comply with the tests/**/*.py typing guideline.
- Around line 335-336: The test method definition
test_scan_started_includes_per_path_fields is missing a return type annotation;
update its signature to include the explicit return annotation "-> None" (i.e.,
change "def test_scan_started_includes_per_path_fields(self):" to "def
test_scan_started_includes_per_path_fields(self) -> None:") so the test follows
the project's typing guideline for tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c05982f2-472b-459c-82de-24148c75fb41

📥 Commits

Reviewing files that changed from the base of the PR and between 26588e7 and 1d911a5.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • modelaudit/telemetry.py
  • tests/test_telemetry.py

Comment on lines 273 to +274
def test_scan_completed_uses_top_level_results_schema(self):
"""Test that scan completion telemetry aggregates from top-level issues/assets."""
"""Test that scan completion telemetry aggregates without file identifiers."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add -> None return annotation on this test method.

-    def test_scan_completed_uses_top_level_results_schema(self):
+    def test_scan_completed_uses_top_level_results_schema(self) -> None:

As per coding guidelines: "tests/**/*.py: Use type hints -> None on all test methods".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_scan_completed_uses_top_level_results_schema(self):
"""Test that scan completion telemetry aggregates from top-level issues/assets."""
"""Test that scan completion telemetry aggregates without file identifiers."""
def test_scan_completed_uses_top_level_results_schema(self) -> None:
"""Test that scan completion telemetry aggregates without file identifiers."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_telemetry.py` around lines 273 - 274, The test method
test_scan_completed_uses_top_level_results_schema is missing a return type
annotation; update its definition to include "-> None" (i.e., change the
function signature for test_scan_completed_uses_top_level_results_schema to
declare a None return type) to comply with the tests/**/*.py typing guideline.

Comment on lines 335 to +336
def test_scan_started_includes_per_path_fields(self):
"""Scan started events include per-path arrays and model names."""
"""Scan started events include coarse path metadata but no identifiers."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add -> None return annotation on this test method.

-    def test_scan_started_includes_per_path_fields(self):
+    def test_scan_started_includes_per_path_fields(self) -> None:

As per coding guidelines: "tests/**/*.py: Use type hints -> None on all test methods".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_scan_started_includes_per_path_fields(self):
"""Scan started events include per-path arrays and model names."""
"""Scan started events include coarse path metadata but no identifiers."""
def test_scan_started_includes_per_path_fields(self) -> None:
"""Scan started events include coarse path metadata but no identifiers."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_telemetry.py` around lines 335 - 336, The test method definition
test_scan_started_includes_per_path_fields is missing a return type annotation;
update its signature to include the explicit return annotation "-> None" (i.e.,
change "def test_scan_started_includes_per_path_fields(self):" to "def
test_scan_started_includes_per_path_fields(self) -> None:") so the test follows
the project's typing guideline for tests.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Line 84: Update the CHANGELOG entry that currently reads "- **telemetry:**
simplify telemetry payloads to omit file path identifiers" to accurately reflect
the implementation: if the code replaces paths with hashed values, change the
wording to "hash file path identifiers in telemetry payloads" or "replace raw
file path identifiers with hashed values"; if the code actually removes them,
keep "omit file path identifiers"; ensure the changelog text matches the
implementation and mentions "telemetry" and "file path identifiers" so it's easy
to locate and verify.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9743f77e-460c-423f-944e-a75319921bd0

📥 Commits

Reviewing files that changed from the base of the PR and between 1d911a5 and 8f68169.

📒 Files selected for processing (1)
  • CHANGELOG.md


### Fixed

- **telemetry:** simplify telemetry payloads to omit file path identifiers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify whether file paths are hashed or omitted.

The changelog entry states that file path identifiers are "omitted," but the PR title says "hash file paths in payloads." These describe different approaches:

  • Hashing means raw values are replaced with hashed identifiers (data is transformed but still present)
  • Omitting means the values are completely removed (data is absent)

For clarity and accuracy, align the changelog wording with what the implementation actually does. If paths are being hashed, consider: "replace raw file path identifiers with hashed values" or "hash file path identifiers in telemetry payloads."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` at line 84, Update the CHANGELOG entry that currently reads "-
**telemetry:** simplify telemetry payloads to omit file path identifiers" to
accurately reflect the implementation: if the code replaces paths with hashed
values, change the wording to "hash file path identifiers in telemetry payloads"
or "replace raw file path identifiers with hashed values"; if the code actually
removes them, keep "omit file path identifiers"; ensure the changelog text
matches the implementation and mentions "telemetry" and "file path identifiers"
so it's easy to locate and verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants