docs: add anonymizer Claude Code skill and supporting concept docs by lipikaramaswamy · Pull Request #153 · NVIDIA-NeMo/Anonymizer

lipikaramaswamy · 2026-05-11T16:58:50Z

Summary

Claude Code skill at skills/anonymizer/ — elicits dataset context, recommends mode + strategy, drafts a runnable Python script, iterates with the user via the workflow in interactive.md.
docs/concepts/choosing-a-strategy.md (215L) — decision guide for mode (Replace vs Rewrite), strategy choice, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference.
docs/troubleshooting.md (210L) — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures.
README.md — new "Using with Claude Code" section pointing at the skills.sh installer.
mkdocs.yml — adds the two new docs to navigation.
src/anonymizer/__init__.py — exports PrivacyGoal at the top level (previously only accessible via deep import despite being referenced in docs and the skill's output template).
docs/concepts/detection.md — minor wording polish.

Testing notes

Tested end-to-end with 20 PMC-Patients clinical case reports. The interactive workflow was exercised: install verify → provider check → data inspect → clarify → plan → build → preview → results inspection. Two real workflow gaps surfaced and fixed during testing:

Provider configuration wasn't checked proactively → added step 1 environment verification + step 3 provider question (commit 8ccd277)
Trace data wasn't persisted by default → script now saves preview.parquet for inspection (commit 5c5d5a0)

The skill loader was also verified: npx skills add (skills.sh) discovers the skill correctly via the SKILL.md frontmatter.

Related issues filed during this work

chore: use COL_* constants for intermediate columns in llm_replace_workflow.py #148 — replace string-literal column names in llm_replace_workflow.py:53-55 with COL_* constants (existing STYLEGUIDE rule violation)
bug: result.trace_dataframe is not persistable via to_parquet #152 — result.trace_dataframe not persistable via to_parquet. Blocks the skill's preview.parquet save from being readable until fixed. Same 3 lines as chore: use COL_* constants for intermediate columns in llm_replace_workflow.py #148.

Test plan

CI passes (format-check, copyright-check, mkdocs build --strict)
from anonymizer import PrivacyGoal succeeds
Skill discoverable via npx skills add against this branch
Skill workflow produces a sensible script for a sample dataset (verified manually with PMC-Patients data)

greptile-apps · 2026-05-11T17:03:24Z

Greptile Summary

This PR adds a Claude Code skill at skills/anonymizer/ along with two new concept documents (docs/concepts/choosing-a-strategy.md, docs/troubleshooting.md), a README section, and a PrivacyGoal top-level export in src/anonymizer/__init__.py. The skill elicits dataset context, guides mode/strategy selection, and produces a runnable Python script following an interactive workflow.

New docs: choosing-a-strategy.md is a full decision guide covering mode, strategy, detection knobs, and privacy-goal phrasing; troubleshooting.md is a symptom-first guide for dropped rows, leakage, low utility, and pipeline failures. Both are wired into mkdocs.yml navigation.
PrivacyGoal export: Previously accessible only via anonymizer.config.rewrite, it is now exported at anonymizer.__init__ and properly listed in __all__.
Known tracked issue: result.trace_dataframe.to_parquet(\"preview.parquet\") in the skill's output template will raise a serialization error until issue bug: result.trace_dataframe is not persistable via to_parquet #152 is resolved — this is acknowledged in the PR description and has already been flagged in a previous review comment.

Confidence Score: 4/5

Safe to merge for docs and the PrivacyGoal export; the skill preview path has a known crash on every first run that is tracked but not yet fixed.

The PrivacyGoal export, all doc content, and the interactive workflow are solid. The one live defect is in the skill template: result.trace_dataframe.to_parquet raises an unhandled exception before the failure-first guard and quality summary can execute on every preview run.

skills/anonymizer/SKILL.md — the output template preview path crashes on trace_dataframe serialization until issue #152 is fixed.

Important Files Changed

Filename	Overview
skills/anonymizer/SKILL.md	New skill definition with output template; the preview path calls result.trace_dataframe.to_parquet() which is non-serializable per issue #152, causing a crash before the failure-first guard runs (already flagged in previous review).
src/anonymizer/init.py	Adds PrivacyGoal re-export from anonymizer.config.rewrite; import path is correct, all updated correctly.
docs/concepts/choosing-a-strategy.md	New decision guide; leakage/tolerance values match code constants, DEFAULT_ENTITY_LABELS unpacking syntax is correct, all cross-doc links resolve properly.
docs/troubleshooting.md	New symptom-first guide; leakage threshold table matches RiskToleranceBundle values in config/rewrite.py exactly, relative links are correct.
skills/anonymizer/workflows/interactive.md	Interactive workflow steps are consistent with the API; Anonymizer(model_providers=...) parameter verified against the constructor signature.
AGENTS.md	Adds a pointer to the new skill and a note to check SKILL.md before shipping public-API changes.
README.md	Adds a Using with Claude Code section with install command; straightforward documentation addition.
mkdocs.yml	Adds two new nav entries for choosing-a-strategy and troubleshooting; both files exist and are in the right directories.
docs/concepts/detection.md	Minor wording change; no logic change.

_{Reviews (8): Last reviewed commit: "docs: mention bundled agent skill in AGE..." | Re-trigger Greptile}

lipikaramaswamy · 2026-05-11T17:07:20Z

Once #149 merges, this PR needs a follow-up edit to AGENTS.md:

The dev-vs-user redirect at the top should also mention the bundled skill at skills/anonymizer/
A note that public-API changes (especially imports referenced in skills/anonymizer/SKILL.md's output template) may require corresponding skill updates.

asteier2026 · 2026-05-11T20:49:22Z

+- Healthcare: `mrn`, `clinical_facility`, `diagnosis_code`, `medication_name`
+- Legal: `case_number`, `court_name`, `docket_number`, `judge_name`
+- Customer support: `ticket_id`, `internal_user_id`, `transaction_id`
+- Internal: `employee_id`, `cost_center`, `internal_project_codename`


mrn is in the default list, as is employee-id and possibly case_number and court_name

Good catch — fixed in de0ab26. Cross-checked all examples against the actual DEFAULT_ENTITY_LABELS list and dropped redundant ones (mrn, court_name, employee_id). Also switched all examples to snake_case to match the convention — the validator only strips/lowercases, so medical record number and medical_record_number would have been treated as different labels.

Note on case_number: verified it's not in DEFAULT_ENTITY_LABELS, so I kept that one

@asteier2026

Address review feedback from @asteier2026 on PR #153: - `mrn`, `court_name`, and `employee_id` were listed as examples of domain-specific labels to *extend* the default list with, but all three are already in DEFAULT_ENTITY_LABELS (as `medical_record_number`, `court_name`, and `employee_id`). - Some examples used space-separated natural-language strings ("medical record number", "internal project codename") while the default list uses snake_case. Spaces are not normalized by the validator, so the two forms would map to different labels. Swap in non-default snake_case examples (e.g. `clinical_facility`, `diagnosis_code`, `case_number`, `internal_project_codename`) and add one sentence noting the snake_case convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

alexahaushalter · 2026-05-12T15:47:50Z

+
+Symptom-first guide to common problems and how to fix them. Each entry says how to diagnose, what knob to turn, and what to verify after.
+
+When something looks wrong, **first confirm the run completed cleanly** (no dropped rows). Once you know the pipeline ran, **run `preview` on the failing rows** — the trace columns it produces are how you'll diagnose nine cases out of ten.


Are dropped rows and failed rows different? My first assumption is they are the same, and then it is confusing to me to make sure there were no dropped rows but to then investigate any that failed.

Good catch — they were the same thing. Unified in 0c46cc0: the intro now uses one term ("rows that didn't make it through the pipeline, usually a rate-limit / infra issue") and explicitly contrasts with quality-issue rows (high leakage_mass, low utility_score, needs_human_review).

…g tone Critical fix flagged by greptile review on PR #153: `DEFAULT_ENTITY_LABELS` is exported as a `tuple[str, ...]`, so the pattern `DEFAULT_ENTITY_LABELS + ["clinical_facility"]` raises `TypeError: can only concatenate tuple (not "list") to tuple` at runtime. Every user copying the documented domain-extension pattern would hit this. Switched all four occurrences (SKILL.md prose tip, SKILL.md template, choosing-a-strategy.md code block, troubleshooting.md code block) to `[*DEFAULT_ENTITY_LABELS, "clinical_facility"]` (clean unpacking that produces a list). Tone polish in choosing-a-strategy.md: now that the doc is exposed in user-facing mkdocs nav, three places that read as agent-facing ("ask the user", "if the user hasn't specified", `"User signal"` table header) were reframed to address the reader directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

greptile-apps · 2026-05-13T01:06:44Z

Want your agent to iterate on Greptile's feedback? Try greploops.

- skills/anonymizer/ — Claude Code skill (SKILL.md + interactive workflow) that walks users through configuring Anonymizer. - docs/concepts/choosing-a-strategy.md — decision guide for mode (Replace vs Rewrite), strategy, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference. - docs/troubleshooting.md — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures. - mkdocs.yml — add the two new docs to navigation. - README.md — add "Using with Claude Code" section pointing at the skills.sh installer. - src/anonymizer/__init__.py — export PrivacyGoal at the top level (referenced from the skill's output template). - docs/concepts/detection.md — minor wording polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

Surfaced during manual skill testing: the workflow jumped from "verify install" straight to data inspection and only mentioned provider / API-key setup in the reactive Troubleshooting section. The agent would discover a missing provider only after the user spent time on data inspection and clarification, then watched preview fail. - Extend step 1 to also verify provider config exists (API key env var + providers.yaml) and STOP with a pointer at docs/concepts/models.md if either is missing. - Add a Clarify-step question that asks whether to use shipped defaults or a custom providers.yaml, so the generated script can pass the path via Anonymizer(model_providers=...). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

Surfaced during manual skill testing: the script the agent generates prints summary stats but doesn't persist any trace data. Investigating 'why was this entity kept/dropped?' or 'why did this row never converge during repair?' required re-running an 8-minute preview, paying tokens again. - Output template now writes result.trace_dataframe to preview.parquet on every preview run. trace_dataframe is a superset of the user-facing dataframe (it includes all internal columns). - Single parquet file (not CSV + parquet) for format consistency. trace columns include dict/list values that don't round-trip through CSV cleanly. - interactive.md step 6 deeper-inspection line updated to point at the saved file instead of suggesting an interactive Python re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

@asteier2026

Address review feedback from @asteier2026 on PR #153: - `mrn`, `court_name`, and `employee_id` were listed as examples of domain-specific labels to *extend* the default list with, but all three are already in DEFAULT_ENTITY_LABELS (as `medical_record_number`, `court_name`, and `employee_id`). - Some examples used space-separated natural-language strings ("medical record number", "internal project codename") while the default list uses snake_case. Spaces are not normalized by the validator, so the two forms would map to different labels. Swap in non-default snake_case examples (e.g. `clinical_facility`, `diagnosis_code`, `case_number`, `internal_project_codename`) and add one sentence noting the snake_case convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

- Rewrite the augmenter-steering bullet in active voice, leading with what data_summary actually is for. Drops the "(Augmenter-only)" parenthetical that was hiding the point. - Rename the "Cost per record" comparison row to "Additional LLM calls (beyond shared detection)" to make explicit that detection runs in both modes and these numbers exclude it — so the 0 for Redact/Annotate/Hash is correct (no additional LLM work beyond detection). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

… per review - Mode (Replace vs Rewrite) reframed as "ask the user" rather than defaulting to Rewrite; applied across choosing-a-strategy.md, SKILL.md, and interactive.md. - Replace-strategy default made explicit: Substitute when unspecified (choosing-a-strategy.md, SKILL.md, interactive.md). - strict_entity_protection wording calls out "default False" and that being in a regulated domain alone isn't reason to flip True. - gliner_threshold paragraph reworked to make both directions' costs explicit (lowering pays latency/tokens; raising trades for recall risk). - data_summary section reframed: improves detection (precursor to both modes), additionally feeds Rewrite's rewriter for meaning preservation. - "What to leave out" of data_summary now flags that Substitute behavior instructions belong in Substitute(instructions). - Cheat sheet: 3 distinct Substitute rows (biographies, survey responses, support transcripts) instead of one. - Troubleshooting: unified "dropped rows" / "failed records" terminology vs quality-issue rows; explained that extending entity_labels switches to strict mode, with the DEFAULT_ENTITY_LABELS + [...] pattern. - Replaced "ringing the bell" latent-identifier example with the unambiguous "during her third round of chemo". - Nits: tune→adjust, mrn spelled out, across-rows→across-documents, US spellings (prioritized, anonymized). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

…g tone Critical fix flagged by greptile review on PR #153: `DEFAULT_ENTITY_LABELS` is exported as a `tuple[str, ...]`, so the pattern `DEFAULT_ENTITY_LABELS + ["clinical_facility"]` raises `TypeError: can only concatenate tuple (not "list") to tuple` at runtime. Every user copying the documented domain-extension pattern would hit this. Switched all four occurrences (SKILL.md prose tip, SKILL.md template, choosing-a-strategy.md code block, troubleshooting.md code block) to `[*DEFAULT_ENTITY_LABELS, "clinical_facility"]` (clean unpacking that produces a list). Tone polish in choosing-a-strategy.md: now that the doc is exposed in user-facing mkdocs nav, three places that read as agent-facing ("ask the user", "if the user hasn't specified", `"User signal"` table header) were reframed to address the reader directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

…ullet, wording - troubleshooting.md: rewrite the "Hash output isn't stable across runs" section to distinguish the digest (text-only) from the output wrapper (templated with label), so the previously-contradictory bullets read consistently. Include the format_template="<HASH_{digest}>" workaround. - troubleshooting.md: reorder the Substitute cross-row consistency bullet to lead with the fix (use Hash or post-process _replacement_map) before the per-row vs cross-row background. - SKILL.md: polish the "LLM calls failing" line ("an auth issue, network problem, or wrong base URL" rather than "auth / network / wrong"). - SKILL.md: "for cost" → "for cost savings" in the gliner_threshold comment to make the direction explicit. - README.md: "Replace vs Rewrite plus a strategy" → "Rewrite, or Replace with a strategy" so the strategy clause is unambiguously tied to Replace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

…ift risk Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

)

lipikaramaswamy requested review from a team as code owners May 11, 2026 16:58

asteier2026 reviewed May 11, 2026

View reviewed changes

Comment thread docs/concepts/choosing-a-strategy.md Outdated

asteier2026 reviewed May 11, 2026

View reviewed changes

Comment thread docs/concepts/choosing-a-strategy.md

asteier2026 reviewed May 11, 2026

View reviewed changes

Comment thread skills/anonymizer/workflows/interactive.md Outdated

asteier2026 reviewed May 11, 2026

View reviewed changes

Comment thread skills/anonymizer/SKILL.md Outdated

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread skills/anonymizer/SKILL.md