Skip to content

docs: add anonymizer Claude Code skill and supporting concept docs#153

Merged
lipikaramaswamy merged 10 commits into
mainfrom
lipikaramaswamy/docs/anonymizer-skill
May 14, 2026
Merged

docs: add anonymizer Claude Code skill and supporting concept docs#153
lipikaramaswamy merged 10 commits into
mainfrom
lipikaramaswamy/docs/anonymizer-skill

Conversation

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

Summary

  • Claude Code skill at skills/anonymizer/ — elicits dataset context, recommends mode + strategy, drafts a runnable Python script, iterates with the user via the workflow in interactive.md.
  • docs/concepts/choosing-a-strategy.md (215L) — decision guide for mode (Replace vs Rewrite), strategy choice, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference.
  • docs/troubleshooting.md (210L) — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures.
  • README.md — new "Using with Claude Code" section pointing at the skills.sh installer.
  • mkdocs.yml — adds the two new docs to navigation.
  • src/anonymizer/__init__.py — exports PrivacyGoal at the top level (previously only accessible via deep import despite being referenced in docs and the skill's output template).
  • docs/concepts/detection.md — minor wording polish.

Testing notes

Tested end-to-end with 20 PMC-Patients clinical case reports. The interactive workflow was exercised: install verify → provider check → data inspect → clarify → plan → build → preview → results inspection. Two real workflow gaps surfaced and fixed during testing:

  • Provider configuration wasn't checked proactively → added step 1 environment verification + step 3 provider question (commit 8ccd277)
  • Trace data wasn't persisted by default → script now saves preview.parquet for inspection (commit 5c5d5a0)

The skill loader was also verified: npx skills add (skills.sh) discovers the skill correctly via the SKILL.md frontmatter.

Related issues filed during this work

Test plan

  • CI passes (format-check, copyright-check, mkdocs build --strict)
  • from anonymizer import PrivacyGoal succeeds
  • Skill discoverable via npx skills add against this branch
  • Skill workflow produces a sensible script for a sample dataset (verified manually with PMC-Patients data)

@lipikaramaswamy lipikaramaswamy requested review from a team as code owners May 11, 2026 16:58
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR adds a Claude Code skill at skills/anonymizer/ along with two new concept documents (docs/concepts/choosing-a-strategy.md, docs/troubleshooting.md), a README section, and a PrivacyGoal top-level export in src/anonymizer/__init__.py. The skill elicits dataset context, guides mode/strategy selection, and produces a runnable Python script following an interactive workflow.

  • New docs: choosing-a-strategy.md is a full decision guide covering mode, strategy, detection knobs, and privacy-goal phrasing; troubleshooting.md is a symptom-first guide for dropped rows, leakage, low utility, and pipeline failures. Both are wired into mkdocs.yml navigation.
  • PrivacyGoal export: Previously accessible only via anonymizer.config.rewrite, it is now exported at anonymizer.__init__ and properly listed in __all__.
  • Known tracked issue: result.trace_dataframe.to_parquet(\"preview.parquet\") in the skill's output template will raise a serialization error until issue bug: result.trace_dataframe is not persistable via to_parquet #152 is resolved — this is acknowledged in the PR description and has already been flagged in a previous review comment.

Confidence Score: 4/5

Safe to merge for docs and the PrivacyGoal export; the skill preview path has a known crash on every first run that is tracked but not yet fixed.

The PrivacyGoal export, all doc content, and the interactive workflow are solid. The one live defect is in the skill template: result.trace_dataframe.to_parquet raises an unhandled exception before the failure-first guard and quality summary can execute on every preview run.

skills/anonymizer/SKILL.md — the output template preview path crashes on trace_dataframe serialization until issue #152 is fixed.

Important Files Changed

Filename Overview
skills/anonymizer/SKILL.md New skill definition with output template; the preview path calls result.trace_dataframe.to_parquet() which is non-serializable per issue #152, causing a crash before the failure-first guard runs (already flagged in previous review).
src/anonymizer/init.py Adds PrivacyGoal re-export from anonymizer.config.rewrite; import path is correct, all updated correctly.
docs/concepts/choosing-a-strategy.md New decision guide; leakage/tolerance values match code constants, DEFAULT_ENTITY_LABELS unpacking syntax is correct, all cross-doc links resolve properly.
docs/troubleshooting.md New symptom-first guide; leakage threshold table matches RiskToleranceBundle values in config/rewrite.py exactly, relative links are correct.
skills/anonymizer/workflows/interactive.md Interactive workflow steps are consistent with the API; Anonymizer(model_providers=...) parameter verified against the constructor signature.
AGENTS.md Adds a pointer to the new skill and a note to check SKILL.md before shipping public-API changes.
README.md Adds a Using with Claude Code section with install command; straightforward documentation addition.
mkdocs.yml Adds two new nav entries for choosing-a-strategy and troubleshooting; both files exist and are in the right directories.
docs/concepts/detection.md Minor wording change; no logic change.

Reviews (8): Last reviewed commit: "docs: mention bundled agent skill in AGE..." | Re-trigger Greptile

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

Once #149 merges, this PR needs a follow-up edit to AGENTS.md:

  1. The dev-vs-user redirect at the top should also mention the bundled skill at skills/anonymizer/
  2. A note that public-API changes (especially imports referenced in skills/anonymizer/SKILL.md's output template) may require corresponding skill updates.

Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md Outdated
- Healthcare: `mrn`, `clinical_facility`, `diagnosis_code`, `medication_name`
- Legal: `case_number`, `court_name`, `docket_number`, `judge_name`
- Customer support: `ticket_id`, `internal_user_id`, `transaction_id`
- Internal: `employee_id`, `cost_center`, `internal_project_codename`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mrn is in the default list, as is employee-id and possibly case_number and court_name

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in de0ab26. Cross-checked all examples against the actual DEFAULT_ENTITY_LABELS list and dropped redundant ones (mrn, court_name, employee_id). Also switched all examples to snake_case to match the convention — the validator only strips/lowercases, so medical record number and medical_record_number would have been treated as different labels.

Note on case_number: verified it's not in DEFAULT_ENTITY_LABELS, so I kept that one

Comment thread docs/concepts/choosing-a-strategy.md
Comment thread skills/anonymizer/workflows/interactive.md Outdated
Comment thread skills/anonymizer/SKILL.md Outdated
lipikaramaswamy added a commit that referenced this pull request May 12, 2026
Address review feedback from @asteier2026 on PR #153:

- `mrn`, `court_name`, and `employee_id` were listed as examples of
  domain-specific labels to *extend* the default list with, but all
  three are already in DEFAULT_ENTITY_LABELS (as
  `medical_record_number`, `court_name`, and `employee_id`).
- Some examples used space-separated natural-language strings
  ("medical record number", "internal project codename") while the
  default list uses snake_case. Spaces are not normalized by the
  validator, so the two forms would map to different labels.

Swap in non-default snake_case examples (e.g. `clinical_facility`,
`diagnosis_code`, `case_number`, `internal_project_codename`) and add
one sentence noting the snake_case convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Comment thread skills/anonymizer/SKILL.md
Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md
Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md
Comment thread docs/concepts/choosing-a-strategy.md
Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md
Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md
Comment thread docs/concepts/choosing-a-strategy.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md
Comment thread docs/concepts/detection.md
Comment thread docs/troubleshooting.md Outdated

Symptom-first guide to common problems and how to fix them. Each entry says how to diagnose, what knob to turn, and what to verify after.

When something looks wrong, **first confirm the run completed cleanly** (no dropped rows). Once you know the pipeline ran, **run `preview` on the failing rows** — the trace columns it produces are how you'll diagnose nine cases out of ten.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are dropped rows and failed rows different? My first assumption is they are the same, and then it is confusing to me to make sure there were no dropped rows but to then investigate any that failed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — they were the same thing. Unified in 0c46cc0: the intro now uses one term ("rows that didn't make it through the pipeline, usually a rate-limit / infra issue") and explicitly contrasts with quality-issue rows (high leakage_mass, low utility_score, needs_human_review).

Comment thread docs/troubleshooting.md Outdated
Comment thread docs/troubleshooting.md Outdated
Comment thread docs/troubleshooting.md Outdated
Comment thread docs/troubleshooting.md Outdated
Comment thread skills/anonymizer/workflows/interactive.md
Comment thread skills/anonymizer/SKILL.md
Comment thread skills/anonymizer/SKILL.md Outdated
Comment thread skills/anonymizer/SKILL.md
Comment thread skills/anonymizer/SKILL.md Outdated
Comment thread README.md Outdated
Comment thread docs/concepts/choosing-a-strategy.md Outdated
lipikaramaswamy added a commit that referenced this pull request May 13, 2026
…g tone

Critical fix flagged by greptile review on PR #153:
`DEFAULT_ENTITY_LABELS` is exported as a `tuple[str, ...]`, so the
pattern `DEFAULT_ENTITY_LABELS + ["clinical_facility"]` raises
`TypeError: can only concatenate tuple (not "list") to tuple` at runtime.
Every user copying the documented domain-extension pattern would hit
this. Switched all four occurrences (SKILL.md prose tip, SKILL.md
template, choosing-a-strategy.md code block, troubleshooting.md code
block) to `[*DEFAULT_ENTITY_LABELS, "clinical_facility"]` (clean
unpacking that produces a list).

Tone polish in choosing-a-strategy.md: now that the doc is exposed
in user-facing mkdocs nav, three places that read as agent-facing
("ask the user", "if the user hasn't specified", `"User signal"`
table header) were reframed to address the reader directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

lipikaramaswamy and others added 10 commits May 12, 2026 18:15
- skills/anonymizer/ — Claude Code skill (SKILL.md + interactive
  workflow) that walks users through configuring Anonymizer.
- docs/concepts/choosing-a-strategy.md — decision guide for mode
  (Replace vs Rewrite), strategy, privacy goal phrasing, and
  detection knobs. Doubles as the primary agent reference.
- docs/troubleshooting.md — symptom-first guide for dropped rows,
  leakage, low utility, and pipeline failures.
- mkdocs.yml — add the two new docs to navigation.
- README.md — add "Using with Claude Code" section pointing at the
  skills.sh installer.
- src/anonymizer/__init__.py — export PrivacyGoal at the top level
  (referenced from the skill's output template).
- docs/concepts/detection.md — minor wording polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Surfaced during manual skill testing: the workflow jumped from "verify
install" straight to data inspection and only mentioned provider /
API-key setup in the reactive Troubleshooting section. The agent would
discover a missing provider only after the user spent time on data
inspection and clarification, then watched preview fail.

- Extend step 1 to also verify provider config exists (API key env var
  + providers.yaml) and STOP with a pointer at docs/concepts/models.md
  if either is missing.
- Add a Clarify-step question that asks whether to use shipped defaults
  or a custom providers.yaml, so the generated script can pass the path
  via Anonymizer(model_providers=...).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Surfaced during manual skill testing: the script the agent generates
prints summary stats but doesn't persist any trace data. Investigating
'why was this entity kept/dropped?' or 'why did this row never converge
during repair?' required re-running an 8-minute preview, paying tokens
again.

- Output template now writes result.trace_dataframe to preview.parquet
  on every preview run. trace_dataframe is a superset of the user-facing
  dataframe (it includes all internal columns).
- Single parquet file (not CSV + parquet) for format consistency.
  trace columns include dict/list values that don't round-trip through
  CSV cleanly.
- interactive.md step 6 deeper-inspection line updated to point at the
  saved file instead of suggesting an interactive Python re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Address review feedback from @asteier2026 on PR #153:

- `mrn`, `court_name`, and `employee_id` were listed as examples of
  domain-specific labels to *extend* the default list with, but all
  three are already in DEFAULT_ENTITY_LABELS (as
  `medical_record_number`, `court_name`, and `employee_id`).
- Some examples used space-separated natural-language strings
  ("medical record number", "internal project codename") while the
  default list uses snake_case. Spaces are not normalized by the
  validator, so the two forms would map to different labels.

Swap in non-default snake_case examples (e.g. `clinical_facility`,
`diagnosis_code`, `case_number`, `internal_project_codename`) and add
one sentence noting the snake_case convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
- Rewrite the augmenter-steering bullet in active voice, leading with
  what data_summary actually is for. Drops the "(Augmenter-only)"
  parenthetical that was hiding the point.
- Rename the "Cost per record" comparison row to "Additional LLM calls
  (beyond shared detection)" to make explicit that detection runs in
  both modes and these numbers exclude it — so the 0 for
  Redact/Annotate/Hash is correct (no additional LLM work beyond
  detection).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
… per review

- Mode (Replace vs Rewrite) reframed as "ask the user" rather than
  defaulting to Rewrite; applied across choosing-a-strategy.md,
  SKILL.md, and interactive.md.
- Replace-strategy default made explicit: Substitute when unspecified
  (choosing-a-strategy.md, SKILL.md, interactive.md).
- strict_entity_protection wording calls out "default False" and that
  being in a regulated domain alone isn't reason to flip True.
- gliner_threshold paragraph reworked to make both directions' costs
  explicit (lowering pays latency/tokens; raising trades for recall risk).
- data_summary section reframed: improves detection (precursor to both
  modes), additionally feeds Rewrite's rewriter for meaning preservation.
- "What to leave out" of data_summary now flags that Substitute
  behavior instructions belong in Substitute(instructions).
- Cheat sheet: 3 distinct Substitute rows (biographies, survey
  responses, support transcripts) instead of one.
- Troubleshooting: unified "dropped rows" / "failed records"
  terminology vs quality-issue rows; explained that extending
  entity_labels switches to strict mode, with the
  DEFAULT_ENTITY_LABELS + [...] pattern.
- Replaced "ringing the bell" latent-identifier example with the
  unambiguous "during her third round of chemo".
- Nits: tune→adjust, mrn spelled out, across-rows→across-documents,
  US spellings (prioritized, anonymized).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
…g tone

Critical fix flagged by greptile review on PR #153:
`DEFAULT_ENTITY_LABELS` is exported as a `tuple[str, ...]`, so the
pattern `DEFAULT_ENTITY_LABELS + ["clinical_facility"]` raises
`TypeError: can only concatenate tuple (not "list") to tuple` at runtime.
Every user copying the documented domain-extension pattern would hit
this. Switched all four occurrences (SKILL.md prose tip, SKILL.md
template, choosing-a-strategy.md code block, troubleshooting.md code
block) to `[*DEFAULT_ENTITY_LABELS, "clinical_facility"]` (clean
unpacking that produces a list).

Tone polish in choosing-a-strategy.md: now that the doc is exposed
in user-facing mkdocs nav, three places that read as agent-facing
("ask the user", "if the user hasn't specified", `"User signal"`
table header) were reframed to address the reader directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
…ullet, wording

- troubleshooting.md: rewrite the "Hash output isn't stable across runs"
  section to distinguish the digest (text-only) from the output wrapper
  (templated with label), so the previously-contradictory bullets read
  consistently. Include the format_template="<HASH_{digest}>" workaround.
- troubleshooting.md: reorder the Substitute cross-row consistency bullet
  to lead with the fix (use Hash or post-process _replacement_map) before
  the per-row vs cross-row background.
- SKILL.md: polish the "LLM calls failing" line ("an auth issue, network
  problem, or wrong base URL" rather than "auth / network / wrong").
- SKILL.md: "for cost" → "for cost savings" in the gliner_threshold
  comment to make the direction explicit.
- README.md: "Replace vs Rewrite plus a strategy" → "Rewrite, or Replace
  with a strategy" so the strategy clause is unambiguously tied to
  Replace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
…ift risk

Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
@lipikaramaswamy lipikaramaswamy force-pushed the lipikaramaswamy/docs/anonymizer-skill branch from 19e8d84 to 242ace4 Compare May 13, 2026 01:18
@lipikaramaswamy lipikaramaswamy merged commit a0669ee into main May 14, 2026
11 checks passed
@lipikaramaswamy lipikaramaswamy deleted the lipikaramaswamy/docs/anonymizer-skill branch May 14, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants