docs: add anonymizer Claude Code skill and supporting concept docs#153
Conversation
Greptile SummaryThis PR adds a Claude Code skill at
Confidence Score: 4/5Safe to merge for docs and the PrivacyGoal export; the skill preview path has a known crash on every first run that is tracked but not yet fixed. The PrivacyGoal export, all doc content, and the interactive workflow are solid. The one live defect is in the skill template: result.trace_dataframe.to_parquet raises an unhandled exception before the failure-first guard and quality summary can execute on every preview run. skills/anonymizer/SKILL.md — the output template preview path crashes on trace_dataframe serialization until issue #152 is fixed. Important Files Changed
Reviews (8): Last reviewed commit: "docs: mention bundled agent skill in AGE..." | Re-trigger Greptile |
|
Once #149 merges, this PR needs a follow-up edit to AGENTS.md:
|
| - Healthcare: `mrn`, `clinical_facility`, `diagnosis_code`, `medication_name` | ||
| - Legal: `case_number`, `court_name`, `docket_number`, `judge_name` | ||
| - Customer support: `ticket_id`, `internal_user_id`, `transaction_id` | ||
| - Internal: `employee_id`, `cost_center`, `internal_project_codename` |
There was a problem hiding this comment.
mrn is in the default list, as is employee-id and possibly case_number and court_name
There was a problem hiding this comment.
Good catch — fixed in de0ab26. Cross-checked all examples against the actual DEFAULT_ENTITY_LABELS list and dropped redundant ones (mrn, court_name, employee_id). Also switched all examples to snake_case to match the convention — the validator only strips/lowercases, so medical record number and medical_record_number would have been treated as different labels.
Note on case_number: verified it's not in DEFAULT_ENTITY_LABELS, so I kept that one
Address review feedback from @asteier2026 on PR #153: - `mrn`, `court_name`, and `employee_id` were listed as examples of domain-specific labels to *extend* the default list with, but all three are already in DEFAULT_ENTITY_LABELS (as `medical_record_number`, `court_name`, and `employee_id`). - Some examples used space-separated natural-language strings ("medical record number", "internal project codename") while the default list uses snake_case. Spaces are not normalized by the validator, so the two forms would map to different labels. Swap in non-default snake_case examples (e.g. `clinical_facility`, `diagnosis_code`, `case_number`, `internal_project_codename`) and add one sentence noting the snake_case convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
|
|
||
| Symptom-first guide to common problems and how to fix them. Each entry says how to diagnose, what knob to turn, and what to verify after. | ||
|
|
||
| When something looks wrong, **first confirm the run completed cleanly** (no dropped rows). Once you know the pipeline ran, **run `preview` on the failing rows** — the trace columns it produces are how you'll diagnose nine cases out of ten. |
There was a problem hiding this comment.
Are dropped rows and failed rows different? My first assumption is they are the same, and then it is confusing to me to make sure there were no dropped rows but to then investigate any that failed.
There was a problem hiding this comment.
Good catch — they were the same thing. Unified in 0c46cc0: the intro now uses one term ("rows that didn't make it through the pipeline, usually a rate-limit / infra issue") and explicitly contrasts with quality-issue rows (high leakage_mass, low utility_score, needs_human_review).
…g tone Critical fix flagged by greptile review on PR #153: `DEFAULT_ENTITY_LABELS` is exported as a `tuple[str, ...]`, so the pattern `DEFAULT_ENTITY_LABELS + ["clinical_facility"]` raises `TypeError: can only concatenate tuple (not "list") to tuple` at runtime. Every user copying the documented domain-extension pattern would hit this. Switched all four occurrences (SKILL.md prose tip, SKILL.md template, choosing-a-strategy.md code block, troubleshooting.md code block) to `[*DEFAULT_ENTITY_LABELS, "clinical_facility"]` (clean unpacking that produces a list). Tone polish in choosing-a-strategy.md: now that the doc is exposed in user-facing mkdocs nav, three places that read as agent-facing ("ask the user", "if the user hasn't specified", `"User signal"` table header) were reframed to address the reader directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
- skills/anonymizer/ — Claude Code skill (SKILL.md + interactive workflow) that walks users through configuring Anonymizer. - docs/concepts/choosing-a-strategy.md — decision guide for mode (Replace vs Rewrite), strategy, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference. - docs/troubleshooting.md — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures. - mkdocs.yml — add the two new docs to navigation. - README.md — add "Using with Claude Code" section pointing at the skills.sh installer. - src/anonymizer/__init__.py — export PrivacyGoal at the top level (referenced from the skill's output template). - docs/concepts/detection.md — minor wording polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Surfaced during manual skill testing: the workflow jumped from "verify install" straight to data inspection and only mentioned provider / API-key setup in the reactive Troubleshooting section. The agent would discover a missing provider only after the user spent time on data inspection and clarification, then watched preview fail. - Extend step 1 to also verify provider config exists (API key env var + providers.yaml) and STOP with a pointer at docs/concepts/models.md if either is missing. - Add a Clarify-step question that asks whether to use shipped defaults or a custom providers.yaml, so the generated script can pass the path via Anonymizer(model_providers=...). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Surfaced during manual skill testing: the script the agent generates prints summary stats but doesn't persist any trace data. Investigating 'why was this entity kept/dropped?' or 'why did this row never converge during repair?' required re-running an 8-minute preview, paying tokens again. - Output template now writes result.trace_dataframe to preview.parquet on every preview run. trace_dataframe is a superset of the user-facing dataframe (it includes all internal columns). - Single parquet file (not CSV + parquet) for format consistency. trace columns include dict/list values that don't round-trip through CSV cleanly. - interactive.md step 6 deeper-inspection line updated to point at the saved file instead of suggesting an interactive Python re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Address review feedback from @asteier2026 on PR #153: - `mrn`, `court_name`, and `employee_id` were listed as examples of domain-specific labels to *extend* the default list with, but all three are already in DEFAULT_ENTITY_LABELS (as `medical_record_number`, `court_name`, and `employee_id`). - Some examples used space-separated natural-language strings ("medical record number", "internal project codename") while the default list uses snake_case. Spaces are not normalized by the validator, so the two forms would map to different labels. Swap in non-default snake_case examples (e.g. `clinical_facility`, `diagnosis_code`, `case_number`, `internal_project_codename`) and add one sentence noting the snake_case convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
- Rewrite the augmenter-steering bullet in active voice, leading with what data_summary actually is for. Drops the "(Augmenter-only)" parenthetical that was hiding the point. - Rename the "Cost per record" comparison row to "Additional LLM calls (beyond shared detection)" to make explicit that detection runs in both modes and these numbers exclude it — so the 0 for Redact/Annotate/Hash is correct (no additional LLM work beyond detection). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
… per review - Mode (Replace vs Rewrite) reframed as "ask the user" rather than defaulting to Rewrite; applied across choosing-a-strategy.md, SKILL.md, and interactive.md. - Replace-strategy default made explicit: Substitute when unspecified (choosing-a-strategy.md, SKILL.md, interactive.md). - strict_entity_protection wording calls out "default False" and that being in a regulated domain alone isn't reason to flip True. - gliner_threshold paragraph reworked to make both directions' costs explicit (lowering pays latency/tokens; raising trades for recall risk). - data_summary section reframed: improves detection (precursor to both modes), additionally feeds Rewrite's rewriter for meaning preservation. - "What to leave out" of data_summary now flags that Substitute behavior instructions belong in Substitute(instructions). - Cheat sheet: 3 distinct Substitute rows (biographies, survey responses, support transcripts) instead of one. - Troubleshooting: unified "dropped rows" / "failed records" terminology vs quality-issue rows; explained that extending entity_labels switches to strict mode, with the DEFAULT_ENTITY_LABELS + [...] pattern. - Replaced "ringing the bell" latent-identifier example with the unambiguous "during her third round of chemo". - Nits: tune→adjust, mrn spelled out, across-rows→across-documents, US spellings (prioritized, anonymized). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
…g tone Critical fix flagged by greptile review on PR #153: `DEFAULT_ENTITY_LABELS` is exported as a `tuple[str, ...]`, so the pattern `DEFAULT_ENTITY_LABELS + ["clinical_facility"]` raises `TypeError: can only concatenate tuple (not "list") to tuple` at runtime. Every user copying the documented domain-extension pattern would hit this. Switched all four occurrences (SKILL.md prose tip, SKILL.md template, choosing-a-strategy.md code block, troubleshooting.md code block) to `[*DEFAULT_ENTITY_LABELS, "clinical_facility"]` (clean unpacking that produces a list). Tone polish in choosing-a-strategy.md: now that the doc is exposed in user-facing mkdocs nav, three places that read as agent-facing ("ask the user", "if the user hasn't specified", `"User signal"` table header) were reframed to address the reader directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
…ullet, wording
- troubleshooting.md: rewrite the "Hash output isn't stable across runs"
section to distinguish the digest (text-only) from the output wrapper
(templated with label), so the previously-contradictory bullets read
consistently. Include the format_template="<HASH_{digest}>" workaround.
- troubleshooting.md: reorder the Substitute cross-row consistency bullet
to lead with the fix (use Hash or post-process _replacement_map) before
the per-row vs cross-row background.
- SKILL.md: polish the "LLM calls failing" line ("an auth issue, network
problem, or wrong base URL" rather than "auth / network / wrong").
- SKILL.md: "for cost" → "for cost savings" in the gliner_threshold
comment to make the direction explicit.
- README.md: "Replace vs Rewrite plus a strategy" → "Rewrite, or Replace
with a strategy" so the strategy clause is unambiguously tied to
Replace.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
…ift risk Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
19e8d84 to
242ace4
Compare
Summary
skills/anonymizer/— elicits dataset context, recommends mode + strategy, drafts a runnable Python script, iterates with the user via the workflow ininteractive.md.docs/concepts/choosing-a-strategy.md(215L) — decision guide for mode (Replace vs Rewrite), strategy choice, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference.docs/troubleshooting.md(210L) — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures.README.md— new "Using with Claude Code" section pointing at theskills.shinstaller.mkdocs.yml— adds the two new docs to navigation.src/anonymizer/__init__.py— exportsPrivacyGoalat the top level (previously only accessible via deep import despite being referenced in docs and the skill's output template).docs/concepts/detection.md— minor wording polish.Testing notes
Tested end-to-end with 20 PMC-Patients clinical case reports. The interactive workflow was exercised: install verify → provider check → data inspect → clarify → plan → build → preview → results inspection. Two real workflow gaps surfaced and fixed during testing:
8ccd277)preview.parquetfor inspection (commit5c5d5a0)The skill loader was also verified:
npx skills add(skills.sh) discovers the skill correctly via the SKILL.md frontmatter.Related issues filed during this work
llm_replace_workflow.py:53-55withCOL_*constants (existing STYLEGUIDE rule violation)result.trace_dataframenot persistable viato_parquet. Blocks the skill'spreview.parquetsave from being readable until fixed. Same 3 lines as chore: use COL_* constants for intermediate columns in llm_replace_workflow.py #148.Test plan
format-check,copyright-check,mkdocs build --strict)from anonymizer import PrivacyGoalsucceedsnpx skills addagainst this branch