Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# AGENTS.md

This file is for agents **developing** NeMo Anonymizer — the codebase you are working in.
If you are an agent helping a user **anonymize data**, use the [product documentation](https://nvidia-nemo.github.io/Anonymizer/) instead.
If you are an agent helping a user **anonymize data**, use the [product documentation](https://nvidia-nemo.github.io/Anonymizer/) or the bundled agent skill at [`skills/anonymizer/`](skills/anonymizer/SKILL.md) instead.

**NeMo Anonymizer** detects and protects PII through context-aware entity replacement and LLM-powered rewriting. Users supply a text dataset and a strategy; Anonymizer detects entities and transforms the text.

Expand All @@ -27,6 +27,8 @@ If you are an agent helping a user **anonymize data**, use the [product document

NeMo Anonymizer wraps [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (NDD) for LLM column generation. `NddAdapter.run_workflow()` is the engine boundary for *executing* DataDesigner workflows — engine sub-workflows may declare DataDesigner column configs (e.g. `LLMStructuredColumnConfig`), but they do not call `DataDesigner.create()` or `preview()` directly.

Public-API changes (re-exports, signatures, kwargs, default values) may require a matching update to the bundled agent skill at [`skills/anonymizer/SKILL.md`](skills/anonymizer/SKILL.md), whose output template imports and instantiates these symbols. Check the template before shipping a change to anything in `anonymizer/__init__.py` or the public surface of `Detect`, `Substitute`, `Redact`, `Annotate`, `Hash`, `Rewrite`, `PrivacyGoal`, or `AnonymizerInput`.

## Core Concepts

- **Entity** — a detected span of text with a label (e.g. `"Alice"` → `first_name`) and character offsets
Expand Down
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,20 @@ AnonymizerConfig(replace=Hash(algorithm="sha256", digest_length=8))

---

## Using with Claude Code

This repo ships a Claude Code skill at [`skills/anonymizer/`](skills/anonymizer/SKILL.md) that elicits your dataset's privacy requirements, recommends Rewrite or Replace with a strategy, and drafts a runnable script for you to iterate on. While the skill should work with other coding agents that support skills, development and testing has focused on Claude Code at this stage.

Install via [skills.sh](https://skills.sh):

```bash
npx skills add NVIDIA-NeMo/Anonymizer
```

After installation, invoke it with `/anonymizer` from within Claude Code, or describe what you want to anonymize and let it auto-trigger.

---

## Development

```bash
Expand Down
222 changes: 222 additions & 0 deletions docs/concepts/choosing-a-strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# Choosing a strategy

This guide walks through the decisions you make when configuring an `AnonymizerConfig` for a real dataset. Use it to go from "I have data + a goal" to a concrete starting config in a few minutes.

It is also the primary reference for AI agents that drive the Anonymizer skill: every decision below is something the agent has to make on the user's behalf.

---

## Decision flow

```
1. (Detection) Describe the data? → AnonymizerInput(data_summary=...)
2. (Detection) Detection knobs? → Detect(entity_labels=..., gliner_threshold=...)
3. Replace or Rewrite? → AnonymizerConfig(replace=...) vs AnonymizerConfig(rewrite=...)
4. (Replace) Which strategy? → Substitute(...) | Redact(...) | Annotate(...) | Hash(...)
5. (Rewrite) Privacy goal? → Rewrite(privacy_goal=PrivacyGoal(protect=..., preserve=...))
6. (Rewrite) Risk tolerance? → Rewrite(risk_tolerance="minimal" | "low" | "moderate" | "high")
```

Steps 1–2 govern the [detection](detection.md) stage that runs first in **both** modes — usually the highest-leverage way to improve overall quality on a new dataset. Steps 3–6 shape the mode-specific transformation that follows detection.

---

## 1. (Detection) `data_summary`

`AnonymizerInput.data_summary` is an optional one-line description that flows into LLM prompts. It is the single cheapest quality lever you have. It improves **detection** — which runs first in both Replace and Rewrite modes, so it's a precursor to any transformation. In Rewrite mode it additionally provides context on what the data contains, helping the rewriter preserve meaning.

```python
from anonymizer import AnonymizerInput

data = AnonymizerInput(
source="patient_notes.csv",
text_column="note",
data_summary="De-identified inpatient progress notes from a US oncology service",
)
```

What to include:

- The domain (clinical, legal, financial, customer support, etc.)
- The genre (notes, transcripts, opinions, biographies)
- Anything about the source the engine couldn't infer from a single record (e.g. "transcribed phone calls — expect disfluencies")
- `data_summary` is the only way to provide a soft do-not-tag list for the augmenter when `entity_labels=None` — the augmenter is free to invent labels beyond `DEFAULT_ENTITY_LABELS`, so use it to tell the LLM what *not* to tag (e.g. "do not tag generic anatomical terms, medication class names, or job titles as PII").

What to leave out:

- Lists of entity types **you want detected** (those go in `Detect.entity_labels`)
- Privacy/utility goals (those go in `Rewrite.privacy_goal`)
Comment thread
alexahaushalter marked this conversation as resolved.
- Substitute behavior instructions (e.g. "names should remain Portuguese", "preserve numeric magnitude") — those go in `Substitute(instructions)`
- Generic phrasing ("text data" adds no signal)

Comment thread
lipikaramaswamy marked this conversation as resolved.
---

## 2. (Detection) Detection knobs

For most datasets the [detection](detection.md) defaults work. The main reason to adjust `entity_labels` is when your data has **domain-specific entities that can be described in plain English** — GLiNER is zero-shot, so any concept you can name (e.g. `"clinical_facility"`, `"internal_project_codename"`) becomes an entity it can find. Match the snake_case convention of `DEFAULT_ENTITY_LABELS`. If the entities you care about aren't in the default list, write them down and add them. Adjust `gliner_threshold` only when you see a specific recall or precision problem in preview.

### `entity_labels`

| Setting | Behavior | Use when |
|---|---|---|
| `None` (default) | Detect all `DEFAULT_ENTITY_LABELS`; the augmenter LLM can also infer new labels not in the default set | General-purpose — almost always the right starting point |
| Explicit list | **Strict mode** — only the labels you list are detected, augmenter cannot invent new ones | You have a domain-specific entity that the defaults don't cover, or you want to *narrow* detection to a known short list |

Common ways to extend the default list:

- Healthcare: `clinical_facility`, `diagnosis_code`, `medication_name`, `lab_test_code`
- Legal: `case_number`, `docket_number`, `statute_citation`, `judge_name`
- Customer support: `ticket_id`, `internal_user_id`, `transaction_id`
- Internal: `cost_center`, `internal_project_codename`, `experiment_id`

```python
from anonymizer import DEFAULT_ENTITY_LABELS, Detect

detect = Detect(entity_labels=[*DEFAULT_ENTITY_LABELS, "clinical_facility", "diagnosis_code", "medication_name"])
```

### `gliner_threshold`

Default `0.3`. The validator catches false positives downstream, so erring low is safe.

| Symptom | Move | Try |
|---|---|---|
| Entities are being missed | Lower | `0.2` or even `0.15` |
| Validator is slow / expensive — it's being handed a huge candidate list | Raise | `0.4`–`0.5` |

The trade-off is symmetric. **Lowering** the threshold doesn't hurt accuracy — the validator runs in batches of `validation_max_entities_per_call` (default `100`, tunable on `Detect`), so a long candidate list becomes more validator calls but not a worse validator. The cost of `gliner_threshold=0.2` is latency and tokens, not precision. **Raising** the threshold trades that cost for *recall risk*: GLiNER stops surfacing borderline candidates and you're relying on the augmenter LLM alone to fill the gap. Default `0.3` errs low; raise only when validator cost is hurting you, and verify with an `Annotate` preview before trusting a high-threshold setup.

---

## 3. Replace vs Rewrite

Both modes start from the same [detection](detection.md) pipeline. The difference is what happens after entities are detected.

| Question | Replace | Rewrite |
|---|---|---|
| Is the goal "scrub the entities and keep everything else"? | ✅ | — |
| Is the goal "produce a privacy-safe version of this text that downstream models can train on"? | — | ✅ |
| Are there inferable / latent identifiers that aren't explicitly stated (e.g. "during her third round of chemo" → cancer treatment)? | ❌ leaves them | ✅ removes them |
| Additional LLM calls (beyond shared detection) | ~1 (Substitute) or 0 (Redact/Annotate/Hash) | Many (domain → disposition → QA → rewrite → evaluate → repair → judge) |
| Output text length | ≈ same as input | Often shorter / restructured |
Comment thread
lipikaramaswamy marked this conversation as resolved.
| Best for | Structured records, log scrubbing, known-list redaction | Free-text data with implicit identifiers (clinical notes, biographies, depositions, support transcripts) |

**Picking between them.** If your data has inferable identifiers that survive entity-only scrubbing (clinical notes, biographies, depositions), Rewrite is the right fit. For structured records, logs, or single-cell PII, Replace is faster and preserves shape. If you're unsure, walk through a few sample rows before deciding.

---

## 4. (Replace) Which strategy

The four strategies are summarised in [Replace](replace.md#strategy-comparison). The decision rule:

| You want… | Use | Why |
|---|---|---|
| Realistic-looking text safe for sharing or training | **Substitute** | LLM-generated synthetic values preserve readability |
| Clear visual marking that an entity was removed | **Redact** | `[REDACTED_FIRST_NAME]` is unambiguous |
| To inspect what was detected without losing the original | **Annotate** | Original text is preserved next to the label — **not privacy-safe on its own** |
| Deterministic re-identification across documents (same person → same token) | **Hash** | Same input always produces the same hash digest |

**If you're not sure which to pick, use `Substitute`.** It's the most general-purpose choice and matches the bulk of production usage.

Comment thread
lipikaramaswamy marked this conversation as resolved.
### Writing `Substitute.instructions`

`Substitute` accepts free-form `instructions` that are passed to the replacement-generator LLM. Use them when the default behavior produces values that don't match your domain or downstream constraints.

| Pattern | When to use | Example |
|---|---|---|
| Format constraint | The original has a structural shape that must be preserved | `"Replacement IDs must keep the same prefix as the original (e.g. ACME-12345 → ACME-XXXXX)."` |
| Domain hint | Entities are domain-specific and need plausible domain values | `"Replacement names should be plausible Brazilian Portuguese names."` |
| Negative constraint | Avoid certain values | `"Do not use any name that appears in the original text."` |

Keep instructions short (one or two sentences). Long instructions compete with the per-entity context and degrade quality.

!!! note "Substitute is per-row, not per-dataset"

Within a single row, repeated mentions of the same value get one consistent replacement (entities are grouped by value before the LLM call). **Across rows the LLM has no shared memory** — each row is an independent call, so "Alice" in row 1 and "Alice" in row 47 will likely get different replacements. If you need stable cross-row mappings (e.g. to re-join records by an identifier), use `Hash` instead, or post-process `result.trace_dataframe["_replacement_map"]`.

---

## 5. (Rewrite) Privacy goal

`Rewrite` ships with sensible defaults for `protect` and `preserve` (auto-populated when you pass `Rewrite()` with no arguments). Override them when you can be more specific than the generic defaults.

### How to write `protect`

`protect` answers: **"What should not appear in the output, even by inference?"**

| Pattern | Example |
|---|---|
| Direct identifiers + quasi-identifiers | `"All patient names, medical record numbers, dates of birth, and any combinations of attributes that could re-identify an individual"` |
| Explicit category list | `"Names, addresses, phone numbers, employer names, and any references to specific institutions"` |
| Inferable signals to suppress | `"Direct identifiers and any contextual phrases that could imply a specific medical condition or diagnosis"` |
| Domain-specific identifiers | `"Case numbers, court names, judge names, and any geographic identifiers below the state level"` |

### How to write `preserve`

`preserve` answers: **"What does the rewritten text need to keep so it's still useful?"**

| Pattern | Example |
|---|---|
| Domain content | `"Clinical findings, treatment plans, and medical terminology"` |
| Structural properties | `"The narrative flow, approximate timeline, and emotional tone of the conversation"` |
| Statistical properties | `"The age range and approximate location at country level so downstream demographics analysis remains valid"` |
| Task-relevant signals | `"Argument structure, citations to legal precedent, and the procedural posture of the case"` |

!!! tip "Be specific, but stay short"

Both fields must be 10–1000 characters and at least 3 words. Aim for 1–3 sentences. The more concrete you are, the more reliably the rewriter targets the right things.

### When to set `strict_entity_protection=True`

**Default: `False`.** Only set `True` when explicitly required by compliance or audit policy — not just because the data is medical, legal, or financial.

By default, low-risk quasi-identifiers may be left unchanged when the engine judges them safe in context. Set `strict_entity_protection=True` to force every detected entity into an active protection method.

Use it when:

- A documented compliance or audit policy *mandates* that every detected entity be actively protected (e.g. HIPAA Safe Harbor with strict interpretation, internal "zero unchanged identifiers" rule)
- You're producing data for external sharing where any unchanged identifier is a compliance risk
- Audit requires "every entity was actively protected"

Being in a regulated domain (medical / legal / financial) is **not** by itself a reason to set this to `True` — most regulated-domain processing tolerates the default behavior. Don't use it when utility matters more than blanket protection — it tends to increase modifications and can lower `utility_score`.

Comment thread
alexahaushalter marked this conversation as resolved.
---

## 6. (Rewrite) Risk tolerance

`risk_tolerance` is a rewrite-only knob — it selects a coherent bundle of repair and review thresholds. The full table is in [Rewrite > Risk tolerance](rewrite.md#risk-tolerance); the choice rule is below.

| Goal | Pick |
|---|---|
| "Medical / legal / financial / external release" | `minimal` |
| Default for most privacy-sensitive data | `low` |
| "I want utility prioritized, this is internal-only" | `moderate` |
| "I just want to see the system run, will fix things by hand" | `high` |

Notes:

- `minimal` and `low` differ mostly in how aggressively repair triggers. Both auto-repair on any high-sensitivity leak.
- `high` does **not** auto-repair single high-sensitivity leaks. Use only when you have downstream review.
- `max_repair_iterations` (default 3) caps cost. Set to 0 to skip repair entirely while still computing leakage / utility metrics — useful for audits.

---

## Goal → starting config cheat sheet

A starting point for common scenarios. Always run `preview` and iterate from here.

| Goal | Mode | Strategy / config |
Comment thread
lipikaramaswamy marked this conversation as resolved.
Comment thread
alexahaushalter marked this conversation as resolved.
|---|---|---|
| "Scrub PII from logs for retention" | Replace | `Redact()` |
| "De-identify clinical notes for research sharing" | Rewrite | `Rewrite(privacy_goal=PrivacyGoal(protect="all PHI and any context that could imply a specific patient or facility", preserve="clinical findings, treatments, and outcomes"), risk_tolerance="minimal", strict_entity_protection=True)` |
| "Produce realistic-looking biographies for demos" | Replace | `Substitute(instructions="Names and locations should remain plausible for the original cultural context.")` |
| "Anonymize survey responses before sharing the dataset" | Replace | `Substitute()` |
| "Anonymize customer support transcripts for fine-tuning a model" | Replace | `Substitute(instructions="Preserve domain-specific terminology and locale.")` |
| "Anonymize legal opinions for an SFT dataset" | Rewrite | `Rewrite(privacy_goal=PrivacyGoal(protect="party names, case numbers, judge names, and locations below the state level", preserve="argument structure and procedural posture"), risk_tolerance="low")` |
| "Allow re-joining records by identifier without keeping the identifier" | Replace | `Hash(algorithm="sha256", digest_length=16)` |
| "I just want to see what the detector finds" | Replace | `Annotate()` (preview only — never ship Annotate output as anonymized data) |

Once you have a starting config, run `anonymizer validate <config>`, then `anonymizer preview --num-records 5 <config>`, then iterate. See [Troubleshooting](../troubleshooting.md) for what to change when preview shows a problem.
2 changes: 1 addition & 1 deletion docs/concepts/detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Pools also act as **failover**. If a chunk's assigned alias can't complete the c

#### What happens when a row can't be validated

If validation can't get a complete answer for a row — every alias in the pool has failed on at least one of that row's chunks — the row is **dropped from the output** rather than passed through with some entities unvalidated. This is deliberate: the alternative would be writing the original text back out with those entities still un-scrubbed, which is exactly the outcome you're trying to avoid.
If validation can't get a complete answer for a row — every alias in the pool has failed on at least one of that row's chunks — the row is **dropped from the output** rather than passed through with some entities unvalidated. This is deliberate: the alternative would be writing the original text back out with those entities still un-scrubbed, which is an undesired outcome.
Comment thread
lipikaramaswamy marked this conversation as resolved.

Dropped rows show up on `result.failed_records` with `step="detection"`, so you can tell which inputs didn't make it through by comparing input IDs against output IDs and reprocess those on a follow-up pass.

Expand Down
Loading
Loading