Skip to content
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,9 @@ CLAUDE.local.md
.claude/settings.local.json
ai/tmp/

# Claude worktrees
.claude/worktrees/

# Anonymizer execution artifacts
.anonymizer-artifacts/
docs/notebook_source/data/synth_bios_sample10_anonymized.csv
Expand Down
118 changes: 118 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# AGENTS.md

This file is for agents **developing** NeMo Anonymizer — the codebase you are working in.
If you are an agent helping a user **anonymize data**, use the [product documentation](https://nvidia-nemo.github.io/Anonymizer/) instead.

**NeMo Anonymizer** detects and protects PII through context-aware entity replacement and LLM-powered rewriting. Users supply a text dataset and a strategy; Anonymizer detects entities and transforms the text.

## Agent compatibility

`AGENTS.md` is the canonical instruction file for coding agents working in this repository. Keep it tool-neutral:

- Use plain Markdown and repository-relative links.
- Do not rely on vendor-specific include syntax, slash commands, MCP names, or IDE-only behavior.
- Put tool-specific adapter instructions in thin wrapper files such as `CLAUDE.md`.

## Module Map

`nemo-anonymizer` is a single package with three primary subpackages plus top-level public utilities:

- **`anonymizer.config`** — user-facing configuration: `AnonymizerConfig`, `AnonymizerInput`, replace strategies (`Substitute`, `Redact`, `Annotate`, `Hash`), and rewrite config (`Rewrite`, `EvaluationCriteria`, `RiskTolerance`). New user-facing knobs go here.
- **`anonymizer.engine`** — internal pipeline implementation: detection, replacement, and rewrite sub-workflows, the NDD adapter, prompt utilities, and all `COL_*` column constants. Never imported directly by users.
- **`anonymizer.interface`** — user-facing entry points: the `Anonymizer` class, CLI, `AnonymizerResult`, `PreviewResult`, and canonical error types. Thin layer that wires config → engine and exposes results.
- **`anonymizer.logging`** — public logging configuration (`LoggingConfig`, `configure_logging`) used by the API, CLI, and examples.

NeMo Anonymizer wraps [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (NDD) for LLM column generation. `NddAdapter.run_workflow()` is the engine boundary for *executing* DataDesigner workflows — engine sub-workflows may declare DataDesigner column configs (e.g. `LLMStructuredColumnConfig`), but they do not call `DataDesigner.create()` or `preview()` directly.

## Core Concepts

- **Entity** — a detected span of text with a label (e.g. `"Alice"` → `first_name`) and character offsets
- **Latent entity** — an entity detected in rewrite mode that is sensitive but not directly named; used to guide rewriting without explicit replacement
- **Replacement map** — a per-record dict mapping entity text → substitute value, built by `LlmReplaceWorkflow` and injected into rewrite prompts
- **Leakage mass** — a weighted score measuring how much sensitive information survives in a rewritten record; drives the repair loop
- **Utility score** — a 0–1 score measuring how much semantic content the rewritten record preserves
- **RiskTolerance** — a preset (`minimal` / `low` / `moderate` / `high`) that bundles the leakage threshold, repair behaviour, and human-review flags into a single user-facing knob
- **Repair loop** — the evaluate → repair → re-evaluate cycle in `RewriteWorkflow`; runs up to `max_repair_iterations` times on failing rows
- **FailedRecord** — a record that was dropped by an NDD workflow; surfaced explicitly rather than silently lost

## Pipelines

### Replace mode — `AnonymizerConfig(replace=...)`

```
input_df
→ EntityDetectionWorkflow.run() # engine/detection/detection_workflow.py
GLiNER detection
→ parse + tag
→ LLM augmentation (add entities GLiNER missed)
→ LLM validation (keep / drop candidates)
→ merge + finalize → COL_DETECTED_ENTITIES, COL_FINAL_ENTITIES
→ ReplacementWorkflow.run() # engine/replace/replace_runner.py
Redact / Annotate / Hash → applied locally, no LLM
Substitute → LlmReplaceWorkflow → NddAdapter
→ output: {text_col}_replaced, {text_col}_with_spans, final_entities
```

### Rewrite mode — `AnonymizerConfig(rewrite=...)`

```
input_df
→ EntityDetectionWorkflow.run() # same as above, plus latent entity tagging
→ RewriteWorkflow.run() # engine/rewrite/rewrite_workflow.py
LlmReplaceWorkflow.generate_map_only() # build replacement map for prompt
→ single NDD adapter call (pipeline_columns):
DomainClassificationWorkflow → _domain, _domain_supplement
SensitivityDispositionWorkflow → _sensitivity_disposition
QAGenerationWorkflow → _quality_qa, _privacy_qa
RewriteGenerationWorkflow → _rewritten_text
→ evaluate-repair loop (up to max_repair_iterations):
EvaluateWorkflow → leakage_mass, utility_score, _needs_repair
RepairWorkflow → _rewritten_text (failing rows only)
→ FinalJudgeWorkflow (non-critical) → _judge_evaluation, needs_human_review
→ output: {text_col}_rewritten, utility_score, leakage_mass, needs_human_review, …
```

Records with no detected entities skip all LLM sub-workflows and pass through with default metrics (utility=1.0, leakage=0.0).

## Config Pattern

`AnonymizerConfig.rewrite` is the user-facing `Rewrite` model. The engine never receives `Rewrite` directly — it receives `EvaluationCriteria` via the `Rewrite.evaluation` property. See that property's docstring for the sync contract (how `risk_tolerance` and `max_repair_iterations` flow into the engine, why production code should not duplicate the mapping).

## NDD Adapter

`NddAdapter.run_workflow()` (`engine/ndd/adapter.py`) is the engine boundary for *executing* DataDesigner workflows. See its docstring for the contract (input/output shapes, `FailedRecord` semantics).

## Prompt Conventions

NDD prompts are inline triple-quoted strings in the workflow file that uses them; there is no separate registry. For DataFrame column references inside templates, use `_jinja()`; for dynamic prompt values, use `substitute_placeholders()`. See each function's docstring for details.

## Structural Invariants

Code conventions enforced in review (future-annotations import, absolute imports, type annotations, SPDX headers, column-name constants) live in [STYLEGUIDE.md](STYLEGUIDE.md).

One pipeline-specific fact worth knowing: `COL_TEXT` is the internal name for the input text column; it's renamed to the user's original column name in final output.

## What NOT To Do

- **Don't duplicate the `Rewrite` → `EvaluationCriteria` mapping** when production code starts from a `Rewrite`; route it through `Rewrite.evaluation`.
- **Don't execute DataDesigner workflows directly** — call `DataDesigner.create()` / `.preview()` only via `NddAdapter.run_workflow()`. Declaring column configs (`LLMStructuredColumnConfig`, etc.) is fine.
- **Don't use string literals for column names** — use `COL_*` constants from `engine/constants.py`
- **Don't add a domain to only one supplement map** — see `engine/rewrite/domain_classification.py` for the sync invariant
- **Don't hardcode `gliner_threshold`** — it belongs in `Detect` config (default 0.3)

## Development

```bash
make test # run all tests
make bootstrap # install dev dependencies
make format # ruff format + sort imports
make format-check # read-only lint check (used in CI)
make typecheck # ty type check (advisory)
make docs-serve # local MkDocs server at http://127.0.0.1:8000
Comment thread
lipikaramaswamy marked this conversation as resolved.
```

For contributor workflow and branch naming see [CONTRIBUTING.md](CONTRIBUTING.md).
For code style and naming conventions see [STYLEGUIDE.md](STYLEGUIDE.md).
8 changes: 8 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# Claude Code instructions

Canonical agent instructions live in [AGENTS.md](AGENTS.md).

@AGENTS.md
11 changes: 11 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,17 @@ The `main` branch has the following protections:
- All `src` and `tests` files: `@NVIDIA-NeMo/anonymizer-reviewers`
- All remaining files (`pyproject.toml`, `uv.lock`, `SECURITY.md`, `LICENSE`, `.github/`, etc.): `@NVIDIA-NeMo/anonymizer-maintainers`

### Agent-Assisted Development

When automating edits with coding agents (IDE assistants, CLI tools, or hosted models), follow the standard [Pull Request Process](#pull-request-process) plus these additions:

1. **For non-trivial changes, draft a plan first.** Non-trivial includes: changes spanning more than one of the `config` / `engine` / `interface` subsystems, introducing a new public API, or modifying an invariant called out in [AGENTS.md](AGENTS.md) or [STYLEGUIDE.md](STYLEGUIDE.md).
Comment thread
lipikaramaswamy marked this conversation as resolved.
- Write a markdown file detailing the approach, trade-offs considered, affected subsystems, and delivery strategy — enough for reviewers to evaluate the design before implementation begins. (Have the agent draft it; review and refine before submitting.)
- Save it at `plans/<issue-number>/<short-name>.md` and submit it as its own PR for review.
- Once the plan is approved, implement it in a follow-up PR.

2. **Implement following [AGENTS.md](AGENTS.md) and [STYLEGUIDE.md](STYLEGUIDE.md).** Both capture pipeline structure, naming conventions, and invariants ruff and ty cannot enforce. Implementers — human or agentic — should read these before non-trivial changes.

## Issues and Discussions

### Issue Templates
Expand Down
236 changes: 236 additions & 0 deletions STYLEGUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# Style Guide

Code and documentation conventions for NeMo Anonymizer that ruff and ty cannot enforce. Architecture boundaries and agent workflow rules live in [AGENTS.md](AGENTS.md). Read before adding a new module, workflow, or config class.

NeMo Anonymizer wraps [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (NDD) for LLM column generation. References to NDD below mean that library.

For architecture and pipeline identity, see [AGENTS.md](AGENTS.md).
For contribution workflow and branch naming, see [CONTRIBUTING.md](CONTRIBUTING.md).

---

## Pydantic vs Dataclasses

**Pydantic** for config, validation, and serialization. **Dataclasses** for simple typed containers in the engine.

| Need | Use |
|------|-----|
| User-facing config, validation, JSON schema | `BaseModel` |
| Private/internal frozen value object (e.g. `WorkflowRunResult`, `_RiskToleranceBundle`) | `@dataclass(frozen=True)` |

```python
# Config — Pydantic
class Detect(BaseModel):
gliner_threshold: float = Field(default=0.3, ge=0.0, le=1.0)

# Internal result — dataclass
@dataclass(frozen=True)
class WorkflowRunResult:
dataframe: pd.DataFrame
failed_records: list[FailedRecord]
```

Use `Field()` only when you need constraints (`ge`, `le`), descriptions, or `default_factory`. Use bare defaults for simple flags and strings.

---

## Error Handling

Wrap exceptions from NDD and other third-party calls at module boundaries into canonical types from `interface/errors.py`. Callers should never see raw NDD exceptions.

Preserve the traceback:

```python
# Good
try:
run_results = self._data_designer.create(...)
except Exception as exc:
raise AnonymizerWorkflowError(f"Workflow failed: {exc}") from exc

# Bad — swallows the traceback
except Exception as exc:
raise AnonymizerWorkflowError("Workflow failed")
```

Don't use defensive `try/except` on trusted internal calls that shouldn't fail — only catch at module boundaries. `RewriteWorkflow._run_final_judge` is the intentional exception: it's explicitly non-critical and catches broadly, logging with `exc_info=True` and substituting safe defaults.

**Error messages** must identify the actual bad value. Use `!r` to make interpolated values unambiguous:

```python
# Good
raise ValueError(f"Unsupported strategy: {strategy!r}")

# Bad
raise ValueError("Invalid strategy")
```

**No `assert` for validation in production/library code** — `assert` statements are stripped when Python runs with `-O`. Use `if/raise` instead. Pytest assertions in tests are fine.

```python
# Good
if not isinstance(config, AnonymizerConfig):
raise TypeError(f"Expected AnonymizerConfig, got {type(config)!r}")

# Bad
assert isinstance(config, AnonymizerConfig)
```

---

## Column Names

All column names are constants in `engine/constants.py`. Never use string literals for column names.

```python
# Good
df[COL_DETECTED_ENTITIES]

# Bad
df["_detected_entities"]
```

Internal (intermediate) columns are prefixed with `_`. User-facing output columns use clean names (`final_entities`, `utility_score`). The input text column is always `COL_TEXT` internally and renamed to the user's original column name in `Anonymizer._rename_output_columns()`.

---

## Prompt Construction

**`_jinja(col, key=None)`** from `engine/constants.py` — use for **shared DataFrame column references** in NDD prompt templates. Never format shared column names directly into prompt strings; `_jinja` keeps them grep-able. Local Jinja loop variables (e.g. `entity.value` inside `{% for entity in ... %}`) are scoped to the prompt and don't need `_jinja`.

```python
# Good
f"The text is: {_jinja(COL_TEXT)}"

# Bad
f"The text is: {{{{ {COL_TEXT} }}}}"
```

**`substitute_placeholders(template, replacements)`** from `engine/prompt_utils.py` — use for dynamic prompt values. The `<<PLACEHOLDER>>` format avoids collisions with Jinja2 syntax. Never use f-strings or `.format()` for prompt templates with dynamic values; single-pass substitution prevents a replacement value from being interpreted as a placeholder.

Prompts live as inline triple-quoted strings in the workflow file that uses them. There is no separate prompt registry.

---

## Type Annotations

Type annotations are required on all functions, methods, and class attributes including tests.

Use `TYPE_CHECKING` blocks for imports needed *only* in type annotations. This prevents circular imports and avoids loading heavy libraries at import time:

```python
from typing import TYPE_CHECKING

if TYPE_CHECKING:
import pandas as pd
```

If a module uses `pandas` at runtime — calls `pd.DataFrame`, indexes a DataFrame in a function body, etc. — import it at the top level. A `TYPE_CHECKING` import raises `NameError` if you reference it at runtime. `pandas` is import-time expensive, so keep top-level imports of it limited to modules that genuinely need it.

---

## Import Style

- **ALWAYS** use absolute imports, never relative imports (enforced by `TID`)
- Place imports at module level, not inside functions (exception: unavoidable for performance reasons)
- Import sorting is handled by `ruff`'s `isort` — imports should be grouped and sorted:
1. Standard library imports
2. Third-party imports
3. First-party imports (`anonymizer`)
- Use standard import conventions (enforced by `ICN`)

```python
# Good
from anonymizer.config.anonymizer_config import AnonymizerConfig

# Bad - relative import (will cause linter errors)
from .anonymizer_config import AnonymizerConfig

# Good - imports at module level
from pathlib import Path

def process_file(filename: str) -> None:
path = Path(filename)

# Bad - import inside function
def process_file(filename: str) -> None:
from pathlib import Path
path = Path(filename)
```

---

## Code Organization

- When adding new symbols, prefer public functions and methods before private (`_`-prefixed) ones within a module or class
- Define helpers at module or class level — avoid nested functions. Nested functions hide logic, make testing harder, and complicate stack traces. The only acceptable use is a closure that genuinely needs to capture local state.

Comment thread
lipikaramaswamy marked this conversation as resolved.
---

## Naming

- Functions and variables: `snake_case`
- Classes: `PascalCase`
- Constants: `UPPER_SNAKE_CASE`
- Function names start with a verb: `run_workflow`, `build_entity_id`, not `entity_id` or `workflow`

---

## Comments

Only add a comment when the WHY is non-obvious — a hidden constraint, a subtle invariant, a workaround for a specific bug. Don't narrate what the code already says:

```python
# Good — explains a non-obvious invariant
# uuid5 is deterministic so input/output IDs match for missing-record tracking.

# Bad — narrates what the code does
# Loop through the records and append to list
for record in records:
results.append(record)
```

---

## Future Annotations

Every Python file must include `from __future__ import annotations` after the license header. This defers annotation evaluation, enables forward references, and keeps behavior consistent across the codebase.

---

## License Headers

Every Python and Markdown file requires an SPDX header at the top (enforced by `tools/codestyle/copyright_fixer.py --check`, run via `make copyright-check`). Files listed in `.copyrightignore` are exempt.

---

## Docstrings

Google style (`Args:`, `Returns:`, `Raises:`). Public API classes and methods get docstrings; private helpers (`_`-prefixed) only when the logic is non-obvious. Don't restate the signature — explain why or what, not what the type annotation already says.

---

## Design Principles

**DRY**

- Extract shared logic into pure helper functions rather than duplicating across similar call sites
- Rule of thumb: tolerate duplication until the third occurrence, then extract

**KISS**

- Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper
- When in doubt between DRY and KISS, favor readability over deduplication

**YAGNI**

- Don't add parameters, config, or abstraction layers for hypothetical future use cases
- Don't generalize until the third caller appears

**SOLID**

- Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals
- Use `Protocol` for contracts between layers
- One function, one job — separate logic from I/O
Loading
Loading