-
Notifications
You must be signed in to change notification settings - Fork 9
docs: add AGENTS.md, STYLEGUIDE.md, agent-assisted contribution (#114) #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
lipikaramaswamy
merged 7 commits into
main
from
lipikaramaswamy/docs/114-agents-styleguide
May 13, 2026
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
48b8f76
chore: ignore Claude worktrees
lipikaramaswamy 29699d5
docs: add AGENTS.md, STYLEGUIDE.md, and agent-assisted contribution w…
lipikaramaswamy 3f5ef3d
docs: add SPDX headers to AGENTS.md, STYLEGUIDE.md, and CLAUDE.md
lipikaramaswamy 13e7902
docs: clarify TYPE_CHECKING guidance and align DataDesigner naming
lipikaramaswamy 1f734c6
docs: add Import Style and Design Principles sections to STYLEGUIDE.md
lipikaramaswamy e122b4f
docs: refine AGENTS.md, STYLEGUIDE.md, CLAUDE.md, and source docstrings
lipikaramaswamy cf49c17
docs: add Agent compatibility section to AGENTS.md
lipikaramaswamy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| <!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> | ||
| <!-- SPDX-License-Identifier: Apache-2.0 --> | ||
|
|
||
| # AGENTS.md | ||
|
|
||
| This file is for agents **developing** NeMo Anonymizer — the codebase you are working in. | ||
| If you are an agent helping a user **anonymize data**, use the [product documentation](https://nvidia-nemo.github.io/Anonymizer/) instead. | ||
|
|
||
| **NeMo Anonymizer** detects and protects PII through context-aware entity replacement and LLM-powered rewriting. Users supply a text dataset and a strategy; Anonymizer detects entities and transforms the text. | ||
|
|
||
| ## Agent compatibility | ||
|
|
||
| `AGENTS.md` is the canonical instruction file for coding agents working in this repository. Keep it tool-neutral: | ||
|
|
||
| - Use plain Markdown and repository-relative links. | ||
| - Do not rely on vendor-specific include syntax, slash commands, MCP names, or IDE-only behavior. | ||
| - Put tool-specific adapter instructions in thin wrapper files such as `CLAUDE.md`. | ||
|
|
||
| ## Module Map | ||
|
|
||
| `nemo-anonymizer` is a single package with three primary subpackages plus top-level public utilities: | ||
|
|
||
| - **`anonymizer.config`** — user-facing configuration: `AnonymizerConfig`, `AnonymizerInput`, replace strategies (`Substitute`, `Redact`, `Annotate`, `Hash`), and rewrite config (`Rewrite`, `EvaluationCriteria`, `RiskTolerance`). New user-facing knobs go here. | ||
| - **`anonymizer.engine`** — internal pipeline implementation: detection, replacement, and rewrite sub-workflows, the NDD adapter, prompt utilities, and all `COL_*` column constants. Never imported directly by users. | ||
| - **`anonymizer.interface`** — user-facing entry points: the `Anonymizer` class, CLI, `AnonymizerResult`, `PreviewResult`, and canonical error types. Thin layer that wires config → engine and exposes results. | ||
| - **`anonymizer.logging`** — public logging configuration (`LoggingConfig`, `configure_logging`) used by the API, CLI, and examples. | ||
|
|
||
| NeMo Anonymizer wraps [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (NDD) for LLM column generation. `NddAdapter.run_workflow()` is the engine boundary for *executing* DataDesigner workflows — engine sub-workflows may declare DataDesigner column configs (e.g. `LLMStructuredColumnConfig`), but they do not call `DataDesigner.create()` or `preview()` directly. | ||
|
|
||
| ## Core Concepts | ||
|
|
||
| - **Entity** — a detected span of text with a label (e.g. `"Alice"` → `first_name`) and character offsets | ||
| - **Latent entity** — an entity detected in rewrite mode that is sensitive but not directly named; used to guide rewriting without explicit replacement | ||
| - **Replacement map** — a per-record dict mapping entity text → substitute value, built by `LlmReplaceWorkflow` and injected into rewrite prompts | ||
| - **Leakage mass** — a weighted score measuring how much sensitive information survives in a rewritten record; drives the repair loop | ||
| - **Utility score** — a 0–1 score measuring how much semantic content the rewritten record preserves | ||
| - **RiskTolerance** — a preset (`minimal` / `low` / `moderate` / `high`) that bundles the leakage threshold, repair behaviour, and human-review flags into a single user-facing knob | ||
| - **Repair loop** — the evaluate → repair → re-evaluate cycle in `RewriteWorkflow`; runs up to `max_repair_iterations` times on failing rows | ||
| - **FailedRecord** — a record that was dropped by an NDD workflow; surfaced explicitly rather than silently lost | ||
|
|
||
| ## Pipelines | ||
|
|
||
| ### Replace mode — `AnonymizerConfig(replace=...)` | ||
|
|
||
| ``` | ||
| input_df | ||
| → EntityDetectionWorkflow.run() # engine/detection/detection_workflow.py | ||
| GLiNER detection | ||
| → parse + tag | ||
| → LLM augmentation (add entities GLiNER missed) | ||
| → LLM validation (keep / drop candidates) | ||
| → merge + finalize → COL_DETECTED_ENTITIES, COL_FINAL_ENTITIES | ||
| → ReplacementWorkflow.run() # engine/replace/replace_runner.py | ||
| Redact / Annotate / Hash → applied locally, no LLM | ||
| Substitute → LlmReplaceWorkflow → NddAdapter | ||
| → output: {text_col}_replaced, {text_col}_with_spans, final_entities | ||
| ``` | ||
|
|
||
| ### Rewrite mode — `AnonymizerConfig(rewrite=...)` | ||
|
|
||
| ``` | ||
| input_df | ||
| → EntityDetectionWorkflow.run() # same as above, plus latent entity tagging | ||
| → RewriteWorkflow.run() # engine/rewrite/rewrite_workflow.py | ||
| LlmReplaceWorkflow.generate_map_only() # build replacement map for prompt | ||
| → single NDD adapter call (pipeline_columns): | ||
| DomainClassificationWorkflow → _domain, _domain_supplement | ||
| SensitivityDispositionWorkflow → _sensitivity_disposition | ||
| QAGenerationWorkflow → _quality_qa, _privacy_qa | ||
| RewriteGenerationWorkflow → _rewritten_text | ||
| → evaluate-repair loop (up to max_repair_iterations): | ||
| EvaluateWorkflow → leakage_mass, utility_score, _needs_repair | ||
| RepairWorkflow → _rewritten_text (failing rows only) | ||
| → FinalJudgeWorkflow (non-critical) → _judge_evaluation, needs_human_review | ||
| → output: {text_col}_rewritten, utility_score, leakage_mass, needs_human_review, … | ||
| ``` | ||
|
|
||
| Records with no detected entities skip all LLM sub-workflows and pass through with default metrics (utility=1.0, leakage=0.0). | ||
|
|
||
| ## Config Pattern | ||
|
|
||
| `AnonymizerConfig.rewrite` is the user-facing `Rewrite` model. The engine never receives `Rewrite` directly — it receives `EvaluationCriteria` via the `Rewrite.evaluation` property. See that property's docstring for the sync contract (how `risk_tolerance` and `max_repair_iterations` flow into the engine, why production code should not duplicate the mapping). | ||
|
|
||
| ## NDD Adapter | ||
|
|
||
| `NddAdapter.run_workflow()` (`engine/ndd/adapter.py`) is the engine boundary for *executing* DataDesigner workflows. See its docstring for the contract (input/output shapes, `FailedRecord` semantics). | ||
|
|
||
| ## Prompt Conventions | ||
|
|
||
| NDD prompts are inline triple-quoted strings in the workflow file that uses them; there is no separate registry. For DataFrame column references inside templates, use `_jinja()`; for dynamic prompt values, use `substitute_placeholders()`. See each function's docstring for details. | ||
|
|
||
| ## Structural Invariants | ||
|
|
||
| Code conventions enforced in review (future-annotations import, absolute imports, type annotations, SPDX headers, column-name constants) live in [STYLEGUIDE.md](STYLEGUIDE.md). | ||
|
|
||
| One pipeline-specific fact worth knowing: `COL_TEXT` is the internal name for the input text column; it's renamed to the user's original column name in final output. | ||
|
|
||
| ## What NOT To Do | ||
|
|
||
| - **Don't duplicate the `Rewrite` → `EvaluationCriteria` mapping** when production code starts from a `Rewrite`; route it through `Rewrite.evaluation`. | ||
| - **Don't execute DataDesigner workflows directly** — call `DataDesigner.create()` / `.preview()` only via `NddAdapter.run_workflow()`. Declaring column configs (`LLMStructuredColumnConfig`, etc.) is fine. | ||
| - **Don't use string literals for column names** — use `COL_*` constants from `engine/constants.py` | ||
| - **Don't add a domain to only one supplement map** — see `engine/rewrite/domain_classification.py` for the sync invariant | ||
| - **Don't hardcode `gliner_threshold`** — it belongs in `Detect` config (default 0.3) | ||
|
|
||
| ## Development | ||
|
|
||
| ```bash | ||
| make test # run all tests | ||
| make bootstrap # install dev dependencies | ||
| make format # ruff format + sort imports | ||
| make format-check # read-only lint check (used in CI) | ||
| make typecheck # ty type check (advisory) | ||
| make docs-serve # local MkDocs server at http://127.0.0.1:8000 | ||
| ``` | ||
|
|
||
| For contributor workflow and branch naming see [CONTRIBUTING.md](CONTRIBUTING.md). | ||
| For code style and naming conventions see [STYLEGUIDE.md](STYLEGUIDE.md). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| <!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> | ||
| <!-- SPDX-License-Identifier: Apache-2.0 --> | ||
|
|
||
| # Claude Code instructions | ||
|
|
||
| Canonical agent instructions live in [AGENTS.md](AGENTS.md). | ||
|
|
||
| @AGENTS.md |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,236 @@ | ||
| <!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> | ||
| <!-- SPDX-License-Identifier: Apache-2.0 --> | ||
|
|
||
| # Style Guide | ||
|
|
||
| Code and documentation conventions for NeMo Anonymizer that ruff and ty cannot enforce. Architecture boundaries and agent workflow rules live in [AGENTS.md](AGENTS.md). Read before adding a new module, workflow, or config class. | ||
|
|
||
| NeMo Anonymizer wraps [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (NDD) for LLM column generation. References to NDD below mean that library. | ||
|
|
||
| For architecture and pipeline identity, see [AGENTS.md](AGENTS.md). | ||
| For contribution workflow and branch naming, see [CONTRIBUTING.md](CONTRIBUTING.md). | ||
|
|
||
| --- | ||
|
|
||
| ## Pydantic vs Dataclasses | ||
|
|
||
| **Pydantic** for config, validation, and serialization. **Dataclasses** for simple typed containers in the engine. | ||
|
|
||
| | Need | Use | | ||
| |------|-----| | ||
| | User-facing config, validation, JSON schema | `BaseModel` | | ||
| | Private/internal frozen value object (e.g. `WorkflowRunResult`, `_RiskToleranceBundle`) | `@dataclass(frozen=True)` | | ||
|
|
||
| ```python | ||
| # Config — Pydantic | ||
| class Detect(BaseModel): | ||
| gliner_threshold: float = Field(default=0.3, ge=0.0, le=1.0) | ||
|
|
||
| # Internal result — dataclass | ||
| @dataclass(frozen=True) | ||
| class WorkflowRunResult: | ||
| dataframe: pd.DataFrame | ||
| failed_records: list[FailedRecord] | ||
| ``` | ||
|
|
||
| Use `Field()` only when you need constraints (`ge`, `le`), descriptions, or `default_factory`. Use bare defaults for simple flags and strings. | ||
|
|
||
| --- | ||
|
|
||
| ## Error Handling | ||
|
|
||
| Wrap exceptions from NDD and other third-party calls at module boundaries into canonical types from `interface/errors.py`. Callers should never see raw NDD exceptions. | ||
|
|
||
| Preserve the traceback: | ||
|
|
||
| ```python | ||
| # Good | ||
| try: | ||
| run_results = self._data_designer.create(...) | ||
| except Exception as exc: | ||
| raise AnonymizerWorkflowError(f"Workflow failed: {exc}") from exc | ||
|
|
||
| # Bad — swallows the traceback | ||
| except Exception as exc: | ||
| raise AnonymizerWorkflowError("Workflow failed") | ||
| ``` | ||
|
|
||
| Don't use defensive `try/except` on trusted internal calls that shouldn't fail — only catch at module boundaries. `RewriteWorkflow._run_final_judge` is the intentional exception: it's explicitly non-critical and catches broadly, logging with `exc_info=True` and substituting safe defaults. | ||
|
|
||
| **Error messages** must identify the actual bad value. Use `!r` to make interpolated values unambiguous: | ||
|
|
||
| ```python | ||
| # Good | ||
| raise ValueError(f"Unsupported strategy: {strategy!r}") | ||
|
|
||
| # Bad | ||
| raise ValueError("Invalid strategy") | ||
| ``` | ||
|
|
||
| **No `assert` for validation in production/library code** — `assert` statements are stripped when Python runs with `-O`. Use `if/raise` instead. Pytest assertions in tests are fine. | ||
|
|
||
| ```python | ||
| # Good | ||
| if not isinstance(config, AnonymizerConfig): | ||
| raise TypeError(f"Expected AnonymizerConfig, got {type(config)!r}") | ||
|
|
||
| # Bad | ||
| assert isinstance(config, AnonymizerConfig) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Column Names | ||
|
|
||
| All column names are constants in `engine/constants.py`. Never use string literals for column names. | ||
|
|
||
| ```python | ||
| # Good | ||
| df[COL_DETECTED_ENTITIES] | ||
|
|
||
| # Bad | ||
| df["_detected_entities"] | ||
| ``` | ||
|
|
||
| Internal (intermediate) columns are prefixed with `_`. User-facing output columns use clean names (`final_entities`, `utility_score`). The input text column is always `COL_TEXT` internally and renamed to the user's original column name in `Anonymizer._rename_output_columns()`. | ||
|
|
||
| --- | ||
|
|
||
| ## Prompt Construction | ||
|
|
||
| **`_jinja(col, key=None)`** from `engine/constants.py` — use for **shared DataFrame column references** in NDD prompt templates. Never format shared column names directly into prompt strings; `_jinja` keeps them grep-able. Local Jinja loop variables (e.g. `entity.value` inside `{% for entity in ... %}`) are scoped to the prompt and don't need `_jinja`. | ||
|
|
||
| ```python | ||
| # Good | ||
| f"The text is: {_jinja(COL_TEXT)}" | ||
|
|
||
| # Bad | ||
| f"The text is: {{{{ {COL_TEXT} }}}}" | ||
| ``` | ||
|
|
||
| **`substitute_placeholders(template, replacements)`** from `engine/prompt_utils.py` — use for dynamic prompt values. The `<<PLACEHOLDER>>` format avoids collisions with Jinja2 syntax. Never use f-strings or `.format()` for prompt templates with dynamic values; single-pass substitution prevents a replacement value from being interpreted as a placeholder. | ||
|
|
||
| Prompts live as inline triple-quoted strings in the workflow file that uses them. There is no separate prompt registry. | ||
|
|
||
| --- | ||
|
|
||
| ## Type Annotations | ||
|
|
||
| Type annotations are required on all functions, methods, and class attributes including tests. | ||
|
|
||
| Use `TYPE_CHECKING` blocks for imports needed *only* in type annotations. This prevents circular imports and avoids loading heavy libraries at import time: | ||
|
|
||
| ```python | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| if TYPE_CHECKING: | ||
| import pandas as pd | ||
| ``` | ||
|
|
||
| If a module uses `pandas` at runtime — calls `pd.DataFrame`, indexes a DataFrame in a function body, etc. — import it at the top level. A `TYPE_CHECKING` import raises `NameError` if you reference it at runtime. `pandas` is import-time expensive, so keep top-level imports of it limited to modules that genuinely need it. | ||
|
|
||
| --- | ||
|
|
||
| ## Import Style | ||
|
|
||
| - **ALWAYS** use absolute imports, never relative imports (enforced by `TID`) | ||
| - Place imports at module level, not inside functions (exception: unavoidable for performance reasons) | ||
| - Import sorting is handled by `ruff`'s `isort` — imports should be grouped and sorted: | ||
| 1. Standard library imports | ||
| 2. Third-party imports | ||
| 3. First-party imports (`anonymizer`) | ||
| - Use standard import conventions (enforced by `ICN`) | ||
|
|
||
| ```python | ||
| # Good | ||
| from anonymizer.config.anonymizer_config import AnonymizerConfig | ||
|
|
||
| # Bad - relative import (will cause linter errors) | ||
| from .anonymizer_config import AnonymizerConfig | ||
|
|
||
| # Good - imports at module level | ||
| from pathlib import Path | ||
|
|
||
| def process_file(filename: str) -> None: | ||
| path = Path(filename) | ||
|
|
||
| # Bad - import inside function | ||
| def process_file(filename: str) -> None: | ||
| from pathlib import Path | ||
| path = Path(filename) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Code Organization | ||
|
|
||
| - When adding new symbols, prefer public functions and methods before private (`_`-prefixed) ones within a module or class | ||
| - Define helpers at module or class level — avoid nested functions. Nested functions hide logic, make testing harder, and complicate stack traces. The only acceptable use is a closure that genuinely needs to capture local state. | ||
|
|
||
|
lipikaramaswamy marked this conversation as resolved.
|
||
| --- | ||
|
|
||
| ## Naming | ||
|
|
||
| - Functions and variables: `snake_case` | ||
| - Classes: `PascalCase` | ||
| - Constants: `UPPER_SNAKE_CASE` | ||
| - Function names start with a verb: `run_workflow`, `build_entity_id`, not `entity_id` or `workflow` | ||
|
|
||
| --- | ||
|
|
||
| ## Comments | ||
|
|
||
| Only add a comment when the WHY is non-obvious — a hidden constraint, a subtle invariant, a workaround for a specific bug. Don't narrate what the code already says: | ||
|
|
||
| ```python | ||
| # Good — explains a non-obvious invariant | ||
| # uuid5 is deterministic so input/output IDs match for missing-record tracking. | ||
|
|
||
| # Bad — narrates what the code does | ||
| # Loop through the records and append to list | ||
| for record in records: | ||
| results.append(record) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Future Annotations | ||
|
|
||
| Every Python file must include `from __future__ import annotations` after the license header. This defers annotation evaluation, enables forward references, and keeps behavior consistent across the codebase. | ||
|
|
||
| --- | ||
|
|
||
| ## License Headers | ||
|
|
||
| Every Python and Markdown file requires an SPDX header at the top (enforced by `tools/codestyle/copyright_fixer.py --check`, run via `make copyright-check`). Files listed in `.copyrightignore` are exempt. | ||
|
|
||
| --- | ||
|
|
||
| ## Docstrings | ||
|
|
||
| Google style (`Args:`, `Returns:`, `Raises:`). Public API classes and methods get docstrings; private helpers (`_`-prefixed) only when the logic is non-obvious. Don't restate the signature — explain why or what, not what the type annotation already says. | ||
|
|
||
| --- | ||
|
|
||
| ## Design Principles | ||
|
|
||
| **DRY** | ||
|
|
||
| - Extract shared logic into pure helper functions rather than duplicating across similar call sites | ||
| - Rule of thumb: tolerate duplication until the third occurrence, then extract | ||
|
|
||
| **KISS** | ||
|
|
||
| - Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper | ||
| - When in doubt between DRY and KISS, favor readability over deduplication | ||
|
|
||
| **YAGNI** | ||
|
|
||
| - Don't add parameters, config, or abstraction layers for hypothetical future use cases | ||
| - Don't generalize until the third caller appears | ||
|
|
||
| **SOLID** | ||
|
|
||
| - Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals | ||
| - Use `Protocol` for contracts between layers | ||
| - One function, one job — separate logic from I/O | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.