docs: add AGENTS.md, STYLEGUIDE.md, agent-assisted contribution (#114) by lipikaramaswamy · Pull Request #149 · NVIDIA-NeMo/Anonymizer

lipikaramaswamy · 2026-05-09T01:03:17Z

Summary

Add AGENTS.md — architecture overview, pipeline diagrams, and structural invariants for agents working in
the codebase.
Add STYLEGUIDE.md — code conventions ruff and ty cannot enforce (Pydantic vs dataclass, error handling,
column-name constants, prompt construction).
Add CLAUDE.md — 3-line redirect to AGENTS.md so Claude Code picks it up.
Add an "Agent-Assisted Development" subsection to CONTRIBUTING.md establishing the
plans/<issue-number>/<short-name>.md convention for non-trivial changes.
Ignore .claude/worktrees/ (Claude Code session worktrees).

Filed #148 as a follow-up for three pre-existing column-name string-literal violations in
llm_replace_workflow.py:53-55 that the new STYLEGUIDE rule covers.

Closes #114.

Test plan

CI passes (format-check, mkdocs build --strict)
Reviewer can navigate the AGENTS.md ↔ STYLEGUIDE.md ↔ CONTRIBUTING.md cross-links
git check-ignore .claude/worktrees/foo returns the path (worktrees properly ignored)

Local Claude Code session worktrees under .claude/worktrees/ shouldn't be tracked. Matches the convention already in the DataDesigner repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…orkflow (#114) - AGENTS.md: architecture overview, pipeline diagrams, structural invariants - STYLEGUIDE.md: code conventions ruff and ty cannot enforce - CLAUDE.md: 3-line redirect to AGENTS.md - CONTRIBUTING.md: new Agent-Assisted Development subsection establishing the plans/<issue-number>/<short-name>.md convention for non-trivial changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

greptile-apps · 2026-05-09T01:05:54Z

Greptile Summary

This PR adds developer and agent-oriented documentation (AGENTS.md, STYLEGUIDE.md, CLAUDE.md) and an "Agent-Assisted Development" subsection to CONTRIBUTING.md, establishing the plans/<issue-number>/<short-name>.md convention for non-trivial changes. It also expands docstrings on five engine/config functions and adds .claude/worktrees/ to .gitignore.

AGENTS.md provides a module map, pipeline diagrams, core-concept glossary, structural invariants, and development commands, all consistent with the actual codebase structure.
STYLEGUIDE.md captures conventions that linters cannot enforce: Pydantic vs. dataclass choice, error handling patterns, column-name constant usage, prompt construction, import style, and design principles.
Python file changes are purely docstring improvements — expanded explanations, cross-references, and conversion from RST double-backtick style to Markdown single-backtick style; no logic is altered.

Confidence Score: 5/5

Safe to merge — all changes are documentation and docstring improvements with no modifications to runtime logic.

Every Python change in this PR is a docstring expansion or style update (RST → Markdown backticks). No control flow, data handling, or public API signatures are touched. The new Markdown files accurately reflect the codebase structure and cross-link correctly. The .gitignore addition is a single, scoped exclusion for Claude Code session worktrees.

No files require special attention.

Important Files Changed

Filename	Overview
.gitignore	Adds `.claude/worktrees/` exclusion so Claude Code session worktrees are not tracked by git.
AGENTS.md	New agent-oriented architecture reference: module map, pipeline ASCII diagrams, core concepts, structural invariants, and dev commands — accurate and consistent with the source code.
CLAUDE.md	Thin vendor adapter file that redirects Claude Code to AGENTS.md via both a Markdown link and the `@AGENTS.md` include directive.
CONTRIBUTING.md	Adds 'Agent-Assisted Development' subsection establishing the `plans/<issue-number>/<short-name>.md` plan-file convention and cross-links to AGENTS.md and STYLEGUIDE.md.
STYLEGUIDE.md	New style guide covering Pydantic vs dataclasses, error handling, column-name constants, prompt construction, import style, type annotations, naming, comments, future annotations, and design principles.
src/anonymizer/config/anonymizer_config.py	Expanded the `Rewrite.evaluation` property docstring with the sync contract and guidance on when to use the property vs. construct `EvaluationCriteria` directly.
src/anonymizer/config/rewrite.py	Docstring style update on `EvaluationCriteria`: double-backtick RST formatting replaced with single-backtick Markdown to match the Google-style guide.
src/anonymizer/engine/constants.py	Expanded `_jinja()` docstring to clarify when to use the helper (shared DataFrame column refs) vs. when not to (local Jinja loop variables).
src/anonymizer/engine/ndd/adapter.py	Expanded `NddAdapter.run_workflow()` docstring with full Args/Returns sections and the engine boundary contract (sub-workflows must not call `DataDesigner.create()` / `.preview()` directly).
src/anonymizer/engine/prompt_utils.py	Updated `substitute_placeholders()` docstring to clarify its role (dynamic prompt values, not DataFrame column refs) and converted RST backticks to Markdown style.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Developer or Agent] -->|reads| B[AGENTS.md\nArchitecture and invariants]
    A -->|reads| C[STYLEGUIDE.md\nCode conventions]
    A -->|reads| D[CONTRIBUTING.md\nWorkflow and PR process]

    B -->|cross-links| C
    B -->|cross-links| D
    C -->|cross-links| B
    C -->|cross-links| D

    D -->|new section| E[Agent-Assisted Development]
    E --> F{Non-trivial change?}
    F -->|yes| G[plans/issue/name.md\nDraft plan PR first]
    F -->|no| H[Implement directly]
    G -->|approved| H
    H -->|follows| B
    H -->|follows| C

    I[CLAUDE.md] --> B

_{Reviews (7): Last reviewed commit: "docs: add Agent compatibility section to..." | Re-trigger Greptile}

Required by the copyright-check CI step (tools/codestyle/copyright_fixer.py). Matches the HTML-comment header format used by docs/concepts/*.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

Address greptile-apps review feedback on PR #149: - Rewrite the TYPE_CHECKING paragraph to explicitly split type-hint-only imports (TYPE_CHECKING block) from runtime use (top-level), and call out the NameError failure mode. Avoids the misread where an agent could wrap all heavy-library imports in TYPE_CHECKING. - Use "DataDesigner" instead of "NeMo Data Designer" so the display name matches AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

binaryaaron

Trying something out- here's my feedback in prompt format.

Prompt for PR author

Please revise the agent/developer documentation in this branch with one main goal: make the agent instructions tool-neutral, cross-agent compatible, and less likely to drift from the code.

The current direction is strong, but AGENTS.md, CLAUDE.md, and STYLEGUIDE.md blur a few boundaries:

AGENTS.md mixes a canonical agent entry point with detailed implementation docs and style rules.
CLAUDE.md is a valid Claude Code shim, but non-Claude readers only see @AGENTS.md.
STYLEGUIDE.md should be the source of truth for review-enforced coding conventions, but some of those conventions are duplicated in AGENTS.md.
A few rules are phrased as absolute when the current code has local exceptions.

Please update the docs along these lines.

1. Make `AGENTS.md` the canonical tool-neutral entry point

Keep AGENTS.md focused on:

who the file is for
where product users should go instead
the high-level package map
the architectural guardrails agents must not violate
links to STYLEGUIDE.md, CONTRIBUTING.md, and deeper developer docs

Avoid vendor-specific syntax or IDE-specific assumptions in AGENTS.md. Do not put Claude @... includes, Cursor rule paths, MCP names, slash commands, or agent-product instructions there. Adapter files such as CLAUDE.md can carry those details.

Suggested addition near the top:

## Agent compatibility

`AGENTS.md` is the canonical instruction file for coding agents working in this repository. Keep it tool-neutral:

- Use plain Markdown and repository-relative links.
- Do not rely on vendor-specific include syntax, slash commands, MCP names, or IDE-only behavior.
- Put tool-specific adapter instructions in thin wrapper files such as `CLAUDE.md`.
- If an adapter file and this file disagree, this file wins.

2. Fix the module map wording

Current wording says the package has “three modules.” That is too narrow. The code has three primary subpackages plus top-level public logging utilities.

Please change the wording to something like:

`nemo-anonymizer` is a single package with three primary subpackages plus top-level public utilities:

Then keep the existing anonymizer.config, anonymizer.engine, and anonymizer.interface bullets, and add a compact logging bullet:

- **`anonymizer.logging`** — public logging configuration (`LoggingConfig`, `configure_logging`) used by the API, CLI, and examples.

3. Make `CLAUDE.md` a readable shim

Keep the Claude Code include line, but add a human-readable fallback so GitHub reviewers and non-Claude tools understand the file.

Suggested replacement:

# Claude Code instructions

Canonical agent instructions live in [AGENTS.md](AGENTS.md).

@AGENTS.md

Do not duplicate the full agent instructions in CLAUDE.md.

4. Move detailed implementation material out of `AGENTS.md`

AGENTS.md should not be the only source for detailed pipeline contracts. Move or reduce these sections:

Pipeline diagrams: move the full replace/rewrite diagrams to docs/concepts/replace.md, docs/concepts/rewrite.md, or a new developer architecture page. In AGENTS.md, keep only a short pointer.
NddAdapter.run_workflow() behavior: document the detailed contract in the method/class docstring.
Rewrite.evaluation and EvaluationCriteria sync behavior: document the contract on the Rewrite.evaluation property.
Prompt helper behavior: keep the formal contract in _jinja() / substitute_placeholders() docstrings and the style rule in STYLEGUIDE.md.
Future imports, type annotations, SPDX headers, prompt construction, and column-name conventions: keep these in STYLEGUIDE.md; in AGENTS.md, link to the style guide rather than repeating the rules.

Good target shape: AGENTS.md as a short index and guardrail file, not a full architecture document.

5. Narrow over-broad rules

Several current statements are directionally right but too absolute for the existing code. Please calibrate them so agents do not churn valid code or infer false invariants.

NDD adapter wording

Current claim: NddAdapter is the only place the DataDesigner dependency crosses.

Issue: DataDesigner config types and decorators appear outside the adapter, and Anonymizer constructs DataDesigner.

Better wording:

`NddAdapter.run_workflow()` is the engine boundary for executing DataDesigner workflows. Engine workflows declare DataDesigner column configs, but they do not call `DataDesigner.create()` or `preview()` directly.

`EvaluationCriteria` wording

Current claim: never construct EvaluationCriteria.

Issue: tests and engine-level code construct or pass it directly. The production risk is manually duplicating the Rewrite to EvaluationCriteria mapping.

Better wording:

When production code starts from the user-facing `Rewrite` config, pass `Rewrite.evaluation` into the engine. Do not manually duplicate the `Rewrite` to `EvaluationCriteria` mapping.

Column-name constants

Current claim: all column names must be constants.

Issue: LlmReplaceWorkflow currently uses local scratch columns such as "_entity_examples" and "_entities_for_replace".

Either promote those scratch columns to COL_* constants, or narrow the rule:

Shared pipeline columns, trace columns, public output columns, NDD column names, prompt column references, and merge/join keys must use `COL_*` constants from `engine/constants.py`. Local scratch columns may be literal strings only when they do not cross workflow boundaries.

Prompt `_jinja()` usage

Current claim: all column references in NDD prompt templates go through _jinja().

Issue: local Jinja loop variables and local scratch prompt variables are not the same as shared DataFrame column references.

Better wording:

Shared DataFrame column references in NDD prompt templates should use `_jinja(COL_*)`. Local Jinja loop variables and explicitly local scratch prompt variables may remain local, but do not hardcode shared column names into prompt strings.

`assert` wording

If STYLEGUIDE.md says “No assert for validation,” clarify that this applies to production/library validation. Pytest assertions in tests are fine.

Suggested wording:

In production/library code, do not use `assert` for validation. Pytest assertions in tests are fine.

6. Clarify `STYLEGUIDE.md`

Please make STYLEGUIDE.md the canonical home for review-enforced conventions.

Suggested intro adjustment:

Code and documentation conventions for NeMo Anonymizer that ruff and ty cannot enforce. Architecture boundaries and agent workflow rules live in [AGENTS.md](AGENTS.md).

Specific clarifications to consider:

In “Pydantic vs Dataclasses,” say dataclasses are for private/internal frozen value objects, including private config helpers such as _RiskToleranceBundle, not only engine containers.
In “Error Handling,” point the final-judge exception example at RewriteWorkflow._run_final_judge, since the broad catch lives in rewrite_workflow.py.
In “Column Names,” narrow the absolute rule or promote local scratch columns to constants.
In “Prompt Construction,” distinguish shared DataFrame column references from local Jinja variables.
In “Code Organization,” consider “Prefer public functions before private helpers when adding new symbols” instead of an absolute rule, unless the repo already satisfies it everywhere.

7. Keep `CONTRIBUTING.md` vendor-neutral

The current “Claude Code, Cursor, Codex” list is understandable but may age poorly. Consider neutral wording:

When automating edits with coding agents (IDE assistants, CLI tools, or hosted models), follow the standard Pull Request Process plus these additions:

Also consider replacing “The agent should read...” with language that applies to both humans and automated tools:

Implementers, human or automated, should read `AGENTS.md` and `STYLEGUIDE.md` before non-trivial changes.

Verification notes

These recommendations were checked against the current source:

Rewrite and Rewrite.evaluation live in src/anonymizer/config/anonymizer_config.py.
EvaluationCriteria, RiskTolerance, and _RiskToleranceBundle live in src/anonymizer/config/rewrite.py.
NddAdapter lives in src/anonymizer/engine/ndd/adapter.py.
_jinja() lives in src/anonymizer/engine/constants.py.
substitute_placeholders() lives in src/anonymizer/engine/prompt_utils.py.
configure_logging and LoggingConfig are public exports from anonymizer.
Current code has local scratch DataFrame columns in LlmReplaceWorkflow, so the column constant rule needs either a code change or narrower wording.

The main outcome should be a cleaner split:

AGENTS.md      = tool-neutral orientation and guardrails
CLAUDE.md      = Claude-specific adapter shim
STYLEGUIDE.md  = review-enforced code/documentation conventions
code docstrings = callable contracts and implementation-local invariants
developer docs = pipeline diagrams, concepts, and longer architecture explanations

Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

- AGENTS.md: add anonymizer.logging to module map; narrow NDD adapter and EvaluationCriteria wording (used to claim too much); clarify _jinja scope (DataFrame columns vs local Jinja loop variables); shrink NDD Adapter, Config Pattern, Prompt Conventions, and Structural Invariants sections to pointers (contracts now live in docstrings or STYLEGUIDE.md). - CLAUDE.md: readable shim header so non-Claude readers understand the file's purpose. - CONTRIBUTING.md: vendor-neutral language in the Agent-Assisted Development section (don't enumerate specific agents; "human or agentic" implementers). - STYLEGUIDE.md: add License Headers section; clarify that the assert ban is for production code (pytest assertions in tests are fine); scope "public before private" to newly added symbols; reframe dataclasses for private/internal frozen value objects; point error-handling example at RewriteWorkflow._run_final_judge. - Docstrings expanded with the contracts that previously lived in AGENTS.md: NddAdapter.run_workflow, Rewrite.evaluation, _jinja, substitute_placeholders. EvaluationCriteria docstring backticks normalized. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

lipikaramaswamy · 2026-05-12T05:18:44Z

@binaryaaron Thanks, I went through all 7 themes.

Agent compatibility section — applied (cf49c17).
Module map fix — applied (e122b4f). Added anonymizer.logging
and reframed leader as "three primary subpackages plus top-level
public utilities".
CLAUDE.md readable shim — applied (e122b4f).
Move detail out of AGENTS.md:
- Pipeline diagrams: kept in AGENTS.md. They reference engine
  internals (EntityDetectionWorkflow.run(), NddAdapter, COL_*),
  which are dev-facing — moving them to docs/concepts/ would
  conflate audiences there.
- NddAdapter.run_workflow → docstring (e122b4f)
- Rewrite.evaluation → property docstring (e122b4f)
- _jinja / substitute_placeholders → expanded docstrings
  (e122b4f)
- Structural Invariants → pointer to STYLEGUIDE.md, with a new
  License Headers section added there (e122b4f)
Narrow over-broad rules:
- NDD adapter wording — narrowed (e122b4f)
- EvaluationCriteria wording — narrowed (e122b4f)
- Column-name absolutism — kept as-stated. The
  LlmReplaceWorkflow scratch columns you flagged are tracked as
  chore: use COL_* constants for intermediate columns in llm_replace_workflow.py #148 + bug: result.trace_dataframe is not persistable via to_parquet #152 with fix at fix(replace): drop workflow-internal columns and use COL_* constants #154. Softening preemptively would
  invite the same drift elsewhere.
- _jinja scope — clarified to "DataFrame columns vs local Jinja
  loop variables" (e122b4f)
- assert scope — clarified (production vs tests) (e122b4f)
STYLEGUIDE.md clarifications — applied (e122b4f): intro scoped vs
AGENTS.md; dataclasses framing for private/internal frozen value
objects; error-handling example pointed at
RewriteWorkflow._run_final_judge; "public before private"
scoped to new symbols.
Vendor-neutral CONTRIBUTING.md — applied (e122b4f): dropped the
"Claude Code, Cursor, Codex" enumeration; "human or agentic"
implementers.

lipikaramaswamy and others added 2 commits May 8, 2026 16:56

chore: ignore Claude worktrees

48b8f76

Local Claude Code session worktrees under .claude/worktrees/ shouldn't be tracked. Matches the convention already in the DataDesigner repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lipikaramaswamy requested a review from a team as a code owner May 9, 2026 01:03

lipikaramaswamy changed the title ~~docs: add AGENTS.md, STYLEGUIDE.md, and agent-assisted contribution workflow (#114)~~ docs: add AGENTS.md, STYLEGUIDE.md, agent-assisted contribution (#114) May 9, 2026

greptile-apps Bot reviewed May 9, 2026

View reviewed changes

Comment thread STYLEGUIDE.md Outdated

Comment thread STYLEGUIDE.md Outdated

lipikaramaswamy and others added 2 commits May 8, 2026 18:07

This was referenced May 11, 2026

bug: result.trace_dataframe is not persistable via to_parquet #152

Closed

docs: add anonymizer Claude Code skill and supporting concept docs #153

Merged

nabinchha reviewed May 11, 2026

View reviewed changes

Comment thread CONTRIBUTING.md

binaryaaron reviewed May 11, 2026

View reviewed changes

nabinchha reviewed May 11, 2026

View reviewed changes

Comment thread AGENTS.md

nabinchha reviewed May 11, 2026

View reviewed changes

Comment thread STYLEGUIDE.md

lipikaramaswamy and others added 2 commits May 12, 2026 01:09

docs: add Import Style and Design Principles sections to STYLEGUIDE.md

1f734c6

Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

lipikaramaswamy requested a review from a team as a code owner May 12, 2026 05:10

docs: add Agent compatibility section to AGENTS.md

cf49c17

Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

binaryaaron approved these changes May 12, 2026

View reviewed changes

lipikaramaswamy merged commit f6b8a57 into main May 13, 2026
11 checks passed

lipikaramaswamy deleted the lipikaramaswamy/docs/114-agents-styleguide branch May 13, 2026 01:13

asteier2026 pushed a commit that referenced this pull request May 15, 2026

docs: add AGENTS.md, STYLEGUIDE.md, agent-assisted contribution (#149)

4f5d180

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add AGENTS.md, STYLEGUIDE.md, agent-assisted contribution (#114)#149

docs: add AGENTS.md, STYLEGUIDE.md, agent-assisted contribution (#114)#149
lipikaramaswamy merged 7 commits into
mainfrom
lipikaramaswamy/docs/114-agents-styleguide

lipikaramaswamy commented May 9, 2026

Uh oh!

greptile-apps Bot commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

binaryaaron left a comment

Uh oh!

Uh oh!

Uh oh!

lipikaramaswamy commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lipikaramaswamy commented May 9, 2026

Summary

Test plan

Uh oh!

greptile-apps Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

binaryaaron left a comment

Choose a reason for hiding this comment

Prompt for PR author

1. Make AGENTS.md the canonical tool-neutral entry point

2. Fix the module map wording

3. Make CLAUDE.md a readable shim

4. Move detailed implementation material out of AGENTS.md

5. Narrow over-broad rules

NDD adapter wording

EvaluationCriteria wording

Column-name constants

Prompt _jinja() usage

assert wording

6. Clarify STYLEGUIDE.md

7. Keep CONTRIBUTING.md vendor-neutral

Verification notes

Uh oh!

Uh oh!

Uh oh!

lipikaramaswamy commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 9, 2026 •

edited

Loading

1. Make `AGENTS.md` the canonical tool-neutral entry point

3. Make `CLAUDE.md` a readable shim

4. Move detailed implementation material out of `AGENTS.md`

`EvaluationCriteria` wording

Prompt `_jinja()` usage

`assert` wording

6. Clarify `STYLEGUIDE.md`

7. Keep `CONTRIBUTING.md` vendor-neutral