docs: add AGENTS.md, STYLEGUIDE.md, agent-assisted contribution (#114)#149
Conversation
Local Claude Code session worktrees under .claude/worktrees/ shouldn't be tracked. Matches the convention already in the DataDesigner repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
β¦orkflow (#114) - AGENTS.md: architecture overview, pipeline diagrams, structural invariants - STYLEGUIDE.md: code conventions ruff and ty cannot enforce - CLAUDE.md: 3-line redirect to AGENTS.md - CONTRIBUTING.md: new Agent-Assisted Development subsection establishing the plans/<issue-number>/<short-name>.md convention for non-trivial changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Greptile SummaryThis PR adds developer and agent-oriented documentation (
Confidence Score: 5/5Safe to merge β all changes are documentation and docstring improvements with no modifications to runtime logic. Every Python change in this PR is a docstring expansion or style update (RST β Markdown backticks). No control flow, data handling, or public API signatures are touched. The new Markdown files accurately reflect the codebase structure and cross-link correctly. The No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Developer or Agent] -->|reads| B[AGENTS.md\nArchitecture and invariants]
A -->|reads| C[STYLEGUIDE.md\nCode conventions]
A -->|reads| D[CONTRIBUTING.md\nWorkflow and PR process]
B -->|cross-links| C
B -->|cross-links| D
C -->|cross-links| B
C -->|cross-links| D
D -->|new section| E[Agent-Assisted Development]
E --> F{Non-trivial change?}
F -->|yes| G[plans/issue/name.md\nDraft plan PR first]
F -->|no| H[Implement directly]
G -->|approved| H
H -->|follows| B
H -->|follows| C
I[CLAUDE.md] --> B
Reviews (7): Last reviewed commit: "docs: add Agent compatibility section to..." | Re-trigger Greptile |
Required by the copyright-check CI step (tools/codestyle/copyright_fixer.py). Matches the HTML-comment header format used by docs/concepts/*.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Address greptile-apps review feedback on PR #149: - Rewrite the TYPE_CHECKING paragraph to explicitly split type-hint-only imports (TYPE_CHECKING block) from runtime use (top-level), and call out the NameError failure mode. Avoids the misread where an agent could wrap all heavy-library imports in TYPE_CHECKING. - Use "DataDesigner" instead of "NeMo Data Designer" so the display name matches AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
binaryaaron
left a comment
There was a problem hiding this comment.
Trying something out- here's my feedback in prompt format.
Prompt for PR author
Please revise the agent/developer documentation in this branch with one main goal: make the agent instructions tool-neutral, cross-agent compatible, and less likely to drift from the code.
The current direction is strong, but AGENTS.md, CLAUDE.md, and STYLEGUIDE.md blur a few boundaries:
AGENTS.mdmixes a canonical agent entry point with detailed implementation docs and style rules.CLAUDE.mdis a valid Claude Code shim, but non-Claude readers only see@AGENTS.md.STYLEGUIDE.mdshould be the source of truth for review-enforced coding conventions, but some of those conventions are duplicated inAGENTS.md.- A few rules are phrased as absolute when the current code has local exceptions.
Please update the docs along these lines.
1. Make AGENTS.md the canonical tool-neutral entry point
Keep AGENTS.md focused on:
- who the file is for
- where product users should go instead
- the high-level package map
- the architectural guardrails agents must not violate
- links to
STYLEGUIDE.md,CONTRIBUTING.md, and deeper developer docs
Avoid vendor-specific syntax or IDE-specific assumptions in AGENTS.md. Do not put Claude @... includes, Cursor rule paths, MCP names, slash commands, or agent-product instructions there. Adapter files such as CLAUDE.md can carry those details.
Suggested addition near the top:
## Agent compatibility
`AGENTS.md` is the canonical instruction file for coding agents working in this repository. Keep it tool-neutral:
- Use plain Markdown and repository-relative links.
- Do not rely on vendor-specific include syntax, slash commands, MCP names, or IDE-only behavior.
- Put tool-specific adapter instructions in thin wrapper files such as `CLAUDE.md`.
- If an adapter file and this file disagree, this file wins.2. Fix the module map wording
Current wording says the package has βthree modules.β That is too narrow. The code has three primary subpackages plus top-level public logging utilities.
Please change the wording to something like:
`nemo-anonymizer` is a single package with three primary subpackages plus top-level public utilities:Then keep the existing anonymizer.config, anonymizer.engine, and anonymizer.interface bullets, and add a compact logging bullet:
- **`anonymizer.logging`** β public logging configuration (`LoggingConfig`, `configure_logging`) used by the API, CLI, and examples.3. Make CLAUDE.md a readable shim
Keep the Claude Code include line, but add a human-readable fallback so GitHub reviewers and non-Claude tools understand the file.
Suggested replacement:
# Claude Code instructions
Canonical agent instructions live in [AGENTS.md](AGENTS.md).
@AGENTS.mdDo not duplicate the full agent instructions in CLAUDE.md.
4. Move detailed implementation material out of AGENTS.md
AGENTS.md should not be the only source for detailed pipeline contracts. Move or reduce these sections:
- Pipeline diagrams: move the full replace/rewrite diagrams to
docs/concepts/replace.md,docs/concepts/rewrite.md, or a new developer architecture page. InAGENTS.md, keep only a short pointer. NddAdapter.run_workflow()behavior: document the detailed contract in the method/class docstring.Rewrite.evaluationandEvaluationCriteriasync behavior: document the contract on theRewrite.evaluationproperty.- Prompt helper behavior: keep the formal contract in
_jinja()/substitute_placeholders()docstrings and the style rule inSTYLEGUIDE.md. - Future imports, type annotations, SPDX headers, prompt construction, and column-name conventions: keep these in
STYLEGUIDE.md; inAGENTS.md, link to the style guide rather than repeating the rules.
Good target shape: AGENTS.md as a short index and guardrail file, not a full architecture document.
5. Narrow over-broad rules
Several current statements are directionally right but too absolute for the existing code. Please calibrate them so agents do not churn valid code or infer false invariants.
NDD adapter wording
Current claim: NddAdapter is the only place the DataDesigner dependency crosses.
Issue: DataDesigner config types and decorators appear outside the adapter, and Anonymizer constructs DataDesigner.
Better wording:
`NddAdapter.run_workflow()` is the engine boundary for executing DataDesigner workflows. Engine workflows declare DataDesigner column configs, but they do not call `DataDesigner.create()` or `preview()` directly.EvaluationCriteria wording
Current claim: never construct EvaluationCriteria.
Issue: tests and engine-level code construct or pass it directly. The production risk is manually duplicating the Rewrite to EvaluationCriteria mapping.
Better wording:
When production code starts from the user-facing `Rewrite` config, pass `Rewrite.evaluation` into the engine. Do not manually duplicate the `Rewrite` to `EvaluationCriteria` mapping.Column-name constants
Current claim: all column names must be constants.
Issue: LlmReplaceWorkflow currently uses local scratch columns such as "_entity_examples" and "_entities_for_replace".
Either promote those scratch columns to COL_* constants, or narrow the rule:
Shared pipeline columns, trace columns, public output columns, NDD column names, prompt column references, and merge/join keys must use `COL_*` constants from `engine/constants.py`. Local scratch columns may be literal strings only when they do not cross workflow boundaries.Prompt _jinja() usage
Current claim: all column references in NDD prompt templates go through _jinja().
Issue: local Jinja loop variables and local scratch prompt variables are not the same as shared DataFrame column references.
Better wording:
Shared DataFrame column references in NDD prompt templates should use `_jinja(COL_*)`. Local Jinja loop variables and explicitly local scratch prompt variables may remain local, but do not hardcode shared column names into prompt strings.assert wording
If STYLEGUIDE.md says βNo assert for validation,β clarify that this applies to production/library validation. Pytest assertions in tests are fine.
Suggested wording:
In production/library code, do not use `assert` for validation. Pytest assertions in tests are fine.6. Clarify STYLEGUIDE.md
Please make STYLEGUIDE.md the canonical home for review-enforced conventions.
Suggested intro adjustment:
Code and documentation conventions for NeMo Anonymizer that ruff and ty cannot enforce. Architecture boundaries and agent workflow rules live in [AGENTS.md](AGENTS.md).Specific clarifications to consider:
- In βPydantic vs Dataclasses,β say dataclasses are for private/internal frozen value objects, including private config helpers such as
_RiskToleranceBundle, not only engine containers. - In βError Handling,β point the final-judge exception example at
RewriteWorkflow._run_final_judge, since the broad catch lives inrewrite_workflow.py. - In βColumn Names,β narrow the absolute rule or promote local scratch columns to constants.
- In βPrompt Construction,β distinguish shared DataFrame column references from local Jinja variables.
- In βCode Organization,β consider βPrefer public functions before private helpers when adding new symbolsβ instead of an absolute rule, unless the repo already satisfies it everywhere.
7. Keep CONTRIBUTING.md vendor-neutral
The current βClaude Code, Cursor, Codexβ list is understandable but may age poorly. Consider neutral wording:
When automating edits with coding agents (IDE assistants, CLI tools, or hosted models), follow the standard Pull Request Process plus these additions:Also consider replacing βThe agent should read...β with language that applies to both humans and automated tools:
Implementers, human or automated, should read `AGENTS.md` and `STYLEGUIDE.md` before non-trivial changes.Verification notes
These recommendations were checked against the current source:
RewriteandRewrite.evaluationlive insrc/anonymizer/config/anonymizer_config.py.EvaluationCriteria,RiskTolerance, and_RiskToleranceBundlelive insrc/anonymizer/config/rewrite.py.NddAdapterlives insrc/anonymizer/engine/ndd/adapter.py._jinja()lives insrc/anonymizer/engine/constants.py.substitute_placeholders()lives insrc/anonymizer/engine/prompt_utils.py.configure_loggingandLoggingConfigare public exports fromanonymizer.- Current code has local scratch DataFrame columns in
LlmReplaceWorkflow, so the column constant rule needs either a code change or narrower wording.
The main outcome should be a cleaner split:
AGENTS.md = tool-neutral orientation and guardrails
CLAUDE.md = Claude-specific adapter shim
STYLEGUIDE.md = review-enforced code/documentation conventions
code docstrings = callable contracts and implementation-local invariants
developer docs = pipeline diagrams, concepts, and longer architecture explanations
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
- AGENTS.md: add anonymizer.logging to module map; narrow NDD adapter and EvaluationCriteria wording (used to claim too much); clarify _jinja scope (DataFrame columns vs local Jinja loop variables); shrink NDD Adapter, Config Pattern, Prompt Conventions, and Structural Invariants sections to pointers (contracts now live in docstrings or STYLEGUIDE.md). - CLAUDE.md: readable shim header so non-Claude readers understand the file's purpose. - CONTRIBUTING.md: vendor-neutral language in the Agent-Assisted Development section (don't enumerate specific agents; "human or agentic" implementers). - STYLEGUIDE.md: add License Headers section; clarify that the assert ban is for production code (pytest assertions in tests are fine); scope "public before private" to newly added symbols; reframe dataclasses for private/internal frozen value objects; point error-handling example at RewriteWorkflow._run_final_judge. - Docstrings expanded with the contracts that previously lived in AGENTS.md: NddAdapter.run_workflow, Rewrite.evaluation, _jinja, substitute_placeholders. EvaluationCriteria docstring backticks normalized. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
|
@binaryaaron Thanks, I went through all 7 themes.
|
Summary
AGENTS.mdβ architecture overview, pipeline diagrams, and structural invariants for agents working inthe codebase.
STYLEGUIDE.mdβ code conventions ruff and ty cannot enforce (Pydantic vs dataclass, error handling,column-name constants, prompt construction).
CLAUDE.mdβ 3-line redirect toAGENTS.mdso Claude Code picks it up.CONTRIBUTING.mdestablishing theplans/<issue-number>/<short-name>.mdconvention for non-trivial changes..claude/worktrees/(Claude Code session worktrees).Filed #148 as a follow-up for three pre-existing column-name string-literal violations in
llm_replace_workflow.py:53-55that the new STYLEGUIDE rule covers.Closes #114.
Test plan
mkdocs build --strict)git check-ignore .claude/worktrees/fooreturns the path (worktrees properly ignored)