Skip to content

Conversation

@Jerryguan777
Copy link

@Jerryguan777 Jerryguan777 commented Jan 15, 2026

Implemented an iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. Achieved 70% success rate (7/10) in initial evaluation.

  • Add IterativeAgent
  • Add config_iterative.yml
  • Add git tools
  • Add SweBenchPredictorIterativeConfig
  • Register iterative predictor and git tool
  • Update README.md

How Has This Been Tested?

export ANTHROPIC_API_KEY=sk-xxxxxx

nat eval --config_file examples/evaluation_and_profiling/swe_bench/configs/config_iterative.yml
Running 10 instances...
10 ran successfully, 0 failed: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [04:30<00:00, 27.07s/it]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 10
Instances submitted: 10
Instances completed: 10
Instances incomplete: 0
Instances resolved: 7
Instances unresolved: 3
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 2
Report written to nv_predictor.nat_iterative_1.json
2026-01-15 16:11:45 - INFO     - nat.eval.swe_bench_evaluator.evaluate:194 - Completed swe_bench run nat_iterative_1
2026-01-15 16:11:45 - INFO     - nat.eval.swe_bench_evaluator.evaluate:200 - SWE_bench report and logs written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/swe_bench_reports directory
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:275 - Profiler is not enabled. Skipping profiling.
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:366 - Original config file copied to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_original.yml
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:384 - Effective config (with overrides) saved to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_effective.yml
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:427 - Configuration metadata saved to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_metadata.json
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:449 - Workflow output written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/workflow_output.json
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:460 - Evaluation results written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/swe_bench_output.json

=== EVALUATION SUMMARY ===
Workflow Status: COMPLETED (workflow_output.json)
Total Runtime: 480.70s

Per evaluator results:
| Evaluator   |   Avg Score | Output File           |
|-------------|-------------|-----------------------|
| swe_bench   |         0.7 | swe_bench_output.json |

Description

Closes #1397

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Added an iterative agent predictor for SWE-bench that uses LLM-driven stepwise command execution to iteratively build and submit patches or final outputs.
    • Added a git-repo workspace tool to provision per-instance repositories, checkout commits, and perform cleanup.
    • New configurable options: selectable LLM backends (including OpenAI), step limits, command timeouts, and workspace cleanup.
  • Documentation

    • README and example workflow/config updated to document the new "iterative" predictor and usage.

✏️ Tip: You can customize this high-level summary in your review settings.

- Add IterativeAgent
- Add config_iterative.yml
- Add git tools
- Add SweBenchPredictorIterativeConfig
- Register iterative predictor and git tool
- Update README.md

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
@Jerryguan777 Jerryguan777 requested a review from a team as a code owner January 15, 2026 02:26
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

Walkthrough

Adds an iterative SWE-bench predictor and supporting tools: an LLM-driven IterativeAgent that proposes and executes single bash commands in an isolated git workspace, observes results, iterates until completion or limits, and returns a git-diff patch; includes config, example YAML, repo tooling, tool registration, and README entry.

Changes

Cohort / File(s) Summary
Documentation
examples/evaluation_and_profiling/swe_bench/README.md
Added "iterative" predictor entry describing the iterative, step-by-step bash-based agent.
Config schema
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
Added SweBenchPredictorIterativeConfig (name="iterative") with llm_name, step_limit, timeout; extended SweBenchPredictorConfig union and updated docstrings/imports.
Config example
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
New YAML example defining multiple LLM backends, iterative predictor settings (step_limit: 100, timeout: 60), git_repo_tool, dataset, evaluator, concurrency, and output paths.
Iterative predictor implementation
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
New IterativeAgent, IterativeAgentConfig, exception hierarchy, and SweBenchPredictor (registered "iterative"): prompt/LLM loop, single-command parsing, async execution with timeout, observation embedding, truncation handling, and final git-diff emission.
Git workspace tool
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Added RepoContext dataclass and RepoManager with async setup_repository (clone/checkout), cleanup, and helper async clone/checkout functions using GitPython; workspace path isolation.
Tool registration
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
Added GitRepoToolConfig and git_repo_tool registered function exposing JSON setup/cleanup operations backed by RepoManager, returning FunctionInfo and honoring cleanup_on_exit.
Predictor registry wiring
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
Imported iterative predictor (SweBenchPredictor as IterativePredictor) to expose it in the predictor registry.
Tool registry exposure
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
Imported git_repo_tool so the git repo tool is available to the tools registry.
Package initializers
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py, .../tools/__init__.py
Added package initializer files with SPDX/Apache-2.0 license headers.

Sequence Diagram(s)

sequenceDiagram
    participant Client as SWE-Bench Client
    participant Predictor as SweBenchPredictor
    participant Repo as RepoManager
    participant Agent as IterativeAgent
    participant LLM as LLM Backend
    participant Executor as Command Executor

    Client->>Predictor: predict_fn(swebench_input)
    Predictor->>Repo: setup_repository(repo_url, commit)
    Repo-->>Predictor: RepoContext
    Predictor->>Agent: instantiate(config, builder)
    Predictor->>Agent: run(task_description, repo_path)

    loop until COMPLETE or limits
        Agent->>LLM: _query_llm(prompt/messages)
        LLM-->>Agent: response (one bash code block)
        Agent->>Executor: _execute_action(bash_command)
        Executor->>Repo: run command in repo workspace
        Repo-->>Executor: stdout/stderr/return_code
        Executor-->>Agent: observation (truncated if needed)
        Agent->>Agent: add_message(assistant,response)
        Agent->>Agent: add_message(user,observation)
        Agent->>Agent: check for COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
    end

    alt Completed
        Agent-->>Predictor: (patch, status)
    else Error/Timeout/Limits
        Agent-->>Predictor: (error_message, status)
    end

    Predictor->>Repo: cleanup()
    Predictor-->>Client: final patch or error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is descriptive, follows imperative mood with the 'feat' prefix, clearly summarizes the main change (adding an iterative predictor), and is within the 72-character limit at 66 characters.
Linked Issues check ✅ Passed The PR implements all core coding requirements from issue #1397: iterative reasoning-action-observation loop, step-by-step execution with feedback, error recovery, dynamic feedback loops, and SweBenchPredictorBase integration.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the iterative predictor feature. README updates, config files, predictor implementation, git tools, registration, and a minor CI fix for async behavior are all aligned with the PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 1-6: Add the standard SPDX Apache-2.0 license header as the very
first lines of the YAML (before the "llms" key); update the top of the file
containing the "llms" / "claude_sonnet_llm" entries to begin with the SPDX
Apache-2.0 header so the file complies with the repo policy.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 76-79: The async function clone_repository uses the synchronous
blocking call Repo.clone_from which will block the event loop; change the
implementation to run Repo.clone_from in a background thread (e.g., via
asyncio.to_thread) and await that result so the function remains async and
non-blocking. Locate the clone_repository function and replace the direct call
to Repo.clone_from(repo_url, target_path) with an awaited asyncio.to_thread call
(or equivalent executor) that invokes Repo.clone_from, and keep the logger.info
call as-is.
- Around line 82-85: The checkout_commit function performs blocking I/O by
calling the synchronous repo.git.checkout; change checkout_commit to have an
explicit return type hint (-> None) and call the blocking operation inside
asyncio.to_thread (e.g., await asyncio.to_thread(repo.git.checkout,
commit_hash)) so the checkout runs off the event loop; keep the logger.info call
and docstring unchanged and reference the function name checkout_commit and the
blocking call repo.git.checkout when making the change.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 41-53: The git_operations function lacks input validation and
error handling: catch JSONDecodeError around json.loads(args_str) and return or
raise a clear error message, validate presence of 'operation' and for operation
== "setup" ensure required keys 'repo_url' and 'base_commit' exist before
calling repo_manager.setup_repository (raise ValueError or return a descriptive
error if missing), and wrap the repo_manager.setup_repository and
repo_manager.cleanup calls to catch and log exceptions so callers receive
actionable error messages referencing git_operations and
repo_manager.setup_repository/cleanup.
🧹 Nitpick comments (11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

24-25: Unused import: FunctionRef

FunctionRef is imported but not used in this file. Only LLMRef is used for the llm_name field.

🧹 Remove unused import
 from nat.data_models.common import TypedBaseModel
-from nat.data_models.component_ref import FunctionRef
 from nat.data_models.component_ref import LLMRef
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (2)

37-67: Add type hints per coding guidelines.

The class is missing type hints on __init__, active_repos, and cleanup(). Per coding guidelines, all public APIs require type hints.

📝 Add type hints
 class RepoManager:
+    active_repos: dict[str, RepoContext]
 
-    def __init__(self, workspace_dir: str):
+    def __init__(self, workspace_dir: str) -> None:
         self.workspace = Path(workspace_dir)
         self.workspace.mkdir(parents=True, exist_ok=True)
-        self.active_repos = {}
+        self.active_repos: dict[str, RepoContext] = {}
 
     # ... setup_repository unchanged ...
 
-    async def cleanup(self):
+    async def cleanup(self) -> None:
         """Clean up all managed repositories."""

25-34: Misleading docstring: not a context manager.

The docstring states "Context manager for repository operations" but RepoContext is a plain dataclass without __enter__/__exit__ methods. Consider updating the docstring to reflect its actual purpose as a data container.

📝 Fix docstring
 `@dataclass`
 class RepoContext:
-    """Context manager for repository operations."""
+    """Data container holding repository state and paths."""
     repo_url: str
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

25-29: Redundant _type field.

The _type field is redundant since TypedBaseModel (parent of FunctionBaseConfig) already manages the type discriminator via name="git_repo_tool". This creates potential confusion with two type fields.

🧹 Remove redundant field
 class GitRepoToolConfig(FunctionBaseConfig, name="git_repo_tool"):
     """Configuration for git repository management tool."""
-    _type: typing.Literal["git_repo_tool"] = "git_repo_tool"
     workspace_dir: str = "./.workspace"  # Base directory for cloning repositories
     cleanup_on_exit: bool = True  # Whether to clean up repos after use

32-60: Unused builder parameter is acceptable for interface consistency.

The builder parameter is unused (as flagged by static analysis) but is likely required by the register_function decorator's expected signature. The cleanup pattern using try/finally is well implemented.

Consider adding a return type hint for the async generator:

📝 Add return type hint
+from collections.abc import AsyncGenerator
+
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncGenerator[FunctionInfo, None]:
     """Git repository management tool for SWE Bench."""
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (6)

76-81: Consider adding docstrings for the configuration fields.

The dataclass lacks documentation for its fields. While the class docstring exists, individual field descriptions would improve clarity.

📝 Suggested improvement
 `@dataclass`
 class IterativeAgentConfig:
     """Configuration for the iterative agent."""
-    step_limit: int = 250
-    timeout: int = 60
-    max_output_length: int = 10000
+    step_limit: int = 250  # Maximum number of agent steps before termination
+    timeout: int = 60  # Command execution timeout in seconds
+    max_output_length: int = 10000  # Maximum characters before output truncation

105-110: Add type hint for llm parameter.

The llm parameter lacks a type annotation. Per coding guidelines, all public APIs require type hints on parameters.

📝 Suggested fix
-    def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+    def __init__(self, llm: typing.Any, repo_path: Path, config: IterativeAgentConfig):
         self.llm = llm
         self.repo_path = repo_path
         self.config = config
-        self.messages: list = []
+        self.messages: list[SystemMessage | HumanMessage | AIMessage] = []
         self.n_steps = 0

Note: Add import typing at the top if not already present. Ideally, use the actual LLM interface type if available from the framework.


360-363: Chain exception and use explicit conversion.

Per coding guidelines, use raise ... from err to preserve the exception chain and use explicit conversion flag instead of str(e).

🔧 Proposed fix
         except Exception as e:
             logger.error("LLM invocation failed: %s", e, exc_info=True)
-            # recoverable error, let the agent continue
-            raise NonTerminatingException(f"LLM call failed: {str(e)}")
+            # recoverable error, let the agent continue
+            raise NonTerminatingException(f"LLM call failed: {e!s}") from e

414-427: Chain exceptions and narrow the exception type.

Multiple issues flagged by static analysis:

  1. Missing exception chaining at lines 425 and 427
  2. Catching broad Exception at line 426 masks specific errors
🔧 Proposed fix
         except (TimeoutError, subprocess.TimeoutExpired) as e:
             # Extract output from exception if available (only subprocess.TimeoutExpired has output attribute)
             if isinstance(e, subprocess.TimeoutExpired) and hasattr(e, "output") and e.output:
                 output = e.output.decode("utf-8", errors="replace")
             else:
                 output = ""
             # Format timeout message using template
             timeout_message = self._TIMEOUT_TEMPLATE.format(
                 action=command,
                 output=output
             )
-            raise ExecutionTimeoutError(timeout_message)
-        except Exception as e:
-            raise NonTerminatingException(f"Error executing command: {str(e)}")
+            raise ExecutionTimeoutError(timeout_message) from e
+        except OSError as e:
+            raise NonTerminatingException(f"Error executing command: {e!s}") from e

Using OSError (or subprocess.SubprocessError) is more appropriate than catching all exceptions, as it covers typical subprocess failures without masking unexpected errors.


462-464: Remove redundant exception object from logger.exception.

When using logger.exception(), the exception info is automatically included. Including e as an argument is redundant (TRY401).

🔧 Proposed fix
         except Exception as e:
-            logger.exception("Failed to setup repository: %s", e)
-            return f"Error: Failed to setup repository - {str(e)}"
+            logger.exception("Failed to setup repository")
+            return f"Error: Failed to setup repository - {e!s}"

493-495: Remove redundant exception object and use explicit conversion.

Same pattern as above - logger.exception() automatically includes exception info.

🔧 Proposed fix
         except Exception as e:
-            logger.exception(f"Error processing {swebench_input.instance_id}: {e}")
-            return f"Error: {str(e)}"
+            logger.exception("Error processing %s", swebench_input.instance_id)
+            return f"Error: {e!s}"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e15df42 and f5771ff.

📒 Files selected for processing (10)
  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/__init__.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{md,mdx}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{md,mdx}: Use 'NVIDIA NeMo Agent toolkit' for full name (first use), 'NeMo Agent toolkit' or 'the toolkit' for subsequent references, and 'Toolkit' (capital T) in titles/headings, 'toolkit' (lowercase t) in body text
Never use deprecated names: 'Agent Intelligence toolkit', 'aiqtoolkit', 'AgentIQ', 'AIQ', or 'aiq' in documentation; update any occurrences unless intentionally referring to deprecated versions or implementing compatibility layers

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
**/*.{md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{md,mdx,rst}: Documentation must be clear, comprehensive, and free of TODOs, FIXMEs, placeholder text, offensive or outdated terms, and spelling mistakes
Do not use words listed in 'ci/vale/styles/config/vocabularies/nat/reject.txt' in documentation
Words listed in 'ci/vale/styles/config/vocabularies/nat/accept.txt' are acceptable even if they appear to be spelling mistakes

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧠 Learnings (1)
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧬 Code graph analysis (5)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (2)
src/nat/data_models/common.py (3)
  • TypedBaseModel (96-171)
  • static_type (157-158)
  • discriminator (165-171)
src/nat/data_models/component_ref.py (2)
  • FunctionRef (94-102)
  • LLMRef (116-124)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (3)
src/nat/builder/function_info.py (2)
  • FunctionInfo (290-625)
  • from_fn (552-625)
src/nat/data_models/function.py (1)
  • FunctionBaseConfig (26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
  • RepoManager (37-67)
  • setup_repository (44-58)
  • cleanup (60-67)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
src/nat/runtime/runner.py (1)
  • context (93-94)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)
  • SweBenchPredictor (431-502)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)
src/nat/builder/builder.py (1)
  • Builder (84-811)
src/nat/builder/framework_enum.py (1)
  • LLMFrameworkEnum (19-27)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
  • SweBenchWorkflowConfig (51-52)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)


53-53: Avoid specifying long messages outside the exception class

(TRY003)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)


327-327: Avoid specifying long messages outside the exception class

(TRY003)


359-359: Consider moving this statement to an else block

(TRY300)


363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


363-363: Avoid specifying long messages outside the exception class

(TRY003)


363-363: Use explicit conversion flag

Replace with conversion flag

(RUF010)


379-379: subprocess call with shell=True identified, security issue

(S602)


412-412: Consider moving this statement to an else block

(TRY300)


425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


426-426: Do not catch blind exception: Exception

(BLE001)


427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


427-427: Avoid specifying long messages outside the exception class

(TRY003)


427-427: Use explicit conversion flag

Replace with conversion flag

(RUF010)


463-463: Redundant exception object included in logging.exception call

(TRY401)


464-464: Use explicit conversion flag

Replace with conversion flag

(RUF010)


494-494: Redundant exception object included in logging.exception call

(TRY401)


495-495: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🔇 Additional comments (11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

40-49: LGTM!

The SweBenchPredictorIterativeConfig follows the established pattern, with appropriate type hints and field descriptions. The discriminated union is correctly extended to include the new iterative variant.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

70-73: LGTM!

Simple helper with proper type hints.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py (1)

19-19: LGTM!

The import correctly triggers registration of the git_repo_tool via its decorator, following the established pattern in this file.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py (1)

20-20: The import follows the established pattern for predictor registration.

The addition of IterativePredictor on line 20 mirrors the existing GoldPredictor import on line 19, with proper flake8: noqa directives to allow unused imports (which are intentionally present to trigger registration side-effects). The file maintains proper Apache 2.0 licensing and copyright headers.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (4)

1-14: LGTM!

License header is correctly formatted with SPDX identifier and Apache 2.0 license text.


52-74: LGTM!

Well-structured exception hierarchy with clear separation between recoverable (NonTerminatingException) and terminal (TerminatingException) conditions. This enables clean control flow in the agent loop.


376-389: Acknowledged: shell=True is intentional for this agent design.

The static analysis flags shell=True as a security concern (S602). However, this is expected behavior since the agent is explicitly designed to execute arbitrary bash commands provided by the LLM. The commands are sandboxed within the repository workspace directory.


497-502: LGTM!

The _build_task_description method is clean and well-typed with appropriate type hints.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (2)

23-35: LGTM!

Workflow and functions configuration is well-structured with appropriate iterative predictor settings and git tool configuration. The 2-space indentation follows repository conventions. Based on learnings, this matches the existing YAML style in the repository.


37-70: LGTM!

Evaluation configuration is properly structured with dataset filtering for focused testing. The single-instance allowlist is appropriate for initial development/testing.

examples/evaluation_and_profiling/swe_bench/README.md (1)

159-159: LGTM!

Documentation entry for the iterative predictor follows the established pattern and provides clear, concise description with appropriate links to implementation and configuration files.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 82 to 85
async def checkout_commit(repo: Repo, commit_hash: str):
"""Checkout a specific commit in the repository."""
logger.info("Checking out commit %s", commit_hash)
repo.git.checkout(commit_hash)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same blocking I/O issue; also missing return type hint.

repo.git.checkout() is synchronous. Wrap with asyncio.to_thread(). Per coding guidelines, add return type hint.

🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str):
+async def checkout_commit(repo: Repo, commit_hash: str) -> None:
     """Checkout a specific commit in the repository."""
     logger.info("Checking out commit %s", commit_hash)
-    repo.git.checkout(commit_hash)
+    await asyncio.to_thread(repo.git.checkout, commit_hash)
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`
around lines 82 - 85, The checkout_commit function performs blocking I/O by
calling the synchronous repo.git.checkout; change checkout_commit to have an
explicit return type hint (-> None) and call the blocking operation inside
asyncio.to_thread (e.g., await asyncio.to_thread(repo.git.checkout,
commit_hash)) so the checkout runs off the event loop; keep the logger.info call
and docstring unchanged and reference the function name checkout_commit and the
blocking call repo.git.checkout when making the change.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py`:
- Around line 69-74: Docstring for SweBenchWorkflowConfig incorrectly lists a
"full" predictor option that doesn't exist; update the docstring to reflect the
actual predictor options defined in SweBenchPredictorConfig (e.g., "gold,
skeleton, or iterative") or enumerate the exact union variants from
SweBenchPredictorConfig so the docs and code are consistent; locate the class
SweBenchWorkflowConfig and its docstring and replace "full, gold, skeleton, or
iterative" with the correct set of predictor types from SweBenchPredictorConfig.
- Around line 24-25: Remove the unused import FunctionRef from the top of the
module: delete the "FunctionRef" import token in the import statement that
currently reads "from nat.data_models.component_ref import FunctionRef" so only
LLMRef remains imported (referenced symbol: FunctionRef).

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 71-74: get_repo_path currently builds a path from only the repo
name causing collisions; update get_repo_path to parse the repo URL and extract
the owner/organization component (e.g., the segment immediately preceding the
repo name for HTTPS and the part after ":" for SSH forms) and return
Path(workspace_dir) / owner / repo_name so repositories with the same name under
different orgs are distinct; ensure you handle URLs like
"https://host/org/repo.git" and "git@host:org/repo.git" and strip ".git" from
repo_name.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 32-38: The git_repo_tool function declares an unused parameter
named builder; rename it to _builder to follow the codebase convention for
intentionally unused parameters (update the function signature async def
git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): and any
references in the decorator/register_function call if necessary) so
linters/readers know it is intentionally unused.
♻️ Duplicate comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

16-28: Duplicate llms key causes configuration to be overwritten.

The YAML has two separate llms: keys (lines 16 and 23). In YAML, duplicate keys at the same level cause the second to overwrite the first, meaning nim_llm will be silently discarded and only claude_sonnet_llm will be available.

Additionally, nim_llm uses 1-space indentation while claude_sonnet_llm uses 2-space indentation. Per learnings, the repository uses 2-space indentation for YAML files.

🔧 Proposed fix - merge into single llms block with consistent 2-space indentation
-llms:
- nim_llm:
-   _type: nim
-   model_name: mistralai/mistral-nemotron
-   temperature: 0.6
-   max_tokens: 4096    
-
-llms:
-  claude_sonnet_llm:
-    _type: litellm
-    model_name: anthropic/claude-sonnet-4-5-20250929
-    temperature: 0.0
-    api_key: "${ANTHROPIC_API_KEY}"  # Set this environment variable before running
+llms:
+  nim_llm:
+    _type: nim
+    model_name: mistralai/mistral-nemotron
+    temperature: 0.6
+    max_tokens: 4096
+
+  claude_sonnet_llm:
+    _type: litellm
+    model_name: anthropic/claude-sonnet-4-5-20250929
+    temperature: 0.0
+    api_key: "${ANTHROPIC_API_KEY}"  # Set this environment variable before running
🧹 Nitpick comments (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (2)

84-88: Add return type hint for checkout_commit.

Per coding guidelines, all public APIs require type hints on return values. This function returns None implicitly.

🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str):
+async def checkout_commit(repo: Repo, commit_hash: str) -> None:
     """Checkout a specific commit in the repository."""
     logger.info("Checking out commit %s", commit_hash)
     # Use asyncio.to_thread to avoid blocking the event loop during checkout
     await asyncio.to_thread(repo.git.checkout, commit_hash)

38-43: Add type hint for active_repos dictionary.

Per coding guidelines, type hints are required. The dictionary maps repo paths to RepoContext objects.

🔧 Proposed fix
 class RepoManager:

     def __init__(self, workspace_dir: str):
         self.workspace = Path(workspace_dir)
         self.workspace.mkdir(parents=True, exist_ok=True)
-        self.active_repos = {}
+        self.active_repos: dict[str, RepoContext] = {}
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

80-82: Remove trailing blank lines.

YAMLlint reports too many blank lines at the end of the file. Files should end with a single newline.

🔧 Proposed fix
   evaluators:
     swe_bench:
       _type: swe_bench
       run_id: nat_iterative_1
       clean: true
-
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5771ff and a39f138.

📒 Files selected for processing (4)
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
🧠 Learnings (6)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-03T18:42:23.494Z
Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-11-05T11:45:35.119Z
Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
🧬 Code graph analysis (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
src/nat/data_models/function.py (1)
  • FunctionBaseConfig (26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
  • RepoManager (38-68)
  • setup_repository (45-59)
  • cleanup (61-68)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)
src/nat/data_models/common.py (4)
  • BaseModelRegistryTag (91-93)
  • TypedBaseModel (96-171)
  • static_type (157-158)
  • discriminator (165-171)
src/nat/data_models/component_ref.py (2)
  • FunctionRef (94-102)
  • LLMRef (116-124)
src/nat/data_models/function.py (1)
  • FunctionBaseConfig (26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
src/nat/runtime/runner.py (1)
  • context (93-94)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)


45-45: Avoid specifying long messages outside the exception class

(TRY003)


51-51: Avoid specifying long messages outside the exception class

(TRY003)


59-59: Avoid specifying long messages outside the exception class

(TRY003)

🪛 YAMLlint (1.37.1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

[error] 23-23: duplication of key "llms" in mapping

(key-duplicates)


[warning] 82-82: too many blank lines (2 > 0)

(empty-lines)

🔇 Additional comments (5)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

77-81: LGTM - blocking I/O properly handled.

The use of asyncio.to_thread correctly offloads the blocking Repo.clone_from call to a thread pool, preventing event loop blocking.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

41-59: Input validation improvements look good.

The JSON parsing is now wrapped in a try-except, and required keys are validated before use. This addresses the past review feedback.


61-66: Good resource cleanup pattern with finally block.

The use of try/finally ensures repo_manager.cleanup() is called even if an exception occurs during the tool's lifecycle, preventing orphaned repository directories.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

51-61: LGTM - SweBenchPredictorIterativeConfig is well-structured.

The configuration properly uses LLMRef for the LLM reference, and Field with descriptions for step_limit and timeout. The defaults (250 steps, 60s timeout) are reasonable for iterative problem solving.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

1-14: LGTM - SPDX license header is present.

The required Apache-2.0 license header has been added, addressing the past review feedback.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +32 to +38
@register_function(config_type=GitRepoToolConfig)
async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
"""Git repository management tool for SWE Bench."""
import json

from .git_tool import RepoManager
repo_manager = RepoManager(tool_config.workspace_dir)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, find and read the register.py file
fd "register.py" -p "examples/evaluation_and_profiling/swe_bench"

# Get the full content to understand context
find . -path "*/examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py" -type f -exec wc -l {} \;

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 450


🏁 Script executed:

# Search for all `@register_function` decorated functions to check builder parameter usage
rg "@register_function" -A 3 --type=py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 50381


🏁 Script executed:

# Find the definition of register_function decorator
rg "def register_function" -B 2 -A 10 --type=py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 6591


🏁 Script executed:

cat -n examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3277


Prefix unused builder parameter with underscore.

The builder parameter is required by the @register_function interface but is not used in this function. Follow the established pattern in the codebase by renaming it to _builder to indicate intentional non-use.

🔧 Suggested fix
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
     """Git repository management tool for SWE Bench."""
🧰 Tools
🪛 Ruff (0.14.11)

33-33: Unused function argument: builder

(ARG001)

🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`
around lines 32 - 38, The git_repo_tool function declares an unused parameter
named builder; rename it to _builder to follow the codebase convention for
intentionally unused parameters (update the function signature async def
git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): and any
references in the decorator/register_function call if necessary) so
linters/readers know it is intentionally unused.

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
@Jerryguan777 Jerryguan777 force-pushed the feat/iterative-predictor branch from a39f138 to dbc2dd6 Compare January 15, 2026 06:29
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 16-28: The YAML defines two separate top-level llms mappings which
causes the first (nim_llm) to be overwritten by the second (claude_sonnet_llm);
merge both entries under a single llms key so both nim_llm and claude_sonnet_llm
are present, and fix nim_llm’s indentation to match the file’s 2-space style;
locate the nim_llm and claude_sonnet_llm blocks and combine them into one llms
mapping preserving their model_name, temperature, max_tokens and api_key fields.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 33-35: RepoContext.__post_init__ sets repo_path = base_path /
repo_name which omits the organization and disagrees with get_repo_path; update
__post_init__ so repo_path includes the org (e.g., repo_path = base_path /
self.org_name / self.repo_name) or call the existing get_repo_path logic to
compute it, ensuring RepoContext.repo_path matches the path used by
setup_repository and clone operations.
♻️ Duplicate comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

88-92: Add missing return type hint.

Per coding guidelines, all public functions require type hints. This async function returns None.

🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str):
+async def checkout_commit(repo: Repo, commit_hash: str) -> None:
     """Checkout a specific commit in the repository."""
🧹 Nitpick comments (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

81-82: Remove extra trailing blank line.

YAMLlint reports too many blank lines at the end. Files should end with exactly one newline.

🔧 Proposed fix
       clean: true
-
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

1-1: Copyright year should be updated to 2025-2026.

Other files in this PR use 2025-2026 in the copyright header. This file uses only 2025.

🔧 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

32-34: Prefix unused builder parameter with underscore.

The builder parameter is required by the @register_function interface but is unused. Follow the codebase convention by renaming to _builder.

🔧 Proposed fix
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
     """Git repository management tool for SWE Bench."""
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

1-1: Copyright year should be updated to 2025-2026.

For consistency with other files in this PR.

🔧 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a39f138 and dbc2dd6.

📒 Files selected for processing (4)
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧠 Learnings (6)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-03T18:42:23.494Z
Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-11-05T11:45:35.119Z
Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧬 Code graph analysis (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)
src/nat/data_models/common.py (4)
  • BaseModelRegistryTag (91-93)
  • TypedBaseModel (96-171)
  • static_type (157-158)
  • discriminator (165-171)
src/nat/data_models/component_ref.py (1)
  • LLMRef (116-124)
src/nat/data_models/function.py (1)
  • FunctionBaseConfig (26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
src/nat/runtime/runner.py (1)
  • context (93-94)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
  • RepoManager (38-68)
  • setup_repository (45-59)
  • cleanup (61-68)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)


45-45: Avoid specifying long messages outside the exception class

(TRY003)


51-51: Avoid specifying long messages outside the exception class

(TRY003)


59-59: Avoid specifying long messages outside the exception class

(TRY003)

🪛 YAMLlint (1.37.1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

[error] 23-23: duplication of key "llms" in mapping

(key-duplicates)


[warning] 82-82: too many blank lines (2 > 0)

(empty-lines)

🔇 Additional comments (6)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)

50-60: LGTM! Well-structured iterative predictor configuration.

The new SweBenchPredictorIterativeConfig is properly defined with appropriate type hints and sensible defaults. The LLMRef type correctly references the component reference pattern used elsewhere in the codebase.

Minor nit: Line 57 has trailing whitespace after the closing triple quotes.


62-66: Discriminated union correctly extended.

The SweBenchPredictorConfig union properly includes the new iterative config using the established Tag/Discriminator pattern.


68-74: Docstring now correctly lists available predictor types.

The docstring accurately reflects the available options: "gold, skeleton, or iterative".

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

41-59: Good input validation and error handling.

The function properly validates JSON input, checks for required keys on the setup operation, and provides clear error messages. This addresses the concerns from prior reviews.


61-66: Correct async generator pattern with guaranteed cleanup.

The try/finally ensures cleanup() runs when the tool is disposed, respecting the cleanup_on_exit config flag.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

81-85: LGTM! Blocking I/O correctly wrapped with asyncio.to_thread.

The clone operation now properly yields to the event loop, addressing the prior review concern.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@Jerryguan777 Jerryguan777 changed the title feat(swe-bench): implement iterative predictor for SWE-bench feat(swe-bench): add an iterative predictor with ~70% success rate Jan 16, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 1-2: Update the SPDX copyright header to include 2026 by modifying
the SPDX-FileCopyrightText line (and/or SPDX-License-Identifier line if needed)
so the year range reads "2025, 2026" or "2025-2026" as per project convention;
locate the top-of-file SPDX header in predict_iterative.py and adjust the year
accordingly.
- Around line 414-425: The except block handling subprocess.TimeoutExpired
should guard the type of e.output before decoding: in the except (TimeoutError,
subprocess.TimeoutExpired) handler (inside the predict_iterative code path),
check if isinstance(e, subprocess.TimeoutExpired) and hasattr(e, "output") and
e.output, then if isinstance(e.output, bytes) decode with .decode("utf-8",
errors="replace") else treat e.output as the string directly; assign that to the
local output variable used in self._TIMEOUT_TEMPLATE and raise
ExecutionTimeoutError(timeout_message) as before to avoid calling .decode on a
str when text=True was used.
- Around line 449-451: The parsing of swebench_input.instance_id into repo_name
using split('-')[0] breaks for hyphenated org/repo names; update the extraction
to take the suffix after the last hyphen (use rsplit('-', 1)) so repo_name
contains the full org__repo string, then continue to split into org, repo and
build repo_url as before (look for repo_name, org, repo, repo_url in
predict_iterative.py and replace the split logic accordingly).

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 1-2: Update the SPDX header to include 2026 by changing the
copyright year range in the top-of-file comments in git_tool.py (the two SPDX
comment lines); for example replace the single year "2025" with a range like
"2025-2026" (or include "2026" alongside 2025) so the SPDX-FileCopyrightText and
SPDX-License-Identifier lines reflect the 2026 change.
- Around line 76-98: The get_repo_path function incorrectly parses SSH-style
URLs like "git@github.com:org/repo.git"; update get_repo_path to normalize
repo_url first (handle git@...:org/repo.git by replacing the first colon after
host with a slash or by extracting the substring after the last ':' when an '@'
is present) so that parts = repo_url.rstrip('/').split('/') yields correct
segments; keep existing logic using repo_name, org_name, workspace_dir and
instance_id unchanged but derive org_name and repo_name from the normalized URL
to support both SSH and HTTP(S) formats.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 1-2: Update the SPDX header to include 2026 by modifying the
existing copyright SPDX lines at the top of the file (the lines containing
"SPDX-FileCopyrightText" and "SPDX-License-Identifier") so the copyright year
range includes 2026 (e.g., change "2025" to "2025, 2026" or "2025-2026")
ensuring the SPDX tags remain intact.
♻️ Duplicate comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)

32-33: Rename unused builder parameter to _builder.

The parameter is required by the decorator but unused. Prefixing with _ clarifies intent and avoids lint warnings.

🔧 Suggested fix
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
     """Git repository management tool for SWE Bench."""
🧹 Nitpick comments (2)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

66-73: Avoid blocking the event loop in cleanup.

shutil.rmtree is synchronous and can block. Run it in a thread (or make cleanup sync) to keep async responsiveness. Based on learnings, avoid blocking I/O in async paths.

♻️ Proposed fix
     async def cleanup(self):
         """Clean up all managed repositories."""
         import shutil
         for repo_path_str in list(self.active_repos.keys()):
             repo_path = Path(repo_path_str)
             if repo_path.exists():
-                shutil.rmtree(repo_path)
+                await asyncio.to_thread(shutil.rmtree, repo_path)
         self.active_repos.clear()
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)

105-110: Add a type hint for llm in IterativeAgent.__init__.

Public APIs should carry parameter type hints; use typing.Any (or a protocol) to make intent explicit. As per coding guidelines, add missing type annotations.

🔧 Proposed fix
-import asyncio
-import json
-import logging
-import re
-import subprocess
+import asyncio
+import json
+import logging
+import re
+import subprocess
+import typing
@@
-    def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+    def __init__(self, llm: typing.Any, repo_path: Path, config: IterativeAgentConfig):
         self.llm = llm
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbc2dd6 and 2ed8b41.

📒 Files selected for processing (4)
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧠 Learnings (6)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-03T18:42:23.494Z
Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-11-05T11:45:35.119Z
Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
🧬 Code graph analysis (2)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)
src/nat/builder/builder.py (1)
  • Builder (84-811)
src/nat/builder/framework_enum.py (1)
  • LLMFrameworkEnum (19-27)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
  • SweBenchWorkflowConfig (68-74)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
  • RepoManager (34-73)
  • setup_repository (41-64)
  • cleanup (66-73)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)


327-327: Avoid specifying long messages outside the exception class

(TRY003)


359-359: Consider moving this statement to an else block

(TRY300)


363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


363-363: Avoid specifying long messages outside the exception class

(TRY003)


363-363: Use explicit conversion flag

Replace with conversion flag

(RUF010)


379-379: subprocess call with shell=True identified, security issue

(S602)


412-412: Consider moving this statement to an else block

(TRY300)


425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


426-426: Do not catch blind exception: Exception

(BLE001)


427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


427-427: Avoid specifying long messages outside the exception class

(TRY003)


427-427: Use explicit conversion flag

Replace with conversion flag

(RUF010)


464-464: Redundant exception object included in logging.exception call

(TRY401)


465-465: Use explicit conversion flag

Replace with conversion flag

(RUF010)


495-495: Redundant exception object included in logging.exception call

(TRY401)


496-496: Use explicit conversion flag

Replace with conversion flag

(RUF010)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)


45-45: Avoid specifying long messages outside the exception class

(TRY003)


51-51: Avoid specifying long messages outside the exception class

(TRY003)


63-63: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

1-77: Config layout and defaults look good.

LLM backends, iterative predictor settings, tool registration, and evaluator settings are coherent and well-structured.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Add instance_id to workspace path to prevent git conflicts between
parallel instances. Each instance now uses .workspace/{instance_id}/org/repo
instead of sharing .workspace/org/repo.

Performance: 10 instances in ~8 min (vs ~30 min sequential)
Results: 7/10 resolved (70%) with Claude Sonnet
Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
@Jerryguan777 Jerryguan777 force-pushed the feat/iterative-predictor branch from 2ed8b41 to 1abbd6e Compare January 16, 2026 05:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 105-113: Update the public methods __init__ and add_message to
include explicit type hints and Google-style docstrings: annotate llm with
typing.Any (or the concrete LLM interface), add return type None for both
__init__ and add_message, and type self.messages as List[Tuple[str, str]] (or a
suitable typed container) and self.n_steps as int; update the __init__ signature
(def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) ->
None) and add_message signature (def add_message(self, role: str, content: str)
-> None), and add Google-style docstrings for both methods including an Args
section for parameters and a Returns section (even if None).
- Around line 478-481: The SweBenchPredictorIterativeConfig is missing a
max_output_length field and thus the IterativeAgentConfig isn't receiving a
custom truncation limit; add max_output_length: int = Field(default=10000,
description="Maximum output length for command results") to the
SweBenchPredictorIterativeConfig in nat_swe_bench/config.py, then update the
IterativeAgentConfig construction in predict_iterative.py (the agent_config =
IterativeAgentConfig(...) block) to include
max_output_length=getattr(self.config.predictor, 'max_output_length', 10000) (or
self.config.predictor.max_output_length) so the value is passed through to
IterativeAgentConfig.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 16-18: Add a Google-style module docstring at the top of
register.py (immediately after the existing first line) describing the module as
a public registry for tools used by the full predictor, its purpose, what it
exports/registry behavior, and any important usage notes; ensure the docstring
follows Google docstring conventions (short summary line, blank line, more
detailed description and any public symbols) so the module is properly
documented for public consumption and tooling.
♻️ Duplicate comments (2)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

33-33: Rename unused builder to _builder.

This parameter is required by the decorator interface but unused; prefixing with _ matches the existing convention.

🔧 Suggested fix
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncIterator[FunctionInfo]:
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder) -> AsyncIterator[FunctionInfo]:

32-34: Add return type hint to git_repo_tool.

The function is a public API (decorated with @register_function) and yields FunctionInfo, but lacks a return type hint. Per coding guidelines, all public APIs require type hints on return values.

♻️ Suggested update
-import typing
+import typing
+from collections.abc import AsyncIterator
@@
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncIterator[FunctionInfo]:
🧹 Nitpick comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)

360-427: Preserve exception context when wrapping errors.

Chaining with from e keeps the original stack trace for debugging.

🔧 Suggested change
@@
-            raise NonTerminatingException(f"LLM call failed: {str(e)}")
+            raise NonTerminatingException(f"LLM call failed: {str(e)}") from e
@@
-            raise ExecutionTimeoutError(timeout_message)
+            raise ExecutionTimeoutError(timeout_message) from e
@@
-            raise NonTerminatingException(f"Error executing command: {str(e)}")
+            raise NonTerminatingException(f"Error executing command: {str(e)}") from e

As per coding guidelines, preserve exception context when wrapping errors.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ed8b41 and 1abbd6e.

📒 Files selected for processing (4)
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧠 Learnings (1)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Confirm that copyright years are up-to-date whenever a file is changed

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧬 Code graph analysis (2)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)
src/nat/builder/builder.py (1)
  • Builder (84-811)
src/nat/builder/framework_enum.py (1)
  • LLMFrameworkEnum (19-27)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
  • SweBenchWorkflowConfig (68-74)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (4)
src/nat/builder/function_info.py (2)
  • FunctionInfo (290-625)
  • from_fn (552-625)
src/nat/agent/react_agent/output_parser.py (1)
  • _type (103-104)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
  • RepoManager (35-74)
  • setup_repository (42-65)
  • cleanup (67-74)
src/nat/runtime/runner.py (1)
  • context (93-94)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)


327-327: Avoid specifying long messages outside the exception class

(TRY003)


359-359: Consider moving this statement to an else block

(TRY300)


363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


363-363: Avoid specifying long messages outside the exception class

(TRY003)


363-363: Use explicit conversion flag

Replace with conversion flag

(RUF010)


379-379: subprocess call with shell=True identified, security issue

(S602)


412-412: Consider moving this statement to an else block

(TRY300)


425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


426-426: Do not catch blind exception: Exception

(BLE001)


427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


427-427: Avoid specifying long messages outside the exception class

(TRY003)


427-427: Use explicit conversion flag

Replace with conversion flag

(RUF010)


464-464: Redundant exception object included in logging.exception call

(TRY401)


465-465: Use explicit conversion flag

Replace with conversion flag

(RUF010)


495-495: Redundant exception object included in logging.exception call

(TRY401)


496-496: Use explicit conversion flag

Replace with conversion flag

(RUF010)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)


45-45: Avoid specifying long messages outside the exception class

(TRY003)


51-51: Avoid specifying long messages outside the exception class

(TRY003)


63-63: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

25-29: Config model looks good.

Clear defaults and simple surface area for the tool configuration.


40-63: Input validation and error messages are solid.

Good defensive handling for malformed JSON and unknown operations.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)

76-81: Config defaults look sensible.

The step limit, timeout, and output limits are reasonable starting points.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +105 to +113
def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
self.llm = llm
self.repo_path = repo_path
self.config = config
self.messages: list = []
self.n_steps = 0

def add_message(self, role: str, content: str):
"""Add a message to the conversation and print it for debugging."""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and read the relevant section
find . -path "*/examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" -type f

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 182


🏁 Script executed:

# Read the file to see the current state around lines 105-113
cat -n "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" | sed -n '100,120p'

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1222


🏁 Script executed:

# Check the import statements at the top of the file
head -30 "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1169


🏁 Script executed:

# Check for typing imports in the file
head -50 "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" | tail -25

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 744


🏁 Script executed:

# Find the class definition and verify this is a public class
grep -n "^class " "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 456


🏁 Script executed:

# Get more context around the __init__ and add_message methods
cat -n "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py" | sed -n '60,125p'

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3254


Add type hints and Google‑style docstrings for public methods.

__init__ and add_message are public methods that require complete type hints and Google‑style docstrings per coding guidelines. Currently:

  • __init__ is missing type hint for llm and return type annotation
  • add_message is missing return type annotation and complete Google‑style docstring (needs Args section)
✏️ Proposed update
@@
-from dataclasses import dataclass
+from dataclasses import dataclass
+from typing import Any
@@
-    def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+    def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) -> None:
+        """Initialize the iterative agent.
+
+        Args:
+            llm: LLM instance used for chat completion.
+            repo_path: Path to the repository workspace.
+            config: Agent configuration.
+        """
@@
-    def add_message(self, role: str, content: str):
-        """Add a message to the conversation and print it for debugging."""
+    def add_message(self, role: str, content: str) -> None:
+        """Add a message to the conversation and print it for debugging.
+
+        Args:
+            role: Message role ("system", "user"/"human", or "assistant"/"ai").
+            content: Message text.
+        """
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`
around lines 105 - 113, Update the public methods __init__ and add_message to
include explicit type hints and Google-style docstrings: annotate llm with
typing.Any (or the concrete LLM interface), add return type None for both
__init__ and add_message, and type self.messages as List[Tuple[str, str]] (or a
suitable typed container) and self.n_steps as int; update the __init__ signature
(def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) ->
None) and add_message signature (def add_message(self, role: str, content: str)
-> None), and add Google-style docstrings for both methods including an Args
section for parameters and a Returns section (even if None).

Comment on lines +478 to +481
agent_config = IterativeAgentConfig(
step_limit=getattr(self.config.predictor, 'step_limit', 250),
timeout=getattr(self.config.predictor, 'timeout', 60)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "max_output_length" examples/evaluation_and_profiling/swe_bench/src -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 389


🏁 Script executed:

# Find the IterativeAgentConfig definition
fd -t f "*.py" examples/evaluation_and_profiling/swe_bench/src | xargs rg -l "class IterativeAgentConfig"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 481


🏁 Script executed:

# Search for max_output_length more broadly in the codebase
rg -n "max_output_length" --type py | head -50

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 389


🏁 Script executed:

head -100 examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py | tail -30

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1225


🏁 Script executed:

sed -n '70,90p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 656


🏁 Script executed:

sed -n '470,490p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 851


🏁 Script executed:

# Find IterativeAgentConfig definition
rg -A 10 "class IterativeAgentConfig" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1580


🏁 Script executed:

# Find the predictor config model definition
rg -B 5 -A 20 "class.*PredictorConfig" examples/evaluation_and_profiling/swe_bench/src -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 51


🏁 Script executed:

# Also search for config field definitions related to max_output_length
rg -B 3 -A 3 "max_output_length" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2114


🏁 Script executed:

sed -n '1,100p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3486


🏁 Script executed:

# Search for config class definitions in the file
rg "^class.*Config" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py -A 5

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 229


🏁 Script executed:

# Look for where self.config is defined
rg -B 5 "self.config.predictor" examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py | head -30

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 601


🏁 Script executed:

# Look for config model definitions
rg -B 2 -A 15 "class SweBenchWorkflowConfig" examples/evaluation_and_profiling/swe_bench/src -g "*.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1030


🏁 Script executed:

# Search for config files and their definitions
fd -t f "config.py" examples/evaluation_and_profiling/swe_bench/src

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 140


🏁 Script executed:

# Check what's in the config module
cat examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3137


🏁 Script executed:

# Verify max_output_length is actually used in IterativeAgent
sed -n '395,410p' examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 851


Add max_output_length to the predictor config model and pass it through to IterativeAgentConfig.

The predictor config (SweBenchPredictorIterativeConfig) is missing the max_output_length field, so the suggested getattr fix alone won't enable custom truncation limits. You need to add this field to the config model first, then pass it through here.

🔧 Required changes

In nat_swe_bench/config.py, add to SweBenchPredictorIterativeConfig:

max_output_length: int = Field(default=10000, description="Maximum output length for command results")

Then in predict_iterative.py at lines 478-481:

             agent_config = IterativeAgentConfig(
                 step_limit=getattr(self.config.predictor, 'step_limit', 250),
-                timeout=getattr(self.config.predictor, 'timeout', 60)
+                timeout=getattr(self.config.predictor, 'timeout', 60),
+                max_output_length=getattr(self.config.predictor, 'max_output_length', 10000),
             )
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`
around lines 478 - 481, The SweBenchPredictorIterativeConfig is missing a
max_output_length field and thus the IterativeAgentConfig isn't receiving a
custom truncation limit; add max_output_length: int = Field(default=10000,
description="Maximum output length for command results") to the
SweBenchPredictorIterativeConfig in nat_swe_bench/config.py, then update the
IterativeAgentConfig construction in predict_iterative.py (the agent_config =
IterativeAgentConfig(...) block) to include
max_output_length=getattr(self.config.predictor, 'max_output_length', 10000) (or
self.config.predictor.max_output_length) so the value is passed through to
IterativeAgentConfig.

Comment on lines +16 to +18
# Register all the tools needed by the full predictor without loading the dependencies.
import typing

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check if this file exists and read its content
cd examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools
ls -la register.py
wc -l register.py
head -50 register.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2189


🏁 Script executed:

# Check the coding guidelines for examples directory
cat .cursor/rules/general.mdc | grep -A 20 "examples/\*\*/\*"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 51


🏁 Script executed:

# Find Python files in examples directory and check their docstrings
find examples -name "*.py" -type f | head -10

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 822


🏁 Script executed:

# Check a few example Python files to see their docstring patterns
fd -e py examples | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && head -20 {}'

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 51


🏁 Script executed:

# Check specific example files for module docstrings
echo "=== A2A register.py ===" && head -25 examples/A2A/math_assistant_a2a/src/nat_math_assistant_a2a/register.py
echo ""
echo "=== HITL register.py ===" && head -25 examples/HITL/por_to_jiratickets/src/nat_por_to_jiratickets/register.py
echo ""
echo "=== extract_por_tool.py ===" && head -25 examples/HITL/por_to_jiratickets/src/nat_por_to_jiratickets/extract_por_tool.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2989


🏁 Script executed:

# Check the .cursor/rules/general.mdc file more thoroughly for any examples directory guidance
cat .cursor/rules/general.mdc | tail -50

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2989


🏁 Script executed:

# Also check if there are any other examples modules with docstrings
grep -r '"""' examples --include="*.py" | grep -A 2 "^examples" | head -20

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1960


Add a module docstring for this public registry.

Public modules require Google-style docstrings. This file currently starts with a comment only; add a docstring following the first line to describe the module's purpose.

✏️ Proposed docstring
@@
-# Register all the tools needed by the full predictor without loading the dependencies.
+"""Register iterative SWE-bench tool integrations."""
+
+# Register all the tools needed by the full predictor without loading the dependencies.
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`
around lines 16 - 18, Add a Google-style module docstring at the top of
register.py (immediately after the existing first line) describing the module as
a public registry for tools used by the full predictor, its purpose, what it
exports/registry behavior, and any important usage notes; ensure the docstring
follows Google docstring conventions (short summary line, blank line, more
detailed description and any public symbols) so the module is properly
documented for public consumption and tooling.

_type: swe_bench
predictor:
_type: iterative
llm_name: "claude_sonnet_llm" # "nim_llm" or "claude_sonnet_llm" or "openai_llm"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have the default be openai_llm or nim_llm ?

llms:
nim_llm:
_type: nim
model_name: mistralai/mistral-nemotron
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we switch this to the new(er) Nemotron Nano 3 model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @willkill07,
Thanks for the feedback. I've set the default LLM to openai_llm. I also tried the latest nvidia/nemotron-3-nano-30b-a3b, but response.content is always empty. Any suggestion?
btw, I also tried nvidia/llama-3.3-nemotron-super-49b-v1.5 and it worked well. Successfully resolved one issue in 19 steps.

Comment on lines 28 to 29
_type: litellm
model_name: openai/gpt-5.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_type: litellm
model_name: openai/gpt-5.2
_type: openai
model_name: gpt-5.2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All files (even if no code) must contain the license headers

@@ -0,0 +1,505 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was introduced in 2026

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

@@ -0,0 +1,70 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

@@ -0,0 +1,118 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need license header

@@ -0,0 +1,78 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`:
- Around line 353-363: The exception wrapper in _query_llm currently raises
NonTerminatingException without chaining the original exception; change the
raise to "raise NonTerminatingException(f'LLM call failed: {str(e)}') from e" so
the original traceback is preserved, and apply the same fix to the other
NonTerminatingException raise site in this module (the other error-handling
block that currently does "raise NonTerminatingException(...)" in the same
class/file).

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 35-41: Add a Google-style class docstring to the public class
RepoManager describing its responsibility (managing local git repositories
within a workspace), what it does on init (creates workspace directory), and
list key arguments/attributes: Args: workspace_dir (str) — path to workspace;
Attributes: workspace (Path) and active_repos (dict). Place the docstring
directly below the class RepoManager declaration and keep it concise (one-line
summary + short paragraph + Args and Attributes sections).
- Around line 1-16: Add a Google-style module docstring immediately below the
SPDX header in git_tool.py that briefly describes the module's purpose (public
Git tooling utilities used by the iterative predictor), lists the primary public
symbols it provides (e.g., any public functions or classes in this module), and
any usage/behavior notes; ensure it follows Google docstring conventions and
sits at the top of the file before imports (i.e., above the existing import
asyncio).
- Around line 77-104: The get_repo_path function must validate repo_url and
sanitise instance_id to prevent path traversal and index errors: ensure repo_url
yields at least two path components before accessing parts[-2] (raise ValueError
if malformed), and validate instance_id (if provided) to be a safe token—trim
whitespace and reject values containing path separators ('/', '\'), '..', or any
characters outside an allowed set such as alphanumerics, hyphen and underscore
(or use a simple regex like ^[A-Za-z0-9_-]+$); only use the sanitized
instance_id when constructing the Path and document/raise a clear error for
invalid inputs so callers cannot escape workspace directories via crafted inputs
in get_repo_path, repo_name, org_name, and instance_id.
- Around line 67-74: The cleanup method lacks a return type and performs
blocking filesystem work; update async def cleanup(self) to async def
cleanup(self) -> None and offload synchronous shutil.rmtree calls to a thread
using asyncio.to_thread (import asyncio if not already), e.g. await
asyncio.to_thread(shutil.rmtree, repo_path) while keeping the loop over
self.active_repos and clearing it afterward; ensure shutil is imported/import
available in the scope where cleanup is defined (or keep the local import) and
await the to_thread calls for each repo_path in the cleanup method.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 33-41: The docstring for git_repo_tool currently describes the
wrapper but should document the logical JSON interface presented to the LLM:
update the git_repo_tool docstring to explain the JSON payload accepted by the
inner async function git_operations (keys: "operation" with allowed operations,
"repo_url" for the git repository, "base_commit" for the commit/branch/ref, and
"instance_id" if applicable), sketch expected types/values and example payloads,
and mention that RepoManager (constructed from tool_config.workspace_dir) will
perform the requested operations; do not describe builder or wrapper internals,
only the LLM-visible JSON interface.
♻️ Duplicate comments (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

1-18: Add a module docstring for the tool registry.

This public module should expose a concise Google-style docstring after the SPDX header.

✏️ Proposed fix
@@
-# Register all the tools needed by the full predictor without loading the dependencies.
+"""Register iterative SWE-bench tool integrations."""
+
+# Register all the tools needed by the full predictor without loading the dependencies.
As per coding guidelines, public modules need Google-style docstrings.

32-34: Prefix unused builder parameter with underscore.

This parameter is required by the decorator but unused; underscore it to signal intent.

🔧 Proposed fix
@@
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (2)

105-130: Add type hints and Google-style docstrings for public methods.

__init__ and add_message are public; they should include explicit type hints and Google-style docstrings, plus typed attributes.

✏️ Proposed fix
@@
-from dataclasses import dataclass
+from dataclasses import dataclass
+from typing import Any
@@
-    def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+    def __init__(self, llm: Any, repo_path: Path, config: IterativeAgentConfig) -> None:
+        """Initialize the iterative agent.
+
+        Args:
+            llm: LLM instance used for chat completion.
+            repo_path: Path to the repository workspace.
+            config: Agent configuration.
+        """
         self.llm = llm
         self.repo_path = repo_path
         self.config = config
-        self.messages: list = []
-        self.n_steps = 0
+        self.messages: list[AIMessage | HumanMessage | SystemMessage] = []
+        self.n_steps: int = 0
@@
-    def add_message(self, role: str, content: str):
-        """Add a message to the conversation and print it for debugging."""
+    def add_message(self, role: str, content: str) -> None:
+        """Add a message to the conversation and print it for debugging.
+
+        Args:
+            role: Message role ("system", "user"/"human", or "assistant"/"ai").
+            content: Message text.
+        """
As per coding guidelines, public methods need type hints and Google-style docstrings.

477-481: Pass through max_output_length to the agent config.

Right now the predictor ignores any configured output limit; pass the value through so it can be tuned.

🔧 Proposed fix
@@
             agent_config = IterativeAgentConfig(
-                step_limit=getattr(self.config.predictor, 'step_limit', 250),
-                timeout=getattr(self.config.predictor, 'timeout', 60)
+                step_limit=getattr(self.config.predictor, "step_limit", 250),
+                timeout=getattr(self.config.predictor, "timeout", 60),
+                max_output_length=getattr(self.config.predictor, "max_output_length", 10000),
             )
🧹 Nitpick comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)

379-389: Run LLM-provided commands inside a hardened sandbox.

Because commands are executed with shell=True from model output, ensure this runs in a tightly isolated environment (drop privileges, restrict filesystem scope, and disable network) to contain potential misuse.

Comment on lines +353 to +363
async def _query_llm(self) -> str:
"""Query LLM and return response content."""
try:
response = await self.llm.ainvoke(self.messages)
content = response.content if hasattr(response, 'content') else str(response)
self.add_message("assistant", content)
return content
except Exception as e:
logger.error("LLM invocation failed: %s", e, exc_info=True)
# recoverable error, let the agent continue
raise NonTerminatingException(f"LLM call failed: {str(e)}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Preserve exception context when wrapping errors.

Wrapping exceptions without from e loses the original traceback and makes debugging harder; chain the cause explicitly.

♻️ Proposed fix
@@
-            raise NonTerminatingException(f"LLM call failed: {str(e)}")
+            raise NonTerminatingException(f"LLM call failed: {e}") from e
@@
-            raise ExecutionTimeoutError(timeout_message)
+            raise ExecutionTimeoutError(timeout_message) from e
@@
-            raise NonTerminatingException(f"Error executing command: {str(e)}")
+            raise NonTerminatingException(f"Error executing command: {e}") from e
As per coding guidelines, preserve stack traces when re-raising.

Also applies to: 414-427

🧰 Tools
🪛 Ruff (0.14.13)

359-359: Consider moving this statement to an else block

(TRY300)


363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


363-363: Avoid specifying long messages outside the exception class

(TRY003)


363-363: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py`
around lines 353 - 363, The exception wrapper in _query_llm currently raises
NonTerminatingException without chaining the original exception; change the
raise to "raise NonTerminatingException(f'LLM call failed: {str(e)}') from e" so
the original traceback is preserved, and apply the same fix to the other
NonTerminatingException raise site in this module (the other error-handling
block that currently does "raise NonTerminatingException(...)" in the same
class/file).

Comment on lines +1 to +16
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import asyncio
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a module docstring for public git tooling.

This public module lacks a Google-style docstring; add one just below the SPDX header.

✏️ Proposed fix
@@
 # limitations under the License.
-
+"""Git repository utilities for the iterative SWE-bench predictor."""
+
As per coding guidelines, public modules need Google-style docstrings.
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`
around lines 1 - 16, Add a Google-style module docstring immediately below the
SPDX header in git_tool.py that briefly describes the module's purpose (public
Git tooling utilities used by the iterative predictor), lists the primary public
symbols it provides (e.g., any public functions or classes in this module), and
any usage/behavior notes; ensure it follows Google docstring conventions and
sits at the top of the file before imports (i.e., above the existing import
asyncio).

Comment on lines +35 to +41
class RepoManager:

def __init__(self, workspace_dir: str):
self.workspace = Path(workspace_dir)
self.workspace.mkdir(parents=True, exist_ok=True)
self.active_repos = {}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document RepoManager as a public class.

Add a concise Google-style class docstring describing its responsibility.

✏️ Proposed fix
@@
-class RepoManager:
+class RepoManager:
+    """Manage cloned repositories within a workspace."""
As per coding guidelines, public classes need Google-style docstrings.
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`
around lines 35 - 41, Add a Google-style class docstring to the public class
RepoManager describing its responsibility (managing local git repositories
within a workspace), what it does on init (creates workspace directory), and
list key arguments/attributes: Args: workspace_dir (str) — path to workspace;
Attributes: workspace (Path) and active_repos (dict). Place the docstring
directly below the class RepoManager declaration and keep it concise (one-line
summary + short paragraph + Args and Attributes sections).

Comment on lines +67 to +74
async def cleanup(self):
"""Clean up all managed repositories."""
import shutil
for repo_path_str in list(self.active_repos.keys()):
repo_path = Path(repo_path_str)
if repo_path.exists():
shutil.rmtree(repo_path)
self.active_repos.clear()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Check if file exists
if [ -f "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py" ]; then
  echo "=== File found. Reading content ==="
  wc -l "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py"
else
  echo "=== File not found ==="
  exit 1
fi

# Show the specific lines in question plus context
echo "=== Lines around 67-74 ==="
sed -n '60,80p' "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py"

# Check for SPDX header
echo ""
echo "=== First 15 lines (checking SPDX header) ==="
head -15 "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py"

# Check for asyncio import
echo ""
echo "=== Checking for asyncio import ==="
grep -n "import asyncio" "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py" || echo "No direct asyncio import found"

# Check all imports
echo ""
echo "=== All imports in file ==="
grep -n "^import\|^from" "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py" | head -20

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1956


🏁 Script executed:

#!/bin/bash

# Get more context around the cleanup method, including class definition
echo "=== Class and method context (lines 40-80) ==="
sed -n '40,80p' "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py"

# Look for the class definition
echo ""
echo "=== Full cleanup method and surrounding code ==="
sed -n '65,85p' "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py"

# Check if there are type hints on other methods
echo ""
echo "=== Check method signatures for type hints ==="
grep -n "def \|async def" "examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py"

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 2750


Add return type hint and offload blocking filesystem work to a thread.

The async def cleanup(self): method has two issues:

  1. Missing return type hint (-> None) — required per guidelines for all public APIs with Python 3.11+ type hints.
  2. shutil.rmtree is synchronous and blocks the event loop. Offload to asyncio.to_thread to comply with the guideline to use async/await for I/O-bound work.
♻️ Proposed fix
-    async def cleanup(self):
+    async def cleanup(self) -> None:
         """Clean up all managed repositories."""
         import shutil
         for repo_path_str in list(self.active_repos.keys()):
             repo_path = Path(repo_path_str)
             if repo_path.exists():
-                shutil.rmtree(repo_path)
+                await asyncio.to_thread(shutil.rmtree, repo_path)
         self.active_repos.clear()
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`
around lines 67 - 74, The cleanup method lacks a return type and performs
blocking filesystem work; update async def cleanup(self) to async def
cleanup(self) -> None and offload synchronous shutil.rmtree calls to a thread
using asyncio.to_thread (import asyncio if not already), e.g. await
asyncio.to_thread(shutil.rmtree, repo_path) while keeping the loop over
self.active_repos and clearing it afterward; ensure shutil is imported/import
available in the scope where cleanup is defined (or keep the local import) and
await the to_thread calls for each repo_path in the cleanup method.

Comment on lines +77 to +104
def get_repo_path(workspace_dir: str, repo_url: str, instance_id: str | None = None) -> Path:
"""Generate a unique path for the repository.
Args:
workspace_dir: Base workspace directory
repo_url: URL of the repository
instance_id: Optional instance ID for unique workspace isolation
Returns:
Path to the repository. If instance_id is provided, returns
workspace_dir/instance_id/org/repo for complete isolation.
Otherwise returns workspace_dir/org/repo.
"""
if "://" in repo_url:
path = urlparse(repo_url).path
else:
# SSH form: git@host:org/repo.git
path = repo_url.split(":", 1)[-1]
parts = path.strip("/").split("/")
repo_name = parts[-1].replace('.git', '')
org_name = parts[-2] # Organization name

# If instance_id is provided, create isolated workspace per instance
if instance_id:
return Path(workspace_dir) / instance_id / org_name / repo_name

# Default: workspace_dir/org/repo
return Path(workspace_dir) / org_name / repo_name
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitise instance_id and validate repo_url before building paths.

Untrusted instance_id can escape the workspace via path separators or .., and malformed URLs can raise IndexError. Validate inputs and reject unsafe values.

🔒 Proposed fix
@@
-    parts = path.strip("/").split("/")
-    repo_name = parts[-1].replace('.git', '')
-    org_name = parts[-2]  # Organization name
+    parts = [p for p in path.strip("/").split("/") if p]
+    if len(parts) < 2:
+        raise ValueError(f"Invalid repo URL: {repo_url}")
+    repo_name = parts[-1].removesuffix(".git")
+    org_name = parts[-2]  # Organization name
@@
-    if instance_id:
-        return Path(workspace_dir) / instance_id / org_name / repo_name
+    if instance_id:
+        safe_instance_id = Path(instance_id).name
+        if safe_instance_id != instance_id or safe_instance_id in {".", ".."}:
+            raise ValueError("instance_id must be a single safe path segment")
+        return Path(workspace_dir) / safe_instance_id / org_name / repo_name
As per coding guidelines, validate and sanitise user input.
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`
around lines 77 - 104, The get_repo_path function must validate repo_url and
sanitise instance_id to prevent path traversal and index errors: ensure repo_url
yields at least two path components before accessing parts[-2] (raise ValueError
if malformed), and validate instance_id (if provided) to be a safe token—trim
whitespace and reject values containing path separators ('/', '\'), '..', or any
characters outside an allowed set such as alphanumerics, hyphen and underscore
(or use a simple regex like ^[A-Za-z0-9_-]+$); only use the sanitized
instance_id when constructing the Path and document/raise a clear error for
invalid inputs so callers cannot escape workspace directories via crafted inputs
in get_repo_path, repo_name, org_name, and instance_id.

Comment on lines +33 to +41
async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
"""Git repository management tool for SWE Bench."""
import json

from .git_tool import RepoManager
repo_manager = RepoManager(tool_config.workspace_dir)

# Simple async function that accepts a JSON string
async def git_operations(args_str: str) -> str:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Describe the tool interface in the docstring.

The registered tool docstring is surfaced to the LLM; describe the JSON payload (operation, repo_url, base_commit, instance_id) rather than the wrapper parameters.

✏️ Proposed fix
@@
-async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
-    """Git repository management tool for SWE Bench."""
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
+    """Manage git repositories for SWE-bench.
+
+    Accepts a JSON string with:
+        - operation: "setup" or "cleanup"
+        - repo_url: Repository URL (required for "setup")
+        - base_commit: Commit hash (required for "setup")
+        - instance_id: Optional workspace isolation identifier
+    """
Based on learnings, tool docstrings should describe the logical tool interface presented to the LLM.
🧰 Tools
🪛 Ruff (0.14.13)

33-33: Unused function argument: builder

(ARG001)

🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`
around lines 33 - 41, The docstring for git_repo_tool currently describes the
wrapper but should document the logical JSON interface presented to the LLM:
update the git_repo_tool docstring to explain the JSON payload accepted by the
inner async function git_operations (keys: "operation" with allowed operations,
"repo_url" for the git repository, "base_commit" for the commit/branch/ref, and
"instance_id" if applicable), sketch expected types/values and example payloads,
and mention that RepoManager (constructed from tool_config.workspace_dir) will
perform the requested operations; do not describe builder or wrapper internals,
only the LLM-visible JSON interface.

@AnuradhaKaruppiah
Copy link
Contributor

AnuradhaKaruppiah commented Jan 21, 2026

Achieved 70% success rate (7/10) in initial evaluation.

70% is very impressive. If you have the compute resources do you think you would be able to get a run on all 500?

@AnuradhaKaruppiah AnuradhaKaruppiah added non-breaking Non-breaking change feature request New feature or request labels Jan 21, 2026
@AnuradhaKaruppiah
Copy link
Contributor

/ok to test 4a1f4af

@Jerryguan777
Copy link
Author

Hi @AnuradhaKaruppiah ,
I run the evaluation with GPT 5.2 on hf://datasets/princeton-nlp/SWE-bench_Lite/data/test-00000-of-00001.parquet, here is a summary:

Success rate: 53.7% (158/294)
Note: Some failures were due to eval environment issues, the actually success rate may be higher.

Patch generation stage:

Totally 300 instances, 6 skipped, 294 finished

Evaluation Stage

    "total_instances": 294,
    "submitted_instances": 294,
    "completed_instances": 274,
    "resolved_instances": 158,
    "unresolved_instances": 116,
    "empty_patch_instances": 0,
    "error_instances": 20,

Achieved 70% success rate (7/10) in initial evaluation.

70% is very impressive. If you have the compute resources do you think you would be able to get a run on all 500?

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 109-115: The pre-clone cleanup in clone_repository currently calls
blocking shutil.rmtree inside an async function; move that call into a
background thread using asyncio.to_thread so the event loop isn't blocked. Wrap
the existing os.path.exists(target_path) check and shutil.rmtree(target_path)
invocation in await asyncio.to_thread(...) (or call asyncio.to_thread only for
rmtree if you prefer to keep the exists check synchronous), then proceed to
await asyncio.to_thread(Repo.clone_from, repo_url, target_path) as already done
to perform the clone.
♻️ Duplicate comments (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (4)

1-15: Add a Google-style module docstring.
Public modules should declare their purpose and key symbols; insert a short docstring under the SPDX header. As per coding guidelines, add a Google-style module docstring.

✏️ Proposed fix
 # limitations under the License.
 
+"""Git repository utilities for the iterative SWE-bench predictor.
+
+Provides `RepoContext`, `RepoManager`, `get_repo_path`, `clone_repository`, and
+`checkout_commit`.
+"""
+
 import asyncio

37-42: Document RepoManager as a public class.
Add a concise Google-style class docstring with key responsibilities and attributes. As per coding guidelines.

✏️ Proposed fix
 class RepoManager:
+    """Manage cloned repositories within a workspace.
+
+    Args:
+        workspace_dir: Base workspace directory for clones.
+
+    Attributes:
+        workspace: Workspace directory as a `Path`.
+        active_repos: Mapping of repo paths to `RepoContext`.
+    """
 
     def __init__(self, workspace_dir: str):

69-76: Avoid blocking shutil.rmtree in async cleanup.
shutil.rmtree is blocking; offload to a thread to keep the event loop responsive. Based on learnings.

♻️ Proposed fix
     async def cleanup(self):
         """Clean up all managed repositories."""
-        import shutil
         for repo_path_str in list(self.active_repos.keys()):
             repo_path = Path(repo_path_str)
             if repo_path.exists():
-                shutil.rmtree(repo_path)
+                await asyncio.to_thread(shutil.rmtree, repo_path)
         self.active_repos.clear()

92-103: Sanitise instance_id and validate repo_url parts.
Prevent path traversal and malformed URLs from escaping the workspace. As per coding guidelines.

🔒 Proposed fix
-    parts = path.strip("/").split("/")
-    repo_name = parts[-1].replace('.git', '')
-    org_name = parts[-2]  # Organization name
+    parts = [p for p in path.strip("/").split("/") if p]
+    if len(parts) < 2:
+        raise ValueError(f"Invalid repo URL: {repo_url}")
+    repo_name = parts[-1].removesuffix(".git")
+    org_name = parts[-2]  # Organization name
@@
-    if instance_id:
-        return Path(workspace_dir) / instance_id / org_name / repo_name
+    if instance_id:
+        safe_instance_id = Path(instance_id).name
+        if safe_instance_id != instance_id or safe_instance_id in {".", ".."}:
+            raise ValueError("instance_id must be a single safe path segment")
+        return Path(workspace_dir) / safe_instance_id / org_name / repo_name

Signed-off-by: Jerry Guan <jerryguan777@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Iterative Predictor for Improved SWE-bench Issue Resolution

3 participants