Skip to content

Conversation

@xingyaoww
Copy link
Contributor

Summary

This PR adds comprehensive documentation for the new experimental Critic feature in the OpenHands SDK.

What's Added

A new guide at sdk/guides/critic.mdx that covers:

Core Concepts

  • What is a Critic? - Explanation of the LLM-based evaluation system
  • When to Use Critics - Use cases for quality monitoring, early intervention, and performance analysis
  • Evaluation Modes - Two modes: finish_and_message (default) and all_actions

Implementation Guide

  • Setting Up APIBasedCritic - Complete example with configuration
  • Configuration Options - All parameters explained (server_url, api_key, model_name, mode)
  • Understanding Results - How to interpret CriticResult scores and feedback
  • Visualizing Results - Color-coded output in the conversation visualizer
  • Programmatic Access - How to access critic results in callbacks

Technical Details

  • How It Works - Step-by-step evaluation flow
  • Chat Template Format - Qwen3-4B-Instruct-2507 template explanation
  • Security - API key handling with SecretStr
  • Performance Considerations - Latency, cost, and parallelization details

Advanced Usage

  • Custom Critic Implementations - Extending CriticBase with custom logic
  • Built-in Critics - PassCritic, AgentFinishedCritic, EmptyPatchCritic
  • Troubleshooting - Common issues and solutions

Example Code

Includes the full example from examples/01_standalone_sdk/34_critic_model_example.py with:

  • Auto-configuration for All-Hands LLM proxy
  • Manual configuration fallback
  • Running instructions

⚠️ Experimental Status

The guide includes prominent warnings that this feature is:

  • Highly experimental and subject to change
  • Not recommended for production without thorough testing
  • Subject to API and behavior changes based on feedback

Related PR

This documentation corresponds to OpenHands/software-agent-sdk#1269 which implements the Critic feature.

Preview

The guide follows the same structure and style as existing SDK guides, including:

  • Clear warnings about experimental status
  • Code examples with syntax highlighting
  • Step-by-step instructions
  • Troubleshooting section
  • Links to related guides

Checklist

  • Added comprehensive guide for Critic feature
  • Included clear experimental warnings
  • Provided complete code examples
  • Added troubleshooting section
  • Documented all configuration options
  • Linked to example code in repository
  • Followed existing documentation style and format

This guide documents the experimental API-based Critic feature for
real-time evaluation of agent actions and messages using an external LLM.

Key topics covered:
- Overview of what critics are and when to use them
- Two evaluation modes: finish_and_message and all_actions
- Configuration and setup with APIBasedCritic
- Understanding and visualizing critic results
- Technical details including chat template format
- Custom critic implementations
- Built-in critic types
- Troubleshooting common issues

The guide includes clear warnings that this is an experimental feature
subject to change and not recommended for production use without
thorough testing.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 15, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • .github/workflows/sync-docs-code-blocks.yml
    • .github/workflows/sync-agent-sdk-openapi.yml

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #263 at branch `xw/critic-model`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@xingyaoww xingyaoww marked this pull request as draft January 15, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants