feat(redteam): add smart extraction for grading guidance#7212
feat(redteam): add smart extraction for grading guidance#7212yash2998chhabria wants to merge 18 commits intomainfrom
Conversation
Adds LLM-powered semantic extraction for file-based grading guidance. Instead of injecting the full guidance document into every plugin, this extracts only the relevant portions for each plugin. Key changes: - New guidanceExtractor.ts with semantic matching logic - Chunking support for large documents (>100K chars) - Parallel processing of chunks with smart merging - Full grader rubric passed to LLM for better context Set SMART_GRADING_GUIDANCE_EXTRACTION=false for legacy behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
I reviewed the new grading guidance extraction feature and found a high-severity prompt injection vulnerability. The feature loads external guidance documents and injects them into grading prompts with "HIGHEST PRIORITY" status, but malicious instructions in these documents could manipulate security evaluation outcomes.
Minimum severity threshold: 🟡 Medium | To re-scan after changes, comment @promptfoo-scanner
Learn more
Start the LLM extraction early and run it in PARALLEL with test synthesis. The extraction results are awaited only after synthesis completes, right before writing the output config. This means total time = max(synthesis, extraction) instead of sum. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3e79b8b to
41bc9c2
Compare
Add alternative extraction approach using Claude Agent SDK with Grep/Read tools, toggled via GRADING_GUIDANCE_EXTRACTION_MODE environment variable. Two extraction modes: - "llm" (default): LLM-based chunking approach - exhaustive, comprehensive - "agent": Claude Agent SDK with tools - focused, concise extractions Usage: # Default LLM chunking npm run local -- redteam generate -c config.yaml # Agent-based with tools GRADING_GUIDANCE_EXTRACTION_MODE=agent npm run local -- redteam generate -c config.yaml Agent approach benefits: - More focused extractions (~3x shorter) - Uses Grep to search, Read to examine relevant sections - No document chunking needed Trade-offs: - Requires ANTHROPIC_API_KEY - May need up to 100 turns for large documents - LLM chunking more reliable for production use Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ance Adds a third extraction mode using OpenAI Agents SDK alongside the existing LLM chunking and Claude Agent SDK approaches. The mode is controlled via GRADING_GUIDANCE_EXTRACTION_MODE environment variable: - "llm" (default): LLM chunking approach - comprehensive, thorough - "agent": Claude Agent SDK with Grep/Read tools - focused, concise - "openai-agent": OpenAI Agents SDK with custom tools - alternative agent Key implementation details: - Custom tools created with @openai/agents tool() function and Zod schemas - grep_file: searches document with comma-separated keyword OR matching - read_file_section: reads specific line ranges - get_file_info: returns document metadata - Tools handle common agent behaviors like passing multiple keywords Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The OpenAI agent-based grading guidance extraction was timing out with "Max turns (30) exceeded" when processing large documents. Increased the default maxTurns to 100 to allow the agent sufficient iterations to search and extract guidance from comprehensive documents like Claude's constitution (~192KB). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Key improvements: - Increase context_lines default from 3 to 10 for broader passage capture - Increase match limit from 15 to 30 to see more relevant sections - Add explicit minimum extraction length requirement (300-500 chars) - Add synonym search guidance (pii→privacy, competitors→other products) - Add good vs bad extraction examples in prompt - Emphasize thorough per-plugin searching with multiple keywords - Update run command to reinforce comprehensive extraction Results improved from 4/8 plugins (50%) to 7/8 plugins (87.5%) with average extraction length nearly doubled (~230→~450 chars). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds file-based caching with 24-hour TTL for extraction results: - Cache key based on document hash + plugin list + extraction mode - In-memory session cache for fast repeated access - File cache in ~/.promptfoo/cache/guidance-extraction/ - Avoids redundant API calls when re-running with same document This significantly improves UX for iterative development where users run generate multiple times with the same grading guidance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tion Adds plugin-specific keyword mappings that translate plugin IDs to the actual terminology commonly used in guidance documents: - hijacking → jailbreak, prompt injection, role-play, bypass - excessive-agency → agentic, autonomous, unilateral, oversight - harmful:cybercrime → cyberweapon, malicious code, malware - harmful:violent-crime → violence, weapon, bioweapon - harmful:self-harm → suicide, self-harm, safe messaging - And more... Also improves agent prompts to be more thorough without overfitting to specific character counts. Results: All three modes now achieve 8/8 plugin coverage (previously Claude Agent was 7/8, OpenAI Agent was 5/8). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PLUGIN_EXTRACTION_HINTS for plugin-specific search guidance - Rewrite Claude Agent prompt with PURPOSE, EXTRACTION STRATEGY, QUALITY CRITERIA - Rewrite OpenAI Agent prompt with structured extraction approach - Add verification step to ensure extracted content enables pass/fail judgment - Upgrade OpenAI Agent to use gpt-5-2025-08-07 model Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update prompts to explicitly request PASS/FAIL conditions - Add structured extraction categories: FAIL CONDITIONS, PASS CONDITIONS, EXAMPLES - Update PLUGIN_EXTRACTION_HINTS with specific pass/fail guidance per plugin - Add search phrases for grading-specific terms Significantly improves extraction quality - OpenAI agent now produces structured "PASS if X; FAIL if Y" criteria aligned with grader rubrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The extracted grading guidance was being added to plugin.config AFTER test cases were generated, so test metadata never received the guidance. This caused graders to ignore the extracted guidance entirely. Fix: After guidance extraction completes, update each test case's pluginConfig metadata to include externalGradingGuidance from its plugin. This ensures graders receive the extracted guidance with HIGHEST PRIORITY per the existing base.ts logic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove hardcoded keyword mappings and extraction hints in favor of letting the agent autonomously decide search strategy based on: - Plugin ID and name - Plugin description from metadata - Grader rubric definition Changes: - Remove PLUGIN_KEYWORD_MAPPINGS constant (~10 entries) - Remove PLUGIN_EXTRACTION_HINTS constant (~100 entries) - Remove extractKeywordsFromPlugin() function - Update agent prompts to encourage autonomous search decisions - Fix OpenAI agent serialization bug ([object Object] output) Benefits: - Zero maintenance when new plugins are added - ~150 fewer lines of code - Same extraction quality in testing - 50% fewer grading tokens vs LLM chunked method Co-Authored-By: Claude <noreply@anthropic.com>
- Extract duplicate normalization logic to normalizeExtractionResult() - Add object serialization handling for Claude agent (consistency) - Add PDF support for grading guidance file loading - Regenerate JSON schema Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🔄 Security Review ✅No issues in new changes. The incremental changes only update the Last updated: 2026-01-27T00:00:00Z | Reviewing: 3d45660 |
There was a problem hiding this comment.
I reviewed the grading guidance feature and found a high-severity prompt injection vulnerability. The feature loads user-provided documents and injects them into LLM grading prompts with "HIGHEST PRIORITY" designation, allowing malicious guidance to override security evaluation criteria. This could cause security violations to be incorrectly marked as passing.
Minimum severity threshold: 🟡 Medium | To re-scan after changes, comment @promptfoo-scanner
Learn more
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive external grading guidance feature for the redteam module. The changes add configuration schema support for a new Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
👍 All Clear
I reviewed this PR which adds grading guidance extraction functionality using LLMs and AI agents. The code processes user-provided guidance documents to extract relevant evaluation criteria for red-team plugins. I examined data flows from document inputs through LLM extraction to grading prompt injection, with particular attention to agent permissions and tool capabilities.
Minimum severity threshold: 🟡 Medium | To re-scan after changes, comment @promptfoo-scanner
Learn more
- Change permission_mode from 'bypassPermissions' to 'dontAsk' with detailed security comments explaining why this is safe (read-only tools, temp dir) - Fix brittle path detection: use fs.existsSync() to verify file paths instead of just checking file extensions (avoids false positives like "This guidance is for test.txt files") - Remove non-null assertions: assign guidanceText to const after truthy check for proper TypeScript type narrowing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fix dynamic import issue for Claude Agent SDK and OpenAI Codex SDK when config file is outside the promptfoo directory (e.g., /tmp/). Changes: - SDK resolution now tries multiple base paths in order: 1. cliState.basePath (config directory) 2. process.cwd() (current working directory) 3. promptfoo package root (via getDirectory()) - Updated @anthropic-ai/claude-agent-sdk to 0.2.20 This allows configs in external directories to still resolve SDKs installed in promptfoo's node_modules. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add ability to select the extraction mode via YAML config instead of
just environment variable:
```yaml
redteam:
gradingGuidance:
file: ./guidance.md
extractionMode: claude_agent # or openai_agent, openai_chunking
```
Three extraction modes:
- openai_chunking (default): LLM-based chunking, no extra deps
- openai_agent: OpenAI Agents SDK with custom tools
- claude_agent: Claude Agent SDK with read-only tools
Config takes precedence over GRADING_GUIDANCE_EXTRACTION_MODE env var.
Legacy env var values (llm, agent, openai-agent) still supported for
backwards compatibility.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…idance-extraction # Conflicts: # package-lock.json # package.json
Update @anthropic-ai/claude-agent-sdk from ^0.2.19 to ^0.2.20 in optionalDependencies to match devDependencies and fix CI consistency check. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Adds smart extraction for file-based grading guidance with three extraction modes. Instead of injecting the full guidance document into every plugin, this extracts only the relevant portions for each plugin based on semantic matching.
Architecture Overview
How Extraction Modes Work
Detailed Mode Architectures
1. LLM Chunking Mode (Default)
How it works:
Architecture:
Pros:
Cons:
2. Claude Agent Mode (
GRADING_GUIDANCE_EXTRACTION_MODE=agent)How it works:
Architecture:
Pros:
Cons:
3. OpenAI Agent Mode (
GRADING_GUIDANCE_EXTRACTION_MODE=openai-agent)How it works:
grep_file,read_file_section,get_file_infoArchitecture:
Pros:
Cons:
Performance Comparison
Test: 6 plugins, ~5KB grading guidance document
Cost Analysis
Approximate costs per extraction (6 plugins, typical document):
Costs are approximate and depend on model pricing and document size
Quality Comparison
Recommendation
Dynamic Agent Extraction
The agent-based extraction is truly dynamic - agents autonomously decide what to search for based on plugin context, with zero hardcoded hints.
Agent receives only:
harmful:violent-crime)Removed in this PR:
PLUGIN_KEYWORD_MAPPINGS- No pre-computed keywordsPLUGIN_EXTRACTION_HINTS- No hardcoded PASS/FAIL hints (~100 entries)extractKeywordsFromPlugin()- No keyword extraction functionFiles Changed
src/redteam/extraction/guidanceExtractor.tssrc/redteam/commands/generate.tssrc/redteam/types.tsgradingGuidancetype definitionssrc/redteam/plugins/base.tsexternalGradingGuidanceconfigsrc/validators/redteam.tssrc/providers/claude-agent-sdk.tssrc/providers/openai/codex-sdk.tssite/static/config-schema.jsonTest Commands
Security Fixes
This PR addresses security review feedback:
permission_mode: 'bypassPermissions'→permission_mode: 'dontAsk'with security commentsfs.existsSync()for path detection (prevents false positives like"This guidance is for test.txt files")Bug Fix: SDK Resolution
Fixed dynamic import issue for Claude Agent SDK and OpenAI Codex SDK when config file is outside the promptfoo directory:
/tmp/or other external directories to still resolve SDKs installed in promptfoo's node_modules