Skip to content

feat(redteam): add smart extraction for grading guidance#7212

Closed
yash2998chhabria wants to merge 18 commits intomainfrom
feat/smart-grading-guidance-extraction
Closed

feat(redteam): add smart extraction for grading guidance#7212
yash2998chhabria wants to merge 18 commits intomainfrom
feat/smart-grading-guidance-extraction

Conversation

@yash2998chhabria
Copy link
Copy Markdown
Contributor

@yash2998chhabria yash2998chhabria commented Jan 23, 2026

Summary

Adds smart extraction for file-based grading guidance with three extraction modes. Instead of injecting the full guidance document into every plugin, this extracts only the relevant portions for each plugin based on semantic matching.

Architecture Overview

How Extraction Modes Work

┌─────────────────────────────────────────────────────────────────────────────┐
│                        GRADING GUIDANCE EXTRACTION                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  INPUT: Full guidance document (e.g., 5KB-100KB markdown)                  │
│  OUTPUT: Plugin-specific excerpts (e.g., 200-800 chars per plugin)         │
│                                                                             │
│  ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐           │
│  │  LLM Chunking   │   │  Claude Agent   │   │  OpenAI Agent   │           │
│  │    (default)    │   │    (agent)      │   │ (openai-agent)  │           │
│  └────────┬────────┘   └────────┬────────┘   └────────┬────────┘           │
│           │                     │                     │                     │
│           ▼                     ▼                     ▼                     │
│    Split document         Agent decides         Agent decides              │
│    into chunks            search strategy       search strategy            │
│           │                     │                     │                     │
│           ▼                     ▼                     ▼                     │
│    LLM extracts           Grep/Read tools       Custom tools               │
│    from each chunk        search document       search document            │
│           │                     │                     │                     │
│           ▼                     ▼                     ▼                     │
│    Merge & dedupe         Extract PASS/FAIL     Extract PASS/FAIL          │
│    results                criteria              criteria                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Detailed Mode Architectures

1. LLM Chunking Mode (Default)

How it works:

  1. Split: Document is split into overlapping chunks (~100K chars each, 2K overlap)
  2. Process: Each chunk is sent to an LLM with plugin context
  3. Extract: LLM extracts relevant PASS/FAIL criteria for each plugin
  4. Merge: Results from chunks are merged and deduplicated

Architecture:

Document (N chars) → Split into chunks → [Chunk 1] [Chunk 2] [Chunk N]
                                              ↓         ↓         ↓
                                           LLM call  LLM call  LLM call
                                              ↓         ↓         ↓
                                         Results    Results    Results
                                              ↓         ↓         ↓
                                         ──────── Merge & Dedupe ────────
                                                      ↓
                                              Final extraction

Pros:

  • Works with any document size
  • Comprehensive extraction (sees all content)
  • No external SDK dependencies

Cons:

  • Verbose output (includes redundant context)
  • More tokens consumed
  • Can include duplicate content from overlapping chunks

2. Claude Agent Mode (GRADING_GUIDANCE_EXTRACTION_MODE=agent)

How it works:

  1. Setup: Document placed in temporary directory
  2. Agent: Claude Agent SDK spawned with read-only tools (Grep, Read, Glob, LS)
  3. Search: Agent autonomously decides what keywords to search based on plugin context
  4. Extract: Agent greps for relevant sections, reads context, extracts PASS/FAIL
  5. Cleanup: Temporary directory deleted

Architecture:

Document → Temp Dir → Claude Agent (with tools)
                           │
                           ├─→ Grep("violence", "harm")  → Matches
                           ├─→ Read(line 50-100)         → Context
                           ├─→ Grep("FAIL", "PASS")      → Criteria
                           │
                           ↓
                      Agent reasons about results
                           ↓
                      Structured JSON output

Pros:

  • Focused, concise extraction
  • Lowest token usage (50% less than LLM mode)
  • Agent can iterate and search for synonyms
  • Clean PASS/FAIL format

Cons:

  • Requires Claude Agent SDK
  • May miss content if search terms don't match

3. OpenAI Agent Mode (GRADING_GUIDANCE_EXTRACTION_MODE=openai-agent)

How it works:

  1. Setup: Document placed in temporary directory
  2. Agent: OpenAI Agents SDK spawned with custom file tools
  3. Tools: grep_file, read_file_section, get_file_info
  4. Search: Agent autonomously searches using its tool calls
  5. Extract: Returns structured JSON with plugin → guidance mapping
  6. Cleanup: Temporary directory deleted

Architecture:

Document → Temp Dir → OpenAI Agent (with custom tools)
                           │
                           ├─→ get_file_info()                → File size
                           ├─→ grep_file("cybercrime")        → Line numbers
                           ├─→ read_file_section(35, 50)      → Text excerpt
                           │
                           ↓
                      Agent tool loop (up to 64 iterations)
                           ↓
                      JSON output with structured guidance

Pros:

  • Structured JSON output format
  • Tool-based approach is debuggable
  • Good for complex documents

Cons:

  • More API calls (tool loop overhead)
  • Higher cost than Claude Agent
  • Requires OpenAI API key

Performance Comparison

Test: 6 plugins, ~5KB grading guidance document

Metric LLM Chunking Claude Agent OpenAI Agent
Extraction Time 85.6s 73s 106.2s
Plugins Extracted 6/6 6/6 6/6
Total Chars Extracted 6,642 2,240 2,569
Avg Chars/Plugin 1,107 373 428
API Calls 1 (per chunk) ~10-20 (tools) ~20-50 (tools)

Cost Analysis

Approximate costs per extraction (6 plugins, typical document):

Mode Input Tokens Output Tokens Est. Cost
LLM Chunking ~8,000 ~2,500 ~$0.03
Claude Agent ~6,000 ~1,500 ~$0.02
OpenAI Agent ~15,000 ~3,000 ~$0.05

Costs are approximate and depend on model pricing and document size


Quality Comparison

Aspect LLM Chunking Claude Agent OpenAI Agent
Extraction Accuracy High High High
Output Conciseness Verbose Concise Moderate
Duplication Yes (overlap) Minimal Minimal
Format Consistency Variable Clean markdown Structured JSON
Success Rate 100% (6/6) 100% (6/6) 100% (6/6)

Recommendation

Use Case Recommended Mode
Production (default) LLM Chunking - no extra dependencies
Best quality Claude Agent - fastest, lowest tokens, cleanest output
Structured output OpenAI Agent - JSON format
Large documents (>50KB) LLM Chunking - handles chunking automatically

Dynamic Agent Extraction

The agent-based extraction is truly dynamic - agents autonomously decide what to search for based on plugin context, with zero hardcoded hints.

Agent receives only:

  • Plugin ID (e.g., harmful:violent-crime)
  • Plugin description (from metadata)
  • Grader rubric (from plugin definition)

Removed in this PR:

  • PLUGIN_KEYWORD_MAPPINGS - No pre-computed keywords
  • PLUGIN_EXTRACTION_HINTS - No hardcoded PASS/FAIL hints (~100 entries)
  • extractKeywordsFromPlugin() - No keyword extraction function

Files Changed

File Change
src/redteam/extraction/guidanceExtractor.ts Dynamic extraction, 3 modes
src/redteam/commands/generate.ts Mode toggle, fs.existsSync for paths
src/redteam/types.ts gradingGuidance type definitions
src/redteam/plugins/base.ts externalGradingGuidance config
src/validators/redteam.ts Validation schema
src/providers/claude-agent-sdk.ts Multi-path resolution for SDK loading
src/providers/openai/codex-sdk.ts Multi-path resolution for SDK loading
site/static/config-schema.json JSON schema regenerated

Test Commands

# Claude Agent Mode (recommended for quality)
GRADING_GUIDANCE_EXTRACTION_MODE=agent npm run local -- redteam generate \
  -c config.yaml -o output.yaml --env-file .env

# OpenAI Agent Mode
GRADING_GUIDANCE_EXTRACTION_MODE=openai-agent npm run local -- redteam generate \
  -c config.yaml -o output.yaml --env-file .env

# LLM Mode (default, no extra dependencies)
npm run local -- redteam generate -c config.yaml -o output.yaml --env-file .env

Security Fixes

This PR addresses security review feedback:

  • Changed permission_mode: 'bypassPermissions'permission_mode: 'dontAsk' with security comments
  • Added fs.existsSync() for path detection (prevents false positives like "This guidance is for test.txt files")
  • Removed non-null assertions with proper TypeScript type narrowing

Bug Fix: SDK Resolution

Fixed dynamic import issue for Claude Agent SDK and OpenAI Codex SDK when config file is outside the promptfoo directory:

  • SDK resolution now tries multiple base paths: config directory → current working directory → promptfoo package root
  • This allows configs in /tmp/ or other external directories to still resolve SDKs installed in promptfoo's node_modules

Adds LLM-powered semantic extraction for file-based grading guidance.
Instead of injecting the full guidance document into every plugin,
this extracts only the relevant portions for each plugin.

Key changes:
- New guidanceExtractor.ts with semantic matching logic
- Chunking support for large documents (>100K chars)
- Parallel processing of chunks with smart merging
- Full grader rubric passed to LLM for better context

Set SMART_GRADING_GUIDANCE_EXTRACTION=false for legacy behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@promptfoo-scanner promptfoo-scanner Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the new grading guidance extraction feature and found a high-severity prompt injection vulnerability. The feature loads external guidance documents and injects them into grading prompts with "HIGHEST PRIORITY" status, but malicious instructions in these documents could manipulate security evaluation outcomes.

Minimum severity threshold: 🟡 Medium | To re-scan after changes, comment @promptfoo-scanner
Learn more

Comment thread src/redteam/extraction/guidanceExtractor.ts
Start the LLM extraction early and run it in PARALLEL with test
synthesis. The extraction results are awaited only after synthesis
completes, right before writing the output config.

This means total time = max(synthesis, extraction) instead of sum.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@yash2998chhabria yash2998chhabria force-pushed the feat/smart-grading-guidance-extraction branch from 3e79b8b to 41bc9c2 Compare January 23, 2026 02:41
yash2998chhabria and others added 11 commits January 23, 2026 12:52
Add alternative extraction approach using Claude Agent SDK with Grep/Read
tools, toggled via GRADING_GUIDANCE_EXTRACTION_MODE environment variable.

Two extraction modes:
- "llm" (default): LLM-based chunking approach - exhaustive, comprehensive
- "agent": Claude Agent SDK with tools - focused, concise extractions

Usage:
  # Default LLM chunking
  npm run local -- redteam generate -c config.yaml

  # Agent-based with tools
  GRADING_GUIDANCE_EXTRACTION_MODE=agent npm run local -- redteam generate -c config.yaml

Agent approach benefits:
- More focused extractions (~3x shorter)
- Uses Grep to search, Read to examine relevant sections
- No document chunking needed

Trade-offs:
- Requires ANTHROPIC_API_KEY
- May need up to 100 turns for large documents
- LLM chunking more reliable for production use

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ance

Adds a third extraction mode using OpenAI Agents SDK alongside the existing
LLM chunking and Claude Agent SDK approaches. The mode is controlled via
GRADING_GUIDANCE_EXTRACTION_MODE environment variable:

- "llm" (default): LLM chunking approach - comprehensive, thorough
- "agent": Claude Agent SDK with Grep/Read tools - focused, concise
- "openai-agent": OpenAI Agents SDK with custom tools - alternative agent

Key implementation details:
- Custom tools created with @openai/agents tool() function and Zod schemas
- grep_file: searches document with comma-separated keyword OR matching
- read_file_section: reads specific line ranges
- get_file_info: returns document metadata
- Tools handle common agent behaviors like passing multiple keywords

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The OpenAI agent-based grading guidance extraction was timing out with
"Max turns (30) exceeded" when processing large documents. Increased
the default maxTurns to 100 to allow the agent sufficient iterations
to search and extract guidance from comprehensive documents like
Claude's constitution (~192KB).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Key improvements:
- Increase context_lines default from 3 to 10 for broader passage capture
- Increase match limit from 15 to 30 to see more relevant sections
- Add explicit minimum extraction length requirement (300-500 chars)
- Add synonym search guidance (pii→privacy, competitors→other products)
- Add good vs bad extraction examples in prompt
- Emphasize thorough per-plugin searching with multiple keywords
- Update run command to reinforce comprehensive extraction

Results improved from 4/8 plugins (50%) to 7/8 plugins (87.5%)
with average extraction length nearly doubled (~230→~450 chars).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds file-based caching with 24-hour TTL for extraction results:
- Cache key based on document hash + plugin list + extraction mode
- In-memory session cache for fast repeated access
- File cache in ~/.promptfoo/cache/guidance-extraction/
- Avoids redundant API calls when re-running with same document

This significantly improves UX for iterative development where
users run generate multiple times with the same grading guidance.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tion

Adds plugin-specific keyword mappings that translate plugin IDs to
the actual terminology commonly used in guidance documents:

- hijacking → jailbreak, prompt injection, role-play, bypass
- excessive-agency → agentic, autonomous, unilateral, oversight
- harmful:cybercrime → cyberweapon, malicious code, malware
- harmful:violent-crime → violence, weapon, bioweapon
- harmful:self-harm → suicide, self-harm, safe messaging
- And more...

Also improves agent prompts to be more thorough without overfitting
to specific character counts.

Results: All three modes now achieve 8/8 plugin coverage
(previously Claude Agent was 7/8, OpenAI Agent was 5/8).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PLUGIN_EXTRACTION_HINTS for plugin-specific search guidance
- Rewrite Claude Agent prompt with PURPOSE, EXTRACTION STRATEGY, QUALITY CRITERIA
- Rewrite OpenAI Agent prompt with structured extraction approach
- Add verification step to ensure extracted content enables pass/fail judgment
- Upgrade OpenAI Agent to use gpt-5-2025-08-07 model

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update prompts to explicitly request PASS/FAIL conditions
- Add structured extraction categories: FAIL CONDITIONS, PASS CONDITIONS, EXAMPLES
- Update PLUGIN_EXTRACTION_HINTS with specific pass/fail guidance per plugin
- Add search phrases for grading-specific terms

Significantly improves extraction quality - OpenAI agent now produces
structured "PASS if X; FAIL if Y" criteria aligned with grader rubrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The extracted grading guidance was being added to plugin.config AFTER
test cases were generated, so test metadata never received the guidance.
This caused graders to ignore the extracted guidance entirely.

Fix: After guidance extraction completes, update each test case's
pluginConfig metadata to include externalGradingGuidance from its plugin.

This ensures graders receive the extracted guidance with HIGHEST PRIORITY
per the existing base.ts logic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove hardcoded keyword mappings and extraction hints in favor of
letting the agent autonomously decide search strategy based on:
- Plugin ID and name
- Plugin description from metadata
- Grader rubric definition

Changes:
- Remove PLUGIN_KEYWORD_MAPPINGS constant (~10 entries)
- Remove PLUGIN_EXTRACTION_HINTS constant (~100 entries)
- Remove extractKeywordsFromPlugin() function
- Update agent prompts to encourage autonomous search decisions
- Fix OpenAI agent serialization bug ([object Object] output)

Benefits:
- Zero maintenance when new plugins are added
- ~150 fewer lines of code
- Same extraction quality in testing
- 50% fewer grading tokens vs LLM chunked method

Co-Authored-By: Claude <noreply@anthropic.com>
- Extract duplicate normalization logic to normalizeExtractionResult()
- Add object serialization handling for Claude agent (consistency)
- Add PDF support for grading guidance file loading
- Regenerate JSON schema

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 27, 2026

🔄 Security Review ✅

No issues in new changes.

The incremental changes only update the @anthropic-ai/claude-agent-sdk dependency from ^0.2.19 to ^0.2.20 in package.json and package-lock.json. This is a minor version bump of an optional dependency with no security implications.


Last updated: 2026-01-27T00:00:00Z | Reviewing: 3d45660

Copy link
Copy Markdown
Contributor

@promptfoo-scanner promptfoo-scanner Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the grading guidance feature and found a high-severity prompt injection vulnerability. The feature loads user-provided documents and injects them into LLM grading prompts with "HIGHEST PRIORITY" designation, allowing malicious guidance to override security evaluation criteria. This could cause security violations to be incorrectly marked as passing.

Minimum severity threshold: 🟡 Medium | To re-scan after changes, comment @promptfoo-scanner
Learn more

Comment thread src/redteam/plugins/base.ts
@yash2998chhabria yash2998chhabria marked this pull request as draft January 27, 2026 15:37
@yash2998chhabria yash2998chhabria marked this pull request as ready for review January 27, 2026 15:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive external grading guidance feature for the redteam module. The changes add configuration schema support for a new gradingGuidance property that accepts inline text or file references, implement a multi-strategy extraction system that processes guidance documents and extracts plugin-specific guidance using LLM, Claude agent, or OpenAI agent approaches, integrate guidance extraction into the command generation workflow with caching and background processing, and incorporate extracted guidance into the plugin grading evaluation pipeline with highest priority. The implementation spans schema validation, extraction logic with chunking and deduplication, and integration points across configuration, type systems, and plugin grading flows.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(redteam): add smart extraction for grading guidance' accurately and clearly summarizes the main change—adding intelligent extraction of grading guidance for the redteam feature.
Docstring Coverage ✅ Passed Docstring coverage is 94.12% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description is comprehensive and directly related to the changeset. It explains the feature (smart extraction for grading guidance), the three extraction modes, architectural overview, performance comparisons, and all related file changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@promptfoo-scanner promptfoo-scanner Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 All Clear

I reviewed this PR which adds grading guidance extraction functionality using LLMs and AI agents. The code processes user-provided guidance documents to extract relevant evaluation criteria for red-team plugins. I examined data flows from document inputs through LLM extraction to grading prompt injection, with particular attention to agent permissions and tool capabilities.

Minimum severity threshold: 🟡 Medium | To re-scan after changes, comment @promptfoo-scanner
Learn more


Was this helpful?  👍 Yes  |  👎 No 

- Change permission_mode from 'bypassPermissions' to 'dontAsk' with detailed
  security comments explaining why this is safe (read-only tools, temp dir)
- Fix brittle path detection: use fs.existsSync() to verify file paths
  instead of just checking file extensions (avoids false positives like
  "This guidance is for test.txt files")
- Remove non-null assertions: assign guidanceText to const after truthy
  check for proper TypeScript type narrowing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@yash2998chhabria yash2998chhabria marked this pull request as draft January 27, 2026 15:52
yash2998chhabria and others added 2 commits January 27, 2026 08:31
Fix dynamic import issue for Claude Agent SDK and OpenAI Codex SDK when
config file is outside the promptfoo directory (e.g., /tmp/).

Changes:
- SDK resolution now tries multiple base paths in order:
  1. cliState.basePath (config directory)
  2. process.cwd() (current working directory)
  3. promptfoo package root (via getDirectory())
- Updated @anthropic-ai/claude-agent-sdk to 0.2.20

This allows configs in external directories to still resolve SDKs
installed in promptfoo's node_modules.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add ability to select the extraction mode via YAML config instead of
just environment variable:

```yaml
redteam:
  gradingGuidance:
    file: ./guidance.md
    extractionMode: claude_agent  # or openai_agent, openai_chunking
```

Three extraction modes:
- openai_chunking (default): LLM-based chunking, no extra deps
- openai_agent: OpenAI Agents SDK with custom tools
- claude_agent: Claude Agent SDK with read-only tools

Config takes precedence over GRADING_GUIDANCE_EXTRACTION_MODE env var.
Legacy env var values (llm, agent, openai-agent) still supported for
backwards compatibility.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@yash2998chhabria yash2998chhabria marked this pull request as ready for review January 27, 2026 17:10
yash2998chhabria and others added 2 commits January 27, 2026 09:12
…idance-extraction

# Conflicts:
#	package-lock.json
#	package.json
Update @anthropic-ai/claude-agent-sdk from ^0.2.19 to ^0.2.20 in
optionalDependencies to match devDependencies and fix CI consistency check.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant