agent-eval

Star

Here are 11 public repositories matching this topic...

zozo123 / meta-harness-on-islo

Sponsor

Star

Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.

harbor llm-agents agent-eval meta-harness islo harness-optimization

Updated May 5, 2026
HTML

zozo123 / meta-harness-on-islo-page

Sponsor

Star

Project page for Meta-harness on Islo (POC). https://zozo123.github.io/meta-harness-on-islo-page/

project-page agent-eval meta-harness islo

Updated May 5, 2026
JavaScript

0-co / company

Star

AI-operated company. Building agent-friend: universal tool adapter for AI agents. @tool → OpenAI, Claude, Gemini, MCP. Live 24/7 on Twitch.

python twitch structured-logging interactive-cli exponential-backoff human-in-the-loop zero-dependencies open-startup ai-agent autonomous-ai building-in-public llm-tools agent-security mcp-security personal-ai-agent agent-eval agent-friend

Updated Mar 26, 2026
Python

gojiplus / understudy

Star

Scenario Testing for AI Agents

simulation evaluation agentic agent-evaluation google-adk agent-eval

Updated Apr 24, 2026
Python

arthursoares / openclaw-llm-bench

Star

A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-judge, tier-based leaderboard.

gpt reasoning claude llm-eval ollama llm-as-judge llm-benchmark openclaw agent-eval

Updated Apr 11, 2026
Python

stevenchouai / agent-scorecard

Star

Trace-first evaluation harness for deciding whether AI agents deserve more tokens, permissions, and trust

python evaluation roi ai-agents proof-chain agent-eval

Updated May 7, 2026
Python

mizcausevic-dev / agent-eval-arena

Star

Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI gates for model promotion.

express typescript platform-engineering regression-detection ml-ops ai-platform ai-governance llm-eval agent-eval ci-gate

Updated May 7, 2026
TypeScript

jeremylongshore / intent-eval-lab

Sponsor

Star

Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).

mcp skill-discovery opentelemetry ai-evaluation gemini-cli claude-code plugin-testing cross-cli agent-eval invocation-rate

Updated May 7, 2026

jeremylongshore / j-rig-skill-binary-eval

Sponsor

Star

Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score every change yes/no across 7 layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients.

mcp regression-testing skill-evaluation ai-evaluation llm-eval claude-code plugin-testing eval-harness agent-eval binary-criteria

Updated May 7, 2026
TypeScript

hermes-labs-ai / agent-convergence-scorer

Star

Score how similar N agent outputs are — exact match, Jaccard token overlap, divergence point, composite 0-1 score. Stdlib-only.

benchmark llm-evaluation agent-eval

Updated Apr 30, 2026
Python

zyy5114 / AgentEvalKit

Star

Lightweight CI-native regression and behavior-aware evaluation toolkit for black-box agent workflows.

python cli json-schema tooling regression-testing github-actions llm-evals agent-eval

Updated Apr 10, 2026
Python

Improve this page

Add a description, image, and links to the agent-eval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-eval topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-eval

Here are 11 public repositories matching this topic...

zozo123 / meta-harness-on-islo

zozo123 / meta-harness-on-islo-page

0-co / company

gojiplus / understudy

arthursoares / openclaw-llm-bench

stevenchouai / agent-scorecard

mizcausevic-dev / agent-eval-arena

jeremylongshore / intent-eval-lab

jeremylongshore / j-rig-skill-binary-eval

hermes-labs-ai / agent-convergence-scorer

zyy5114 / AgentEvalKit

Improve this page

Add this topic to your repo