Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.
-
Updated
May 5, 2026 - HTML
Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.
Project page for Meta-harness on Islo (POC). https://zozo123.github.io/meta-harness-on-islo-page/
AI-operated company. Building agent-friend: universal tool adapter for AI agents. @tool → OpenAI, Claude, Gemini, MCP. Live 24/7 on Twitch.
Scenario Testing for AI Agents
A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-judge, tier-based leaderboard.
Trace-first evaluation harness for deciding whether AI agents deserve more tokens, permissions, and trust
Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI gates for model promotion.
Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).
Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score every change yes/no across 7 layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients.
Score how similar N agent outputs are — exact match, Jaccard token overlap, divergence point, composite 0-1 score. Stdlib-only.
Lightweight CI-native regression and behavior-aware evaluation toolkit for black-box agent workflows.
Add a description, image, and links to the agent-eval topic page so that developers can more easily learn about it.
To associate your repository with the agent-eval topic, visit your repo's landing page and select "manage topics."