Close the loop: six prompt-optimization algorithms, any LLM, any metric.
Part of the Future AGI open-source platform for making AI agents reliable.
Try Cloud (Free) · Docs · Colab · Blog · Discord · Discussions
Prompts are how ambiguity sneaks into an agent. You can tweak one by hand. You can't tweak a hundred, and you definitely can't re-tweak them every time the model behind them changes. agent-opt does the tweaking for you: pick an algorithm, pick a metric, feed it a dataset, and it returns a prompt that beats the one you wrote.
Six algorithms, one API. Plug in any LLM via LiteLLM. Score against any of the 50+ metrics from ai-evaluation, or write your own. Production traces feed back in as training data.
|
Not one toy loop with six labels. Random Search, Bayesian (Optuna), ProTeGi (textual gradients), Meta-Prompt, PromptWizard (mutate-critique-refine), and GEPA (evolutionary Pareto). Pick by problem shape. |
LiteLLM under the hood, so OpenAI, Anthropic, Gemini, Bedrock, Azure, Groq, and self-hosted all just work. Score with BLEU, ROUGE, embedding similarity, LLM-as-judge, or any of 50+ |
Optimize against traces captured by |
pip install agent-optRequirements: Python ≥ 3.10 · ai-evaluation ≥ 0.2.2 · litellm ≥ 1.80 · optuna ≥ 3.6 · gepa ≥ 0.0.17.
Optimize a RAG prompt against BLEU in 60 seconds.
from fi.opt.optimizers import BayesianSearchOptimizer
from fi.opt.datamappers import BasicDataMapper
from fi.opt.base.evaluator import Evaluator
from fi.evals.metrics import BLEUScore
dataset = [
{"context": "Paris is the capital of France.",
"question": "What is the capital of France?", "answer": "Paris"},
# ... more examples
]
evaluator = Evaluator(BLEUScore())
mapper = BasicDataMapper(key_map={
"response": "generated_output",
"expected_response": "answer",
})
optimizer = BayesianSearchOptimizer(
inference_model_name="gpt-4o-mini",
teacher_model_name="gpt-4o",
n_trials=10,
)
result = optimizer.optimize(
evaluator=evaluator,
data_mapper=mapper,
dataset=dataset,
initial_prompts=["Given the context: {context}, answer: {question}"],
)
print(f"Best score: {result.final_score:.4f}")
print(f"Best prompt: {result.best_generator.get_prompt_template()}")Full walkthrough: examples/FutureAGI_Agent_Optimizer.ipynb · Open in Colab
Each algorithm is a drop-in optimize() call. Swap without touching your dataset, evaluator, or data mapper.
| Algorithm | Best for | Key idea |
|---|---|---|
| Random Search | Baselines and sanity checks | Random prompt variations around a seed |
| Bayesian Search | Few-shot example selection | Optuna TPE over example subsets and ordering |
| ProTeGi | Iterative refinement | Textual gradients from error analysis, beam-searched |
| Meta-Prompt | Teacher-model rewrites | Strong teacher analyzes failures, rewrites the prompt |
| PromptWizard | Multi-stage pipelines | Mutate → critique → refine, N rounds |
| GEPA | Complex solution spaces | Genetic Pareto evolution across multiple objectives |
Quick snippets for each
from fi.opt.optimizers import (
RandomSearchOptimizer, BayesianSearchOptimizer,
ProTeGi, MetaPromptOptimizer,
PromptWizardOptimizer, GEPAOptimizer,
)
from fi.opt.generators import LiteLLMGenerator
teacher = LiteLLMGenerator(model="gpt-4o", prompt_template="{prompt}")
# Random — fastest baseline
RandomSearchOptimizer(generator=teacher, teacher_model="gpt-4o", num_variations=5)
# Bayesian — few-shot selection via Optuna
BayesianSearchOptimizer(min_examples=2, max_examples=8, n_trials=20,
inference_model_name="gpt-4o-mini", teacher_model_name="gpt-4o")
# ProTeGi — textual gradient refinement
ProTeGi(teacher_generator=teacher, num_gradients=4, beam_size=4)
# Meta-Prompt — teacher-driven rewrites
MetaPromptOptimizer(teacher_generator=teacher, num_rounds=5)
# PromptWizard — mutate / critique / refine
PromptWizardOptimizer(teacher_generator=teacher, mutate_rounds=3, refine_iterations=2)
# GEPA — evolutionary Pareto
GEPAOptimizer(reflection_model="gpt-5", generator_model="gpt-4o-mini")Execute a prompt, return a response. LiteLLMGenerator works with every LiteLLM-supported provider.
from fi.opt.generators import LiteLLMGenerator
generator = LiteLLMGenerator(
model="gpt-4o-mini",
prompt_template="Summarize this text: {text}",
)Score a generated output. Three flavors (heuristic, LLM-as-judge, and the Future AGI platform's pre-built templates), all behind one Evaluator API.
# Heuristic
from fi.evals.metrics import BLEUScore
evaluator = Evaluator(BLEUScore())
# LLM-as-judge
from fi.evals.llm import LiteLLMProvider
from fi.evals.metrics import CustomLLMJudge
judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "correctness_judge",
"grading_criteria": (
"Score 1.0 if 'response' is semantically equivalent to "
"'expected_response'. 0.0 if incorrect. Partial credit OK."
),
},
model="gemini/gemini-2.5-flash",
temperature=0.4,
)
evaluator = Evaluator(metric=judge)
# Future AGI platform — 50+ pre-built templates
evaluator = Evaluator(
eval_template="summary_quality",
eval_model_name="turing_flash",
fi_api_key="...", fi_secret_key="...",
)Translate your dataset's shape into the keys the evaluator expects.
from fi.opt.datamappers import BasicDataMapper
mapper = BasicDataMapper(key_map={
"output": "generated_output", # from the generator
"input": "question", # from the dataset row
"ground_truth": "answer", # from the dataset row
})from fi.evals.metrics.base_metric import BaseMetric
class ExactMatchWithNormalization(BaseMetric):
@property
def metric_name(self):
return "exact_match_norm"
def compute_one(self, inputs):
return float(inputs["response"].strip().lower()
== inputs["expected_response"].strip().lower())def builder(base_prompt: str, few_shot: list[str]) -> str:
return f"{base_prompt}\n\nExamples:\n" + "\n\n".join(few_shot)
BayesianSearchOptimizer(prompt_builder=builder, ...)from fi.opt.utils import setup_logging
import logging
setup_logging(level=logging.INFO,
log_to_console=True, log_to_file=True,
log_file="optimization.log")export OPENAI_API_KEY="..."
export GEMINI_API_KEY="..." # if using Gemini
export FI_API_KEY="..." # for Future AGI platform evaluators
export FI_SECRET_KEY="..."simulate → evaluate → control → monitor → optimize. This SDK is the optimize step.
traceAIcaptures production traces of every LLM call.ai-evaluationscores them with 50+ metrics.agent-optturns those scored traces into a better prompt.- The Agent Command Center ships the new prompt behind an OpenAI-compatible endpoint.
Use one SDK or all of them. Each is independently packaged and Apache 2.0-licensed.
src/fi/opt/
├── base/ # Abstract base classes (Evaluator, Optimizer, …)
├── datamappers/ # Dataset-shape → evaluator-key translators
├── generators/ # LiteLLM-backed LLM callers
├── optimizers/ # Random, Bayesian, ProTeGi, Meta-Prompt, PromptWizard, GEPA
├── utils/ # Logging, IO, small helpers
└── types.py # Shared type defs
| Shipped | In progress | Coming up | Exploring |
|---|---|---|---|
|
|
|
|
Bug fixes, new algorithms, new metrics, docs, examples: all welcome.
- Browse
good first issue - Read the main repo Contributing Guide — same CLA, same workflow.
- Say hi on Discord or Discussions.
| 💬 Discord | Real-time help from the team and community |
| 🗨️ GitHub Discussions | Ideas, questions, roadmap input |
| 📝 Blog | Engineering & research posts |
| 📧 support@futureagi.com | Cloud account / billing |
| 🔐 security@futureagi.com | Private vulnerability disclosure — see SECURITY.md |
Licensed under the Apache License 2.0. See LICENSE and NOTICE.
Part of the Future AGI open-source ecosystem.
Built by the Future AGI team and contributors.
If agent-opt helps you ship better agents, a ⭐ helps more teams find us.
🌐 futureagi.com · 📖 docs.futureagi.com · ☁️ app.futureagi.com