Skip to content

Conversation

@YihengWang828
Copy link
Collaborator

Adds the initial agent evaluation framework with a SmolAgents-based example and the BrowseCompZH dataset. Introduces the agent runner, agent base/records, and dataset implementation, along with package exports, to demonstrate end-to-end evaluation workflow and trajectory/score handling.

@gemini-code-assist
Copy link

Summary of Changes

Hello @YihengWang828, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for a robust agent evaluation system. It provides the necessary infrastructure to define, run, and assess AI agents against various datasets, starting with a SmolAgents integration and the BrowseCompZH dataset. The changes enable a complete pipeline for evaluating agent performance, capturing detailed execution trajectories, and generating aggregate scores.

Highlights

  • Agent Evaluation Framework: Introduces a foundational framework for evaluating AI agents, including core components for running, recording, and scoring agent performance.
  • SmolAgents Integration: Integrates a SmolAgents-based example, demonstrating how to use the framework with existing agent implementations and tool-calling capabilities.
  • BrowseCompZH Dataset: Adds the BrowseCompZH dataset, a Chinese browsing comprehension benchmark, complete with LLM-based judging for evaluation.
  • End-to-End Workflow: Establishes an end-to-end evaluation workflow, covering agent execution, trajectory storage, and result aggregation for comprehensive analysis.
  • Modular Design: Implements a modular design with abstract base classes for agents and standardized evaluation samples, promoting extensibility and reusability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new agent evaluation framework, including the agent runner, base classes, record-keeping structures, and an example agent implementation. The changes are extensive and lay a good foundation. My review focuses on improving correctness, security, and maintainability. Key feedback includes addressing a critical security vulnerability in a custom decryption function, fixing type mismatches, improving the robustness of LLM response parsing, and refactoring duplicated or brittle logic. Overall, this is a solid contribution that will be even better with these refinements.

Comment on lines +16 to +27
def _derive_key(password: str, length: int) -> bytes:
hasher = hashlib.sha256()
hasher.update(password.encode())
key = hasher.digest()
return key * (length // len(key)) + key[: length % len(key)]


def _decrypt(ciphertext_b64: str, password: str) -> str:
encrypted = base64.b64decode(ciphertext_b64)
key = _derive_key(password, len(encrypted))
decrypted = bytes(a ^ b for a, b in zip(encrypted, key))
return decrypted.decode()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The custom encryption implemented in _derive_key and _decrypt is not cryptographically secure. The key derivation simply repeats a hash, and using a simple XOR with a reused key (keystream) is vulnerable to attacks. If the intent is strong encryption, it is highly recommended to use standard, well-vetted cryptographic libraries and algorithms, such as AES with a proper key derivation function like PBKDF2 (hashlib.pbkdf2_hmac). If the goal is merely obfuscation, this should be explicitly stated in a comment to avoid misunderstanding its security level.

Comment on lines +191 to +195
judge_kwargs={
"model": getattr(args, "judge", None),
"api_key": os.environ.get("OPENAI_API_KEY", ""),
"api_base": os.environ.get("OPENAI_API_BASE", ""),
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The judge_kwargs dictionary is populated with API keys from environment variables using os.environ.get(..., ""), which defaults to an empty string if the variable is not set. If a judge model is specified and requires an API key, this could lead to an authentication error much later in the execution. It's better to check if the required keys (OPENAI_API_KEY) are present when a judge is used and fail early with a clear error message.

Comment on lines +35 to +36
elif os.path.exists(item):
images.append(Image.open(item))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Loading an image from a file path specified in the input data (elif os.path.exists(item): images.append(Image.open(item))) can be a security risk. If the path is not properly sanitized, it could allow for path traversal attacks, enabling an attacker to read arbitrary files from the local filesystem. It's recommended to validate that the path is within an expected base directory.


if message.role == "tool-response":
info = message.content[0]["text"]
step_result = StepResult(role="assistant", content="")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a type mismatch here. The StepResult constructor expects content to be of type Optional[List[Dict[str, Any]]], but an empty string "" is being passed. This will likely cause a TypeError or unexpected behavior later when this content attribute is accessed. It should be content=[] or content=None to match the expected type.

Suggested change
step_result = StepResult(role="assistant", content="")
step_result = StepResult(role="assistant", content=[])

self.judger = _LLM(self.judge_model, resolved_key, resolved_base)
return self.judger

def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, float]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return type hint for score_agent_sample is Dict[str, float], but the function actually returns a dictionary containing values of other types as well, such as strings (question, answer, judge_decision) and floats (accuracy). This mismatch is misleading and can cause issues with static analysis tools. The type hint should be updated to reflect the actual return type, for example, Dict[str, Any].

Suggested change
def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, float]:
def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, Any]:

Comment on lines +128 to +131
for i in range(len(agent.memory.steps) - 1, -1, -1):
if agent.memory.steps[i].error is None:
last_step = agent.memory.steps[i]
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop to find the last successful step can be written more concisely using a generator expression with next().

        last_step = next((step for step in reversed(agent.memory.steps) if step.error is None), None)

Comment on lines +145 to +156
unknown_function_idx = 0
if steps_skeleton:
for step in steps_skeleton:
for tc in step.tool_calling:
if tc.tool_name == "unknown" and tc.tool_input == "unknown":
if unknown_function_idx < len(function_calls):
tc.tool_name = [f["name"] for f in function_calls[unknown_function_idx]]
tc.tool_input = [
f["arguments"] for f in function_calls[unknown_function_idx]
]
unknown_function_idx += 1
eval_result.add_step(step)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code to patch tool_name and tool_input for 'unknown' tool calls is quite complex and appears brittle. It relies on the index unknown_function_idx to match tool calls from different sources, which could easily break if the order or number of items changes. This suggests a potential design issue in how the trajectory is being reconstructed. Consider refactoring to extract all necessary tool call information in a more direct and robust way, if the underlying smolagents library provides a mechanism for it.


eval_cached = store.load_eval(idx) if reuse else None
if eval_cached is not None:
cached_score = eval_cached.get("score", eval_cached)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback value in eval_cached.get("score", eval_cached) is confusing. If eval_cached is a dictionary that doesn't contain a 'score' key, this expression will return the entire eval_cached dictionary, which is then assigned to cached_score. This could lead to unexpected behavior downstream. If score can be absent, it would be safer to handle it explicitly, for example by getting None and then checking for it.

Comment on lines +149 to +159
def _get_judger(self) -> _LLM:
if self.judger is None:
load_env()
resolved_key = self.api_key or os.environ.get("OPENAI_API_KEY", "")
resolved_base = _normalize_base_url(
self.api_base
or os.environ.get("OPENAI_API_BASE", "")
or os.environ.get("OPENAI_BASE_URL", "")
)
self.judger = _LLM(self.judge_model, resolved_key, resolved_base)
return self.judger

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for initializing the _LLM judger is duplicated between the __init__ method and the _get_judger method. This can lead to inconsistencies if one is updated and the other is not. This logic should be extracted into a single helper method that both __init__ and _get_judger can call to ensure the judger is initialized consistently.

Comment on lines +170 to +171
match = re.search(r"correct: (yes|no)", judge_response)
final_decision = match.group(1) if match else "no"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Parsing the LLM's response using a regular expression (re.search) is brittle. The evaluation will fail if the model's output format deviates even slightly from the expected correct: (yes|no) pattern. A more robust approach is to instruct the LLM to return its judgment in a structured format like JSON. This would make parsing more reliable and less prone to errors from minor formatting changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants