construct the framework of agent-v1 #5

YihengWang828 · 2026-01-19T07:50:34Z

Adds the initial agent evaluation framework with a SmolAgents-based example and the BrowseCompZH dataset. Introduces the agent runner, agent base/records, and dataset implementation, along with package exports, to demonstrate end-to-end evaluation workflow and trajectory/score handling.

gemini-code-assist · 2026-01-19T07:50:56Z

Summary of Changes

Hello @YihengWang828, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for a robust agent evaluation system. It provides the necessary infrastructure to define, run, and assess AI agents against various datasets, starting with a SmolAgents integration and the BrowseCompZH dataset. The changes enable a complete pipeline for evaluating agent performance, capturing detailed execution trajectories, and generating aggregate scores.

Highlights

Agent Evaluation Framework: Introduces a foundational framework for evaluating AI agents, including core components for running, recording, and scoring agent performance.
SmolAgents Integration: Integrates a SmolAgents-based example, demonstrating how to use the framework with existing agent implementations and tool-calling capabilities.
BrowseCompZH Dataset: Adds the BrowseCompZH dataset, a Chinese browsing comprehension benchmark, complete with LLM-based judging for evaluation.
End-to-End Workflow: Establishes an end-to-end evaluation workflow, covering agent execution, trajectory storage, and result aggregation for comprehensive analysis.
Modular Design: Implements a modular design with abstract base classes for agents and standardized evaluation samples, promoting extensibility and reusability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new agent evaluation framework, including the agent runner, base classes, record-keeping structures, and an example agent implementation. The changes are extensive and lay a good foundation. My review focuses on improving correctness, security, and maintainability. Key feedback includes addressing a critical security vulnerability in a custom decryption function, fixing type mismatches, improving the robustness of LLM response parsing, and refactoring duplicated or brittle logic. Overall, this is a solid contribution that will be even better with these refinements.

gemini-code-assist · 2026-01-19T07:53:16Z

scieval/dataset/browsecomp_zh.py

+def _derive_key(password: str, length: int) -> bytes:
+    hasher = hashlib.sha256()
+    hasher.update(password.encode())
+    key = hasher.digest()
+    return key * (length // len(key)) + key[: length % len(key)]
+
+
+def _decrypt(ciphertext_b64: str, password: str) -> str:
+    encrypted = base64.b64decode(ciphertext_b64)
+    key = _derive_key(password, len(encrypted))
+    decrypted = bytes(a ^ b for a, b in zip(encrypted, key))
+    return decrypted.decode()


The custom encryption implemented in _derive_key and _decrypt is not cryptographically secure. The key derivation simply repeats a hash, and using a simple XOR with a reused key (keystream) is vulnerable to attacks. If the intent is strong encryption, it is highly recommended to use standard, well-vetted cryptographic libraries and algorithms, such as AES with a proper key derivation function like PBKDF2 (hashlib.pbkdf2_hmac). If the goal is merely obfuscation, this should be explicitly stated in a comment to avoid misunderstanding its security level.

gemini-code-assist · 2026-01-19T07:53:16Z

agent_runner.py

+                judge_kwargs={
+                    "model": getattr(args, "judge", None),
+                    "api_key": os.environ.get("OPENAI_API_KEY", ""),
+                    "api_base": os.environ.get("OPENAI_API_BASE", ""),
+                },


The judge_kwargs dictionary is populated with API keys from environment variables using os.environ.get(..., ""), which defaults to an empty string if the variable is not set. If a judge model is specified and requires an API key, this could lead to an authentication error much later in the execution. It's better to check if the required keys (OPENAI_API_KEY) are present when a judge is used and fail early with a clear error message.

gemini-code-assist · 2026-01-19T07:53:16Z

scieval/agents/smolagents.py

+            elif os.path.exists(item):
+                images.append(Image.open(item))


Loading an image from a file path specified in the input data (elif os.path.exists(item): images.append(Image.open(item))) can be a security risk. If the path is not properly sanitized, it could allow for path traversal attacks, enabling an attacker to read arbitrary files from the local filesystem. It's recommended to validate that the path is within an expected base directory.

gemini-code-assist · 2026-01-19T07:53:16Z

scieval/agents/smolagents.py

+
+        if message.role == "tool-response":
+            info = message.content[0]["text"]
+            step_result = StepResult(role="assistant", content="")


There is a type mismatch here. The StepResult constructor expects content to be of type Optional[List[Dict[str, Any]]], but an empty string "" is being passed. This will likely cause a TypeError or unexpected behavior later when this content attribute is accessed. It should be content=[] or content=None to match the expected type.

Suggested change

step_result = StepResult(role="assistant", content="")

step_result = StepResult(role="assistant", content=[])

gemini-code-assist · 2026-01-19T07:53:17Z

scieval/dataset/browsecomp_zh.py

+            self.judger = _LLM(self.judge_model, resolved_key, resolved_base)
+        return self.judger
+
+    def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, float]:


The return type hint for score_agent_sample is Dict[str, float], but the function actually returns a dictionary containing values of other types as well, such as strings (question, answer, judge_decision) and floats (accuracy). This mismatch is misleading and can cause issues with static analysis tools. The type hint should be updated to reflect the actual return type, for example, Dict[str, Any].

Suggested change

def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, float]:

def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, Any]:

gemini-code-assist · 2026-01-19T07:53:17Z

scieval/agents/smolagents.py

+        for i in range(len(agent.memory.steps) - 1, -1, -1):
+            if agent.memory.steps[i].error is None:
+                last_step = agent.memory.steps[i]
+                break


This loop to find the last successful step can be written more concisely using a generator expression with next().

last_step = next((step for step in reversed(agent.memory.steps) if step.error is None), None)

gemini-code-assist · 2026-01-19T07:53:17Z

scieval/agents/smolagents.py

+        unknown_function_idx = 0
+        if steps_skeleton:
+            for step in steps_skeleton:
+                for tc in step.tool_calling:
+                    if tc.tool_name == "unknown" and tc.tool_input == "unknown":
+                        if unknown_function_idx < len(function_calls):
+                            tc.tool_name = [f["name"] for f in function_calls[unknown_function_idx]]
+                            tc.tool_input = [
+                                f["arguments"] for f in function_calls[unknown_function_idx]
+                            ]
+                        unknown_function_idx += 1
+                eval_result.add_step(step)


This block of code to patch tool_name and tool_input for 'unknown' tool calls is quite complex and appears brittle. It relies on the index unknown_function_idx to match tool calls from different sources, which could easily break if the order or number of items changes. This suggests a potential design issue in how the trajectory is being reconstructed. Consider refactoring to extract all necessary tool call information in a more direct and robust way, if the underlying smolagents library provides a mechanism for it.

gemini-code-assist · 2026-01-19T07:53:17Z

agent_runner.py

+
+    eval_cached = store.load_eval(idx) if reuse else None
+    if eval_cached is not None:
+        cached_score = eval_cached.get("score", eval_cached)


The fallback value in eval_cached.get("score", eval_cached) is confusing. If eval_cached is a dictionary that doesn't contain a 'score' key, this expression will return the entire eval_cached dictionary, which is then assigned to cached_score. This could lead to unexpected behavior downstream. If score can be absent, it would be safer to handle it explicitly, for example by getting None and then checking for it.

gemini-code-assist · 2026-01-19T07:53:17Z

scieval/dataset/browsecomp_zh.py

+    def _get_judger(self) -> _LLM:
+        if self.judger is None:
+            load_env()
+            resolved_key = self.api_key or os.environ.get("OPENAI_API_KEY", "")
+            resolved_base = _normalize_base_url(
+                self.api_base
+                or os.environ.get("OPENAI_API_BASE", "")
+                or os.environ.get("OPENAI_BASE_URL", "")
+            )
+            self.judger = _LLM(self.judge_model, resolved_key, resolved_base)
+        return self.judger


The logic for initializing the _LLM judger is duplicated between the __init__ method and the _get_judger method. This can lead to inconsistencies if one is updated and the other is not. This logic should be extracted into a single helper method that both __init__ and _get_judger can call to ensure the judger is initialized consistently.

gemini-code-assist · 2026-01-19T07:53:17Z

scieval/dataset/browsecomp_zh.py

+        match = re.search(r"correct: (yes|no)", judge_response)
+        final_decision = match.group(1) if match else "no"


Parsing the LLM's response using a regular expression (re.search) is brittle. The evaluation will fail if the model's output format deviates even slightly from the expected correct: (yes|no) pattern. A more robust approach is to instruct the LLM to return its judgment in a structured format like JSON. This would make parsing more reliable and less prone to errors from minor formatting changes.

construct the framework of agent-v1

41ac050

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

construct the framework of agent-v1

d8ae9e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

construct the framework of agent-v1 #5

construct the framework of agent-v1 #5

Uh oh!

YihengWang828 commented Jan 19, 2026

Uh oh!

gemini-code-assist bot commented Jan 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	step_result = StepResult(role="assistant", content="")
	step_result = StepResult(role="assistant", content=[])

	def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, float]:
	def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, Any]:

		match = re.search(r"correct: (yes\|no)", judge_response)
		final_decision = match.group(1) if match else "no"

construct the framework of agent-v1 #5

Are you sure you want to change the base?

construct the framework of agent-v1 #5

Uh oh!

Conversation

YihengWang828 commented Jan 19, 2026

Uh oh!

gemini-code-assist bot commented Jan 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants