-
Notifications
You must be signed in to change notification settings - Fork 7
construct the framework of agent-v1 #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @YihengWang828, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request lays the groundwork for a robust agent evaluation system. It provides the necessary infrastructure to define, run, and assess AI agents against various datasets, starting with a SmolAgents integration and the BrowseCompZH dataset. The changes enable a complete pipeline for evaluating agent performance, capturing detailed execution trajectories, and generating aggregate scores. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new agent evaluation framework, including the agent runner, base classes, record-keeping structures, and an example agent implementation. The changes are extensive and lay a good foundation. My review focuses on improving correctness, security, and maintainability. Key feedback includes addressing a critical security vulnerability in a custom decryption function, fixing type mismatches, improving the robustness of LLM response parsing, and refactoring duplicated or brittle logic. Overall, this is a solid contribution that will be even better with these refinements.
| def _derive_key(password: str, length: int) -> bytes: | ||
| hasher = hashlib.sha256() | ||
| hasher.update(password.encode()) | ||
| key = hasher.digest() | ||
| return key * (length // len(key)) + key[: length % len(key)] | ||
|
|
||
|
|
||
| def _decrypt(ciphertext_b64: str, password: str) -> str: | ||
| encrypted = base64.b64decode(ciphertext_b64) | ||
| key = _derive_key(password, len(encrypted)) | ||
| decrypted = bytes(a ^ b for a, b in zip(encrypted, key)) | ||
| return decrypted.decode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom encryption implemented in _derive_key and _decrypt is not cryptographically secure. The key derivation simply repeats a hash, and using a simple XOR with a reused key (keystream) is vulnerable to attacks. If the intent is strong encryption, it is highly recommended to use standard, well-vetted cryptographic libraries and algorithms, such as AES with a proper key derivation function like PBKDF2 (hashlib.pbkdf2_hmac). If the goal is merely obfuscation, this should be explicitly stated in a comment to avoid misunderstanding its security level.
| judge_kwargs={ | ||
| "model": getattr(args, "judge", None), | ||
| "api_key": os.environ.get("OPENAI_API_KEY", ""), | ||
| "api_base": os.environ.get("OPENAI_API_BASE", ""), | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The judge_kwargs dictionary is populated with API keys from environment variables using os.environ.get(..., ""), which defaults to an empty string if the variable is not set. If a judge model is specified and requires an API key, this could lead to an authentication error much later in the execution. It's better to check if the required keys (OPENAI_API_KEY) are present when a judge is used and fail early with a clear error message.
| elif os.path.exists(item): | ||
| images.append(Image.open(item)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading an image from a file path specified in the input data (elif os.path.exists(item): images.append(Image.open(item))) can be a security risk. If the path is not properly sanitized, it could allow for path traversal attacks, enabling an attacker to read arbitrary files from the local filesystem. It's recommended to validate that the path is within an expected base directory.
|
|
||
| if message.role == "tool-response": | ||
| info = message.content[0]["text"] | ||
| step_result = StepResult(role="assistant", content="") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a type mismatch here. The StepResult constructor expects content to be of type Optional[List[Dict[str, Any]]], but an empty string "" is being passed. This will likely cause a TypeError or unexpected behavior later when this content attribute is accessed. It should be content=[] or content=None to match the expected type.
| step_result = StepResult(role="assistant", content="") | |
| step_result = StepResult(role="assistant", content=[]) |
| self.judger = _LLM(self.judge_model, resolved_key, resolved_base) | ||
| return self.judger | ||
|
|
||
| def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type hint for score_agent_sample is Dict[str, float], but the function actually returns a dictionary containing values of other types as well, such as strings (question, answer, judge_decision) and floats (accuracy). This mismatch is misleading and can cause issues with static analysis tools. The type hint should be updated to reflect the actual return type, for example, Dict[str, Any].
| def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, float]: | |
| def score_agent_sample(self, idx: int, model_response: str, **judge_kwargs) -> Dict[str, Any]: |
| for i in range(len(agent.memory.steps) - 1, -1, -1): | ||
| if agent.memory.steps[i].error is None: | ||
| last_step = agent.memory.steps[i] | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| unknown_function_idx = 0 | ||
| if steps_skeleton: | ||
| for step in steps_skeleton: | ||
| for tc in step.tool_calling: | ||
| if tc.tool_name == "unknown" and tc.tool_input == "unknown": | ||
| if unknown_function_idx < len(function_calls): | ||
| tc.tool_name = [f["name"] for f in function_calls[unknown_function_idx]] | ||
| tc.tool_input = [ | ||
| f["arguments"] for f in function_calls[unknown_function_idx] | ||
| ] | ||
| unknown_function_idx += 1 | ||
| eval_result.add_step(step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code to patch tool_name and tool_input for 'unknown' tool calls is quite complex and appears brittle. It relies on the index unknown_function_idx to match tool calls from different sources, which could easily break if the order or number of items changes. This suggests a potential design issue in how the trajectory is being reconstructed. Consider refactoring to extract all necessary tool call information in a more direct and robust way, if the underlying smolagents library provides a mechanism for it.
|
|
||
| eval_cached = store.load_eval(idx) if reuse else None | ||
| if eval_cached is not None: | ||
| cached_score = eval_cached.get("score", eval_cached) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback value in eval_cached.get("score", eval_cached) is confusing. If eval_cached is a dictionary that doesn't contain a 'score' key, this expression will return the entire eval_cached dictionary, which is then assigned to cached_score. This could lead to unexpected behavior downstream. If score can be absent, it would be safer to handle it explicitly, for example by getting None and then checking for it.
| def _get_judger(self) -> _LLM: | ||
| if self.judger is None: | ||
| load_env() | ||
| resolved_key = self.api_key or os.environ.get("OPENAI_API_KEY", "") | ||
| resolved_base = _normalize_base_url( | ||
| self.api_base | ||
| or os.environ.get("OPENAI_API_BASE", "") | ||
| or os.environ.get("OPENAI_BASE_URL", "") | ||
| ) | ||
| self.judger = _LLM(self.judge_model, resolved_key, resolved_base) | ||
| return self.judger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for initializing the _LLM judger is duplicated between the __init__ method and the _get_judger method. This can lead to inconsistencies if one is updated and the other is not. This logic should be extracted into a single helper method that both __init__ and _get_judger can call to ensure the judger is initialized consistently.
| match = re.search(r"correct: (yes|no)", judge_response) | ||
| final_decision = match.group(1) if match else "no" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing the LLM's response using a regular expression (re.search) is brittle. The evaluation will fail if the model's output format deviates even slightly from the expected correct: (yes|no) pattern. A more robust approach is to instruct the LLM to return its judgment in a structured format like JSON. This would make parsing more reliable and less prone to errors from minor formatting changes.
Adds the initial agent evaluation framework with a SmolAgents-based example and the BrowseCompZH dataset. Introduces the agent runner, agent base/records, and dataset implementation, along with package exports, to demonstrate end-to-end evaluation workflow and trajectory/score handling.