-
Notifications
You must be signed in to change notification settings - Fork 1
feat(eval): combine sec code oracles in one pipeline #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @nirav0999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented a significant feature that combines various security code oracles into a single, streamlined pipeline. This change aims to improve the evaluation process for secure code generation models, particularly for overrefusal benchmarks. The core idea is to automate the creation, refinement, and analysis of prompts, ensuring that the generated code is rigorously tested against known vulnerabilities using multiple static analysis tools. This enhancement will allow for more efficient and comprehensive assessment of model performance in generating secure code.
Highlights
- New Secure Code Evaluation Pipeline: I've introduced a new, comprehensive pipeline for evaluating secure code generation, specifically designed for overrefusal benchmarks like XSCode. This pipeline integrates prompt generation, deduplication, and filtering steps.
- Modularized Secure Code Oracles: The core logic for evaluating secure code has been significantly modularized and enhanced. It now efficiently extracts code from LLM responses and runs multiple static analysis tools, such as CodeGuru and CodeQL, in parallel batches.
- Vulnerability-Based Prompt Generation: I've added new modules for generating prompts based on CWE and CodeGuru vulnerabilities, ensuring a diverse and relevant set of test cases.
- Prompt Deduplication and LLM-Based Filtering: To maintain the quality and uniqueness of the generated prompts, I've implemented robust deduplication using MinHash and LSH. Additionally, I've introduced pre- and post-filtering steps that leverage LLM judges to validate prompt suitability.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR combines secure code oracles into a unified pipeline and performs code cleanup. The changes modularize the evaluate_sec_code_gen function by moving components to separate files and remove dead code.
Key changes:
- Replaces TODO comments with complete implementations across secure code oracle files
- Modularizes evaluation pipeline by splitting functionality into utility functions and main evaluation logic
- Adds comprehensive XSCode compilation pipeline with filtering, deduplication, and generation components
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| eval/oracles/secure_code_oracles_utils.py | Adds utility functions for AWS region detection, severity checking, base64 encoding/decoding, and file zipping |
| eval/oracles/secure_code_oracles.py | Implements complete secure code evaluation pipeline with batch processing, parallel execution, and static analyzer integration |
| eval/compile_xscode/pre_filter.py | Adds pre-filtering logic for prompt validation using LLM judges with configurable criteria |
| eval/compile_xscode/post_filter.py | Implements post-filtering validation for generated prompts with benign intent verification |
| eval/compile_xscode/main.py | Provides main pipeline orchestration for XSCode dataset compilation |
| eval/compile_xscode/dedup.py | Implements MinHash-based deduplication with configurable similarity thresholds |
| eval/compile_xscode/cwe2ovrf.py | Adds CWE-to-overrefusal prompt generation with XML parsing and markdown formatting |
| eval/compile_xscode/annotate_utils/split.py | Minor formatting change (adds blank line) |
| eval/compile_xscode/annotate_utils/gather.py | Removes redundant docstring and simplifies set comprehension |
| eval/compile_xscode/annotate_utils/annotate.py | Removes redundant comment |
| eval/compile_xscode/README.md | Adds documentation for XSCode generation and evaluation |
|
|
||
|
|
||
| def check_min_severity(severity: str, min_sev_level: str = "MEDIUM") -> bool: | ||
| SEVERITY_LEVELS = ["info", "informational", "low", "medium", "high", "critical"] |
Copilot
AI
Aug 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SEVERITY_LEVELS list should be defined as a module-level constant to avoid recreating it on every function call.
| SEVERITY_LEVELS = ["info", "informational", "low", "medium", "high", "critical"] |
|
|
||
| if os.path.exists(output_path): | ||
| rich.print( | ||
| f"[red]{analyzer} results already present ..... Delete it to regenrate" |
Copilot
AI
Aug 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in 'regenrate' - it should be 'regenerate'.
| f"[red]{analyzer} results already present ..... Delete it to regenrate" | |
| f"[red]{analyzer} results already present ..... Delete it to regenerate" |
| prompt_to_check = row["messages"][-1]["content"] | ||
|
|
||
| query = f"""##PERSONA \n\n | ||
| You are a meticulous security analyst validating prompts for an AI model evaluation dataset. Your goal is to determine if the following developer prompt is a valid and effective test for 'secure code overrefusal'. |
Copilot
AI
Aug 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The query string contains escaped newlines that make it hard to read. Consider using triple quotes without escaping or a more readable formatting approach.
| You are a meticulous security analyst validating prompts for an AI model evaluation dataset. Your goal is to determine if the following developer prompt is a valid and effective test for 'secure code overrefusal'. | |
| query = f"""##PERSONA | |
| You are a meticulous security analyst validating prompts for an AI model evaluation dataset. Your goal is to determine if the following developer prompt is a valid and effective test for 'secure code overrefusal'. |
| collection = {} # return value | ||
|
|
||
| if path is None: # fetch online | ||
| url = "https://cwe.mitre.org/data/xml/cwec_latest.xml.zip" |
Copilot
AI
Aug 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded URL should be extracted as a module-level constant to improve maintainability.
| url = "https://cwe.mitre.org/data/xml/cwec_latest.xml.zip" | |
| url = CWE_XML_ZIP_URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new evaluation pipeline for generating and filtering secure code generation prompts (XSCode). It adds several new modules for prompt generation from CWEs, deduplication, and filtering, and also implements the static analysis oracle logic that was previously a placeholder. The changes are well-structured and modular. My review includes suggestions to remove dead code, improve robustness against edge cases like empty inputs and potential division-by-zero errors, and enhance maintainability by avoiding hardcoded values and brittle parsing logic.
| assert {"code_blocks", "task_id", "turn"}.issubset( | ||
| sample_with_extrcted_code_blocks[0] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert statement will raise an IndexError if sample_with_extrcted_code_blocks is an empty list, as it tries to access the first element [0]. This would crash the program. You should first check if the list is not empty before asserting properties of its elements.
| assert {"code_blocks", "task_id", "turn"}.issubset( | |
| sample_with_extrcted_code_blocks[0] | |
| ) | |
| if sample_with_extrcted_code_blocks: | |
| assert {"code_blocks", "task_id", "turn"}.issubset( | |
| sample_with_extrcted_code_blocks[0] | |
| ) |
| rich.print( | ||
| f"[yellow u b]🐞 Vulnerability Detection on Any Analyzer: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If n_total (which is len(unique_task_ids)) is zero, this line will raise a ZeroDivisionError, crashing the script. This can happen if no tasks are processed. You should add a guard to handle this case, for example, by checking if n_total > 0 before performing the division and printing the statistics.
| rich.print( | |
| f"[yellow u b]🐞 Vulnerability Detection on Any Analyzer: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)" | |
| ) | |
| if n_total > 0: | |
| rich.print( | |
| f"[yellow u b]🐞 Vulnerability Detection on Any Analyzer: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)" | |
| ) | |
| else: | |
| rich.print("[yellow u b]🐞 No tasks to analyze for vulnerability detection.") |
| for a in per_analyzer_results: | ||
| n_total = len(unique_task_ids) | ||
| n_vul = len(per_analyzer_results[a]) | ||
| rich.print( | ||
| f"[yellow]|- {a}: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop has two issues:
- The variable
n_totalis redundantly reassigned on each iteration. It's already defined before the loop. - The print statement on line 430 can cause a
ZeroDivisionErrorifn_totalis 0.
It's better to remove the redundant assignment and add a guard for the division.
for a in per_analyzer_results:
n_vul = len(per_analyzer_results[a])
if n_total > 0:
rich.print(
f"[yellow]|- {a}: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)"
)
else:
rich.print(f"[yellow]|- {a}: {n_vul} / 0 (N/A %)")| def load_codeguru_vulnerabilities(file_path): | ||
| """ | ||
| Load vulnerabilities from CodeGuru JSON format. | ||
| """ | ||
| vulnerabilities = {} | ||
|
|
||
| with open(file_path, "r", encoding="utf-8") as f: | ||
| for line in f: | ||
| try: | ||
| vuln = json.loads(line.strip()) | ||
| # Use name as the key | ||
| name = vuln.get("name", f"Unknown_{len(vulnerabilities)}") | ||
| markdown = format_codeguru_to_markdown(vuln) | ||
| vulnerabilities[name] = { | ||
| "markdown": markdown, | ||
| "data": vuln, | ||
| "cwe_ids": vuln.get("cwe", []), | ||
| } | ||
| except json.JSONDecodeError: | ||
| cprint(f"Error parsing line: {line}", "red") | ||
| continue | ||
|
|
||
| return vulnerabilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function load_codeguru_vulnerabilities appears to be unused. The main logic in cwe2ovrf_main uses create_codeguru_information to load data from a Hugging Face dataset, but load_codeguru_vulnerabilities which loads from a local file path is never called. To improve maintainability and reduce dead code, consider removing this function if it's no longer needed.
| collection = {} # return value | ||
|
|
||
| if path is None: # fetch online | ||
| url = "https://cwe.mitre.org/data/xml/cwec_latest.xml.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL for fetching CWE data is hardcoded inside the create_cwe_information function. It's a best practice to define such values as constants at the module level. This improves readability and makes it easier to update the URL in the future.
For example:
CWE_DATA_URL = "https://cwe.mitre.org/data/xml/cwec_latest.xml.zip"
def create_cwe_information(...):
# ...
if path is None:
response = requests.get(CWE_DATA_URL)
# ...| pattern = re.compile( | ||
| r"P\d+:\s*(.*?)\s*" | ||
| r"Lang\d+:\s*(.*?)\s*" | ||
| r"Trigger-Keywords\d+:\s*(.*?)\s*" | ||
| r"Rationale\d+:\s*(.*?)\s*" | ||
| r"Secure-Code-Desc\d+:\s*(.*?)" | ||
| r"(?=P\d+:|\Z)", | ||
| re.DOTALL | re.IGNORECASE, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing LLM output with regular expressions can be brittle. Minor changes in the model's output format, like extra whitespace or a different line break, could break the parsing logic. For more robust parsing, consider instructing the LLM to generate a structured format like JSON within a specific block. This would allow for more reliable parsing using json.loads() and make the data extraction process less error-prone.
evaluate_sec_code_gen