Skip to content

Conversation

@nirav0999
Copy link
Contributor

  • remove dead code
  • modularize evaluate_sec_code_gen

Copilot AI review requested due to automatic review settings August 9, 2025 06:22
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @nirav0999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a significant feature that combines various security code oracles into a single, streamlined pipeline. This change aims to improve the evaluation process for secure code generation models, particularly for overrefusal benchmarks. The core idea is to automate the creation, refinement, and analysis of prompts, ensuring that the generated code is rigorously tested against known vulnerabilities using multiple static analysis tools. This enhancement will allow for more efficient and comprehensive assessment of model performance in generating secure code.

Highlights

  • New Secure Code Evaluation Pipeline: I've introduced a new, comprehensive pipeline for evaluating secure code generation, specifically designed for overrefusal benchmarks like XSCode. This pipeline integrates prompt generation, deduplication, and filtering steps.
  • Modularized Secure Code Oracles: The core logic for evaluating secure code has been significantly modularized and enhanced. It now efficiently extracts code from LLM responses and runs multiple static analysis tools, such as CodeGuru and CodeQL, in parallel batches.
  • Vulnerability-Based Prompt Generation: I've added new modules for generating prompts based on CWE and CodeGuru vulnerabilities, ensuring a diverse and relevant set of test cases.
  • Prompt Deduplication and LLM-Based Filtering: To maintain the quality and uniqueness of the generated prompts, I've implemented robust deduplication using MinHash and LSH. Additionally, I've introduced pre- and post-filtering steps that leverage LLM judges to validate prompt suitability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR combines secure code oracles into a unified pipeline and performs code cleanup. The changes modularize the evaluate_sec_code_gen function by moving components to separate files and remove dead code.

Key changes:

  • Replaces TODO comments with complete implementations across secure code oracle files
  • Modularizes evaluation pipeline by splitting functionality into utility functions and main evaluation logic
  • Adds comprehensive XSCode compilation pipeline with filtering, deduplication, and generation components

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
eval/oracles/secure_code_oracles_utils.py Adds utility functions for AWS region detection, severity checking, base64 encoding/decoding, and file zipping
eval/oracles/secure_code_oracles.py Implements complete secure code evaluation pipeline with batch processing, parallel execution, and static analyzer integration
eval/compile_xscode/pre_filter.py Adds pre-filtering logic for prompt validation using LLM judges with configurable criteria
eval/compile_xscode/post_filter.py Implements post-filtering validation for generated prompts with benign intent verification
eval/compile_xscode/main.py Provides main pipeline orchestration for XSCode dataset compilation
eval/compile_xscode/dedup.py Implements MinHash-based deduplication with configurable similarity thresholds
eval/compile_xscode/cwe2ovrf.py Adds CWE-to-overrefusal prompt generation with XML parsing and markdown formatting
eval/compile_xscode/annotate_utils/split.py Minor formatting change (adds blank line)
eval/compile_xscode/annotate_utils/gather.py Removes redundant docstring and simplifies set comprehension
eval/compile_xscode/annotate_utils/annotate.py Removes redundant comment
eval/compile_xscode/README.md Adds documentation for XSCode generation and evaluation



def check_min_severity(severity: str, min_sev_level: str = "MEDIUM") -> bool:
SEVERITY_LEVELS = ["info", "informational", "low", "medium", "high", "critical"]
Copy link

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SEVERITY_LEVELS list should be defined as a module-level constant to avoid recreating it on every function call.

Suggested change
SEVERITY_LEVELS = ["info", "informational", "low", "medium", "high", "critical"]

Copilot uses AI. Check for mistakes.

if os.path.exists(output_path):
rich.print(
f"[red]{analyzer} results already present ..... Delete it to regenrate"
Copy link

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in 'regenrate' - it should be 'regenerate'.

Suggested change
f"[red]{analyzer} results already present ..... Delete it to regenrate"
f"[red]{analyzer} results already present ..... Delete it to regenerate"

Copilot uses AI. Check for mistakes.
prompt_to_check = row["messages"][-1]["content"]

query = f"""##PERSONA \n\n
You are a meticulous security analyst validating prompts for an AI model evaluation dataset. Your goal is to determine if the following developer prompt is a valid and effective test for 'secure code overrefusal'.
Copy link

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The query string contains escaped newlines that make it hard to read. Consider using triple quotes without escaping or a more readable formatting approach.

Suggested change
You are a meticulous security analyst validating prompts for an AI model evaluation dataset. Your goal is to determine if the following developer prompt is a valid and effective test for 'secure code overrefusal'.
query = f"""##PERSONA
You are a meticulous security analyst validating prompts for an AI model evaluation dataset. Your goal is to determine if the following developer prompt is a valid and effective test for 'secure code overrefusal'.

Copilot uses AI. Check for mistakes.
collection = {} # return value

if path is None: # fetch online
url = "https://cwe.mitre.org/data/xml/cwec_latest.xml.zip"
Copy link

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded URL should be extracted as a module-level constant to improve maintainability.

Suggested change
url = "https://cwe.mitre.org/data/xml/cwec_latest.xml.zip"
url = CWE_XML_ZIP_URL

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new evaluation pipeline for generating and filtering secure code generation prompts (XSCode). It adds several new modules for prompt generation from CWEs, deduplication, and filtering, and also implements the static analysis oracle logic that was previously a placeholder. The changes are well-structured and modular. My review includes suggestions to remove dead code, improve robustness against edge cases like empty inputs and potential division-by-zero errors, and enhance maintainability by avoiding hardcoded values and brittle parsing logic.

Comment on lines +326 to +328
assert {"code_blocks", "task_id", "turn"}.issubset(
sample_with_extrcted_code_blocks[0]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assert statement will raise an IndexError if sample_with_extrcted_code_blocks is an empty list, as it tries to access the first element [0]. This would crash the program. You should first check if the list is not empty before asserting properties of its elements.

Suggested change
assert {"code_blocks", "task_id", "turn"}.issubset(
sample_with_extrcted_code_blocks[0]
)
if sample_with_extrcted_code_blocks:
assert {"code_blocks", "task_id", "turn"}.issubset(
sample_with_extrcted_code_blocks[0]
)

Comment on lines +422 to +424
rich.print(
f"[yellow u b]🐞 Vulnerability Detection on Any Analyzer: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If n_total (which is len(unique_task_ids)) is zero, this line will raise a ZeroDivisionError, crashing the script. This can happen if no tasks are processed. You should add a guard to handle this case, for example, by checking if n_total > 0 before performing the division and printing the statistics.

Suggested change
rich.print(
f"[yellow u b]🐞 Vulnerability Detection on Any Analyzer: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)"
)
if n_total > 0:
rich.print(
f"[yellow u b]🐞 Vulnerability Detection on Any Analyzer: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)"
)
else:
rich.print("[yellow u b]🐞 No tasks to analyze for vulnerability detection.")

Comment on lines +426 to +431
for a in per_analyzer_results:
n_total = len(unique_task_ids)
n_vul = len(per_analyzer_results[a])
rich.print(
f"[yellow]|- {a}: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This loop has two issues:

  1. The variable n_total is redundantly reassigned on each iteration. It's already defined before the loop.
  2. The print statement on line 430 can cause a ZeroDivisionError if n_total is 0.

It's better to remove the redundant assignment and add a guard for the division.

    for a in per_analyzer_results:
        n_vul = len(per_analyzer_results[a])
        if n_total > 0:
            rich.print(
                f"[yellow]|- {a}: {n_vul} / {n_total} ({n_vul / n_total * 100:.1f}%)"
            )
        else:
            rich.print(f"[yellow]|- {a}: {n_vul} / 0 (N/A %)")

Comment on lines +202 to +224
def load_codeguru_vulnerabilities(file_path):
"""
Load vulnerabilities from CodeGuru JSON format.
"""
vulnerabilities = {}

with open(file_path, "r", encoding="utf-8") as f:
for line in f:
try:
vuln = json.loads(line.strip())
# Use name as the key
name = vuln.get("name", f"Unknown_{len(vulnerabilities)}")
markdown = format_codeguru_to_markdown(vuln)
vulnerabilities[name] = {
"markdown": markdown,
"data": vuln,
"cwe_ids": vuln.get("cwe", []),
}
except json.JSONDecodeError:
cprint(f"Error parsing line: {line}", "red")
continue

return vulnerabilities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function load_codeguru_vulnerabilities appears to be unused. The main logic in cwe2ovrf_main uses create_codeguru_information to load data from a Hugging Face dataset, but load_codeguru_vulnerabilities which loads from a local file path is never called. To improve maintainability and reduce dead code, consider removing this function if it's no longer needed.

collection = {} # return value

if path is None: # fetch online
url = "https://cwe.mitre.org/data/xml/cwec_latest.xml.zip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URL for fetching CWE data is hardcoded inside the create_cwe_information function. It's a best practice to define such values as constants at the module level. This improves readability and makes it easier to update the URL in the future.

For example:

CWE_DATA_URL = "https://cwe.mitre.org/data/xml/cwec_latest.xml.zip"

def create_cwe_information(...):
    # ...
    if path is None:
        response = requests.get(CWE_DATA_URL)
        # ...

Comment on lines +406 to +414
pattern = re.compile(
r"P\d+:\s*(.*?)\s*"
r"Lang\d+:\s*(.*?)\s*"
r"Trigger-Keywords\d+:\s*(.*?)\s*"
r"Rationale\d+:\s*(.*?)\s*"
r"Secure-Code-Desc\d+:\s*(.*?)"
r"(?=P\d+:|\Z)",
re.DOTALL | re.IGNORECASE,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Parsing LLM output with regular expressions can be brittle. Minor changes in the model's output format, like extra whitespace or a different line break, could break the parsing logic. For more robust parsing, consider instructing the LLM to generate a structured format like JSON within a specific block. This would allow for more reliable parsing using json.loads() and make the data extraction process less error-prone.

@nirav0999 nirav0999 closed this Aug 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants