Skip to content

[FEATURE] Integration tests on real big code #3

@Artemonim

Description

@Artemonim

Pre-submission Checklist

  • I have read the README.md
  • I am using the latest version of Agent Docstrings
  • I have searched for existing issues to see if this feature has been requested before
  • This feature request is not a bug report (use Bug Report template for bugs)

Feature Category

Developer/API enhancement

Feature Summary

The main goal is not just to check that the tool works, but to deliberately find its weak spots. We need to choose files that will test:

  1. Correctness on real-world, non-“laboratory” code.
  2. Reliability when faced with complex and non-standard syntax.
  3. Performance on large files.
  4. Resilience to regressions as the code evolves.

Selection Criteria

I propose picking files according to the following four criteria:

1. Language Variety and Parser Quality

We need to test both categories of parsers:

  • AST-based parsers (Python, Go): Here, the goal is to ensure correct handling of all syntactic constructs supported by the AST. We should look for files with complex function declarations, decorators, generics (in Go), etc.
  • Regex-based parsers (C++, C#, JS, TS, Java, etc.): This group is critical. We need files that are likely to “break” our regex parser. We need examples where simple bracket counting and regex matching can fail.

2. Syntax Complexity and Diversity

For each language, we should find files containing:

  • Multi-line declarations: Functions or classes whose signatures span several lines.

  • Advanced language features:

    • C++: Templates (template), operator overloading, nested namespaces.
    • C# / Java: Generics, attributes/annotations on separate lines, nested and anonymous classes.
    • JS / TS: Arrow functions, async/await, decorators, default and named exports in the same file.
    • Python: Decorators, complex type annotations, functions with *args and **kwargs.
  • Unusual formatting: Syntactically valid code formatted in odd ways (for example, brace on its own line, weird indentation).

  • Regex stress cases:

    • Braces inside strings or comments: e.g. var s = "{\"key\": \"value\"}"; or // See function {Foo}. These are stress tests for our regex parsers.

3. File Size and Structure

We should assemble a balanced set:

  • Small files (up to 100 lines): For quick “smoke” tests.
  • Medium files (300–1 000 lines): Representative of a typical project file.
  • Large files (2 000+ lines): To check performance and behavior on code with many functions and classes.
  • Mixed-structure files: For example, a file with multiple classes, nested classes, and top-level functions.
  • Files without classes/functions: To see how the tool handles an empty file or one containing only variable declarations.

4. License and Repository Popularity

  • License: It’s crucial to only take files from repositories under permissive licenses (MIT, Apache 2.0, BSD) that allow us to use their code. Avoid GPL/LGPL to prevent licensing conflicts in our codebase.
  • Popularity: Prefer well-known, actively maintained projects (for example, requests, pandas, Docker, React, VS Code). Their code is typically high-quality and reflects modern language practices.

Problem Statement

Lack of testing on real data. All tests in the project are currently generated by AI.

Proposed Solution

  1. Select repositories: Identify 2–3 popular repositories for each key language group (AST-based and regex-based).

  2. Pick files: Within those repos, deliberately find 3–5 files that match the complexity and edge-case criteria above.

  3. Create a fixture set: Copy the selected files into a dedicated directory in our project, e.g. tests/integration_fixtures/. IMPORTANT: prepend each copied file with a comment noting its original source URL on GitHub and its license.

  4. Write tests: For each fixture file, write a test that:
    a. Copies the fixture into a temporary directory.
    b. Runs our generator on it.
    c. Compares the output to a pre-saved “golden” snapshot.

    • Snapshot testing is ideal here. A test only fails if the generated header changes, making it easy to catch regressions. For Python, we can use the pytest-snapshot library.

Priority Level

🤔 Medium - Would be nice to have

Implementation Complexity

  • 🤔 Simple - Minor change or addition
  • 😑 Moderate - Requires new parsing logic
  • 🛠️ Complex - Major feature requiring significant development
  • 😎 I don't know

Contribution

  • I would like to implement this feature myself
  • I can help with testing
  • I can provide sample code files for testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions