Skip to content

blog: add Legacy Bench blog post#877

Closed
l30t wants to merge 1 commit intomainfrom
blog/legacy-bench
Closed

blog: add Legacy Bench blog post#877
l30t wants to merge 1 commit intomainfrom
blog/legacy-bench

Conversation

@l30t
Copy link
Copy Markdown

@l30t l30t commented Apr 1, 2026

Summary

Adds the Legacy Bench announcement as a new blog post on the docs site. Legacy Bench is the first benchmark designed to measure frontier AI agent capabilities on legacy software engineering tasks spanning COBOL, Fortran, Java 7, BASIC, C89, and Assembly.

Changes

  • New blog post: docs/blog/legacy-bench.mdx -- full blog post converted from the Notion draft
  • Chart images: 3 benchmark result charts added to docs/images/
  • Navigation: Added a new "Blog" tab to docs.json with a "Research" group

Content highlights

  • 100-task benchmark across 6 legacy language families
  • Results from 12 model-agent combinations (16.9%--42.5% pass rates)
  • Analysis of failure patterns, language-specific challenges, and agent comparison
  • Benchmark construction methodology and limitations
  • Developed by Factory in collaboration with Parsewave

Notes

  • The agent comparison chart placeholder from the draft ([PLACEHOLDER: Agent comparison chart]) was intentionally omitted -- the text description stands on its own
  • GitHub repo link placeholder was filled with https://github.com/factory-ai/legacy-bench
  • Contact CTA links to https://factory.ai/contact

Adds the Legacy Bench announcement as a new blog post covering the first
benchmark for evaluating AI agents on legacy software engineering tasks
(COBOL, Fortran, Java 7, etc). Includes benchmark results charts and a
new Blog tab in the docs navigation.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@factory-droid
Copy link
Copy Markdown
Contributor

factory-droid bot commented Apr 1, 2026

Droid finished @l30t's task —— View job



## What is Legacy Bench

Legacy Bench consists of 100 tasks spanning six legacy language families and real enterprise domains. The full benchmark is used for evaluation, with ten representative tasks publicly available as open samples.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Task total is inconsistent across the post and charts

The post states "Legacy Bench consists of 100 tasks" (line 16), but the per-language chart shows "OVERALL (99)" while the per-language counts sum to 100. Please reconcile the benchmark total across the narrative and charts (and adjust any derived percentages) so readers aren’t left unsure whether results are for 99 or 100 tasks.

| Language | % of Benchmark | Domains |
| --- | --- | --- |
| **COBOL** | 46% | Financial settlement, payroll processing, insurance claims, telecom billing, VSAM file handling |
| **Java 7** | 32% | Enterprise middleware, CDR processing, warehouse logistics, binary parsing, EJB patterns |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Java 7 share in the table doesn’t match the task-count chart

The language table lists Java 7 as 32% (line 21), but the per-language chart shows "Java 7 (33)" tasks. If the benchmark is 100 tasks, that row should read 33%; if the benchmark is 99 tasks, the percentage needs to be recalculated. Please update the table (or the chart) so the share and the counts agree.

Copy link
Copy Markdown
Contributor

@factory-droid factory-droid bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The post looks good overall, but the benchmark totals/percentages are internally inconsistent between the narrative/table and the charts; reconciling those figures will prevent reader confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant