blog: add Legacy Bench blog post by l30t · Pull Request #877 · Factory-AI/factory

l30t · 2026-04-01T18:00:18Z

Summary

Adds the Legacy Bench announcement as a new blog post on the docs site. Legacy Bench is the first benchmark designed to measure frontier AI agent capabilities on legacy software engineering tasks spanning COBOL, Fortran, Java 7, BASIC, C89, and Assembly.

Changes

New blog post: docs/blog/legacy-bench.mdx -- full blog post converted from the Notion draft
Chart images: 3 benchmark result charts added to docs/images/
Navigation: Added a new "Blog" tab to docs.json with a "Research" group

Content highlights

100-task benchmark across 6 legacy language families
Results from 12 model-agent combinations (16.9%--42.5% pass rates)
Analysis of failure patterns, language-specific challenges, and agent comparison
Benchmark construction methodology and limitations
Developed by Factory in collaboration with Parsewave

Notes

The agent comparison chart placeholder from the draft ([PLACEHOLDER: Agent comparison chart]) was intentionally omitted -- the text description stands on its own
GitHub repo link placeholder was filled with https://github.com/factory-ai/legacy-bench
Contact CTA links to https://factory.ai/contact

Adds the Legacy Bench announcement as a new blog post covering the first benchmark for evaluating AI agents on legacy software engineering tasks (COBOL, Fortran, Java 7, etc). Includes benchmark results charts and a new Blog tab in the docs navigation. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

factory-droid · 2026-04-01T18:00:43Z

Droid finished @l30t's task —— View job

factory-droid · 2026-04-01T18:07:55Z

docs/blog/legacy-bench.mdx

+
+## What is Legacy Bench
+
+Legacy Bench consists of 100 tasks spanning six legacy language families and real enterprise domains. The full benchmark is used for evaluation, with ten representative tasks publicly available as open samples.


[P1] Task total is inconsistent across the post and charts

The post states "Legacy Bench consists of 100 tasks" (line 16), but the per-language chart shows "OVERALL (99)" while the per-language counts sum to 100. Please reconcile the benchmark total across the narrative and charts (and adjust any derived percentages) so readers aren’t left unsure whether results are for 99 or 100 tasks.

factory-droid · 2026-04-01T18:07:59Z

docs/blog/legacy-bench.mdx

+| Language | % of Benchmark | Domains |
+| --- | --- | --- |
+| **COBOL** | 46% | Financial settlement, payroll processing, insurance claims, telecom billing, VSAM file handling |
+| **Java 7** | 32% | Enterprise middleware, CDR processing, warehouse logistics, binary parsing, EJB patterns |


[P2] Java 7 share in the table doesn’t match the task-count chart

The language table lists Java 7 as 32% (line 21), but the per-language chart shows "Java 7 (33)" tasks. If the benchmark is 100 tasks, that row should read 33%; if the benchmark is 99 tasks, the percentage needs to be recalculated. Please update the table (or the chart) so the share and the counts agree.

factory-droid

The post looks good overall, but the benchmark totals/percentages are internally inconsistent between the narrative/table and the charts; reconciling those figures will prevent reader confusion.

l30t requested review from abhay-factory, enoreyes and francesca-factory April 1, 2026 18:00

l30t closed this Apr 1, 2026

factory-droid bot reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blog: add Legacy Bench blog post#877

blog: add Legacy Bench blog post#877
l30t wants to merge 1 commit intomainfrom
blog/legacy-bench

l30t commented Apr 1, 2026

Uh oh!

factory-droid bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

factory-droid bot Apr 1, 2026

Uh oh!

factory-droid bot Apr 1, 2026

Uh oh!

factory-droid bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		## What is Legacy Bench

		Legacy Bench consists of 100 tasks spanning six legacy language families and real enterprise domains. The full benchmark is used for evaluation, with ten representative tasks publicly available as open samples.

Conversation

l30t commented Apr 1, 2026

Summary

Changes

Content highlights

Notes

Uh oh!

factory-droid bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

factory-droid bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

factory-droid bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

factory-droid bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

factory-droid bot commented Apr 1, 2026 •

edited

Loading