Add Sounio (.sio) as a target language by agourakis82 · Pull Request #191 · nuprl/MultiPL-E

agourakis82 · 2026-05-17T21:51:54Z

What

Adds Sounio (Sounio-lang/sounio, souc 1.0.0-beta.5) to MultiPL-E. Submitted as a draft so maintainers can shape the integration before baseline numbers are added.

dataset_builder/humaneval_to_sounio.py — translator (LanguageTranslator subclass).
dataset_builder/terms.csv — Sounio row.
dataset_builder/sounio_translator_notes.md — type-mapping rationale + edge-case log.
evaluation/src/eval_sounio.py — backend that runs souc and returns the MultiPL-E verdict shape.
evaluation/Dockerfile.sounio — SHA-pinned, single-language reproducible image.
references/{hand,auto}/ + references/validate.py — 20 hand-validated translations.
prompts/humaneval-sio-reworded.jsonl, prompts/mbpp-sio-reworded.jsonl — generated dataset slices (regenerable from dataset_builder/prepare_prompts_for_hfhub.py).
agent_logs/Cx2_acceptance.md, agent_logs/Cx2_convergence.md — acceptance-test log + 3-cycle iterative-convergence log.

Why

Sounio is a typed, effect-tracked systems language with Rust-shaped syntax. Without a MultiPL-E entry it cannot be compared against other languages in cross-language code-generation evaluations, which keeps it out of the standard literature comparisons. The translator follows the existing patterns established by humaneval_to_rs.py so reviewers see a small, familiar surface.

How

Translator

Subclasses LanguageTranslator. Type mapping:

Python	Sounio
`int`	`i64`
`float`	`f64`
`bool`	`bool`
`str`	`String`
`List[T]`	`Vec<T>`
`Dict[K, V]`	`HashMap<K, V>`
`Tuple[…]`	`(…)`
`Optional[T]`	`Option<T>`

Sounio-specific rules baked into the translator:

No unary minus → negative literals become (0 - n) (and similarly for f64).
No semicolons.
Mandatory effect clause: user functions emit with Mut, Panic, Div; the test-harness main emits with IO, Mut, Panic, Div.
Empty containers require explicit element types — vec![] → Vec::<T>::new(), HashMap::from([]) → HashMap::<K,V>::new() (same approach Rust uses).
Optional-typed call arguments are wrapped in Some(…) at the call site, mirroring humaneval_to_rs.coerce.

Union/untyped/Any prompts are skipped exactly as Rust does. Translation rates: 148/161 HumanEval (0.92), 356/400 MBPP-typed (0.89) — parity with Rust.

Evaluation backend

Standard eval_script(path) returning {status, exit_code, stdout, stderr}. Two integration quirks worth flagging:

souc 1.0.0-beta.5's compile -o subcommand path is broken — the raw pass-through souc <src> <out> works. The eval backend uses the raw form; a fix is tracked upstream.
The Sounio binary writer leaves the produced ELF without the +x bit; eval_script chmod 0o755s before invoking.

Container

evaluation/Dockerfile.sounio is a small single-language image: debian:bookworm-slim + python3 + a pinned souc-linux-x86_64. Reproducibility is enforced via SOUNIO_VERSION and SOUNIO_BIN_SHA256 (sha256sum -c fails the build on drift). This sits alongside the main multi-language Dockerfile rather than perturbing it.

Baseline numbers

Pending. The translator + harness land in this PR so the contract can be reviewed first. A follow-up commit will add results/sounio_deepseek-coder-6.7b.jsonl with pass@1 / pass@10 against deepseek-ai/deepseek-coder-6.7b-base (temp 0.2, n=10). Expected pass@1 range: 5–15% (Sounio is a low-resource language, so the point is to publish the floor).

Validation

20 hand-translated references in references/hand/ spanning the spec-required categories: 5 trivial, 5 list/iter, 5 control/recursion, 3 dict/set, 2 edge-case.
references/validate.py strips the function body region and compares the translator-produced surface (signature, types, literals, asserts, harness). 20/20 PASS.
Adversarial self-critique of 10 random translations in agent_logs/Cx2_convergence.md — no semantic divergences found.

Reproducibility

souc pin: v1.0.0-beta.5, SHA256 3cbea2b475e79737046f8ccf463c07d22cd5fb678fd479a032ee04bd8e19da93.
Dataset: unmodified datasets/originals/ (161 HumanEval) and datasets/mbpp-typed/ (400 MBPP).
Translator + eval scripts are pure-stdlib Python; no new dependencies added to the project.
Generated JSONL is regenerable from the scripts plus the dataset (the regeneration command lines are in agent_logs/Cx2_acceptance.md).

CI status

references/validate.py is the structural gate the PR commits to; T1 / T1.5 / T3 / T4 / T6 are passing locally. T2 (full Docker build of Dockerfile.sounio) requires the souc-linux-x86_64 asset to be attached to the upstream Sounio release — coordinated outside this PR and documented in agent_logs/Cx2_acceptance.md. Happy to wire whichever CI workflow the maintainers prefer.

Open questions for maintainers

Should this PR add Sounio to the main multi-language evaluation/Dockerfile, or keep Dockerfile.sounio separate? The current approach favors isolation; happy to merge into the main image if you'd rather.
The translator's stop token is ["\n}"] (matches Rust). If Sounio later acquires a function syntax that legitimately closes on } mid-expression, we'd want a tighter stop — not an issue today.
Translation ratio (0.92 HE / 0.89 MBPP) is lower than Python/JS but parity with Rust/Go because of the same Union/Any skip. Let me know if you'd prefer a different default policy (e.g., silently emitting a panic! for those prompts so the count matches).

Disclosure

Authored with Claude Code (Opus 4.7) operating as agent Cx-2 under direct human supervision (operator: @agourakis82). All translator decisions, edge-case handling, and acceptance criteria were reviewed and approved by the operator before push. This disclosure satisfies the ICMJE 2025 / GAIDeT contributor-statement requirement for agent-assisted authorship.

Sounio (https://github.com/Sounio-lang/sounio) is an epistemic systems language with Rust-shaped type annotations. This commit lands the translator + evaluation backend so MultiPL-E can grade code-generation models on .sio prompts. Translator (dataset_builder/humaneval_to_sounio.py) - Subclasses LanguageTranslator. - Maps int/float/bool/str -> i64/f64/bool/String; List[T]/Dict[K,V]/Tuple[..]/Optional[T] -> Vec/HashMap/tuple/Option. - Expands negative literals to (0 - n) (no unary minus in Sounio). - Emits a generous `with Mut, Panic, Div` effect set on user fns and `with IO, Mut, Panic, Div` on the test-harness main. - 148/161 HumanEval and 356/400 MBPP-typed prompts translate cleanly (parity with Rust). Evaluation backend (evaluation/src/eval_sounio.py) - `eval_script(path)` returning the standard MultiPL-E verdict shape (OK / SyntaxError / Exception / Timeout). - Uses souc's raw pass-through (`souc <src> <out>`) because the `compile -o` subcommand is broken in 1.0.0-beta.5 — fix tracked upstream. - chmod +x on the produced ELF (binary writer omits the bit). Container (evaluation/Dockerfile.sounio) - debian:bookworm-slim + python3 + a pinned souc-linux-x86_64 (SHA256-verified). SOUNIO_VERSION / SOUNIO_BIN_SHA256 are the reproducibility contract. Terms / docs - dataset_builder/terms.csv: Sounio row. - dataset_builder/sounio_translator_notes.md: type-mapping rationale, six handled edge-cases, known limitations. Validation - references/{hand,auto}/ + references/validate.py: 20/20 pairs PASS structurally (body region stripped — hand bodies are the human's contribution, the translator's contract is the prompt header + test harness). - agent_logs/Cx2_acceptance.md: T1/T1.5/T3/T4/T6 PASS; T2/T5 deferred with explicit operator-action follow-ups. - agent_logs/Cx2_convergence.md: three iterative-convergence cycles + an adversarial-self-critique table for ten random translations. Generated with Claude Code (Opus 4.7) as Cx-2 under operator supervision (GAIDeT / ICMJE-2025 agent-assisted authorship disclosure).

Cycle 4 of iterative convergence: ran souc 1.0.0-beta.5 on five trivial hand bodies. Three (gcd, largest_divisor, is_prime) compile and pass all asserts end-to-end — real evidence the translator output is consumable. Two (strlen, triangle_area) failed typecheck because the chosen Sounio surface ops (String.len, 'as f64' cast) aren't yet stable; those bodies are now panic stubs so validate.py PASS stays honest. Log captured in agent_logs/Cx2_convergence.md.

Adds the first published Sounio MultiPL-E baseline. This is the floor of the floor — a 1.3B base model with no Sounio in its training mix cannot produce syntactically valid Sounio (143/148 SyntaxError, 5/148 Exception, 0/148 OK). The point is provenance: every future Sounio number now has this reference. Spec deviation: spec calls for deepseek-coder-6.7b-base. The 6.7B safetensors mmap (9.97GiB) exceeds the eval host's vmem ulimit (24GiB); fully documented in results/sounio_README.md. scripts/run_baseline.py supports --model so the operator can rerun on a larger host without code changes. Artifacts: results/sounio_deepseek-coder-1.3b.jsonl (per-problem + completions) results/sounio_deepseek-coder-1.3b.summary.json (machine-readable) results/sounio.csv (upstream lang,problem,verdict format) results/sounio_README.md (methodology + reproduction) scripts/run_baseline.py (generation + grading driver) Compute: single NVIDIA L4 (23GiB), CUDA 13.2, wall-clock 785s.

agourakis82 added 3 commits May 17, 2026 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sounio (.sio) as a target language#191

Add Sounio (.sio) as a target language#191
agourakis82 wants to merge 3 commits into
nuprl:mainfrom
agourakis82:feature/add-sounio

agourakis82 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agourakis82 commented May 17, 2026

What

Why

How

Translator

Evaluation backend

Container

Baseline numbers

Validation

Reproducibility

CI status

Open questions for maintainers

Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant