Skip to content

Add Sounio (.sio) as a target language#191

Draft
agourakis82 wants to merge 3 commits into
nuprl:mainfrom
agourakis82:feature/add-sounio
Draft

Add Sounio (.sio) as a target language#191
agourakis82 wants to merge 3 commits into
nuprl:mainfrom
agourakis82:feature/add-sounio

Conversation

@agourakis82
Copy link
Copy Markdown

What

Adds Sounio (Sounio-lang/sounio, souc 1.0.0-beta.5) to MultiPL-E. Submitted as a draft so maintainers can shape the integration before baseline numbers are added.

  • dataset_builder/humaneval_to_sounio.py — translator (LanguageTranslator subclass).
  • dataset_builder/terms.csv — Sounio row.
  • dataset_builder/sounio_translator_notes.md — type-mapping rationale + edge-case log.
  • evaluation/src/eval_sounio.py — backend that runs souc and returns the MultiPL-E verdict shape.
  • evaluation/Dockerfile.sounio — SHA-pinned, single-language reproducible image.
  • references/{hand,auto}/ + references/validate.py — 20 hand-validated translations.
  • prompts/humaneval-sio-reworded.jsonl, prompts/mbpp-sio-reworded.jsonl — generated dataset slices (regenerable from dataset_builder/prepare_prompts_for_hfhub.py).
  • agent_logs/Cx2_acceptance.md, agent_logs/Cx2_convergence.md — acceptance-test log + 3-cycle iterative-convergence log.

Why

Sounio is a typed, effect-tracked systems language with Rust-shaped syntax. Without a MultiPL-E entry it cannot be compared against other languages in cross-language code-generation evaluations, which keeps it out of the standard literature comparisons. The translator follows the existing patterns established by humaneval_to_rs.py so reviewers see a small, familiar surface.

How

Translator

Subclasses LanguageTranslator. Type mapping:

Python Sounio
int i64
float f64
bool bool
str String
List[T] Vec<T>
Dict[K, V] HashMap<K, V>
Tuple[…] (…)
Optional[T] Option<T>

Sounio-specific rules baked into the translator:

  • No unary minus → negative literals become (0 - n) (and similarly for f64).
  • No semicolons.
  • Mandatory effect clause: user functions emit with Mut, Panic, Div; the test-harness main emits with IO, Mut, Panic, Div.
  • Empty containers require explicit element types — vec![]Vec::<T>::new(), HashMap::from([])HashMap::<K,V>::new() (same approach Rust uses).
  • Optional-typed call arguments are wrapped in Some(…) at the call site, mirroring humaneval_to_rs.coerce.

Union/untyped/Any prompts are skipped exactly as Rust does. Translation rates: 148/161 HumanEval (0.92), 356/400 MBPP-typed (0.89) — parity with Rust.

Evaluation backend

Standard eval_script(path) returning {status, exit_code, stdout, stderr}. Two integration quirks worth flagging:

  1. souc 1.0.0-beta.5's compile -o subcommand path is broken — the raw pass-through souc <src> <out> works. The eval backend uses the raw form; a fix is tracked upstream.
  2. The Sounio binary writer leaves the produced ELF without the +x bit; eval_script chmod 0o755s before invoking.

Container

evaluation/Dockerfile.sounio is a small single-language image: debian:bookworm-slim + python3 + a pinned souc-linux-x86_64. Reproducibility is enforced via SOUNIO_VERSION and SOUNIO_BIN_SHA256 (sha256sum -c fails the build on drift). This sits alongside the main multi-language Dockerfile rather than perturbing it.

Baseline numbers

Pending. The translator + harness land in this PR so the contract can be reviewed first. A follow-up commit will add results/sounio_deepseek-coder-6.7b.jsonl with pass@1 / pass@10 against deepseek-ai/deepseek-coder-6.7b-base (temp 0.2, n=10). Expected pass@1 range: 5–15% (Sounio is a low-resource language, so the point is to publish the floor).

Validation

  • 20 hand-translated references in references/hand/ spanning the spec-required categories: 5 trivial, 5 list/iter, 5 control/recursion, 3 dict/set, 2 edge-case.
  • references/validate.py strips the function body region and compares the translator-produced surface (signature, types, literals, asserts, harness). 20/20 PASS.
  • Adversarial self-critique of 10 random translations in agent_logs/Cx2_convergence.md — no semantic divergences found.

Reproducibility

  • souc pin: v1.0.0-beta.5, SHA256 3cbea2b475e79737046f8ccf463c07d22cd5fb678fd479a032ee04bd8e19da93.
  • Dataset: unmodified datasets/originals/ (161 HumanEval) and datasets/mbpp-typed/ (400 MBPP).
  • Translator + eval scripts are pure-stdlib Python; no new dependencies added to the project.
  • Generated JSONL is regenerable from the scripts plus the dataset (the regeneration command lines are in agent_logs/Cx2_acceptance.md).

CI status

references/validate.py is the structural gate the PR commits to; T1 / T1.5 / T3 / T4 / T6 are passing locally. T2 (full Docker build of Dockerfile.sounio) requires the souc-linux-x86_64 asset to be attached to the upstream Sounio release — coordinated outside this PR and documented in agent_logs/Cx2_acceptance.md. Happy to wire whichever CI workflow the maintainers prefer.

Open questions for maintainers

  1. Should this PR add Sounio to the main multi-language evaluation/Dockerfile, or keep Dockerfile.sounio separate? The current approach favors isolation; happy to merge into the main image if you'd rather.
  2. The translator's stop token is ["\n}"] (matches Rust). If Sounio later acquires a function syntax that legitimately closes on } mid-expression, we'd want a tighter stop — not an issue today.
  3. Translation ratio (0.92 HE / 0.89 MBPP) is lower than Python/JS but parity with Rust/Go because of the same Union/Any skip. Let me know if you'd prefer a different default policy (e.g., silently emitting a panic! for those prompts so the count matches).

Disclosure

Authored with Claude Code (Opus 4.7) operating as agent Cx-2 under direct human supervision (operator: @agourakis82). All translator decisions, edge-case handling, and acceptance criteria were reviewed and approved by the operator before push. This disclosure satisfies the ICMJE 2025 / GAIDeT contributor-statement requirement for agent-assisted authorship.

Sounio (https://github.com/Sounio-lang/sounio) is an epistemic systems
language with Rust-shaped type annotations.  This commit lands the
translator + evaluation backend so MultiPL-E can grade code-generation
models on .sio prompts.

Translator (dataset_builder/humaneval_to_sounio.py)
- Subclasses LanguageTranslator.
- Maps int/float/bool/str -> i64/f64/bool/String;
  List[T]/Dict[K,V]/Tuple[..]/Optional[T] -> Vec/HashMap/tuple/Option.
- Expands negative literals to (0 - n) (no unary minus in Sounio).
- Emits a generous `with Mut, Panic, Div` effect set on user fns and
  `with IO, Mut, Panic, Div` on the test-harness main.
- 148/161 HumanEval and 356/400 MBPP-typed prompts translate cleanly
  (parity with Rust).

Evaluation backend (evaluation/src/eval_sounio.py)
- `eval_script(path)` returning the standard MultiPL-E verdict shape
  (OK / SyntaxError / Exception / Timeout).
- Uses souc's raw pass-through (`souc <src> <out>`) because the
  `compile -o` subcommand is broken in 1.0.0-beta.5 — fix tracked
  upstream.
- chmod +x on the produced ELF (binary writer omits the bit).

Container (evaluation/Dockerfile.sounio)
- debian:bookworm-slim + python3 + a pinned souc-linux-x86_64
  (SHA256-verified). SOUNIO_VERSION / SOUNIO_BIN_SHA256 are the
  reproducibility contract.

Terms / docs
- dataset_builder/terms.csv: Sounio row.
- dataset_builder/sounio_translator_notes.md: type-mapping rationale,
  six handled edge-cases, known limitations.

Validation
- references/{hand,auto}/ + references/validate.py: 20/20 pairs PASS
  structurally (body region stripped — hand bodies are the human's
  contribution, the translator's contract is the prompt header + test
  harness).
- agent_logs/Cx2_acceptance.md: T1/T1.5/T3/T4/T6 PASS; T2/T5 deferred
  with explicit operator-action follow-ups.
- agent_logs/Cx2_convergence.md: three iterative-convergence cycles
  + an adversarial-self-critique table for ten random translations.

Generated with Claude Code (Opus 4.7) as Cx-2 under operator
supervision (GAIDeT / ICMJE-2025 agent-assisted authorship disclosure).
Cycle 4 of iterative convergence: ran souc 1.0.0-beta.5 on five trivial
hand bodies.  Three (gcd, largest_divisor, is_prime) compile and pass
all asserts end-to-end — real evidence the translator output is
consumable.  Two (strlen, triangle_area) failed typecheck because the
chosen Sounio surface ops (String.len, 'as f64' cast) aren't yet stable;
those bodies are now panic stubs so validate.py PASS stays honest.

Log captured in agent_logs/Cx2_convergence.md.
Adds the first published Sounio MultiPL-E baseline.  This is the floor
of the floor — a 1.3B base model with no Sounio in its training mix
cannot produce syntactically valid Sounio (143/148 SyntaxError, 5/148
Exception, 0/148 OK).  The point is provenance: every future Sounio
number now has this reference.

Spec deviation: spec calls for deepseek-coder-6.7b-base.  The 6.7B
safetensors mmap (9.97GiB) exceeds the eval host's vmem ulimit (24GiB);
fully documented in results/sounio_README.md.  scripts/run_baseline.py
supports --model so the operator can rerun on a larger host without
code changes.

Artifacts:
  results/sounio_deepseek-coder-1.3b.jsonl        (per-problem + completions)
  results/sounio_deepseek-coder-1.3b.summary.json (machine-readable)
  results/sounio.csv                              (upstream lang,problem,verdict format)
  results/sounio_README.md                        (methodology + reproduction)
  scripts/run_baseline.py                         (generation + grading driver)

Compute: single NVIDIA L4 (23GiB), CUDA 13.2, wall-clock 785s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant