Skip to content
This repository was archived by the owner on Apr 18, 2026. It is now read-only.

feat(experiments): construction scheduling tier experiment#89

Open
runyaga wants to merge 11 commits into
mainfrom
feat/experiment-tiers
Open

feat(experiments): construction scheduling tier experiment#89
runyaga wants to merge 11 commits into
mainfrom
feat/experiment-tiers

Conversation

@runyaga
Copy link
Copy Markdown
Owner

@runyaga runyaga commented Mar 6, 2026

Summary

  • 4-tier construction scheduling experiment testing gpt-oss:20b vs 120b capability boundaries
  • ConstructionPlugin (20 host functions) moved from test/ to lib/src/experiments/ for CLI use
  • Critical bridge bug fix: on Exceptionon Object in host function dispatch
  • StreamRegistry.select() gracefully handles exhausted handles

Changes

  • soliplex_interpreter_monty: _dispatchToolCall and _resolveFutures catch Object not Exception — prevents platform from getting stuck in Pending state when host functions throw Error
  • soliplex_scripting: StreamRegistry.select() filters exhausted handles instead of throwing ArgumentError; streamSetup callback on factory; ConstructionPlugin moved to lib/
  • soliplex_cli: Experiment runner at bin/construction_experiment.dart
  • docs: Full experiment writeup with prompts, data, generated Python, and results

Results

Tier 20B 120B
T1 Prescriptive PASS (10s) PASS (15s)
T2 Scheduler PASS (10s, optimal) PASS (flaky)
T3 Dispatcher PASS (7s) PASS (9s)
T4 Recovery PARTIAL (2/5) PASS (32s, 5/5)

Key finding: 20B capability boundary is between T3 and T4. 20B lacks "executive function" for multi-turn self-correction.

Test plan

  • 105 bridge tests pass (including new Error-catching test)
  • 205 scripting tests pass (stream_registry updated)
  • Full 8-room experiment run completes successfully
  • T3 streams work end-to-end for both model sizes
  • Markdownlint passes on experiment doc

runyaga and others added 11 commits March 6, 2026 08:12
…iptive tests

Tier 1 of the experiment evolution plan: prescriptive baseline proving
the plugin architecture works for domain-specific agentic scheduling.

- Add ConstructionPlugin (18 domain functions) + ConstructionState engine
  with full constraint validation (trade, deps, weather, availability)
- Add _BufferedIterator with peek/consume pattern for data-safe stream racing
- Add StreamRegistry.select() — races multiple subscriptions, consumes only
  winner, losers keep cached futures (no data loss, no blocking)
- Add stream_select host function wired in HostFunctionWiring
- Add homebuilder_disruption_test.dart (original ScriptableBridge harness)
- Add construction_plugin_test.dart with 36 tests covering:
  - Domain model, constraint validation, conflict detection, disruptions
  - Bridge integration, Pattern F (full prescriptive schedule)
  - Stream-driven reactive (Pattern H), stream_select multiplexer
  - Agent integration (spawn_agent, ask_llm, FFI re-entrancy)
  - get_ready_jobs / advance_day scheduling loop
- Add 14 stream_registry_test.dart select tests including no-data-loss test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tier 2 of experiment evolution: goal-oriented tests simulating what an
LLM generates when given ONLY a scheduling goal (no pseudocode steps).

- LLM-invented scheduling loop: query→check→assign→advance across 4 days
- Dependency awareness: LLM checks deps_met before assigning dependents
- Weather awareness: skip outdoor on rain, schedule indoor work instead
- ask_llm delegation: supervisor uses sub-agent for planning decisions
- Parallelism maximization: workers assigned to different houses concurrently

5 new tests proving the plugin API is sufficient for autonomous planning.
All 41 construction_plugin tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tier 3 of experiment evolution: stream-driven disruption handling
with cascading events and multi-stream select racing.

- H: single disruption → unassign → reassign via stream subscription
- H+: cascading disruptions on same day (crew sick + weather change)
- H+: multi-stream select races weather vs crew streams (no data loss)
- H+: drain 6 events across 3 streams via select (weather/crew/material)
- H: stream exhaustion handled gracefully (null + cleanup)

5 new tests proving reactive re-planning through stream primitives.
All 46 construction_plugin tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tier 3 fix: Add generic reactive handler loop (LLM-style) that
dispatches on event type dynamically — crew_noshow → unassign worker,
weather_change → update weather + move outdoor jobs. Uses stream_select
to race two event streams, processes events in a switch, and takes
actions based on payload content (not hardcoded sequences).

Tier 4: Error recovery tests proving the LLM can parse {ok: false}
errors from construction_assign and adjust its approach.

- J: trade mismatch error → query workers_for_trade → retry with correct worker
- J: dependency error → complete prerequisite → retry
- J: rain error → check weather → reschedule to sunny day
- J: double-booking → pick different day
- J+: cascading errors (wrong trade → fix → deps unmet → fix → success)
- J+: three consecutive errors before finding valid assignment
- I: infeasibility detection (only 1 concrete_crew for 2 foundation jobs)

13 new tests. All 205 tests pass, zero lint issues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DefaultMontyBridge._dispatchToolCall and _resolveFutures only caught
Exception, missing Error subclasses like ArgumentError. When a host
function threw an Error, the Monty platform was never resumed —
leaving it permanently stuck in Pending state. All subsequent
execute() calls failed with "Cannot call start() while execution
is active."

Also fixes StreamRegistry.select() to filter exhausted handles
instead of throwing ArgumentError — LLM loop patterns naturally
retry with stale handles after streams exhaust.

Includes:
- ConstructionPlugin moved from test/ to lib/src/experiments/
- Experiment runner at soliplex_cli/bin/construction_experiment.dart
- streamSetup callback on createMontyScriptEnvironmentFactory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Full overview of 4-tier construction scheduling experiment testing
gpt-oss:20b vs 120b capability boundaries. Includes architecture,
test data, all system prompts, user messages, Dart host functions,
generated Python per room, and results matrix.

Key findings:
- T1-T3: 20B sufficient with scaffolded prompts
- T4: requires 120B for multi-turn self-correction
- Stream fix (on Object catch) unblocked T3 for both models

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _CodeCapturingExtension that intercepts execute_python tool calls
to record generated Python code. Add --iterations CLI arg for repeated
runs. Output files now include all generated Python per tool call.

5-run eval results in construction-eval-5x-2026-03-06.md:
- T1-T3: 100% pass both models (T1 120B has Ollama flaky 400s)
- T4 20B: 20% full pass (stops after foundations)
- T4 120B: 60% full pass (deeper planning, can over-think)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _Verdict/_Check validation that checks actual ConstructionState
against expected outcomes per tier, replacing smoke-test-only eval.
Rewrite eval summary with validated 5-run results.

Key findings from validation:
- T4 20B drops from 20% to 0% (false positive: self-reported SUCCESS)
- T4 120B rises from 60% to 100% (false negative: tool depth != wrong)
- T3 end-state validation has blind spot for pre-seeded state

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change _CodeCapturingExtension to record both the Python code sent
and the result returned for each execute_python call. Output format
now shows [code] and [result] sections per call.

Re-ran 5-iteration eval (v3) with answer capture. New findings:
- T2 20B code-as-text failure (0 tool calls, model outputs code in text)
- T4 20B first-ever CORRECT (run 3, 30s, 4 calls)
- T1 120B tool-depth failure (correct assignments, no advance_day)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Make maxToolDepth configurable on RunOrchestrator and AgentRuntime
  (default raised from 10 to 20)
- Experiment gives 20B 22 calls, 120B 20 calls
- Add T5 verify phase: after each CORRECT generation, 120B works
  backwards through the schedule to prove it correct
- Extract _runRoom helper to share between generation and verification
- Rewrite eval doc with mermaid diagrams per tier, two-phase design,
  and pairing explanation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update experiment doc with v4 generation + verification results
- Fix T5 validator bug: tool call count checked post-hoc instead
  of passing placeholder 0 to _validateVerify
- Add _fullVerifyVerdict for post-hoc T5 scoring with real call
  count and state preservation check
- 30/30 verified answers passed, 100% state preserved
- Note stream_subscribe bug (issue #90) in T3 section

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant