feat(experiments): construction scheduling tier experiment by runyaga · Pull Request #89 · runyaga/flutter

runyaga · 2026-03-06T15:57:31Z

Summary

4-tier construction scheduling experiment testing gpt-oss:20b vs 120b capability boundaries
ConstructionPlugin (20 host functions) moved from test/ to lib/src/experiments/ for CLI use
Critical bridge bug fix: on Exception → on Object in host function dispatch
StreamRegistry.select() gracefully handles exhausted handles

Changes

soliplex_interpreter_monty: _dispatchToolCall and _resolveFutures catch Object not Exception — prevents platform from getting stuck in Pending state when host functions throw Error
soliplex_scripting: StreamRegistry.select() filters exhausted handles instead of throwing ArgumentError; streamSetup callback on factory; ConstructionPlugin moved to lib/
soliplex_cli: Experiment runner at bin/construction_experiment.dart
docs: Full experiment writeup with prompts, data, generated Python, and results

Results

Tier	20B	120B
T1 Prescriptive	PASS (10s)	PASS (15s)
T2 Scheduler	PASS (10s, optimal)	PASS (flaky)
T3 Dispatcher	PASS (7s)	PASS (9s)
T4 Recovery	PARTIAL (2/5)	PASS (32s, 5/5)

Key finding: 20B capability boundary is between T3 and T4. 20B lacks "executive function" for multi-turn self-correction.

Test plan

105 bridge tests pass (including new Error-catching test)
205 scripting tests pass (stream_registry updated)
Full 8-room experiment run completes successfully
T3 streams work end-to-end for both model sizes
Markdownlint passes on experiment doc

…iptive tests Tier 1 of the experiment evolution plan: prescriptive baseline proving the plugin architecture works for domain-specific agentic scheduling. - Add ConstructionPlugin (18 domain functions) + ConstructionState engine with full constraint validation (trade, deps, weather, availability) - Add _BufferedIterator with peek/consume pattern for data-safe stream racing - Add StreamRegistry.select() — races multiple subscriptions, consumes only winner, losers keep cached futures (no data loss, no blocking) - Add stream_select host function wired in HostFunctionWiring - Add homebuilder_disruption_test.dart (original ScriptableBridge harness) - Add construction_plugin_test.dart with 36 tests covering: - Domain model, constraint validation, conflict detection, disruptions - Bridge integration, Pattern F (full prescriptive schedule) - Stream-driven reactive (Pattern H), stream_select multiplexer - Agent integration (spawn_agent, ask_llm, FFI re-entrancy) - get_ready_jobs / advance_day scheduling loop - Add 14 stream_registry_test.dart select tests including no-data-loss test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tier 2 of experiment evolution: goal-oriented tests simulating what an LLM generates when given ONLY a scheduling goal (no pseudocode steps). - LLM-invented scheduling loop: query→check→assign→advance across 4 days - Dependency awareness: LLM checks deps_met before assigning dependents - Weather awareness: skip outdoor on rain, schedule indoor work instead - ask_llm delegation: supervisor uses sub-agent for planning decisions - Parallelism maximization: workers assigned to different houses concurrently 5 new tests proving the plugin API is sufficient for autonomous planning. All 41 construction_plugin tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tier 3 of experiment evolution: stream-driven disruption handling with cascading events and multi-stream select racing. - H: single disruption → unassign → reassign via stream subscription - H+: cascading disruptions on same day (crew sick + weather change) - H+: multi-stream select races weather vs crew streams (no data loss) - H+: drain 6 events across 3 streams via select (weather/crew/material) - H: stream exhaustion handled gracefully (null + cleanup) 5 new tests proving reactive re-planning through stream primitives. All 46 construction_plugin tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tier 3 fix: Add generic reactive handler loop (LLM-style) that dispatches on event type dynamically — crew_noshow → unassign worker, weather_change → update weather + move outdoor jobs. Uses stream_select to race two event streams, processes events in a switch, and takes actions based on payload content (not hardcoded sequences). Tier 4: Error recovery tests proving the LLM can parse {ok: false} errors from construction_assign and adjust its approach. - J: trade mismatch error → query workers_for_trade → retry with correct worker - J: dependency error → complete prerequisite → retry - J: rain error → check weather → reschedule to sunny day - J: double-booking → pick different day - J+: cascading errors (wrong trade → fix → deps unmet → fix → success) - J+: three consecutive errors before finding valid assignment - I: infeasibility detection (only 1 concrete_crew for 2 foundation jobs) 13 new tests. All 205 tests pass, zero lint issues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

DefaultMontyBridge._dispatchToolCall and _resolveFutures only caught Exception, missing Error subclasses like ArgumentError. When a host function threw an Error, the Monty platform was never resumed — leaving it permanently stuck in Pending state. All subsequent execute() calls failed with "Cannot call start() while execution is active." Also fixes StreamRegistry.select() to filter exhausted handles instead of throwing ArgumentError — LLM loop patterns naturally retry with stale handles after streams exhaust. Includes: - ConstructionPlugin moved from test/ to lib/src/experiments/ - Experiment runner at soliplex_cli/bin/construction_experiment.dart - streamSetup callback on createMontyScriptEnvironmentFactory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Full overview of 4-tier construction scheduling experiment testing gpt-oss:20b vs 120b capability boundaries. Includes architecture, test data, all system prompts, user messages, Dart host functions, generated Python per room, and results matrix. Key findings: - T1-T3: 20B sufficient with scaffolded prompts - T4: requires 120B for multi-turn self-correction - Stream fix (on Object catch) unblocked T3 for both models Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add _CodeCapturingExtension that intercepts execute_python tool calls to record generated Python code. Add --iterations CLI arg for repeated runs. Output files now include all generated Python per tool call. 5-run eval results in construction-eval-5x-2026-03-06.md: - T1-T3: 100% pass both models (T1 120B has Ollama flaky 400s) - T4 20B: 20% full pass (stops after foundations) - T4 120B: 60% full pass (deeper planning, can over-think) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add _Verdict/_Check validation that checks actual ConstructionState against expected outcomes per tier, replacing smoke-test-only eval. Rewrite eval summary with validated 5-run results. Key findings from validation: - T4 20B drops from 20% to 0% (false positive: self-reported SUCCESS) - T4 120B rises from 60% to 100% (false negative: tool depth != wrong) - T3 end-state validation has blind spot for pre-seeded state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change _CodeCapturingExtension to record both the Python code sent and the result returned for each execute_python call. Output format now shows [code] and [result] sections per call. Re-ran 5-iteration eval (v3) with answer capture. New findings: - T2 20B code-as-text failure (0 tool calls, model outputs code in text) - T4 20B first-ever CORRECT (run 3, 30s, 4 calls) - T1 120B tool-depth failure (correct assignments, no advance_day) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Make maxToolDepth configurable on RunOrchestrator and AgentRuntime (default raised from 10 to 20) - Experiment gives 20B 22 calls, 120B 20 calls - Add T5 verify phase: after each CORRECT generation, 120B works backwards through the schedule to prove it correct - Extract _runRoom helper to share between generation and verification - Rewrite eval doc with mermaid diagrams per tier, two-phase design, and pairing explanation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Update experiment doc with v4 generation + verification results - Fix T5 validator bug: tool call count checked post-hoc instead of passing placeholder 0 to _validateVerify - Add _fullVerifyVerdict for post-hoc T5 scoring with real call count and state preservation check - 30/30 verified answers passed, 100% state preserved - Note stream_subscribe bug (issue #90) in T3 section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

runyaga and others added 11 commits March 6, 2026 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(experiments): construction scheduling tier experiment#89

feat(experiments): construction scheduling tier experiment#89
runyaga wants to merge 11 commits into
mainfrom
feat/experiment-tiers

runyaga commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

runyaga commented Mar 6, 2026

Summary

Changes

Results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant