This repository was archived by the owner on Apr 18, 2026. It is now read-only.
feat(experiments): construction scheduling tier experiment#89
Open
runyaga wants to merge 11 commits into
Open
Conversation
…iptive tests Tier 1 of the experiment evolution plan: prescriptive baseline proving the plugin architecture works for domain-specific agentic scheduling. - Add ConstructionPlugin (18 domain functions) + ConstructionState engine with full constraint validation (trade, deps, weather, availability) - Add _BufferedIterator with peek/consume pattern for data-safe stream racing - Add StreamRegistry.select() — races multiple subscriptions, consumes only winner, losers keep cached futures (no data loss, no blocking) - Add stream_select host function wired in HostFunctionWiring - Add homebuilder_disruption_test.dart (original ScriptableBridge harness) - Add construction_plugin_test.dart with 36 tests covering: - Domain model, constraint validation, conflict detection, disruptions - Bridge integration, Pattern F (full prescriptive schedule) - Stream-driven reactive (Pattern H), stream_select multiplexer - Agent integration (spawn_agent, ask_llm, FFI re-entrancy) - get_ready_jobs / advance_day scheduling loop - Add 14 stream_registry_test.dart select tests including no-data-loss test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tier 2 of experiment evolution: goal-oriented tests simulating what an LLM generates when given ONLY a scheduling goal (no pseudocode steps). - LLM-invented scheduling loop: query→check→assign→advance across 4 days - Dependency awareness: LLM checks deps_met before assigning dependents - Weather awareness: skip outdoor on rain, schedule indoor work instead - ask_llm delegation: supervisor uses sub-agent for planning decisions - Parallelism maximization: workers assigned to different houses concurrently 5 new tests proving the plugin API is sufficient for autonomous planning. All 41 construction_plugin tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tier 3 of experiment evolution: stream-driven disruption handling with cascading events and multi-stream select racing. - H: single disruption → unassign → reassign via stream subscription - H+: cascading disruptions on same day (crew sick + weather change) - H+: multi-stream select races weather vs crew streams (no data loss) - H+: drain 6 events across 3 streams via select (weather/crew/material) - H: stream exhaustion handled gracefully (null + cleanup) 5 new tests proving reactive re-planning through stream primitives. All 46 construction_plugin tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tier 3 fix: Add generic reactive handler loop (LLM-style) that
dispatches on event type dynamically — crew_noshow → unassign worker,
weather_change → update weather + move outdoor jobs. Uses stream_select
to race two event streams, processes events in a switch, and takes
actions based on payload content (not hardcoded sequences).
Tier 4: Error recovery tests proving the LLM can parse {ok: false}
errors from construction_assign and adjust its approach.
- J: trade mismatch error → query workers_for_trade → retry with correct worker
- J: dependency error → complete prerequisite → retry
- J: rain error → check weather → reschedule to sunny day
- J: double-booking → pick different day
- J+: cascading errors (wrong trade → fix → deps unmet → fix → success)
- J+: three consecutive errors before finding valid assignment
- I: infeasibility detection (only 1 concrete_crew for 2 foundation jobs)
13 new tests. All 205 tests pass, zero lint issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DefaultMontyBridge._dispatchToolCall and _resolveFutures only caught Exception, missing Error subclasses like ArgumentError. When a host function threw an Error, the Monty platform was never resumed — leaving it permanently stuck in Pending state. All subsequent execute() calls failed with "Cannot call start() while execution is active." Also fixes StreamRegistry.select() to filter exhausted handles instead of throwing ArgumentError — LLM loop patterns naturally retry with stale handles after streams exhaust. Includes: - ConstructionPlugin moved from test/ to lib/src/experiments/ - Experiment runner at soliplex_cli/bin/construction_experiment.dart - streamSetup callback on createMontyScriptEnvironmentFactory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Full overview of 4-tier construction scheduling experiment testing gpt-oss:20b vs 120b capability boundaries. Includes architecture, test data, all system prompts, user messages, Dart host functions, generated Python per room, and results matrix. Key findings: - T1-T3: 20B sufficient with scaffolded prompts - T4: requires 120B for multi-turn self-correction - Stream fix (on Object catch) unblocked T3 for both models Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _CodeCapturingExtension that intercepts execute_python tool calls to record generated Python code. Add --iterations CLI arg for repeated runs. Output files now include all generated Python per tool call. 5-run eval results in construction-eval-5x-2026-03-06.md: - T1-T3: 100% pass both models (T1 120B has Ollama flaky 400s) - T4 20B: 20% full pass (stops after foundations) - T4 120B: 60% full pass (deeper planning, can over-think) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _Verdict/_Check validation that checks actual ConstructionState against expected outcomes per tier, replacing smoke-test-only eval. Rewrite eval summary with validated 5-run results. Key findings from validation: - T4 20B drops from 20% to 0% (false positive: self-reported SUCCESS) - T4 120B rises from 60% to 100% (false negative: tool depth != wrong) - T3 end-state validation has blind spot for pre-seeded state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change _CodeCapturingExtension to record both the Python code sent and the result returned for each execute_python call. Output format now shows [code] and [result] sections per call. Re-ran 5-iteration eval (v3) with answer capture. New findings: - T2 20B code-as-text failure (0 tool calls, model outputs code in text) - T4 20B first-ever CORRECT (run 3, 30s, 4 calls) - T1 120B tool-depth failure (correct assignments, no advance_day) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Make maxToolDepth configurable on RunOrchestrator and AgentRuntime (default raised from 10 to 20) - Experiment gives 20B 22 calls, 120B 20 calls - Add T5 verify phase: after each CORRECT generation, 120B works backwards through the schedule to prove it correct - Extract _runRoom helper to share between generation and verification - Rewrite eval doc with mermaid diagrams per tier, two-phase design, and pairing explanation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update experiment doc with v4 generation + verification results - Fix T5 validator bug: tool call count checked post-hoc instead of passing placeholder 0 to _validateVerify - Add _fullVerifyVerdict for post-hoc T5 scoring with real call count and state preservation check - 30/30 verified answers passed, 100% state preserved - Note stream_subscribe bug (issue #90) in T3 section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
on Exception→on Objectin host function dispatchChanges
_dispatchToolCalland_resolveFuturescatchObjectnotException— prevents platform from getting stuck in Pending state when host functions throwErrorStreamRegistry.select()filters exhausted handles instead of throwingArgumentError;streamSetupcallback on factory; ConstructionPlugin moved to lib/bin/construction_experiment.dartResults
Key finding: 20B capability boundary is between T3 and T4. 20B lacks "executive function" for multi-turn self-correction.
Test plan