Claude Code Hackathon Project Feb 2026 (Built with Opus 4.6)
Current AI agents usually improve only inside one chat.
New chat, same mistakes again.
That breaks real productivity.
Project north-star guidance is maintained in CLAUDE.md (and AGENTS.md, which symlinks to it).
Most agent memory systems are either:
- manual (user must maintain notes/skills/docs),
- session-local (wiped between chats),
- or unsafe (retrieves irrelevant memory and causes contamination).
If the user has to manage memory themselves, the product does not scale.
Memory V2: an automatic memory loop that learns from failures and reuses lessons across runs.
Core loop:
- Agent acts.
- Runtime captures failures/progress signals.
- Failure is fingerprinted and tagged.
- Retrieval injects relevant lessons (pre-run + on-error).
- End-of-run outcomes update lesson utility.
- Lessons are promoted/suppressed over time.
This enables learning across sessions without requiring users to manually maintain skill docs.
We keep runtime roles:
- executor model: attempts the task.
- referee/judge layer: scores pass/fail and quality.
Evaluator behavior:
- deterministic evaluator first when available,
- LLM judge fallback when deterministic checks are unavailable/insufficient.
So outcomes are explicit and trackable, not based on vague impressions.
The hackathon CLI demo runs a mixed protocol:
gridtoolwarmup,fluxtoolinterference,shellExcel-style interference,sqliteinterference,gridtoolretention check.
You should see:
- early failures in cold start,
- lesson activation in later waves,
- improved pass rate/steps/errors/tokens as memory becomes useful.
START_SESSION=120001 \
AUTO_TIMELINE=1 AUTO_TOKEN_REPORT=1 \
bash tracks/cli_sqlite/scripts/run_hackathon_demo.sh --prettyOutputs:
- wave summaries:
/tmp/memory_mixed_wave*.json - timelines:
/tmp/memory_timeline_wave*.txt - token report:
/tmp/memory_mixed_tokens_*.json
Legacy command compatibility (same behavior, alias wrapper):
START_SESSION=120001 \
AUTO_TIMELINE=1 AUTO_TOKEN_REPORT=1 \
bash tracks/cli_sqlite/scripts/run_hackathon_demo_legacy.sh --prettyGrid memory curve (expected pattern: fail -> pass -> pass):
bash tracks/cli_sqlite/scripts/run_tool_three_waves.sh \
--domain gridtool \
--task-id multi_step_pipeline \
--start-session 100651 \
--max-steps 4Shell memory curve (expected pattern: fail -> pass -> pass):
bash tracks/cli_sqlite/scripts/run_tool_three_waves.sh \
--domain shell \
--task-id shell_excel_multi_summary \
--start-session 100451 \
--max-steps 4If this loop is reliable, users stop repeating themselves across chats.
The agent gets faster, cheaper, and less error-prone over time.
That is the path from “token predictor” behavior to persistent, compounding productivity.
This repo proves the architecture in a controlled CLI lab.
Real GUI/computer-use reliability (for harder domains like FL Studio) is still a separate reliability problem, mainly visual grounding and action precision.
Quick dev workflow shortcuts:
make test
make test-root
make test-cliScript catalog:
scripts/README.md
Run one live FL session:
./scripts/run_fl_live_demo.sh 210001 12Run FL with subscription-backed claude -p executor (default; no API key required for executor loop):
./scripts/run_fl_live_demo.sh 210011 12Note: claude_print runs default to claude-opus-4-6 with --effort high.
Optional overrides: CORTEX_CLAUDE_PRINT_MODEL, CORTEX_CLAUDE_PRINT_EFFORT.
claude_print now covers the full FL loop: executor + extract state + visual judge + posttask learning.
For active benchmark iteration, prefer API backends (openai first). claude_print is usually much slower.
Run repeated FL sessions and produce a benchmark JSON:
python3 scripts/run_fl_benchmark.py \
--start-session 210001 \
--runs 10 \
--max-steps 12 \
--llm-backend claude_print \
--output-json /tmp/fl_benchmark_210001.jsonRender a compact per-session FL timeline with deterministic/judge/final verdict line:
python3 scripts/render_fl_timeline.py --session 210001 --show-outputIf you want background iteration without blocking your host desktop:
./scripts/vm/provision_cortex_vm.sh
./scripts/vm/start_cortex_vm.sh
./scripts/vm/status_cortex_vm.sh
open "vnc://127.0.0.1:5905"Stop it:
./scripts/vm/stop_cortex_vm.shDetails and limitations: docs/VM-RUNNER.md
docs/README.md- docs indexdocs/MEMORY-V2-AGNOSTIC-PLAN.md- requirements/statusdocs/MEMORY-V2-BENCHMARKS.md- benchmark protocol + interpretationdocs/archive/memory-v2-history/HACKATHON-DEMO-NARRATION.md- archived narration script