Cortex: Persistent Memory For Agents

Claude Code Hackathon Project Feb 2026 (Built with Opus 4.6)

Current AI agents usually improve only inside one chat.
New chat, same mistakes again.

That breaks real productivity.

Project north-star guidance is maintained in CLAUDE.md (and AGENTS.md, which symlinks to it).

The Problem

Most agent memory systems are either:

manual (user must maintain notes/skills/docs),
session-local (wiped between chats),
or unsafe (retrieves irrelevant memory and causes contamination).

If the user has to manage memory themselves, the product does not scale.

The Solution In This Repo

Memory V2: an automatic memory loop that learns from failures and reuses lessons across runs.

Core loop:

Agent acts.
Runtime captures failures/progress signals.
Failure is fingerprinted and tagged.
Retrieval injects relevant lessons (pre-run + on-error).
End-of-run outcomes update lesson utility.
Lessons are promoted/suppressed over time.

This enables learning across sessions without requiring users to manually maintain skill docs.

How Success/Failure Is Measured

We keep runtime roles:

executor model: attempts the task.
referee/judge layer: scores pass/fail and quality.

Evaluator behavior:

deterministic evaluator first when available,
LLM judge fallback when deterministic checks are unavailable/insufficient.

So outcomes are explicit and trackable, not based on vague impressions.

What The Demo Shows

The hackathon CLI demo runs a mixed protocol:

gridtool warmup,
fluxtool interference,
shell Excel-style interference,
sqlite interference,
gridtool retention check.

You should see:

early failures in cold start,
lesson activation in later waves,
improved pass rate/steps/errors/tokens as memory becomes useful.

Run The Demo (One Command)

START_SESSION=120001 \
AUTO_TIMELINE=1 AUTO_TOKEN_REPORT=1 \
bash tracks/cli_sqlite/scripts/run_hackathon_demo.sh --pretty

Outputs:

wave summaries: /tmp/memory_mixed_wave*.json
timelines: /tmp/memory_timeline_wave*.txt
token report: /tmp/memory_mixed_tokens_*.json

Legacy command compatibility (same behavior, alias wrapper):

START_SESSION=120001 \
AUTO_TIMELINE=1 AUTO_TOKEN_REPORT=1 \
bash tracks/cli_sqlite/scripts/run_hackathon_demo_legacy.sh --pretty

Targeted 3-Wave Checks (Fast)

Grid memory curve (expected pattern: fail -> pass -> pass):

bash tracks/cli_sqlite/scripts/run_tool_three_waves.sh \
  --domain gridtool \
  --task-id multi_step_pipeline \
  --start-session 100651 \
  --max-steps 4

Shell memory curve (expected pattern: fail -> pass -> pass):

bash tracks/cli_sqlite/scripts/run_tool_three_waves.sh \
  --domain shell \
  --task-id shell_excel_multi_summary \
  --start-session 100451 \
  --max-steps 4

Why This Matters

If this loop is reliable, users stop repeating themselves across chats.
The agent gets faster, cheaper, and less error-prone over time.

That is the path from “token predictor” behavior to persistent, compounding productivity.

Status

This repo proves the architecture in a controlled CLI lab.
Real GUI/computer-use reliability (for harder domains like FL Studio) is still a separate reliability problem, mainly visual grounding and action precision.

FL Studio Bench Commands

Quick dev workflow shortcuts:

make test
make test-root
make test-cli

Script catalog:

scripts/README.md

Run one live FL session:

./scripts/run_fl_live_demo.sh 210001 12

Run FL with subscription-backed claude -p executor (default; no API key required for executor loop):

./scripts/run_fl_live_demo.sh 210011 12

Note: claude_print runs default to claude-opus-4-6 with --effort high. Optional overrides: CORTEX_CLAUDE_PRINT_MODEL, CORTEX_CLAUDE_PRINT_EFFORT. claude_print now covers the full FL loop: executor + extract state + visual judge + posttask learning. For active benchmark iteration, prefer API backends (openai first). claude_print is usually much slower.

Run repeated FL sessions and produce a benchmark JSON:

python3 scripts/run_fl_benchmark.py \
  --start-session 210001 \
  --runs 10 \
  --max-steps 12 \
  --llm-backend claude_print \
  --output-json /tmp/fl_benchmark_210001.json

Render a compact per-session FL timeline with deterministic/judge/final verdict line:

python3 scripts/render_fl_timeline.py --session 210001 --show-output

Isolated VM Runner

If you want background iteration without blocking your host desktop:

./scripts/vm/provision_cortex_vm.sh
./scripts/vm/start_cortex_vm.sh
./scripts/vm/status_cortex_vm.sh
open "vnc://127.0.0.1:5905"

Stop it:

./scripts/vm/stop_cortex_vm.sh

Details and limitations: docs/VM-RUNNER.md

Docs

docs/README.md - docs index
docs/MEMORY-V2-AGNOSTIC-PLAN.md - requirements/status
docs/MEMORY-V2-BENCHMARKS.md - benchmark protocol + interpretation
docs/archive/memory-v2-history/HACKATHON-DEMO-NARRATION.md - archived narration script

Name		Name	Last commit message	Last commit date
Latest commit History 326 Commits
.worktree-archives		.worktree-archives
docs		docs
integrations		integrations
scripts		scripts
sessions		sessions
skills		skills
tests		tests
tracks		tracks
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
agent.py		agent.py
claude_print_client.py		claude_print_client.py
claude_print_runtime.py		claude_print_runtime.py
computer_use.py		computer_use.py
config.py		config.py
consolidate.py		consolidate.py
fl_state.py		fl_state.py
fl_visual_judge.py		fl_visual_judge.py
hotfix.txt		hotfix.txt
hotfix_beta.txt		hotfix_beta.txt
hotfix_gamma.txt		hotfix_gamma.txt
learning.py		learning.py
memory.py		memory.py
requirements.txt		requirements.txt
run_eval.py		run_eval.py
self_improve.py		self_improve.py
skill_routing.py		skill_routing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cortex: Persistent Memory For Agents

The Problem

The Solution In This Repo

How Success/Failure Is Measured

What The Demo Shows

Run The Demo (One Command)

Targeted 3-Wave Checks (Fast)

Why This Matters

Status

FL Studio Bench Commands

Isolated VM Runner

Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cortex: Persistent Memory For Agents

The Problem

The Solution In This Repo

How Success/Failure Is Measured

What The Demo Shows

Run The Demo (One Command)

Targeted 3-Wave Checks (Fast)

Why This Matters

Status

FL Studio Bench Commands

Isolated VM Runner

Docs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages