This document captures the debugging methodology that worked on the GC corruption bugs (task #12 family, June 2026). The same approach generalises to any bug whose surface symptom is far removed from its root cause -- typically because data flows from the buggy write site, through several allocations and a GC pass or two, and only then gets read back in a context where the corruption becomes visible.
The shape of the work is two phases:
-
Bisect to a minimal trigger. A bug that reproduces only under a 20-minute compile of the entire game is useless for investigation; you need a reproducer that runs in under a few seconds and either crashes or doesn't.
-
Trace back from the surface to the source with write traps. Once the bug is small and deterministic, instrument the heap so that the moment the bad value lands in a slot, the writing instruction's RIP gets logged. From there it's a short walk to the C source line.
This method has handled every Symta-runtime bug we've seen so far except heisenbugs -- failures that happen once every several days, can't be reliably triggered, and may be caused by faulty hardware or OS scheduling rather than the program itself. See the Heisenbugs section at the end for how to recognise them and why this methodology doesn't apply.
Most observed bugs reproduce under a narrow envelope. The art is finding that envelope so each subsequent run takes seconds, not minutes.
The bug is triggered by some input. Start with the largest
input that reproduces (usually bash game/build.sh or symta /path/to/project) and shrink it. Useful shrink operations:
- Replace the source file's contents with a one-liner like
say "hi". Most compile-time bugs fire duringcore_.smacroexpansion before they ever see the user code, so this shrinks the input without changing the failure. - Remove project dependencies (
use foo) one at a time. - For runtime bugs, find the single user-source-file that triggers and shrink that file with binary search.
Some bugs depend on GC running often. SYMTA_GEN0_SIZE=<bytes>
forces a smaller nursery and a higher GC rate. The original
task #12 bug only triggered below ~96 KB. A useful sweep:
for sz in 65536 131072 262144 524288 1048576; do
echo "=== gen0=$sz ==="
SYMTA_GEN0_SIZE=$sz ./symta.exe "$tmp" 2>&1 | grep -E 'ERROR|has no method'
doneThe smallest size that still reproduces is what you want. Below that the runtime clamps to a page; above that the bug stops firing. Pick a size right in the trigger band.
If the bug fires under --jit but not --no-jit, you have a
JIT-correctness regression. This is task #14 territory. The
shape of the fix is different (audit recent JIT translations
and helpers, not heap writes), but the bisect is the same:
shrink the input + flip the flag.
Once you have a minimal input, git bisect works normally as
long as each candidate commit can be built. The Windows build
needs TMPDIR=$LOCALAPPDATA/Temp PATH=/c/soft/w64devkit/bin:... make -f Makefile.w64; bake that into the bisect script. See
the comment in feedback_symta_toolchain.md -- using mingw64
instead of w64devkit produces binaries that won't start, which
will throw bisect off if you don't notice the build is broken.
./game/sbc/ and ./game/cache/ hold previously compiled SBCs
and intermediate state. If a bug only fires "sometimes" check
whether you're inadvertently reusing those. rm -rf cache sbc
before each reproducer run is cheap insurance.
Once the bug is small and deterministic, you know roughly where it fires (the stack trace from the rterr / segfault gives you the Symta function and source line). What you need next is where the bad value was written.
Find one consistent fact about the value:
- Its raw bit pattern (e.g.
0x07,0x000001fe7fe60013). - The slot or register it ends up in (printed by the sink builtin's "has no method" message or by a manual probe).
- A type-mismatch property -- "tag should be T_LIST=9 but is 3", "size field reads back as 2.1 GB", etc.
For task #12 the consistent fact was: the slot holding the
poison always had O_TAG = 3 (_unused0_), a reserved tag.
Whenever you can phrase the bug as "this dyn shouldn't ever
exist", the rest of the hunt is mechanical.
The surface symptom tells you what the value looks like but
not which heap slot is holding it. A scan over the live heap
finds the slot. See gc_types.h GCDEF(gc_list)'s
SYMTA_DBG_LISTSCAN block for the template -- on every list
trace, check whether any slot matches your bad-value signature,
and print (gid, size, slot_index) on the first hit.
Once you have the gid and slot, the slot's heap-byte-offset is
api.heap0 + gid*8 + slot*8. Note that gid is deterministic
across runs (alloc order is a function of the input program),
so you can use the gid from the first run as a fixed key for
the second run.
This is where the runtime's purpose-built infrastructure earns its keep. We have two implementations:
ctx_set_page_watchpoint(void *addr)(inruntime/w64/ctx.c) -- the workhorse.VirtualProtect's the 4 KB page containing the slot toPAGE_READONLY, installs a vectored exception handler, and traps every write to that page. On the slot-match write it logs full GPR state + RIP and re-arms via single-step. Slower than DR0 (every page write traps, not just the watched address) but works without elevated privileges or a debug session.ctx_set_write_watchpoint(void *addr, uint64_t value)-- hardware DR0 via a suspending helper thread. Faster but Windows silently drops the register write if the session doesn't have sufficient integrity (we've seen GetThreadContext read back DR0=0 immediately after SetThreadContext "succeeds"). Enable viaSYMTA_WATCH_HW=1.
Both are wired up from runtime/main.c behind env vars:
| env var | meaning |
|---|---|
SYMTA_WATCH_ADDR |
raw heap address (hex, full 64-bit) to watch |
SYMTA_WATCH_GID |
alternative: gid in heap0; slot = api.heap0[gid] |
SYMTA_WATCH_SLOT |
additional slot offset (default 0) |
SYMTA_WATCH_VALUE |
only log+abort writes that land this value |
SYMTA_WATCH_HW |
use DR0 instead of page protect |
SYMTA_DUMP_SEGV |
dump RIP+GPRs on any access violation |
SYMTA_DBG_LISTSCAN |
gc_list emits a forensic dump for absurd sizes |
SYMTA_SCAN_AST |
recursively scan reader.c outputs for poison |
SYMTA_FATAL_BAD_TAG |
abort immediately when a tag-3..5/7 dyn appears |
Typical session:
# Run 1: find the holder gid via the LISTSCAN diagnostic.
SYMTA_GEN0_SIZE=65536 SYMTA_DBG_LISTSCAN=1 ./symta.exe "$tmp" 2>&1 \
| grep LISTSCAN
# Run 2: trap writes to the slot that contained the poison.
SYMTA_GEN0_SIZE=65536 \
SYMTA_WATCH_GID=33451165 SYMTA_WATCH_SLOT=1 SYMTA_WATCH_VALUE=7 \
./symta.exe "$tmp" 2>&1 | grep -E 'PAGE|WATCH'The watchpoint handler prints the writing instruction's RIP and every GPR.
GetModuleHandleA(NULL) is logged alongside the RIP so you can
compute the file offset: RIP - ImageBase = file VA. Add the
preferred image base (0x140000000 on Windows w64devkit builds)
and grep the disasm:
objdump -d ./symta.exe > /tmp/symta-disasm.txt
awk '/^[0-9a-f]+ <[^>]+>:/ {sym=$0; addr=strtonum("0x"$1)}
addr <= TARGET {last_sym=sym; last_addr=addr}
END {print last_sym; printf "offset = 0x%x\n", TARGET - last_addr}' \
/tmp/symta-disasm.txtThis drops you on a specific instruction inside a named C
function with a byte offset. For task #12 it landed at
reader_parse_strip + 0x155, which the disassembly identified
as the store of the recursive-call return into LGET(out, i).
Once you know the C site the fix is usually one of two patterns:
- A defensive guard at the read site. When the bad value
is structurally invalid -- a reserved tag, a list size in the
gigabyte range, a parsed-cache list with size != 1 -- check
for it before consuming. Task #12's first fix was a one-line
guard in
reader_parse_stripthat falls back totok_value(x)whenever the parsed cache isn't a textbook 1-element list. - Anchoring stale C locals across an allocating call. When
the writer is a C function that holds a heap dyn in a local
across
LIST/OBJECT/mk_token, anchor the local withgc_anchor_push(&x); ...; gc_anchor_pop(). The GC then forwards the local on every collection so the next dereference reads the moved object instead of the freed-then-reused slot.
The anchor API lives in runtime/common.h (declarations) and
runtime/gc.c (the gc_anchors stb_ds stack, walked by
gc_builtins).
The phase-1 bisect assumes the bug reproduces on demand. A heisenbug -- something that fires once every few days, never on the same input twice, sometimes only on one specific machine -- defeats every step:
- You can't shrink the input because each shrink might happen to hide the bug for the wrong reason.
- You can't bisect commits because a "good" commit might just be one where you got lucky.
- You can't set a watchpoint because you don't know which run will be the bad one.
Symptoms that suggest a heisenbug rather than a determinism bug:
- The fault address varies between runs by more than ASLR slide.
- The same binary on the same input fails sometimes and succeeds sometimes.
- The fault disappears under
--no-jit, then comes back, then disappears again on rebuild, with no source changes. - The reproducer involves wall-clock time, sockets, or other external state.
- The fault is only seen on one machine in the wild and never on CI or developer hardware.
When you see those, the next step is not to apply more instrumentation but to exclude environmental causes:
- Run a memory test (MemTest86 or equivalent) overnight. Bad RAM produces bit flips that look exactly like GC corruption.
- Check the kernel log for ECC corrections, disk errors, thermal throttling.
- Check
wmic memorychip(Windows) /dmidecode(Linux) -- a recent module swap can introduce bad sticks. - If the bug is in a JIT-compiled function, check that nothing
is mapping the JIT arena with
PAGE_EXECUTE_READWRITEand letting an unrelated process scribble on it (some antivirus inject DLLs do this). - If the bug only shows in the wild, suspect the user's machine before suspecting the code.
Only after those are ruled out is it worth setting up long-running statistical instrumentation (a continuously-armed watchpoint that logs to disk, a corrupted-heap detector that runs once per GC and crashes with full context on the first mismatch). Even then, plan for the investigation to take days of clock time rather than minutes.
All env-gated, inert when unset:
SYMTA_DUMP_SEGV-- access-violation handler logs full GPRs.SYMTA_WATCH_GID/SYMTA_WATCH_SLOT/SYMTA_WATCH_VALUE/SYMTA_WATCH_HW-- arms the page or DR0 watchpoint at startup.SYMTA_DBG_LISTSCAN-- gc_list dumps source list + neighbours on absurd size.SYMTA_SCAN_AST-- reader.c emits[ASTSCAN]lines at parser function boundaries when poison appears in the AST.SYMTA_DBG_LIST2-- SBC_LIST2 traps on building a list with poison in slot b.SYMTA_FATAL_BAD_TAG-- the gc_custom defensive guard and the SBC_LIST2 watchpoint abort immediately on first detection so the stack trace shows the exact context.
All can be combined. None has measurable cost when not set.
gc_anchor_push(dyn *p)/gc_anchor_pop(void)/gc_anchor_pop_n(int)-- register stack-local dyn slots as GC roots. Seeruntime/common.h.ctx_set_page_watchpoint(void *addr)/ctx_set_write_watchpoint(void *addr, uint64_t value)/ctx_set_watch_target_value(uint64_t v)-- watchpoint primitives. Seeruntime/w64/ctx.h.
tests/runtime/tiny-gen0.sh-- pinned regression for the original tag-3 / 0x07 poison bug. The fix in commit3c7ec19exercises the full methodology: minimal reproducer (say "hi"under tiny gen0), heap scan via LISTSCAN, page watchpoint via WATCH_GID/SLOT/VALUE, RIP-to-source via objdump, defensive guard inreader_parse_stripplus root-cause anchor pattern inparse_term.
This list is kept in sync with the in-flight tracker tasks and should be the first stop when a regression surfaces. Each entry records the minimal trigger, the surface symptom, and what we already know about the source.
-
Trigger.
rm -rf game/sbc; bash game/build.sh(with--jit). Crashed during macroexpansion with a "has no method" error on aNo(tag-14) orint 0object. Surface varied between runs (qlmb_supply_cvars:204calling.nonNo,ssa_markreceivingName = No, etc.) because the bug was allocation-pattern-sensitive. -
Root cause. The JIT trampolines
jit_rt_immeq_impl,jit_rt_immne_impl,jit_rt_fxnlset_impl, andjit_rt_fxnlsetir_implcached frame-slot reads in C locals before callingARGLIST2/ARGLIST3, then passed the stale C locals intoSTARGafter the allocation:dyn aa = ((dyn*)L)[a_sl]; // <-- snapshot dyn bb = ((dyn*)L)[b_sl]; ... ARGLIST2(aa, bb); // GC may run here // STARG(0, aa); STARG(1, bb); // stale pointers leaked
ARGLIST*callsLIST(api.args, n)which can trigger GC. Frame slots are GC roots (scanned viaapi.frame) so the slot value is updated to the new heap address, but the C localaastill holds the old pointer. WhenSTARGwroteaainto the newly-allocated args list, the args list ended up with a dangling pointer to recycled memory -- which read back asNo(or whatever else was now at that address). The first method dispatch through that args list then saw the wrong receiver and crashed. -
Why bisect was so painful. The trigger depends on the binary's link layout because that shifts heap addresses and GC timing. The disabled-opcode bisect under
SYMTA_NO_FXNLSET/SYMTA_NO_OBJECT/ etc. didn't catch it because those flags gated the JIT translator's opcode-emission path, not the trampoline helper bodies the AOT-installed code calls into. The buggy helpers are reached through pre-existing translations (IMMEQ / IMMNE from commit237c89f; FXNLSET / FXNLSETIR from0b2d433). -
Tool that found it. Same methodology as bugs #12/#13: pin a deterministic reproducer (
bash build.shalways failed on the qlmb path), bisect withSYMTA_JIT_FILTER+SYMTA_JIT_MAX_FN/SYMTA_JIT_SKIP_FNto a single function (fnmeta.nameat compiler.sbc fn-index 207), decode that function's bytecode (SYMTA_JIT_PRINT_FNS=1), read the opcode trace -- IMMEQ stood out -- then audit the matching helper for the C-local-before-allocation pattern. -
Fix. Commit
<TBD>. Rewrote the affected helpers so everySTARGreads from the frame slot post-ARGLIST:ARGLIST(2); STARG(0, ((dyn*)L)[a_sl]); // fresh read; post-GC STARG(1, ((dyn*)L)[b_sl]);
Also reload for the
O_TAG-driven mcache key. The pre-ARGLISTC locals are kept for the fast-path tag checks (tag bits are GC-stable), but never re-used after the allocation point. -
Verified.
rm -rf game/sbc; bash game/build.shnow succeeds three runs in a row. Drift PASS 5/5. Runtime tests 34/34 PASS.tests/runtime/tiny-gen0.shstill pinned. Gamebuild.shrestored to--jit(commit1b2c8e4's--no-jitworkaround reverted).
- Trigger. Was
SYMTA_GEN0_SIZE=65536 ./symta.exe ...(clean reproducer; default heap sizes never hit it). - Surface. Was a segfault below heap base,
gc_alloc + 0x79writing the new object's header afterhgp->top - sizeunderflowed. Caller wasgc_list + 0x49invoked viagc_custom's slot iteration, asking for~2.1 GBbecause the source list'sgc_head_tread back as adjacent dyn data instead of a{size, code}pair. - Root cause. Cross-gen write without barrier. Same
source line as bug #12 (
parse_term's parsed-cache write), different failure mode. After my fix to bug #12,tokwas anchored acrossLIST(pp, 1), so the C local stayed valid. But the actual writeLGET(tok, 6) = ppwas a direct store -- nolsetmwrite barrier. Under tiny-gen0tokoften promotes to gen 4 by the time the cache is written (many GCs have happened by then) whileppis a fresh nursery allocation in gen 0. A cross-gen pointer without the dirty-marking write barrier doesn't get scanned during younger-gen collections, so whenpp's gen-0 location is recycled,tok.6is left pointing into the recycled memory. When gen-4 eventually collects,gc_customtracestok, callsgc_liston the staletok.6, and reads a corruptedgc_head_t-> 2.1 GB alloc -> heap underrun -> segfault. - Tool that found it. Exactly the methodology in this
file.
SYMTA_DUMP_SEGV=1caught the absurd gc_alloc size at the source. The gc_list scan reported the holder slot (heap[33519473] in gen 4).SYMTA_WATCH_GID=33519473 SYMTA_WATCH_VALUE=0x000001fe7fe60013armed a page watchpoint that fired on the exactLGET(tok, 6) = ppinstruction (RIP -> objdump ->parse_term + 0x1e2). - Fix. Commit
7f4dd7bfollow-up: changedLGET(tok, 6) = pptoLSET(tok, 6, pp).LSETcallslsetmwhich checksGC_OLDER(tok, pp)and marks the containing page dirty when the write crosses gen boundaries. - Verified.
tests/runtime/tiny-gen0.shno longer surfaces a segfault. Other tiny-gen0 bugs (#14 under--jit, an "undefined variable." symptom under--no-jitthat's clearly downstream) still fire; those are separate stale-pointer sites tracked separately.
- Trigger. None on its own; this is a pending JIT improvement, not a correctness bug.
- Surface. N/A.
- What we know. SBC_CTX_BTLAND currently always falls
through to the interpreter for the setjmp call, costing a
helper-trampoline jump per
tryblock. An inline emission would let JIT'd code stay on the hot path through try / catch. - Blocked by. Was bug #14; now unblocked.
- Bug #15 -- system-wide sweep for "raw store into a heap
container without write barrier" (bug #13's class) found
7 sites across
reader.c,am.h, andbltin.c. Two are empirically real bugs (bracket-literal parsed-cache,.=namesetter combiner -- both in the reader); five are defensive (current operator set is short enough that appended-_lands in fixtext immediate; KW_bracketsL / KW_curlyL are 1-char fixtexts;view_set's copy-on-write needs an aged view which the game compile doesn't produce;amSetNo/amSwaparen't exercised by the game compile butamSwapcarries a self-warning comment inbltin.c:tbl_swap). All seven fixed across commits6a66ec3,9ed1a68,<TBD>-- four reader.c sites use the anchor + LSET recipe; the two am.h sites use AM_ATTRACT / explicit magnets-array registration; bltin.c's view_set uses lsetm directly. Audits inbugs-review.md(first pass, explicit-LGET form) andbugs-review-2.md(second pass, macro-hidden formsAM_VOID(o)=v/VIEW_BASE(o)=v). Both real bugs pinned bytests/runtime/cross-gen-store.sh. - Bug #14 -- the JIT api.args corruption crash under
--jiton game compile. Fixed in commit9f3ae76by reorderingSTARGreads to happen afterARGLISTinjit_rt_immeq_impl/jit_rt_immne_impl/jit_rt_fxnlset_impl/jit_rt_fxnlsetir_impl. Pre-fix the helpers snapshotted L-slot dyns into C locals, then calledARGLIST*(which can GC), then passed the stale C locals toSTARG-- leaking dangling pointers into api.args. Three consecutive clean game compiles confirm the fix. - Bug #13 -- the tiny-gen0 expand_qlmb_r segfault. Same
source line as bug #12 (
parse_term's parsed-cache write) but a different failure mode: cross-gen write without barrier. Fixed in commit449414aby changingLGET(tok, 6) = pptoLSET(tok, 6, pp)solsetmmarks the page dirty forgc_older_gens. Found via the methodology in this file: SYMTA_DUMP_SEGV -> gc_list scan -> page watchpoint -> RIP -> objdump -> source line. - Bug #12 -- the original tag-3 / 0x07 poison crash under
tiny gen0. Fixed in commit
3c7ec19via thereader_parse_stripparsed-cache guard. Pinned regression intests/runtime/tiny-gen0.sh. - Bug #10 -- game's
show_fpsoverlay was ignoringui.cfg. Fixed in commit2db44fb, also addedshow_perffor per-frame ms.