Skip to content

Pre-release testing findings (v0.3.0 sprint) — bugs and playbook fixes #68

@anilmurty

Description

@anilmurty

Running list of findings surfaced while walking through tests/manual-pre-release-testing.md against the v0.3.0 sprint work on task-5-docs-sweep. Filed proactively so we can keep updating as the playbook run continues.

Code bugs

1. tj onboard --reconfigure bare-path early-returns

The --reconfigure flag is honored inside _onboard_claude_code and _onboard_codex, but the top-level cmd_onboard early-returns with "Config already exists. Use --force to overwrite." when called without --claude-code / --codex, regardless of --reconfigure.

Fix: Make the bare-onboard path honor --reconfigure (bypass the existing-config check) or explicitly document that --reconfigure only applies to integration-specific flows.

Where: tokenjam/cli/cmd_onboard.py ~ line 39–43


2. SDK silently drops spans on HTTP 401

When the SDK's HttpTransport gets a 401 from a running tj serve (e.g., secret mismatch), it logs a single line — tj serve returned 401 on span export — and the spans are dropped. The user can easily miss the message; no fallback to direct DuckDB write; no counter exposed via tj doctor.

Repro: Have .tj/config.toml and ~/.config/tj/config.toml with different ingest_secret values, then run any example script while tj serve is up. Watch the spans never appear in tj status.

Fix options:

  • Fall back to direct DB write when HTTP push fails (preferable for local-first ergonomics)
  • Raise visibility: count dropped spans, surface in tj doctor
  • At minimum, change the warning to ERROR level and include the secret-fingerprint mismatch hint

Where: tokenjam/sdk/transport.py


3. tj drift doesn't surface baselines that demonstrably exist (HTTP API fallback)

After running examples/alerts_and_drift/drift_demo.py (12 baseline + 1 anomalous session) and observing drift_detected fire successfully, tj drift reports "No drift baselines found." The detector clearly built and used a baseline (otherwise the alert wouldn't fire), but the CLI doesn't surface it.

tj doctor correctly identifies other agents as "Collecting baseline: sensitive-demo (0/10), budget-demo (0/10)" but doesn't mention drift-demo — suggesting its baseline IS built but tj drift can't read it.

Likely cause: CLI is using the HTTP API fallback (because tj serve holds the DB lock), and /api/v1/drift either doesn't return baseline records correctly or cmd_drift doesn't call it with the right shape.

Where: tokenjam/cli/cmd_drift.py + tokenjam/api/routes/drift.py


4. tj doctor reports DuckDB writable: ✗ as a failure when daemon holds the lock

Doctor's "DuckDB writable" check attempts a direct DB connection and reports ✗ Could not set lock on file ... Conflicting lock is held in ... PID <daemon> as a failure. But this is expected/healthy state — the daemon is the rightful lock holder.

Doctor already handles this gracefully for another check: i Spans column statistics: Skipped — CLI is running through the HTTP API fallback. The writable check should follow the same pattern.

Fix: Detect the daemon. When it's the lock holder, downgrade to i informational ("DB lock held by daemon — this is the expected operating state").

Where: tokenjam/cli/cmd_doctor.py


UX / config-design issues

5. Project-local vs global config secret divergence is a footgun

When .tj/config.toml exists in cwd, the SDK picks up its ingest_secret. The daemon (started by launchd) reads ~/.config/tj/config.toml. These can drift silently — there's no warning when they differ, and the manifestation is the dropped-spans-on-401 issue above (bug #2). Took several minutes of diagnostic work to trace.

Fix options:

  • Detect divergence at SDK startup; emit a clear warning
  • Prefer the global config when the daemon is running (or vice versa, but consistently)
  • Document the precedence rule explicitly in CLAUDE.md and the README
  • Consider removing .tj/config.toml from git tracking entirely — a config file in a code repo is mostly a misfeature

6. Alert/drift demos don't fire alerts without matching tj.toml config

examples/alerts_and_drift/sensitive_actions_demo.py and budget_breach_demo.py write spans for agent IDs sensitive-demo and budget-demo, but if the user's tj.toml doesn't declare those agents with the relevant config (sensitive_actions list, budget thresholds), no alerts fire — even though the demos' output says they should.

Currently the demos contain a # tj.toml needs this config comment block, but no enforcement. New users following the playbook see "No active alerts" and don't realize they missed a prerequisite.

Fix options:

  • Demos self-register the required agent config on first run (best UX)
  • tj doctor checks for "demo agent present without matching config" and warns
  • Playbook explicitly notes the prerequisite

Playbook (tests/manual-pre-release-testing.md) fixes

7. Step 3 implies bare tj onboard prompts for plan tier — it doesn't

The plan-tier prompt is only in _onboard_claude_code and _onboard_codex. Plain tj onboard doesn't prompt. Either the playbook needs to route plan-tier verification through --claude-code / --codex, or the bare path should also prompt (then ask which provider).

8. Playbook doesn't tell user to stop pre-existing daemon

Step 1/2/3 assumes a clean shell, but a daemon from a prior install can already be running. Need an explicit tj stop (or "verify no daemon already running") near the top.

9. max_20x literal in playbook examples should be plan-agnostic

The expected-output checks use Max 20x plan, $200/mo flat — tester should be able to use whichever plan they actually have on the test machine. Reframe checks around format ("Max 5x" / "Max 20x" / etc.) rather than the specific dollar denominator.

10. Ruff baseline note in playbook says "49 errors"; actual run is clean

Local run shows All checks passed! Either the baseline got cleaned up between writing the playbook and now, or the playbook was just wrong. Update to "ruff clean" or "match main baseline."


Open questions / look-into-later

11. tj budget only lists configured agents — not seen-but-unconfigured ones

After step 5, tj budget shows only defaults, budget-demo, and sensitive-demo. The pre-existing claude-code-* agents (which have real spend) aren't listed. Possibly intentional (only show agents with explicit config), but worth confirming.


Status

Playbook progress: step 5 complete; about to start step 6 (cost-optimization analyzers).

Update this issue as we surface more, then turn checked-off boxes into PR commits when we're ready to fix.

🤖 Filed by Claude Code during the v0.3.0 pre-release walkthrough.


More findings from step 6

12. tj optimize fails entirely when tj serve holds the DB lock — CRITICAL

Symptom:

$ tj optimize --finding model-downgrade
Error: Could not open /Users/anilmurty/.tj/telemetry.duckdb read-only: IO Error: 
Could not set lock on file "...": Conflicting lock is held in 
/Library/Frameworks/Python.framework/Versions/3.10/Resources/Python.app/Contents/MacOS/Python 
(PID 89879) by user anilmurty.

Why it matters: The strategy pivot positions cost optimization as the headline product. tj optimize failing whenever the daemon is up (which is the recommended operating mode after onboard auto-installs the daemon) is a launch-blocking regression.

Why CLAUDE.md is aspirational:

tj optimize (cmd_optimize.py) ... Opens the live DB read-only so it works alongside a running tj serve.

cmd_optimize.py does call duckdb.connect(db_path, read_only=True). But DuckDB enforces process-level exclusivity — when one process has the DB open in write mode (the daemon), no other process can attach, even read-only. The docs are explicit about this: https://duckdb.org/docs/stable/connect/concurrency

Fix options (ordered by impact):

  1. Route tj optimize's queries through the HTTP API. Long change — needs new /api/v1/optimize endpoints that mirror the analyzer surface. Highest ROI: makes optimize work in the recommended operating mode by default.
  2. Auto-detect the daemon, suggest tj stop in the error. Quick UX fix — at least the user knows what to do. Doesn't actually solve the problem.
  3. Make the daemon yield the write lock for short read transactions. Probably impractical given DuckDB's locking model.

Recommendation: option 1. The analyzer logic already operates on conn.execute(SQL) — it'd port cleanly to an API endpoint that the CLI calls when db.conn is None (i.e., when the API shim is active).

Where: tokenjam/cli/cmd_optimize.py, plus new tokenjam/api/routes/optimize.py.

13. Same bug class likely affects other db.conn-direct callers

tj cost worked above because it uses the StorageBackend protocol via the ApiBackend shim. tj optimize works directly against db.conn. Need to audit other CLI commands that take the db.conn short-cut: probably tj cost --compare (the compute_cost_diff path), and any future analyzers.

14. Per-example cost_usd leaks through unknown-tier dollar suppression

When pricing_mode == "unknown", the top-level downgrade savings line correctly says "savings figures suppressed — plan tier unknown". But the per-example table immediately below still shows the original cost:

Examples:
  82c68dd9..  0 tool calls   —   $8.0590  (claude-opus-4-7)
                              ^^^^^^^^^^^ leaks through

If we're suppressing dollar figures because we don't yet know whether they're real "spend" or implied-API-value, the per-example cost has the same honesty problem.

Fix: In cmd_optimize.py _render_downgrade(), when pricing_mode in {"unknown", "subscription", "local"}, suppress the cost_usd column in examples (or replace with tokens).

15. LAUNCH-BLOCKING: Wave 2 analyzers produce no CLI output — renderer ignores report.findings

Symptom: Every Wave-2 analyzer (cache-efficacy, cache-recommend, workflow-restructure, prompt-bloat) prints the same generic catch-all message:

$ tj optimize --finding cache-efficacy
No candidates flagged in this window. Either spend is small or all sessions already use a cost-effective model.

That message is the catch-all from _render_report() when both report.downgrade is None and report.budgets is empty. It is not specific to the analyzer that ran.

Root cause: Wave-2 analyzers attach their findings to report.findings[<name>] (the generic dict on OptimizeReport). But cmd_optimize.py::_render_report() only reads the typed slots (report.downgrade, report.budgets) — it never iterates report.findings. So Wave-2 analyzers run successfully, write data, and the renderer drops it on the floor.

Confirmed via JSON path:

$ tj optimize --finding cache-efficacy --json | python3 -m json.tool | head -30
{
    "window": {...},
    "downgrade": null,
    "budgets": [],
    "findings": {
        "cache-efficacy": {
            "rows": [
                {"provider": "anthropic", "model": "claude-opus-4-7", "input_tokens": 25498, "cache_tokens": 1195766597, "efficacy": 1.0, "support": "full", "flagged": false},
                ...6 rows total...
            ]
        }
    }
}

JSON output is correct. CLI text output is not.

Impact: All four Wave-2 analyzers (the bulk of the strategy-pivot sprint) are effectively invisible to anyone running tj optimize interactively. Unit tests pass because they test the analyzer functions directly; the CLI text-rendering path was never wired to read report.findings.

Fix: Extend _render_report() in tokenjam/cli/cmd_optimize.py to iterate report.findings and dispatch to a per-finding renderer. Roughly:

# After rendering downgrade + budgets, before the catch-all:
for name, finding in report.findings.items():
    renderer = _FINDING_RENDERERS.get(name)
    if renderer is not None:
        renderer(finding, pricing_mode=pricing_mode)
        console.print()

Plus per-finding render functions:

  • _render_cache_efficacy() — per-(provider, model) table
  • _render_cache_recommend() — disabled hint OR breakpoint candidates list
  • _render_workflow_restructure() — clusters or "no clusters" message
  • _render_prompt_bloat() — disabled hint OR per-prompt summary table

Update the catch-all condition to also check report.findings so it only fires when truly empty.

Where: tokenjam/cli/cmd_optimize.py


Pausing playbook run here

Bugs #12 (optimize unusable while daemon up) + #15 (Wave-2 analyzers invisible in CLI) make the rest of the playbook unrunnable in any meaningful way. Resuming after fixes land.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions