Skip to content

fix(cli,storage,forms): complete CSV escape + TOCTOU + zip leak + exempt-offerings num (sec)#120

Open
sroussey wants to merge 7 commits into
mainfrom
claude/wonderful-hypatia-gdhUE
Open

fix(cli,storage,forms): complete CSV escape + TOCTOU + zip leak + exempt-offerings num (sec)#120
sroussey wants to merge 7 commits into
mainfrom
claude/wonderful-hypatia-gdhUE

Conversation

@sroussey
Copy link
Copy Markdown
Contributor

@sroussey sroussey commented Jun 2, 2026

Summary

Bundles three independently-rooted correctness fixes — one CRITICAL CLI hardening, one HIGH-severity storage TOCTOU + a HIGH-severity disk leak, and one HIGH-severity data-correctness bug in the exempt-offerings extractors. All seven commits land on claude/wonderful-hypatia-gdhUE; the full suite (bun test) passes 845/845.

Plan F — Complete CSV escape hardening (CRITICAL)

src/cli/output/TableRenderer.ts was vulnerable to CSV / spreadsheet-formula injection via three bypasses on top of PR #119's line-by-line defusing:

  • Tighten LEADING_WS: only strip space-like characters spreadsheets silently ignore (ASCII space, NBSP, SHY, ZWSP, ZWNJ, ZWJ, LRM, RLM, BOM). \t and \r are themselves dangerous formula leads, so we no longer strip them before the DANGEROUS_LEAD check; otherwise "\t=cmd" or "\r=cmd" slipped through as non-dangerous after the strip.
  • Bare-CR split: change value.split(/(\r?\n)/)value.split(/(\r\n|\r|\n)/) so a bare CR inside a multi-line cell is also a line boundary; the line after the CR is independently defused.

Tests in src/cli/output/TableRenderer.test.ts now cover all 7 zero-width prefixes + the bare-CR scenarios + a ZWSP-then-benign negative control. 44/44 cases pass.

Plan G — TOCTOU PK race + bootstrap zip leak (HIGH × 2)

  • New src/util/AsyncMutex.ts: trivial promise-queue mutex with rejection-isolated chaining so one failing critical section doesn't poison subsequent calls.
  • PersonObservationRepo / CompanyObservationRepo upsertByNaturalKey: wrap the INSERT branch in a process-wide static AsyncMutex and re-check the natural key inside the critical section. Without the fix, two concurrent inserts on an empty in-memory store both observed size() as 0 and assigned observation_id = 1 to two different natural keys. Update-of-existing rows remain lock-free.
  • BootstrapDownloadTask: wrap the Bun.spawn(unzip) + exit-code check in try { ... } finally { rmSync(zipPath, { force: true }); }. The multi-GB staged zip used to leak whenever Bun.spawn threw or unzip exited non-zero; now it's removed on every exit path.

New tests:

  • PersonObservationRepo.test.ts / CompanyObservationRepo.test.ts: concurrent INSERT-on-empty must yield distinct sequential ids; concurrent UPDATE-existing + parallel INSERT must keep the original id stable.
  • BootstrapDownloadTask.test.ts gains an "execute zip cleanup" describe block: stubs fetch + Bun.spawn + Bun.which, asserts zip removed on (a) spawn-throws, (b) unzip-exits-nonzero, (c) success.

Plan H — Exempt-offerings num() bug (HIGH × 1, broad scope)

The four exempt-offering schemas (Form 1-A, 1-K, 1-Z, C) typed every decimal XML leaf as Type.Number(). When EDGAR shipped an empty or whitespace-only element, Value.Convert on Type.Number() silently coerced it to 0, which then propagated through RegAOfferingHistory, RegAFinancialData, CrowdfundingOfferings, and CrowdfundingReports as a legitimate-looking zero-dollar amount.

  • New src/sec/forms/_valueHelpers.ts with strScalar / numScalar / strWrapped / numWrapped. numScalar treats empty / whitespace-only / NaN / Infinity input as null and rejects thousand-separator strings; legitimate "0" round-trips. Inline copy of the helpers PR fix(forms,storage): Form 144 num() whitespace + PG bulk-writer duplicate-CIK dedup #118 introduced under src/sec/forms/insider-trading/_valueHelpers.ts — when fix(forms,storage): Form 144 num() whitespace + PG bulk-writer duplicate-CIK dedup #118 merges, the duplicate should be reconciled to a single shared module.
  • Schemas: change DECIMAL_TYPE* aliases from Type.Number() to Type.String() in all four schemas (1-A: 4 aliases, 1-K: 2, 1-Z: 2, C: 4). Type.Integer() aliases stay untouched.
  • Storage: wrap every numeric field read with numScalar(...) in Form_1_A.storage.ts (processFinancialData, processForm1A history build), Form_1_K.storage.ts (processOfferingHistory), Form_1_Z.storage.ts (processOfferingSummaries, processCertificationSuspension.approxRecordHolders), and Form_C.storage.ts (processOfferingInfo, processAnnualReportDisclosures). extractor_version bumped 1.0.0 → 1.1.0 in all four with the standard re-extract comment.
  • Tests: parsing assertions updated for the new string contract (decimals are raw XML text, integers stay numbers). New storage regression tests in each *.storage.test.ts cover null / empty / "0" round-trip.

Plan H is intentionally split across 5 commits (parts 1/5–5/5) due to the push-files tool size limit; the diffs taken together are exactly one Plan H.

Findings closed

  • CRITICAL: CSV formula-injection in TableRenderer.escapeCsvValue (3 follow-up failing cases on top of fix(cli): harden CSV formula-injection escape (sec) #119: \t/\r leads, bare-CR multi-line lines, zero-width-prefix bypasses).
  • HIGH: TOCTOU race on synthetic observation_id in PersonObservationRepo and CompanyObservationRepo.
  • HIGH: Bootstrap-download zip leak — staged multi-GB archives leaked on Bun.spawn failure or non-zero unzip exit.
  • HIGH: Empty/whitespace numerics in exempt-offering filings persisted as fabricated 0 in offering history, financial data, and crowdfunding disclosures.

Reconciliation with open PRs

This PR shares scope with #118 (insider-trading _valueHelpers) and #119 (CSV escape line-by-line defusing). Neither is closed by this PR — they should be reconciled at merge time:

Test plan

  • bun test src/cli/output/TableRenderer.test.ts — 44 cases pass.
  • bun test src/storage/observation — Person+Company repo tests pass, including new TOCTOU regression cases.
  • bun test src/task/bootstrap — streamDownloadToFile cases plus new zip-cleanup cases pass.
  • bun test src/sec/forms/_valueHelpers.test.ts — 32/32 helper unit tests pass.
  • bun test src/sec/forms/exempt-offerings — 195/195 parse + storage tests pass, including new null/empty/"0" regression guards per form.
  • bun test whole-suite — 845/845.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc


Generated by Claude Code

sroussey added 7 commits June 2, 2026 01:44
Layers two refinements on top of the line-by-line CSV defusing in #119:

1. Tighten LEADING_WS so it only strips space-like characters that
   spreadsheets silently ignore (ASCII space, NBSP, SHY, ZWSP, ZWNJ,
   ZWJ, LRM, RLM, BOM). \t and \r are themselves dangerous formula
   leads, so we no longer strip them away before the DANGEROUS_LEAD
   check — otherwise "\t=cmd" or "\r=cmd" would slip through as
   non-dangerous after the strip.

2. Split on /(\r\n|\r|\n)/ instead of /(\r?\n)/ so that a bare CR
   inside a multi-line cell is also a line boundary; the line after
   the CR is independently defused. Excel re-parses every physical
   line of a quoted cell, including lines separated by lone CR.

Tests cover the zero-width-prefix bypasses (ZWSP/ZWNJ/ZWJ/LRM/RLM/SHY/BOM
+ "=cmd"), mixed-WS bypasses ("ZWSP space =cmd"), bare-CR-followed-by-formula
("safe\r=cmd"), and a negative control to prove ZWSP-then-benign is left
alone. All 44 cases in TableRenderer.test.ts pass.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Three related correctness fixes:

1. PersonObservationRepo / CompanyObservationRepo upsertByNaturalKey
   had a TOCTOU window — two concurrent upserts on an empty in-memory
   store would both observe `size()` as 0 and assign observation_id=1
   to two different natural keys. Introduce a tiny process-wide
   AsyncMutex in src/util/AsyncMutex.ts that serializes the INSERT
   branch only (the existing-row update branch is keyed by
   observation_id and remains lock-free). Inside the critical section
   we re-query the natural key before allocating an id so that a
   parallel upsert of the same natural key still collapses to one row
   and one id.

2. BootstrapDownloadTask staged the multi-GB zip into
   SEC_RAW_DATA_FOLDER and only removed it on the success branch. If
   Bun.spawn threw, or unzip exited non-zero, the archive leaked until
   the next bootstrap run, silently consuming disk on operator VMs.
   Wrap the spawn+exit-code block in try/finally and remove the zip
   (with `force: true`) on every exit path.

3. AsyncMutex itself: trivial queue-based serializer. Rejections in
   one critical section don't poison the queue because the chained
   tracker swallows both branches; each caller still observes its own
   rejection through the returned promise.

Tests:
- PersonObservationRepo.test.ts / CompanyObservationRepo.test.ts each
  gain two new cases: concurrent INSERT-on-empty must yield distinct
  sequential ids, and concurrent UPDATE-existing + parallel INSERT
  must keep the original id stable and give the insert the next id.
- BootstrapDownloadTask.test.ts gains an "execute zip cleanup"
  describe block that stubs fetch + Bun.spawn + Bun.which and asserts
  the zip is removed on (a) spawn-throws, (b) unzip-exits-nonzero,
  and (c) the success path.

All 24 affected tests pass.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 1 of 5 — split across multiple commits due to the
push-files tool size limit; logically one fix. See later commits for
Form_1_A tests, Form_1_K, Form_1_Z, and Form_C.

Adds the shared _valueHelpers module (numScalar/strScalar/numWrapped/
strWrapped) plus its unit tests. numScalar treats empty / whitespace-
only / NaN / Infinity input as null and rejects thousand-separator
strings; legitimate "0" round-trips so the regression guard holds.

NOTE: this is an inline copy of the helpers introduced by PR #118
under src/sec/forms/insider-trading/_valueHelpers.ts. When #118 merges
the duplicate should be removed in favour of a single shared module.

Also switches Form_1_A.schema.ts decimal-type aliases from
Type.Number() to Type.String() (4 aliases) so the storage layer can
make the null-vs-zero decision per cell with numScalar() instead of
Value.Convert silently producing 0 for empty text. Form_1_A.storage.ts
is updated to numScalar() every decimal field in processFinancialData
and the RegAOfferingHistory build, and the extractor_version is bumped
1.0.0 → 1.1.0 to force re-extraction.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 2 of 5 — Form_1_A test updates.

Form_1_A.test.ts: decimal-typed leaves now arrive as raw strings from
the parser; assertions updated accordingly (pricePerSecurity, audit/
legal/etc. fees). New "validate financial data types and ranges" split
into decimal-string vs integer-number checks.

Form_1_A.storage.test.ts: three new regression tests pin the null /
empty / "0" round-trip behaviour through processForm1A.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 3 of 5 — Form_1_K.

Schema: 2 DECIMAL_TYPE aliases switch from Type.Number() to
Type.String().

Storage (processOfferingHistory): price_per_security,
aggregate_offering_price, aggregate_offering_price_holders, and
estimated_net_amount now flow through numScalar(). extractor_version
bumped 1.0.0 → 1.1.0 with the standard re-extract comment.

Tests: parsing assertions updated for the new string contract; new
storage regression tests for null/empty/"0" round-trip.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 4 of 5 — Form_1_Z.

Schema: 2 DECIMAL_TYPE aliases switch from Type.Number() to
Type.String().

Storage: processOfferingSummaries now flows price_per_security,
portionSecuritiesSoldIssuer/Securityholders, and issuerNetProceeds
through numScalar(). processCertificationSuspension's
approxRecordHolders also goes through numScalar() now (with a
null-check instead of an undefined-check so the row is dropped
when the input is empty/whitespace). extractor_version 1.0.0 →
1.1.0.

Tests: parsing assertions updated for string contract; new storage
regression tests for null/empty/"0" round-trip.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Plan H part 5 of 5 — Form_C and final wrap-up.

Schema: 4 DECIMAL_TYPE aliases switch from Type.Number() to
Type.String() — including the previously minimum-0 DECIMAL_TYPE7_2_
NONNEGATIVE. The minimum-0 invariant is no longer expressible at the
schema level; that constraint becomes the extractor's responsibility
(see Form_C.storage.ts numScalar() use).

Storage:
- processOfferingInfo now flows price, offering_amount, and
  maximum_offering_amount through numScalar(); priceDeterminationMethod
  remains a passthrough.
- processAnnualReportDisclosures now flows every disclosure field
  through numScalar() and drops the row when the result is null,
  instead of writing a fabricated 0 disclosure_value.
- extractor_version 1.0.0 → 1.1.0.

Tests: parsing assertions updated for the new string contract on
price/offeringAmount/maximumOfferingAmount/currentEmployees/
totalAssetMostRecentFiscalYear/revenueMostRecentFiscalYear. New
storage regression tests cover null/empty/"0" round-trip for both
offering and disclosure numerics.

Whole-suite verification (bun test) passes 845/845 with the helpers,
schemas, storage, and tests all in place.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant