Fix bulk CIK name writer to respect SEC_DB_TYPE config#111
Merged
Conversation
FetchAllCikNamesTask reached into getDb() directly for its ~1M-row bulk insert fast path. getDb() returns a SQLite handle regardless of SEC_DB_TYPE, so on Postgres backends rows silently landed in a stray SQLite file that nothing else read from — Postgres cik_names stayed empty and every downstream CIK-name lookup returned nothing. - Add a CikNameBulkWriter abstraction that dispatches by SEC_DB_TYPE: SQLite keeps the prepared `INSERT OR REPLACE` + transaction fast path; Postgres uses multi-row parameterised `INSERT ... ON CONFLICT` (capped at 30k rows / 60k bind params per statement, all inside a single transaction); tests + missing-config fall through to the repository's putBulk against the in-memory backend. - Make getDb() throw SecCliConfigurationError when SEC_DB_TYPE isn't sqlite, so the next caller can't repeat the same silent-divergence bug. Gate the canonical-view DDL in setupAllDatabases on SEC_DB_TYPE === "sqlite" for the same reason (Postgres view bootstrap is a follow-up). - Tests pin: the writer falls back to the repository under the in-memory testing config, overwrites by PK on subsequent batches, and handles empty batches; getDb() throws under Postgres. - CLAUDE.md documents the getDb() guard and the bulk-writer pattern. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
In CompanyNormalization.canonicalEndings the regex was built via a
template literal: `\b${regexp}\b`. Inside a template literal `\b` is the
backspace character (U+0008), not a word-boundary escape. The companion
regexes at the top of the file build the same shape via string
concatenation with "\\b" and work correctly.
Net effect: canonicalisations like "Acme L L C" → "Acme LLC" silently
no-op'd. Punctuated forms like "Acme L.L.C." only converged because
normalizeCompanyName strips dots first; space-separated forms never did.
Switch to string concatenation to match the working pattern elsewhere
in the file. Add a regression test that pins space-separated "L L C"
canonicalises and shares a hash id with "LLC".
https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
Every `sec query` subcommand previously called getAll() / records(0) /
listAll() and filtered/sorted/paginated in JS. On the production corpus
(~1M rows in cik_names, millions of filings) every command OOM'd.
Push DB-evaluable predicates down to the storage layer and stream only
when a substring search forces JS-side filtering:
- Add `_streamMatches.ts`: cursored loop over `queryPage(criteria, ...)`
plus a `collectPage(iter, offset, limit)` helper that stops as soon as
it has offset+limit matches.
- Extend `QueryResult<T>` with optional `totalApprox: { atLeast,
exhausted }`. When the iterator was capped before draining, the
displayed total is a lower bound; the table renderer prints "≥ N" and
a hint to narrow the filter. Pure-DB paths still return exact totals.
- FilingQuery: cik/form/single-date-bound pushed down via SearchCriteria;
count() drives the displayed total. Substring on primary_doc_description
and dual date bounds run in the predicate. Default ordering is filing_date DESC.
- CikQuery: empty needle now uses size() + getOffsetPage() — no scan.
Exact/prefix/substring stream with a hard MAX_FUZZY_MATCHES = 1000
cap; surfaces totalApprox when capped. Case-insensitive exact is still
client-side (workglow equality is case-sensitive).
- PersonQuery: cik (source_filing_issuer_cik) pushed down; search +
relationship substring stream.
- EntityQuery: cik → repo.get; sic/state pushed via criteria with
count(); sorted no-filter uses queryPage with orderBy; substring on
name streams.
- CrowdfundingQuery, OfferingQuery, FactsQuery: same shape — cik (and
fy/portal where present) push down, substring fields stream.
- TableRenderer: renders "≥ N (streamed; narrow the filter…)" when the
result carries `totalApprox` with exhausted=false. CLI groups/query.ts
forwards `result.totalApprox` into the renderer.
Workglow's `query({})` rejects empty criteria with
StorageEmptyCriteriaError, so the unfiltered path uses
`getOffsetPage(offset, limit) + size()` instead. Workglow has no LIKE
operator, so any substring filter must remain client-side; the
streaming + cap pattern is the bound.
Tests: existing query tests pass unchanged; new regression test pins
that the CikQuery empty-needle case never sets `totalApprox`.
https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
…res config CSV injection (M2): TableRenderer.escapeCsvValue handled commas, quotes, and newlines but not leading =/+/-/@/TAB/CR. Excel/Sheets/Numbers interpret cells starting with those as formulas, opening WEBSERVICE/HYPERLINK data exfiltration and external command paths. SEC-sourced fields can carry arbitrary text (filer-supplied names, descriptions), so a CSV emitted from `sec query --format csv` was shipping a known attack surface. Prefix flagged cells with a single quote — spreadsheets render it literally and hide the prefix; plain CSV readers see the original text with one apostrophe. Postgres fast-fail (M4): assertSecCliEnvConfigured only validated the sqlite path (SEC_DB_FOLDER + SEC_DB_NAME). With SEC_DB_TYPE=postgres set but no PG env vars, commands ran until pg.Pool surfaced a generic connection error mid-execution, which buried the actual cause. Now require either SEC_PG_URL or SEC_PG_HOST + SEC_PG_DATABASE and throw SecCliConfigurationError up-front with an actionable message. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
SecFetchFileOutputCache built file paths via
path.join(folderPath, inputToFileName(input)) with no validation. The
inputToFileName implementations interpolate SEC-supplied values
(primary_doc, fileName, accession numbers) directly — e.g.
SecFetchAccessionDocTask builds `accessiondocs/{cik}/{accNo}-{fileName}`.
A single malformed `primary_doc` containing `../` segments would
escape SEC_RAW_DATA_FOLDER and overwrite arbitrary files on disk; an
absolute path would write anywhere.
The blast radius is small (SEC data isn't an active attacker), but
BootstrapDownloadTask.ts already validates its target dir the same
way, so this is straightforward defense-in-depth.
Add safeJoinWithinFolder() and use it in both saveOutput() and
getOutput(). Tests pin: ../ escapes throw, absolute paths throw, and
legitimate nested subdirectories still work.
https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
BootstrapSubmissionsTask, BootstrapCompanyFactsTask, UpdateAllSubmissionsTask, and UpdateAllCompanyFactsTask each computed a "what's already processed?" set/map by calling `await processedRepo.getAll()` and iterating. On a production-scale DB (hundreds of thousands to millions of processed rows) this materialises every column for every row into one giant array, only to keep cik (and last_processed for the freshness comparison). Switch to `for await (const row of repo.records(5000))` so peak memory is bounded to one page (5k rows) plus the final set/map. The set/map still grows to N entries, but ~24 bytes/entry is ~25 MB for 1M CIKs vs several hundred MB for the full row dump. Extracted streamProcessedCikSet() into src/storage/processing/ for the two callers (BootstrapSubmissionsTask, BootstrapCompanyFactsTask) that need only the cik column. The UpdateAll* callers inline `records()` because they need (cik, last_processed) pairs in a Map. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
…ering BootstrapDownloadTask used SecFetchJob with response_type: "blob", which materialised the entire response body in JS heap before Bun.write() copied it to disk. SEC's submissions.zip and companyfacts.zip archives are multi-GB; the previous path peaked at roughly the file size in memory and OOM'd on smaller VMs. Extract a `streamDownloadToFile()` helper that uses globalThis.fetch + Bun.file().writer() to pipe the response body straight to disk a chunk at a time. Peak memory is now bounded by the chunk size (Bun's default ~64KB), regardless of archive size. Progress is surfaced via the existing context.updateProgress so multi-GB downloads aren't silent. Tradeoff: bypasses SecFetchJob's queue/retry machinery because workglow's FetchUrlJob buffers the body regardless of response_type. Bulk download is a low-frequency, operator-triggered, single-URL path — SEC's 10 req/sec fair-use limit is never a concern for one download, and a transient failure can be retried by re-running the command. Worth it for the memory ceiling. Tests: streaming writes match the source body across multiple chunks; non-2xx responses throw without writing; missing content-length is handled (totalBytes undefined, no progress percent). https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
StoreSubmissionFilingsTask called entityRepo.saveFiling() once per filing inside a JS loop. For 10-K filers with thousands of historical filings, that's thousands of sequential putBulk-of-one round-trips per company. On Postgres each round-trip is ~1ms, so submission ingest for a single high-volume CIK took multiple seconds where it should take hundreds of milliseconds. Switch to chunked saveFilingsBulk(500). The chunk size keeps the per-batch payload well under PostgreSQL's bind-parameter ceiling (~30k ish given the wide Filing row) while still amortising the round-trip and the workglow `put` event emit across every batch. https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a regression where bulk CIK-name ingestion could silently write to a SQLite file even when PostgreSQL was configured, and it adds several related scalability/security hardening improvements across ingestion, bootstrap download, caching, and CLI query pagination.
Changes:
- Introduces a backend-dispatching CIK-name bulk writer and updates
FetchAllCikNamesTaskto use it (SQLite fast path, Postgres fast path, repo fallback). - Makes SQLite-only access via
getDb()fail loudly whenSEC_DB_TYPEis notsqlite, and updates DB setup logic accordingly. - Improves performance and safety: streamed download-to-disk for large SEC archives, streaming/approx-count query paths for CLI, batch upserts for filings, and cache path traversal protection.
Reviewed changes
Copilot reviewed 33 out of 33 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/util/db.ts | Makes getDb() SQLite-only by throwing when SEC_DB_TYPE is non-sqlite to prevent silent divergence. |
| src/util/db.test.ts | Adds test coverage for the new getDb() configuration guard. |
| src/task/submissions/UpdateAllSubmissionsTask.ts | Streams processed-submission rows via records() to avoid materializing large tables. |
| src/task/submissions/StoreSubmissionFilingsTask.ts | Switches to chunked bulk upserts for filings to reduce DB round-trips. |
| src/task/submissions/BootstrapSubmissionsTask.ts | Uses a streamed CIK set builder to avoid getAll() on large processed tables. |
| src/task/facts/UpdateAllCompanyFactsTask.ts | Streams processed-facts rows via records() to reduce memory usage. |
| src/task/facts/BootstrapCompanyFactsTask.ts | Uses streamed CIK set builder for processed facts instead of getAll(). |
| src/task/ciknames/FetchAllCikNamesTask.ts | Uses the new bulk writer abstraction and ensures writer closure via finally. |
| src/task/ciknames/FetchAllCikNamesTask.test.ts | Adds unit tests pinning bulk-writer fallback and overwrite semantics. |
| src/task/bootstrap/BootstrapDownloadTask.ts | Streams large ZIP downloads directly to disk to avoid buffering multi-GB blobs in memory. |
| src/task/bootstrap/BootstrapDownloadTask.test.ts | Adds tests for streaming download behavior and error handling. |
| src/storage/processing/processedCikSet.ts | Adds helper to stream CIKs into a Set with bounded peak memory. |
| src/storage/entity/EntityRepo.ts | Adds saveFilingsBulk() wrapper over repository putBulk(). |
| src/storage/entity/cikNameBulkWriter.ts | Adds backend-specific fast paths for bulk CIK-name ingestion (SQLite/Postgres/repo fallback). |
| src/storage/company/CompanyNormalization.ts | Fixes regex word-boundary handling so canonical suffix normalization actually runs. |
| src/storage/company/CompanyNormalization.test.ts | Adds regression tests for canonicalizing space-separated suffixes (e.g., “L L C”). |
| src/fetch/SecFetchFileOutputCache.ts | Adds safeJoinWithinFolder to prevent cache path traversal writes/reads. |
| src/fetch/SecFetchFileOutputCache.test.ts | Adds tests verifying traversal/absolute-path rejection and legitimate nested paths. |
| src/config/setupAllDatabases.ts | Guards SQLite-only view DDL creation behind an explicit SEC_DB_TYPE === "sqlite" check. |
| src/config/EnvToDI.ts | Strengthens configuration validation for sqlite vs postgres setups (clearer errors). |
| src/cli/queries/PersonQuery.ts | Reworks query to push down DB-filterable criteria and stream predicate-only filters. |
| src/cli/queries/OfferingQuery.ts | Streams predicate-only filters and uses count/query where possible to avoid full loads. |
| src/cli/queries/FilingQuery.ts | Pushes down equality/range where possible; streams for substring search and dual-ended ranges. |
| src/cli/queries/FactsQuery.ts | Uses DB count/query for exact filters; streams when substring filters are required. |
| src/cli/queries/EntityQuery.ts | Adds streaming substring search path and supports approximate totals; adds sort validation. |
| src/cli/queries/CrowdfundingQuery.ts | Pushes down criteria where possible; streams for substring and dual-ended date filters. |
| src/cli/queries/CikQuery.ts | Adds empty-needle pagination fast path and caps fuzzy match collection to bound memory. |
| src/cli/queries/CikQuery.test.ts | Adds regression test ensuring empty-needle pagination avoids scanning. |
| src/cli/queries/_streamMatches.ts | Adds generic streaming helpers for cursor-pagination + collecting an offset/limit window. |
| src/cli/output/TableRenderer.ts | Adds “≥ N” rendering for approximate totals and defuses CSV formula injection. |
| src/cli/output/TableRenderer.test.ts | Adds tests for CSV formula-injection defusing and benign cases. |
| src/cli/groups/query.ts | Plumbs totalApprox through to the table renderer for CLI output. |
| CLAUDE.md | Documents the new SQLite-only getDb() behavior and the recommended backend-branching pattern. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+25
to
+29
| // `path.relative` returns ".." segments when the candidate escapes the | ||
| // base; the equality check covers the exact-base case. | ||
| const rel = path.relative(base, candidate); | ||
| if (rel.startsWith("..") || path.isAbsolute(rel)) { | ||
| throw new Error( |
Comment on lines
+104
to
+108
| // No criteria but caller wants a sort — cursor-paginate so we get the | ||
| // ordering pushed down. Use queryPage with empty criteria; SQL backends | ||
| // override it with a server-side ORDER BY + LIMIT. | ||
| const page = await repo.queryPage({}, { orderBy, limit }); | ||
| return { rows: [...page.items], total }; |
Comment on lines
+42
to
+47
| * 1. **Empty needle** — `count() + getOffsetPage()`. No scan, no | ||
| * sorting. The "rank" concept doesn't apply because there's no | ||
| * target; rows come back in PK order. | ||
| * 2. **Exact match** — `query({ name }) + count({ name })`. Pushes | ||
| * equality down to the DB. Only the matching rows are loaded. | ||
| * 3. **Prefix / substring** — streams via `records()` because workglow |
| tableEmpty: !anyRowSeen, | ||
| }; | ||
| if (!exhausted) { | ||
| (result as { totalApprox?: { atLeast: number; exhausted: boolean } }).totalApprox = { |
Comment on lines
+41
to
+46
| // SQLite fast path needs SEC_DB_FOLDER to be configured (getDb() requires | ||
| // it). When the test harness leaves SEC_DB_TYPE=sqlite registered as the | ||
| // default but the rest of the production config is absent — e.g. in unit | ||
| // tests that exercise this task without standing up a real DB — fall | ||
| // through to the repository-backed writer instead of crashing in getDb(). | ||
| if (dbType === "sqlite" && globalServiceRegistry.has(SEC_DB_FOLDER)) { |
Critical:
- init wizard pushed only sqlite vars into process.env before EnvToDI;
the new Postgres fast-fail (M4) then threw on first run for any user
configuring Postgres via the wizard, even though .env.local was
written correctly. Push SEC_PG_URL/HOST/PORT/USER/PASSWORD/DATABASE
too. (src/cli/groups/init.ts:141-158)
Important:
- EntityQuery sorted-no-criteria path used queryPage({}, {orderBy,
limit}) which is cursor-based and silently ignored `offset`, so
`sec query entities --sort name --offset 50` always returned the
first page. Switch to getAll({orderBy, limit, offset}), which pushes
ORDER BY/LIMIT/OFFSET to SQL. Regression test added.
(src/cli/queries/EntityQuery.ts:104-110)
- safeJoinWithinFolder rejected legitimate filenames like "..foo.txt"
via a plain startsWith("..") check. Anchor on the path separator (and
exact ".."). Regression test added.
(src/fetch/SecFetchFileOutputCache.ts:28-39)
- cikNameBulkWriter's sqlite branch gated only on SEC_DB_FOLDER but
getDb() requires SEC_DB_NAME too — partial test configs would crash
in getDb() instead of falling back to the in-memory writer. Check
both. (src/storage/entity/cikNameBulkWriter.ts:46-52)
- CikQuery JSDoc described an "Exact match — query({name})" mode that
doesn't exist (the implementation streams for both exact and fuzzy
because workglow equality is case-sensitive). Update the doc.
Removed the unnecessary `result as { totalApprox?: ... }` cast in
the same file by switching to conditional spread.
(src/cli/queries/CikQuery.ts:40-50, 116)
- OfferingQuery `after` predicate worked by accident — null
date_of_first_sale was filtered out via `("" < params.after)` being
truthy. Handle null explicitly on both sides for symmetry with the
`before` branch. (src/cli/queries/OfferingQuery.ts:57-66)
- Six query files always set totalApprox even when the stream drained
(exhausted: true). The renderer's logic was correct, but consumers
reading QueryResult.totalApprox couldn't trust its presence as the
"this is a lower bound" signal. Only emit totalApprox when actually
capped — matches CikQuery's pattern. (Entity/Filing/Person/
Crowdfunding/Offering/FactsQuery.ts)
Minor:
- Removed dead empty `if (after && before)` block in FilingQuery
buildCriteria; lifted comment into the function JSDoc. Also fixed
HTML-entity-encoded comparison operators in the same comment.
- Pulled PG_MAX_ROWS_PER_STATEMENT to module scope in cikNameBulkWriter.
https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu
4 tasks
sroussey
added a commit
that referenced
this pull request
May 27, 2026
bun test shares module state across files in a worker, and FetchDailyIndexTask.test.ts / FetchQuarterlyIndexTask.test.ts call EnvToDI() at module load, which registers SEC_DB_FOLDER / SEC_DB_NAME / SEC_DB_TYPE=sqlite (from .env.test) into the shared globalServiceRegistry. The registry has no unregister API, so once a sibling runs first those tokens stay set. FetchAllCikNamesTask.test.ts then (a) failed its `has(SEC_DB_FOLDER) === false` precondition and (b) drove createCikNameBulkWriter() down the SQLite fast path -> getDb() opened ./sec-db/edgar.sqlite and prepared an INSERT against a non-existent cik_names table, throwing a bun:sqlite prepare error. This was latent since #111 and only surfaced now because adding _streamMatches.test.ts in this PR shifted bun's test-file execution order so the index task tests run before this one in the same worker. Make the test deterministic: pin SEC_DB_TYPE to a non-sqlite/non-postgres value in beforeEach so the writer always routes to the repository (in-memory) writer regardless of any leaked config, and drop the order-dependent has(SEC_DB_FOLDER) assertion.
sroussey
added a commit
that referenced
this pull request
May 27, 2026
) * fix(cli): make streamed query totalApprox a meaningful lower bound collectPage now counts every yielded match (up to a shared soft cap) instead of stopping at offset+limit, so total / totalApprox.atLeast is a real lower bound rather than a constant page-end value. Both collectPage and queryCiks now share the single MAX_FUZZY_MATCHES cap exported from _streamMatches so the two streaming query surfaces report totalApprox with identical semantics. * test(cli): assert streamed total is the full match count; fix stale comments Updates EntityQuery/FilingQuery comments that claimed collectPage stops after offset+limit matches, and adds tests asserting total equals the full match count with totalApprox undefined when the stream drains. * fix(test): isolate cik-name bulk writer test from leaked DB config bun test shares module state across files in a worker, and FetchDailyIndexTask.test.ts / FetchQuarterlyIndexTask.test.ts call EnvToDI() at module load, which registers SEC_DB_FOLDER / SEC_DB_NAME / SEC_DB_TYPE=sqlite (from .env.test) into the shared globalServiceRegistry. The registry has no unregister API, so once a sibling runs first those tokens stay set. FetchAllCikNamesTask.test.ts then (a) failed its `has(SEC_DB_FOLDER) === false` precondition and (b) drove createCikNameBulkWriter() down the SQLite fast path -> getDb() opened ./sec-db/edgar.sqlite and prepared an INSERT against a non-existent cik_names table, throwing a bun:sqlite prepare error. This was latent since #111 and only surfaced now because adding _streamMatches.test.ts in this PR shifted bun's test-file execution order so the index task tests run before this one in the same worker. Make the test deterministic: pin SEC_DB_TYPE to a non-sqlite/non-postgres value in beforeEach so the writer always routes to the repository (in-memory) writer regardless of any leaked config, and drop the order-dependent has(SEC_DB_FOLDER) assertion. * fix(cli): fill window past cap in collectPage; simplify totalApprox; clarify test cast
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a regression where
FetchAllCikNamesTaskunconditionally wrote ~1M CIK name rows to a SQLite file viagetDb(), even whenSEC_DB_TYPE=postgreswas configured. The system would then read from PostgreSQL, causing a silent data divergence.Key Changes
New
cikNameBulkWriter.ts— IntroducescreateCikNameBulkWriter()factory that dispatches to backend-specific fast paths:INSERT OR REPLACEtransaction (via prepared statement)INSERT ... ON CONFLICTwith 30k-row batching (respects 65535 bind parameter limit)Updated
FetchAllCikNamesTask.ts— Replaces directgetDb()call withcreateCikNameBulkWriter(), making flush operations async and ensuringwriter.close()is called in a finally block.Hardened
getDb()indb.ts— Now throwsSecCliConfigurationErrorifSEC_DB_TYPEis not"sqlite", preventing silent data divergence. Callers must usecreateStorage()/ repository tokens orgetPgPool()for non-SQLite backends.Updated
setupAllDatabases.ts— Guards SQLite-specific view DDL creation behind adbType === "sqlite"check, sincegetDb()now throws for other backends.New test coverage —
FetchAllCikNamesTask.test.tsanddb.test.tsverify bulk writer dispatch logic and thegetDb()safety guard.Implementation Details
The bulk writer factory checks
SEC_DB_TYPEandSEC_DB_FOLDERregistration to pick the right path:SEC_DB_TYPE=sqliteandSEC_DB_FOLDERis configured → SQLite fast pathSEC_DB_TYPE=postgres→ PostgreSQL fast pathThis design also serves as a safety net for the
getDb()regression: if a caller mistakenly reaches intogetDb()whenSEC_DB_TYPE=postgres, it will fail loudly rather than silently writing to the wrong database.https://claude.ai/code/session_01KMfZPiw68JWG7KiKyLxGEu