fix(cli): make streamed query totalApprox a meaningful lower bound#112
Merged
Conversation
collectPage now counts every yielded match (up to a shared soft cap) instead of stopping at offset+limit, so total / totalApprox.atLeast is a real lower bound rather than a constant page-end value. Both collectPage and queryCiks now share the single MAX_FUZZY_MATCHES cap exported from _streamMatches so the two streaming query surfaces report totalApprox with identical semantics.
…omments Updates EntityQuery/FilingQuery comments that claimed collectPage stops after offset+limit matches, and adds tests asserting total equals the full match count with totalApprox undefined when the stream drains.
bun test shares module state across files in a worker, and FetchDailyIndexTask.test.ts / FetchQuarterlyIndexTask.test.ts call EnvToDI() at module load, which registers SEC_DB_FOLDER / SEC_DB_NAME / SEC_DB_TYPE=sqlite (from .env.test) into the shared globalServiceRegistry. The registry has no unregister API, so once a sibling runs first those tokens stay set. FetchAllCikNamesTask.test.ts then (a) failed its `has(SEC_DB_FOLDER) === false` precondition and (b) drove createCikNameBulkWriter() down the SQLite fast path -> getDb() opened ./sec-db/edgar.sqlite and prepared an INSERT against a non-existent cik_names table, throwing a bun:sqlite prepare error. This was latent since #111 and only surfaced now because adding _streamMatches.test.ts in this PR shifted bun's test-file execution order so the index task tests run before this one in the same worker. Make the test deterministic: pin SEC_DB_TYPE to a non-sqlite/non-postgres value in beforeEach so the writer always routes to the repository (in-memory) writer regardless of any leaked config, and drop the order-dependent has(SEC_DB_FOLDER) assertion.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes the streaming CLI query pagination/counting semantics so total becomes a meaningful lower bound (rather than a constant offset+limit) and aligns the behavior across streaming query surfaces by sharing a single soft cap constant.
Changes:
- Reworked
collectPage()to count matches up to a shared soft cap while retaining only the requested[offset, offset+limit)window (O(limit) memory), and exportedMAX_FUZZY_MATCHES. - Updated
CikQueryto import/use the shared cap and documented why it must buffer-and-sort the whole capped match set. - Added/updated unit tests to assert streamed searches report full totals when the iterator drains (and omit
totalApproxin that case).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/task/ciknames/FetchAllCikNamesTask.test.ts | Pins SEC_DB_TYPE in tests to avoid leaked registry config affecting bulk-writer selection. |
| src/cli/queries/PersonQuery.test.ts | Adds assertion that drained streaming searches return full total and no totalApprox. |
| src/cli/queries/FilingQuery.ts | Updates streaming-path comment to reflect new collectPage semantics. |
| src/cli/queries/FilingQuery.test.ts | Adds assertion that drained streaming searches return full total and no totalApprox. |
| src/cli/queries/EntityQuery.ts | Clarifies QueryResult.total/totalApprox semantics and updates streaming-path comments. |
| src/cli/queries/EntityQuery.test.ts | Adds regression test that streamed totals are full-count (not offset+limit). |
| src/cli/queries/CikQuery.ts | Imports shared MAX_FUZZY_MATCHES and documents buffering rationale for ranking. |
| src/cli/queries/_streamMatches.ts | Exports shared cap and rewrites collectPage() to count matches up to cap while window-buffering. |
| src/cli/queries/_streamMatches.test.ts | New unit tests covering collectPage() totals, cap behavior, and default-cap behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+91
to
95
| matched++; | ||
| if (matched >= maxScan) { | ||
| // Hit the soft cap: there may be more matches we never counted. | ||
| return { rows: window, total: matched, exhausted: false }; | ||
| } |
| /** The match count we got to before stopping. */ | ||
| /** The match count we got to before stopping (the soft cap). */ | ||
| readonly atLeast: number; | ||
| /** True if the iterator drained — in which case `total` is exact. */ |
| // API, so without this pin a leaked sqlite config would route the | ||
| // writer down the getDb() fast path and prepare an INSERT against a | ||
| // cik_names table that does not exist in the test SQLite file. | ||
| globalServiceRegistry.registerInstance(SEC_DB_TYPE, "memory" as unknown as "sqlite"); |
…clarify test cast
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses two HIGH review findings on the streaming CLI query path added in #111.
collectPageinsrc/cli/queries/_streamMatches.tsstopped at exactlyoffset + limitmatches and returned that astotal, so consumers reportedtotalApprox.atLeast = offset + limit, a constant. The "≥ N" UX (rendered byTableRenderer.ts) was therefore meaningless — it always showed the page end.CikQuery.tsinstead drained up toMAX_FUZZY_MATCHES = 1000and reported a real count, so the two streaming query surfaces exposedtotalApproxwith different meanings, and each declared its own cap.Decisions applied
MAX_FUZZY_MATCHES = 1000is now exported fromsrc/cli/queries/_streamMatches.ts(moved out ofCikQuery.ts). BothcollectPage(as the defaultmaxScan) andqueryCiksreference it, so there is one source of truth and both surfaces reporttotalApproxwith identical semantics.totalApproxstays table-only. No change torenderJson/ CSV output. Limitation: JSON and CSV consumers still receive only the numerictotal(a lower bound when the cap fires) without the "this is approximate" signal; surfacing it there is out of scope for this fix.Changes per file
_streamMatches.ts— RewrotecollectPage(iter, offset, limit, maxScan = MAX_FUZZY_MATCHES)to count every yielded match, retain only rows in[offset, offset + limit)(O(limit) memory), stop at the cap (exhausted: false) or when the iterator drains (exhausted: true), and returntotal = matched. Exported the sharedMAX_FUZZY_MATCHESconstant and updated docstrings.CikQuery.ts— Now imports the sharedMAX_FUZZY_MATCHESinstead of declaring its own. Added a comment explaining why CikQuery buffers all matches up to the cap (ranking reorders the whole set) whilecollectPagebuffers only the window. The empty-needle fast path still never setstotalApprox(unchanged).EntityQuery.ts/FilingQuery.ts— Fixed stale comments that claimed the stream "stops after offset + limit matches".QueryResult.totaldocstring clarified. The six consumers keep theexhausted ? {...} : {..., totalApprox}pattern, which becomes correct now thattotalis real. EntityQuery's post-collect sort behavior is unchanged (still sorts the window only); only its comment was corrected.Facts/Offering/Crowdfunding/PersonQuery — no code change needed; pattern already correct (verified).TableRenderer.ts— no change; confirmed the "≥ N" branch reads correctly now thattotalis meaningful._streamMatches.test.ts(new) — unit tests forcollectPage: full count below cap, cap fires (total === maxScan,exhausted: false), offset past match set, empty iterator, O(limit) window, and default-cap behavior.EntityQuery.test.ts/FilingQuery.test.ts/PersonQuery.test.ts— added assertions that a streamed search reports the FULL match count (notoffset + limit) withtotalApproxundefined when the stream drains. CikQuery's existing "empty needle never sets totalApprox" test is preserved untouched.Test plan
collectPageunit tests pass (full count, cap, offset-past, empty, O(limit), default cap)total+ undefinedtotalApproxGenerated by Claude Code