You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #624 lands the bulk of this work — atomic metadata writes, config fingerprint persistence and compatibility checks, async parquet row-count recovery, allow_resize=True rejection, post-generation terminal handling, and resume documentation. It also unifies the sync and async resume paths so both engines derive num_completed_batches and actual_num_records from parquet-files/batch_*.parquet (metadata stays the source of truth for run configuration; the filesystem is the source of truth for progress).
The "reduce metadata-write overhead in async" item is deferred — current write cost has not shown up as a hot path in profiling.
Task Summary
Harden the resume feature added in #526 before we rely on it for broad retry/orchestration workflows. The initial implementation is useful and will be merged, but follow-up work is needed around async partial-row recovery, sync allow_resize behavior, metadata durability, compatibility metadata, tests, and documentation.
Technical Details & Implementation Plan
Correctness fixes
Async resume should count actual persisted rows, not inferred row-group sizes.
Current async resume reconstructs initial_actual_num_records from completed row-group IDs via _rg_size(rg_id).
This overcounts salvaged partial row groups after early shutdown, where a parquet file may contain fewer rows than the requested row-group size.
Replace _find_completed_row_group_ids() with a scan that returns {row_group_id: parquet_num_rows} using parquet metadata.
Add a regression test with a real parquet file containing fewer rows than the row-group size.
Reject or fully support resume with allow_resize=True.
allow_resize=True mutates batch cardinality and _num_records_list, so the current metadata is not enough to reconstruct the original remaining batch plan.
Small safe fix: fail fast when resume != ResumeMode.NEVER and any column has allow_resize=True.
Alternative: persist the exact original requested target and batch plan in metadata, then resume from that plan.
Make metadata.json writes atomic.
Resume depends on metadata.json as checkpoint state, but current writes truncate/rewrite in place.
Write to a temporary file, flush/fsync, then os.replace() into place.
Add corrupt/truncated metadata tests that raise a clear DatasetGenerationError.
Handle post-generation-processed datasets as terminal for resume.
process_after_generation runs on the whole dataset and can re-chunk files, change row counts, or change schema.
Persist post_generation_processed: true after successful run_after_generation.
On resume, return the existing path if the requested target matches; otherwise fail fast with a clear error before generation starts.
Metadata and compatibility
Persist config fingerprint in metadata.json.
Store config_hash and config_hash_version alongside checkpoint metadata.
Use those values for resume compatibility checks instead of deserializing builder_config.json and recomputing the fingerprint.
Keep builder_config.json as the human-readable config record.
Tighten corrupt config handling.
Distinguish missing config, unreadable/corrupt config, and true config mismatch.
Avoid surfacing corrupt state as a misleading "config changed" condition.
Use required metadata fields consistently.
_load_resume_state() safely retrieves target_num_records but later uses direct dict access.
Treat required resume metadata as required and raise clear DatasetGenerationError when absent.
Durability
Reduce metadata-write overhead in async.(deferred — not currently a hot path)
Async currently writes full metadata, including file path scans, after every row group.
Consider writing compact progress metadata every K row groups, with full metadata at the end.
Tests
Replace empty .parquet stubs in async resume tests with real parquet fixtures.
Include full row group, partial salvaged row group, and non-aligned final row group cases.
Add direct allow_resize=True + resume tests.
Either assert the combination raises clearly, or verify exact batch-plan resume behavior.
Add config fingerprint round-trip coverage while the builder_config.json fallback exists.
Documentation
Update architecture/dataset-builders.md with resume mechanics and invariants.
Add user docs for ResumeMode.NEVER, ResumeMode.ALWAYS, and ResumeMode.IF_POSSIBLE.
Document resume failure conditions: config drift, buffer-size mismatch, requested records below existing records, post-generation processing already applied, and allow_resize=True rejection.
Document that process_after_generation runs on the whole dataset, not per buffer.
Document that DatasetCreationResults from a resume invocation reflects that invocation, not cumulative original + resume observability.
Investigation / Context
Follow-up to #526 (feat: resume interrupted dataset generation runs). The review found that the core resume implementation is useful, but a few edge cases remain:
Async resume uses filesystem row-group IDs as truth, but salvaged partial row groups need actual parquet row counts.
metadata.json is a crash-recovery checkpoint but is currently rewritten in place.
process_after_generation operates on the full dataset and can make parquet row-group identity invalid for later resume/extension.
Agent Plan / Findings
Recommended order for the follow-up PR:
Add atomic metadata writes and corrupt-metadata error handling.
Change async completed-row-group discovery to read parquet row counts.
Reject resume != ResumeMode.NEVER when allow_resize=True, unless exact batch-plan persistence is implemented.
Add post_generation_processed metadata and fail-fast/no-op handling.
Persist config_hash / config_hash_version in metadata and use it for compatibility checks.
Add focused regression tests for async partial row groups, allow_resize, and truncated metadata.
Update architecture and user docs.
Security note: no obvious malicious code was found in the #526 surface; atomic metadata writes are an integrity hardening item worth addressing promptly.
Priority Level
High
Status (PR #624)
PR #624 lands the bulk of this work — atomic metadata writes, config fingerprint persistence and compatibility checks, async parquet row-count recovery,
allow_resize=Truerejection, post-generation terminal handling, and resume documentation. It also unifies the sync and async resume paths so both engines derivenum_completed_batchesandactual_num_recordsfromparquet-files/batch_*.parquet(metadata stays the source of truth for run configuration; the filesystem is the source of truth for progress).The "reduce metadata-write overhead in async" item is deferred — current write cost has not shown up as a hot path in profiling.
Task Summary
Harden the resume feature added in #526 before we rely on it for broad retry/orchestration workflows. The initial implementation is useful and will be merged, but follow-up work is needed around async partial-row recovery, sync
allow_resizebehavior, metadata durability, compatibility metadata, tests, and documentation.Technical Details & Implementation Plan
Correctness fixes
Async resume should count actual persisted rows, not inferred row-group sizes.
initial_actual_num_recordsfrom completed row-group IDs via_rg_size(rg_id)._find_completed_row_group_ids()with a scan that returns{row_group_id: parquet_num_rows}using parquet metadata.Reject or fully support resume with
allow_resize=True.allow_resize=Truemutates batch cardinality and_num_records_list, so the current metadata is not enough to reconstruct the original remaining batch plan.resume != ResumeMode.NEVERand any column hasallow_resize=True.Make
metadata.jsonwrites atomic.metadata.jsonas checkpoint state, but current writes truncate/rewrite in place.os.replace()into place.DatasetGenerationError.Handle post-generation-processed datasets as terminal for resume.
process_after_generationruns on the whole dataset and can re-chunk files, change row counts, or change schema.post_generation_processed: trueafter successfulrun_after_generation.Metadata and compatibility
Persist config fingerprint in
metadata.json.config_hashandconfig_hash_versionalongside checkpoint metadata.builder_config.jsonand recomputing the fingerprint.builder_config.jsonas the human-readable config record.Tighten corrupt config handling.
Use required metadata fields consistently.
_load_resume_state()safely retrievestarget_num_recordsbut later uses direct dict access.DatasetGenerationErrorwhen absent.Durability
Tests
Replace empty
.parquetstubs in async resume tests with real parquet fixtures.Add direct
allow_resize=True+ resume tests.Add config fingerprint round-trip coverage while the
builder_config.jsonfallback exists.Documentation
architecture/dataset-builders.mdwith resume mechanics and invariants.ResumeMode.NEVER,ResumeMode.ALWAYS, andResumeMode.IF_POSSIBLE.allow_resize=Truerejection.process_after_generationruns on the whole dataset, not per buffer.DatasetCreationResultsfrom a resume invocation reflects that invocation, not cumulative original + resume observability.Investigation / Context
Follow-up to #526 (
feat: resume interrupted dataset generation runs). The review found that the core resume implementation is useful, but a few edge cases remain:allow_resize=Truechanges cardinality.metadata.jsonis a crash-recovery checkpoint but is currently rewritten in place.process_after_generationoperates on the full dataset and can make parquet row-group identity invalid for later resume/extension.Agent Plan / Findings
Recommended order for the follow-up PR:
resume != ResumeMode.NEVERwhenallow_resize=True, unless exact batch-plan persistence is implemented.post_generation_processedmetadata and fail-fast/no-op handling.config_hash/config_hash_versionin metadata and use it for compatibility checks.allow_resize, and truncated metadata.Security note: no obvious malicious code was found in the #526 surface; atomic metadata writes are an integrity hardening item worth addressing promptly.
Dependencies
Follow-up to #526.