Conversation
Persist config identity in metadata, make checkpoints atomic, and reject unsafe resume states so interrupted runs do not mix incompatible or post-processed data.
PR #624 Review —
|
Greptile SummaryThis PR hardens resume checkpoint handling by making metadata writes atomic, unifying sync and async progress recovery from parquet files on disk, and adding new guards that reject unsafe resume states (
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py | Core resume hardening: adds _post_generation_processed_resume_result guard, allow_resize rejection, unified _recover_progress_from_disk for sync+async, _find_completed_row_groups returning real parquet row counts, and config-fingerprint-first compat check. |
| packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py | Adds atomic metadata write (tmp + fsync + os.replace with finally-guarded cleanup), set_metadata_defaults for fingerprint injection, and _metadata_defaults merged with caller dict on every write_metadata call. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py | Comprehensive new test coverage: allow_resize guards, post-processed terminal-state paths, corrupt metadata handling, non-contiguous batch ID rejection, metadata-lags-disk crash-window scenario, and partial row-group salvage. |
| packages/data-designer/src/data_designer/interface/data_designer.py | Splits the single try/except into two blocks so DeprecationWarning from builder construction propagates correctly under strict warning filters. |
| packages/data-designer/src/data_designer/interface/results.py | Documentation-only update clarifying resume scope for DatasetCreationResults. |
| packages/data-designer-engine/tests/engine/storage/test_artifact_storage.py | Adds test verifying that set_metadata_defaults fields are merged into every write_metadata call. |
| architecture/dataset-builders.md | New Resume Checkpointing section documenting the unified metadata/filesystem split and new invariants. |
Reviews (6): Last reviewed commit: "docs: explain DeprecationWarning re-rais..." | Re-trigger Greptile
Let IF_POSSIBLE start fresh for resize configs and mark after-generation processing before mutation so interrupted processors cannot be resumed unsafely.
Single-user CLI/notebook flows don't race on the artifact directory, and the timestamped-directory fallback already handles the "ran it twice" case. The lock added complexity (re-entrancy, stale cleanup, the cached-property trap where IF_POSSIBLE→NEVER moves writes to a timestamped directory while the lock stays pinned to the original) for no real protection. Atomic metadata writes still cover the actual hazard (crash mid-write). Also fix a pre-existing test bug in test_initial_actual_num_records_uses_actual_parquet_rows_for_partial_row_group where the mocked scheduler hit the partial-completion path with unconfigured Mock attributes.
* Drop the unreachable ResumeMode.IF_POSSIBLE branch in _post_generation_processed_resume_result. By the time this helper runs, build() has normalised IF_POSSIBLE to ALWAYS or NEVER, so the guard now matches reality. Tighten the docstring to document the three outcomes (no-op return / fall through / raise). * Split the post-processed extension/raise into two cases. When num_records < prior_target the user just asked for fewer records than already exist; the previous "would mix pre- and post-processor records" message only describes the extension case. Mirror the wording used by _load_resume_state and add a regression test. * Remove the dead _find_completed_row_group_ids wrapper now that _build_async uses _find_completed_row_groups directly. Rename the related test to match.
Both engines now derive `num_completed_batches` and `actual_num_records` from `parquet-files/batch_*.parquet` via `_recover_progress_from_disk`. `metadata.json` keeps describing the run *configuration* (`buffer_size`, `target_num_records`, `original_target_num_records`, config fingerprint), while the filesystem is the source of truth for *progress*. This closes the sync engine's race window between `move_partial_result_to_final_file_path` and the metadata write that follows it, matching the crash-recovery the async engine already had. The sync engine additionally rejects non-contiguous batch IDs (a hole can only mean external mutation or a directory written by an incompatible engine); the async engine continues to tolerate gaps from out-of-order completion via `allow_holes=True`. Existing sync resume tests now seed parquet files alongside metadata, and two new tests cover the unified behaviour: filesystem progress wins when metadata lags, and sync rejects non-contiguous IDs.
`load_dataset`, `count_records`, `load_analysis`, `export`, and `push_to_hub` all read from the artifact directory, so they reflect the cumulative dataset (original + resume rows). `task_traces`, model-usage logs, and telemetry events are scoped to the current invocation only because the original run's in-memory state is not persisted. Document this in the class docstring, the architecture note, and the Fern resume guide.
Future readers were puzzled by the ``except DeprecationWarning: raise`` short-circuits before the generic generation-error wrappers. Add a comment in ``create()`` (with a back-reference from ``preview()``) to record that strict warning filters (``pytest.warns``, ``-W error::DeprecationWarning``) turn the engine's ``warnings.warn(..., DeprecationWarning)`` calls — most notably the ``allow_resize=True`` deprecation in ``_resolve_async_compatibility`` — into raised exceptions, and we want them to surface untouched instead of being swallowed by ``DataDesignerGenerationError``.
Summary
allow_resize=Trueand already post-processed datasets.num_completed_batchesandactual_num_recordsfromparquet-files/batch_*.parquetvia_recover_progress_from_disk.metadata.jsonkeeps describing the run configuration (buffer_size,target_num_records,original_target_num_records, config fingerprint); the filesystem is the source of truth for progress. The sync engine additionally rejects non-contiguous batch IDs.DatasetCreationResultsobservability scope on resume: methods that read the artifact directory (load_dataset,count_records,load_analysis,export,push_to_hub) reflect the full on-disk dataset, whiletask_traces, model-usage logs, and telemetry events are scoped to the current invocation only.IF_POSSIBLEguard in_post_generation_processed_resume_result, split the post-processed raise sonum_records < prior_targetand the extension case get accurate messages, and drop the dead_find_completed_row_group_idshelper.Fixes #623
Test plan
uv run pytest packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py -k "resume or allow_resize or post_generation or non_contiguous or recovers_progress"uv run pytest packages/data-designer-engine/tests/engine/storage/test_artifact_storage.py -k "metadata"uv run pytest packages/data-designer/tests/interface/test_data_designer.py packages/data-designer/tests/interface/test_results.pyuv run pytest packages/data-designer-engine/tests/(full engine suite)uv run ruff format --check <touched files>uv run ruff check <touched files>fern check(not run: Fern CLI is not installed in this environment)