Skip to content

Harden resume checkpointing and edge cases #623

@nabinchha

Description

@nabinchha

Priority Level

High

Status (PR #624)

PR #624 lands the bulk of this work — atomic metadata writes, config fingerprint persistence and compatibility checks, async parquet row-count recovery, allow_resize=True rejection, post-generation terminal handling, and resume documentation. It also unifies the sync and async resume paths so both engines derive num_completed_batches and actual_num_records from parquet-files/batch_*.parquet (metadata stays the source of truth for run configuration; the filesystem is the source of truth for progress).

The "reduce metadata-write overhead in async" item is deferred — current write cost has not shown up as a hot path in profiling.

Task Summary

Harden the resume feature added in #526 before we rely on it for broad retry/orchestration workflows. The initial implementation is useful and will be merged, but follow-up work is needed around async partial-row recovery, sync allow_resize behavior, metadata durability, compatibility metadata, tests, and documentation.

Technical Details & Implementation Plan

Correctness fixes

  • Async resume should count actual persisted rows, not inferred row-group sizes.

    • Current async resume reconstructs initial_actual_num_records from completed row-group IDs via _rg_size(rg_id).
    • This overcounts salvaged partial row groups after early shutdown, where a parquet file may contain fewer rows than the requested row-group size.
    • Replace _find_completed_row_group_ids() with a scan that returns {row_group_id: parquet_num_rows} using parquet metadata.
    • Add a regression test with a real parquet file containing fewer rows than the row-group size.
  • Reject or fully support resume with allow_resize=True.

    • allow_resize=True mutates batch cardinality and _num_records_list, so the current metadata is not enough to reconstruct the original remaining batch plan.
    • Small safe fix: fail fast when resume != ResumeMode.NEVER and any column has allow_resize=True.
    • Alternative: persist the exact original requested target and batch plan in metadata, then resume from that plan.
  • Make metadata.json writes atomic.

    • Resume depends on metadata.json as checkpoint state, but current writes truncate/rewrite in place.
    • Write to a temporary file, flush/fsync, then os.replace() into place.
    • Add corrupt/truncated metadata tests that raise a clear DatasetGenerationError.
  • Handle post-generation-processed datasets as terminal for resume.

    • process_after_generation runs on the whole dataset and can re-chunk files, change row counts, or change schema.
    • Persist post_generation_processed: true after successful run_after_generation.
    • On resume, return the existing path if the requested target matches; otherwise fail fast with a clear error before generation starts.

Metadata and compatibility

  • Persist config fingerprint in metadata.json.

    • Store config_hash and config_hash_version alongside checkpoint metadata.
    • Use those values for resume compatibility checks instead of deserializing builder_config.json and recomputing the fingerprint.
    • Keep builder_config.json as the human-readable config record.
  • Tighten corrupt config handling.

    • Distinguish missing config, unreadable/corrupt config, and true config mismatch.
    • Avoid surfacing corrupt state as a misleading "config changed" condition.
  • Use required metadata fields consistently.

    • _load_resume_state() safely retrieves target_num_records but later uses direct dict access.
    • Treat required resume metadata as required and raise clear DatasetGenerationError when absent.

Durability

  • Reduce metadata-write overhead in async. (deferred — not currently a hot path)
    • Async currently writes full metadata, including file path scans, after every row group.
    • Consider writing compact progress metadata every K row groups, with full metadata at the end.

Tests

  • Replace empty .parquet stubs in async resume tests with real parquet fixtures.

    • Include full row group, partial salvaged row group, and non-aligned final row group cases.
  • Add direct allow_resize=True + resume tests.

    • Either assert the combination raises clearly, or verify exact batch-plan resume behavior.
  • Add config fingerprint round-trip coverage while the builder_config.json fallback exists.

Documentation

  • Update architecture/dataset-builders.md with resume mechanics and invariants.
  • Add user docs for ResumeMode.NEVER, ResumeMode.ALWAYS, and ResumeMode.IF_POSSIBLE.
  • Document resume failure conditions: config drift, buffer-size mismatch, requested records below existing records, post-generation processing already applied, and allow_resize=True rejection.
  • Document that process_after_generation runs on the whole dataset, not per buffer.
  • Document that DatasetCreationResults from a resume invocation reflects that invocation, not cumulative original + resume observability.

Investigation / Context

Follow-up to #526 (feat: resume interrupted dataset generation runs). The review found that the core resume implementation is useful, but a few edge cases remain:

  • Async resume uses filesystem row-group IDs as truth, but salvaged partial row groups need actual parquet row counts.
  • Sync resume cannot reconstruct safe batch boundaries after allow_resize=True changes cardinality.
  • metadata.json is a crash-recovery checkpoint but is currently rewritten in place.
  • process_after_generation operates on the full dataset and can make parquet row-group identity invalid for later resume/extension.

Agent Plan / Findings

Recommended order for the follow-up PR:

  1. Add atomic metadata writes and corrupt-metadata error handling.
  2. Change async completed-row-group discovery to read parquet row counts.
  3. Reject resume != ResumeMode.NEVER when allow_resize=True, unless exact batch-plan persistence is implemented.
  4. Add post_generation_processed metadata and fail-fast/no-op handling.
  5. Persist config_hash / config_hash_version in metadata and use it for compatibility checks.
  6. Add focused regression tests for async partial row groups, allow_resize, and truncated metadata.
  7. Update architecture and user docs.

Security note: no obvious malicious code was found in the #526 surface; atomic metadata writes are an integrity hardening item worth addressing promptly.

Dependencies

Follow-up to #526.

Metadata

Metadata

Assignees

No one assigned

    Labels

    taskInternal development task

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions