Harden resume checkpointing and edge cases

## Priority Level

High

## Status (PR #624)

PR #624 lands the bulk of this work — atomic metadata writes, config fingerprint persistence and compatibility checks, async parquet row-count recovery, `allow_resize=True` rejection, post-generation terminal handling, and resume documentation. It also unifies the sync and async resume paths so both engines derive `num_completed_batches` and `actual_num_records` from `parquet-files/batch_*.parquet` (metadata stays the source of truth for run *configuration*; the filesystem is the source of truth for *progress*).

The "reduce metadata-write overhead in async" item is deferred — current write cost has not shown up as a hot path in profiling.

## Task Summary

Harden the resume feature added in #526 before we rely on it for broad retry/orchestration workflows. The initial implementation is useful and will be merged, but follow-up work is needed around async partial-row recovery, sync `allow_resize` behavior, metadata durability, compatibility metadata, tests, and documentation.

## Technical Details & Implementation Plan

### Correctness fixes

- [x] **Async resume should count actual persisted rows, not inferred row-group sizes.**
  - Current async resume reconstructs `initial_actual_num_records` from completed row-group IDs via `_rg_size(rg_id)`.
  - This overcounts salvaged partial row groups after early shutdown, where a parquet file may contain fewer rows than the requested row-group size.
  - Replace `_find_completed_row_group_ids()` with a scan that returns `{row_group_id: parquet_num_rows}` using parquet metadata.
  - Add a regression test with a real parquet file containing fewer rows than the row-group size.

- [x] **Reject or fully support resume with `allow_resize=True`.**
  - `allow_resize=True` mutates batch cardinality and `_num_records_list`, so the current metadata is not enough to reconstruct the original remaining batch plan.
  - Small safe fix: fail fast when `resume != ResumeMode.NEVER` and any column has `allow_resize=True`.
  - Alternative: persist the exact original requested target and batch plan in metadata, then resume from that plan.

- [x] **Make `metadata.json` writes atomic.**
  - Resume depends on `metadata.json` as checkpoint state, but current writes truncate/rewrite in place.
  - Write to a temporary file, flush/fsync, then `os.replace()` into place.
  - Add corrupt/truncated metadata tests that raise a clear `DatasetGenerationError`.

- [x] **Handle post-generation-processed datasets as terminal for resume.**
  - `process_after_generation` runs on the whole dataset and can re-chunk files, change row counts, or change schema.
  - Persist `post_generation_processed: true` after successful `run_after_generation`.
  - On resume, return the existing path if the requested target matches; otherwise fail fast with a clear error before generation starts.

### Metadata and compatibility

- [x] **Persist config fingerprint in `metadata.json`.**
  - Store `config_hash` and `config_hash_version` alongside checkpoint metadata.
  - Use those values for resume compatibility checks instead of deserializing `builder_config.json` and recomputing the fingerprint.
  - Keep `builder_config.json` as the human-readable config record.

- [x] **Tighten corrupt config handling.**
  - Distinguish missing config, unreadable/corrupt config, and true config mismatch.
  - Avoid surfacing corrupt state as a misleading "config changed" condition.

- [x] **Use required metadata fields consistently.**
  - `_load_resume_state()` safely retrieves `target_num_records` but later uses direct dict access.
  - Treat required resume metadata as required and raise clear `DatasetGenerationError` when absent.

### Durability

- [ ] **Reduce metadata-write overhead in async.** *(deferred — not currently a hot path)*
  - Async currently writes full metadata, including file path scans, after every row group.
  - Consider writing compact progress metadata every K row groups, with full metadata at the end.

### Tests

- [x] Replace empty `.parquet` stubs in async resume tests with real parquet fixtures.
  - Include full row group, partial salvaged row group, and non-aligned final row group cases.

- [x] Add direct `allow_resize=True` + resume tests.
  - Either assert the combination raises clearly, or verify exact batch-plan resume behavior.

- [x] Add config fingerprint round-trip coverage while the `builder_config.json` fallback exists.

### Documentation

- [x] Update `architecture/dataset-builders.md` with resume mechanics and invariants.
- [x] Add user docs for `ResumeMode.NEVER`, `ResumeMode.ALWAYS`, and `ResumeMode.IF_POSSIBLE`.
- [x] Document resume failure conditions: config drift, buffer-size mismatch, requested records below existing records, post-generation processing already applied, and `allow_resize=True` rejection.
- [x] Document that `process_after_generation` runs on the whole dataset, not per buffer.
- [x] Document that `DatasetCreationResults` from a resume invocation reflects that invocation, not cumulative original + resume observability.

## Investigation / Context

Follow-up to #526 (`feat: resume interrupted dataset generation runs`). The review found that the core resume implementation is useful, but a few edge cases remain:

- Async resume uses filesystem row-group IDs as truth, but salvaged partial row groups need actual parquet row counts.
- Sync resume cannot reconstruct safe batch boundaries after `allow_resize=True` changes cardinality.
- `metadata.json` is a crash-recovery checkpoint but is currently rewritten in place.
- `process_after_generation` operates on the full dataset and can make parquet row-group identity invalid for later resume/extension.

## Agent Plan / Findings

Recommended order for the follow-up PR:

1. Add atomic metadata writes and corrupt-metadata error handling.
2. Change async completed-row-group discovery to read parquet row counts.
3. Reject `resume != ResumeMode.NEVER` when `allow_resize=True`, unless exact batch-plan persistence is implemented.
4. Add `post_generation_processed` metadata and fail-fast/no-op handling.
5. Persist `config_hash` / `config_hash_version` in metadata and use it for compatibility checks.
6. Add focused regression tests for async partial row groups, `allow_resize`, and truncated metadata.
7. Update architecture and user docs.

Security note: no obvious malicious code was found in the #526 surface; atomic metadata writes are an integrity hardening item worth addressing promptly.

## Dependencies

Follow-up to #526.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden resume checkpointing and edge cases #623

Priority Level

Status (PR #624)

Task Summary

Technical Details & Implementation Plan

Correctness fixes

Metadata and compatibility

Durability

Tests

Documentation

Investigation / Context

Agent Plan / Findings

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Harden resume checkpointing and edge cases #623

Description

Priority Level

Status (PR #624)

Task Summary

Technical Details & Implementation Plan

Correctness fixes

Metadata and compatibility

Durability

Tests

Documentation

Investigation / Context

Agent Plan / Findings

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions