Skip to content

feat(migrator): [1/6] add sync migration engine#549

Closed
nkanu17 wants to merge 51 commits intomainfrom
feat/migrate-core
Closed

feat(migrator): [1/6] add sync migration engine#549
nkanu17 wants to merge 51 commits intomainfrom
feat/migrate-core

Conversation

@nkanu17
Copy link
Copy Markdown
Collaborator

@nkanu17 nkanu17 commented Mar 31, 2026

Summary

Introduces the core synchronous index migration engine for RedisVL. This enables users to programmatically plan and execute index schema migrations using a drop-recreate strategy with automatic data reindexing.

Overview

The migration engine follows a plan -> execute -> validate workflow:

from redisvl.migration import MigrationPlanner, MigrationExecutor

# Plan: compare source index against target schema
planner = MigrationPlanner(redis_url="redis://localhost:6379")
plan = planner.plan("my_index", target_schema)

# Execute: drop-recreate with data preservation
executor = MigrationExecutor(redis_url="redis://localhost:6379")
report = executor.execute(plan)

# Validate: confirm schema and data integrity
validator = MigrationValidator(redis_url="redis://localhost:6379")
result = validator.validate(plan)

What is included

Core modules (redisvl/migration/)

Module Purpose
models.py Pydantic models: MigrationPlan, MigrationReport, SchemaPatch, FieldRename, RenameOperations
planner.py Diff-based planning -- compares live index against target schema, produces a MigrationPlan
executor.py Drop-recreate execution -- enumerates docs, drops index, recreates with new schema, reindexes
validation.py Post-migration validation -- schema comparison, document count, key sampling
utils.py Shared utilities: schema comparison, YAML I/O, key enumeration, index listing

Supporting changes

  • redisvl/cli/utils.py -- Refactored add_index_parsing_options to extract add_redis_connection_options (backward compatible, needed by migration CLI in later PR)
  • redisvl/redis/connection.py -- Added HNSW parameter parsing (m, ef_construction) in convert_index_info_to_schema
  • .gitignore -- Added migration temp files and dev directories
  • AGENTS.md -- Added project context file

Tests

  • tests/unit/test_migration_planner.py -- Comprehensive planner unit tests (~890 lines)
  • tests/integration/test_migration_v1.py -- End-to-end migration integration tests
  • tests/integration/test_field_modifier_ordering_integration.py -- Field modifier ordering tests

Design decisions

  1. Drop-recreate strategy: Chosen for V1 simplicity. Redis does not support in-place schema ALTER, so we enumerate all documents, drop the index, recreate with the new schema, and reindex. Data keys are preserved (only the index metadata is dropped).

  2. Schema diff planning: The planner compares the live FT.INFO output against the target IndexSchema to produce a precise diff (added/removed/updated fields, rename operations). This avoids unnecessary migrations when schemas already match.

  3. Pydantic models throughout: All plans and reports are Pydantic BaseModel instances for validation, serialization (YAML/JSON), and type safety.

  4. Rename support built-in: Field renames are first-class operations in the plan, executed as key-level field copy+delete during reindexing.

  5. Validation as separate concern: MigrationValidator runs independently after execution to verify schema correctness, document counts, and key-level field sampling.

Part of a stack

This is PR 1 of 6 in the index migrator feature:

  1. This PR -- Sync migration core
  2. Async migration support
  3. Batch migration support
  4. Interactive migration wizard
  5. CLI + documentation
  6. Crash-safe quantization & disk space estimation (PR feat(migrator): [6/6] crash-safe quantization and disk space estimation #548)

nkanu17 added 30 commits March 23, 2026 13:20
- Remove unused imports: Union, ClusterPipeline, AsyncClusterPipeline,
  logging, cast, Optional, os, lazy_import, SyncRedisCluster, Mapping,
  Awaitable, warnings
- Fix unused exception variables in index.py exception handlers
- Clean up HybridResult import used only for feature detection
- Add rvl migrate subcommand (helper, list, plan, apply, validate)
- Implement MigrationPlanner for schema diff classification
- Implement MigrationExecutor with drop_recreate mode
- Support vector quantization (float32 <-> float16) during migration
- Add MigrationValidator for post-migration validation
- Show error messages prominently on migration failure
- Add migration temp files to .gitignore
- Add MigrationWizard for guided schema changes
- Support add/update/remove field operations
- Algorithm-specific datatype prompts (SVS-VAMANA vs HNSW/FLAT)
- SVS-VAMANA params: GRAPH_MAX_DEGREE, COMPRESSION
- HNSW params: M, EF_CONSTRUCTION
- Normalize SVS_VAMANA -> SVS-VAMANA input
- Preview patch as YAML before finishing
- Add conceptual guide: how migrations work (Diataxis explanation)
- Add task guide: step-by-step migration walkthrough (Diataxis how-to)
- Expand field-attributes.md with migration support matrix
- Add vector datatypes table with algorithm compatibility
- Update navigation indexes to include new guides
- Normalize SVS-VAMANA naming throughout docs
- Unit tests for MigrationPlanner diff classification
- Unit tests for MigrationWizard (41 tests incl. adversarial inputs)
- Integration test for drop_recreate flow
- Field modifier ordering integration tests (INDEXEMPTY, INDEXMISSING, etc.)
Add async/await execution for index migrations, enabling non-blocking
operation for large quantization jobs and async application integration.

New functionality:
- CLI: --async flag for rvl migrate apply
- Python API: AsyncMigrationPlanner, AsyncMigrationExecutor, AsyncMigrationValidator
- Batched quantization with pipelined HSET operations
- Non-blocking readiness polling with asyncio.sleep()

What becomes async:
- SCAN operations (yields between batches of 500 keys)
- Pipelined HSET writes (100-1000 ops per batch)
- Index readiness polling (asyncio.sleep vs time.sleep)

What stays sync:
- CLI prompts (user interaction)
- YAML file I/O (local filesystem)

Documentation:
- Sync vs async execution guidance in concepts/index-migrations.md
- Async usage examples in how_to_guides/migrate-indexes.md

Tests:
- 4 unit tests for AsyncMigrationPlanner
- 4 unit tests for AsyncMigrationExecutor
- 1 integration test for full async flow
Document Enumeration Optimization:
- Use FT.AGGREGATE WITHCURSOR for efficient key enumeration
- Falls back to SCAN only when index has hash_indexing_failures
- Pre-enumerate keys before drop for reliable re-indexing

CLI Simplification:
- Remove redundant --allow-downtime flag from apply/batch-apply
- Plan review is now the safety mechanism

Batch Migration:
- Add BatchMigrationExecutor and BatchMigrationPlanner
- Support for multi-index migration with failure policies
- Resumable batch operations with state persistence

Bug Fixes:
- Fix mypy type errors in planner, wizard, validation, and CLI

Documentation:
- Update concepts and how-to guides for new workflow
- Remove --allow-downtime references from all docs
- Add FieldRename and RenameOperations models
- Add _extract_rename_operations to detect index/prefix/field renames
- Update classify_diff to support rename detection
- Update tests for prefix change (now supported, not blocked)
- Add _rename_keys for prefix changes via RENAME command
- Add _rename_field_in_hash and _rename_field_in_json for field renames
- Execute renames before drop/recreate for safe enumeration
- Support both HASH and JSON storage types
- Add rename operations (rename index, change prefix, rename field)
- Add vector field removal with [WARNING] indicator
- Add index_empty, ef_runtime, epsilon prompts
- Add phonetic_matcher and withsuffixtrie for text/tag fields
- Update menu to 8 options
- All 40 supported operations now in wizard
- Add UNRELIABLE_*_ATTRS constants for attributes Redis doesn't return
- Add _strip_unreliable_attrs() to normalize schemas before comparison
- Update canonicalize_schema() with strip_unreliable parameter
- Handle NUMERIC+SORTABLE auto-UNF normalization
- Update validation.py and async_validation.py to use strip_unreliable=True
- Remove withsuffixtrie from wizard (parser breaks)
- All 38 comprehensive integration tests pass with strict validation
- Transform key_sample to use new prefix when validating after prefix change
- Sync and async validators detect plan.rename_operations.change_prefix
- Update integration test to use full validation (result['succeeded'])
- All 38 comprehensive tests pass with strict validation
- Add _run_functional_checks() to sync and async validators
- Wildcard search (FT.SEARCH "*") verifies index is operational
- Automatically runs after every migration (no user config needed)
- Verifies doc count matches expected count from source
- All 40 migration integration tests pass
- Add parsing for m and ef_construction from FT.INFO in parse_vector_attrs
- Normalize float weights to int in schema comparison (_strip_unreliable_attrs)
- Fixes false validation failures after HNSW migrations
- Tests algorithm changes (HNSW<->FLAT)
- Tests datatype changes (float32, float16, bfloat16, int8, uint8)
- Tests distance metric changes (cosine, l2, ip)
- Tests HNSW tuning parameters (m, ef_construction, ef_runtime, epsilon)
- Tests combined changes (algorithm + datatype + metric)

Requires Redis 8.0+ for INT8/UINT8 datatype tests
- Add field rename, prefix change, index rename to supported changes
- Update quantization docs to include bfloat16/int8/uint8 datatypes
- Add Redis 8.0+ requirement notes for INT8/UINT8
- Add Redis 8.2+ and Intel AVX-512 notes for SVS-VAMANA
- Add batch migration CLI commands to CLI reference
- Remove prefix/field rename from blocked changes lists
- Fix wizard algorithm case sensitivity (schema stores lowercase 'hnsw')
- Remove non-existent --skip-count-check flag from docs
- CRITICAL: merge_patch now applies rename_fields to merged schema
- HIGH: BatchState.success_count uses correct 'succeeded' status
- HIGH: CLI helper text shows prefix/rename as supported
- HIGH: Planner docstring updated for current capabilities
- HIGH: batch_plan_path stored in state for resume support
- MEDIUM: Fixed --output to --plan-out in batch migration docs
- MEDIUM: Fixed --indexes to use comma-separated format in docs
- MEDIUM: Added validation to block multi-prefix migrations
- MEDIUM: Updated migration plan YAML example to match model
- MEDIUM: Added skipped_count property and [SKIP] status display
- Add reliability.py: idempotent dtype detection, checkpoint persistence,
  BGSAVE safety net, bounded undo buffer with async rollback
- Add DiskSpaceEstimate/VectorFieldEstimate models and estimate_disk_space()
- Wire reliability into both sync and async executors (_quantize_vectors)
- Add --resume flag to rvl migrate apply for checkpoint-based resume
- Add rvl migrate estimate subcommand for pre-migration cost analysis
- Update progress labels to 6 steps (enumerate, bgsave, drop, quantize,
  create, re-index)
- Planner returns dims in datatype change metadata for idempotent detection
- 39 new unit tests (90 total migration tests passing)
- Fix bfloat16/uint8 idempotent detection using dtype byte-width families
  so float16<->bfloat16 and int8<->uint8 are treated as equivalent
- Validate checkpoint index_name matches source index before resuming
- Force checkpoint_path to the load path, not the stored value
- Record all batch keys in checkpoint (including skipped) to avoid
  re-scanning on resume
- Fix misleading AOF wording when aof_enabled is not set
…uracy

- Fix #2: docs_processed increments by full batch size (including skipped)
  so progress reaches 100% even when vectors are already quantized
- Fix #4: is_already_quantized prevents skipping same-width dtype
  conversions (e.g. float16->bfloat16) since encodings differ
- Fix #5: apply() detects checkpoint on resume and bypasses index
  validation, BGSAVE, field renames, drop, and key renames (all already
  done pre-crash); enumerates keys via SCAN with plan prefix instead
- Add IM-16 (auto-detect AOF) and IM-17 (compact checkpoint) to backlog
…rom models

Remove _BYTES_PER_ELEMENT from reliability.py and import the identical
DTYPE_BYTES constant from models.py to maintain a single source of truth.
- record_batch() no longer appends to processed_keys list
- save() excludes processed_keys from serialized YAML
- get_remaining_keys() uses completed_keys offset for compact
  checkpoints, with backward compat for legacy processed_keys
- Add tests for compact checkpoint resume, save exclusion,
  checkpoint-before-HGET ordering, quantize return counts,
  scan pattern builder, key normalization, and AOF detection
- Use is_file() instead of exists() in batch_executor load methods
- Add exclude_none to wizard preview model_dump for consistency
- Fix validate timestamps: capture before/after validation runs
- Use client.info('persistence') instead of full info() in BGSAVE poll
- Remove misleading isdigit() comment in wizard test
@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Mar 31, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 281c162eeb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Mar 31, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 281c162eeb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

… persist

- Use RENAMENX instead of RENAME for prefix migrations (sync+async)
- Unwrap JSONPath list results in _rename_field_in_json (sync+async)
- Remap update_fields through rename_operations in merge_patch and classify_diff
- Validate/normalize prefix type (list -> string) in planner
- Persist state and clear current_index when batch executor skips indexes
- Heuristic rename detection continues even with explicit renames
@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Mar 31, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 281c162eeb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ype enforcement

- Wizard now rebuilds working schema after each action so prompts
  reflect staged renames, removes, and adds
- When switching to SVS-VAMANA with incompatible datatype, force
  selection or default to float32
@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Mar 31, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aabcf4e048

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

nkanu17 added 2 commits April 1, 2026 11:44
… predictions

- Document the 6-step drop-recreate migration sequence in user guide
- Clarify that source index is dropped BEFORE quantization begins
- Correct memory prediction: peak is baseline FP32 size (~57 GB for 10M),
  not double (80+ GB), because index is already dropped during quantization
- Update 10M predictions: 64-128 GB RAM (not 128+ GB), 50-90 min timeline
- Add FLAT vs HNSW target considerations for large-scale migrations
@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 1, 2026

@codex review

1 similar comment
@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6734794854

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6734794854

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7eccaabf80

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3c0ee56d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3c0ee56d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 1, 2026

Closing to recreate with proper stacked diffs. Code is preserved on feat/index-migrator-v0-lc-checkpoint-backup.

@nkanu17 nkanu17 closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants