feat(migrator): [7/7] Migration performance benchmarks with throughput, latency, memory measurement, and visualization#566
Conversation
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c262d1de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds benchmark and visualization tooling under tests/benchmarks/ to measure migrator and retrieval performance (throughput/latency/memory) across multiple index configurations and dataset scales, plus scripts to render result charts.
Changes:
- Add an end-to-end migration timing benchmark (sync + async) with index-size scaling and memory reporting.
- Add retrieval benchmarks comparing FLAT vs HNSW and FP32 vs FP16, including overlap/recall-style metrics.
- Add matplotlib-based result visualization for benchmark JSON outputs.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| tests/benchmarks/visualize_results.py | New charting script to generate memory/precision/QPS/latency/recall plots from benchmark JSON results. |
| tests/benchmarks/retrieval_benchmark.py | New retrieval benchmark runner to build/query multiple index configs and output structured metrics. |
| tests/benchmarks/migration_benchmark.py | New migration benchmark runner to populate synthetic corpora and time migrations (sync/async), including timing breakdowns. |
| tests/benchmarks/index_migrator_real_benchmark.py | New “real dataset” benchmark (AG News + sentence-transformers) to measure migration + query quality/memory on realistic text data. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c262d1de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c262d1de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix SentenceTransformer local_files_only fallback for first run - Remove unused imports (string, struct) - Fix dead store enum_s (prefix with _) - Guard ZeroDivisionError in chart_memory_savings - Fix compute_recall to return 0.0 for empty ground truth - Wrap logger handler in try/finally to prevent leak on exception
Scoped cleanup to current database only to avoid affecting other databases on shared Redis instances.
751a0e5 to
49ce997
Compare
- Remove leftover numbered list fragment in migrate-indexes.md (#565 #20) - Extract duplicated progress_callback into shared _make_progress_callback method in migrate.py (#565 #21) - Add rvl migrate section to CLI reference docs (#565 #19) - Add try/except for matplotlib import in visualize_results.py (#566 #14)
99c262d to
26ce024
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26ce0246e7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…hput, latency, memory measurement, and visualization Benchmark scripts for measuring migration performance across different index sizes and configurations. Includes real-world AG News dataset benchmarks, synthetic end-to-end timing benchmarks with memory tracking, retrieval quality comparisons (FLAT vs HNSW, FP32 vs FP16), and matplotlib visualization tooling for benchmark results.
- Fix SentenceTransformer local_files_only fallback for first run - Remove unused imports (string, struct) - Fix dead store enum_s (prefix with _) - Guard ZeroDivisionError in chart_memory_savings - Fix compute_recall to return 0.0 for empty ground truth - Wrap logger handler in try/finally to prevent leak on exception
Scoped cleanup to current database only to avoid affecting other databases on shared Redis instances.
- Remove leftover numbered list fragment in migrate-indexes.md (#565 #20) - Extract duplicated progress_callback into shared _make_progress_callback method in migrate.py (#565 #21) - Add rvl migrate section to CLI reference docs (#565 #19) - Add try/except for matplotlib import in visualize_results.py (#566 #14)
- Clean up source index on failure in benchmark finally block - Include queried_top_k in latency metrics for clarity when effective_top_k differs
49ce997 to
a1d54c9
Compare
26ce024 to
63d7c94
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63d7c94e5c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| label, title, description = row | ||
| text = f"{title}. {description}".strip() | ||
| records.append( | ||
| { | ||
| "doc_id": f"ag-news-{idx}", | ||
| "text": text, | ||
| "label": AG_NEWS_LABELS[int(label) - 1], |
There was a problem hiding this comment.
Skip CSV header rows before parsing AG News labels
--dataset-csv is documented as a CSV with label,title,description columns, but this loop parses every row as data and immediately casts label to int. On common AG News CSVs with a header row, the first row (label,title,description) raises ValueError, so the benchmark fails before loading any records. Please detect/skip headers (or use csv.DictReader) before converting labels.
Useful? React with 👍 / 👎.
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| target_schema_path = Path(tmpdir) / "target_schema.yaml" | ||
| plan_path = Path(tmpdir) / "migration_plan.yaml" |
There was a problem hiding this comment.
Persist migration plan outside temporary directory
The benchmark report returns plan_path, but it is created under TemporaryDirectory; that directory is deleted when the function exits, so the reported path is always invalid for post-run inspection/debugging. This makes the output misleading and prevents users from reusing the generated plan artifact.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| query_embeddings = all_embeddings[size : size + query_count] | ||
| dims = int(all_embeddings.shape[1]) | ||
|
|
||
| client.flushall() |
There was a problem hiding this comment.
client.flushall() will delete all keys in all DBs on the target Redis instance. For a benchmark script that takes --redis-url, this is risky if someone points it at a shared or non-local Redis. Consider switching to FLUSHDB on a dedicated DB (e.g., default redis://localhost:6379/15), and/or require an explicit --force/--i-understand-this-wipes-data flag (and refuse to run unless the host is localhost unless forced).
| client.flushall() | |
| # Clear only the currently selected Redis database for this benchmark run | |
| client.flushdb() |
| print(f" Size: {size:,} documents") | ||
| print(f"{'='*60}") | ||
|
|
||
| client.flushdb() |
There was a problem hiding this comment.
This benchmark unconditionally calls client.flushdb() for each size. Since --redis-url is user-provided, this can wipe unrelated data if run against a non-dedicated DB. Add a safety guard (e.g., require --flushdb/--force and default to a dedicated DB index) or at least refuse to run unless the Redis URL points to localhost unless explicitly forced.
| cursor, keys = client.scan(cursor, match=f"{INDEX_PREFIX}*", count=5000) | ||
| if keys: | ||
| client.delete(*keys) | ||
| if cursor == 0: |
There was a problem hiding this comment.
populate_index() deletes keys by scanning INDEX_PREFIX and calling DEL. This can delete unrelated data if INDEX_PREFIX collides with existing keys (especially if someone points --redis-url at a non-dedicated DB). Consider making the prefix configurable via CLI (defaulting to something clearly benchmark-specific and unique per run), and add a --force guard before any destructive deletes.
|
@codex review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| info = index.info() | ||
| while time.perf_counter() < deadline: | ||
| info = index.info() | ||
| pct = float(info.get("percent_indexed", 1) or 1) |
There was a problem hiding this comment.
Falsy percent_indexed treated as fully indexed
Medium Severity
The expression info.get("percent_indexed", 1) or 1 defaults a falsy percent_indexed value (e.g., numeric 0 or 0.0) to 1, meaning "100% indexed." This is inverted from the production code in redisvl/migration/utils.py and async_utils.py, which correctly use or 0 to default to "not ready." If percent_indexed is ever numeric zero, the function returns immediately without waiting, causing benchmarks to run against partially-indexed data and produce inaccurate results.
Additional Locations (1)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63d7c94e5c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| query_embeddings = all_embeddings[size : size + query_count] | ||
| dims = int(all_embeddings.shape[1]) | ||
|
|
||
| client.flushall() |
There was a problem hiding this comment.
Replace FLUSHALL with scoped cleanup in benchmark setup
benchmark_scale calls client.flushall() before each run, which removes keys from all databases on the Redis server, not just this benchmark's data. If someone points --redis-url at a shared/non-throwaway instance, this benchmark will wipe unrelated datasets; use FLUSHDB for the selected DB or delete only the benchmark key prefixes instead.
Useful? React with 👍 / 👎.
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| target_schema_path = Path(tmpdir) / "target_schema.yaml" | ||
| plan_path = Path(tmpdir) / "migration_plan.yaml" |
There was a problem hiding this comment.
Persist plan artifact outside temporary directory
run_quantization_migration records plan_path in the returned benchmark payload, but that path is created under TemporaryDirectory; once the context exits, the file is deleted. This makes the reported artifact path invalid for later inspection/repro and can mislead downstream analysis that expects a readable migration plan file.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return parser.parse_args() | ||
|
|
||
|
|
There was a problem hiding this comment.
Several metrics calculations assume query_count > 0 (e.g., statistics.mean/min/max on empty lists in run_queries/compute_overlap/compute_recall). Since --query-count currently allows 0, the script will crash with StatisticsError. Add an explicit argument validation (e.g., in parse_args() or main()) to require query_count >= 1 (and similarly validate sizes/top_k).
| return parser.parse_args() | |
| args = parser.parse_args() | |
| if args.query_count < 1: | |
| parser.error("--query-count must be >= 1") | |
| if any(size < 1 for size in args.sizes): | |
| parser.error("--sizes must contain positive integers") | |
| if args.top_k < 1: | |
| parser.error("--top-k must be >= 1") | |
| if args.recall_k_max < 1: | |
| parser.error("--recall-k-max must be >= 1") | |
| return args |
| "plan_duration_seconds": round(duration, 3), | ||
| "supported": plan.diff_classification.supported, | ||
| "blocked_reasons": plan.diff_classification.blocked_reasons, | ||
| "plan_path": str(plan_path), |
There was a problem hiding this comment.
assert_planner_allows_algorithm_change() also returns plan_path within a TemporaryDirectory that is removed on exit, so the returned plan_path is invalid. Persist the plan somewhere durable (e.g., alongside the benchmark --output file) or remove it from the return payload.
| "plan_path": str(plan_path), |
| query_embeddings = all_embeddings[size : size + query_count] | ||
| dims = int(all_embeddings.shape[1]) | ||
|
|
||
| client.flushall() |
There was a problem hiding this comment.
This benchmark uses client.flushall(), which wipes all databases on the target Redis instance (more destructive than FLUSHDB) and can surprise users if they point at a shared/non-dev server. Consider switching to flushdb(), or add an explicit opt-in flag/confirmation (and/or a dedicated DB selection) with a prominent warning in the CLI help/output.
| client.flushall() | |
| client.flushdb() |


Summary
Benchmark scripts for measuring migration performance across different index sizes and configurations. Includes real-world AG News dataset benchmarks, synthetic end-to-end timing benchmarks with memory tracking, retrieval quality comparisons (FLAT vs HNSW, FP32 vs FP16), and matplotlib visualization tooling for benchmark results.
Files
tests/benchmarks/- migration_benchmark, index_migrator_real_benchmark, retrieval_benchmark, visualize_resultsStack
Note
Low Risk
Adds new benchmark/visualization scripts under
tests/benchmarksand makes only whitespace/formatting-only tweaks in migration executors/models, so production behavior should be unchanged; main risk is added optional deps/CI noise if these scripts are accidentally executed in test runs.Overview
Adds a new
tests/benchmarks/suite to measure migration performance and retrieval quality at scale, including an end-to-end migration timing benchmark (sync/async), a “real data” AG News HNSW/FP32→FLAT/FP16 quantization migration benchmark, and a retrieval benchmark comparing FP32 vs FP16 across HNSW vs FLAT with overlap/recall and latency/QPS metrics.Also adds a
matplotlib-basedvisualize_results.pyhelper to turn retrieval benchmark JSON output into charts, and includes minor formatting-only refactors inMigrationExecutor,AsyncMigrationExecutor, andRenameOperations.has_operations.Written by Cursor Bugbot for commit 63d7c94. This will update automatically on new commits. Configure here.