Optimized vector generation for VDB Benchmark #227
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
VDB Benchmark - Enhanced Vector Loader
Overview
The
load_vdb.pyscript loads synthetic vectors into a Milvus vector database for benchmarking purposes. This enhanced version introduces CPU and memory optimizations while preserving backward compatibility with the original implementation.Architecture Changes
High-Level Architecture
Component Details
1. Memory Management Utilities (New)
AdaptiveBatchController
Dynamically adjusts batch sizes based on real-time memory pressure.
Key Methods:
get_batch_size(): Returns current batch size, adjusting if memory threshold exceededforce_scale_down(): Emergency reduction after insertion errorsDiskBackedBuffer
Memory-mapped file buffer for datasets exceeding available RAM.
File Layout:
2. Vector Generation Engine (Enhanced)
Original vs Enhanced Comparison
np.random.random()np.random.default_rng()list(via.tolist())np.ndarray(float32)float16float32Seeded Generation Flow
3. Execution Modes
Mode Selection Logic
Standard Mode (Default)
Preserves original chunked approach with added memory optimizations.
Adaptive Mode
Memory-pressure-aware execution with automatic batch scaling.
Disk-Backed Mode
Two-phase approach for billion-scale datasets on memory-constrained systems.
4. Memory Optimization Strategies
Garbage Collection Points
Memory Monitoring Integration
5. Data Flow Comparison
Original Flow
Enhanced Flow
New CLI Arguments
--seed--adaptive--memory-budget4G,512M)--disk-backed--temp-dirPerformance Characteristics
Memory Footprint Estimates
Backward Compatibility
All original arguments and behaviors are preserved:
The script automatically uses standard mode when no optimization flags are specified.
Dependencies
Required
numpypymilvuspyyaml(via config_loader)Optional
psutil- Enables memory monitoring and adaptive scaling# Install optional dependency pip install psutilError Handling
Summary Report
Enhanced summary output includes: