Skip to content

Conversation

@idevasena
Copy link

VDB Benchmark - Enhanced Vector Loader

Overview

The load_vdb.py script loads synthetic vectors into a Milvus vector database for benchmarking purposes. This enhanced version introduces CPU and memory optimizations while preserving backward compatibility with the original implementation.


Architecture Changes

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           load_vdb.py (Enhanced)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌──────────────────────────────────────────────────┐   │
│  │   CLI Args  │───▶│              Mode Selection                      │   │
│  │  + Config   │    │  ┌──────────┬──────────────┬─────────────────┐   │   │
│  └─────────────┘    │  │ Standard │   Adaptive   │   Disk-Backed   │   │   │
│                     │  │  (default)│  (--adaptive)│  (--disk-backed)│   │   │
│                     │  └────┬─────┴──────┬───────┴────────┬────────┘   │   │
│                     └───────┼────────────┼────────────────┼────────────┘   │
│                             ▼            ▼                ▼                │
│                     ┌───────────────────────────────────────────────┐      │
│                     │           Vector Generation Engine            │      │
│                     │  • Seeded RNG (reproducibility)               │      │
│                     │  • NumPy float32 arrays                       │      │
│                     │  • L2 normalization                           │      │
│                     └───────────────────────────────────────────────┘      │
│                                          │                                 │
│                             ┌────────────┼────────────┐                    │
│                             ▼            ▼            ▼                    │
│                     ┌─────────────┐ ┌─────────┐ ┌───────────┐              │
│                     │   Chunked   │ │ Adaptive│ │  Mmap'd   │              │
│                     │   In-Memory │ │ Batching│ │   Disk    │              │
│                     │   Buffer    │ │Controller│ │  Buffer   │              │
│                     └──────┬──────┘ └────┬────┘ └─────┬─────┘              │
│                            └─────────────┼───────────┘                     │
│                                          ▼                                 │
│                     ┌───────────────────────────────────────────────┐      │
│                     │              Milvus Insertion                 │      │
│                     │  • Batch insert with progress tracking        │      │
│                     │  • Memory monitoring (psutil)                 │      │
│                     │  • Periodic garbage collection                │      │
│                     └───────────────────────────────────────────────┘      │
│                                          │                                 │
│                                          ▼                                 │
│                     ┌───────────────────────────────────────────────┐      │
│                     │         Post-Load Operations                  │      │
│                     │  • Flush → Index Build Monitor → Compact      │      │
│                     └───────────────────────────────────────────────┘      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Component Details

1. Memory Management Utilities (New)

AdaptiveBatchController

Dynamically adjusts batch sizes based on real-time memory pressure.

┌─────────────────────────────────────────────────────────────┐
│                 AdaptiveBatchController                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Memory Threshold: 80%                                      │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Memory Usage    │  Action                          │   │
│  ├───────────────────┼─────────────────────────────────┤   │
│  │  > 80%           │  Scale down by 50%              │   │
│  │  < 55%           │  Scale up by 25% (after 10 batches) │
│  │  55% - 80%       │  Maintain current size          │   │
│  └───────────────────┴─────────────────────────────────┘   │
│                                                             │
│  Bounds: [min_batch_size, max_batch_size]                  │
│  Default: [batch_size/20, batch_size*5]                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key Methods:

  • get_batch_size(): Returns current batch size, adjusting if memory threshold exceeded
  • force_scale_down(): Emergency reduction after insertion errors

DiskBackedBuffer

Memory-mapped file buffer for datasets exceeding available RAM.

┌─────────────────────────────────────────────────────────────┐
│                    DiskBackedBuffer                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Phase 1: Generate → Disk                                   │
│  ┌──────────┐    ┌──────────────────────────────────┐      │
│  │ generate │───▶│  Memory-Mapped File (.mmap)      │      │
│  │ vectors  │    │  Size: num_vectors × dim × 4 bytes│      │
│  └──────────┘    └──────────────────────────────────┘      │
│                              │                              │
│  Phase 2: Disk → Database    │                              │
│                              ▼                              │
│  ┌──────────────────────────────────────────────┐          │
│  │  read_batch() ──▶ Milvus insert()            │          │
│  │  (streaming, no full dataset in memory)      │          │
│  └──────────────────────────────────────────────┘          │
│                                                             │
│  Cleanup: Auto-delete temp file on exit                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

File Layout:

Offset 0                                      Offset N×D×4
┌─────────┬─────────┬─────────┬─────────┬─────────┐
│Vector 0 │Vector 1 │Vector 2 │   ...   │Vector N │
│ D×4 B   │ D×4 B   │ D×4 B   │         │ D×4 B   │
└─────────┴─────────┴─────────┴─────────┴─────────┘

2. Vector Generation Engine (Enhanced)

Original vs Enhanced Comparison

Aspect Original Enhanced
RNG np.random.random() np.random.default_rng()
Reproducibility None Seed + batch_index
Output Type list (via .tolist()) np.ndarray (float32)
Intermediate Type float16 float32
Normalization Yes Yes (with zero-division protection)

Seeded Generation Flow

┌─────────────────────────────────────────────────────────────┐
│                  Reproducible Generation                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  seed = 42, batch_index = 0                                │
│       │                                                     │
│       ▼                                                     │
│  ┌─────────────────────────────────────┐                   │
│  │ rng = default_rng(seed + batch_index)│                   │
│  │     = default_rng(42 + 0)           │                   │
│  │     = default_rng(42)               │                   │
│  └─────────────────────────────────────┘                   │
│       │                                                     │
│       ▼                                                     │
│  Batch 0: rng(42)  → deterministic vectors                 │
│  Batch 1: rng(43)  → different but reproducible            │
│  Batch 2: rng(44)  → different but reproducible            │
│       ...                                                   │
│                                                             │
│  Re-run with same seed → identical dataset                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

3. Execution Modes

Mode Selection Logic

if args.disk_backed:
    mode = "disk-backed"      # Lowest memory, two-phase
elif args.adaptive:
    mode = "adaptive"         # Dynamic batch sizing
else:
    mode = "standard"         # Original behavior (chunked)

Standard Mode (Default)

Preserves original chunked approach with added memory optimizations.

┌─────────────────────────────────────────────────────────────┐
│                     Standard Mode                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  if num_vectors > chunk_size:                              │
│      ┌─────────────────────────────────────────┐           │
│      │  for each chunk (default 1M vectors):   │           │
│      │    1. generate_vectors(chunk_size)      │           │
│      │    2. insert_data(chunk_vectors)        │           │
│      │    3. del chunk_vectors                 │           │
│      │    4. gc.collect()                      │           │
│      └─────────────────────────────────────────┘           │
│  else:                                                      │
│      ┌─────────────────────────────────────────┐           │
│      │  insert_data_standard():                │           │
│      │    • Generate + insert per batch        │           │
│      │    • gc.collect() every 50 batches      │           │
│      └─────────────────────────────────────────┘           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Adaptive Mode

Memory-pressure-aware execution with automatic batch scaling.

┌─────────────────────────────────────────────────────────────┐
│                     Adaptive Mode                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  while vectors_loaded < num_vectors:                │   │
│  │                                                     │   │
│  │    batch_size = controller.get_batch_size()        │   │
│  │         │                                          │   │
│  │         ├── Check psutil.virtual_memory()          │   │
│  │         ├── Scale down if > 80% used               │   │
│  │         └── Scale up if < 55% (after cooldown)     │   │
│  │                                                     │   │
│  │    try:                                            │   │
│  │        generate → insert → update count            │   │
│  │    except Error:                                   │   │
│  │        controller.force_scale_down()               │   │
│  │        continue  # retry with smaller batch        │   │
│  │                                                     │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Output includes: batch_adjustments, errors count          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Disk-Backed Mode

Two-phase approach for billion-scale datasets on memory-constrained systems.

┌─────────────────────────────────────────────────────────────┐
│                   Disk-Backed Mode                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Phase 1: Generate to Disk                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  with DiskBackedBuffer(dim, num_vectors) as buf:    │   │
│  │                                                     │   │
│  │    for batch_idx in range(num_batches):            │   │
│  │        vectors = generate_vectors(batch_size)       │   │
│  │        buf.write_batch(vectors, offset)            │   │
│  │        del vectors  # immediate cleanup             │   │
│  │                                                     │   │
│  └─────────────────────────────────────────────────────┘   │
│                         │                                   │
│                         ▼                                   │
│  Phase 2: Stream to Database                                │
│  ┌─────────────────────────────────────────────────────┐   │
│  │    for start_id in range(0, num_vectors, batch):   │   │
│  │        vectors = buf.read_batch(start_id, count)    │   │
│  │        collection.insert([ids, vectors.tolist()])   │   │
│  │                                                     │   │
│  └─────────────────────────────────────────────────────┘   │
│                         │                                   │
│                         ▼                                   │
│  Cleanup: buf.__exit__() deletes temp file                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4. Memory Optimization Strategies

Garbage Collection Points

# After each batch (all modes)
del vectors, data

# Periodic deep collection (every 50 batches)
if batch_idx % 50 == 0:
    gc.collect()

# After chunk completion (standard mode)
del chunk_vectors
gc.collect()

# After memory pressure detection (adaptive mode)
if mem_percent > threshold:
    gc.collect()

Memory Monitoring Integration

┌─────────────────────────────────────────────────────────────┐
│                  psutil Integration                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Optional dependency (graceful degradation)                │
│                                                             │
│  if PSUTIL_AVAILABLE:                                      │
│      • Log memory % in progress reports                    │
│      • Enable adaptive batch scaling                       │
│      • Report available RAM at startup                     │
│  else:                                                      │
│      • Use default values (50% usage, 8GB available)       │
│      • Adaptive mode works but without dynamic scaling     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

5. Data Flow Comparison

Original Flow

Config → Parse Args → Connect → Create Collection → Create Index
    → Generate ALL vectors (chunk) → Insert (batch loop) → Flush → Monitor

Enhanced Flow

Config → Parse Args → Connect → Create Collection → Create Index
    → Select Mode:
        ├─ Standard:  Chunked generate+insert with gc.collect()
        ├─ Adaptive:  Memory-monitored generate+insert with auto-scaling
        └─ Disk:      Phase1(generate→disk) → Phase2(disk→insert)
    → Flush → Monitor → Summary Report

New CLI Arguments

Argument Type Default Description
--seed int None Random seed for reproducible generation
--adaptive flag False Enable memory-aware batch sizing
--memory-budget str "0" Memory limit (e.g., 4G, 512M)
--disk-backed flag False Use memory-mapped temp file
--temp-dir str system temp Directory for disk-backed mode

Performance Characteristics

Mode Memory Usage Throughput Best For
Standard High (chunk_size × dim × 4B) Highest <100M vectors, adequate RAM
Adaptive Variable (auto-regulated) Medium-High Variable memory, shared systems
Disk-Backed Low (batch_size × dim × 4B) Medium >100M vectors, limited RAM

Memory Footprint Estimates

Standard Mode (1M chunk, 1536 dim):
    Chunk: 1,000,000 × 1,536 × 4 bytes = 5.7 GB
    
Adaptive Mode (10K batch, 1536 dim):
    Batch: 10,000 × 1,536 × 4 bytes = 58.6 MB
    Peak: ~2-3× batch size during insert
    
Disk-Backed Mode (10K batch, 1536 dim):
    Memory: ~58.6 MB per batch (streaming)
    Disk: num_vectors × 1,536 × 4 bytes (temp file)
    Example: 1B vectors = 5.7 TB temp file

Backward Compatibility

All original arguments and behaviors are preserved:

# Original command still works identically
python load_vdb.py --config vdbbench/configs/10m_diskann.yaml \
    --collection-name test \
    --num-vectors 1000000 \
    --batch-size 10000 \
    --force \
    --compact

The script automatically uses standard mode when no optimization flags are specified.


Dependencies

Required

  • numpy
  • pymilvus
  • pyyaml (via config_loader)

Optional

  • psutil - Enables memory monitoring and adaptive scaling
# Install optional dependency
pip install psutil

Error Handling

Scenario Behavior
psutil not installed Graceful degradation, uses defaults
Memory pressure (adaptive) Auto-scale down batch size
Insert error (adaptive) Force scale down, retry
Disk buffer cleanup failure Warning logged, continues
Missing required params Parser error with guidance

Summary Report

Enhanced summary output includes:

============================================================
Loading Summary
============================================================
Vectors loaded:    10,000,000
Total time:        245.3s
Throughput:        40,766 vectors/sec
Generation time:   45.2s
Insertion time:    198.1s
Batches:           1,000
Batch adjustments: 3          # (adaptive mode only)
Errors:            0          # (adaptive mode only)
============================================================

@idevasena idevasena requested a review from a team January 21, 2026 15:12
@idevasena idevasena requested a review from a team as a code owner January 21, 2026 15:12
@github-actions
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@russfellows
Copy link

I'm wondering if it makes sense to use dgen-py instead of np.random.default_rng() ??

Since generation time is a major consideration, being able to run 10x faster could be a big improvement. Instead of days, it would be hours. The change is minimal. Here is the PyPi site: https://pypi.org/project/dgen-py/

Below is the EXACT Python program I ran on 6 different cloud instance sizes, with the results listed above on dgen-py. This shows the performance of 4 data gen methods, os.random (/dev/urandom), np.random(), Numba with a custom Xosiro256 algorithm, and dgen-py.

bench_dgen-vs-numba-numpy.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants