perf: hardware-adaptive GPU optimization (DGX Spark UMA + A100/H100 HBM)

## Problem

All GPU code runs as vanilla PyTorch with no hardware-specific optimization. Leaves significant performance on the table across all target platforms.

## Current State (measured on DGX Spark GB10)

| Feature | Available | Utilized |
|---------|-----------|----------|
| TF32 matmul | Yes (Blackwell) | **No** (`allow_tf32 = False`) |
| BF16 | Yes | **No** (all FP32/FP64) |
| 128GB UMA | Yes | **No** (unnecessary `.cpu().numpy()` roundtrips) |
| Tensor Cores (5th gen) | Yes | **No** |
| `torch.compile` | Yes | **No** |

## Target Platforms

| Platform | Memory | Key Optimization |
|----------|--------|-----------------|
| DGX Spark GB10 | 128GB UMA (shared) | Minimize `.cpu()`/`.to(device)` — zero-copy |
| A100 (40/80GB) | Dedicated HBM2e | Maximize batch size, fuse transfers |
| H100 (80GB) | Dedicated HBM3 | FP8 tensor cores, transformer engine |

## Proposed Changes (by effort/impact)

### Quick wins (1-2 lines each)
- `torch.backends.cuda.matmul.allow_tf32 = True`
- `torch.set_float32_matmul_precision('high')`
- `torch.compile(nqs)` for NQS inference

### Medium effort (~50 lines)
- `torch.amp.autocast('cuda', dtype=torch.bfloat16)` around NQS training loop
- Audit and remove unnecessary `.cpu().numpy()` → `.to(device)` roundtrips (especially in UMA)

### Larger effort (separate PR)
- Hardware-adaptive runtime: detect UMA vs HBM and adjust data movement strategy
- CUDA kernel optimization for `diagonal_elements_batch` and PT2 scoring
- Numba/CUDA acceleration for `get_connections` inner loop (currently pure Python)

## Related

- Issue #25 (adaptive sampling — also performance-sensitive)
- ADR-005 (PT2 selection — `compute_pt2_scores` is CPU-bound Python loop)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: hardware-adaptive GPU optimization (DGX Spark UMA + A100/H100 HBM) #32

Problem

Current State (measured on DGX Spark GB10)

Target Platforms

Proposed Changes (by effort/impact)

Quick wins (1-2 lines each)

Medium effort (~50 lines)

Larger effort (separate PR)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature	Available	Utilized
TF32 matmul	Yes (Blackwell)	No (`allow_tf32 = False`)
BF16	Yes	No (all FP32/FP64)
128GB UMA	Yes	No (unnecessary `.cpu().numpy()` roundtrips)
Tensor Cores (5th gen)	Yes	No
`torch.compile`	Yes	No

Platform	Memory	Key Optimization
DGX Spark GB10	128GB UMA (shared)	Minimize `.cpu()`/`.to(device)` — zero-copy
A100 (40/80GB)	Dedicated HBM2e	Maximize batch size, fuse transfers
H100 (80GB)	Dedicated HBM3	FP8 tensor cores, transformer engine

perf: hardware-adaptive GPU optimization (DGX Spark UMA + A100/H100 HBM) #32

Description

Problem

Current State (measured on DGX Spark GB10)

Target Platforms

Proposed Changes (by effort/impact)

Quick wins (1-2 lines each)

Medium effort (~50 lines)

Larger effort (separate PR)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions