Skip to content

feat: support per-dimension link settings and sync collective dims#33

Open
horser1 wants to merge 1 commit into
casys-kaist:mainfrom
horser1:feat_cxm
Open

feat: support per-dimension link settings and sync collective dims#33
horser1 wants to merge 1 commit into
casys-kaist:mainfrom
horser1:feat_cxm

Conversation

@horser1
Copy link
Copy Markdown

@horser1 horser1 commented May 23, 2026

related to #32

Summary

  • allow link_bw and link_latency to be specified as either scalars or per-dimension arrays
  • reuse the same inferred topology dimensions for both network.yml and system.json
  • keep the existing default collective implementation as ring
  • add docs and a new cluster-config example for distinct intra-/inter-dimension settings

Motivation

ASTRA-Sim supports per-dimension bandwidth/latency when npus_count is multi-dimensional, but LLMServingSim previously only exposed scalar link_bw / link_latency.

While adding array support, I also found that network.yml and system.json were not using exactly the same topology-dimension inference path. In several multi-
instance / PD / DP+EP configs, network.yml could become 2D while system.json still emitted only one collective implementation entry per collective.

This change centralizes topology-dimension inference so both files stay aligned.

Testing

Syntax / parsing

python3 -m py_compile serving/core/config_builder.py serving/__main__.py serving/core/trace_generator.py

python3 - <<'PY'
import json
from pathlib import Path
for path in [
    Path('configs/cluster/dual_node_moe_dp_ep_intra_inter_instance.json'),
    Path('configs/cluster/single_node_single_instance.json'),
]:
    with open(path) as f:
        json.load(f)
    print('OK', path.name)
PY

Result:

- passed

### Config-builder regression

python3 - <<'PY'
import os, sys
from pathlib import Path
repo = Path('/app/LLMServingSim')
os.chdir(repo / 'astra-sim')
sys.path.insert(0, str(repo))
from serving.core.config_builder import build_cluster_config
astra = str(repo / 'astra-sim')
root = repo / 'configs' / 'cluster'
errors = []
for path in sorted(root.glob('*.json')):
    try:
        build_cluster_config(astra, str(path.relative_to(repo)))
        print('OK', path.name)
    except Exception as exc:
        errors.append((path.name, type(exc).__name__, str(exc)))
        print('ERR', path.name, type(exc).__name__, str(exc))
print('SUMMARY', len(errors), 'errors')
PY

Result:

- 14 / 14 cluster configs generated successfully
- 0 errors

### Topology / collective-dimension consistency

python3 - <<'PY'
import json, os, sys, yaml
from pathlib import Path
repo = Path('/app/LLMServingSim')
os.chdir(repo / 'astra-sim')
sys.path.insert(0, str(repo))
from serving.core.config_builder import build_cluster_config
astra = str(repo / 'astra-sim')
root = repo / 'configs' / 'cluster'
keys = [
    'all-reduce-implementation',
    'all-gather-implementation',
    'reduce-scatter-implementation',
    'all-to-all-implementation',
]
for path in sorted(root.glob('*.json')):
    build_cluster_config(astra, str(path.relative_to(repo)))
    with open(repo / 'astra-sim/inputs/network/network.yml') as f:
        net = yaml.safe_load(f)
    with open(repo / 'astra-sim/inputs/system/system.json') as f:
        syscfg = json.load(f)
    ndims = len(net['npus_count'])
    lens = {k: len(syscfg[k]) for k in keys}
    assert all(v == ndims for v in lens.values()), (path.name, ndims, lens)
    print('OK', path.name, ndims, lens)
PY

Result:

- 0 mismatches between len(network.yml::npus_count) and all collective implementation arrays in system.json

### Smoke run

mkdir -p /tmp/llmservingsim-bin
ln -sf "$(command -v python3)" /tmp/llmservingsim-bin/python
PATH="/tmp/llmservingsim-bin:$PATH" \
PYTHONPATH="$(pwd)/astra-sim/extern/graph_frontend:$(pwd)/astra-sim/extern/graph_frontend/chakra" \
python3 -m serving \
  --dtype bfloat16 \
  --cluster-config configs/cluster/single_node_single_instance.json \
  --dataset workloads/example_trace.jsonl \
  --num-reqs 10 \
  --output outputs/pr_smoke_single_10.csv

Result:

- exit code 0
- outputs/pr_smoke_single_10.csv contains 10 data rows
- workload graphs were regenerated under astra-sim/inputs/workload/...
- final summary printed successfully
- observed throughput summary:
    - request throughput: 6.01 req/s
    - average prompt throughput: 72.07 tok/s
    - average generation throughput: 354.94 tok/s

## Notes
- the new example config demonstrates per-dimension intra-/inter-link settings without modifying the existing bundled examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant