Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions fastdeploy/envs.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,10 +225,10 @@ def _validate_split_kv_size(value: int) -> int:
"FD_WORKER_ALIVE_TIMEOUT": lambda: int(os.getenv("FD_WORKER_ALIVE_TIMEOUT", "30")),
# File path for file storage backend
"FILE_BACKEND_STORAGE_DIR": lambda: str(os.getenv("FILE_BACKEND_STORAGE_DIR", "/tmp/fastdeploy")),
# Custom all-reduce max buffer size in MB (default 64MB).
# Custom all-reduce max buffer size in MB (default 8MB).
# Increase this to avoid NCCL fallback for large tensors in deterministic mode.
Comment on lines +228 to 229
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的注释与实际行为不一致:在 FD_DETERMINISTIC_MODE=1 时,如果输入 tensor 超过 max_size,会直接抛 RuntimeError(communication._ensure_deterministic_ready),不会“fallback 到 NCCL”。建议把注释改成“超大 tensor 会报错/需要调大该值以满足 deterministic all-reduce 的 max_size 限制”,避免误导。

Suggested change
# Custom all-reduce max buffer size in MB (default 8MB).
# Increase this to avoid NCCL fallback for large tensors in deterministic mode.
# Custom deterministic all-reduce max buffer size in MB (default 8MB).
# When FD_DETERMINISTIC_MODE=1, tensors larger than this limit will raise an error
# instead of falling back to NCCL. Increase this value to avoid max_size errors.

Copilot uses AI. Check for mistakes.
# E.g. FD_CUSTOM_AR_MAX_SIZE_MB=128 for 128MB.
"FD_CUSTOM_AR_MAX_SIZE_MB": lambda: int(os.getenv("FD_CUSTOM_AR_MAX_SIZE_MB", "64")),
"FD_CUSTOM_AR_MAX_SIZE_MB": lambda: int(os.getenv("FD_CUSTOM_AR_MAX_SIZE_MB", "8")),
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FD_CUSTOM_AR_MAX_SIZE_MB 这里直接 int(os.getenv(...)),未校验取值范围;如果被设置为 0/负数,会导致 CustomAllreduce 分配共享 buffer 时 size_in_bytes 非法并在更底层报错,排查困难。建议新增一个校验(类似 _validate_split_kv_size),确保该值为正整数(例如 >=1),并在不合法时给出明确异常信息。

Copilot uses AI. Check for mistakes.
# Enable deterministic inference mode for chunked prefill alignment
"FD_DETERMINISTIC_MODE": lambda: bool(int(os.getenv("FD_DETERMINISTIC_MODE", "0"))),
# Split KV block size for deterministic alignment (must be power of 2 and > 0, default 16)
Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/4cards_cases/test_determinism_long.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def _module_env():
{
"CUDA_VISIBLE_DEVICES": os.environ.get("CUDA_VISIBLE_DEVICES", "0,1,2,3"),
"FD_DETERMINISTIC_MODE": "1",
"FD_CUSTOM_AR_MAX_SIZE_MB": os.environ.get("FD_CUSTOM_AR_MAX_SIZE_MB", "57"),
"FD_CUSTOM_AR_MAX_SIZE_MB": os.environ.get("FD_CUSTOM_AR_MAX_SIZE_MB", "64"),
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该测试这里允许外部环境变量覆盖 FD_CUSTOM_AR_MAX_SIZE_MB;如果 CI/本地恰好设置为更小值(比如现在默认 8),可能导致 deterministic all-reduce 在大 tensor 场景直接报错而引入不稳定。考虑与同目录其他 determinism e2e 测试保持一致,直接固定为 "64"(如 _test_determinism_offline.py),让用例结果不受外部环境影响。

Suggested change
"FD_CUSTOM_AR_MAX_SIZE_MB": os.environ.get("FD_CUSTOM_AR_MAX_SIZE_MB", "64"),
"FD_CUSTOM_AR_MAX_SIZE_MB": "64",

Copilot uses AI. Check for mistakes.
"FLAGS_max_partition_size": _CHUNK_SIZE_FOR_TEST,
}
):
Expand Down
Loading