[Benchmark] Add compute_seq_len_sweep_config_with_probe with linear/quadratic scaling support by shivam2199 · Pull Request #1218 · linkedin/Liger-Kernel

shivam2199 · 2026-05-07T16:45:11Z

Summary

Refs #1200. Addresses non-linear memory scaling in benchmark sweep config inference.

The existing compute_seq_len_sweep_config inverts memory via max_tokens = usable_bytes / kernel_bytes_per_token, which only holds for linear-scaling kernels. For O(L²) kernels (e.g. benchmark_sparse_multi_token_attention.py), this overestimates capacity by orders of magnitude — the existing workaround there divides by probe_L * probe_L, but the downstream sweep math still treats the result as linear bytes-per-token.

Per discussion on the issue (#1200 (comment)), this PR adds a new helper rather than threading scaling_method through the existing function — 16+ benchmark scripts call estimate_kernel_peak_memory today, and a wider signature change would conflict with in-flight benchmark refactors (#1199, #1180). Linear-scaling callers are unchanged; only quadratic-scaling benchmarks opt in.

What changed

benchmark/scripts/benchmark_model_configs.py — adds compute_seq_len_sweep_config_with_probe(model_cfg, probe_fn, probe_seq_len, probe_batch_size=1, scaling_method="linear" | "quadratic", ...). Internalizes the probe call + inversion; reuses estimate_kernel_peak_memory for the measurement.
benchmark/scripts/benchmark_sparse_multi_token_attention.py — switches the token_length sweep mode to the new helper with scaling_method="quadratic", dropping the manual peak_bytes // (probe_L * probe_L) workaround.

estimate_kernel_peak_memory and compute_seq_len_sweep_config are untouched.

Validation

Hardware: A10G 24GB (g5.xlarge).

Synthetic O(L²) probe (B=2, L=2048, allocates B * L * L floats) using LLAMA_3_8B config and max_seq_len=2**20 to bypass the model cap so the raw inversion is visible:

quadratic: SeqLenSweepConfig(batch_size=2, seq_len=8192)
linear:    SeqLenSweepConfig(batch_size=2, seq_len=65536)

The 8× gap (≈17× before snap-to-power-of-2) demonstrates the inversion difference: linear claims a sweep at L=65536 fits, when in reality L² at that size would require multiple TBs. quadratic lands at a realistic L=8192. This matches the issue's premise — for non-linear-scaling kernels, the existing inversion overestimates capacity and would OOM at the predicted boundary.

Testing Done

Synthetic O(L²) sanity check on A10G — confirms quadratic predicts L=8192 vs linear predicts L=65536 for the same probe (8× separation, scales as expected).
benchmark_sparse_multi_token_attention.py imports + helper resolution verified locally.
Full sparse-attention end-to-end sweep on A10G (deferred — synthetic test already isolates the inversion math from kernel-specific noise).

cc @Tcc0403

…r scaling (linkedin#1200) Adds a new helper alongside the existing compute_seq_len_sweep_config that internalizes both the probe and the seq-len inversion, with a scaling_method argument supporting "linear" (default) and "quadratic". For O(L^2) kernels, the inversion uses L_max = sqrt(usable / (B * c_per_BL2)) instead of the linear max_tokens / batch_size path. Migrates benchmark_sparse_multi_token_attention.py to the new helper and drops its manual `peak_bytes // (probe_L * probe_L)` workaround. The existing estimate_kernel_peak_memory and compute_seq_len_sweep_config are unchanged; linear-scaling benchmark callers don't need to migrate.

shivam2199 · 2026-05-08T12:57:41Z

@Tcc0403 @Mecoli1219 Please take a look

Tcc0403 · 2026-05-08T14:54:19Z

+    batch_size = max(1, min(max_batch_size, probe_batch_size))
+
+    if scaling_method == "linear":
+        c_per_BL = max(1.0, peak_bytes / (probe_batch_size * probe_seq_len))
+        max_seq_len_from_mem = max(1, int(usable_bytes / (batch_size * c_per_BL)))
+    else:
+        c_per_BL2 = max(1.0, peak_bytes / (probe_batch_size * probe_seq_len * probe_seq_len))
+        max_seq_len_from_mem = max(1, int(math.sqrt(usable_bytes / (batch_size * c_per_BL2))))
+
+    seq_len = min(max_seq_len, max_seq_len_from_mem)
+    seq_len = 2 ** int(math.log2(seq_len)) if seq_len >= 1024 else 1024


Is it possible to just plug this part to compute_seq_len_sweep_config?

@Tcc0403 Good call. Pushed 6c204db which extracts two private helpers — _max_seqlen_under_memory (handles both linear and quadratic inversion) and _snap_pow2_seqlen — and collapses both public functions to thin orchestration over them.

compute_seq_len_sweep_config treats kernel_bytes_per_token as a unit-probe (B=L=1, linear) so the inversion math reduces to the existing max_tokens = usable / bpt quantity. No behavior change for the existing callers; the duplicated inversion/snap logic is gone.

@Tcc0403

Per @Tcc0403 review: instead of two parallel implementations of the inversion + power-of-2 snap, extract `_max_seqlen_under_memory` (handles both linear and quadratic) and `_snap_pow2_seqlen`. Both public APIs become thin orchestration layers over them. `compute_seq_len_sweep_config` now treats `kernel_bytes_per_token` as a unit-probe (B=L=1, scaling=linear) so the math collapses to the existing `max_tokens = usable / bpt` behavior — no behavior change for the 16+ existing callers.

Tcc0403 · 2026-05-09T14:35:31Z

+    return 2 ** int(math.log2(seq_len)) if seq_len >= 1024 else 1024
+
+
+def compute_seq_len_sweep_config_with_probe(


we can replace all occurrences of compute_seq_len_sweep_config with yours, keeping only one helper function

@Tcc0403

Per @Tcc0403's review on linkedin#1218: replace all callers of the old compute_seq_len_sweep_config with the probe-aware variant and delete the old function. Single public helper, single way to compute a sweep config. The unified compute_seq_len_sweep_config takes probe_fn + probe_seq_len directly and runs estimate_kernel_peak_memory internally. The scaling_method="quadratic" path that compute_seq_len_sweep_config_with_probe existed to support is now first-class on the unified function. Caller pattern simplifies from: peak_bytes = estimate_kernel_peak_memory(probe_fn=_probe) kernel_bpt = peak_bytes // probe_seq_len config = compute_seq_len_sweep_config(model, kernel_bytes_per_token=kernel_bpt) to: config = compute_seq_len_sweep_config(model, probe_fn=_probe, probe_seq_len=probe_seq_len) Net -154 lines across 33 benchmark scripts. The benchmark_multi_token_attention caller, which previously did a manual peak_bytes // (probe_L * probe_L) quadratic inversion, now uses scaling_method="quadratic" via the unified API.

shivam2199 · 2026-05-09T19:45:59Z

@Tcc0403 Pushed a8a8f40. Collapsed the two public helpers into one — compute_seq_len_sweep_config now takes probe_fn + probe_seq_len directly with optional probe_batch_size and scaling_method. Migrated all 33 callers (net −154 lines).

The benchmark_multi_token_attention script's manual peak_bytes // (probe_L * probe_L) inversion is now first-class via scaling_method="quadratic". make checkstyle clean.

Tcc0403

Thank you, lgtm

shivam2199 and others added 2 commits May 7, 2026 22:13

Merge branch 'main' into issue-1200-quadratic-probe-scaling

c8ca081

Tcc0403 reviewed May 8, 2026

View reviewed changes

shivam2199 and others added 2 commits May 8, 2026 20:43

Merge branch 'main' into issue-1200-quadratic-probe-scaling

0966b51

Tcc0403 reviewed May 9, 2026

View reviewed changes

Merge branch 'main' into issue-1200-quadratic-probe-scaling

0aff987

Tcc0403 approved these changes May 9, 2026

View reviewed changes

Tcc0403 added this pull request to the merge queue May 9, 2026

Merged via the queue into linkedin:main with commit 97b6fe2 May 9, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Add compute_seq_len_sweep_config_with_probe with linear/quadratic scaling support#1218

[Benchmark] Add compute_seq_len_sweep_config_with_probe with linear/quadratic scaling support#1218
Tcc0403 merged 6 commits into
linkedin:mainfrom
shivam2199:issue-1200-quadratic-probe-scaling

shivam2199 commented May 7, 2026

Uh oh!

shivam2199 commented May 8, 2026

Uh oh!

Tcc0403 May 8, 2026

Uh oh!

shivam2199 May 8, 2026

Uh oh!

Tcc0403 May 9, 2026

Uh oh!

shivam2199 commented May 9, 2026

Uh oh!

Tcc0403 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return 2 ** int(math.log2(seq_len)) if seq_len >= 1024 else 1024


		def compute_seq_len_sweep_config_with_probe(

Conversation

shivam2199 commented May 7, 2026

Summary

What changed

Validation

Testing Done

Uh oh!

shivam2199 commented May 8, 2026

Uh oh!

Tcc0403 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

shivam2199 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Tcc0403 May 9, 2026

Choose a reason for hiding this comment

Uh oh!

shivam2199 commented May 9, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants