Skip to content

perf: cache renderDouble for small integers#763

Open
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/cached-render-double
Open

perf: cache renderDouble for small integers#763
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/cached-render-double

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 12, 2026

Motivation:

Reduce allocation overhead in common numeric rendering paths.

Key Design Decision:

Keep this PR focused on the renderDouble optimization. The original Builtin1.apply1 / Builtin3.apply3 specialization has already landed in current master via #807, so it is intentionally no longer part of this PR diff after the rebase.

Modification:

  • Materializer delegates number stringification to RenderUtils.renderDouble.
  • RenderUtils.renderDouble reuses a small integer string cache for exact integer doubles in the range 0 to 255.

Benchmark Results:

Rebased onto latest upstream/master at 8b67cb1e.

JMH, JVM harness, lower is better. Normal/noise-only rows are intentionally omitted.

Benchmark master ms/op PR ms/op Delta
realistic2 47.484 41.848 -11.9%

Scala Native hyperfine against latest source-built jrsonnet master 5b43fa8 (jrsonnet 0.5.0-pre98), lower is better:

Benchmark master-native PR-native jrsonnet-source Result
realistic2 82.2 +/- 1.5 ms 85.4 +/- 6.1 ms 143.7 +/- 4.0 ms Native neutral/noisy vs master; PR-native is 1.68x faster than latest jrsonnet

Analysis:

The current-master refresh leaves one clear JVM signal: realistic2 improves by about 12% through the shared numeric rendering path. Native CLI is not a standout improvement over master on this PR, but latest jrsonnet is substantially slower on the same realistic2 case.

Verification:

  • ./mill -i __.checkFormat
  • ./mill -i 'sjsonnet.jvm[3.3.7].test'
  • ./mill -i bench.runRegressions bench/resources/cpp_suite/large_string_join.jsonnet bench/resources/cpp_suite/large_string_template.jsonnet bench/resources/cpp_suite/realistic2.jsonnet bench/resources/go_suite/parseInt.jsonnet
  • ./mill -i 'sjsonnet.native[3.3.7].nativeLink'
  • hyperfine --warmup 10 --min-runs 50

References:

  • PR branch: perf/cached-render-double
  • Base: upstream/master at 8b67cb1eeffa764f2a7298658e5473d8402a8da1
  • Head: b21352970b5f19e8f929a259739d88e69df31627
  • Source-built jrsonnet: 5b43fa88b8c43856dd5a2daa9c5c251153c5e14d

Result:

Ready. The PR now contains only the cached renderDouble work; the Builtin1/3 apply override work is already covered by #807 on master.

@He-Pin He-Pin force-pushed the perf/cached-render-double branch from 12b870c to 4c8c5ab Compare April 12, 2026 17:32
@He-Pin He-Pin marked this pull request as ready for review April 12, 2026 17:48
@He-Pin He-Pin marked this pull request as draft April 12, 2026 18:57
@He-Pin He-Pin force-pushed the perf/cached-render-double branch 3 times, most recently from 3ae5a56 to 5e55449 Compare April 26, 2026 10:46
@He-Pin He-Pin marked this pull request as ready for review April 26, 2026 10:48
@stephenamar-db
Copy link
Copy Markdown
Collaborator

I'm questioning the usefulness of this tweak (for numbers) - this seems quite neutral

@He-Pin He-Pin marked this pull request as draft April 28, 2026 21:00
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 28, 2026

It just a bit optimization around realistic2 where many small nunbers.

stephenamar-db pushed a commit that referenced this pull request Apr 30, 2026
Split out from #763.

Motivation:

Reduce allocation and dispatch overhead when one- and three-argument
builtins are called through the dynamic `Val.Func.apply1` / `apply3`
path.

Key Design Decision:

Keep the optimization local and semantics-preserving. `Builtin2` already
has an exact-arity `apply2` override; this adds matching
`Builtin1.apply1` and `Builtin3.apply3` overrides. Exact positional
calls directly invoke the structured `evalRhs` overload and skip
constructing an intermediate `Array`. Non-exact paths still fall back to
the generic parent application path.

Correctness:

- The direct path matches the existing `Builtin1.apply` /
`Builtin3.apply` exact positional behavior: force the supplied `Eval`
values, then call the typed `evalRhs`.
- Named arguments, missing defaults, too many arguments, and other
non-exact calls still use the generic function application logic.
- Static `Expr.ApplyBuiltin1` / `Expr.ApplyBuiltin3` paths are
unchanged; this only helps dynamic builtin calls such as a builtin
stored in a local or returned from another function.

Modification:

- Add `Builtin1.apply1`.
- Add `Builtin3.apply3`.

Validation:

- `./mill --no-server 'sjsonnet.jvm[3.3.7].compile'`
- `./mill --no-server 'sjsonnet.jvm[3.3.7].test'` (`141/141, SUCCESS`)
- `./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'`
- `./mill --no-server '_.jvm[_].__.test'` (`1104/1104, SUCCESS`)
- Dynamic builtin smoke checks:
  - `local f = std.length; f([1, 2, 3])` -> `3`
  - `local f = std.substr; f("abcdef", 1, 3)` -> `"bcd"`
  - `local f = std.substr; f(str="abcdef", from=1, len=3)` -> `"bcd"`

Hyperfine:

Toolchain:

- `hyperfine 1.20.0`
- `--warmup 3 --min-runs 25` for targeted dynamic builtin benchmarks
- `--warmup 3 --min-runs 20` for `realistic2`
- JVM assemblies built with `./mill --no-server show
'sjsonnet.jvm[3.3.7].assembly'`
- Base: `upstream/master` at `c04fc804`
- Branch: `2067d8b5`

Targeted `Builtin1` dynamic call benchmark:

```jsonnet
local identity(x) = x;
local len = identity(std.length);
std.foldl(
  function(acc, i) acc + len("abcdef"),
  std.range(1, 5000000),
  0
)
```

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `master builtin1_dynamic` | 649.8 +/- 48.3 | 557.4 | 726.8 | 1.00 |
| `branch builtin1_dynamic` | 661.7 +/- 41.0 | 606.1 | 747.3 | 1.02 +/-
0.10 |

Result: statistically neutral in this hyperfine run.

Targeted `Builtin3` dynamic call benchmark:

```jsonnet
local identity(x) = x;
local substr = identity(std.substr);
std.foldl(
  function(acc, i) acc + std.length(substr("abcdef", 1, 3)),
  std.range(1, 3000000),
  0
)
```

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `master builtin3_dynamic` | 742.5 +/- 156.1 | 594.1 | 1254.7 | 1.12
+/- 0.30 |
| `branch builtin3_dynamic` | 660.4 +/- 110.9 | 534.0 | 962.6 | 1.00 |

Result: branch was faster in this run, but variance is high.

End-to-end `realistic2`:

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `master realistic2` | 544.9 +/- 95.3 | 414.1 | 706.5 | 1.27 +/- 0.27 |
| `branch realistic2` | 428.4 +/- 54.1 | 378.3 | 565.8 | 1.00 |

Result: branch was faster in this run; due JVM-startup and system noise,
treat this as a non-regression signal rather than a guaranteed 1.27x
speedup.
@He-Pin He-Pin force-pushed the perf/cached-render-double branch from 5e55449 to eb50f7f Compare April 30, 2026 04:34
@He-Pin He-Pin changed the title perf: cached renderDouble for small integers + Builtin1/3 apply overrides perf: cache renderDouble for small integers Apr 30, 2026
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented May 8, 2026

Closing stale low-priority draft. The branch is clean, but its benchmark base predates the current renderer/base64/TOML work and needs fresh attribution before being worth review.

@He-Pin He-Pin closed this May 8, 2026
@He-Pin He-Pin reopened this May 8, 2026
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented May 8, 2026

Reopened. This is not negative; keep as draft pending fresh benchmarking on current master. The previous benchmark base is stale after later renderer/base64/TOML work.

@He-Pin He-Pin marked this pull request as ready for review May 8, 2026 05:25
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented May 8, 2026

Marking ready under the updated review bar: a PR can be ready when it has a clear positive result for at least one benchmark suite, even if it is not the primary docs/jrsonnet-gap case. Keep normal/noise-only benchmark movement out of the description.

Motivation:
Reduce allocation overhead in common numeric rendering paths.

Modification:
1. RenderUtils.renderDouble reuses pre-cached string representations for exact integer doubles in the range 0-255.
2. Materializer.stringify delegates number stringification to RenderUtils.renderDouble, removing its duplicate integer fast path.

Result:
Numeric materialization uses the shared renderDouble fast path. The Builtin1.apply1 / Builtin3.apply3 specialization from the original PR is already present in current master via databricks#807, so it is no longer part of this PR diff.
@He-Pin He-Pin force-pushed the perf/cached-render-double branch from eb50f7f to b213529 Compare May 8, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants