Skip to content

perf: replace Val.Str._asciiSafe flag with AsciiSafeStr subclass#862

Open
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:perf/asciisafe-subclass
Open

perf: replace Val.Str._asciiSafe flag with AsciiSafeStr subclass#862
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:perf/asciisafe-subclass

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented May 19, 2026

Motivation

Save memory. Each Val.Str instance is allocated for every string literal,
every ConstMember key, every parser-produced fragment, every concat result.
On a config-generation workload (gen_big_object, realistic1) the
benchmark allocates tens of thousands of Val.Str per run — so per-instance
overhead compounds.

Replace the Val.Str._asciiSafe: Boolean flag with an AsciiSafeStr
subclass. Encoding the invariant in the type instead of a field shrinks every
plain Val.Str, and lets the renderer/format hot paths dispatch on
case _: AsciiSafeStr instead of reading a field — the JIT and Scala-Native
CHA can devirtualize this into a direct branch.

This sits directly on master (commit b252b184) now that #861's C1-C4
incremental wins have landed. Single structural commit.

Modification

  • New AsciiSafeStr subclass marks strings statically known to be ASCII
    JSON-safe; plain Val.Str is the unknown-safety base case.
  • Val.Str.asciiSafe(pos, s) factory constructs AsciiSafeStr directly.
  • getAsciiSafe() short-circuits to true for AsciiSafeStr; for plain
    Val.Str it still does the lazy SWAR scan and caches.
  • _asciiSafe: Boolean field removed from Val.Str.
  • Hot-path dispatch sites (ByteRenderer, Format, StringModule join paths,
    Parser, Format) now match on the type.

Existing string_asciisafe_propagation.jsonnet and full JVM/Native test
suites cover behavior — no semantic change.

Result

Re-benched on 2026-05-21 against master @ b252b184. Apple Silicon, JDK 21,
Scala 3.3.7.

Memory: Val.Str object layout (compressed oops)

Field                master         this PR
header                   12              12
pos (ref)                 4               4
_str (ref)                4               4
_children (ref)           4               4
_asciiSafe (boolean)      1               -
padding                   7               -
                       -----           -----
total                  32 B            24 B    (-8 B / -25%)

Every plain Val.Str instance shrinks from 32 → 24 bytes.
AsciiSafeStr adds no fields, so it's also 24 bytes per instance — class
metadata is one-time, not per-instance.

Allocation rate (JMH -prof gc, full bench corpus)

bench                                       master       this PR        Δ
cpp_suite/gen_big_object.jsonnet         3.72 MB/op   3.49 MB/op   -234 KB  (-6.15%)
cpp_suite/realistic1.jsonnet             5.60 MB/op   5.54 MB/op    -63 KB  (-1.09%)
jdk17_suite/split_resolve.jsonnet       586.5 KB/op  562.7 KB/op    -24 KB  (-4.06%)
jdk17_suite/repeat_format.jsonnet       572.3 KB/op  557.5 KB/op    -15 KB  (-2.58%)
bug_suite/assertions.jsonnet            625.4 KB/op  616.6 KB/op   -8.7 KB  (-1.40%)
cpp_suite/realistic2.jsonnet            71.74 MB/op  71.69 MB/op    -50 KB  (-0.07%)
go_suite/manifestYamlDoc.jsonnet        134.0 KB/op  133.4 KB/op   -640 B   (-0.47%)
go_suite/manifestJsonEx.jsonnet         129.7 KB/op  129.5 KB/op   -200 B   (-0.15%)
sjsonnet_suite/lazy_array_compr.        51.66 MB/op  51.66 MB/op    -70 B   (-0.00%)
go_suite/comparison2.jsonnet            52.95 MB/op  52.95 MB/op   -126 B   (-0.00%)

The largest absolute saving is the 234 KB/op drop on gen_big_object — a
config-shape benchmark. At ~30K Val.Str instances/op × 8 bytes each, the
arithmetic checks out. Object-key-heavy programs are the win profile;
benches that reuse already-parsed strings (bench.02, comparison2) see
near-zero deltas because they don't allocate fresh Val.Str at runtime.

No bench shows an allocation regression > +200 B/op (≈ 1 instance worth of
noise).

Wall-clock

Hyperfine, Scala-Native release binary, full bench corpus (warmup=2,
min-runs=5). Long benches (process-startup variance is small relative to
work):

bench                                    master ms      this PR ms     Δ
cpp_suite/realistic2.jsonnet             89.30 ± 1.91   88.87 ± 2.04   -0.48%
sjsonnet_suite/lazy_array_compr.         80.88 ± 1.51   81.56 ± 1.72   +0.83%
cpp_suite/bench.02.jsonnet               61.25 ± 1.47   60.45 ± 1.41   -1.30%
go_suite/comparison2.jsonnet             40.59 ± 3.02   40.76 ± 1.45   +0.42%
cpp_suite/bench.03.jsonnet               27.15 ± 1.23   27.46 ± 3.70   +1.13%
sjsonnet_suite/lazy_array_sparse_idx     18.43 ± 4.24   18.55 ± 1.17   +0.68%
sjsonnet_suite/lazy_array_reverse_sp.    16.08 ± 3.54   15.48 ± 1.57   -3.76%
go_suite/reverse.jsonnet                 15.02 ± 1.33   15.22 ± 1.20   +1.29%
go_suite/base64DecodeBytes.jsonnet       13.12 ± 1.13   11.80 ± 1.31  -10.06%
sjsonnet_suite/array_copy_views.jsonnet  12.17 ± 1.18   12.56 ± 0.95   +3.20%
cpp_suite/realistic1.jsonnet             11.34 ± 1.21   10.76 ± 2.00   -5.07%
cpp_suite/large_string_template.jsonnet  10.08 ± 1.41   10.87 ± 1.04   +7.88%

Net wall-clock impact is a wash on Native (sub-10ms benches dominated by
process-startup variance, ±10–20% run-to-run). The realistic1 and
base64DecodeBytes wins line up with their alloc-rate drops — less GC
pressure → less work.

JMH (JVM steady-state, single 1-iter run):

RegressionBenchmark.main              master ms/op    this PR ms/op    Δ
cpp_suite/bench.02.jsonnet            28.062          26.696          -4.87%
cpp_suite/bench.03.jsonnet             7.398           6.902          -6.70%
cpp_suite/realistic2.jsonnet          46.844          46.374          -1.00%
sjsonnet_suite/lazy_array_compr.      20.814          20.258          -2.67%
go_suite/comparison2.jsonnet          18.763          19.416          +3.48%
go_suite/reverse.jsonnet               5.429           5.349          -1.47%
cpp_suite/large_string_join.jsonnet    0.281           0.281           0.00%
cpp_suite/large_string_template.       0.697           0.703          +0.86%
jdk17_suite/repeat_format.jsonnet      0.132           0.138          +4.55%
sjsonnet_suite/setUnion.jsonnet        0.487           0.492          +1.03%

Summary

Memory is the headline:

  • Val.Str instance: 32 → 24 bytes (-25%)
  • Real workload alloc rate: up to -6.15% (gen_big_object),
    consistently negative on object-construction benches

Wall-clock is neutral-to-marginally-positive — exactly what you'd expect
for a layout shrink: the JIT was already inlining the field read, so
removing it wins on memory pressure rather than instruction count. The
type-encoded invariant also makes it cheaper for future fast paths to
discriminate on the marker without re-scanning.

Test plan

  • ./mill 'sjsonnet.jvm[3.3.7]'.test — green
  • ./mill 'sjsonnet.native[3.3.7]'.test — green
  • ./mill __.checkFormat — green

@He-Pin He-Pin marked this pull request as draft May 19, 2026 08:46
@He-Pin He-Pin marked this pull request as ready for review May 19, 2026 08:57
@He-Pin He-Pin marked this pull request as draft May 19, 2026 08:57
@He-Pin He-Pin force-pushed the perf/asciisafe-subclass branch from cae7a70 to 8516e7b Compare May 20, 2026 17:40
Motivation:
The boolean field added 1 byte to every Val.Str instance, which JVM
alignment expanded to 8 bytes per object. Val.Str instances number
in the millions on string-heavy workloads (e.g. joinedRepeatedString
results, format outputs, parsed string literals), so the wasted
padding adds up.

Modification:
Drop `_asciiSafe: Boolean` from Val.Str and introduce a sealed
`Val.AsciiSafeStr extends Val.Str` marker subclass. Factory
`Val.Str.asciiSafe(pos, s)` now constructs the subclass directly.
ByteRenderer and propagation sites switch from `vs._asciiSafe` to
`vs.isInstanceOf[Val.AsciiSafeStr]`. Str.concat preserves the
subclass when both operands are ASCII-safe (eager and rope paths).
Parser/Substr write sites that previously mutated the flag now call
the asciiSafe factory directly.

Result:
8 bytes saved per Val.Str instance with no behavioral change. JIT
still devirtualizes `.str` access via CHA (single non-final
implementation in the hierarchy). All JVM tests pass on Scala 3.3.7;
all platforms (JVM/JS/Native/WASM) compile cleanly across Scala
3.3.7/2.13.18/2.12.21.
@He-Pin He-Pin force-pushed the perf/asciisafe-subclass branch from 8516e7b to f07ef42 Compare May 20, 2026 18:18
@He-Pin He-Pin marked this pull request as ready for review May 20, 2026 18:29
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented May 20, 2026

@stephenamar-db This PR is ready now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant