perf: stack validated kube-prometheus optimizations by He-Pin · Pull Request #836 · databricks/sjsonnet

He-Pin · 2026-05-11T03:19:21Z

Motivation:
kube-prometheus and large_string_template remain the main source-built jrsonnet comparison targets for this stacked exploration branch. The accepted stack keeps only locally positive, output-equal changes and records rejected experiments so the same failed routes are not retried without a new hypothesis.

Key Design Decision:
Keep this PR as a benchmark-gated stacked exploration branch rather than a single small merge candidate. Each retained code change has local correctness smoke tests and benchmark evidence; higher-risk or noisy ideas are reverted and documented in bench/reports/sync-points.md.

Modification:

Rebased He-Pin:perf/stacked-ready-gap-explore onto upstream master@1679892980e63e00945053eba02affca9dceae5f.
Keep the committed kube-prometheus hot-path stack: std.manifestJson* visitor reuse, strict .json import fast path, strict JSON visitor trimming, strict JSON inline object layout, and strict JSON noOffsetPos reuse.
Keep branch-local supporting work: ASCII-safe substr, ScopedExprTransform scope-map trimming, and single-pass static apply argument collection.
Add the latest accepted large-template parser improvement: use String.indexOf('\n') for LF-only bulk ||| text-block line scanning while leaving CRLF/multi-character separators on the prior path.
Update bench/reports/jrsonnet-gap-baseline-2026-05-10.md and bench/reports/sync-points.md with accepted and rejected benchmark evidence.

Benchmark Results:

kube-prometheus step	Scala Native result	Delta
Stacked baseline	`235.971 +/- 12.925 ms`	baseline
`std.manifestJson*` visitor reuse	`224.975 +/- 11.550 ms`	`-4.66%`
Strict `.json` import fast path	`139.242 +/- 1.204 ms`	`-38.11%` vs prior step
Strict JSON visitor trim	`139.088 +/- 1.305 ms`	`-1.72%` same-run A/B
Strict JSON inline object layout	`136.301 +/- 1.957 ms`	`-2.60%` same-run A/B
Strict JSON `noOffsetPos` reuse	`135.9 +/- 1.2 ms`	`136.7 -> 135.9 ms` on the final step

Large string template check	sjsonnet Scala Native	Source-built jrsonnet	Status
Post-rebase baseline	`12.4 +/- 1.8 ms`	`3.3 +/- 0.6 ms`	`3.77x` jrsonnet time
Accepted LF text-block scan, same-run	`10.552 +/- 0.656 ms`	`5.611 +/- 0.826 ms`	`1.88x` jrsonnet time
Candidate vs frozen clean sjsonnet	`10.373 ms`	clean `11.191 ms`	`-7.3%` reverse-order Native A/B

Focused JMH / guard benchmark	Current result
`large_string_template`	`0.683 ms/op`
`realistic2`	`41.618 ms/op`
`gen_big_object`	`0.822 ms/op`
`manifestJsonEx`	`0.052 ms/op`
`OptimizerBenchmark`	`0.414 ms/op` after static-apply optimization
`go_suite/substr`	`0.045-0.047 ms/op` after ASCII-safe substr work

Analysis:
The strict JSON import stack remains the largest confirmed win, reducing the kube-prometheus run by roughly 42% from the original stacked baseline. After the latest rebase, large_string_template became the clearest string-heavy gap; the LF text-block parser optimization is narrow, semantics-preserving, and measurable, but the remaining source-built jrsonnet gap is still about 1.88x, so further parser/evaluator work is still needed.

Rejected routes now documented include Native base64 copy trimming, long-string escape-position collection, exact repeated-label format sizing, a specialized %(same_label)s scanner/render path, and lazy stdlib construction.

References:

bench/reports/jrsonnet-gap-baseline-2026-05-10.md
bench/reports/sync-points.md
bench/resources/cpp_suite/large_string_template.jsonnet
jrsonnet/docs/benchmarks.adoc

Result:
Branch head is 8f8ed59a2fe7ed8f2439c26049a60c9bef08b2c5. Local output equality, focused JMH guards, three-model review, and full ./mill --no-server -j 1 __.test passed for the latest accepted parser change.

Motivation: std.join over std.repeat([string], n) currently walks every repeated element and repeatedly appends the same string/separator pattern. Modification: Teach repeated/constant arrays to expose a constant Eval and let std.join materialize repeated string results directly while preserving null skipping and lazy error behavior. Keep the existing array-join copy paths for non-string separators. Result: Adds directional golden coverage for repeated string, unicode string, null, array, and zero-length lazy-error join cases. References: PR: databricks#825

Motivation: bench.07 builds a deep chain of function(x) f(f(x)) over identity functions. Scala Native overflows the stack on this case with --max-stack 100000, and the JVM path creates tens of thousands of lazy values and function calls. Modification: Add an apply1 fast path for unary identity functions and recognize the exact non-tailstrict function(x) f(f(x)) shape. The wrapper preserves laziness, keeps explicit tailstrict eager semantics, and checks identity-composition chains iteratively instead of recursively. Result: bench.07 now passes on Scala Native, reduces the JVM debug counters from lazy_created=32786/function_calls=65550 to lazy_created=19/function_calls=16, and reports 0.036 ms/op in the single-case JMH run.

Motivation: std.manifestTomlEx allocates avoidable intermediate strings, regex matchers, sequence paths, and table partition/sort structures while rendering TOML. Modification: Render TOML strings and keys directly to the existing writer, replace bare-key regex matching with an ASCII scan, reuse renderer/path buffers during table traversal, and classify sections with cached sorted keys. Result: Keeps output-compatible TOML rendering and adds directional golden coverage for key escaping, non-section ordering, nested tables, and table arrays. References: PR: databricks#828

Motivation: Short format strings such as %08d or %20s have exactly one dynamic value and no static literal text, so the generic format path was allocating and appending through a StringBuilder only to return that single formatted value. Modification: Detect the single-spec/no-static-literal case in Format.format and return the computed formatted value directly after preserving all existing arity checks. Result: The repeat_format regression improves from 0.190 ms/op on upstream master to 0.133 ms/op locally, while large_string_template remains effectively neutral and the full Mill test matrix passes. References: Source idea: databricks#776

Motivation: std.substr on long ASCII strings repeatedly pays codepoint-offset scans even when parser-time analysis can prove the literal is printable ASCII and JSON-render safe. Modification: Mark long ASCII JSON-safe literals with the existing _asciiSafe flag using a single platform CharSWAR scan, propagate the flag through string concatenation, and let std.length/std.substr use direct UTF-16 length/substring only for proven-safe values. Add UnicodeHandlingTests coverage for long ASCII length/substr boundaries and concat propagation. Result: Focused JVM JMH improves go_suite/substr from 0.056 ms/op to 0.046-0.047 ms/op with split_resolve unchanged and realistic2 in the same noise range. Scala Native hyperfine is neutral against master on the same case. References: Extracted from ideas in databricks#776, especially commit a190a80 (ASCII fast paths and asciiSafe propagation), narrowed to avoid the broader join/parseInt changes.

Motivation: The historical jit branch is useful source material, but a mechanical rebase onto current master immediately conflicts with newer optimizer and runtime work. Modification: Add a performance sync ledger, record the fresh jit-explore-2026 branch strategy, and link the exploration status report from CLAUDE.md. Result: Future optimization ports can be tracked as current-master, benchmark-gated atomic changes instead of replaying stale historical rewrites. References: Source branch: hepinssh/jit@9dc20016b0e2d1a061d1c0451ed555dcc46a0a33 Base: databricks/sjsonnet@0ae7b78

Motivation: The historical jit branch showed that StaticOptimizer scope construction was allocation-sensitive, but its mutable-scope rewrite is too stale to replay on current master. Modification: Reimplement the low-risk part in ScopedExprTransform by replacing zipWithIndex.map, tuple creation, and intermediate mapping arrays with while-loop HashMap.updated construction while keeping Scope immutable. Result: OptimizerBenchmark.main improved from 0.432 +/- 0.004 ms/op on master to 0.422 +/- 0.004 ms/op locally, with MainBenchmark.main neutral/slightly positive and full ./mill --no-server -j 1 __.test passing. References: Source ideas: hepinssh/sjsonnet@bfced4ecac89f7032d620a25a7c28c745f32dbb7, hepinssh/sjsonnet@f5959f27f5c91722a3f058a2a519c0b5be620164

Motivation: Historical jit experiments showed apply specialization allocation overhead in StaticOptimizer, but the full old shortcut is not reliably positive on current master. Modification: Keep only the stable part: replace tryStaticApply args.forall plus args.map with a single while loop that builds Array[Val] and exits on the first non-static argument. Result: OptimizerBenchmark.main improved from the prior branch result of 0.422 +/- 0.004 ms/op to 0.414 +/- 0.005 ms/op on the stable split run. MainBenchmark.main stayed neutral, full ./mill --no-server -j 1 __.test passed, and the broader rebindApply shortcut was rejected after noisy repeat JMH. References: Source idea: hepinssh/sjsonnet@6367df6d585f1f9649aae82e42790a0de9136eab

Motivation: Gap exploration should not duplicate ready performance PR work or rely on stale jrsonnet benchmark notes. Modification: Add a jrsonnet gap baseline report for the ready-PR stacked branch, update the sync ledger with the stacked baseline, and point CLAUDE.md at the new report. Result: Future optimization work can start from the same ready-PR stack and prioritize the largest verified jrsonnet gaps before writing code.

Motivation: The jrsonnet benchmark document is stale for foldl/string concatenation, so gap selection must use local source-built measurements. Modification: Record the latest source-built sjsonnet Scala Native and jrsonnet release hyperfine results for both foldl workloads, including the failed mimalloc build caveat. Result: Foldl is removed as the next optimization target because the stacked Scala Native binary is faster than source-built jrsonnet locally.

Motivation: After foldl was invalidated as a current gap, large string template needed the same latest-source hyperfine check. Modification: Record the source-built Scala Native vs jrsonnet large string template result and update the active gap priority. Result: Large string template remains a local gap at 1.86x, replacing the stale 6.90x documented gap as the next string-heavy target.

Motivation: Kube-prometheus spends visible time in std.manifestJsonEx while building Grafana dashboard ConfigMap strings. MaterializeJsonRenderer allocated a fresh array/object visitor for every rendered container even though those visitors carry no per-container state. Modification: Reuse one array visitor and one object visitor inside MaterializeJsonRenderer, keeping the per-container initialization in visitArray and visitObject. This follows the existing ByteRenderer reusable visitor shape and preserves newline, indentation, and key-value separator behavior. Result: Kube-prometheus Scala Native output still matches source-built jrsonnet. Native hyperfine improved from 235.971 +/- 12.925 ms to 224.975 +/- 11.550 ms (-4.66%); full ./mill --no-server -j 1 __.test passed. References: bench/reports/jrsonnet-gap-baseline-2026-05-10.md

Motivation: Kube-prometheus spends a large share of the remaining source-built gap parsing imported Kubernetes CRD .json files through the full Jsonnet parser, even when those files are strict JSON literals. Modification: Add a shared strict .json import fast path for CachedResolver and Preloader that parses directly into literal Val trees, while falling back to the normal Jsonnet parser for malformed JSON, duplicate keys, non-finite numbers, parser-depth overflow, and defensive numeric parse failures. Add regression coverage for the semantic fallback cases and update the performance tracking reports. Result: Kube-prometheus Scala Native improves from 224.975 +/- 11.550 ms to 139.242 +/- 1.204 ms for this step, with output equality against source-built jrsonnet. The remaining source-built jrsonnet gap is 1.58x, and full ./mill --no-server -j 1 __.test passes.

Motivation: After the strict .json import fast path, kube-prometheus still spends measurable time in the JSON visitor while parsing large CRD imports. Modification: Use HashMap.put's previous-value result for duplicate-key detection in the strict JSON object visitor, avoiding a second lookup while preserving duplicate-key fallback semantics. Avoid CharSequence.toString when ujson already provides a String. Update the gap and sync reports with A/B benchmark evidence and rejected follow-up attempts. Result: Same-run Scala Native kube-prometheus A/B against frozen clean e4fed2e improves from 141.526 +/- 1.896 ms to 139.088 +/- 1.305 ms, with output equality against source-built jrsonnet. Full ./mill --no-server -j 1 __.test passes. References: Follow-up to e4fed2e perf: fast-path strict json imports.

Motivation: Kube-prometheus still spends significant time materializing large strict JSON imports after the parser fast path. Modification: Build strict JSON import objects with inline field arrays so renderer/materializer direct iteration can skip generic visible-key and value lookup work. Disable field caching for these parse-cache-shared literals and add JVM concurrency coverage for shared JSON imports. Result: Same-run Scala Native kube-prometheus A/B improved from 139.937 +/- 2.294 ms to 136.301 +/- 1.957 ms with output equality. Full ./mill --no-server -j 1 __.test passed. References: bench/reports/jrsonnet-gap-baseline-2026-05-10.md bench/reports/sync-points.md

Strict .json imports create many Position objects for each JSON scalar/container. Since strict JSON literals are static and never execute code, all positions can safely share a single noOffsetPos sentinel. This avoids allocating Position for every imported JSON scalar/array/object, reducing GC pressure and materialization time on large JSON CRD imports (kube-prometheus case). Kube-prometheus Native: 136.7 ± 1.5 ms → 135.9 ± 1.2 ms (repeat 137.5 → 136.1). Output equality with source-built jrsonnet verified. Validation: - Formatting: ./mill __.reformat ✓ - Focused JSON tests: JsonImportFastPathTests + JsonImportFastPathJvmTests 8/8 ✓ - Full JVM tests: 503/503 ✓ - Full cross-platform: ./mill __.test 2066/2066 ✓ - Native link: ✓ - JMH guards: manifestJsonEx 0.053, realistic2 39.485, large_string_template 1.102, gen_big_object 0.801 ms/op ✓

He-Pin · 2026-05-11T03:26:39Z

I use this one as a references, will extract PR from it.

He-Pin added 19 commits May 10, 2026 21:59

docs: add realworld source benchmark gaps

89e7958

docs: add json position reuse sync point

798b03f

docs: log json position reuse benchmark results

3090f6f

He-Pin changed the title ~~perf: stack validated jrsonnet gap reductions~~ perf: stack validated kube-prometheus optimizations May 11, 2026

He-Pin closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: stack validated kube-prometheus optimizations#836

perf: stack validated kube-prometheus optimizations#836
He-Pin wants to merge 19 commits intodatabricks:masterfrom
He-Pin:perf/stacked-ready-gap-explore

He-Pin commented May 11, 2026 •

edited

Loading

Uh oh!

He-Pin commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

He-Pin commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

He-Pin commented May 11, 2026 •

edited

Loading