perf: stack validated kube-prometheus optimizations#836
Closed
He-Pin wants to merge 19 commits intodatabricks:masterfrom
Closed
perf: stack validated kube-prometheus optimizations#836He-Pin wants to merge 19 commits intodatabricks:masterfrom
He-Pin wants to merge 19 commits intodatabricks:masterfrom
Conversation
Motivation: std.join over std.repeat([string], n) currently walks every repeated element and repeatedly appends the same string/separator pattern. Modification: Teach repeated/constant arrays to expose a constant Eval and let std.join materialize repeated string results directly while preserving null skipping and lazy error behavior. Keep the existing array-join copy paths for non-string separators. Result: Adds directional golden coverage for repeated string, unicode string, null, array, and zero-length lazy-error join cases. References: PR: databricks#825
Motivation: bench.07 builds a deep chain of function(x) f(f(x)) over identity functions. Scala Native overflows the stack on this case with --max-stack 100000, and the JVM path creates tens of thousands of lazy values and function calls. Modification: Add an apply1 fast path for unary identity functions and recognize the exact non-tailstrict function(x) f(f(x)) shape. The wrapper preserves laziness, keeps explicit tailstrict eager semantics, and checks identity-composition chains iteratively instead of recursively. Result: bench.07 now passes on Scala Native, reduces the JVM debug counters from lazy_created=32786/function_calls=65550 to lazy_created=19/function_calls=16, and reports 0.036 ms/op in the single-case JMH run.
Motivation: std.manifestTomlEx allocates avoidable intermediate strings, regex matchers, sequence paths, and table partition/sort structures while rendering TOML. Modification: Render TOML strings and keys directly to the existing writer, replace bare-key regex matching with an ASCII scan, reuse renderer/path buffers during table traversal, and classify sections with cached sorted keys. Result: Keeps output-compatible TOML rendering and adds directional golden coverage for key escaping, non-section ordering, nested tables, and table arrays. References: PR: databricks#828
Motivation: Short format strings such as %08d or %20s have exactly one dynamic value and no static literal text, so the generic format path was allocating and appending through a StringBuilder only to return that single formatted value. Modification: Detect the single-spec/no-static-literal case in Format.format and return the computed formatted value directly after preserving all existing arity checks. Result: The repeat_format regression improves from 0.190 ms/op on upstream master to 0.133 ms/op locally, while large_string_template remains effectively neutral and the full Mill test matrix passes. References: Source idea: databricks#776
Motivation: std.substr on long ASCII strings repeatedly pays codepoint-offset scans even when parser-time analysis can prove the literal is printable ASCII and JSON-render safe. Modification: Mark long ASCII JSON-safe literals with the existing _asciiSafe flag using a single platform CharSWAR scan, propagate the flag through string concatenation, and let std.length/std.substr use direct UTF-16 length/substring only for proven-safe values. Add UnicodeHandlingTests coverage for long ASCII length/substr boundaries and concat propagation. Result: Focused JVM JMH improves go_suite/substr from 0.056 ms/op to 0.046-0.047 ms/op with split_resolve unchanged and realistic2 in the same noise range. Scala Native hyperfine is neutral against master on the same case. References: Extracted from ideas in databricks#776, especially commit a190a80 (ASCII fast paths and asciiSafe propagation), narrowed to avoid the broader join/parseInt changes.
Motivation: The historical jit branch is useful source material, but a mechanical rebase onto current master immediately conflicts with newer optimizer and runtime work. Modification: Add a performance sync ledger, record the fresh jit-explore-2026 branch strategy, and link the exploration status report from CLAUDE.md. Result: Future optimization ports can be tracked as current-master, benchmark-gated atomic changes instead of replaying stale historical rewrites. References: Source branch: hepinssh/jit@9dc20016b0e2d1a061d1c0451ed555dcc46a0a33 Base: databricks/sjsonnet@0ae7b78
Motivation: The historical jit branch showed that StaticOptimizer scope construction was allocation-sensitive, but its mutable-scope rewrite is too stale to replay on current master. Modification: Reimplement the low-risk part in ScopedExprTransform by replacing zipWithIndex.map, tuple creation, and intermediate mapping arrays with while-loop HashMap.updated construction while keeping Scope immutable. Result: OptimizerBenchmark.main improved from 0.432 +/- 0.004 ms/op on master to 0.422 +/- 0.004 ms/op locally, with MainBenchmark.main neutral/slightly positive and full ./mill --no-server -j 1 __.test passing. References: Source ideas: hepinssh/sjsonnet@bfced4ecac89f7032d620a25a7c28c745f32dbb7, hepinssh/sjsonnet@f5959f27f5c91722a3f058a2a519c0b5be620164
Motivation: Historical jit experiments showed apply specialization allocation overhead in StaticOptimizer, but the full old shortcut is not reliably positive on current master. Modification: Keep only the stable part: replace tryStaticApply args.forall plus args.map with a single while loop that builds Array[Val] and exits on the first non-static argument. Result: OptimizerBenchmark.main improved from the prior branch result of 0.422 +/- 0.004 ms/op to 0.414 +/- 0.005 ms/op on the stable split run. MainBenchmark.main stayed neutral, full ./mill --no-server -j 1 __.test passed, and the broader rebindApply shortcut was rejected after noisy repeat JMH. References: Source idea: hepinssh/sjsonnet@6367df6d585f1f9649aae82e42790a0de9136eab
Motivation: Gap exploration should not duplicate ready performance PR work or rely on stale jrsonnet benchmark notes. Modification: Add a jrsonnet gap baseline report for the ready-PR stacked branch, update the sync ledger with the stacked baseline, and point CLAUDE.md at the new report. Result: Future optimization work can start from the same ready-PR stack and prioritize the largest verified jrsonnet gaps before writing code.
Motivation: The jrsonnet benchmark document is stale for foldl/string concatenation, so gap selection must use local source-built measurements. Modification: Record the latest source-built sjsonnet Scala Native and jrsonnet release hyperfine results for both foldl workloads, including the failed mimalloc build caveat. Result: Foldl is removed as the next optimization target because the stacked Scala Native binary is faster than source-built jrsonnet locally.
Motivation: After foldl was invalidated as a current gap, large string template needed the same latest-source hyperfine check. Modification: Record the source-built Scala Native vs jrsonnet large string template result and update the active gap priority. Result: Large string template remains a local gap at 1.86x, replacing the stale 6.90x documented gap as the next string-heavy target.
Motivation: Kube-prometheus spends visible time in std.manifestJsonEx while building Grafana dashboard ConfigMap strings. MaterializeJsonRenderer allocated a fresh array/object visitor for every rendered container even though those visitors carry no per-container state. Modification: Reuse one array visitor and one object visitor inside MaterializeJsonRenderer, keeping the per-container initialization in visitArray and visitObject. This follows the existing ByteRenderer reusable visitor shape and preserves newline, indentation, and key-value separator behavior. Result: Kube-prometheus Scala Native output still matches source-built jrsonnet. Native hyperfine improved from 235.971 +/- 12.925 ms to 224.975 +/- 11.550 ms (-4.66%); full ./mill --no-server -j 1 __.test passed. References: bench/reports/jrsonnet-gap-baseline-2026-05-10.md
Motivation: Kube-prometheus spends a large share of the remaining source-built gap parsing imported Kubernetes CRD .json files through the full Jsonnet parser, even when those files are strict JSON literals. Modification: Add a shared strict .json import fast path for CachedResolver and Preloader that parses directly into literal Val trees, while falling back to the normal Jsonnet parser for malformed JSON, duplicate keys, non-finite numbers, parser-depth overflow, and defensive numeric parse failures. Add regression coverage for the semantic fallback cases and update the performance tracking reports. Result: Kube-prometheus Scala Native improves from 224.975 +/- 11.550 ms to 139.242 +/- 1.204 ms for this step, with output equality against source-built jrsonnet. The remaining source-built jrsonnet gap is 1.58x, and full ./mill --no-server -j 1 __.test passes.
Motivation: After the strict .json import fast path, kube-prometheus still spends measurable time in the JSON visitor while parsing large CRD imports. Modification: Use HashMap.put's previous-value result for duplicate-key detection in the strict JSON object visitor, avoiding a second lookup while preserving duplicate-key fallback semantics. Avoid CharSequence.toString when ujson already provides a String. Update the gap and sync reports with A/B benchmark evidence and rejected follow-up attempts. Result: Same-run Scala Native kube-prometheus A/B against frozen clean e4fed2e improves from 141.526 +/- 1.896 ms to 139.088 +/- 1.305 ms, with output equality against source-built jrsonnet. Full ./mill --no-server -j 1 __.test passes. References: Follow-up to e4fed2e perf: fast-path strict json imports.
Motivation: Kube-prometheus still spends significant time materializing large strict JSON imports after the parser fast path. Modification: Build strict JSON import objects with inline field arrays so renderer/materializer direct iteration can skip generic visible-key and value lookup work. Disable field caching for these parse-cache-shared literals and add JVM concurrency coverage for shared JSON imports. Result: Same-run Scala Native kube-prometheus A/B improved from 139.937 +/- 2.294 ms to 136.301 +/- 1.957 ms with output equality. Full ./mill --no-server -j 1 __.test passed. References: bench/reports/jrsonnet-gap-baseline-2026-05-10.md bench/reports/sync-points.md
Strict .json imports create many Position objects for each JSON scalar/container. Since strict JSON literals are static and never execute code, all positions can safely share a single noOffsetPos sentinel. This avoids allocating Position for every imported JSON scalar/array/object, reducing GC pressure and materialization time on large JSON CRD imports (kube-prometheus case). Kube-prometheus Native: 136.7 ± 1.5 ms → 135.9 ± 1.2 ms (repeat 137.5 → 136.1). Output equality with source-built jrsonnet verified. Validation: - Formatting: ./mill __.reformat ✓ - Focused JSON tests: JsonImportFastPathTests + JsonImportFastPathJvmTests 8/8 ✓ - Full JVM tests: 503/503 ✓ - Full cross-platform: ./mill __.test 2066/2066 ✓ - Native link: ✓ - JMH guards: manifestJsonEx 0.053, realistic2 39.485, large_string_template 1.102, gen_big_object 0.801 ms/op ✓
Contributor
Author
|
I use this one as a references, will extract PR from it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation:
kube-prometheusandlarge_string_templateremain the main source-built jrsonnet comparison targets for this stacked exploration branch. The accepted stack keeps only locally positive, output-equal changes and records rejected experiments so the same failed routes are not retried without a new hypothesis.Key Design Decision:
Keep this PR as a benchmark-gated stacked exploration branch rather than a single small merge candidate. Each retained code change has local correctness smoke tests and benchmark evidence; higher-risk or noisy ideas are reverted and documented in
bench/reports/sync-points.md.Modification:
He-Pin:perf/stacked-ready-gap-exploreonto upstreammaster@1679892980e63e00945053eba02affca9dceae5f.std.manifestJson*visitor reuse, strict.jsonimport fast path, strict JSON visitor trimming, strict JSON inline object layout, and strict JSONnoOffsetPosreuse.substr,ScopedExprTransformscope-map trimming, and single-pass static apply argument collection.String.indexOf('\n')for LF-only bulk|||text-block line scanning while leaving CRLF/multi-character separators on the prior path.bench/reports/jrsonnet-gap-baseline-2026-05-10.mdandbench/reports/sync-points.mdwith accepted and rejected benchmark evidence.Benchmark Results:
235.971 +/- 12.925 msstd.manifestJson*visitor reuse224.975 +/- 11.550 ms-4.66%.jsonimport fast path139.242 +/- 1.204 ms-38.11%vs prior step139.088 +/- 1.305 ms-1.72%same-run A/B136.301 +/- 1.957 ms-2.60%same-run A/BnoOffsetPosreuse135.9 +/- 1.2 ms136.7 -> 135.9 mson the final step12.4 +/- 1.8 ms3.3 +/- 0.6 ms3.77xjrsonnet time10.552 +/- 0.656 ms5.611 +/- 0.826 ms1.88xjrsonnet time10.373 ms11.191 ms-7.3%reverse-order Native A/Blarge_string_template0.683 ms/oprealistic241.618 ms/opgen_big_object0.822 ms/opmanifestJsonEx0.052 ms/opOptimizerBenchmark0.414 ms/opafter static-apply optimizationgo_suite/substr0.045-0.047 ms/opafter ASCII-safe substr workAnalysis:
The strict JSON import stack remains the largest confirmed win, reducing the kube-prometheus run by roughly 42% from the original stacked baseline. After the latest rebase,
large_string_templatebecame the clearest string-heavy gap; the LF text-block parser optimization is narrow, semantics-preserving, and measurable, but the remaining source-built jrsonnet gap is still about1.88x, so further parser/evaluator work is still needed.Rejected routes now documented include Native base64 copy trimming, long-string escape-position collection, exact repeated-label format sizing, a specialized
%(same_label)sscanner/render path, and lazy stdlib construction.References:
bench/reports/jrsonnet-gap-baseline-2026-05-10.mdbench/reports/sync-points.mdbench/resources/cpp_suite/large_string_template.jsonnetjrsonnet/docs/benchmarks.adocResult:
Branch head is
8f8ed59a2fe7ed8f2439c26049a60c9bef08b2c5. Local output equality, focused JMH guards, three-model review, and full./mill --no-server -j 1 __.testpassed for the latest accepted parser change.