Skip to content

perf: use hybrid sort for inline object order#855

Draft
He-Pin wants to merge 2 commits into
databricks:masterfrom
He-Pin:perf/hybrid-inline-sort-order
Draft

perf: use hybrid sort for inline object order#855
He-Pin wants to merge 2 commits into
databricks:masterfrom
He-Pin:perf/hybrid-inline-sort-order

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented May 13, 2026

Motivation:

computeSortedInlineOrder was originally optimized for evaluated inline objects with only a few fields. After strict JSON imports began constructing byte-parsed JSON objects as inline Val.Objs, kube-prometheus exposed a wider shape: imported JSON objects can have many visible fields, and the old insertion sort becomes quadratic.

Sampling a repeated kube-prometheus materialization showed Materializer.computeSortedInlineOrder as a real Scala Native hotspot. This PR keeps the small-object fast path but avoids quadratic sorting for larger inline objects.

Key Design Decision:

Use a hybrid in-place sort:

  • <= 16 visible fields keep insertion sort, preserving the current small-object path.
  • > 16 visible fields use in-place quicksort with insertion-sort cleanup for small partitions.
  • Sorting still uses Util.compareStringsByCodepoint, so Jsonnet key ordering semantics are unchanged.
  • The implementation sorts only a fresh Array[Int] index array; it does not mutate shared parsed object keys or members.

Modification:

  • Refactor Materializer.computeSortedInlineOrder to delegate sorting to sortInlineOrder.
  • Add insertion-sort and quicksort helpers over the index array.
  • Update local performance ledgers with accepted and rejected candidates from this optimization round.

Benchmark Results:

All numbers are local, single benchmark process, no concurrent benchmark agents.

Benchmark Baseline Candidate Result
Native kube forward, vs #854 binary 145.3 +/- 3.6 ms 140.0 +/- 3.2 ms 1.04x faster
Native kube reverse, vs #854 binary 151.6 +/- 10.2 ms 148.9 +/- 3.7 ms 1.02x faster
Native large_string_template guard 6.02 +/- 3.67 ms 4.52 +/- 1.27 ms no regression observed
JVM JMH large_string_template guard candidate 0.770 ms/op, repeat 0.723 ms/op guard only no relevant regression signal
jrsonnet kube reference 93.3 +/- 3.9 ms sjsonnet candidate 146.8 +/- 12.8 ms jrsonnet still faster

Additional profiling:

  • Repeated kube materialization sample before: computeSortedInlineOrder top-stack samples 164.
  • Repeated kube materialization sample after: computeSortedInlineOrder top-stack samples 63; sort-specific samples 75.

Analysis:

The change targets a structural mismatch introduced by wider inline objects from strict JSON imports. Insertion sort is optimal for 2-8 fields, but it is the wrong default for imported JSON objects with larger key sets. The hybrid threshold keeps existing small-object behavior and reduces wider-object ordering from O(n²) toward O(n log n) without boxing.

Correctness checks:

  • Output equality: kube-prometheus and large_string_template outputs are byte-identical to the perf: parse strict JSON imports from bytes #854 binary.
  • Focused tests: RendererTests and JsonImportFastPathTests passed.
  • Formatting: ./mill --no-server --ticker false --color false -j 1 __.checkFormat passed.
  • Full tests: ./mill --no-server --ticker false --color false -j 1 __.test passed (440/440).

References:

Result:

Draft stacked PR. This should be reviewed after, or together with, the strict JSON byte import PR because the main win appears on the wider inline objects enabled by that prior optimization.

He-Pin added 2 commits May 13, 2026 15:43
Motivation:
PR databricks#840 introduced a strict JSON fast path for .json imports but still
forces a full UTF-8 string decode for every cached file before handing
the text to ujson.StringParser. Real-world workloads (e.g. kube-prometheus)
import many .json files; decoding each one twice (once into String for
parsing, again as cache content) is pure overhead.

Key Design Decision:
ujson 4.4.3 ships ByteArrayParser, which parses UTF-8 JSON directly from
a byte array without an intermediate String. Cache small resolved files
as raw bytes (already what we read from disk) and lazily decode text
only when the importstr/parser-input path actually needs it. Preserve
parse-cache content identity by hashing the cached bytes with SHA-256
(length + hex digest) so external ParseCache implementations keep the
same collision resistance as the old full-string key.

Modification:
* Importer.scala: CachedResolver.parseJsonImport now calls
  ujson.ByteArrayParser.transform(content.readRawBytes(), visitor)
  instead of decoding the whole file to String first.
* CachedResolvedFile.scala (JVM/Native): small files are cached as
  Array[Byte]; getParserInput / readString materialize the String
  lazily; readRawBytes returns the cached bytes directly; contentHash
  is length + SHA-256 over the cached bytes; binary imports still use
  StaticBinaryResolvedFile.
* PreloaderTests.scala: tighten the strict-JSON fast-path coverage so
  it fails if the fast path ever falls back to readString().

Result:
* Output equality vs upstream sjsonnet and jrsonnet preserved on
  kube-prometheus and large_string_template.
* Native kube-prometheus hyperfine A/B (forward & reverse):
  clean 139.4 +/- 2.8 ms -> candidate 132.7 +/- 1.9 ms (forward)
  candidate 132.1 +/- 1.9 ms vs clean 140.3 +/- 2.6 ms (reverse)
* Full ./mill __.test green.

References:
Follow-up to databricks#840
Motivation:
Large inline objects produced by strict JSON imports can exceed the small-object shape that computeSortedInlineOrder was originally tuned for. Native sampling on kube-prometheus showed sorted inline-order computation as a materialization hotspot, and insertion sort becomes quadratic on those wider objects.

Modification:
Keep insertion sort for small inline objects, and use an in-place quicksort with insertion-sort cleanup for larger visible field sets. Record the accepted benchmark result and rejected parser/key-render micro-routes in the performance ledgers.

Result:
Kube-prometheus Native A/B improved on top of strict JSON byte imports, with forward mean 145.3ms -> 140.0ms and reverse mean 151.6ms -> 148.9ms. Formatting and the full test suite pass.

References:
Upstream-base: databricks/sjsonnet@cedc083
Prior optimization: 883fca5 perf: parse strict JSON imports from bytes
@He-Pin He-Pin marked this pull request as ready for review May 13, 2026 12:26
@He-Pin He-Pin marked this pull request as draft May 13, 2026 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant