perf: parse strict JSON imports from bytes#854
Open
He-Pin wants to merge 1 commit into
Open
Conversation
Motivation: PR databricks#840 introduced a strict JSON fast path for .json imports but still forces a full UTF-8 string decode for every cached file before handing the text to ujson.StringParser. Real-world workloads (e.g. kube-prometheus) import many .json files; decoding each one twice (once into String for parsing, again as cache content) is pure overhead. Key Design Decision: ujson 4.4.3 ships ByteArrayParser, which parses UTF-8 JSON directly from a byte array without an intermediate String. Cache small resolved files as raw bytes (already what we read from disk) and lazily decode text only when the importstr/parser-input path actually needs it. Preserve parse-cache content identity by hashing the cached bytes with SHA-256 (length + hex digest) so external ParseCache implementations keep the same collision resistance as the old full-string key. Modification: * Importer.scala: CachedResolver.parseJsonImport now calls ujson.ByteArrayParser.transform(content.readRawBytes(), visitor) instead of decoding the whole file to String first. * CachedResolvedFile.scala (JVM/Native): small files are cached as Array[Byte]; getParserInput / readString materialize the String lazily; readRawBytes returns the cached bytes directly; contentHash is length + SHA-256 over the cached bytes; binary imports still use StaticBinaryResolvedFile. * PreloaderTests.scala: tighten the strict-JSON fast-path coverage so it fails if the fast path ever falls back to readString(). Result: * Output equality vs upstream sjsonnet and jrsonnet preserved on kube-prometheus and large_string_template. * Native kube-prometheus hyperfine A/B (forward & reverse): clean 139.4 +/- 2.8 ms -> candidate 132.7 +/- 1.9 ms (forward) candidate 132.1 +/- 1.9 ms vs clean 140.3 +/- 2.6 ms (reverse) * Full ./mill __.test green. References: Follow-up to databricks#840
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation:
PR #840 added a strict JSON import fast path, but cached file imports still decoded every small file to a UTF-8
Stringbefore strict.jsonparsing. Real-world Jsonnet projects such as kube-prometheus import many JSON files, so decoding bytes to text only to parse them as UTF-8 JSON adds avoidable allocation and CPU cost.Key Design Decision:
Use
ujson.ByteArrayParserfor strict.jsonimports and keep small cached resolved files byte-backed until text is explicitly needed byimportstror Fastparse. This keeps the JSON path byte-native while preserving existing text behavior for Jsonnet source parsing.Modification:
CachedResolvedFilenow caches small resolved files asArray[Byte], lazily materializes text forgetParserInput/readString, and returns cached bytes directly forreadRawBytes.CachedResolver.parseJsonImportnow parses strict JSON imports withujson.ByteArrayParser.transform(content.readRawBytes(), visitor).PreloaderTestsnow assert strict JSON preload does not fall back to Fastparse or decode viareadString.Result:
Output equality was preserved against upstream sjsonnet and source-built jrsonnet:
7,506,029bytes, byte-identical.large_string_templateguard output:549,674bytes, byte-identical.Benchmark Results:
Native whole-process kube-prometheus, command run from
jrsonnet/tests/realworldwith-J vendor,hyperfine --shell=none --warmup 5 --runs 50:143.351 +/- 2.556 ms135.806 +/- 3.388 ms91.752 +/- 7.863 ms145.082 +/- 2.988 ms136.392 +/- 2.034 ms91.365 +/- 1.419 msNative guard,
bench/resources/cpp_suite/large_string_template.jsonnet:11.036 +/- 0.807 ms10.762 +/- 0.650 ms5.204 +/- 0.511 msJMH guard,
bench.runRegressions:cpp_suite/large_string_template.jsonnet0.757 ms/op0.691 ms/opgo_suite/manifestJsonEx.jsonnet0.055 ms/op0.054 ms/opAnalysis:
The target workload improves by about 5-6% in both command orders. The remaining gap to jrsonnet is not eliminated, but this change removes one clear extra decode from strict JSON import handling without regressing the non-target renderer guard or JMH guards.
Validation:
./mill --no-server --ticker false --color false -j 1 __.checkFormat./mill --no-server --ticker false --color false -j 1 __.test(Tests: 440, Passed: 440, Failed: 0)References:
Follow-up to #840