Skip to content

chore(audit): audit json expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4470

Open
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:worktree-audit-json-funcs
Open

chore(audit): audit json expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4470
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:worktree-audit-json-funcs

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Continuation of the per-category expression audit. Same pattern as #4469 (struct), #4461 (string), and earlier audits, using the updated audit-comet-expression skill in #4468 (now also covers Spark 4.1.1).

What changes are included in this PR?

Support-doc audit notes

Add per-version audit sub-bullets to get_json_object in docs/source/contributor-guide/spark_expressions_support.md. Spark 3.4.3 and 3.5.8 use a BinaryExpression with CodegenFallback with inline Jackson-based eval. Spark 4.0 extracts the eval into a GetJsonObjectEvaluator helper, mixes in DefaultStringProducingExpression, and widens inputTypes to StringTypeWithCollation(supportsTrimCollation = true). Spark 4.1.1 is identical to 4.0.

Support-level consistency fix (in strings.scala)

  • CometGetJsonObject: extract the duplicate single-quote / control-character incompatibility reason into a shared private val so the doc generator and the EXPLAIN dispatcher cannot drift.

Tracking issues filed for follow-up

None. The known incompatibilities (single-quoted JSON, unescaped control characters) are already declared via getSupportLevel and getIncompatibleReasons. Non-default Spark 4.0 string collations are covered by the umbrella #2190 (referenced from the support-doc sub-bullet).

Audit process

Audited directly using the audit-comet-expression skill (4 Spark versions). One backing serde, so no parallel subagents were needed.

How are these changes tested?

  • ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite string/get_json_object" -Dtest=none (2 tests pass; existing get_json_object.sql already covers single-character, nested-field, wildcard, deep-nested, unicode, emoji, mixed-script, escaped-quote, and dictionary-encoded inputs).
  • make core succeeds with the serde change.

…, 4.1.1

Add per-version audit sub-bullets to `get_json_object` in
`docs/source/contributor-guide/spark_expressions_support.md`. Spark
3.4.3 and 3.5.8 use a `BinaryExpression with CodegenFallback` with
inline Jackson-based eval; Spark 4.0 extracts the eval into a
`GetJsonObjectEvaluator` helper and widens `inputTypes` to
`StringTypeWithCollation` (`DefaultStringProducingExpression` trait
added). 4.1 is identical to 4.0.

Apply the one support-level consistency fix surfaced by the audit:

- `CometGetJsonObject`: extract the duplicate single-quote /
  control-character incompatibility reason into a shared `private val`
  so the doc generator and the EXPLAIN dispatcher cannot drift.

No new tracking issues filed. The known incompatibilities (single-
quoted JSON, unescaped control characters) are already declared via
`getSupportLevel` and `getIncompatibleReasons`. Spark 4.0 collation
propagation is covered by the umbrella apache#2190.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant