Skip to content

chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461

Open
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:worktree-audit-string-funcs
Open

chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:worktree-audit-string-funcs

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 27, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

Following the same pattern as #4436 (any), #4437 (bit_and), and the earlier date/time audit, this PR audits every supported string expression in Comet against Spark 3.4.3, 3.5.8, and 4.0.1, records the findings inline in the support doc, and applies the support-level consistency fixes the audit surfaced.

What changes are included in this PR?

Support-doc audit notes

Add per-version audit sub-bullets to all 37 supported string SQL function names in docs/source/contributor-guide/spark_expressions_support.md. The notes record:

  • whether the Spark class is identical across 3.4.3 / 3.5.8 / 4.0.1
  • Spark 4.0 changes (collation widening via StringTypeWithCollation; CollationSupport routing for Upper/Lower/InitCap and the startswith/endswith/contains/instr/replace/translate/trim family; NullIntolerant -> nullIntolerant: Boolean refactor; BinaryPad rewrite for lpad/rpad)
  • known divergences between Comet and Spark, each linked to the tracking issue listed below

Support-level consistency fixes (in strings.scala)

  • CometInitCap: Incompatible(None) -> Incompatible(Some(reason)) via a shared private val; drop the dead convert override.
  • CometLength: extract the BinaryType reason into a shared private val so the getSupportLevel branch and getUnsupportedReasons stay in sync.
  • CometRegExpReplace: pass the existing reasons into Incompatible(Some(...)) / Unsupported(Some(...)) via shared private vals so EXPLAIN matches the compatibility-guide text.
  • CometStringLPad / CometStringRPad: share the two reason strings via a small PadReasons companion so getSupportLevel and getUnsupportedReasons cannot drift; align wording with backticks.
  • CometSubstring / CometLeft / CometRight: lift the literal-argument fallback out of convert and into getSupportLevel so the dispatcher handles it uniformly; replace the convert-time withInfo with a declared Unsupported reason via a shared private val.

Tracking issues filed for follow-up

Higher-risk findings that need a semantics decision or larger work are filed as separate issues and cross-referenced from the support doc:

The umbrella Spark 4.0 collation issue is the existing #2190, which the lower/upper sub-bullets reference.

The audit skill update (require issues for prose-only findings, add Spark 4.1.1 to the version list) is split out into #4468.

Audit process

The audit was driven by 7 parallel agents using the audit-comet-expression skill, each handling a related group (case conversion + initcap; length family; trim; left/right/substring/substring_index; lpad/rpad; simple scalar funcs; complex serdes including concat_ws/regexp_replace/repeat/split/decode).

How are these changes tested?

  • ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/" -Dtest=none (42 tests pass)
  • ./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite" -Dtest=none (33 tests pass)
  • Updated the affected fallback-reason assertions in left.sql, string_lpad.sql, string_rpad.sql, and CometStringExpressionSuite.scala to match the new shared reason strings.

Add per-version audit sub-bullets to every supported string expression in
spark_expressions_support.md using the audit-comet-expression skill.
Covers 37 SQL function names backed by the serdes in
spark/src/main/scala/org/apache/comet/serde/strings.scala plus the
CometScalarFunction wirings in QueryPlanSerde.scala.

For each function, the sub-bullets record:
- Whether the Spark class is identical across 3.4.3, 3.5.8, 4.0.1
- Spark 4.0 changes (collation widening via StringTypeWithCollation,
  CollationSupport routing for Upper/Lower/InitCap and the
  startswith/endswith/contains/instr/replace/translate/trim family,
  NullIntolerant -> nullIntolerant field refactor, BinaryPad rewrite
  for lpad/rpad)
- Known divergences between Comet and Spark (initcap hyphen handling,
  replace with empty search, repeat with negative count, decode
  charset/legacy-flag gaps, BinaryType wiring gaps in bit_length /
  octet_length, regexp_replace pos > 1 restriction)

The audit was driven by 7 parallel agents, each handling a related
group (case conversion + initcap; length family; trim; left/right/
substring/substring_index; lpad/rpad; simple scalar funcs; complex
serdes including concat_ws/regexp_replace/repeat/split/decode).

Apply support-level consistency fixes surfaced by the audit:
- CometInitCap: Incompatible(None) -> Incompatible(Some(reason))
  via a shared private val; drop the dead convert override.
- CometLength: extract the BinaryType reason into a shared private val
  so the getSupportLevel branch and getUnsupportedReasons stay in sync.
- CometRegExpReplace: pass the existing reasons into
  Incompatible(Some(...)) and Unsupported(Some(...)) via shared
  private vals so EXPLAIN matches the compatibility-guide text.
- CometStringLPad / CometStringRPad: share the two reason strings via
  a small PadReasons companion so getSupportLevel and
  getUnsupportedReasons cannot drift; align wording with backticks.
- CometSubstring / CometLeft / CometRight: lift the literal-argument
  fallback out of convert and into getSupportLevel so the dispatcher
  handles it uniformly, and replace the convert-time withInfo with a
  declared Unsupported reason via a shared private val.

Update the affected tests for the new reason strings:
- spark/src/test/resources/sql-tests/expressions/string/left.sql
- spark/src/test/resources/sql-tests/expressions/string/string_lpad.sql
- spark/src/test/resources/sql-tests/expressions/string/string_rpad.sql
- spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala
… doc

Link the high-priority findings from the string-expressions audit to
the tracking issues filed alongside the PR:

- repeat negative count divergence -> apache#4462
- translate grapheme vs codepoint and U+0000 deletion -> apache#4463
- bit_length/octet_length BinaryType native error -> apache#4464
- decode Spark 4.0 legacy flags ignored -> apache#4465
- decode missing from auto-generated compatibility docs -> apache#4466
- case-conversion gating antipattern -> apache#4467
- generic Spark 4.0 collation gap on lower/upper -> apache#2190 (existing)

Also remove a stray NUL byte in the chr entry that came from a heredoc
roundtrip and replace it with the literal escape \u0000.
andygrove added a commit to andygrove/datafusion-comet that referenced this pull request May 27, 2026
The audit-comet-expression skill previously permitted leaving
semantics-decision findings as prose recommendations in the PR
description, on the assumption the reviewer would pick them up.
In practice that note dies with the PR.

Tighten Step 6 and Step 7 so that every high-priority finding either
becomes an inline fix + test, or a filed GitHub issue + ignored
regression test, before the audit PR is opened. Add a dedicated
'Findings that need follow-up' subsection spelling out the workflow
(search, file, cross-reference from the support-doc sub-bullet and
the PR description).

Surfaced while re-running the string-expressions audit in apache#4461:
several higher-risk findings (CometCaseConversionBase compat gating,
StringRepeat negative-count divergence, translate grapheme semantics,
bit_length/octet_length BinaryType, decode legacy flags) were left as
prose only and had to be filed retroactively as apache#4462-apache#4467.
@andygrove andygrove force-pushed the worktree-audit-string-funcs branch from aa20e8f to d994671 Compare May 27, 2026 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant