chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461
Open
andygrove wants to merge 2 commits into
Open
chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461andygrove wants to merge 2 commits into
andygrove wants to merge 2 commits into
Conversation
Add per-version audit sub-bullets to every supported string expression in spark_expressions_support.md using the audit-comet-expression skill. Covers 37 SQL function names backed by the serdes in spark/src/main/scala/org/apache/comet/serde/strings.scala plus the CometScalarFunction wirings in QueryPlanSerde.scala. For each function, the sub-bullets record: - Whether the Spark class is identical across 3.4.3, 3.5.8, 4.0.1 - Spark 4.0 changes (collation widening via StringTypeWithCollation, CollationSupport routing for Upper/Lower/InitCap and the startswith/endswith/contains/instr/replace/translate/trim family, NullIntolerant -> nullIntolerant field refactor, BinaryPad rewrite for lpad/rpad) - Known divergences between Comet and Spark (initcap hyphen handling, replace with empty search, repeat with negative count, decode charset/legacy-flag gaps, BinaryType wiring gaps in bit_length / octet_length, regexp_replace pos > 1 restriction) The audit was driven by 7 parallel agents, each handling a related group (case conversion + initcap; length family; trim; left/right/ substring/substring_index; lpad/rpad; simple scalar funcs; complex serdes including concat_ws/regexp_replace/repeat/split/decode). Apply support-level consistency fixes surfaced by the audit: - CometInitCap: Incompatible(None) -> Incompatible(Some(reason)) via a shared private val; drop the dead convert override. - CometLength: extract the BinaryType reason into a shared private val so the getSupportLevel branch and getUnsupportedReasons stay in sync. - CometRegExpReplace: pass the existing reasons into Incompatible(Some(...)) and Unsupported(Some(...)) via shared private vals so EXPLAIN matches the compatibility-guide text. - CometStringLPad / CometStringRPad: share the two reason strings via a small PadReasons companion so getSupportLevel and getUnsupportedReasons cannot drift; align wording with backticks. - CometSubstring / CometLeft / CometRight: lift the literal-argument fallback out of convert and into getSupportLevel so the dispatcher handles it uniformly, and replace the convert-time withInfo with a declared Unsupported reason via a shared private val. Update the affected tests for the new reason strings: - spark/src/test/resources/sql-tests/expressions/string/left.sql - spark/src/test/resources/sql-tests/expressions/string/string_lpad.sql - spark/src/test/resources/sql-tests/expressions/string/string_rpad.sql - spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala
This was referenced May 27, 2026
Open
… doc Link the high-priority findings from the string-expressions audit to the tracking issues filed alongside the PR: - repeat negative count divergence -> apache#4462 - translate grapheme vs codepoint and U+0000 deletion -> apache#4463 - bit_length/octet_length BinaryType native error -> apache#4464 - decode Spark 4.0 legacy flags ignored -> apache#4465 - decode missing from auto-generated compatibility docs -> apache#4466 - case-conversion gating antipattern -> apache#4467 - generic Spark 4.0 collation gap on lower/upper -> apache#2190 (existing) Also remove a stray NUL byte in the chr entry that came from a heredoc roundtrip and replace it with the literal escape \u0000.
andygrove
added a commit
to andygrove/datafusion-comet
that referenced
this pull request
May 27, 2026
The audit-comet-expression skill previously permitted leaving semantics-decision findings as prose recommendations in the PR description, on the assumption the reviewer would pick them up. In practice that note dies with the PR. Tighten Step 6 and Step 7 so that every high-priority finding either becomes an inline fix + test, or a filed GitHub issue + ignored regression test, before the audit PR is opened. Add a dedicated 'Findings that need follow-up' subsection spelling out the workflow (search, file, cross-reference from the support-doc sub-bullet and the PR description). Surfaced while re-running the string-expressions audit in apache#4461: several higher-risk findings (CometCaseConversionBase compat gating, StringRepeat negative-count divergence, translate grapheme semantics, bit_length/octet_length BinaryType, decode legacy flags) were left as prose only and had to be filed retroactively as apache#4462-apache#4467.
aa20e8f to
d994671
Compare
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
Following the same pattern as #4436 (
any), #4437 (bit_and), and the earlier date/time audit, this PR audits every supported string expression in Comet against Spark 3.4.3, 3.5.8, and 4.0.1, records the findings inline in the support doc, and applies the support-level consistency fixes the audit surfaced.What changes are included in this PR?
Support-doc audit notes
Add per-version audit sub-bullets to all 37 supported string SQL function names in
docs/source/contributor-guide/spark_expressions_support.md. The notes record:StringTypeWithCollation;CollationSupportrouting forUpper/Lower/InitCapand thestartswith/endswith/contains/instr/replace/translate/trimfamily;NullIntolerant->nullIntolerant: Booleanrefactor;BinaryPadrewrite forlpad/rpad)Support-level consistency fixes (in
strings.scala)CometInitCap:Incompatible(None)->Incompatible(Some(reason))via a sharedprivate val; drop the deadconvertoverride.CometLength: extract theBinaryTypereason into a sharedprivate valso thegetSupportLevelbranch andgetUnsupportedReasonsstay in sync.CometRegExpReplace: pass the existing reasons intoIncompatible(Some(...))/Unsupported(Some(...))via sharedprivate vals so EXPLAIN matches the compatibility-guide text.CometStringLPad/CometStringRPad: share the two reason strings via a smallPadReasonscompanion sogetSupportLevelandgetUnsupportedReasonscannot drift; align wording with backticks.CometSubstring/CometLeft/CometRight: lift the literal-argument fallback out ofconvertand intogetSupportLevelso the dispatcher handles it uniformly; replace the convert-timewithInfowith a declaredUnsupportedreason via a sharedprivate val.Tracking issues filed for follow-up
Higher-risk findings that need a semantics decision or larger work are filed as separate issues and cross-referenced from the support doc:
repeatthrows on negative count where Spark returns empty stringtranslateuses graphemes vs Spark code points; ignores U+0000 deletionbit_lengthandoctet_lengtherror natively forBinaryTypeinput instead of falling backdecodeignores Spark 4.0legacyCharsetsandlegacyErrorActionflagsdecodedoes not appear in auto-generated compatibility docsCometCaseConversionBasegates compat insideconvert()instead ofgetSupportLevelThe umbrella Spark 4.0 collation issue is the existing #2190, which the
lower/uppersub-bullets reference.The audit skill update (require issues for prose-only findings, add Spark 4.1.1 to the version list) is split out into #4468.
Audit process
The audit was driven by 7 parallel agents using the
audit-comet-expressionskill, each handling a related group (case conversion + initcap; length family; trim; left/right/substring/substring_index; lpad/rpad; simple scalar funcs; complex serdes including concat_ws/regexp_replace/repeat/split/decode).How are these changes tested?
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/" -Dtest=none(42 tests pass)./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite" -Dtest=none(33 tests pass)left.sql,string_lpad.sql,string_rpad.sql, andCometStringExpressionSuite.scalato match the new shared reason strings.