Skip to content

chore(audit): audit math expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4486

Open
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:worktree-audit-math-funcs
Open

chore(audit): audit math expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4486
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:worktree-audit-math-funcs

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Continuation of the per-category expression audit. Same pattern as #4483 (array), #4480 (predicate), #4479 (bitwise), #4478 (map), #4476 (hash), #4475 (conditional), #4474 (misc), #4473 (collection), #4470 (json), #4469 (struct), using the updated audit-comet-expression skill in #4468.

What changes are included in this PR?

Support-doc audit notes

Add per-version audit sub-bullets to all 60 supported math SQL function names (%, *, +, -, /, abs, all trig and inverse-trig, all log/exp/pow/sqrt/cbrt, bin, hex, unhex, greatest, least, ceil/ceiling/floor/round/rint/sign/signum, div/mod/negative/positive/pi, try_add/try_divide/try_multiply/try_subtract, width_bucket). Add 4.1.1 audit line to the already-audited factorial. rand / randn cross-reference the misc audit (#4474); shiftleft cross-references the bitwise audit (#4479).

Highlights from the cross-version review:

  • All trig and exp/log helpers are byte-for-byte stable across the four Spark versions; the only systemic change in 4.0 is the NullIntolerant trait -> nullIntolerant: Boolean field refactor.
  • 4.0 also adds the DefaultStringProducingExpression trait on Bin / Hex / Unhex and widens their StringType inputs to StringTypeWithCollation.
  • UnaryPositive becomes RuntimeReplaceable in 4.0; in 3.4/3.5 +col silently disables Comet for the enclosing projection because there is no CometUnaryPositive serde.
  • CometRemainder rejects EvalMode.TRY, so try_mod on Spark 4.0+ falls back (filed as [Bug] try_mod falls back to Spark because CometRemainder rejects EvalMode.TRY #4484).
  • Round falls back for Float / Double because BigDecimal-via-toString rounding cannot be precisely matched.
  • width_bucket (Spark 3.5+) is wired via per-version CometExprShim rather than a CometExpressionSerde, bypassing the support-level framework (filed as [Bug] width_bucket bypasses CometExpressionSerde framework #4485).
  • hex / unhex do not propagate Spark 4.x collation; covered by the umbrella [Spark 4.0] Add string collation support #2190.

Support-level consistency fixes

None in this PR. The audit surfaced several convert-time withInfo fallbacks in arithmetic.scala (CometAdd / CometSubtract / CometMultiply / CometDivide / CometIntegralDivide / CometRemainder / CometUnaryMinus / CometRound rely on the default Compatible(None) and bail out from convert for several cases). Lifting them into getSupportLevel / getUnsupportedReasons is mechanical but touches an unusually load-bearing file; deliberately deferred to follow-ups to keep this audit PR low-risk.

Tracking issues filed for follow-up

Audit process

Audited using the audit-comet-expression skill (4 Spark versions per #4468), driven by 4 parallel agents covering arithmetic ops, trig + constants, log/exp/power/rounding, and misc (bin/hex/etc).

How are these changes tested?

  • make core succeeds (no code changes; doc only).
  • Existing test coverage in CometExpressionSuite, CometMathExpressionSuite, CometSqlFileTestSuite expressions/math/, and expressions/string/{hex,unhex}.sql remains unchanged.

…, 4.1.1

Add per-version audit sub-bullets to all supported math SQL function
names (around 60). Also add the 4.1.1 audit line to the already-audited
`factorial`. `rand`/`randn` cross-reference the misc audit (apache#4474),
and `shiftleft` cross-references the bitwise audit (apache#4479).

Highlights from the cross-version review:

- All trig and exp/log helpers are byte-for-byte stable across the four
  Spark versions; the only systemic change in 4.0 is the
  `NullIntolerant` trait -> `nullIntolerant: Boolean` field refactor.
- 4.0 also adds the `DefaultStringProducingExpression` trait on
  `Bin` / `Hex` / `Unhex` and widens their `StringType` inputs to
  `StringTypeWithCollation`.
- `UnaryPositive` becomes `RuntimeReplaceable` in 4.0; in 3.4/3.5
  `+col` silently disables Comet for the enclosing projection because
  there is no `CometUnaryPositive` serde.
- All arithmetic ops thread `EvalMode` through, but `CometRemainder`
  rejects `EvalMode.TRY`, so `try_mod` on Spark 4.0+ falls back
  (filed as apache#4484).
- `Round` falls back for Float/Double because BigDecimal-via-toString
  rounding cannot be precisely matched.
- `width_bucket` (Spark 3.5+) is wired via per-version `CometExprShim`
  rather than a `CometExpressionSerde`, bypassing the support-level
  framework (filed as apache#4485).
- `hex` / `unhex` do not propagate Spark 4.x collation; covered by
  the umbrella apache#2190.

The audit was driven by 4 parallel agents covering arithmetic ops,
trig + constants, log/exp/power/rounding, and misc (bin/hex/etc).

Tracking issues filed for the gaps found:

- apache#4484 `try_mod` falls back because `CometRemainder` rejects `EvalMode.TRY`.
- apache#4485 `width_bucket` bypasses `CometExpressionSerde` framework.

No serde changes are included; the support-level antipattern fixes
called out in the audit (mainly lifting convert-time `withInfo`
fallbacks in `arithmetic.scala` into `getSupportLevel`) are
deliberately out of scope for this doc-focused PR and can ride in
follow-ups.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant