Skip to content

Comments

feat: support Spark base64 expression#3571

Draft
n0r0shi wants to merge 2 commits intoapache:mainfrom
n0r0shi:base64-419-only
Draft

feat: support Spark base64 expression#3571
n0r0shi wants to merge 2 commits intoapache:mainfrom
n0r0shi:base64-419-only

Conversation

@n0r0shi
Copy link

@n0r0shi n0r0shi commented Feb 23, 2026

Which issue does this PR close?

Closes #419.

Rationale for this change

Spark's base64() expression is currently not supported by Comet, causing fallback to Spark. This is a commonly used function that can be mapped directly to DataFusion's built-in encode(input, 'base64') function with no Rust changes.

What changes are included in this PR?

Two code paths handle different Spark versions:

  • Spark 3.4: Base64 is a direct expression node. Added CometBase64 handler in strings.scala and registered it in the stringExpressions map.
  • Spark 3.5+: Base64 is RuntimeReplaceable — Spark's optimizer rewrites it into StaticInvoke(Base64.encode, [input, chunkBase64]) before Comet sees the plan. Added CometBase64Encode handler in statics.scala to handle this.

Both paths produce the same DataFusion call: encode(input, 'base64').

Chunked base64 (spark.sql.chunkBase64String.enabled=true, which inserts newlines every 76 chars per RFC 2045) is not supported by DataFusion's encode function, so it falls back to Spark. I can take a look at DataFusion side for this later.

Are these changes tested?

  • Normal base64 encoding: checkSparkAnswerAndOperator verifies correct results and native Comet execution
  • NULL handling: verified via checkSparkAnswerAndOperator
  • Chunked base64 fallback: checkSparkAnswerAndFallbackReason verifies correct results via Spark fallback and checks the expected fallback reason message
  • The Spark 3.4 direct expression handler (CometBase64) is exercised when CI runs the spark-3.4 profile. On Spark 3.5+ it is not reached because Spark replaces Base64 with StaticInvoke during optimization.

Are there any user-facing changes?

Yes. base64() is now executed natively by Comet instead of falling back to Spark, improving performance for queries using this function.

SELECT base64(binary_column) FROM table

Add native support for Spark's base64() expression by mapping it to
DataFusion's built-in encode(input, 'base64') function.

Two code paths are needed because Spark changed the expression type
across versions:
- Spark 3.4: Base64 is a direct expression node, handled in strings.scala
- Spark 3.5+: Base64 is RuntimeReplaceable, rewritten by Spark's
  optimizer into StaticInvoke(Base64.encode), handled in statics.scala

Chunked base64 mode (spark.sql.chunkBase64String.enabled=true, which
inserts newlines every 76 chars per RFC 2045) falls back to Spark
since DataFusion's encode function does not support this mode.

Closes apache#419
@n0r0shi n0r0shi changed the title feat: support Spark base64 expression (#419) feat: support Spark base64 expression Feb 23, 2026
@mbutrovich
Copy link
Contributor

Thanks @n0r0shi! Kicking off CI.

@n0r0shi
Copy link
Author

n0r0shi commented Feb 24, 2026

I found that DataFusion's encode(input, 'base64') produces unpadded output, which doesn't match Spark's padded base64. The padded variant ('base64pad') and SparkBase64 in the datafusion-spark crate are available in DataFusion 52.x, but Comet currently pins to 51.0.0.

Can I keep this PR as draft until Comet upgrades to DataFusion 52.x+?

@mbutrovich
Copy link
Contributor

Can I keep this PR as draft until Comet upgrades to DataFusion 52.x+?

Sure thing, we should be wrapping up the DF 52 migration this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support spark base64 function

2 participants