Discussion: Should Comet add geospatial (ST_*) function support?

## Background

PR #4423 proposes adding 40 native geospatial SQL functions (`ST_Contains`, `ST_Intersects`, `ST_Distance`, etc.) directly into Comet, executed via DataFusion using new Rust dependencies (`geo`, `geoarrow`, `geojson`, `geos`, `wkt`). The functions are wired through `CometSparkSessionExtensions` so that users get them automatically once Comet is enabled — no Sedona dependency required.

This represents a potential shift in scope for Comet, which to date has focused on accelerating Spark's built-in expressions and operators. The discussion on the PR raised several open questions that deserve broader community input, ideally on the dev mailing list as well as here.

## Questions for discussion

1. **Should geospatial support be in scope for Comet at all?** Comet has historically targeted Spark built-ins (math, string, datetime, aggregates, etc.). `ST_*` functions are not Spark built-ins — they come from Apache Sedona or other extensions. Adding them would expand Comet's surface area into a domain that already has dedicated projects.

2. **If yes, what is the right implementation path?**
   - **In-tree, maintained by Comet** — as proposed in #4423. Comet owns the function definitions, tests, and dependencies (including the GEOS C library via `geos` crate with static linking).
   - **Wrap SedonaDB** — @paleolimbot noted that SedonaDB has ~100 functions plus join/Parquet IO already implemented, tested, and benchmarked in Rust. Comet could wrap those, limiting Comet's maintenance burden to thin wrapper code.
   - **Defer to Sedona / Wherobots** — users who need geo today already have options (Sedona on Spark; Wherobots offers a Rust-accelerated path). Comet could choose not to enter this space.

3. **Geometry representation.** The PR uses WKT strings. @paleolimbot pointed out that Spark, Parquet, and SedonaDB all use WKB, which is significantly faster (\"the equivalent of passing around doubles as strings\"). If Comet adopts geo, what is the right representation, and does that depend on broader UDT support?

4. **UDT / Spark geometry type support.** @paleolimbot mentioned that full UDT support would require changing many `DataType` usages to `FieldRef` usages. Spark geometry has a type parameter that is dropped when represented as Utf8. Is this a prerequisite for doing geo \"properly,\" and is it work the project wants to take on?

5. **Build and runtime dependencies.** The proposed approach adds a native dependency on GEOS (statically linked, so end users don't need it at runtime, but build machines do). How does the community feel about adding a C library dependency to the Comet build?

6. **Maintenance burden.** Geo functions are a large surface area (the PR adds 40; SedonaDB has ~100+). Who maintains them, who reviews changes, and who handles compatibility as Sedona/Spark evolve?

## References

- PR #4423 — https://github.com/apache/datafusion-comet/pull/4423
- SedonaDB — https://github.com/apache/sedona
- Wherobots — https://wherobots.com/

## Next steps

@andygrove suggested taking this to the dev@ mailing list given the scope shift implications. This issue is intended to collect written input from contributors and users before/alongside that discussion. Please weigh in with your perspective, especially if you have a use case for geo in Comet or experience maintaining geospatial libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Should Comet add geospatial (ST_*) function support? #4455

Background

Questions for discussion

References

Next steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Discussion: Should Comet add geospatial (ST_*) function support? #4455

Description

Background

Questions for discussion

References

Next steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions