You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #4423 proposes adding 40 native geospatial SQL functions (ST_Contains, ST_Intersects, ST_Distance, etc.) directly into Comet, executed via DataFusion using new Rust dependencies (geo, geoarrow, geojson, geos, wkt). The functions are wired through CometSparkSessionExtensions so that users get them automatically once Comet is enabled — no Sedona dependency required.
This represents a potential shift in scope for Comet, which to date has focused on accelerating Spark's built-in expressions and operators. The discussion on the PR raised several open questions that deserve broader community input, ideally on the dev mailing list as well as here.
Questions for discussion
Should geospatial support be in scope for Comet at all? Comet has historically targeted Spark built-ins (math, string, datetime, aggregates, etc.). ST_* functions are not Spark built-ins — they come from Apache Sedona or other extensions. Adding them would expand Comet's surface area into a domain that already has dedicated projects.
Wrap SedonaDB — @paleolimbot noted that SedonaDB has ~100 functions plus join/Parquet IO already implemented, tested, and benchmarked in Rust. Comet could wrap those, limiting Comet's maintenance burden to thin wrapper code.
Defer to Sedona / Wherobots — users who need geo today already have options (Sedona on Spark; Wherobots offers a Rust-accelerated path). Comet could choose not to enter this space.
Geometry representation. The PR uses WKT strings. @paleolimbot pointed out that Spark, Parquet, and SedonaDB all use WKB, which is significantly faster ("the equivalent of passing around doubles as strings"). If Comet adopts geo, what is the right representation, and does that depend on broader UDT support?
UDT / Spark geometry type support.@paleolimbot mentioned that full UDT support would require changing many DataType usages to FieldRef usages. Spark geometry has a type parameter that is dropped when represented as Utf8. Is this a prerequisite for doing geo "properly," and is it work the project wants to take on?
Build and runtime dependencies. The proposed approach adds a native dependency on GEOS (statically linked, so end users don't need it at runtime, but build machines do). How does the community feel about adding a C library dependency to the Comet build?
Maintenance burden. Geo functions are a large surface area (the PR adds 40; SedonaDB has ~100+). Who maintains them, who reviews changes, and who handles compatibility as Sedona/Spark evolve?
@andygrove suggested taking this to the dev@ mailing list given the scope shift implications. This issue is intended to collect written input from contributors and users before/alongside that discussion. Please weigh in with your perspective, especially if you have a use case for geo in Comet or experience maintaining geospatial libraries.
Background
PR #4423 proposes adding 40 native geospatial SQL functions (
ST_Contains,ST_Intersects,ST_Distance, etc.) directly into Comet, executed via DataFusion using new Rust dependencies (geo,geoarrow,geojson,geos,wkt). The functions are wired throughCometSparkSessionExtensionsso that users get them automatically once Comet is enabled — no Sedona dependency required.This represents a potential shift in scope for Comet, which to date has focused on accelerating Spark's built-in expressions and operators. The discussion on the PR raised several open questions that deserve broader community input, ideally on the dev mailing list as well as here.
Questions for discussion
Should geospatial support be in scope for Comet at all? Comet has historically targeted Spark built-ins (math, string, datetime, aggregates, etc.).
ST_*functions are not Spark built-ins — they come from Apache Sedona or other extensions. Adding them would expand Comet's surface area into a domain that already has dedicated projects.If yes, what is the right implementation path?
geoscrate with static linking).Geometry representation. The PR uses WKT strings. @paleolimbot pointed out that Spark, Parquet, and SedonaDB all use WKB, which is significantly faster ("the equivalent of passing around doubles as strings"). If Comet adopts geo, what is the right representation, and does that depend on broader UDT support?
UDT / Spark geometry type support. @paleolimbot mentioned that full UDT support would require changing many
DataTypeusages toFieldRefusages. Spark geometry has a type parameter that is dropped when represented as Utf8. Is this a prerequisite for doing geo "properly," and is it work the project wants to take on?Build and runtime dependencies. The proposed approach adds a native dependency on GEOS (statically linked, so end users don't need it at runtime, but build machines do). How does the community feel about adding a C library dependency to the Comet build?
Maintenance burden. Geo functions are a large surface area (the PR adds 40; SedonaDB has ~100+). Who maintains them, who reviews changes, and who handles compatibility as Sedona/Spark evolve?
References
Next steps
@andygrove suggested taking this to the dev@ mailing list given the scope shift implications. This issue is intended to collect written input from contributors and users before/alongside that discussion. Please weigh in with your perspective, especially if you have a use case for geo in Comet or experience maintaining geospatial libraries.