feat(meos): parameterize CROSS_DISTANCE (vidA, vidB) via SQL constant args (stacks on #19)#20
Open
estebanzimanyi wants to merge 6 commits into
Conversation
… on NebulaStream (33 YAMLs, 27/27 cells) Additive scaffold for the BerlinMOD-9 × 3 streaming-form parity contract on MobilityNebula, sibling to the existing SNCB Q-series and matching the MobilityFlink MobilityDB#3 / MobilityKafka MobilityDB#1 streaming-form definitions. All 27 cells covered: Q1 'which vehicles have appeared' — full (continuous + windowed + snapshot) Q2 'where is vehicle X at time T' — full Q3 'vehicles within 5 km of P' — full Q4 'vehicles inside region R (polygon)'— full Q5 'pairs of vehicles meeting near P' — partial (emit per-vehicle trajectories near P; consumer joins) Q6 'cumulative distance per vehicle' — partial (emit TEMPORAL_SEQUENCE; consumer computes length) Q7 'first passage of vehicle through POI' × {POI1, POI2, POI3} — full (per-POI fan-out) Q8 'vehicles within d of LINESTRING' — full (edwithin_tgeo_geo with LINESTRING geometry) Q9 'distance between X and Y at time T'— partial (emit X and Y trajectories; consumer joins) 18 of 27 cells are FULL (the BerlinMOD-Q semantic is computed entirely inside NebulaStream). 9 cells are PARTIAL — NebulaStream emits the per-window inputs (trajectory, candidate vehicles) and a consumer post-processes for the final BerlinMOD-Q answer. The partial pattern is the natural expression of these queries in NebulaStream's current SQL surface; the path to FULL is documented per-Q in docs/berlinmod-streaming-forms.md (a stream-self-join for Q5/Q9, a temporal_length scalar function for Q6). Form mapping to NebulaStream windows: continuous: SLIDING(time_utc, SIZE 1 SEC, ADVANCE BY 1 SEC) windowed: TUMBLING(time_utc, SIZE 10 SEC) snapshot: TUMBLING(time_utc, SIZE 5 SEC) MEOS-side surface consumed (already exposed by PR MobilityDB#14 + follow-ups): edwithin_tgeo_geo — Q3 (POINT predicate), Q4 (POLYGON, d=0.0), Q5 (POINT predicate), Q7 (per-POI POINT), Q8 (LINESTRING predicate) TEMPORAL_SEQUENCE — Q2 / Q5 / Q6 / Q9 (per-window per-vehicle trajectory) No new MEOS PhysicalFunction classes added; no C++ changes; no SNCB Q-series modifications. All 33 YAMLs are additive in a new Queries/berlinmod/ subdirectory. Add (additions): Queries/berlinmod/q1_{continuous,windowed,snapshot}.yaml (3) Queries/berlinmod/q2_{continuous,windowed,snapshot}.yaml (3) Queries/berlinmod/q3_{continuous,windowed,snapshot}.yaml (3) Queries/berlinmod/q4_{continuous,windowed,snapshot}.yaml (3) Queries/berlinmod/q5_{continuous,windowed,snapshot}.yaml (3, partial) Queries/berlinmod/q6_{continuous,windowed,snapshot}.yaml (3, partial) Queries/berlinmod/q7_poi{1,2,3}_{continuous,windowed,snapshot}.yaml (9, full via fan-out) Queries/berlinmod/q8_{continuous,windowed,snapshot}.yaml (3, LINESTRING predicate) Queries/berlinmod/q9_{continuous,windowed,snapshot}.yaml (3, partial) Input/input_berlinmod.csv (sample data: 3 vehicles × 21 events, 14 simulated seconds) docs/berlinmod-streaming-forms.md Validation: every YAML parses cleanly via python3 yaml.safe_load. Runtime verification gated on the NebulaStream test harness. Coverage: 27 of 27 cells (100 %), with 18 FULL and 9 PARTIAL annotated explicitly per Q. Path to FULL for the 9 PARTIAL cells is one MobilityNebula C++ PhysicalFunction class each (or a NebulaStream upstream stream-self-join), documented in docs/berlinmod-streaming-forms.md.
…-form cells to full
Adds the TEMPORAL_LENGTH aggregation across the four levels of the
NebulaStream pipeline (logical / physical / parser / lowering) so the
BerlinMOD-Q6 "cumulative distance per vehicle" streaming-form cells
(continuous + windowed + snapshot) compute the spheroidal trajectory
length entirely inside NebulaStream instead of emitting raw trajectories
for a consumer-side reduction.
Logical: nes-logical-operators/{include,src}/Operators/Windows/Aggregations/Meos/TemporalLengthAggregationLogicalFunction.{hpp,cpp}
mirroring TemporalSequenceAggregationLogicalFunctionV2 but with finalAggregateStampType = FLOAT64.
Registers as "TemporalLength" in the aggregation registry. Serializes through the existing
TemporalAggregationSerde wire shape with the type tag overridden.
Physical: nes-physical-operators/{include,src}/Aggregation/Function/Meos/TemporalLengthAggregationPhysicalFunction.{hpp,cpp}
identical lift / combine / reset / cleanup to TemporalSequenceAggregationPhysicalFunction;
the lower() path builds the same MEOS instant-set trajectory string, parses it via
MEOSWrapper::parseTemporalPoint, and calls MEOS' tpoint_length(Temporal*) to return a single
FLOAT64 result.
Parser: nes-sql-parser/AntlrSQL.g4 adds the TEMPORAL_LENGTH lexer token and includes it in
functionName. AntlrSQLQueryPlanCreator.cpp adds the TEMPORAL_LENGTH dispatch in both the
case-label and string-name paths, parallel to TEMPORAL_SEQUENCE.
Lowering: nes-query-optimizer/src/RewriteRules/LowerToPhysical/LowerToPhysicalWindowedAggregation.cpp
adds the TEMPORAL_LENGTH special-case lowering, parallel to TEMPORAL_SEQUENCE, producing a
TemporalLengthAggregationPhysicalFunction with the same (lon, lat, timestamp) state schema.
YAMLs: Queries/berlinmod/q6_{continuous,windowed,snapshot}.yaml updated to call
TEMPORAL_LENGTH directly; the FLOAT64 output column replaces the VARSIZED trajectory output;
header comments updated to "FULL".
Docs: docs/berlinmod-streaming-forms.md updated to reflect 21 cells full + 6 cells partial
(Q5 + Q9 only); the path-to-full table now lists those two queries only.
YAML safe_load green on all 3 Q6 cells. Build verification gated on the user's NebulaStream
test harness (vcpkg-bootstrapped); the C++ code follows the established TemporalSequence
template exactly, with the lower() path replaced by tpoint_length.
…streaming-form cells to full
Mirrors the TEMPORAL_LENGTH pattern from the parent PR with two new
four-field aggregations that close the last 6 partial cells on the
MobilityNebula BerlinMOD parity matrix:
PAIR_MEETING(lon, lat, ts, vehicle_id) -> VARSIZED
Lift collects per-event tuples. Lower picks each vehicle's latest known
position in the window, enumerates pairs (a < b), calls MEOS' geog_dwithin
with dMeet = 200 m hardcoded for the BerlinMOD scaffold, and emits a
string-encoded list of meeting pairs (vid_a, vid_b, ts, "<=dMeet" tag).
Future PR can parameterize dMeet via a constant input. Closes Q5 × 3 cells.
CROSS_DISTANCE(lon, lat, ts, vehicle_id) -> FLOAT64
Same lift shape. Lower picks the latest known position of each of the two
target vehicles (VID_A = 100, VID_B = 200 hardcoded), drives the MEOS
nad_tgeo_tgeo distance, and returns a FLOAT64 (NaN if either vehicle is
unobserved). Future PR can parameterize (VID_A, VID_B). Closes Q9 × 3 cells.
Wired across the four pipeline layers identically to TEMPORAL_LENGTH:
- nes-physical-operators/{include,src}/Aggregation/Function/Meos/{PairMeeting,CrossDistance}AggregationPhysicalFunction.{hpp,cpp}
- nes-logical-operators/{include,src}/Operators/Windows/Aggregations/Meos/{PairMeeting,CrossDistance}AggregationLogicalFunction.{hpp,cpp}
- nes-physical-operators/src/Aggregation/Function/Meos/CMakeLists.txt + nes-logical-operators/src/Operators/Windows/Aggregations/Meos/CMakeLists.txt plugin entries
- nes-sql-parser/AntlrSQL.g4 lexer + functionName tokens
- nes-sql-parser/src/AntlrSQLQueryPlanCreator.cpp case-label + string-name dispatch
- nes-query-optimizer/src/RewriteRules/LowerToPhysical/LowerToPhysicalWindowedAggregation.cpp special-case lowering with 4-field state schema
YAMLs: Queries/berlinmod/q5_{continuous,windowed,snapshot}.yaml and
q9_{continuous,windowed,snapshot}.yaml rewritten to call the new
aggregations directly; sink schemas updated to FLOAT64 / VARSIZED;
header comments updated to FULL.
Docs: docs/berlinmod-streaming-forms.md updated to reflect 27/27 cells
full (was 21 full + 6 partial); MEOS-operators table now lists
PAIR_MEETING and CROSS_DISTANCE alongside the existing ones.
YAML safe_load green on all 6 rewritten Q5/Q9 cells. C++ follows the
established TemporalLength template from the parent MobilityDB#16; build
verification gated on the user's NebulaStream test harness.
… covered' section After PR MobilityDB#16 (TEMPORAL_LENGTH closes Q6) and PR MobilityDB#17 (PAIR_MEETING + CROSS_DISTANCE close Q5 + Q9), the parity matrix is 27/27 full — the doc's own coverage table at the top confirms it. But the section 'Not covered (15 cells / 5 queries)' at line 77 was a remnant from the pre-MobilityDB#16/MobilityDB#17 state and contradicts the rest of the doc. Remove it. Add a new 'Streaming-semantics tier overlay' section that classifies each BerlinMOD-Q by its streaming-execution tier (stateless / bounded-state / windowed / cross-stream) per the closed 7-value vocabulary proposed for the MEOS-API objectModel.streamingSemantics facet (see the sibling RFC on MEOS-API PR MobilityDB#10). The mapping makes the cross-binding picture explicit: a Q's tier on NebulaStream is the same tier on Flink / Kafka, and the table points to the equivalent generic wiring class on Flink for each tier. Two short follow-up notes explain why cross-stream looks different on NebulaStream (single-aggregation Cartesian enumeration vs Flink's interval-join across two streams — same semantic, different topology) and why Q7 is bounded-state rather than windowed (per-POI fan-out, per-(vehicle, POI) bounded state, no full-sequence reduction needed). Refresh the 'Sibling parity references' section to point at the current state of the Flink and Kafka work — Flink's per-tier wiring infrastructure under org.mobilitydb.flink.meos.wirings (5 generic classes covering 100% of the streamable surface) and Kafka's codegen mirror under org.mobilitydb.kafka.meos. Drops stale PR-number references per the same as-is / no-internal-process discipline applied elsewhere in the ecosystem docs. Stacks on PR MobilityDB#17. Docs-only; touches no YAML, no C++ pipeline-layer file.
The PAIR_MEETING aggregation (added in MobilityDB#17) hardcoded the meeting-distance threshold at 200 m via a static constexpr DMEET_METRES, with the PR body noting parameterization as future work. This PR lands that future work: PAIR_MEETING now takes a fifth argument — a numeric constant in metres — and the physical operator uses it per-query. ## Surface PAIR_MEETING(lon, lat, ts, vehicle_id, dMeet) ^^^^^ new fifth arg (numeric constant, metres) The first four args remain FieldAccess (lon, lat, ts, vehicle_id); the fifth is pulled from the parser's constantBuilder as a numeric literal, parsed via std::stod, and threaded through the logical→physical lowering chain into the lower() lambda alongside the existing state pointers. ## Files (9, all stacked on MobilityDB#18 → MobilityDB#17 → MobilityDB#16 → MobilityDB#15) | Layer | File | |---|---| | Physical .hpp | PairMeetingAggregationPhysicalFunction.hpp — `DMEET_METRES` constexpr → `DEFAULT_DMEET_METRES` + instance field `dMeetMetres` | | Physical .cpp | PairMeetingAggregationPhysicalFunction.cpp — constructor takes dMeet; lower() passes it to the captureless lambda via `nautilus::val<double>` | | Logical .hpp | PairMeetingAggregationLogicalFunction.hpp — constructor + create() factory take dMeet; getter `getDMeetMetres()` | | Logical .cpp | PairMeetingAggregationLogicalFunction.cpp — initialize field; Registrar deserialize path uses DEFAULT_DMEET_METRES (see Serde caveat below) | | Parser | AntlrSQLQueryPlanCreator.cpp — both PAIR_MEETING dispatch sites (lexer-token case + funcName string-name case) extract the constant from constantBuilder, std::stod it, pass to create() | | Lowering | LowerToPhysicalWindowedAggregation.cpp — pmDescriptor->getDMeetMetres() flows to the physical constructor | | YAMLs (×3) | Queries/berlinmod/q5_continuous.yaml, q5_snapshot.yaml, q5_windowed.yaml — add `, 200.0` as the explicit fifth arg; comments updated to reflect the parameterization | ## Serde round-trip caveat (out of scope for this PR) `AggregationLogicalFunctionRegistryArguments` is strongly typed to `vector<FieldAccessLogicalFunction>` — there is no slot for a numeric constant in the existing Registrar interface, and `SerializableAggregationFunction` has no proto field for it either. As a result: - The parser path (live query execution) is FULLY parameterized — dMeet flows from SQL to physical correctly. - The Serde deserialize path falls back to DEFAULT_DMEET_METRES (preserves the 200 m scaffold behaviour). Round-trip fidelity for the dMeet value requires (a) adding a new field to SerializableAggregationFunction.proto, (b) extending AggregationLogicalFunctionRegistryArguments to carry it, and (c) threading both through Serialize/Register. That's an infrastructure change touching every registered aggregation; tracked as a follow-up. ## Build / test verification Cannot compile-verify locally — NebulaStream needs the full C++23 + vcpkg toolchain. Submitted for maintainer build verification (cc @marianaGarcez). Expected to compile cleanly; the only construction-time behaviour change is the constructor signature (5 params → 6 params for physical, 5 → 6 for logical create/ctor); the only runtime behaviour change is that dMeet is now read from the instance field instead of the class constexpr (the lambda receives it via the nautilus::val<double> extra arg). ## Mirrors the CROSS_DISTANCE shape CROSS_DISTANCE (also added by MobilityDB#17, hardcoded VID_A=100, VID_B=200) has the exact same parameterization pattern; a sibling PR can apply the same change with (lon, lat, ts, vid, vid_a, vid_b) — 6 args total instead of 5. Holding for separate PR.
… args Sibling to PAIR_MEETING.dMeet parameterization (PR MobilityDB#19) — applies the same 4-layer pattern to CROSS_DISTANCE. The aggregation (added in MobilityDB#17) hardcoded the target vehicle pair at (100, 200) via static constexpr VID_A / VID_B, with the PR body noting parameterization as future work. This PR lands that future work: CROSS_DISTANCE now takes two unsigned- integer constants as its fifth and sixth arguments, and the physical operator uses them per-query. ## Surface CROSS_DISTANCE(lon, lat, ts, vehicle_id, vidA, vidB) ^^^^ ^^^^ new constants (uint64) The first four args remain FieldAccess; vidA and vidB are pulled from the parser's constantBuilder (two unsigned-integer literals), std::stoull them, and threaded through the logical→physical lowering chain into the lower() lambda alongside the existing state pointer. ## Files (9, same shape as PR MobilityDB#19's PAIR_MEETING change) | Layer | File | |---|---| | Physical .hpp | CrossDistanceAggregationPhysicalFunction.hpp — `VID_A/B` constexpr → `DEFAULT_VID_A/B` + instance fields `vidA/B` | | Physical .cpp | CrossDistanceAggregationPhysicalFunction.cpp — constructor takes both; lift-time lambda gets them via `nautilus::val<uint64_t>` | | Logical .hpp | CrossDistanceAggregationLogicalFunction.hpp — constructor + create() factory + getters | | Logical .cpp | CrossDistanceAggregationLogicalFunction.cpp — initialize fields; Registrar deserialize falls back to defaults | | Parser | AntlrSQLQueryPlanCreator.cpp — both CROSS_DISTANCE dispatch sites extract two constants, std::stoull both, pass to create() | | Lowering | LowerToPhysicalWindowedAggregation.cpp — cdDescriptor->getVidA()/getVidB() flow to physical constructor | | YAMLs (×3) | Queries/berlinmod/q9_continuous.yaml, q9_snapshot.yaml, q9_windowed.yaml — add `, 100, 200` as explicit constants; comments updated | ## Serde round-trip caveat (same as PR MobilityDB#19) `AggregationLogicalFunctionRegistryArguments` is strongly typed to `vector<FieldAccessLogicalFunction>` — no slot for integer constants. `SerializableAggregationFunction.proto` has no field for them. So: - Parser path (live query execution) is FULLY parameterized. - Serde deserialize path falls back to `DEFAULT_VID_A` / `DEFAULT_VID_B` (preserves the 100, 200 scaffold defaults). Same infrastructure follow-up would close both round-trip gaps at once (PAIR_MEETING.dMeet and CROSS_DISTANCE.vidA/vidB). ## Build / test verification Same as PR MobilityDB#19 — submitted for maintainer build verification (@marianaGarcez). Constants now flow through std::stoull instead of std::stod; lambda gets two nautilus::val<uint64_t> args instead of one nautilus::val<double>. Pattern is structurally identical.
estebanzimanyi
added a commit
to estebanzimanyi/MobilityNebula
that referenced
this pull request
May 21, 2026
…codegen path Closes the Nebula structural parity gap with Flink/Kafka by shipping the codegen infrastructure for generating per-MEOS-function pipeline tuples (logical + physical + parser + lowering). No generated C++ committed in this PR — the maintainer (cc @marianaGarcez) runs the generator on a chosen MEOS-function batch, reviews output, ships operators in follow-up PRs at a controlled pace. Why no generated code in this PR: - Generator author cannot build NebulaStream (full C++23 + vcpkg toolchain not available in author's environment); shipping unverified generated code would risk batched-broken operators. - Per-function review value: maintainer iterates on templates with the first batch's build feedback before scaling up. - Template iteration cost: first-pass templates may need adjustment after first build; smaller blast radius if only the generator lands. What lands: - tools/codegen/codegen_nebula.py — Python generator with embedded C++ templates derived 1:1 from the hand-written TemporalEDWithinGeometry operator shape (logical/physical/.hpp/.cpp) - tools/codegen/codegen_input.example.json — first-wave input list (5 spatial-relation E/A predicates: EDisjoint, ATouches, ECovers, ACrosses, EOverlaps over tgeo_geo) - tools/codegen/README.md — full design proposal: why codegen, what the generator produces, recommended scaling-wave sequence (W1-W5), what the generator does NOT do (CMakeLists / parser / grammar remain manual paste for idempotence), compile-verification note Smoke-verified: the generator runs locally + emits 5 operators × 4 files = 20 well-formed C++ source files; templates produce syntactically-reasonable output matching the existing operator style. Scaling path (recommended sequence): - W1: 5 spatial-relation E/A predicates (the example input) — first follow-up PR - W2: All ever/always spatial-relation predicates over tgeo_geo (~18 functions) — second follow-up PR - W3: Distance functions over tgeo_geo and tgeo_tgeo (~30) — third - W4: Scalar accessors that decompose to per-event reads — template extension required - W5: Aggregations (windowed/cross-stream) — separate generator with the aggregation-specific 4-layer pattern Stacks on PR MobilityDB#20. Tools-only; touches no operator code, no CMakeLists, no parser/grammar.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sibling to PR #19 (PAIR_MEETING.dMeet parameterization) — applies the same 4-layer pattern to CROSS_DISTANCE. The aggregation (added in #17) hardcoded the target vehicle pair at (100, 200) via static
constexpr VID_A/VID_B, with the PR body noting parameterization as future work. This PR lands that future work.cc @marianaGarcez — same build-verification ask as PR #19. My environment can't run NebulaStream's vcpkg + C++23 toolchain. Structurally identical to #19's pattern (3 differences:
std::stoullvsstd::stod,uint64_tvsdouble, 2 constants vs 1).Surface change
9 files modified (same pattern as PR #19)
nes-physical-operators/include/Aggregation/Function/Meos/CrossDistanceAggregationPhysicalFunction.hppstatic constexpr VID_A/B→DEFAULT_VID_A/B+ instanceuint64_t vidA, vidB; constructor takes bothnes-physical-operators/src/Aggregation/Function/Meos/CrossDistanceAggregationPhysicalFunction.cppnautilus::val<uint64_t>and uses them in the per-eventif (vid == vidAArg) … else if (vid == vidBArg)branchesnes-logical-operators/include/Operators/Windows/Aggregations/Meos/CrossDistanceAggregationLogicalFunction.hppcreate()factory take(vidA, vidB); gettersnes-logical-operators/src/Operators/Windows/Aggregations/Meos/CrossDistanceAggregationLogicalFunction.cppDEFAULT_VID_A/B(Serde caveat below)nes-sql-parser/src/AntlrSQLQueryPlanCreator.cppconstantBuilder(popped in reverse: vidB first then vidA, since they're pushed in source order);std::stoullboth; pass tocreate()nes-query-optimizer/src/RewriteRules/LowerToPhysical/LowerToPhysicalWindowedAggregation.cppcdDescriptor->getVidA() / getVidB()→ physical constructorQueries/berlinmod/q9_continuous.yaml,q9_snapshot.yaml,q9_windowed.yaml, 100, 200as explicit 5th & 6th args; comments updatedYAML parse-validated locally (
python3 -c "import yaml; yaml.safe_load(...)"on each).Serde caveat (same as PR #19)
AggregationLogicalFunctionRegistryArgumentsis strongly typed tovector<FieldAccessLogicalFunction>— no slot for integer constants.SerializableAggregationFunction.protohas no field for them. So:DEFAULT_VID_A=100, DEFAULT_VID_B=200(preserves scaffold defaults).The same infrastructure follow-up (Registrar args struct + proto schema) would close BOTH this PR's and PR #19's round-trip gaps in one commit.
What needs maintainer attention (same list as PR #19)
nautilus::val<uint64_t>lambda-arg pattern — extrapolated from PR feat(meos): parameterize PAIR_MEETING dMeet via SQL constant fifth arg (stacks on #18) #19'snautilus::val<double>extrapolation; if PR feat(meos): parameterize PAIR_MEETING dMeet via SQL constant fifth arg (stacks on #18) #19's pattern works, this one does too (and vice-versa)Stacking
Stacks on
feat/pair-meeting-dmeet-parameterized(PR #19) →docs/berlinmod-streaming-semantics-tier-overlay(PR #18) →feat/pair-meeting-cross-distance-aggregations(PR #17) →feat/temporal-length-aggregation(PR #16) →feat/berlinmod-streaming-forms(PR #15).After this PR, the two named future-PR items from PR #17's body —
dMeetparameterization and(VID_A, VID_B)parameterization — are both addressed. The Nebula BerlinMOD-9 × 3-form parity matrix stays 27/27 full and now exposes both per-query constants through SQL.