Skip to content

[Feature] Materialized View DDL #18544

Open
hongkunxu wants to merge 3 commits into
apache:masterfrom
hongkunxu:feat/sse_mv_view_creation
Open

[Feature] Materialized View DDL #18544
hongkunxu wants to merge 3 commits into
apache:masterfrom
hongkunxu:feat/sse_mv_view_creation

Conversation

@hongkunxu
Copy link
Copy Markdown
Contributor

@hongkunxu hongkunxu commented May 20, 2026

[Feature] Materialized View DDL

Summary

Adds Pinot-native SQL DDL for managing Materialized Views (MV) end-to-end, served through the existing POST /sql/ddl controller endpoint introduced by the Table DDL PR. Four new statements:

-- Column list is OPTIONAL. When omitted, the compiler infers all columns
-- from the AS <query> projection (recommended). When present, all columns
-- must be declared explicitly. Mixing is not allowed.
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]name
[ ( col TYPE [DIMENSION|METRIC|DATETIME FORMAT '..' GRANULARITY '..']
        [NOT NULL|NULL] [DEFAULT lit], ... ) ]
[REFRESH EVERY '<period>']
PROPERTIES ('timeColumnName' = 't', 'bucketTimePeriod' = '1d', ...)
AS <Pinot SELECT>

SHOW MATERIALIZED VIEWS [FROM db]              -- list MVs in the (database-scoped) cluster
SHOW CREATE MATERIALIZED VIEW [db.]name        -- no TYPE clause (MV is always OFFLINE)
DROP MATERIALIZED VIEW [IF EXISTS] [db.]name   -- no TYPE clause

Before this PR, MVs could only be authored / inspected / torn down via the JSON POST /tables path — verbose, error-prone, and impossible to copy-paste between environments. This PR makes MVs a first-class SQL DDL surface with a deterministic round-trip: CREATE → ZK → SHOW CREATE produces a statement that re-parses to the same TableConfig.

Labels: feature, release-notes (new SQL grammar + new controller error contracts), sql-compliance, materialized-view.

Scope & alignment with prior PRs

This MV DDL is implemented on top of the following two PRs. Its functional scope is intentionally kept aligned with theirs — anything outside that joint baseline is out of scope here and tracked on the respective upstream layer instead.


DDL 1: CREATE MATERIALIZED VIEW

There are two ways to author the column list. AS-driven inference is the recommended path. Use the explicit form only if you need to override a specific column type (e.g. tighten a LONG to INT).

Two ways, same MV

The following two statements produce a byte-identical persisted TableConfig + Schema:

Inferred (recommended) Explicit
CREATE MATERIALIZED VIEW clickstream_daily_event_counts
REFRESH EVERY '15m'
PROPERTIES (
'timeColumnName' = 'event_day',
'bucketTimePeriod' = '1d'
)
AS
SELECT DATETRUNC('DAY', ts * 1000) AS event_day,
event_name,
COUNT(*) AS cnt
FROM clickstreamFunnel
GROUP BY event_day, event_name;
CREATE MATERIALIZED VIEW clickstream_daily_event_counts (
event_day TIMESTAMP NOT NULL,
event_name STRING NOT NULL,
cnt LONG NOT NULL
)
REFRESH EVERY '15m'
PROPERTIES (
'timeColumnName' = 'event_day',
'bucketTimePeriod' = '1d'
)
AS
SELECT DATETRUNC('DAY', ts * 1000) AS event_day,
event_name,
COUNT(*) AS cnt
FROM clickstreamFunnel
GROUP BY event_day, event_name;

No mixing

The column list is all-or-nothing:

  • All inferred — omit the (...) column list entirely; every column is derived from the AS <query> projection.
  • All explicit — declare every projected column; the projection arity, names, and ordering must line up exactly.

A column list that covers only some of the projection columns (or that adds extra columns the projection does not produce) is rejected at compile time with HTTP 400 and a clear pointer to either drop the column list or complete it.

Inferred type & role mapping

When the column list is omitted, each SELECT projection item maps to an MV column as follows:

Projection shape (in AS <query>) Inferred MV type Inferred role Notes
col (non-time identity passthrough) inherits from source inherits (DIMENSION / METRIC / DATETIME) nullability, default, and DATETIME format/granularity also inherited from the source schema
col AS t where t = timeColumnName (time-column passthrough) inherits from source DATETIME format/granularity inherited; emitted as NOT NULL
DATETRUNC('unit', ts[, ...]) AS t where t = timeColumnName TIMESTAMP DATETIME format 1:MILLISECONDS:TIMESTAMP, granularity derived from the truncation unit; NOT NULL
ts * f1 [* f2 …] AS t where t = timeColumnName (arithmetic scaling) TIMESTAMP DATETIME format 1:MILLISECONDS:TIMESTAMP, granularity 1:MILLISECONDS; NOT NULL
SUM(expr) / MIN(expr) / MAX(expr) DOUBLE METRIC NOT NULL
COUNT(expr) / COUNT(*) LONG METRIC NOT NULL
DISTINCTCOUNTRAWHLL(expr) / DISTINCTCOUNTRAWHLLPLUS(expr) / DISTINCTCOUNTRAWTHETASKETCH(expr) BYTES METRIC NOT NULL. Raw sketches are stored as serialized bytes; both engines surface them as hex-encoded STRING to query clients, but the MV column must be BYTES for the rewrite engine to re-aggregate them correctly.

Aggregations not in the table above (e.g. AVG, DISTINCTCOUNT, custom UDAFs) are rejected at compile time with HTTP 400 and a pointer to the supported set. Multi-value source columns are rejected (MV does not currently support multi-value).

REFRESH EVERY '<period>'

Optional. Registers a per-MV Quartz cron under task.MaterializedViewTask.schedule. When omitted, the MV runs under the cluster-wide PinotTaskManager schedule. <period> is a single positive integer followed by a unit char m / h / d:

Period Range Notes
'Nm' 1 – 59 Every N minutes, fires at second 0.
'Nh' 1 – 23 Every N hours, fires at minute 0.
'Nd' 1 – 28 Every N days; cycle resets at day 1 of each month.

Rejected at parse / compile time (HTTP 400):

  • '60m', '24h', '29d' and above — use the next-larger unit.
  • '0d' / '-1m' / '1.5h' — non-positive or non-integer.
  • '1w' / '1y' / '30s' / 'every Monday' — unsupported unit.
  • Hand-written cron expressions — there is no DDL syntax for arbitrary cron. Operators needing irregular schedules use the JSON API; SHOW CREATE MATERIALIZED VIEW on such an MV returns 400 rather than silently emitting an un-round-trippable DDL.

For convenience, EVERY { MINUTE[S] | HOUR[S] | DAY[S] } is accepted as a sugared form (e.g. REFRESH EVERY 15 MINUTES) and normalized to '15m' internally.

PROPERTIES

Single flat string-to-string map. Keys are case-insensitive on input; canonical casing is restored on the wire.

Required

Key Type / format Description
timeColumnName identifier Name of the MV column used for watermark progression. Must match a DATETIME column produced by the projection. The analyzer rejects LONG / INT time columns at create time; in the inferred form, the time column must be either an identity passthrough of a TIMESTAMP source, a DATETRUNC call, or arithmetic-scaling.
bucketTimePeriod period ('1h', '1d', '7d', …) Width of the materialization window. Determines watermark advance step and per-task time-range filter. Must be > 0.

Optional MV task knobs

All can be written either bare (preferred) or with the task.MaterializedViewTask. prefix. SHOW CREATE always emits the bare form.

Key Type Default Description
bufferTimePeriod period empty (no buffer) Lag between wall clock and the watermark when scheduling new APPEND windows. Must be ≥ 0.
maxNumRecordsPerSegment integer 5000000 Max rows per generated MV segment.
maxTasksPerBatch integer 4 (hard cap 1000) How many APPEND windows the scheduler dispatches per generator cycle.
stalenessThresholdMs long ms 0 (SLO disabled) Per-MV freshness SLO. When (now - watermarkMs) > stalenessThresholdMs, the broker excludes this MV from query-rewrite. 0 means no check.
taskMode string scheduler-managed Reserved. Setting it has no effect on scheduling; kept on the allow-list so existing JSON-API tables round-trip.

Optional table-level promoted knobs

Key Type Default Description
replication integer cluster default Number of OFFLINE replicas.
brokerTenant identifier DefaultTenant Broker tenant.
serverTenant identifier DefaultTenant Server tenant.
timeType string inferred from column-level FORMAT '..' Almost never needed. Accepted for parity with CREATE TABLE.

Rejected at compile time (HTTP 400)

Key Why
schedule (bare or task.MaterializedViewTask.schedule) Reserved; use REFRESH EVERY '<period>'.
definedSQL (bare or task.MaterializedViewTask.definedSQL) Reserved; written from the AS <query> clause.
streamType, stream.*, realtime.* MV is always OFFLINE.
tableType, tableName, ifNotExists Reserved DDL-clause names.

Limitations enforced at compile time

Restriction Why
MV cannot share a name with an existing OFFLINE or REALTIME table An MV is an OFFLINE table; collisions would create undefined behavior.
MV source table cannot be REALTIME (or the REALTIME half of a hybrid) MV consistency tracking does not yet observe realtime commits.
AS <query> rejects JOIN, SELECT *, and DML (INSERT / UPDATE / DELETE / MERGE) Not currently supported by the MV task generator and the broker rewrite engine.
Multi-value source columns are not supported in MVs Same reason as above.
Sub-minute REFRESH EVERY and arbitrary cron expressions Sub-minute cadence is operationally hostile; arbitrary cron has no DDL syntax — use the JSON API.

DDL 2: SHOW MATERIALIZED VIEWS [FROM db]

Lists raw MV names (no _OFFLINE suffix) in the database scope. Identification is driven by the canonical TableConfig.isMaterializedView() flag. MVs continue to appear in SHOW TABLES (they are physically OFFLINE tables); SHOW MATERIALIZED VIEWS is the type-narrowed view of the same catalog.

SHOW MATERIALIZED VIEWS;                -- current database (defaults to "default")
SHOW MATERIALIZED VIEWS FROM analytics; -- override the database

DDL 3: SHOW CREATE MATERIALIZED VIEW [db.]name

Returns the canonical, re-parseable DDL for the named MV. Always emits the explicit-column form for round-trip stability — the output is byte-identical regardless of whether the MV was originally created with an inferred or explicit column list.

SHOW CREATE MATERIALIZED VIEW analytics.clickstream_daily_funnel;

Emission rules:

  • Properties emitted in lexicographic order.
  • MV task knobs flattened back to bare form (matches the canonical input shape).
  • REFRESH EVERY '<period>' emitted only when the stored cron round-trips via the inverse mapping; non-standard hand-crafted cron expressions trigger HTTP 400 with a pointer back to the JSON API rather than silently dropping the schedule.
  • task.MaterializedViewTask.definedSQL is preserved as the raw user-typed substring of AS <query> and re-emitted verbatim — no Calcite re-rendering, no dialect drift.
  • No TYPE clause is accepted (a trailing TYPE OFFLINE fails parsing rather than being silently swallowed).

emit → parse → emit is exercised end-to-end.


DDL 4: DROP MATERIALIZED VIEW [IF EXISTS] [db.]name

Tears down the MV, including its definition znode, runtime watermark znode, and Helix resource. No TYPE clause.

IF EXISTS swallows a missing MV (200 no-op). It does not swallow type confusion: if the name resolves to a plain OFFLINE table, the call returns 400 pointing at DROP TABLE.

DROP MATERIALIZED VIEW IF EXISTS analytics.clickstream_daily_funnel;

More examples

Sub-hour refresh + bounded staleness SLO (inferred)

CREATE MATERIALIZED VIEW IF NOT EXISTS analytics.clickstream_minute_dau
REFRESH EVERY '1m'
PROPERTIES (
    'timeColumnName'        = 'minute_ts',
    'bucketTimePeriod'      = '1m',
    'bufferTimePeriod'      = '30s',
    'stalenessThresholdMs'  = '60000'
)
AS
SELECT DATETRUNC('MINUTE', ts * 1000) AS minute_ts,
       COUNT(*)                       AS dau
FROM clickstreamFunnel
GROUP BY minute_ts;

Heavy daily roll-up with aggressive back-fill and segment sizing (inferred)

CREATE MATERIALIZED VIEW analytics.clickstream_daily_funnel
REFRESH EVERY '15m'
PROPERTIES (
    'timeColumnName'          = 'event_day',
    'bucketTimePeriod'        = '1d',
    'bufferTimePeriod'        = '1h',
    'maxNumRecordsPerSegment' = '2000000',
    'maxTasksPerBatch'        = '16',
    'replication'             = '2'
)
AS
SELECT DATETRUNC('DAY', ts * 1000) AS event_day,
       event_name,
       user_id,
       COUNT(*)                    AS event_count,
       SUM(dwell_ms)               AS total_dwell_ms
FROM clickstreamFunnel
GROUP BY event_day, event_name, user_id;

POSTing DDL to the controller

curl -sS -u "$PINOT_USER:$PINOT_PASS" \
     -H 'Content-Type: application/json' \
     -H 'Database: analytics' \
     -d "{\"sql\": \"$(< create_mv.sql sed 's/"/\\"/g' | tr -d '\n')\"}" \
     "$PINOT_CONTROLLER/sql/ddl"

Strict TABLE / MATERIALIZED VIEW partitioning at the REST boundary

The DDL verbs refuse to act on the wrong target type — enforced symmetrically:

Verb on wrong target HTTP Error message
SHOW CREATE TABLE on an MV 400 "… is a materialized view. Use 'SHOW CREATE MATERIALIZED VIEW …'"
SHOW CREATE MATERIALIZED VIEW on a plain table 400 "… is not a materialized view. Use 'SHOW CREATE TABLE …'"
DROP TABLE on an MV 400 "… is a materialized view. Use 'DROP MATERIALIZED VIEW …'"
DROP MATERIALIZED VIEW on a plain table 400 "… is not a materialized view. Use 'DROP TABLE …'"
DROP MATERIALIZED VIEW IF EXISTS on a plain table 400 IF EXISTS is about absence, not type confusion.

REST endpoint contract

All MV DDL operations are sent to the same endpoint as the Table DDL PR:

POST /sql/ddl[?dryRun=true|false]
Content-Type: application/json
Authorization: <Bearer | Basic …>
Database: <optional database name>

{ "sql": "<single DDL statement>" }

Response codes (MV-specific outcomes)

Code Operation Meaning
200 OK DROP / SHOW / IF [NOT] EXISTS no-op / dry-run Operation succeeded.
201 Created CREATE MATERIALIZED VIEW MV was created and persisted to ZK.
400 Bad Request All Parse / semantic error; period out of range; type confusion; REALTIME source; unsupported AS <query> shape; unsupported aggregation; mixed (partial-explicit) column list; non-round-trippable cron on SHOW CREATE.
404 Not Found DROP / SHOW CREATE MV not found (DROP without IF EXISTS, or SHOW CREATE on a missing MV).
409 Conflict CREATE Same-name table/MV already exists without IF NOT EXISTS; race lost to a concurrent writer.
500 Internal Server Error All Helix/ZK inconsistency or controller defect.

The full response shape (operation, tableName, databaseName, schema, tableConfig, warnings, message, …) is the same as the Table DDL PR. SHOW MATERIALIZED VIEWS populates the existing tableNames field.

Dry-run

?dryRun=true compiles + validates and returns the would-be Schema + TableConfig, but persists nothing. Useful for CI / pre-flight checks. Inferred column lists are fully resolved during dry-run, so operators can preview the inferred Schema before committing.


Authorization

MV DDLs reuse the same action constants as their Table DDL counterparts (CREATE_TABLE, GET_TABLE_CONFIG, DELETE_TABLE). Introducing new MV-specific actions would silently lock operators with CREATE_TABLE out of MV creation during rolling upgrade. Authorization happens after database-name translation, against the post-translation resource.


Rolling-upgrade safety

The DDL feature is purely additive. No SPI signature, no enum, no TableConfig / Schema field, and no ZK property-store path is renamed or removed. Persisted artifacts produced by POST /sql/ddl are shape-identical to those produced by the existing POST /tables and POST /schemas endpoints — old brokers, servers, and controllers read DDL-created MVs exactly as they read JSON-API-created MVs.

  • Pre-DDL controllers return 404 on POST /sql/ddl. There is no half-state.
  • New SQL keywords (MATERIALIZED, REFRESH, VIEWS) are added under nonReservedKeywordsToAdd. Existing DQL queries using them as identifiers continue to parse on the new binary.
  • Rollback to a pre-DDL controller continues to serve DDL-created MVs identically to JSON-API MVs.

Test plan

  • Parser tests (PinotDdlParserTest) — CREATE MATERIALIZED VIEW all syntactic variants, including the optional column list (present / empty / mixed-rejection); REFRESH EVERY quoted-period and sugared (N MINUTES / N HOURS / N DAYS) forms; SHOW MATERIALIZED VIEWS [FROM db]; SHOW CREATE MATERIALIZED VIEW; DROP MATERIALIZED VIEW [IF EXISTS]; rejection of TYPE on MV-specific SHOW/DROP.
  • Compiler tests (DdlCompilerMaterializedViewTest, DdlCompilerMaterializedViewInferenceTest) — every PROPERTIES routing rule; period → Quartz cron mapping including boundary rejections; AS <query> raw-substring preservation; rejection of JOIN, SELECT *, DML, undefined-column references; partial-column-list rejection; unsupported aggregation rejection; multi-value source column rejection.
  • Inferer tests (MultiStageMaterializedViewSchemaInfererTest) — every projection shape in the type/role mapping table, including all 7 supported aggregations (sketches assert BYTES), all three supported time-column expression shapes (identity passthrough, DATETRUNC, arithmetic scaling), and the corresponding negative cases.
  • Reverse-DDL tests (CanonicalDdlEmitterMaterializedViewTest, MaterializedViewPropertyExtractorTest) — canonical clause order; lexicographic property ordering; non-round-trippable cron rejection; MV task knob flattening; SHOW CREATE always emits the explicit-column form regardless of how the MV was created.
  • Round-trip tests (RoundTripTest) — emit → parse → compile → emit idempotence across all PROPERTIES routing rules + the REFRESH EVERY clause + both column-list forms.
  • Controller unit tests (PinotDdlRestletResourceMaterializedViewUnitTest) — CREATE / SHOW CREATE / SHOW MATERIALIZED VIEWS / DROP dispatch, response shape, authorization (403 on permission deny), 400 on strict type-partitioning violations.
  • Resource manager tests (PinotHelixResourceManagerMaterializedViewListingTest) — getAllRawMaterializedViewNames filters by isMaterializedView(), skips REALTIME, tolerates corrupted znodes, scopes by database.
  • Integration tests (MaterializedViewDdlAnalyzerIntegrationTest) — DDL creates an MV that real MaterializedViewAnalyzer accepts on a live cluster, end-to-end for both inferred and explicit column-list forms.
  • Manual end-to-end on a local Pinot cluster — full lifecycle (create / show / drop) for both inferred and explicit forms, all 7 supported aggregations, mixed-mode rejection messages.

All tests pass. ./mvnw spotless:apply checkstyle:check license:format license:check clean across pinot-common, pinot-sql-ddl, pinot-controller, and pinot-materialized-view.

@hongkunxu hongkunxu changed the title [feature] MATERIALIZED VIEW DDL — CREATE / SHOW CREATE / DROP [Feature] MATERIALIZED VIEW DDL — CREATE / SHOW CREATE / DROP May 20, 2026
@hongkunxu hongkunxu changed the title [Feature] MATERIALIZED VIEW DDL — CREATE / SHOW CREATE / DROP [Feature] MATERIALIZED VIEW DDL May 20, 2026
@hongkunxu hongkunxu changed the title [Feature] MATERIALIZED VIEW DDL [Feature] Materialized View DDL May 20, 2026
@xiangfu0 xiangfu0 added sql-compliance Related to SQL standard compliance materialized-view labels May 20, 2026
@hongkunxu hongkunxu force-pushed the feat/sse_mv_view_creation branch from 8a64264 to 2d063ee Compare May 20, 2026 16:52
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 20, 2026

Codecov Report

❌ Patch coverage is 72.30769% with 378 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.36%. Comparing base (4688318) to head (93b02a9).
⚠️ Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
...troller/api/resources/PinotDdlRestletResource.java 71.53% 58 Missing and 18 partials ⚠️
...sql/ddl/inferer/MaterializedViewSchemaInferer.java 66.66% 35 Missing and 20 partials ⚠️
...ntroller/helix/core/PinotHelixResourceManager.java 62.29% 37 Missing and 9 partials ⚠️
.../org/apache/pinot/sql/ddl/compile/DdlCompiler.java 72.89% 28 Missing and 17 partials ⚠️
...ql/ddl/compile/MaterializedViewPropertyRouter.java 79.71% 16 Missing and 12 partials ⚠️
...timeexpr/MaterializedViewTimeExpressionParser.java 70.11% 12 Missing and 14 partials ⚠️
...parsers/parser/SqlPinotCreateMaterializedView.java 36.58% 26 Missing ⚠️
...ata/MaterializedViewDefinitionMetadataBuilder.java 57.50% 0 Missing and 17 partials ⚠️
...che/pinot/sql/ddl/reverse/CanonicalDdlEmitter.java 85.10% 4 Missing and 10 partials ⚠️
...w/metadata/MaterializedViewDefinitionMetadata.java 65.21% 4 Missing and 4 partials ⚠️
... and 9 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18544      +/-   ##
============================================
+ Coverage     64.26%   64.36%   +0.10%     
- Complexity     1137     1302     +165     
============================================
  Files          3335     3355      +20     
  Lines        205809   207245    +1436     
  Branches      32106    32381     +275     
============================================
+ Hits         132256   133392    +1136     
- Misses        62914    63105     +191     
- Partials      10639    10748     +109     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.36% <72.30%> (+0.10%) ⬆️
temurin 64.36% <72.30%> (+0.10%) ⬆️
unittests 64.36% <72.30%> (+0.10%) ⬆️
unittests1 56.81% <64.01%> (+0.07%) ⬆️
unittests2 37.05% <70.98%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@hongkunxu hongkunxu force-pushed the feat/sse_mv_view_creation branch 5 times, most recently from d66cc70 to 4032804 Compare May 22, 2026 17:34
@hongkunxu hongkunxu marked this pull request as ready for review May 22, 2026 17:34
@xiangfu0
Copy link
Copy Markdown
Contributor

High-signal issue in the source-table validation path: / still accepts a raw source name when both and exist, because the resolver prefers the OFFLINE table and only rejects the source when OFFLINE is absent. That means can still be persisted for a hybrid source even though the stored is replayed later against the raw table name. The downstream broker/scheduler query can then read REALTIME rows while validation, fingerprinting, and STALE-marking checks only covered the OFFLINE half. Since realtime commits still do not notify the MV consistency manager, this can silently drift the MV. Please reject raw names that have both OFFLINE and REALTIME variants, or require an explicitly typed source table before persisting the MV.

@xiangfu0
Copy link
Copy Markdown
Contributor

High-signal issue in the source-table validation path: validateSourceTable(...) / resolveSourceTableWithType(...) still accepts a raw source name when both foo_OFFLINE and foo_REALTIME exist, because the resolver prefers the OFFLINE table and only rejects the source when OFFLINE is absent. That means CREATE MATERIALIZED VIEW ... AS SELECT ... FROM foo can still be persisted for a hybrid source even though the stored definedSQL is replayed later against the raw table name. The downstream broker/scheduler query can then read REALTIME rows while validation, fingerprinting, and STALE-marking checks only covered the OFFLINE half. Since realtime commits still do not notify the MV consistency manager, this can silently drift the MV. Please reject raw names that have both OFFLINE and REALTIME variants, or require an explicitly typed source table before persisting the MV.

@hongkunxu hongkunxu force-pushed the feat/sse_mv_view_creation branch from da3a54f to aaf07a2 Compare May 24, 2026 14:32
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 high-signal issue; see inline comment.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 63 out of 63 changed files in this pull request and generated 3 comments.

@hongkunxu hongkunxu force-pushed the feat/sse_mv_view_creation branch from 0e60edf to d79b906 Compare May 27, 2026 03:04
…SHOW CREATE / DROP

Adds Pinot-native SQL DDL for managing materialized views end to end,
served through the same POST /sql/ddl endpoint as plain-table DDL.

Supported statements:

    CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]name [(<column list>)]
    [REFRESH EVERY 'period']
    PROPERTIES ('timeColumnName' = 't', 'bucketTimePeriod' = '1d', ...)
    AS <Pinot SELECT>

    SHOW MATERIALIZED VIEWS
    SHOW CREATE MATERIALIZED VIEW [db.]name
    DROP MATERIALIZED VIEW [IF EXISTS] [db.]name

CREATE MATERIALIZED VIEW supports two forms:

  - Explicit column list: every column is declared in the DDL.
  - Inferred column list: the column list is omitted and the storage
    schema is derived from the AS <SELECT>.

Both forms produce identical materialized-view storage from the data
plane's perspective. SHOW CREATE MATERIALIZED VIEW always renders the
canonical form with an explicit column list, so the round-trip through
DDL is stable in either case.

Co-authored-by: Xiang Fu <xiangfu.1024@gmail.com>
Signed-off-by: Hongkun Xu <xuhongkun666@163.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@hongkunxu hongkunxu force-pushed the feat/sse_mv_view_creation branch from d79b906 to d6a3ab2 Compare May 27, 2026 03:09
Polish on top of the MV DDL feature commit. Summary:

Simplifications:
- Delete MaterializedViewSchemaInferer interface + factory; collapse the
  single impl into a final class with a static infer() method.
- Inline MaterializedViewPropertyExtractor.isMaterializedView() to direct
  TableConfig.isMaterializedView() calls; delete the wrapper.
- Collapse three identical IF NOT EXISTS race-recovery checks into
  mvIfNotExistsNoOpResponse(...) helper.
- Replace MaterializedViewPropertyRouter.canonicalTaskKnobKeys() Set lookup
  with a single canonicalKnobName(lowerCasedKnob) helper that does both
  membership and canonical-cased return.

Bug fixes:
- Canonicalize task.MaterializedViewTask.* prefix-form keys in the router
  the same way bare-form keys are canonicalized, eliminating an asymmetry
  where 'BUCKETTIMEPERIOD' bare and 'task.MaterializedViewTask.BUCKETTIMEPERIOD'
  prefixed landed under different on-wire keys.
- executeDdl no longer catches RuntimeException as 400. Programmer errors
  now propagate to JAX-RS as 500 so monitoring fires.
- Wrap IllegalStateException from RequestUtils.extractAliasOrIdentifierName
  as DdlCompilationException in the inferer so unaliased aggregations
  surface as 400 with guidance, not 500.
- Verify isMaterializedView() in IF NOT EXISTS race-catch paths: a concurrent
  plain OFFLINE table create no longer fools MV DDL into returning 200.
- Restore task schedules when deleteTable rejects DROP (e.g. blocked by
  dependent MV), via TaskCleanupRestorer; suppress the misleading
  "schedules need manual restoration" hint after a successful restore.
- Skip orphan MV definition znodes (TableConfig missing or non-MV) in the
  consistency manager's reverse-index rebuild, with WARN log; prevents ghost
  MV resurrection after a partial DROP.
- Reject malformed/negative stalenessThresholdMs in the analyzer at CREATE
  time; explicit 0 (the documented "no SLO" default) is accepted.
- Thread request-header database into DdlCompileContext so the MV inferer
  resolves FROM clauses against the operator's intended DB, not the
  cluster default.

Tests:
- Add regression test for the bare/prefix case canonicalization.
- Add regression test for unaliased-aggregation DdlCompilationException.
- Add regression test for the orphan-znode skip branch.
- Add regression tests for stalenessThresholdMs analyzer validation
  (malformed, negative, explicit 0).
- Add regression test for the race-catch path where a plain OFFLINE
  table races in under IF NOT EXISTS — must surface as 409, not 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xiangfu0
Copy link
Copy Markdown
Contributor

Pushed 93b02a98d5 as a follow-up review/polish commit. Summary:

Simplifications (~410 net lines removed)

  • Delete MaterializedViewSchemaInferer interface + factory; collapse the single impl into a final class with a static infer().
  • Inline MaterializedViewPropertyExtractor.isMaterializedView() to direct TableConfig.isMaterializedView() calls.
  • Collapse 3 identical IF NOT EXISTS race-recovery checks into mvIfNotExistsNoOpResponse(...).
  • Replace MaterializedViewPropertyRouter.canonicalTaskKnobKeys() Set lookup with a single canonicalKnobName(lowerCasedKnob) helper.

Bug fixes

  • Canonicalize task.MaterializedViewTask.* prefix-form keys in the router so bare and prefixed forms land under the same on-wire key.
  • executeDdl no longer catches RuntimeException as 400 — programmer errors now surface as 500 so monitoring fires.
  • Wrap IllegalStateException from RequestUtils.extractAliasOrIdentifierName as DdlCompilationException so unaliased aggregations (e.g. SELECT SUM(x) FROM …) surface as 400 with guidance, not 500.
  • IF NOT EXISTS race-catch paths verify isMaterializedView() — a concurrent plain OFFLINE table create no longer fools MV DDL into returning 200.
  • TaskCleanupRestorer restores task schedules when deleteTable rejects DROP (e.g. blocked by dependent MV); the misleading "schedules need manual restoration" hint is suppressed after a successful restore.
  • MaterializedViewConsistencyManager.rebuildReverseIndex skips orphan definition znodes (TableConfig missing or non-MV) with a WARN log; prevents ghost MV resurrection after a partial DROP.
  • MaterializedViewAnalyzer validates stalenessThresholdMs at CREATE time (rejects malformed/negative; explicit 0 for "no SLO" is accepted).
  • HTTP-header database threaded into DdlCompileContext so the MV inferer resolves FROM against the operator's intended DB instead of the cluster default.

Tests added

  • Prefix-form case canonicalization, unaliased aggregation, orphan-znode skip, stalenessThresholdMs validation (malformed/negative/zero), IF NOT EXISTS race-catch against a plain OFFLINE table.

All affected modules pass mvn test + spotless:apply checkstyle:check license:check locally. Waiting on full CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

materialized-view sql-compliance Related to SQL standard compliance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants