Skip to content

[SPARK-56643][CONNECT][TESTS] Add DSv2 temp view with stored plan tests using Spark Connect mode#55571

Open
longvu-db wants to merge 29 commits into
apache:masterfrom
longvu-db:dsv2-connect-pr1-temp-views-clone
Open

[SPARK-56643][CONNECT][TESTS] Add DSv2 temp view with stored plan tests using Spark Connect mode#55571
longvu-db wants to merge 29 commits into
apache:masterfrom
longvu-db:dsv2-connect-pr1-temp-views-clone

Conversation

@longvu-db
Copy link
Copy Markdown
Contributor

@longvu-db longvu-db commented Apr 27, 2026

What changes were proposed in this pull request?

Extract the 21 DSv2 temp view with stored plan tests from DataSourceV2DataFrameSuite into a shared trait DSv2TempViewWithStoredPlanTests, and add a Connect-mode runner DataSourceV2TempViewConnectSuite so the same tests run in both classic and Connect modes.

The shared trait covers these scenarios:

  1. Session and external writes (scenarios 1.1, 1.2, 1.2+cache): Temp view with filter reflects new data.
  2. Adding columns (scenarios 2.1, 2.2, 2.2+cache): Temp view preserves original schema after ADD COLUMN.
  3. Removing columns (scenarios 3.1, 3.2, 3.2+cache): Temp view detects column removal.
  4. Drop and recreate table (scenarios 4.1, 4.2, 4.2+cache): Temp view resolves to recreated table.
  5. Drop and re-add column with same type (scenarios 5.1, 5.2, 5.2+cache): Schema validation passes, view continues working with null values.
  6. Drop and re-add column with different type (scenarios 6.1, 6.2, 6.2+cache): Temp view detects type change.
  7. Type widening (scenarios 7.1, 7.2, 7.2+cache): Temp view detects INT to BIGINT type change.

Each test creates a DSv2 table, inserts initial data, builds a temp view with a filter (salary < 999) to demonstrate stored plan behavior, and verifies the view after table modifications. External writes use the direct catalog API (loadTable + withData) to simulate writes from outside the session. Caching connector variants use CachingInMemoryTableCatalog to verify stale-cache behavior.

This is the Spark Connect counterpart of #55540.

Refactoring

  • DSv2ExternalMutationTestBase (new): Abstract base trait with session-agnostic hooks (withTestSession, checkRows, getTableCatalog, withTestTableAndViews, externalAppend) so the same test logic runs in both classic and Connect modes.
  • DSv2TempViewWithStoredPlanTests (new): Shared trait containing the 21 test methods, extracted from DataSourceV2DataFrameSuite.
  • DataSourceV2TempViewConnectSuite (new): Connect-mode runner that provides Connect-specific session, catalog access, and result comparison.
  • DataSourceV2DataFrameSuite (modified): Mixes in DSv2TempViewWithStoredPlanTests instead of inlining the tests; adds classic-mode implementations of the abstract hooks.

Why are the changes needed?

The existing temp view tests in DataSourceV2DataFrameSuite use classic-only APIs (checkAnswer, catalog() helper) and cannot run in Connect mode. Extracting them into a shared trait enables Connect parity testing.

Does this PR introduce any user-facing change?

No. This PR only adds tests and refactors test infrastructure.

How was this patch tested?

All 21 shared tests pass in both modes:

build/sbt 'sql/testOnly *DataSourceV2DataFrameSuite'
build/sbt 'connect/testOnly *DataSourceV2TempViewConnectSuite'

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-6)

longvu-db added a commit to longvu-db/spark that referenced this pull request Apr 27, 2026
…-only view to drop+re-add tests

Co-authored-by: Isaac
longvu-db added a commit to longvu-db/spark that referenced this pull request Apr 28, 2026
…tions, and S5 assertion fix

- Fix alterTableWithData to physically remove dropped column values so they
  do not survive through ALTER chains (e.g. DROP COLUMN then ADD COLUMN same name)
- Add scenario number comments (1.1, 1.2, 2.1, ..., 7) to each test
- Update Scenario 5 tests to expect null salary after drop+re-add, add v_no_filter
  view to demonstrate null values

Ported from the last 3 commits of PR apache#55571 (Connect counterpart).

Co-authored-by: Isaac
@longvu-db longvu-db force-pushed the dsv2-connect-pr1-temp-views-clone branch from c9345a9 to a177cec Compare April 28, 2026 12:47
longvu-db added a commit to longvu-db/spark that referenced this pull request Apr 28, 2026
…tions, and S5 assertion fix

- Fix alterTableWithData to physically remove dropped column values so they
  do not survive through ALTER chains (e.g. DROP COLUMN then ADD COLUMN same name)
- Add scenario number comments (1.1, 1.2, 2.1, ..., 7) to each test
- Update Scenario 5 tests to expect null salary after drop+re-add, add v_no_filter
  view to demonstrate null values

Ported from the last 3 commits of PR apache#55571 (Connect counterpart).

Co-authored-by: Isaac
longvu-db added a commit to longvu-db/spark that referenced this pull request Apr 29, 2026
…tions, and S5 assertion fix

- Fix alterTableWithData to physically remove dropped column values so they
  do not survive through ALTER chains (e.g. DROP COLUMN then ADD COLUMN same name)
- Add scenario number comments (1.1, 1.2, 2.1, ..., 7) to each test
- Update Scenario 5 tests to expect null salary after drop+re-add, add v_no_filter
  view to demonstrate null values

Ported from the last 3 commits of PR apache#55571 (Connect counterpart).

Co-authored-by: Isaac
@longvu-db longvu-db force-pushed the dsv2-connect-pr1-temp-views-clone branch 2 times, most recently from 613e639 to 854c11f Compare April 29, 2026 20:04
longvu-db added a commit to longvu-db/spark that referenced this pull request Apr 30, 2026
…tions, and S5 assertion fix

- Fix alterTableWithData to physically remove dropped column values so they
  do not survive through ALTER chains (e.g. DROP COLUMN then ADD COLUMN same name)
- Add scenario number comments (1.1, 1.2, 2.1, ..., 7) to each test
- Update Scenario 5 tests to expect null salary after drop+re-add, add v_no_filter
  view to demonstrate null values

Ported from the last 3 commits of PR apache#55571 (Connect counterpart).

Co-authored-by: Isaac
@longvu-db longvu-db force-pushed the dsv2-connect-pr1-temp-views-clone branch from 0dd8d1d to 5ffcae5 Compare May 1, 2026 09:10
Copy link
Copy Markdown
Contributor

@andreaschat-db andreaschat-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @longvu-db. I left a few comments.

@longvu-db longvu-db requested a review from andreaschat-db May 8, 2026 13:46
Copy link
Copy Markdown
Contributor

@andreaschat-db andreaschat-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@longvu-db longvu-db force-pushed the dsv2-connect-pr1-temp-views-clone branch from 8bf75d5 to 5813598 Compare May 20, 2026 14:35
@longvu-db
Copy link
Copy Markdown
Contributor Author

longvu-db commented May 20, 2026

Auto-review results

1 critical issue found and fixed:

CachingInMemoryTableCatalog.clearCache() in withTableAndViews (line 87) was called as a companion-object static method, but no companion object exists — only an instance method. This would not compile.

Fix (commit 0b76175): Get the server-side catalog instance and call clearCache() on it, matching the pattern used by the classic DataSourceV2DataFrameSuite:

if (clearCachingCatalog) {
  val serverSession = getServerSession(session)
  serverCatalog[CachingInMemoryTableCatalog](serverSession, "cachingcat").clearCache()
}

Also removed CachingInMemoryTableCatalog.scala and InMemoryTableCatalog.scala from this PR (both already in master).

No other issues found. Test logic, assertions, error conditions, cleanup patterns, imports, and style all look good. ✅

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test-only PR that adds DataSourceV2TempViewConnectSuite with 21 Connect-mode parity tests mirroring the classic DataSourceV2DataFrameSuite "temp view with stored plan" scenarios. Architecture is reasonable for this kind of parity coverage: in-process Connect server via SparkConnectServerTest, client-driven reads/writes through withSession, server-side catalog peeks via getServerSession + serverCatalog[T] for external mutations. Most of @andreaschat-db's earlier nits are addressed; a few helper-design questions remain.

PR description vs. diff

Two pieces of the PR description no longer match what's in the diff:

  1. "Add 11 new tests" — the file actually adds 21 tests (each scenario has session, external, and cached-connector variants). The opportunistic extra coverage is welcome, just worth updating the count.
  2. The "Shared test infrastructure changes" block describes a new CachingInMemoryTableCatalog and a BasicInMemoryTableCatalog.alterTable fix, but per commits 5c64c989f8e ("Remove ... already in master") and 58135982455 ("Restore catalog files to master version") neither is in this PR anymore. Trim that section before merge.

Duplication with the analogue

All 21 tests are line-for-line parity copies of tests already in DataSourceV2DataFrameSuite (lines 3009–3568) — no shared trait, just sql(...)/withTable/checkAnswer/catalog(name) swapped for session.sql(...).collect()/withTableAndViews/assertRows/serverCatalog[T]. Was a shared trait considered? The analogue extending InsertIntoTests makes factoring non-trivial so I understand if the answer is "duplication is the lesser evil," but a one-line acknowledgement in the PR description would help future scenario additions remember to update both files.

Process note

@andreaschat-db's approval predates five commits (the latest being the 2026-05-20 auto-review fix). A re-confirm seems appropriate before merge.

@longvu-db longvu-db force-pushed the dsv2-connect-pr1-temp-views-clone branch 2 times, most recently from aef7da9 to 401a3e6 Compare May 22, 2026 21:02
longvu-db added a commit to longvu-db/spark that referenced this pull request May 24, 2026
Move the 9 repeated-sql tests from DataSourceV2DataFrameSuite into a
DSv2RepeatedSqlTests trait backed by DSv2ExternalMutationTestBase,
following the same pattern as DSv2TempViewWithStoredPlanTests (PR apache#55571).
This allows a future Connect suite to reuse the same tests by mixing in
the trait and providing Connect-specific implementations.

Co-authored-by: Isaac
@longvu-db longvu-db requested a review from cloud-fan May 24, 2026 20:31
@longvu-db longvu-db force-pushed the dsv2-connect-pr1-temp-views-clone branch from 3f48931 to 94a65be Compare May 24, 2026 20:54
@longvu-db
Copy link
Copy Markdown
Contributor Author

CI failed with unrelated with unrelated flaky RocksDBStateStoreIntegrationSuiteWithRowChecksum failure

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared-trait refactor since the last round (extracting 21 tests into DSv2TempViewWithStoredPlanTests over a DSv2ExternalMutationTestBase hook surface) addresses the duplication concern from the prior round, and the Connect-side justifications for QueryTest.sameRows (DataSourceV2TempViewConnectSuite.scala:46-53) and dropping clearCache() (:69-72) are accurate — I verified both. Correction to my prior round: I claimed checkAnswer is usable from Connect server tests; that's true only for classic server-side DataFrames, not for the Connect-client DataFrames this suite uses (QueryTest.checkAnswer accesses queryExecution/logicalPlan/materializedRdd, which throw on Connect client DataFrames). Your sameRows switch is correct; ignore the earlier suggestion.

Status against the prior AI review: 3 addressed, 0 remaining, 2 new. The prior commit (0b76175) is no longer an ancestor of HEAD (rebased), so I'm treating new findings uniformly.

Bigger design question — coordination with #55356

@fwc's PR #55356 ("SharedSparkSession provides sql.SparkSession, add SharedClassicSparkSession, ClassicQueryTest") is pursuing the same goal of sharing tests between classic and Connect, but at the foundation layer: it lifts SharedSparkSession.spark to sql.SparkSession, moves classic-only helpers into classic.SharedSparkSession/classic.QueryTest subtraits, and makes QueryTest itself extendable in a Connect-providing suite.

If that direction lands, three of the four hooks in DSv2ExternalMutationTestBase become unnecessary:

  • withTestSession(fn) — because spark itself is the abstract sql.SparkSession.
  • checkRows — if checkAnswer is made Connect-aware in shared infra, instead of every shared trait re-implementing it.
  • withTestTableAndViews — once withTable operates on sql.SparkSession.

Only getTableCatalog would remain, because the catalog-mutation bypass is genuinely Connect-specific.

Merge order doesn't matter, but the direction needs to be one direction, not two. Could you sync with @fwc on whether the right path is:

(a) wait for #55356 to land and rebase this PR onto the lifted SharedSparkSession, dropping the redundant hooks; or

(b) merge this PR first but contribute the generic Connect primitives (checkAnswer dispatch, withTable lift) into shared infrastructure as part of #55356, so future shared traits don't reinvent them.

What would be unfortunate is leaving a private 4-hook abstraction in this trait and a parallel foundational abstraction in #55356 — that's two test-sharing frameworks to reconcile later. The shared-trait pattern in StaticProcedureSuiteBase / sql.SparkSessionBuilderImplementationBindingSuite (plain extends AnyFunSuite, no per-trait routing) is what this should look like once the foundation is in place.

Process note

@andreaschat-db's LGTM (2026-05-08, commit ae21481) predates the entire shared-trait refactor (introduced in b879d2c). A re-confirm seems appropriate before merge.

Two smaller inline findings below.

longvu-db added 6 commits May 25, 2026 13:35
…ts using Spark Connect mode

Rebased on latest master.
Assert schema field names before every checkAnswer/assertRows
with empty results to verify schema is preserved correctly.
Add missing test for Section 1 Scenario 3.1 where a session
ALTER TABLE DROP COLUMN should trigger an analysis exception
when querying a temp view that references the removed column.
Co-authored-by: Isaac
- Fix assertRows to use sortBy instead of toSet (preserves duplicates)
- Replace serverCatalog/serverCachingCatalog with generic serverCatalog[T]
- Add withTableAndView helper with try/finally for cleanup

Co-authored-by: Isaac
longvu-db added 22 commits May 25, 2026 13:35
Co-authored-by: Isaac
…la from this PR (already in master)

Co-authored-by: Isaac
…panion object

CachingInMemoryTableCatalog has no companion object, so the static call
CachingInMemoryTableCatalog.clearCache() does not compile. Fixed by
getting the server-side catalog instance and calling clearCache() on it,
matching the pattern used by the classic DataSourceV2DataFrameSuite.

Co-authored-by: Isaac
- Replace assertRows toString comparison with QueryTest.sameRows for value-based comparison
- Add ClassTag bound to serverCatalog for runtime type checking at helper boundary
- Remove schema param from externalAppend, derive it from table columns via CatalogV2Util
- Remove clearCachingCatalog param from withTableAndViews since sessions are isolated per-test

Co-authored-by: Isaac
Addresses review feedback to share test cases instead of duplicating.

New shared traits in sql/core (accessible from both sql/core and sql/connect):
- DSv2ExternalMutationTestBase: base trait with 5 abstract methods
  (withTestSession, checkRows, getTableCatalog, withTestTableAndViews, testPrefix)
  and shared externalAppend helper
- DSv2TempViewWithStoredPlanTests: 21 temp view tests (7 scenarios x 3 variants)

Classic DataSourceV2DataFrameSuite:
- Mixes in DSv2TempViewWithStoredPlanTests
- Implements abstract methods using classic session, checkAnswer, catalog()
- Removes 578 lines of duplicated inline tests

Connect DataSourceV2TempViewConnectSuite:
- 72 lines (was 739) thin wrapper mixing in the shared trait
- Implements abstract methods for Connect: withSession, QueryTest.sameRows,
  getServerSession for catalog access

Also addresses cloud-fan's earlier review comments:
- Uses QueryTest.sameRows for value-based comparison (not toString)
- Adds ClassTag bound to getTableCatalog for runtime type checking
- Derives schema from table columns via CatalogV2Util (no schema param)
- Removes clearCachingCatalog since sessions are isolated per-test

Co-authored-by: Isaac
…roken Scaladoc

- Remove `with DSv2RepeatedSQLTests` mixin (trait not in this PR)
- Remove unused imports: BufferedRows, CatalogV2Util, InMemoryBaseTable, InternalRow
- Fix Scaladoc to only reference DSv2TempViewWithStoredPlanTests
- Use session.sql() in classic withTestTableAndViews to honor param contract

Co-authored-by: Isaac
…r Connect

QueryTest.checkAnswer cannot be used with Connect client DataFrames because
it accesses logicalPlan, queryExecution, and materializedRdd which throw
ConnectClientUnsupportedErrors. sameRows provides the same value-based,
order-agnostic comparison that checkAnswer uses internally.

Co-authored-by: Isaac
…ents

- DSv2ExternalMutationTestBase: extend QueryTest instead of
  SharedSparkSession so the trait does not pull in a SparkSession of its
  own; concrete suites provide sessions via their own inheritance.
- DSv2TempViewWithStoredPlanTests: document why .collect() is appended to
  every session.sql() call (needed for Connect client DataFrames) and why
  testIdent is catalog-agnostic.
- DataSourceV2TempViewConnectSuite: clarify that checkRows uses sameRows
  because the DataFrame is a Connect *client* DataFrame (not a server-side
  one), and explain why clearCache() is unnecessary (Connect sessions are
  isolated per-test).

Co-authored-by: Isaac
- Rename `cat` variable to `catalog` throughout DSv2TempViewWithStoredPlanTests
- Rename `cat` parameter to `catalog` in DSv2ExternalMutationTestBase.externalAppend
- Simplify fully-qualified [[]] references to short class names
- Remove identifier-is-catalog-agnostic comment
- Change "assertion wiring" to "result comparison" in Connect suite
- Change testPrefix default from "" to "[classic] "
- Simplify verbose method Scaladoc in DSv2ExternalMutationTestBase

Co-authored-by: Isaac
- Set classic testPrefix to "" to preserve existing test name history
  in CI dashboards (only Connect uses "[connect] " prefix)
- Add require() check in classic getTableCatalog to match Connect impl,
  giving a clear error instead of a confusing ClassCastException

Co-authored-by: Isaac
@longvu-db longvu-db force-pushed the dsv2-connect-pr1-temp-views-clone branch from 94a65be to 8fd6b01 Compare May 25, 2026 13:35
Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants