Skip to content

CREATE TABLE ... AS ( WITH RECURSIVE ... ) fails with "Mismatch between schema and batches" - recursive term's batches carry expression names instead of the anchor member #22093

@jccampagne

Description

@jccampagne

Describe the bug

I stumbled a potential issue in DataFusion while testing recursive CTEs in my project:

Here's such a query:

WITH RECURSIVE chain AS (
  SELECT 'A' AS level
  UNION ALL
  SELECT level || '>' FROM chain WHERE length(level) < 5
)
select * from chain

TLDR: the issue is:

  • Anchor member SELECT 'A' AS level has field name level
  • the recursive part SELECT level || '>' ... has field name chain.level || Utf8(\">\")

After a bit of digging and debugging I was able to reproduce the error in the CLI.
I also tested in Postgres for reference (see below).

Exhibit error in CLI with datafusion-cli

The issue is also apparent in the CLI. Using Datafusion 53.1.0.

At first it seems ok in the CLI:

% datafusion-cli
DataFusion CLI v53.1.0
> WITH RECURSIVE chain AS (
  SELECT 'A' AS level
  UNION ALL
  SELECT level || '>' FROM chain WHERE length(level) < 5
)
SELECT * FROM chain
;
+-------+
| level |
+-------+
| A     |
| A>    |
| A>>   |
| A>>>  |
| A>>>> |
+-------+
5 row(s) fetched.
Elapsed 0.005 seconds.

This works fine, maybe the pretty printer is lenient and is silent or ignores the different schemas.
However if you try to create the table using CREATE TABLE ... AS ..., the error is apparent:

> create table test as ( WITH RECURSIVE chain AS (
  SELECT 'A' AS level
  UNION ALL
  SELECT level || '>' FROM chain WHERE length(level) < 5
)
SELECT * FROM chain );
Error during planning: Mismatch between schema and batches

It fails.

Debug in code:

Adding some debugging shows the value of the returned record batches (without as level for the recursive case:

[
RecordBatch { schema: Schema { fields: [Field { name: "level", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A", ]], row_count: 1 },
RecordBatch { schema: Schema { fields: [Field { name: "chain.level || Utf8(\">\")", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A>", ]], row_count: 1 },
RecordBatch { schema: Schema { fields: [Field { name: "chain.level || Utf8(\">\")", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A>>", ]], row_count: 1 },
RecordBatch { schema: Schema { fields: [Field { name: "chain.level || Utf8(\">\")", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A>>>", ]], row_count: 1 },
RecordBatch { schema: Schema { fields: [Field { name: "chain.level || Utf8(\">\")", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A>>>>", ]], row_count: 1 }
]

As you can see the field names in the schema different in the anchor member and the recursive case.

With the ... as level in the recursive case, the record batches are:

[RecordBatch { schema: Schema { fields: [Field { name: "level", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A", ]], row_count: 1 },
 RecordBatch { schema: Schema { fields: [Field { name: "level", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A>", ]], row_count: 1 },
 RecordBatch { schema: Schema { fields: [Field { name: "level", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A>>", ]], row_count: 1 },
 RecordBatch { schema: Schema { fields: [Field { name: "level", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A>>>", ]], row_count: 1 },
 RecordBatch { schema: Schema { fields: [Field { name: "level", data_type: Utf8 }], metadata: {} }, columns: [StringArray [ "A>>>>",]], row_count: 1 }]

Compatibility check with Postgres

I tested with Postgres to check compatibility, it works fine (with or without the CREATE TABLE):

% psql
psql (14.22 (Homebrew))
Type "help" for help.

jc=# WITH RECURSIVE chain AS (
  SELECT 'A' AS level
  UNION ALL
  SELECT level || '>' FROM chain WHERE length(level) < 5
)
select * from chain;
 level
-------
 A
 A>
 A>>
 A>>>
 A>>>>
(5 rows)

jc=# create table test as ( WITH RECURSIVE chain AS (
  SELECT 'A' AS level
  UNION ALL
  SELECT level || '>' FROM chain WHERE length(level) < 5
)
SELECT * FROM chain );
SELECT 5
jc=# select * from test;
 level
-------
 A
 A>
 A>>
 A>>>
 A>>>>
(5 rows)

To Reproduce

Run this in Datafusion CLI:

create table test as ( WITH RECURSIVE chain AS (
  SELECT 'A' AS level
  UNION ALL
  SELECT level || '>' FROM chain WHERE length(level) < 5
)
SELECT * FROM chain );

It should fail with Error during planning: Mismatch between schema and batches

Expected behavior

It should behave like:

> create table test as ( WITH RECURSIVE chain AS (
  SELECT 'A' AS level
  UNION ALL
  SELECT level || '>' as level FROM chain WHERE length(level) < 5
)
SELECT * FROM chain );
0 row(s) fetched.
Elapsed 0.003 seconds.

> select * from test;
+-------+
| level |
+-------+
| A     |
| A>    |
| A>>   |
| A>>>  |
| A>>>> |
+-------+
5 row(s) fetched.
Elapsed 0.004 seconds.

note the ... as level ... in SELECT level || '>' as level FROM chain ...

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions