feat: add support for generating JSON formatted substrait plan#1376
feat: add support for generating JSON formatted substrait plan#1376Prathamesh9284 wants to merge 5 commits intoapache:mainfrom
Conversation
timsaucer
left a comment
There was a problem hiding this comment.
I have a few questions/recommendations. My biggest concern is I don't see evidence that this is generating json strings that are actually compatible with the other libraries and described in the original issue. It's entirely possible it's working exactly as needed, but I think that's something we can show in the tests.
python/datafusion/substrait.py
Outdated
| def to_json(self) -> str: | ||
| """Serialize the plan to a JSON string. | ||
|
|
||
| Returns a JSON string following the Protobuf JSON Mapping. |
There was a problem hiding this comment.
I find the statement "Returns a JSON string following the Protobuf JSON Mapping" to be hard to understand. What exactly does this mean? Is this LLM generated?
There was a problem hiding this comment.
I agree the wording isn’t very clear. I’ll rephrase it to make it clearer that this just returns the Plan in its standard JSON form.
python/datafusion/substrait.py
Outdated
|
|
||
| @staticmethod | ||
| def serialize_json(sql: str, ctx: SessionContext, path: str | pathlib.Path) -> None: | ||
| """Serialize a SQL query to a JSON-formatted Substrait plan file. |
There was a problem hiding this comment.
Is the output a json file? I'm trying to understand the "serialize" in this context.
There was a problem hiding this comment.
Yes, in the current implementation it writes a JSON-formatted Substrait plan to a file. I used “serialize” in the same sense as the existing binary serialize, but I agree the naming may not be very clear here.
python/tests/test_substrait.py
Outdated
| # Convert plan to JSON | ||
| json_str = plan.to_json() | ||
| assert isinstance(json_str, str) | ||
| assert "relations" in json_str # basic sanity check |
There was a problem hiding this comment.
If these conversions to json are stable as the issue suggests, then I would think we could compare the entire output, right? What do the other libraries produce?
There was a problem hiding this comment.
Hi @timsaucer,
I was able to generate a JSON-formatted Substrait plan using my DataFusion implementation. Comparing it with Isthmus:
- Both include relations, project expressions, virtualTable values, and correct column mappings.
- Minor differences exist: DataFusion uses i64 instead of i32, column names differ, and optional metadata (extensions, typeAliases) are not included — these do not affect compatibility.
- DuckDB currently does not provide a working JSON inspection method due to extension availability issues, so direct JSON comparison via DuckDB isn’t possible.
Overall, the DataFusion output is compatible and can be used for reproducibility and validation.
python/datafusion/substrait.py
Outdated
| @staticmethod | ||
| def serialize_json(sql: str, ctx: SessionContext, path: str | pathlib.Path) -> None: |
There was a problem hiding this comment.
Is the preferred method here to write to a file or to just produce a string output? Maybe the author of the issue has an opinion on best use case.
There was a problem hiding this comment.
I was originally trying to keep it consistent with Serde.serialize, which writes the binary plan to a file. But I’m also fine with just returning a JSON string from Plan.to_json() and letting users handle writing to a file themselves. Do you have a preference for which approach fits better with the project?
python/datafusion/substrait.py
Outdated
| def deserialize_json(path: str | pathlib.Path) -> Plan: | ||
| """Deserialize a Substrait plan from a JSON file. |
There was a problem hiding this comment.
Similarly, not sure if this needs to be a path to a file that we read or expect to pass in a string. Since it's python it's fairly easy to handle either/both
There was a problem hiding this comment.
Following up on my earlier comment if we keep JSON handling on Plan, would it make sense for from_json() to accept either a JSON string or a file path? That might simplify things instead of having separate helpers on Serde. Curious what you think.
src/errors.rs
Outdated
| PythonError(PyErr), | ||
| EncodeError(EncodeError), | ||
| DecodeError(DecodeError), | ||
| SerdeJsonError(SerdeJsonError), |
There was a problem hiding this comment.
Recommend removing SerdeJsonError and mapping to EncodeError or DecodeError. Alternatively putting into DataFusion execution error.
479bbd9 to
e9df13e
Compare
e9df13e to
babd63a
Compare
|
Hi @timsaucer I have added pytest to validate cross-library compatibility. On test_substrait.py ines 94–99 parse the JSON using the official Substrait protobuf library (plan_pb2) via json_format.Parse, which would fail if the JSON didn’t conform to the Substrait protobuf schema. It then re-serializes to binary and asserts byte-for-byte equality with the original DataFusion output. This ensures any tool that supports protobuf-JSON can consume it correctly. |
… plan. Switch unit test to focus on json parsing and not byte serialization.
|
Thank you for the PR. There was an issue where the internal function name differed from the wrapper, so unit tests were failing. I pushed an update correcting this and making a change to the unit test to focus on the json representation of the logical plan rather than serialization and deserialization. |
|
Thank you for fixing the internal function name mismatch and adjusting the unit test. I did notice that one of the CI builds failed. The failure appears to be related to a Clippy lint (redundant_closure). I’ve pushed a small fix replacing the redundant closure with the function directly (map_err(to_datafusion_err)), which should resolve the lint issue across environments. |
Which issue does this PR close?
Closes #508
Rationale for this change
Other Substrait producers (Isthmus, DuckDB) support generating JSON-formatted Substrait plans, which makes it much easier to inspect and compare plans across engines. The substrait crate already provides serde support via the pbjson feature, so DataFusion Python can leverage this to expose JSON plan generation and parsing with minimal effort.
What changes are included in this PR?
python/datafusion/substrait.pyto_json()instance method andparse_json()static method on thePlanclass for converting plans to JSON strings and parse JSON strings to plans.src/substrait.rsfn to_json(&self) -> PyDataFusionResult<String>#[staticmethod] fn from_json(json: &str) -> PyDataFusionResult<PyPlan>Cargo.tomlserde_jsonas a dependency.pyproject.tomlpython/tests/test_substrait.pyAre there any user-facing changes?
Yes — two new public API methods:
Plan.to_json() -> str— convert a Substrait plan to a JSON stringPlan.from_json(json: str) -> Plan— create a Substrait plan from a JSON string