GH-49129: [R][Parquet] Guard Parquet dataset code with ARROW_R_WITH_PARQUET#49128
GH-49129: [R][Parquet] Guard Parquet dataset code with ARROW_R_WITH_PARQUET#49128IsabelParedes wants to merge 5 commits intoapache:mainfrom
Conversation
Add preprocessor guards around Parquet-related code in dataset.cpp and update the condition in arrowExports.cpp to check for both ARROW_R_WITH_DATASET and ARROW_R_WITH_PARQUET. This ensures the code compiles correctly when Parquet support is disabled.
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
|
Thanks for submitting this PR. I am inclined to accept it, I think the changes are good. But I would love to have a test in our CI that would catch if we (accidentally) add another place we need to guard against. What about adding a new crossbow job like this one: Lines 708 to 714 in 7532327 Something like: How does that sound? |
|
@github-actions crossbow submit test-r-dataset-without-parquet |
|
Revision: 7adc2a0 Submitted crossbow builds: ursacomputing/crossbow @ actions-e48f2edfc9
|
raulcd
left a comment
There was a problem hiding this comment.
Thanks for the PR @IsabelParedes !
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
|
@github-actions crossbow submit test-r-dataset-without-parquet |
|
Revision: 7717fb7 Submitted crossbow builds: ursacomputing/crossbow @ actions-fe4ac87ab0
|
|
There seems to still be a missing symbol somewhere, that's why the job failed: |
|
@github-actions crossbow submit test-r-dataset-without-parquet |
|
Revision: 3ffdc3c Submitted crossbow builds: ursacomputing/crossbow @ actions-e999121048
|
|
@github-actions crossbow submit test-r-dataset-without-parquet |
|
|
@github-actions crossbow submit test-r-dataset-without-parquet |
|
Revision: 990e09b Submitted crossbow builds: ursacomputing/crossbow @ actions-1c0912d23b
|
| } | ||
| #else | ||
| extern "C" SEXP _arrow_dataset___ParquetFragmentScanOptions__Make(SEXP use_buffered_stream_sexp, SEXP buffer_size_sexp, SEXP pre_buffer_sexp, SEXP thrift_string_size_limit_sexp, SEXP thrift_container_size_limit_sexp){ | ||
| Rf_error("Cannot call dataset___ParquetFragmentScanOptions__Make(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries. "); |
There was a problem hiding this comment.
The errors seem to be "correct", right? Calling this function should cause an error, as expected?
[5](https://github.com/ursacomputing/crossbow/actions/runs/21668554323/job/62470241060#step:6:26579)
── Error ('test-csv.R:581:3'): write_csv_arrow can write from Dataset objects ──
Error in `dataset___ParquetFragmentScanOptions__Make(use_buffered_stream, buffer_size, pre_buffer, thrift_string_size_limit, thrift_container_size_limit)`: Cannot call dataset___ParquetFragmentScanOptions__Make(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries.
Backtrace:
▆
1. └─arrow::write_dataset(tbl_no_dates, data_dir, partitioning = "lgl") at test-csv.R:581:3
2. └─FileWriteOptions$create(...)
3. └─FileFormat$create(format)
4. └─ParquetFileFormat$create(...)
5. └─ParquetFragmentScanOptions$create(...)
6. └─arrow:::dataset___ParquetFragmentScanOptions__Make(...)
── Error ('test-dataset-csv.R:295:3'): Error if no format specified and files are not parquet ──
Error in `dataset___ParquetFragmentScanOptions__Make(use_buffered_stream, buffer_size, pre_buffer, thrift_string_size_limit, thrift_container_size_limit)`: Cannot call dataset___ParquetFragmentScanOptions__Make(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries.
Or should I look into changing the tests?
There was a problem hiding this comment.
cc @jonkeane @thisisnic
Maybe we should skip the test if parquet is not available too not only if dataset is not available?
test_that("write_csv_arrow can write from Dataset objects", {
skip_if_not_available("dataset")There was a problem hiding this comment.
Absolutely; we've not really considered the case of building the Arrow R package without Parquet, so there might be a few more of these to deal with.
There was a problem hiding this comment.
Thanks for submitting this PR @IsabelParedes!
One quick note about a slightly weird bit of the package's inner workings. In the r/data-raw directory there is a file called codegen.R, which generates the code in r/src/arrowExports.cpp. I think that the latter file might have been manually edited in this PR but we need to instead edit the upstream file that generates it.
This might add some complexity; when I took a look at it with Claude yesterday, I got the below output/suggestion, which sounds about right, but I haven't fully explored the different options there. Let me know if you'd rather leave it with you to have a think and propose an approach; otherwise I can take a more in-depth look at what's needed.
🤖 :
The changes to dataset.cpp look correct - adding the ARROW_R_WITH_PARQUET guards around the Parquet-specific includes and functions makes sense.
However, arrowExports.cpp is a generated file (note the comment at the top: "Generated by using data-raw/codegen.R -> do not edit by hand"). The change you made there will be overwritten the next time codegen.R runs.
The fix needs to happen in r/data-raw/codegen.R instead. Currently, codegen.R uses the decoration (e.g., // [[dataset::export]]) to determine which single feature guard to use. For functions like dataset___ParquetFileWriteOptions__update that need both
ARROW_R_WITH_DATASET AND ARROW_R_WITH_PARQUET, there are a couple of options:
- Support multiple decorations on a single function (e.g., // [[dataset::export]] [[parquet::export]])
- Have codegen detect when a function is already inside a #if defined(ARROW_R_WITH_...) block in the source and include that
guard as well - Add a new decoration syntax for specifying additional required features
Rationale for this change
This ensures the code compiles correctly when Parquet support is disabled.
What changes are included in this PR?
Add preprocessor guards around Parquet-related code in dataset.cpp and update the condition in arrowExports.cpp to check for both ARROW_R_WITH_DATASET and ARROW_R_WITH_PARQUET.
Are these changes tested?
Yes. Please see the patch applied here emscripten-forge/recipes#4397
Are there any user-facing changes?
No.
Fixes #49129