Skip to content

GH-49129: [R][Parquet] Guard Parquet dataset code with ARROW_R_WITH_PARQUET#49128

Open
IsabelParedes wants to merge 5 commits intoapache:mainfrom
IsabelParedes:if-parquet
Open

GH-49129: [R][Parquet] Guard Parquet dataset code with ARROW_R_WITH_PARQUET#49128
IsabelParedes wants to merge 5 commits intoapache:mainfrom
IsabelParedes:if-parquet

Conversation

@IsabelParedes
Copy link

@IsabelParedes IsabelParedes commented Feb 3, 2026

Rationale for this change

This ensures the code compiles correctly when Parquet support is disabled.

What changes are included in this PR?

Add preprocessor guards around Parquet-related code in dataset.cpp and update the condition in arrowExports.cpp to check for both ARROW_R_WITH_DATASET and ARROW_R_WITH_PARQUET.

Are these changes tested?

Yes. Please see the patch applied here emscripten-forge/recipes#4397

Are there any user-facing changes?

No.

Fixes #49129

Add preprocessor guards around Parquet-related code in dataset.cpp
and update the condition in arrowExports.cpp to check for both
ARROW_R_WITH_DATASET and ARROW_R_WITH_PARQUET. This ensures the code
compiles correctly when Parquet support is disabled.
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@IsabelParedes IsabelParedes changed the title [R] Guard Parquet dataset code with ARROW_R_WITH_PARQUET GH-49129: [R] Guard Parquet dataset code with ARROW_R_WITH_PARQUET Feb 3, 2026
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

⚠️ GitHub issue #49129 has been automatically assigned in GitHub to PR creator.

@IsabelParedes IsabelParedes marked this pull request as ready for review February 3, 2026 13:31
@IsabelParedes IsabelParedes changed the title GH-49129: [R] Guard Parquet dataset code with ARROW_R_WITH_PARQUET GH-49129: [R][Parquet] Guard Parquet dataset code with ARROW_R_WITH_PARQUET Feb 3, 2026
@jonkeane
Copy link
Member

jonkeane commented Feb 3, 2026

Thanks for submitting this PR. I am inclined to accept it, I think the changes are good. But I would love to have a test in our CI that would catch if we (accidentally) add another place we need to guard against.

What about adding a new crossbow job like this one:

arrow/dev/tasks/tasks.yml

Lines 708 to 714 in 7532327

test-r-dev-duckdb:
ci: github
template: docker-tests/github.linux.yml
params:
env:
R_DUCKDB_DEV: TRUE
image: ubuntu-r-only-r

Something like:

  test-r-dataset-without-parquet:
    ci: github
    template: docker-tests/github.linux.yml
    params:
      env:
        ARROW_DATASET: ON
        ARROW_PARQUET: OFF
      image: ubuntu-r-only-r

How does that sound?

@raulcd
Copy link
Member

raulcd commented Feb 3, 2026

@github-actions crossbow submit test-r-dataset-without-parquet

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Revision: 7adc2a0

Submitted crossbow builds: ursacomputing/crossbow @ actions-e48f2edfc9

Task Status
test-r-dataset-without-parquet GitHub Actions

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @IsabelParedes !

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 3, 2026
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 3, 2026
@raulcd
Copy link
Member

raulcd commented Feb 3, 2026

@github-actions crossbow submit test-r-dataset-without-parquet

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Revision: 7717fb7

Submitted crossbow builds: ursacomputing/crossbow @ actions-fe4ac87ab0

Task Status
test-r-dataset-without-parquet GitHub Actions

@raulcd
Copy link
Member

raulcd commented Feb 3, 2026

There seems to still be a missing symbol somewhere, that's why the job failed:

** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/arrow/r/check/arrow.Rcheck/00LOCK-arrow/00new/arrow/libs/arrow.so':
  /arrow/r/check/arrow.Rcheck/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: _Z33dataset___ParquetFileFormat__MakeRKSt10shared_ptrIN5arrow7dataset26ParquetFragmentScanOptionsEEN5cpp118r_vectorINS6_8r_stringEEE
Error: loading failed
Execution halted

@raulcd
Copy link
Member

raulcd commented Feb 3, 2026

@github-actions crossbow submit test-r-dataset-without-parquet

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Revision: 3ffdc3c

Submitted crossbow builds: ursacomputing/crossbow @ actions-e999121048

Task Status
test-r-dataset-without-parquet GitHub Actions

@IsabelParedes
Copy link
Author

@github-actions crossbow submit test-r-dataset-without-parquet

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/21666672188

@raulcd
Copy link
Member

raulcd commented Feb 4, 2026

@github-actions crossbow submit test-r-dataset-without-parquet

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Revision: 990e09b

Submitted crossbow builds: ursacomputing/crossbow @ actions-1c0912d23b

Task Status
test-r-dataset-without-parquet GitHub Actions

}
#else
extern "C" SEXP _arrow_dataset___ParquetFragmentScanOptions__Make(SEXP use_buffered_stream_sexp, SEXP buffer_size_sexp, SEXP pre_buffer_sexp, SEXP thrift_string_size_limit_sexp, SEXP thrift_container_size_limit_sexp){
Rf_error("Cannot call dataset___ParquetFragmentScanOptions__Make(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries. ");
Copy link
Author

@IsabelParedes IsabelParedes Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errors seem to be "correct", right? Calling this function should cause an error, as expected?

[5](https://github.com/ursacomputing/crossbow/actions/runs/21668554323/job/62470241060#step:6:26579)
── Error ('test-csv.R:581:3'): write_csv_arrow can write from Dataset objects ──
Error in `dataset___ParquetFragmentScanOptions__Make(use_buffered_stream, buffer_size, pre_buffer, thrift_string_size_limit, thrift_container_size_limit)`: Cannot call dataset___ParquetFragmentScanOptions__Make(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries. 
Backtrace:
    ▆
 1. └─arrow::write_dataset(tbl_no_dates, data_dir, partitioning = "lgl") at test-csv.R:581:3
 2.   └─FileWriteOptions$create(...)
 3.     └─FileFormat$create(format)
 4.       └─ParquetFileFormat$create(...)
 5.         └─ParquetFragmentScanOptions$create(...)
 6.           └─arrow:::dataset___ParquetFragmentScanOptions__Make(...)
── Error ('test-dataset-csv.R:295:3'): Error if no format specified and files are not parquet ──
Error in `dataset___ParquetFragmentScanOptions__Make(use_buffered_stream, buffer_size, pre_buffer, thrift_string_size_limit, thrift_container_size_limit)`: Cannot call dataset___ParquetFragmentScanOptions__Make(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries.

Or should I look into changing the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jonkeane @thisisnic
Maybe we should skip the test if parquet is not available too not only if dataset is not available?

test_that("write_csv_arrow can write from Dataset objects", {
  skip_if_not_available("dataset")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely; we've not really considered the case of building the Arrow R package without Parquet, so there might be a few more of these to deal with.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 4, 2026
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR @IsabelParedes!

One quick note about a slightly weird bit of the package's inner workings. In the r/data-raw directory there is a file called codegen.R, which generates the code in r/src/arrowExports.cpp. I think that the latter file might have been manually edited in this PR but we need to instead edit the upstream file that generates it.

This might add some complexity; when I took a look at it with Claude yesterday, I got the below output/suggestion, which sounds about right, but I haven't fully explored the different options there. Let me know if you'd rather leave it with you to have a think and propose an approach; otherwise I can take a more in-depth look at what's needed.


🤖 :

The changes to dataset.cpp look correct - adding the ARROW_R_WITH_PARQUET guards around the Parquet-specific includes and functions makes sense.

However, arrowExports.cpp is a generated file (note the comment at the top: "Generated by using data-raw/codegen.R -> do not edit by hand"). The change you made there will be overwritten the next time codegen.R runs.

The fix needs to happen in r/data-raw/codegen.R instead. Currently, codegen.R uses the decoration (e.g., // [[dataset::export]]) to determine which single feature guard to use. For functions like dataset___ParquetFileWriteOptions__update that need both
ARROW_R_WITH_DATASET AND ARROW_R_WITH_PARQUET, there are a couple of options:

  1. Support multiple decorations on a single function (e.g., // [[dataset::export]] [[parquet::export]])
  2. Have codegen detect when a function is already inside a #if defined(ARROW_R_WITH_...) block in the source and include that
    guard as well
  3. Add a new decoration syntax for specifying additional required features

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[R] Compilation fails when Parquet support is disabled

4 participants