Disable s3 multipart download option#12413
Conversation
Clarified the new configuration setting for S3 drivers to specify its effect on multipart downloads and added context regarding `download-redirect` behavior.
qqmyers
left a comment
There was a problem hiding this comment.
The basic change here - to provide the option to disable use of multipart transfer for download - is necessary for DANS to use the code with Ceph, (with download-redirect false). The implementation is straight forward - the code is refactored to have separate read/write client variables, but these still point to one client unless the new setting is true, in which case download uses a separate aws s3 client with multipart disabled.
The other general changes (noted in other comments) all see reasonable as well, though I think #12109 should make the same release.
The two requested changes are just to add Ceph in the list of known compatible instances and (minor) to clean up some of the formatting/reformatting that happened.
For QA, I'll note that DANS is using this already so I think the key thing is to regression test - not sure the core team needs to configure Ceph to test the actual fix.
| @@ -0,0 +1,5 @@ | |||
| A new configuration setting has been introduced for S3 compatible storage drivers that addresses an incompatibility between the AWS S3 library used in Dataverse and certain S3 implementations such as the Ceph Object Gateway: | |||
There was a problem hiding this comment.
@janvanmansum - I made some edits - feel free to revert. I wanted to make it clear this new setting is mainly to address a problem, not something you'd want to set true otherwise. I also guessed that you're talking about the Ceph Object Gateway.
Beyond that, I might suggest adding Ceph to the list at https://github.com/IQSS/dataverse/blob/develop/doc/sphinx-guides/source/installation/config.rst#reported-working-s3-compatible-storage with a note about this setting.
| private static final ConcurrentHashMap<String, S3AsyncClient> driverDownloadClientMap = new ConcurrentHashMap<>(); | ||
| private static final ConcurrentHashMap<String, S3AsyncClient> driverUploadClientMap = new ConcurrentHashMap<>(); | ||
| private static final ConcurrentHashMap<String, S3Presigner> driverPresignerMap = new ConcurrentHashMap<>(); | ||
| private static final ConcurrentHashMap<String, AwsCredentialsProvider> driverCredentialsProviderMap = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Just noting that this PR has some general improvements ~unrelated to the issue, such as the use of ConcurrentHashMaps.
| if (super.getInputStream() == null) { | ||
| throw new IOException("Cannot get InputStream for S3 Object" + key); | ||
| catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); |
There was a problem hiding this comment.
Resetting the interrupt is a recommended practice, not related to the issue.
| return null; | ||
| } finally { | ||
| s3Presigner.close(); | ||
| } |
There was a problem hiding this comment.
Another general fix. FWIW: Not closing the presigners is one of the fixes in #12109 which has been sitting. We probably want the other fixes there too.
| if (driverTMMap.containsKey(driverId)) { | ||
| return driverTMMap.get(driverId); | ||
| } else { | ||
| return driverTMMap.computeIfAbsent(driverId, id -> { |
| Thread.currentThread().interrupt(); | ||
| throw new IOException("S3AccessIO: Failed to get aux objects for listing.", e); | ||
| } | ||
| catch (ExecutionException e) { |
There was a problem hiding this comment.
Minor - this PR seems to be inconsistent in putting the catch on the line with the closing } or the next line and changes whether else is on a new line or not. It would be good to have things consistent (I think we generally have catch and else on the same line as the } )
Cover Message for Pull Request
What this PR does / why we need it:
This PR addresses a
412 Precondition Failederror encountered when downloading certain large objects from S3-compatible storage (specifically Ceph) that were originally uploaded using multipart upload (e.g., via theboto3library with default 8MiB part sizes).The issue arises because the AWS SDK for Java, when multipart support is enabled, automatically attempts a parallel multipart download if it detects an ETag with a part-count suffix (e.g.,
-5). During this process, the client sends anIf-Matchheader containing the ETag of an individual part to ensure consistency. However, some servers like Ceph expect the ETag for the entire object, leading to the412error.To resolve this, the
S3AccessIOclass has been refactored to use two distinct S3 clients:If-Matchheaders that trigger the 412 error.Additionally, a new configuration option
disable-multipart-download-for-indirect-downloadhas been introduced to control this behavior.Which issue(s) this PR closes:
N/a
Special notes for your reviewer:
The fix involves spliting the S3 client into
s3ReadClientands3WriteClient. Thes3ReadClientis configured with.multipartEnabled(false)if the new configuration setting is set totrue. By defaults3ReadClientands3WriteClientrefer to the same client, which has multipart enabled.Suggestions on how to test this:
-5). Confirm it fails with412 Precondition Failedwithout this PR. (JM) Setdataverse.files.<id>.download-redirect=falseso that Dataverse is involved in the download.dataverse.files.<id>.disable-multipart-download-for-indirect-downloadtofalseand verify the412error returns (if testing against a problematic server).Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:
https://github.com/IQSS/dataverse/pull/12413/changes#diff-827dccc047f09653dade26fe2dce6be6d20ae6a739448e2b9f2890849b64d2fd
Additional documentation:
No