[FLINK-39118] Add documentation for Native s3 FileSystem#27841
[FLINK-39118] Add documentation for Native s3 FileSystem#27841Samrat002 wants to merge 3 commits intoapache:masterfrom
Conversation
alpinegizmo
left a comment
There was a problem hiding this comment.
This is in pretty good shape. Just a couple of points to address.
7b82717 to
7be000d
Compare
| - Use *s3p://* scheme for checkpointing (Presto) | ||
|
|
||
| {{< hint info >}} | ||
| The Native S3 implementation does not introduce a new URI scheme. It reuses the existing *s3://* and *s3a://* schemes. To use it alongside the Hadoop implementation, ensure only the Native S3 plugin JAR is in the `plugins` directory (i.e., do not have both `flink-s3-fs-native` and `flink-s3-fs-hadoop` plugins loaded simultaneously for the same scheme). |
There was a problem hiding this comment.
"To use it alongside the Hadoop implementation" -- what does this mean? I had assumed that it's not possible to use both the Native S3 and Hadoop implementations together, since they use the same scheme.
There was a problem hiding this comment.
Yeah , "To use it alongside the Hadoop implementation" sounds missleading at the beginning.
I've updated the wording to be more explicit and direct . PTAL
alpinegizmo
left a comment
There was a problem hiding this comment.
One more suggestion, and a question.
Izeren
left a comment
There was a problem hiding this comment.
Thank you for the PR @Samrat002, I have left a few comments, PTAL.
My general request for changes is to replicate this for Chinese docs (usually we update both): https://github.com/apache/flink/blob/master/docs/content.zh/docs/deployment/filesystems/s3.md
It can be done in English for further translation.
|
|
||
| Flink provides two file systems to talk to Amazon S3, `flink-s3-fs-presto` and `flink-s3-fs-hadoop`. | ||
| Both implementations are self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use them. | ||
| - **Native S3 FileSystem** (`flink-s3-fs-native`): Built directly on AWS SDK v2 with async I/O and parallel transfers, this implementation supports both checkpointing and the FileSystem sink. [Benchmarks](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396) show ~2x higher checkpoint throughput (~200 MB/s vs ~90 MB/s) compared to the Presto implementation at state sizes up to 15 GB. **Experimental** in Flink 2.3; the API and behavior may change in future releases. |
There was a problem hiding this comment.
Should we say it is (experimental) in the header of the referenced section?
| You need to configure both `s3.access-key` and `s3.secret-key` in Flink's [configuration file]({{< ref "docs/deployment/config#flink-configuration-file" >}}): | ||
|
|
||
| ```yaml | ||
| s3.access-key: your-access-key |
There was a problem hiding this comment.
Should we mention bucket level configuration overrides for all these?
| ### Native S3 FileSystem | ||
|
|
||
| {{< hint warning >}} | ||
| **Experimental**: The Native S3 FileSystem implementation is experimental in Flink 2.3. While functionally complete, it should not yet be used in production environments. Please use Presto or Hadoop implementations for production deployments. |
There was a problem hiding this comment.
"it should not yet be used in production environments"
Maybe too strong statement, it should also explain "why" you should be cautious using it in prod.
|
|
||
| - **No external dependencies**: Built on AWS SDK v2 with minimal footprint | ||
| - **Drop-in replacement**: Compatible with the same S3 URI schemes (`s3://`) | ||
| - **Encryption support**: Server-side encryption (SSE) and KMS encryption |
There was a problem hiding this comment.
This "and" reads confusing. If we are talking about SSE-KMS it is server side
| s3.path-style-access: true | ||
| ``` | ||
|
|
||
| ## S3 FileSystem Implementations |
There was a problem hiding this comment.
This heading is repeated twice at the same level, can it break ToC links?
|
|
||
| #### Features | ||
|
|
||
| - **FileSystem sink support**: The only S3 implementation with support for the [FileSystem sink]({{< ref "docs/connectors/datastream/filesystem" >}}) |
There was a problem hiding this comment.
I thought FS sink is supported for Native as we have RecoverableWriter implementation, is it a miss or do we need more changes for FS sink support?
| s3.retry.max-num-retries: 3 | ||
|
|
||
| # Credentials provider | ||
| fs.s3.aws.credentials.provider: software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider |
There was a problem hiding this comment.
The default is noDefaultValue for this config:
Which may be a bit confusing this way. Should we have it explicitly in the config if this is our intention?
|
|
||
| ```yaml | ||
| s3.path.style.access: true | ||
| s3.path-style-access: true |
There was a problem hiding this comment.
Will we have any backwards compatibilty problem with config property name being changed?
|
|
||
|
|
||
|
|
||
| **Caution** : Do not load `flink-s3-fs-native` and `flink-s3-fs-hadoop` plugins simultaneously. |
What is the purpose of the change
Add documentation for Native s3 FileSystem
Please note that this patch does not update the Chinese document yet. This will be done once english document content is reached to consensus.
Brief change log
Add documentation and show how to use new s3Filesystem.
Verifying this change
Build the docs in local using Hugo
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation
Does this pull request introduce a new feature? (yes / no) no
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) yes