Skip to content

Conversation

@arturobernalg
Copy link
Member

Implements RFC 6266/5987 filename encoding in multipart/form-data. Adds mode-aware support in FormBodyPartBuilder and HttpRFC7578Multipart, propagates mode via MultipartEntityBuilder, and updates tests. Resolves https://issues.apache.org/jira/browse/HTTPCLIENT-2360.

@arturobernalg arturobernalg requested a review from ok2c February 25, 2025 08:02
Copy link
Member

@ok2c ok2c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Looks good. Some minor suggestions.

* @return this builder instance for method chaining
* @since 5.5
*/
public FormBodyPartBuilder setMode(final HttpMultipartMode mode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Would not it be better to pass this parameter at the construction time to #create method?

* @since 5.5
*/
private static boolean canEncodeToISO8859_1(final String input) {
return StandardCharsets.ISO_8859_1.newEncoder().canEncode(input);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Let's make the method non-static re-use the encoder rather than creating a new instance with every invocation

final FormBodyPart p1 = FormBodyPartBuilder.create(
"field1",
new InputStreamBody(new FileInputStream(tmpfile), s1 + ".tmp")).build();
new InputStreamBody(new FileInputStream(tmpfile), s1 + ".tmp")).setMode(HttpMultipartMode.LEGACY).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Would not it look better as a #create parameter?

@arturobernalg
Copy link
Member Author

@arturobernalg Looks good. Some minor suggestions.

HI @ok2c fair enough. I’ve made the changes as recommended. Please take another pass.

@arturobernalg arturobernalg requested a review from ok2c February 26, 2025 20:00
* Reusable encoder for ISO-8859-1 charset checks, improving performance by avoiding
* repeated instance creation.
*/
private final CharsetEncoder iso8859_1Encoder = StandardCharsets.ISO_8859_1.newEncoder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg I would instantiate this attribute lazily. It does not need to be final. Looks good overall.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ok2c changed.

…r RFC 6266/5987

- Modified FormBodyPartBuilder to support HttpMultipartMode, adding filename* with UTF-8 encoding for non-ISO-8859-1 filenames in STRICT/EXTENDED modes, skipping it in LEGACY mode.
- Updated HttpRFC7578Multipart to use mode for filename encoding: percent-encode in EXTENDED, ISO-8859-1 in STRICT/LEGACY, and always encode filename* per RFC 5987.
- Adjusted MultipartEntityBuilder to propagate mode to FormBodyPartBuilder, ensuring consistent behavior across the pipeline.
- Fixed tests to align with mode-specific expectations, maintaining LEGACY mode’s raw UTF-8 filename behavior.
Copy link
Member

@ok2c ok2c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Nicely done.

@arturobernalg arturobernalg merged commit fc3b11e into apache:master Feb 28, 2025
10 checks passed
@ok2c
Copy link
Member

ok2c commented Feb 28, 2025

@arturobernalg While you are at it, do you think you could remove the code that randomly generates boundary values with a fixed boundary and update the javadocs to instruct the users to use generate their own boundary values if they want them to be random / non-predictable / 'secure'?

@arturobernalg
Copy link
Member Author

@arturobernalg While you are at it, do you think you could remove the code that randomly generates boundary values with a fixed boundary and update the javadocs to instruct the users to use generate their own boundary values if they want them to be random / non-predictable / 'secure'?

@ok2c sure. I'll do

@surlemur
Copy link

Question: Wouldn't it be more standard compliant to render the filename* attribute whenever the string is not US-ASCII encodable? E.g., Embedded Tomcat 10.1 does not properly parse the ISO-8859-1 filename content, but correctly handles filename*. It seems to me that the default ISO-8859-1 assumption, although mentioned in some RFCs, is not consistently true in reality.

@arturobernalg
Copy link
Member Author

Question: Wouldn't it be more standard compliant to render the filename* attribute whenever the string is not US-ASCII encodable? E.g., Embedded Tomcat 10.1 does not properly parse the ISO-8859-1 filename content, but correctly handles filename*. It seems to me that the default ISO-8859-1 assumption, although mentioned in some RFCs, is not consistently true in reality.

RFC 7578 §4.2 is crystal clear:

**NOTE: The encoding method described in RFC 5987, which would add a filename* parameter to the Content-Disposition header field, MUST NOT be used.**

HttpClient 5.5 therefore emits filename* only when the filename contains code points that ISO-8859-1 cannot represent. Everything in the 0x00-0xFF range already has a legal encoding, so adding filename* there would break the letter of the spec and has no real-world upside.

If embedded Tomcat 10.1 stumbles over an ISO-8859-1 filename, that is a Tomcat issue, not a protocol requirement. Use HttpMultipartMode.EXTENDED (percent-encoded names) as a workaround or open a bug with Tomcat, but let’s keep HttpClient on the right side of the standards

@surlemur
Copy link

Ok. I must admit it's hard to grasp the combination of all those related RFCs plus the different traditions...

The pragmatic solution seems to be to assume ISO-8859-1 as the default on the server side when parsing the filename property. I am using the Apache Commons FileUpload and realized that I can set the header encoding there.

Thanks!

@arturobernalg
Copy link
Member Author

Ok. I must admit it's hard to grasp the combination of all those related RFCs plus the different traditions...

The pragmatic solution seems to be to assume ISO-8859-1 as the default on the server side when parsing the filename property. I am using the Apache Commons FileUpload and realized that I can set the header encoding there.

Thanks!

@surlemur ,

Correct. RFC 7578 treats the raw octets of a filename parameter as ISO-8859-1 unless they are percent-encoded or a filename*= override is present (see the NOTE in §4.2).

Glad it’s sorted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants