Skip to content

Conversation

@arturobernalg
Copy link
Member

Simplify MultipartEntityBuilder by replacing random boundary with fixed "boundary-blah" default. Update Javadoc to note users should set custom boundaries if needed, and tweak test for the fixed value.

@arturobernalg arturobernalg requested a review from ok2c February 28, 2025 12:22
@ok2c
Copy link
Member

ok2c commented Feb 28, 2025

@arturobernalg Please, please do not take everything I say literally or even seriously. By boundary_blah I did not mean 'boundary_blah', but something that is unique enough and is unlikely to be present in body parts' text. Something like httpclient_bounary_1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ or similar. Just something reasonably non-common.

… in MultipartEntityBuilder. Update Javadoc to clarify users must set their own boundary for uniqueness, and adjust test to verify the new fixed value.
@ok2c ok2c changed the title Fixed "boundary-blah" as a default MultipartEntityBuilder to use a fixed boundary by a default Feb 28, 2025
@ok2c
Copy link
Member

ok2c commented Feb 28, 2025

@arturobernalg Do you think you would have time to do the following?

  1. Add two boundary generators, one that always returns the same boundary and one that generates a random one, probably using UUID#randomUUID or SecureRandom
  2. Spit out a warning when the user does not explicitly sets a boundary?

@arturobernalg
Copy link
Member Author

@arturobernalg Do you think you would have time to do the following?

1. Add two boundary generators, one that always returns the same boundary and one that generates a random one, probably using `UUID#randomUUID` or `SecureRandom`

2. Spit out a warning when the user does not explicitly sets a boundary?

@ok2c
I just implemented this in a PR. It now includes two boundary generators. Please let me know if that is what you requested

@arturobernalg arturobernalg requested a review from ok2c February 28, 2025 22:20
@arturobernalg arturobernalg force-pushed the bounday branch 5 times, most recently from 2837a9e to 8749a1a Compare March 1, 2025 20:26
Copy link
Member

@ok2c ok2c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Looks good to me. Please do allow the security folks a few more days to review and comment

@benweissmann
Copy link
Contributor

Is there a reason to default to the fixed boundary rather than the secure-random boundary? I think the only time it's safe to use a fixed boundary is if there's no user-controlled data in the request that could possibly include the fixed boundary. A few examples:

  • Let's say we're working with a file that itself contains network traffic data. For example, imagine a program that uploads a HAR file (a recording of network traffic) to server using MultipartEntity encoding. This would break if the network recording itself contained recording of traffic generated by MultipartEntityBuilder, since the data they're trying to encode would include this boundary.

  • As a more malicious example, imagine a server-side program that uses MultipartEntityBuilder to send a request to an authorization server with a payload along the lines of:

Content-Disposition: form-data; name="userProfile"
Content-Type: text/plain

<the user-provided profile goes here>
--httpclient_boundary_7k9p2m4x8n5j3q6t1r0vwyzabcdefghi
Content-Disposition: form-data; name="userPermissions"
Content-Type: text/plain

Regular-User

With userPermissions being some server-controlled (not user-controlled) value. A user could provide a profile of Foo bar\r\n--httpclient_boundary_7k9p2m4x8n5j3q6t1r0vwyzabcdefghi\r\nContent-Disposition: form-data; name="userPermissions"\r\nContent-Type: text/plain\r\n\r\nAdmin_user to coerce the server into providing admin-level permissions:

Content-Disposition: form-data; name="userProfile"
Content-Type: text/plain

Foo bar
--httpclient_boundary_7k9p2m4x8n5j3q6t1r0vwyzabcdefghi
Content-Disposition: form-data; name="userPermissions"
Content-Type: text/plain

Admin-User
--httpclient_boundary_7k9p2m4x8n5j3q6t1r0vwyzabcdefghi
Content-Disposition: form-data; name="userPermissions"
Content-Type: text/plain

Regular-User

Rather than trying to communicate these risks via Javadoc, why not just default to the always-safe option of a UUID, which uses a cryptographically-secure RNG and doesn't have any of these risks? Most other MultipartEntity libraries use a cryptographically-secure RNG to generate the boundary. And then users could explicitly opt-in to the simpler behavior if they're confident that none of their data could contain the fixed boundary.

@ok2c
Copy link
Member

ok2c commented Mar 9, 2025

@benweissmann Regular users can be expected to sanitize / validate UI input and use TLS for transport security, in which case the boundary value can as well be fixed. Folks concerned about (1) green men from Mars, (2) Evil Russians, (3) witchcraft, etc can opt to use the UUID based generator. We can make the latter default but I personally do not see a good reason.

@ok2c ok2c merged commit 85be78d into apache:master Mar 13, 2025
10 checks passed
@outlandishlizard
Copy link

This PR appears to fundamentally misunderstand the risks; it's not a packet injection or MITM attack, it's an in-band attack within user supplied data. The notion that an application should know that it needs to parse and sanitize out a magic apache-specific boundary token from input is absurd. The changes in this PR are actively dangerous and should be reverted immediately, as they result in a fundamentally unsafe default behavior.

@outlandishlizard
Copy link

Aside from the security implications, this change also violates the guidance in RFC 2048 that:

The use of a media type of "multipart" in a body part within another
"multipart" entity is explicitly allowed. In such cases, for obvious
reasons, care must be taken to ensure that each nested "multipart"
entity uses a different boundary delimiter. See RFC 2049 for an
example of nested "multipart" entities.

With this change, nested multipart encoding will also break unless users explicitly toggle on the randomized boundaries.

I'd also note the following section from the RFC:

NOTE: Because boundary delimiters must not appear in the body parts
being encapsulated, a user agent must exercise care to choose a
unique boundary parameter value. The boundary parameter value in the
example above could have been the result of an algorithm designed to
produce boundary delimiters with a very low probability of already
existing in the data to be encapsulated without having to prescan the
data. Alternate algorithms might result in more "readable" boundary
delimiters for a recipient with an old user agent, but would require
more attention to the possibility that the boundary delimiter might
appear at the beginning of some line in the encapsulated part. The
simplest boundary delimiter line possible is something like "---",
with a closing boundary delimiter line of "-----".

@ok2c
Copy link
Member

ok2c commented Mar 16, 2025

@outlandishlizard Thank you for sharing your opinion. Please do note HttpClient does not support nested MIME bodies and never will. For anything other than trivial cases it is recommended to use a full-featured MIME library such as Apache Mime4J.

@michael-o
Copy link
Member

@outlandishlizard Thank you for sharing your opinion. Please do note HttpClient does not support nested MIME bodies and never will. For anything other than trivial cases it is recommended to use a full-featured MIME library such as Apache Mime4J.

This does not mean that our code should be compliant.

@ok2c
Copy link
Member

ok2c commented Mar 16, 2025

Exactly compliant with that? The section in question is referring to "a body part within another"multipart" entity". We do not have such cases.

@michael-o
Copy link
Member

Exactly compliant with that? The section in question is referring to "a body part within another"multipart" entity". We do not have such cases.

You mean that the client is responsible for the payload?

@outlandishlizard
Copy link

Circling back, what is the benefit of using a fixed boundary value? It appears to be a change that is less RFC compliant and less secure, and places an unrealistic burden on users to understand particulars of the internals of the library when sanitizing message content.

@ok2c
Copy link
Member

ok2c commented Mar 16, 2025

Circling back, what is the benefit of using a fixed boundary value?

@outlandishlizard They are not many: it is less computationally expensive and is simple, the folks would know what to look for when sanitizing input

By the way, what are exactly the benefits of using a random boundary other than giving the users a false notion of not having to worry about the security of their applications and having someone else to blame in case something goes sideways?

@outlandishlizard
Copy link

The RFC names quite a few reasons for using a pseudorandom or random value, which I have quoted.

Additionally, I don't think that it is "providing a false sense of security" to ensure that users don't need to be aware of unique implementation details of a downstream library when parsing input.

I have never before considered whether a message I was sending via cURL, or python requests, or any other number of tools might be mangled because of multipart encoding issues with the underlying library, and I'm a bit confused as to why you think your users, who may be several wrappers downstream, will have any idea that they need to escape these values-- it's not a normal application security consideration at all.

@outlandishlizard
Copy link

Even if we accept your premise that this is a user responsibility to ensure input never contains a boundary separator (something that they literally cannot do in many RFC compliant implementations that use randomized boundaries), that would make this patch a breaking change to current behavior without any user comms.

@ok2c
Copy link
Member

ok2c commented Mar 16, 2025

who may be several wrappers downstream, will have any idea that they need to escape these values-- it's not a normal application security consideration at all.

This is a responsibility of people who wrap the library and enforce a security model or provide a UI, not that of the library. Again, do not pin it on us.

@outlandishlizard
Copy link

So, you contend that your users, consumers of an HTTP client library, are responsible for piercing the abstraction layer, seeing that your code handles multipart encoding wrong, and defending themselves from the library?

@outlandishlizard
Copy link

I will also note that you have yet to address my comment that this constitutes a breaking change that users had no way to defend themselves from-- surely you don't contend that they should also have read your mind that you planned to switch the implementation to a static boundary and defensively coded their applications to contain filters for that boundary?

@ok2c
Copy link
Member

ok2c commented Mar 16, 2025

So, you contend that your users, consumers of an HTTP client library, are responsible for piercing the abstraction layer

@outlandishlizard Please re-read what I have said.

I will also note that you have yet to address my comment that this constitutes a breaking change

I do not. There is a reasonably well-established process for dealing with potentially breaking changes.

@outlandishlizard
Copy link

outlandishlizard commented Mar 16, 2025

who may be several wrappers downstream, will have any idea that they need to escape these values-- it's not a normal application security consideration at all.

This is a responsibility of people who wrap the library and enforce a security model or provide a UI, not that of the library. Again, do not pin it on us.

I have read what you said several times. The above quote seems to me to say that you believe it is the responsibility of your users to escape the boundary values, piercing the abstraction layer of "send a multipart message please" and requiring a full scan of the request body in order to ensure it complies with the specific quirks of multipart encoding this PR implements.

It seems to me that this is both onerous on the user, and substantially less performant than using a random boundary value from a secure random source (there will be a constant-time versus linear time cost breakpoint for certain-- for small enough multipart requests the linear scan probably wins!).

With regard to your comments about a "false sense of security":

The entire point of using a random value is to make the chance of a collision of message content and boundary content negligible; the sense of security provided is in no way false. The spec allows for up to 70 characters of boundary; these characters are selected from a set of 75 potential options, meaning that we have a total of:

75^70 =
179592599813797960985749775445106096943740867154224363601035145662298829808338299928818803698205019969691420556046068668365478515625

potential valid boundary delimiters, making the chance of a robust random scheme producing a collision negligible, in the cryptographic sense of the word.

So, we are left with three options for providing users a robust path forward:

  1. Add a requirement that users of this library perform a linear scan on their message bodies for a library specific boundary identifier, in exchange for a potential small performance boost for the library itself. Users who fail to implement a linear scan will encounter some messages that are reliably mangled by the library.
  2. Use a random implementation, with a potential very small performance hit. Assuming a trillion requests per second are processed by the library around the world, the sun will have exploded before a collision occurs.
  3. Entirely abdicate the responsibility for boundary generation as a library, and require the boundary be supplied by the user. This is logically consistent with the stance I'm seeing from you that the user should be responsible for boundary encoding, although I personally think it's not a very nice user experience.

@ok2c
Copy link
Member

ok2c commented Mar 16, 2025

The above quote seems to me to say that you believe it is the responsibility of your users to escape the boundary values

@outlandishlizard It is a responsibility of the application layer that enforces domain specific security constraints and handles user input through a UI. HttpClient does neither. It cannot make those decisions. The chance of a boundary value collision is always zero because MultipartEntityBuilder does not support nested MIME content. And I keep on repeating myself, and we keep on going in circles.

@outlandishlizard
Copy link

No, the chance of a boundary value collision is NOT zero, as boundary value collisions can also occur due to the message body containing the boundary value, which is the issue that was originally brought up in this thread. Changing the boundary value to a static value makes it possible for a library to anticipate this and perform a linear scan to prevent the content it sends from being mangled, but in doing so also dramatically increases the chances of a collision occurring in the first place.

@ok2c
Copy link
Member

ok2c commented Mar 17, 2025

@outlandishlizard So, you happily ignored the main point and proceeded to tell me the good old tale if thousands of monkeys bang away at a type writer long enough they will eventually produce Leo Tolstoy's "War and Peace". So, we keep on merrily going in circles.

I can no longer help you here. If someone else submits a pull request making UUID based boundary default and another project committers accepts the PR, I will not stand in their way. My personal preference though is to deprecate the whole damn thing and be done with it and Green Men from Mars.

@TomerAberbach
Copy link

Hey folks, I'm a software engineer at Stainless. We generate Java/Kotlin SDKs and they all use MultipartEntityBuilder from this library.

I was just made aware of this change. I don't agree with a fixed boundary being the default behavior, but putting that aside, does this PR change the default behavior in a non-major version bump? It seems like a clear breaking change to me.

If this will not be associated with a major version bump, then we're going to have a serious problem. The SDKs will be stuck between the following two choices:

  1. Bump httpcomponents-client to the new version so that I can call withRandomBoundary() to guarantee a valid boundary

    • The problem with this is that Maven dependency mediation may still cause an older version to be chosen (e.g. if the end-user of the SDK is using another version of httpcomponents-client directly or indirectly). In that case, the SDK will began crashing with NoSuchMethodError because withRandomBoundary() will not exist in that version. So really, I can't upgrade to this new version until it's been out for quite a while, unless I want to force all end-users to upgrade too, which is arguably a breaking change to my SDK.
  2. Don't bump httpcomponents-client to the latest version

    • The problem with this is that the behavior is changing in this new version and again, because of Maven dependency mediation, I have no control over which version my end-user ends up using. But because I didn't bump the version, I can't call withRandomBoundary(). So I can't guarantee the particular behavior I want for the SDKs.

Technically I could use reflection to look up at runtime whether withRandomBoundary() exists, and call it if it does, but I would really like to avoid those kinds of hacks.


So with all that being said, if you're not going to revert this PR, can you at least release this as a major version bump? I think it would be the responsible thing to do to set expectations around which versions of this library are backwards compatible with each other.

@garydgregory
Copy link
Member

Hello @TomerAberbach

We normally reserve major version bumps for breaking binary compatibility as defined by the JLS, which is not the case here.

@TomerAberbach
Copy link

Okay, I hear you, but did my comment help explain why I think this change would be disruptive if it's not released as a major version bump? Do you have any thoughts on the problem I described?

@michael-o
Copy link
Member

I'd rather make it random and move it. Not worth the pain or reduction of the computational overhead.

@ok2c
Copy link
Member

ok2c commented Mar 18, 2025

@TomerAberbach Maven dependency resolver should pick version 5.5 over 5.4 if both are present as transitive dependencies. (Please @michael-o correct me if I am wrong.) If the user has explicitly pinned HttpClient to a specific version in their POM this is a general dependency management problem that the user needs to address either by pinning the version of your library to match that of HttpClient or by upgrading (or unpinning) HttpClient version.

You may be much better off migrating to Apache Mime4J given the direction this whole thing is going.

@michael-o
Copy link
Member

@TomerAberbach Maven dependency resolver should pick version 5.5 over 5.4 if both are present as transitive dependencies. (Please @michael-o correct me if I am wrong.) If the user has explicitly pinned HttpClient to a specific version in their POM this is a general dependency management problem that the user needs to address either by pinning the version of your library to match that of HttpClient or by upgrading (or unpinning) HttpClient version.

Pinning is always crucial, if you know something is coming.

You may be much better off migrating to Apache Mime4J given the direction this whole thing is going.

I don't know whether this is really necessary. The MIME package here covers a lot of cases, I don't see a reason why it should not continue to do so. I am really inclined to say that it should be random by default and static by choice. I do use MIME code myself and it is fully sufficient for almost all use cases here.

@ok2c
Copy link
Member

ok2c commented Mar 18, 2025

I don't know whether this is really necessary. The MIME package here covers a lot of cases, I don't see a reason why it should not continue to do s

@michael-o And who deals with them "security professional"? Who reads through numerous RFCs to make sure we stay conformant?

@TomerAberbach
Copy link

@TomerAberbach Maven dependency resolver should pick version 5.5 over 5.4 if both are present as transitive dependencies. (Please @michael-o correct me if I am wrong.)

I wish this were true, but unfortunately Maven will pick the version closest to the root of the user's project...

Gradle does the right thing and picks the higher of the two versions, but enough people use Maven that its resolution strategy is a problem for me.

If the user has explicitly pinned HttpClient to a specific version in their POM this is a general dependency management problem that the user needs to address either by pinning the version of your library to match that of HttpClient or by upgrading (or unpinning) HttpClient version.

The problem with this is that the user may be on an old version and it may be onerous for them to upgrade to the latest one just so that my SDK can call this withRandomBoundary().

That's why I consider this PR a breaking change:

  1. The contract of the class has changed. It used to not be the responsibility of the caller to ensure some fixed boundary isn't in the request body, but now it is.
  2. In order to avoid the new contract, I must force all users of my library to upgrade to the new version. If they don't upgrade then their code will break because withRandomBoundary() will not exist at runtime.

Is this not clearly a breaking change? A change to the contract of a class that also ends up requiring breaking changes to users of this library?

You may be much better off migrating to Apache Mime4J given the direction this whole thing is going.

You're one of the maintainers right? I'm curious why you'd prefer users to migrate to a different library instead of using this one?

@ok2c
Copy link
Member

ok2c commented Mar 18, 2025

The problem with this is that the user may be on an old version and it may be onerous for them to upgrade to the latest one just so that my SDK can call this withRandomBoundary().

@TomerAberbach Disregard #withRandomBoundary 1. Generate a boundary value any way you deem appropriate for your application 2. Assign it with #setBoundary 3. Live happily ever after. This is how one should have been using MultipartEntityBuilder from the very beginning.

Apache Mime4J is a full featured MIME library. I used to contribute to it a lot 15 years ago but not any more. When HttpComponents was established as a Top Level Project it was decided we would be using Mime4J instead of developing out own MIME support. Then, Google came and everything had gone to sh*te

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants