forked from rack/rack-attack
-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor: less-spiky throttles #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jamiemccarthy
wants to merge
19
commits into
main
Choose a base branch
from
jm-random-period-offset
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+203
−34
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
a140e53 to
06787b5
Compare
9dc159e to
ca42333
Compare
Conflicts:
lib/rack/attack/cache.rb
Resolved (just a code comment)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
My company's been using Rack::Attack for years and it's an important part of how our Rails application controls traffic. But because every IP's 60-second period for its requests rolls over at the same time, we were seeing a spiky pattern to traffic.
At the top of the minute, all the badly-behaving clients' requests would be passed through our application's Rails server. Then over the course of that minute they would gradually get themselves blocked. Near the end of the minute, traffic would be lower. And it would repeat, with Rails getting slammed again at the top of each minute.
It's better for performance purposes to spread that traffic out over the minute as evenly as possible. This is especially important if we want to improve our 99th-percentile performance.
This PR implements that by giving each discriminator (such as an IP number) a pseudo-random offset into the time period. While the same duration is used, in our case 60 seconds, each individual IP's rollover time could be fixed at :00, :59, or anywhere in-between.
Badly-behaving clients can still slam Rails when their period rolls over. But they don't all slam it together.
The change is implemented inside
Rack::Attack::Cache. Thecountmethod now takes an optional argument to enable the period offset, defaulting to the old behavior.Rack::Attack::Throttlemakes use of it, unless there's an incompatible callback (see below). But it wouldn't make sense for safelist, blocklist and track, so their behavior is unaffected.The effect
Vox Media switched our main production external-facing web app to this branch, 06787b5 to be precise, earlier this year. It's been running fine, serving hundreds of requests per second, for a month.
Before: Here's what a DDoS-attempt from 12 IP's looked like before the offset functionality was enabled. The empty gaps at the top of each minute are where the bot's requests were not being throttled and all 12 attacker IPs were being served actual traffic all at once. This concentrated the attack's load on our app servers into those first few seconds of each minute -- the traffic "spikes" show up as the negative space between the 429 bars:
After: And here's a different DDoS-attempt, with the offset enabled. Each of the attacker's 2,742 IPs still has the same number of allowed requests per minute, but each IP's counter resets at a different time instead of landing all at once, spreading out the load across the whole minute:
(In that time period, almost all of the attack came from just a handful of IPs, with 2,700+ others contributing only a little traffic each. But despite that extreme concentration, the random offset was still sufficient to spread out the load.)
Ensuring custom-responder correctness
So as we just saw, this change should work well, and work transparently, for applications that are using Rack::Attack to throttle public non-authenticated traffic.
However, there are some compatibility issues when it's used to return helpful information for what the readme calls "well-behaved" traffic, for example authenticated customers hitting an API.
The problem is that this PR makes
Rack::Attack::Cacheuse a private method to calculate when the end of the period will come, which is the value we want to emit to friendly clients. But for some years the README has been advising applications to calculate it themselves, and has provided sample code that doesn't return the same result as this PR now does.With this branch's changes, now that previous sample code, or any similar code, would calculate the wrong result. The throttling effect will be the same, but applications' lambdas would be emitting incorrect information.
Now, I have of course updated the
DEFAULT_THROTTLED_RESPONDERlambda to read that value out ofmatch_data. So if an application is just settingthrottled_response_retry_after_header = trueand using that default, it'll work correctly.But if an application has written their own custom responder, perhaps based on previous sample code, this PR disables its own functionality, to make sure the correct answer is returned. Oh well! The throttling behavior doesn't change and the gem works as it did before.
When application authors notice this change, they can update their lambda to be compatible, and when they do, I've provided a config option to restore this PR's functionality:
throttled_responder_is_offset_aware = true. I've updated the README's sample code to reflect this.Because this ensures backwards compatibility, with functionality automatically remaining the same if necessary, I believe that should allow this PR to be released in a 6.x version.
I've written tests to ensure correctness in all three cases (default responder, custom not-aware, custom aware). And Vox Media has run each of the latter two cases in production.
Is this too unpredictable for friendlies?
Sometimes apps want to throttle "friendly" traffic, promising paying customers a certain amount of requests. They may be using Rack::Attack for this purpose.
So let's consider the case of an app that wants to throttle "well-behaved" customers in a predictable way, promising them x requests per week or month. We might wonder if rolling over that week or month at an unpredictable time would make the customer unhappy.
I consider this unlikely. Currently the period must roll over at
% period == 0which for a week is not Sunday midnight but an oddball 7 PM or 8 PM Eastern Time every Wednesday depending on daylight savings. And this can't be used for month-by-month traffic: there's no way to specify an exact period for the number of seconds in a month, so nobody's doing that.If any applications are dependent on throttling customer traffic per-day with the expected behavior of rolling over at GMT midnight, this PR encapsulates that logic in
Rack::Attack::Cache.offset_for. A targeted monkey-patch could pick any time of day. Maybe their customers would prefer midnight in their local time zone.But any such applications are probably using a throttled callback to report this information to customers' clients, and if those clients can't figure out what to do that seems like a client problem, not an application problem.
If this behavior change seems like a deal-breaker, I'd be happy to make
offset_fora bit more complex and, for example, always return0forperiod == 86400, orperiod >= 86400, or similar.But I don't think this would be necessary, because I think Rack::Attack is already not the right solution for these needs.
Tests improved
A number of tests were written assuming that they knew how to generate the key they were checking for. That's an implementation detail that this PR now changes. I changed them to query a class method for that data (via
sendsince it's a private method).I also added a separate unit test for that private method's key generation, to make sure that code is separately covered.
This also has the advantage of removing all the non-Timecop-frozen
Time.nowreferences from tests, each of which is an unlikely but potential failure. I don't promise to have found every case where Timecop can absolutely rule out tests breaking, but I think I got most of them, and the test suite should be a little more reliable now.1-second as choice of resolution
I round offsets off to the nearest second, and consider that sufficient. If someone feels it would be helpful to round to a tenth of a second, I'd have no objection. But it would have to be an extremely high-traffic site to notice the difference, and it'd require rethinking tests a bit.
MD5 as choice of hash
This PR works only because the discriminator (IP number or what-have-you) gets hashed to a predictable value. I picked
Digest::MD5to generate the hash value. It's fast, and MD5's infamous issues for cryptography don't concern us for this purpose.Its speed seems fine. On my Chromebook, in a simple benchmark on an IP address, MD5 and modulo run in about 10 microseconds. That's 2 orders of magnitude under a typical round-trip time to the cache, and 3-4 under servicing a typical web request. So I think that's good enough.
(In case anyone's curious,
Zlib.crc32isn't better, and stripping off a prefix/suffix of thehexdigestbefore callingString#hexonly makes it slower. The gem doesn't support Ruby < 2.4, so we don't have to worry about Fixnum vs. Bignum: it's all just Integer, and%on MD5's 128-bit Integer is no problem for Ruby.)One more thing
I kept this PR as on-target as possible, but in this case I did want to correct a documentation error. For the last nine years the README has said that match_data annotation happens for:
This is backwards and I changed it to: