Skip to content

Conversation

@jamiemccarthy
Copy link
Owner

@jamiemccarthy jamiemccarthy commented Dec 30, 2021

Overview

My company's been using Rack::Attack for years and it's an important part of how our Rails application controls traffic. But because every IP's 60-second period for its requests rolls over at the same time, we were seeing a spiky pattern to traffic.

At the top of the minute, all the badly-behaving clients' requests would be passed through our application's Rails server. Then over the course of that minute they would gradually get themselves blocked. Near the end of the minute, traffic would be lower. And it would repeat, with Rails getting slammed again at the top of each minute.

It's better for performance purposes to spread that traffic out over the minute as evenly as possible. This is especially important if we want to improve our 99th-percentile performance.

This PR implements that by giving each discriminator (such as an IP number) a pseudo-random offset into the time period. While the same duration is used, in our case 60 seconds, each individual IP's rollover time could be fixed at :00, :59, or anywhere in-between.

Badly-behaving clients can still slam Rails when their period rolls over. But they don't all slam it together.

The change is implemented inside Rack::Attack::Cache. The count method now takes an optional argument to enable the period offset, defaulting to the old behavior. Rack::Attack::Throttle makes use of it, unless there's an incompatible callback (see below). But it wouldn't make sense for safelist, blocklist and track, so their behavior is unaffected.

The effect

Vox Media switched our main production external-facing web app to this branch, 06787b5 to be precise, earlier this year. It's been running fine, serving hundreds of requests per second, for a month.

Before: Here's what a DDoS-attempt from 12 IP's looked like before the offset functionality was enabled. The empty gaps at the top of each minute are where the bot's requests were not being throttled and all 12 attacker IPs were being served actual traffic all at once. This concentrated the attack's load on our app servers into those first few seconds of each minute -- the traffic "spikes" show up as the negative space between the 429 bars:

Screen Shot 2022-01-25 at 1 05 53 PM

After: And here's a different DDoS-attempt, with the offset enabled. Each of the attacker's 2,742 IPs still has the same number of allowed requests per minute, but each IP's counter resets at a different time instead of landing all at once, spreading out the load across the whole minute:

Screen Shot 2022-03-16 at 10 43 13 AM

(In that time period, almost all of the attack came from just a handful of IPs, with 2,700+ others contributing only a little traffic each. But despite that extreme concentration, the random offset was still sufficient to spread out the load.)

Ensuring custom-responder correctness

So as we just saw, this change should work well, and work transparently, for applications that are using Rack::Attack to throttle public non-authenticated traffic.

However, there are some compatibility issues when it's used to return helpful information for what the readme calls "well-behaved" traffic, for example authenticated customers hitting an API.

The problem is that this PR makes Rack::Attack::Cache use a private method to calculate when the end of the period will come, which is the value we want to emit to friendly clients. But for some years the README has been advising applications to calculate it themselves, and has provided sample code that doesn't return the same result as this PR now does.

With this branch's changes, now that previous sample code, or any similar code, would calculate the wrong result. The throttling effect will be the same, but applications' lambdas would be emitting incorrect information.

Now, I have of course updated the DEFAULT_THROTTLED_RESPONDER lambda to read that value out of match_data. So if an application is just setting throttled_response_retry_after_header = true and using that default, it'll work correctly.

But if an application has written their own custom responder, perhaps based on previous sample code, this PR disables its own functionality, to make sure the correct answer is returned. Oh well! The throttling behavior doesn't change and the gem works as it did before.

When application authors notice this change, they can update their lambda to be compatible, and when they do, I've provided a config option to restore this PR's functionality: throttled_responder_is_offset_aware = true. I've updated the README's sample code to reflect this.

Because this ensures backwards compatibility, with functionality automatically remaining the same if necessary, I believe that should allow this PR to be released in a 6.x version.

I've written tests to ensure correctness in all three cases (default responder, custom not-aware, custom aware). And Vox Media has run each of the latter two cases in production.

Is this too unpredictable for friendlies?

Sometimes apps want to throttle "friendly" traffic, promising paying customers a certain amount of requests. They may be using Rack::Attack for this purpose.

So let's consider the case of an app that wants to throttle "well-behaved" customers in a predictable way, promising them x requests per week or month. We might wonder if rolling over that week or month at an unpredictable time would make the customer unhappy.

I consider this unlikely. Currently the period must roll over at % period == 0 which for a week is not Sunday midnight but an oddball 7 PM or 8 PM Eastern Time every Wednesday depending on daylight savings. And this can't be used for month-by-month traffic: there's no way to specify an exact period for the number of seconds in a month, so nobody's doing that.

If any applications are dependent on throttling customer traffic per-day with the expected behavior of rolling over at GMT midnight, this PR encapsulates that logic in Rack::Attack::Cache.offset_for. A targeted monkey-patch could pick any time of day. Maybe their customers would prefer midnight in their local time zone.

But any such applications are probably using a throttled callback to report this information to customers' clients, and if those clients can't figure out what to do that seems like a client problem, not an application problem.

If this behavior change seems like a deal-breaker, I'd be happy to make offset_for a bit more complex and, for example, always return 0 for period == 86400, or period >= 86400, or similar.

But I don't think this would be necessary, because I think Rack::Attack is already not the right solution for these needs.

Tests improved

A number of tests were written assuming that they knew how to generate the key they were checking for. That's an implementation detail that this PR now changes. I changed them to query a class method for that data (via send since it's a private method).

I also added a separate unit test for that private method's key generation, to make sure that code is separately covered.

This also has the advantage of removing all the non-Timecop-frozen Time.now references from tests, each of which is an unlikely but potential failure. I don't promise to have found every case where Timecop can absolutely rule out tests breaking, but I think I got most of them, and the test suite should be a little more reliable now.

1-second as choice of resolution

I round offsets off to the nearest second, and consider that sufficient. If someone feels it would be helpful to round to a tenth of a second, I'd have no objection. But it would have to be an extremely high-traffic site to notice the difference, and it'd require rethinking tests a bit.

MD5 as choice of hash

This PR works only because the discriminator (IP number or what-have-you) gets hashed to a predictable value. I picked Digest::MD5 to generate the hash value. It's fast, and MD5's infamous issues for cryptography don't concern us for this purpose.

Its speed seems fine. On my Chromebook, in a simple benchmark on an IP address, MD5 and modulo run in about 10 microseconds. That's 2 orders of magnitude under a typical round-trip time to the cache, and 3-4 under servicing a typical web request. So I think that's good enough.

(In case anyone's curious, Zlib.crc32 isn't better, and stripping off a prefix/suffix of the hexdigest before calling String#hex only makes it slower. The gem doesn't support Ruby < 2.4, so we don't have to worry about Fixnum vs. Bignum: it's all just Integer, and % on MD5's 128-bit Integer is no problem for Ruby.)

One more thing

I kept this PR as on-target as possible, but in this case I did want to correct a documentation error. For the last nine years the README has said that match_data annotation happens for:

responses that did not exceed a throttle limit

This is backwards and I changed it to:

responses that exceeded a throttle limit

@jamiemccarthy jamiemccarthy force-pushed the jm-random-period-offset branch from a140e53 to 06787b5 Compare December 30, 2021 14:06
@jamiemccarthy jamiemccarthy changed the title Refactor: no more sawtooth throttles Refactor: less-spiky throttles Dec 30, 2021
@jamiemccarthy jamiemccarthy force-pushed the jm-random-period-offset branch from 9dc159e to ca42333 Compare March 21, 2022 12:41
@jamiemccarthy jamiemccarthy changed the base branch from master to main March 21, 2022 12:44
@jamiemccarthy jamiemccarthy marked this pull request as ready for review March 21, 2022 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants