Flow eviction and LFT/UFT lifecycle rework#990
Conversation
42acc3e to
7d0fd8b
Compare
This PR reworks the LFT/UFT lifecycle so that inbound/outbound entries are correctly bound by a single lifetime, and so that a slowpath packet tracks all LFTs which were used in its handling. We then associate this list of LFTs with the flow UFT, and the TCP flow entry if created. This allows us to correctly keep LFTs alive so long as a UFT/TCP entry refers to them. This has been a longstanding bug. The new and really useful feature is that this allows us to remove all related and dependent entries when we deliberately evict an LFT to make space for a new flow. An expiry policy for any table can now also specifiy an "eviction priority" for a given flow. Generally speaking most flows are unevictable so long as they are well-behaved, but the main issue is that we need information from elsewhere to *know* when we can make this determination. How we tackle this is by looking at all descendent flows which rely on a given flow. Most will state no preference on the flow lifecycle, but the TCP flow entry can return a non-zero priority when inactive in a given state for too long. Closes #986. Closes #788. Closes #789.
7d0fd8b to
c8bfdd9
Compare
7711592 to
2b7de6d
Compare
d09ad98 to
b1fb3c2
Compare
f37152d to
1df13fe
Compare
|
From what I can tell this appears to have the same throughput on |
rcgoodfellow
left a comment
There was a problem hiding this comment.
Thanks for putting this together @FelixMcFelix. Questions and comments follow.
On the #986 front, while this PR does not introduce the provisional table we discussed, it looks like it does satisfy properties (1) and (2) from that ticket and therefore would satisfy the intent of the ticket. For (1) we only mark connections as eligible for eviction if the last hit is older than the TTL for the connection state they are in. For (2) the priority based eviction seems to cover that, but I have some questions on how it's implemented below.
| // should be immune to timer-driven expiry. They are *not* immune to | ||
| // explicit eviction. | ||
| // | ||
| // We don't arrange the UFT as a parent of the TCP flow entry because |
There was a problem hiding this comment.
Just food for thought, I wonder if the three way relationship between UFT/LFT/TFT is more trilateral in nature than parent/child. E.g. we just want to have the three way associations in place through pointers to be able to have policy over all three that may evolve over time and make the parent/child relationship less of a thing.
There was a problem hiding this comment.
Possibly, yeah. At least for now I'm trying to keep something DAG-shaped so that we don't end up with any resource leaks or entry n-tuples ending up with an infinite lifetime because something always depends on them.
I think one thing I was maybe getting at but ended up not implementing is that perhaps UFTs are always evictable, but they somehow don't pass that trait onto their parent LFTs. Part of doing that would be that the TCP state isn't invalidated if the UFT disappears, which the current parent/child model allows for. Making that would have probably added yet more complexity for something we really ought to measure (i.e., there's not as much need if all the table sizes are the same).
| Direction::Out => &mut inner.outbound_parents, | ||
| }; | ||
|
|
||
| core::mem::swap(new_parent_slot, &mut parents); |
There was a problem hiding this comment.
What happens if the reaper runs when the parent/child lists are potentially inconsistent between the parent vec swap here and the child addition/removals below? (or other threads that may hit this inconsistent window, consecutive slowpath packets on the same flow?)
There was a problem hiding this comment.
The port itself is wrapped in a KRwLock. Any slowpath hit, or access by the cleanup callback / reaper can only operate on the port and its tables within an exclusive write().
I'll add a comment advertising that we'll need to be more aware of this when we inevitably rethink the locking strategy here at some point.
| // quickly, but leave them in place for the ~120s when we are | ||
| // not under pressure. | ||
| a if a > self.incipient_ttl.as_milliseconds() => { | ||
| Some(EvictionPriority::Evictable(NonZeroU16::MIN)) |
There was a problem hiding this comment.
For flows that are beyond the TTL, should the priority reflect how far beyond the TTL they are, something like the delta in milliseconds so the flows furthest beyond their TTL get evicted first? Same question for line 3182 below.
There was a problem hiding this comment.
Good point. I've made the logic here for misbehaving flows take a linear interpolation across between the evictable time and expiry time to capture this and the below comment:
NonZeroU16::MINis reserved for established flows which are getting long in the tooth. In this caselast_hit - current_timewill always reflect distance from the TTL.[NonZeroU16::MIN + 1, NonZeroU16::MAX - 1]for the other states.NonZeroU16::MAXfor the closed state.
| // Allow established flows to be evicted on the same time scale as | ||
| // UFT/LFT expiry, but if we are not under pressure then we aim to | ||
| // keep the LFTs in place even while the flow is inactive. | ||
| TcpState::Established |
There was a problem hiding this comment.
For the established case, we may want to have this be tunable per OPTE port. Maybe I'm not totally understanding the model, but if we get to a place where a flow exists for a TCP session with a sparse communication pattern that has intervals of silence beyond 60 seconds, the flow will stay in place until the system comes under pressure, then the flow may be evicted in favor of new flows due to high connection churn. In that situation, is the flow dead or does it just encounter a slowpath speed bump? If dead, that seems rather unfortunate and I think we'd want to provide a knob for users to tune FLOW_DEF_TTL for their applications.
There was a problem hiding this comment.
The flow would be dead in that situation, but having protection under low load is strictly better than where we are before this change -- any flow dependent on a firewall exception or SNAT entry will be dead after 60s of inactivity, no matter what. I agree that it should be tunable, but I don't really know what the best way to plumb it would be right now.
We could raise this threshold here ("timeouts, timeouts, always wrong"), but I think the separate priority spaces might do a little better here by taking out misbehaved flows first.
| } | ||
| Some((EvictionKey::Evictable(curr_prio, curr_time), ..)) | ||
| if prio > curr_prio | ||
| || (prio == curr_prio && last_hit < curr_time) => |
There was a problem hiding this comment.
Is last hit what we want here, or do we want amount of time beyond TTL, as the TTL can pretty different depending on what state the flow is in? This is related to the question earlier on how we set the eviction priority when looking at the delta relative to TTL in TcpExpiry::eviction_priority.
There was a problem hiding this comment.
Yes, but I don't think we can always enforce that here in a way that's not confusing (i.e., if we wanted to make UFT entries always evictable then there is no TTL). Similarly you could have policies which are purely protocol state-driven, rather than including timing information. I've tried to capture this for TCP with usng the priority scale.
This PR reworks the LFT/UFT lifecycle so that inbound/outbound entries are correctly bound by a single lifetime, and so that a slowpath packet tracks all LFTs which were used in its handling. We then associate this list of LFTs with the flow UFT, and the TCP flow entry if created.
This allows us to correctly keep LFTs alive so long as a UFT/TCP entry refers to them. This has been a longstanding bug. The new and really useful feature is that this allows us to remove all related and dependent entries when we deliberately evict an LFT to make space for a new flow.
An expiry policy for any table can now also specifiy an "eviction priority" for a given flow. Generally speaking most flows are unevictable so long as they are well-behaved, but the main issue is that we need information from elsewhere to know when we can make this determination. How we tackle this is by looking at all descendent flows which rely on a given flow. Most will state no preference on the flow lifecycle, but the TCP flow entry can return a non-zero priority when inactive in a given state for too long.
Closes #986.
Closes #788.
Closes #789.