Flow eviction and LFT/UFT lifecycle rework by FelixMcFelix · Pull Request #990 · oxidecomputer/opte

FelixMcFelix · 2026-05-18T13:08:04Z

This PR reworks the LFT/UFT lifecycle so that inbound/outbound entries are correctly bound by a single lifetime, and so that a slowpath packet tracks all LFTs which were used in its handling. We then associate this list of LFTs with the flow UFT, and the TCP flow entry if created.

This allows us to correctly keep LFTs alive so long as a UFT/TCP entry refers to them. This has been a longstanding bug. The new and really useful feature is that this allows us to remove all related and dependent entries when we deliberately evict an LFT to make space for a new flow.

An expiry policy for any table can now also specifiy an "eviction priority" for a given flow. Generally speaking most flows are unevictable so long as they are well-behaved, but the main issue is that we need information from elsewhere to know when we can make this determination. How we tackle this is by looking at all descendent flows which rely on a given flow. Most will state no preference on the flow lifecycle, but the TCP flow entry can return a non-zero priority when inactive in a given state for too long.

Closes #986.
Closes #788.
Closes #789.

This PR reworks the LFT/UFT lifecycle so that inbound/outbound entries are correctly bound by a single lifetime, and so that a slowpath packet tracks all LFTs which were used in its handling. We then associate this list of LFTs with the flow UFT, and the TCP flow entry if created. This allows us to correctly keep LFTs alive so long as a UFT/TCP entry refers to them. This has been a longstanding bug. The new and really useful feature is that this allows us to remove all related and dependent entries when we deliberately evict an LFT to make space for a new flow. An expiry policy for any table can now also specifiy an "eviction priority" for a given flow. Generally speaking most flows are unevictable so long as they are well-behaved, but the main issue is that we need information from elsewhere to *know* when we can make this determination. How we tackle this is by looking at all descendent flows which rely on a given flow. Most will state no preference on the flow lifecycle, but the TCP flow entry can return a non-zero priority when inactive in a given state for too long. Closes #986. Closes #788. Closes #789.

FelixMcFelix · 2026-05-21T14:34:02Z

From what I can tell this appears to have the same throughput on glasgow as master. I was mildly concerned that changing the Arc<TcpFlowEntryState> held by a UFT into a Weak<_> in the name of consistent leak prevention would have had a mild impact, due to the calls to Weak::upgrade (similar enough cost to an Arc::clone). Otherwise the extra work we're doing affects only the slowpath and the periodic reaper callback.

rcgoodfellow

Thanks for putting this together @FelixMcFelix. Questions and comments follow.

On the #986 front, while this PR does not introduce the provisional table we discussed, it looks like it does satisfy properties (1) and (2) from that ticket and therefore would satisfy the intent of the ticket. For (1) we only mark connections as eligible for eviction if the last hit is older than the TTL for the connection state they are in. For (2) the priority based eviction seems to cover that, but I have some questions on how it's implemented below.

rcgoodfellow · 2026-05-23T17:34:16Z

+    // should be immune to timer-driven expiry. They are *not* immune to
+    // explicit eviction.
+    //
+    // We don't arrange the UFT as a parent of the TCP flow entry because


Just food for thought, I wonder if the three way relationship between UFT/LFT/TFT is more trilateral in nature than parent/child. E.g. we just want to have the three way associations in place through pointers to be able to have policy over all three that may evolve over time and make the parent/child relationship less of a thing.

Possibly, yeah. At least for now I'm trying to keep something DAG-shaped so that we don't end up with any resource leaks or entry n-tuples ending up with an infinite lifetime because something always depends on them.

I think one thing I was maybe getting at but ended up not implementing is that perhaps UFTs are always evictable, but they somehow don't pass that trait onto their parent LFTs. Part of doing that would be that the TCP state isn't invalidated if the UFT disappears, which the current parent/child model allows for. Making that would have probably added yet more complexity for something we really ought to measure (i.e., there's not as much need if all the table sizes are the same).

rcgoodfellow · 2026-05-23T17:46:46Z

+            Direction::Out => &mut inner.outbound_parents,
+        };
+
+        core::mem::swap(new_parent_slot, &mut parents);


What happens if the reaper runs when the parent/child lists are potentially inconsistent between the parent vec swap here and the child addition/removals below? (or other threads that may hit this inconsistent window, consecutive slowpath packets on the same flow?)

The port itself is wrapped in a KRwLock. Any slowpath hit, or access by the cleanup callback / reaper can only operate on the port and its tables within an exclusive write().

I'll add a comment advertising that we'll need to be more aware of this when we inevitably rethink the locking strategy here at some point.

rcgoodfellow · 2026-05-23T17:58:07Z

+                // quickly, but leave them in place for the ~120s when we are
+                // not under pressure.
+                a if a > self.incipient_ttl.as_milliseconds() => {
+                    Some(EvictionPriority::Evictable(NonZeroU16::MIN))


For flows that are beyond the TTL, should the priority reflect how far beyond the TTL they are, something like the delta in milliseconds so the flows furthest beyond their TTL get evicted first? Same question for line 3182 below.

Good point. I've made the logic here for misbehaving flows take a linear interpolation across between the evictable time and expiry time to capture this and the below comment:

NonZeroU16::MIN is reserved for established flows which are getting long in the tooth. In this case last_hit - current_time will always reflect distance from the TTL.

[NonZeroU16::MIN + 1, NonZeroU16::MAX - 1] for the other states.

NonZeroU16::MAX for the closed state.

rcgoodfellow · 2026-05-23T18:05:24Z

+            // Allow established flows to be evicted on the same time scale as
+            // UFT/LFT expiry, but if we are not under pressure then we aim to
+            // keep the LFTs in place even while the flow is inactive.
+            TcpState::Established


For the established case, we may want to have this be tunable per OPTE port. Maybe I'm not totally understanding the model, but if we get to a place where a flow exists for a TCP session with a sparse communication pattern that has intervals of silence beyond 60 seconds, the flow will stay in place until the system comes under pressure, then the flow may be evicted in favor of new flows due to high connection churn. In that situation, is the flow dead or does it just encounter a slowpath speed bump? If dead, that seems rather unfortunate and I think we'd want to provide a knob for users to tune FLOW_DEF_TTL for their applications.

The flow would be dead in that situation, but having protection under low load is strictly better than where we are before this change -- any flow dependent on a firewall exception or SNAT entry will be dead after 60s of inactivity, no matter what. I agree that it should be tunable, but I don't really know what the best way to plumb it would be right now.

We could raise this threshold here ("timeouts, timeouts, always wrong"), but I think the separate priority spaces might do a little better here by taking out misbehaved flows first.

rcgoodfellow · 2026-05-23T18:47:22Z

+                }
+                Some((EvictionKey::Evictable(curr_prio, curr_time), ..))
+                    if prio > curr_prio
+                        || (prio == curr_prio && last_hit < curr_time) =>


Is last hit what we want here, or do we want amount of time beyond TTL, as the TTL can pretty different depending on what state the flow is in? This is related to the question earlier on how we set the eviction priority when looking at the delta relative to TTL in TcpExpiry::eviction_priority.

Yes, but I don't think we can always enforce that here in a way that's not confusing (i.e., if we wanted to make UFT entries always evictable then there is no TTL). Similarly you could have policies which are purely protocol state-driven, rather than including timing information. I've tried to capture this for TCP with usng the priority scale.

FelixMcFelix self-assigned this May 18, 2026

FelixMcFelix force-pushed the eviction-policy branch from 42acc3e to 7d0fd8b Compare May 18, 2026 13:39

FelixMcFelix added 2 commits May 19, 2026 12:06

Self-review.

c8bfdd9

FelixMcFelix force-pushed the eviction-policy branch from 7d0fd8b to c8bfdd9 Compare May 19, 2026 18:12

More aggressive eviction for quiescent states

2b7de6d

FelixMcFelix force-pushed the eviction-policy branch from 7711592 to 2b7de6d Compare May 19, 2026 18:44

Unit tests now that perf is coming back untouched.

b1fb3c2

FelixMcFelix force-pushed the eviction-policy branch from d09ad98 to b1fb3c2 Compare May 20, 2026 12:56

FelixMcFelix added 2 commits May 20, 2026 16:51

Unit tests flush out bugs, as it turns out!

e6642fd

Comment.

1df13fe

FelixMcFelix force-pushed the eviction-policy branch from f37152d to 1df13fe Compare May 20, 2026 15:58

FelixMcFelix marked this pull request as ready for review May 20, 2026 15:59

rcgoodfellow reviewed May 23, 2026

View reviewed changes

FelixMcFelix mentioned this pull request May 28, 2026

InnerFlowId should be a per-NetworkImpl concept #996

Open

Ry review feedback.

dba5a03

FelixMcFelix mentioned this pull request May 28, 2026

Need to better handle contention around RouteCache #997

Open

Conversation

FelixMcFelix commented May 18, 2026

Uh oh!

FelixMcFelix commented May 21, 2026

Uh oh!

rcgoodfellow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rcgoodfellow May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rcgoodfellow May 23, 2026 •

edited

Loading