Skip to content

Need to better handle contention around RouteCache #997

@FelixMcFelix

Description

@FelixMcFelix

A guest which chooses to spin up many ill-fated TCP connection attempts (#985, #990) runs into another bottleneck, that being the RouteCache which sits between OPTE and the underlay network. This table is small (sized 512 entries), and while it does have an eviction policy what we've seen in practice is that this traffic pattern causes contention and thrashing when all of these packets invariably miss on the table, find it full, and evict a possibly-good entry.

Copying out from https://github.com/oxidecomputer/customer-support/issues/1188#issuecomment-4497160655, we need to:

  • Bump the table size,
  • Disable eviction and just go straight to a raw lookup when table capacity is exceeded,
  • Shrink the write lock scope to cover only insertion, such that readers aren't blocked while one thread is querying illumos itself.

Combined, these should ensure that actually-established flows eventually end up represented in the cache, and that we can sustain a higher active connections-per-second.

#539 will properly solve this issue and give us a far smaller routing table (because it will not scale with the number of active connections), but will require additional interfaces in the kernel to realise. The path described here is a shorter-term fix while we determine what those interfaces should look like.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions