-
Notifications
You must be signed in to change notification settings - Fork 28
LLT-6859 adjust UAPI error handler #1618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LLT-6859 adjust UAPI error handler #1618
Conversation
f9e65d3 to
4b9c0f6
Compare
4b9c0f6 to
1c17d77
Compare
1c17d77 to
2395810
Compare
2395810 to
8cbe327
Compare
8cbe327 to
f581981
Compare
f581981 to
fa7b39c
Compare
fa7b39c to
5bc731b
Compare
5bc731b to
825b681
Compare
825b681 to
1049dc7
Compare
1049dc7 to
5e76d91
Compare
5e76d91 to
6c9662a
Compare
6c9662a to
45fe6a4
Compare
mathiaspeters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
cde43e3 to
853eb12
Compare
stalowyjez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks rather ok, +1
6906a0e to
9d4063f
Compare
|
@stalowyjez I've addressed your comments |
9d4063f to
3a76c6b
Compare
stalowyjez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, +1
3a76c6b to
6e68b3d
Compare
crates/telio-wg/src/wg.rs
Outdated
| pub struct TUNHealthMonitor { | ||
| /// Minimum duration of a continuous UAPI failure period required to | ||
| /// classify the failure as non-transient | ||
| sustained_failure_threshold: Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 This is a nitpick, but maybe it is worth storing here just FeatureInterfaceHealth instead of having a copy of the fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code no longer exists
crates/telio-wg/src/wg.rs
Outdated
| // than expected, it indicates that libtelio(or device) might have been suspended. Since | ||
| // WireguardNT produces transient errors around device suspension(just before and right after), | ||
| // we must filter out those errors or else we can easily exceed sustained_failure_threshold. | ||
| let suspend_detected = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this part. As it seems like it is likely to be flaky.... 🤔 🤔 🤔 🤔 As while our period is roughly 1s, there are no guarantees that the only "interruption" can be suspension.
Although I do not have a better suggestion now 🫤.
Maybe it is worth having a small design/brainstorm session to see if we can come up to something less sensitive to timing (2s by default)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed - I just increased the counter value so this is resolved :)
6e68b3d to
0d8e9e5
Compare
|
@Jauler @stalowyjez @mathiaspeters we discussed a bit with @Jauler about the current approach and decided to go via simpler approach - increase the counter threshold. The benefits of such approach is that it's very deterministic while being very simple. |
0d8e9e5 to
f79344c
Compare
This reduces critical errors caused by transient wireguard-nt UAPI failures at the cost of longer detection time when interface is permanently gone. Signed-off-by: Lukas Pukenis <lukas.pukenis@nordsec.com>
f79344c to
d9db805
Compare
Problem
Libtelio uses counter based mechanism when handling interface errors. The counter is used to ignore transient WireguardNT errors that happen around power-transitions and usually go away.
Current limitations:
These limitations allow libtelio to produce unnecessary critical failures that require libtelio restart.
Solution
Increase the counter threshold value.
☑️ Definition of Done checklist