Enhancement: Add drift detection and automatic reconciliation#668
Enhancement: Add drift detection and automatic reconciliation#668eshulman2 wants to merge 1 commit intok-orc:mainfrom
Conversation
mandre
left a comment
There was a problem hiding this comment.
What part of the code needs changing? I expect we detail how shouldReconcile changes.
enhancements/drift-detection.md
Outdated
|
|
||
| **Mitigation**: | ||
| - Conservative 10-hour default resync period | ||
| - Add random jitter to resync times to avoid thundering herd |
There was a problem hiding this comment.
What would that look like?
Could you describe what you mean by thundering herd? Don't we already have the problem when restarting ORC?
There was a problem hiding this comment.
I mean that in case that many resources were created at once and will try to re-reconcile in T+10H (all at the same time) it might flood the OSP API possibly causing slowness or other issues. The point in adding a random jitters is to have resources created together start their drift detection in T+10H+random number making it safer. As we only reschedule with the suggested implementation adding the jitter should be as simple as adding a random number to the re-queue request. added a note on that.
Regarding the restart issueI assume it is the same conceptual issue but unfortunately the suggested solution won't solve it on startup. We should consider possibly a different approach for this issue on startup like the one in ASO limiting the number of concurrent reconciles by exposing MAX_CONCURRENT_RECONCILES (controller runtime MaxConcurrentReconciles).
There was a problem hiding this comment.
Another thought on the possible boot issue we can also add a jitter when getting 429 which will allow to solve the thundering herd at startup without an arbitrary limitation on the number of concurrent reconciliations (which we should probably use on our side as a ceiling for both the amount of request and to avoid cpu spikes in the operator). This hybrid approach lets openstack services "decide" if they are "free enough" to handle the request and allows us to avoid DDOSing (to some extent) without arbitrarily limiting the number of reconciliations. We can also use RateLimiter in the controller options configuring the work queue.
Actually the best ultra safe approach would be to use all:
- Drift detection with jitter - avoid thundering herd at the periodic drift detection
RateLimiterin the controller options - making sure we are not processing to many too many items per second limiting the burst rate and QPSMaxConcurrentReconciles- make sure we are not running out of RAM/CPU to allow robustness at the operator level making sure it 'doesn't break itself'- 429 backoff + jitter - avoiding DDOSing the openstack cluster when it is already too busy
There was a problem hiding this comment.
(maybe it would be a good idea to create an issue with this information to allow planning and implementing it to resolve the startup thundering heard and add robustness to the operator at all levels)
There was a problem hiding this comment.
We don't have to over specify now. Let's just recognize it might be a problem and highlight potential solutions without going into too much details. I believe the current wording is fine and doesn't need more details.
3134799 to
a9f9abf
Compare
|
|
||
| 1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id` | ||
| 2. If not found and the resource was originally created by ORC (not imported), ORC recreates it | ||
| 3. The new resource ID is stored in `status.id` |
There was a problem hiding this comment.
My question wasn't really about the obvious case where you'd delete out of band a resource for which you have a weak dependency, but for where we had hard dependency. An example with Subnet -> Network would be more telling. What happens if a network is re-created? Will the subnet be recreated as well?
Inconsistent states can happen, and will happen. Someone who force-deleted a resource, a bug in OpenStack, an operator who made changes to the database directly... We should explain what we would do whenever we will have that case.
This probably deserves a separate section.
| **Behavior when drift detection is disabled** (`resyncPeriod: 0`): External deletion remains a terminal error (current behavior preserved). Resource recreation only occurs when drift detection is enabled and a periodic resync discovers the missing resource. This maintains backwards compatibility. | ||
|
|
||
| For **imported resources** that are deleted externally, this is always a terminal error regardless of drift detection settings, because the resource was not created by ORC and recreating it would not restore the original resource. | ||
|
|
There was a problem hiding this comment.
What happens to a resource that was in a terminal error (due to missing openstack resource) prior to enabling drift detection? I believe we should clarify that we won't reconcile resources that are in terminal error.
There was a problem hiding this comment.
I believe we should actually reconcile them even if they are in terminal error as this would actually solve #667 which is unsolvable for users right now and forces users to delete and re-create the resource for already solved semi-transient issues currently addressed as terminal irrecoverable errors (like quota space being freed up by another machine being deleted). The periodic resync provides a recovery path without requiring users to manually touch the spec to trigger reconciliation.
There was a problem hiding this comment.
By design, terminal errors are... terminal. We should not reconcile resources in this state. The problem that was raised in #667 is that some errors are not exactly terminal and could be retried, but that a different discussion.
Let's be explicit here.
There was a problem hiding this comment.
I agree that the intent for terminal errors is to be terminal but I think we should re-consider what we call a terminal error. I'll change it in this proposal and lets discuses in 667 our approach to what we call a terminal error and how we address it and allow user to recover from it.
a9f9abf to
eda8a6f
Compare
enhancements/drift-detection.md
Outdated
| } | ||
| ``` | ||
|
|
||
| This reuses the existing `Progressing.LastTransitionTime` to track when the resource was last reconciled, avoiding the need for additional status fields |
There was a problem hiding this comment.
I don't think that would work: progressing.LastTransitionTime is only updated when a condition changes on the resource, so if there's no changes we would reconcile continuously once the first resyncPeriod has elapsed.
Possibly, we need to add an annotation that carries the last resync timestamp.
Another possibility we discussed early on was to re-queue with RequeueAfter(resyncPeriod) at the end of a successful reconcile.
There was a problem hiding this comment.
The implementation uses both mechanisms you mentioned:
- RequeueAfter: At the end of successful reconciliation, we schedule the next resync with reconcileStatus.WithRequeue(resyncPeriod) (see the implementation here)
- Time-based check: The shouldReconcile function checks time.Since(progressing.LastTransitionTime.Time) >= resyncPeriod to handle reconciles triggered by other means (controller restart, dependency updates, etc.)
The LastTransitionTime gets updated when the reconcile completes and Progressing condition transitions. The scheduled requeue ensures this happens periodically, and the time check prevents excessive reconciliation if multiple triggers fire close together. I updated the proposal to reflect it and make it clearer.
There was a problem hiding this comment.
Here's the problem: we can't rely on progressing.LastTransitionTime in shouldReconcile() because as the resource converges to a stable state, progressing.LastTransitionTime won't get any more update meaning that the comparison time.Since(progressing.LastTransitionTime.Time) >= resyncPeriod will always return true and we'll constantly be reconciling resources.
There was a problem hiding this comment.
ok got it, so we should set a new status field for the last time it tried to rsync to be able to check it like suggested correct?
There was a problem hiding this comment.
ok got it, so we should set a new status field for the last time it tried to rsync to be able to check it like suggested correct?
If we want to go down this route, I believe it should be an annotation (and not a status field). But I also think that it might not be necessary if we requeue after each reconcile.
There was a problem hiding this comment.
ok, removed the check completely and only left re-queue
There was a problem hiding this comment.
My preference would definitely be for a versioned, structured field over an annotation. A new status field applied to all objects would be fine. Using the Progressing condition is probably not appropriate for the reasons above, and because it indicates a transition time to the user: "The controller last made a change to this object at time X". If you wanted to go down the Condition route, a new Synchronized condition might be appropriate? I would probably lean towards a new status field, though.
eda8a6f to
c610a4f
Compare
enhancements/drift-detection.md
Outdated
| } | ||
| ``` | ||
|
|
||
| This reuses the existing `Progressing.LastTransitionTime` to track when the resource was last reconciled, avoiding the need for additional status fields |
There was a problem hiding this comment.
ok got it, so we should set a new status field for the last time it tried to rsync to be able to check it like suggested correct?
If we want to go down this route, I believe it should be an annotation (and not a status field). But I also think that it might not be necessary if we requeue after each reconcile.
c610a4f to
0fa5f18
Compare
Proposal for drift detection feature.
0fa5f18 to
ab42db9
Compare
mdbooth
left a comment
There was a problem hiding this comment.
A few comments, but looks very reasonable to me.
|
|
||
| This ensures the controller automatically triggers reconciliation after the configured period. Controller-runtime's work queue handles deduplication of reconcile requests, so no additional time-based checks are needed in `shouldReconcile`. | ||
|
|
||
| **Resources in terminal error are not resynced**: When a resource is in a terminal error state (e.g., invalid configuration, unrecoverable OpenStack error), periodic resync is not scheduled. Terminal errors indicate issues that cannot be resolved through automatic retry and require manual intervention to fix the underlying problem. This prevents wasted reconciliation cycles on resources that are known to be in an unrecoverable state. |
There was a problem hiding this comment.
My memory definitely needs a refresh, but do we sometimes set a terminal state due to an issue that can't be resolved automatically but might be resolved manually by an administrator? e.g. A resource is in ERROR state which is later cleared after admin fixes some underlying issue. If so, it might be worth including terminal states in any resync process.
There was a problem hiding this comment.
there was a discussion between me an martin exactly about this in Thursday and also a thread here and I tend to very much agree with allowing rsync on terminal error resources but it seems like @mandre wants to further discuss it and approach it in the scope of the #667 bug. I believe nearly all terminal error (now applied for conflicts 409) can be resolved externally as I mentioned in the bug.
|
|
||
| 1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id` | ||
| 2. If not found and the resource was originally created by ORC (not imported), ORC recreates it | ||
| 3. The new resource ID is stored in `status.id` |
There was a problem hiding this comment.
I suspect this will need careful thought per resource. In the specific case of network and subnet, IIRC a subnet cannot exist without a network, so if the network has been deleted then the subnet has also been deleted. From this POV, 'hard' dependencies may actually be the easy case.
|
|
||
| #### Implementation Details | ||
|
|
||
| At the end of a successful reconciliation (when no other reschedule is pending), the controller schedules the next resync: |
There was a problem hiding this comment.
What do you want to do on controller restart?
There was a problem hiding this comment.
As far as I understand the controller design now the controller will try reconciliation on start up for all resources (or at least that what was claimed in a previous thread) if thats the case it would set the next rsync as a requeue out of the box. This might be problematic for other reasons like causing a thundering herd issue (which already existed and is unrelated to this proposal) but I also suggested some ways to resolve it in the thread #668 (comment)
|
|
||
| For **imported resources** that are deleted externally, this is always a terminal error regardless of drift detection settings, because the resource was not created by ORC and recreating it would not restore the original resource. | ||
|
|
||
| **Note on dependent resources**: OpenStack enforces referential integrity for most resources (e.g., Networks cannot be deleted while Subnets exist). If resources are deleted through means that bypass these checks (direct database manipulation, OpenStack bugs), drift detection preserves ORC's existing reconciliation behavior: |
There was a problem hiding this comment.
I think ORC's behaviour is undefined in the presence of bugs and direct DB manipulation, tbh. I'm not sure how you could define it.
There was a problem hiding this comment.
I generally agree this was already discussed in a previous thread and I believe is out of scope for this but I was asked by Martin to add a note about it anyways. (#668 (comment))
| **Risk**: Multiple controllers or systems may be managing the same OpenStack resources, leading to conflicts where changes are repeatedly overwritten. | ||
|
|
||
| **Mitigation**: | ||
| - Document that ORC should be the sole manager of resources it creates |
There was a problem hiding this comment.
I think we already do this?
There was a problem hiding this comment.
I am not sure there is a warning note for this but if there already is one than this would be one less change to make :)
|
|
||
| **Mitigation**: | ||
| - Disabled by default; when enabled, recommend conservative intervals (e.g., 10 hours) | ||
| - Add random jitter to resync times to avoid thundering herd: since reconciliation already uses "requeue after X duration", jitter simply adds a random offset (e.g., ±10%) to the resync period, spreading resyncs over time rather than having them fire simultaneously |
There was a problem hiding this comment.
Unless it all gets reset when the controller restarts. You might want to store a last synced time in the status? Or some other solution.
There was a problem hiding this comment.
I think thats a good idea I noticed you mentioned it in another thread, I'm completely for it if no objections from @mandre.
|
@mdbooth thanks for the review, I addressed all comments in the review and re-opened relevant threads for you to comment about in case you want to continue the discussions there. |
Proposal for drift detection feature.