Skip to content

Enhancement: Add drift detection and automatic reconciliation#668

Open
eshulman2 wants to merge 1 commit intok-orc:mainfrom
eshulman2:drift-d-proposal
Open

Enhancement: Add drift detection and automatic reconciliation#668
eshulman2 wants to merge 1 commit intok-orc:mainfrom
eshulman2:drift-d-proposal

Conversation

@eshulman2
Copy link
Contributor

Proposal for drift detection feature.

@github-actions github-actions bot added the semver:patch No API change label Feb 3, 2026
Copy link
Collaborator

@mandre mandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What part of the code needs changing? I expect we detail how shouldReconcile changes.


**Mitigation**:
- Conservative 10-hour default resync period
- Add random jitter to resync times to avoid thundering herd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would that look like?
Could you describe what you mean by thundering herd? Don't we already have the problem when restarting ORC?

Copy link
Contributor Author

@eshulman2 eshulman2 Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that in case that many resources were created at once and will try to re-reconcile in T+10H (all at the same time) it might flood the OSP API possibly causing slowness or other issues. The point in adding a random jitters is to have resources created together start their drift detection in T+10H+random number making it safer. As we only reschedule with the suggested implementation adding the jitter should be as simple as adding a random number to the re-queue request. added a note on that.

Regarding the restart issueI assume it is the same conceptual issue but unfortunately the suggested solution won't solve it on startup. We should consider possibly a different approach for this issue on startup like the one in ASO limiting the number of concurrent reconciles by exposing MAX_CONCURRENT_RECONCILES (controller runtime MaxConcurrentReconciles).

Copy link
Contributor Author

@eshulman2 eshulman2 Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought on the possible boot issue we can also add a jitter when getting 429 which will allow to solve the thundering herd at startup without an arbitrary limitation on the number of concurrent reconciliations (which we should probably use on our side as a ceiling for both the amount of request and to avoid cpu spikes in the operator). This hybrid approach lets openstack services "decide" if they are "free enough" to handle the request and allows us to avoid DDOSing (to some extent) without arbitrarily limiting the number of reconciliations. We can also use RateLimiter in the controller options configuring the work queue.

Actually the best ultra safe approach would be to use all:

  1. Drift detection with jitter - avoid thundering herd at the periodic drift detection
  2. RateLimiter in the controller options - making sure we are not processing to many too many items per second limiting the burst rate and QPS
  3. MaxConcurrentReconciles - make sure we are not running out of RAM/CPU to allow robustness at the operator level making sure it 'doesn't break itself'
  4. 429 backoff + jitter - avoiding DDOSing the openstack cluster when it is already too busy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe it would be a good idea to create an issue with this information to allow planning and implementing it to resolve the startup thundering heard and add robustness to the operator at all levels)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to over specify now. Let's just recognize it might be a problem and highlight potential solutions without going into too much details. I believe the current wording is fine and doesn't need more details.


1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id`
2. If not found and the resource was originally created by ORC (not imported), ORC recreates it
3. The new resource ID is stored in `status.id`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question wasn't really about the obvious case where you'd delete out of band a resource for which you have a weak dependency, but for where we had hard dependency. An example with Subnet -> Network would be more telling. What happens if a network is re-created? Will the subnet be recreated as well?

Inconsistent states can happen, and will happen. Someone who force-deleted a resource, a bug in OpenStack, an operator who made changes to the database directly... We should explain what we would do whenever we will have that case.

This probably deserves a separate section.

**Behavior when drift detection is disabled** (`resyncPeriod: 0`): External deletion remains a terminal error (current behavior preserved). Resource recreation only occurs when drift detection is enabled and a periodic resync discovers the missing resource. This maintains backwards compatibility.

For **imported resources** that are deleted externally, this is always a terminal error regardless of drift detection settings, because the resource was not created by ORC and recreating it would not restore the original resource.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to a resource that was in a terminal error (due to missing openstack resource) prior to enabling drift detection? I believe we should clarify that we won't reconcile resources that are in terminal error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should actually reconcile them even if they are in terminal error as this would actually solve #667 which is unsolvable for users right now and forces users to delete and re-create the resource for already solved semi-transient issues currently addressed as terminal irrecoverable errors (like quota space being freed up by another machine being deleted). The periodic resync provides a recovery path without requiring users to manually touch the spec to trigger reconciliation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design, terminal errors are... terminal. We should not reconcile resources in this state. The problem that was raised in #667 is that some errors are not exactly terminal and could be retried, but that a different discussion.

Let's be explicit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the intent for terminal errors is to be terminal but I think we should re-consider what we call a terminal error. I'll change it in this proposal and lets discuses in 667 our approach to what we call a terminal error and how we address it and allow user to recover from it.

@eshulman2 eshulman2 requested a review from mandre March 11, 2026 14:22
}
```

This reuses the existing `Progressing.LastTransitionTime` to track when the resource was last reconciled, avoiding the need for additional status fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would work: progressing.LastTransitionTime is only updated when a condition changes on the resource, so if there's no changes we would reconcile continuously once the first resyncPeriod has elapsed.

Possibly, we need to add an annotation that carries the last resync timestamp.

Another possibility we discussed early on was to re-queue with RequeueAfter(resyncPeriod) at the end of a successful reconcile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation uses both mechanisms you mentioned:

  1. RequeueAfter: At the end of successful reconciliation, we schedule the next resync with reconcileStatus.WithRequeue(resyncPeriod) (see the implementation here)
  2. Time-based check: The shouldReconcile function checks time.Since(progressing.LastTransitionTime.Time) >= resyncPeriod to handle reconciles triggered by other means (controller restart, dependency updates, etc.)

The LastTransitionTime gets updated when the reconcile completes and Progressing condition transitions. The scheduled requeue ensures this happens periodically, and the time check prevents excessive reconciliation if multiple triggers fire close together. I updated the proposal to reflect it and make it clearer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the problem: we can't rely on progressing.LastTransitionTime in shouldReconcile() because as the resource converges to a stable state, progressing.LastTransitionTime won't get any more update meaning that the comparison time.Since(progressing.LastTransitionTime.Time) >= resyncPeriod will always return true and we'll constantly be reconciling resources.

Copy link
Contributor Author

@eshulman2 eshulman2 Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it, so we should set a new status field for the last time it tried to rsync to be able to check it like suggested correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it, so we should set a new status field for the last time it tried to rsync to be able to check it like suggested correct?

If we want to go down this route, I believe it should be an annotation (and not a status field). But I also think that it might not be necessary if we requeue after each reconcile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, removed the check completely and only left re-queue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would definitely be for a versioned, structured field over an annotation. A new status field applied to all objects would be fine. Using the Progressing condition is probably not appropriate for the reasons above, and because it indicates a transition time to the user: "The controller last made a change to this object at time X". If you wanted to go down the Condition route, a new Synchronized condition might be appropriate? I would probably lean towards a new status field, though.

@eshulman2 eshulman2 requested a review from mandre March 16, 2026 11:13
}
```

This reuses the existing `Progressing.LastTransitionTime` to track when the resource was last reconciled, avoiding the need for additional status fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it, so we should set a new status field for the last time it tried to rsync to be able to check it like suggested correct?

If we want to go down this route, I believe it should be an annotation (and not a status field). But I also think that it might not be necessary if we requeue after each reconcile.

@eshulman2 eshulman2 requested a review from mandre March 19, 2026 10:25
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but looks very reasonable to me.


This ensures the controller automatically triggers reconciliation after the configured period. Controller-runtime's work queue handles deduplication of reconcile requests, so no additional time-based checks are needed in `shouldReconcile`.

**Resources in terminal error are not resynced**: When a resource is in a terminal error state (e.g., invalid configuration, unrecoverable OpenStack error), periodic resync is not scheduled. Terminal errors indicate issues that cannot be resolved through automatic retry and require manual intervention to fix the underlying problem. This prevents wasted reconciliation cycles on resources that are known to be in an unrecoverable state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My memory definitely needs a refresh, but do we sometimes set a terminal state due to an issue that can't be resolved automatically but might be resolved manually by an administrator? e.g. A resource is in ERROR state which is later cleared after admin fixes some underlying issue. If so, it might be worth including terminal states in any resync process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a discussion between me an martin exactly about this in Thursday and also a thread here and I tend to very much agree with allowing rsync on terminal error resources but it seems like @mandre wants to further discuss it and approach it in the scope of the #667 bug. I believe nearly all terminal error (now applied for conflicts 409) can be resolved externally as I mentioned in the bug.


1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id`
2. If not found and the resource was originally created by ORC (not imported), ORC recreates it
3. The new resource ID is stored in `status.id`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this will need careful thought per resource. In the specific case of network and subnet, IIRC a subnet cannot exist without a network, so if the network has been deleted then the subnet has also been deleted. From this POV, 'hard' dependencies may actually be the easy case.


#### Implementation Details

At the end of a successful reconciliation (when no other reschedule is pending), the controller schedules the next resync:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you want to do on controller restart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand the controller design now the controller will try reconciliation on start up for all resources (or at least that what was claimed in a previous thread) if thats the case it would set the next rsync as a requeue out of the box. This might be problematic for other reasons like causing a thundering herd issue (which already existed and is unrelated to this proposal) but I also suggested some ways to resolve it in the thread #668 (comment)


For **imported resources** that are deleted externally, this is always a terminal error regardless of drift detection settings, because the resource was not created by ORC and recreating it would not restore the original resource.

**Note on dependent resources**: OpenStack enforces referential integrity for most resources (e.g., Networks cannot be deleted while Subnets exist). If resources are deleted through means that bypass these checks (direct database manipulation, OpenStack bugs), drift detection preserves ORC's existing reconciliation behavior:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ORC's behaviour is undefined in the presence of bugs and direct DB manipulation, tbh. I'm not sure how you could define it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree this was already discussed in a previous thread and I believe is out of scope for this but I was asked by Martin to add a note about it anyways. (#668 (comment))

**Risk**: Multiple controllers or systems may be managing the same OpenStack resources, leading to conflicts where changes are repeatedly overwritten.

**Mitigation**:
- Document that ORC should be the sole manager of resources it creates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure there is a warning note for this but if there already is one than this would be one less change to make :)


**Mitigation**:
- Disabled by default; when enabled, recommend conservative intervals (e.g., 10 hours)
- Add random jitter to resync times to avoid thundering herd: since reconciliation already uses "requeue after X duration", jitter simply adds a random offset (e.g., ±10%) to the resync period, spreading resyncs over time rather than having them fire simultaneously
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless it all gets reset when the controller restarts. You might want to store a last synced time in the status? Or some other solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats a good idea I noticed you mentioned it in another thread, I'm completely for it if no objections from @mandre.

@eshulman2
Copy link
Contributor Author

@mdbooth thanks for the review, I addressed all comments in the review and re-opened relevant threads for you to comment about in case you want to continue the discussions there.

@eshulman2 eshulman2 requested a review from mdbooth March 22, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver:patch No API change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants