THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138
THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138urbanikb wants to merge 7 commits into3scale:masterfrom
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
/retest |
|
/ok-to-test |
|
Thanks the code looks much better. I have perform the following tests:
During the upgrade, the system-app-pre hooks run AFTER system-app, which is the behavior we've seen in THREESCALE-12013THREESCALE-12013 I'll run the test again tomorrow. In the meantime, could you build the bundle and check the upgrade process? |
b85cb4e to
d5facb4
Compare
This should not be possible on the upgrade path - the deployment should update only after job with revision+1 has been scheduled. Having jobs run after is possible on the "manual trigger" path - on revision change - although it shouldn't take a minute to trigger, so even for that path it seems to be a little bit off. If you still have this instance, could you check the logs of the "pre" hook to check whether the migration was executed in that run or it was already a no-op?
I built the bundle but couldn't reproduce the state where the migration isn't run from the "image changed" path - the bundle I tested with is this one (contains CSV from 2.15 and "alpha" with quay.io images): |
00a5d4e to
9f53af0
Compare
|
Tested upgrade of the current PR from following index: Tried scaling down+up via the CR and it came up correctly, there were some logs in the 3.15 operator regarding pod being deleted ("won't delete job system-app-post because it's still running"), which means the job was being re-created twice. There were 2 caching-related Reconcile errors in the new version that don't seem to be anything to worry about - the corresponding logic has been re-queued and jobs have been re-created within a minute. |
| return reconcile.Result{}, err | ||
| } | ||
| if trackedRevision == -1 { | ||
| r.Logger().Info("Deleted PreHook job that was missing revision annotation") |
| } | ||
|
|
||
| // Already at correct revision | ||
| if trackedRevision == targetRevision { |
There was a problem hiding this comment.
Can you think of a scenario where we have the same job revision but different image? Perhaps we should check the job images here?
There was a problem hiding this comment.
I've added a test/code for this, I think it's good catch - even if there isn't a "normal" way to get to that state, it fits the function - after reconcile I'd expect the image to match.
| targetRevision, err := component.ParseSystemAppHookRevision(desired) | ||
| if err != nil { | ||
| return reconcile.Result{}, err | ||
| } |
There was a problem hiding this comment.
Small nit: We create the desired job before calling this function with the desired revision, so perhaps we should pass that revision into this function call instead of calling ParseSystemAppHookRevision?
There was a problem hiding this comment.
Sounds good to me - I've moved the targetRevision parameter from the component "factory" function here - could have it in 2 places but then it would be little less clear. Does the new version look OK?
|
Index with 2.15 and the bundle for the PR above: |
|
/retest |
027750a to
fdef331
Compare
4d1efa3 to
7bf7723
Compare
439c039 to
94ba470
Compare
|
The fix to controller where the reconcile can stall due to ignored requeue request had to be reverted because it caused deadlock in the e2e tests. I've opened https://issues.redhat.com/browse/THREESCALE-12448 for that one. |
This commit fixes issue with system deployment reconciler happening on upgrade from 2.15 to 2.16 version. The "state" of the upgrade process is persisted in an annotation on the job that records the deployment version. The basic idea is that the job annotation shoudld always match the "version" of system-app deployment. The 2.15 used generation as the version. 2.16 uses revision. The reconciliation logic would only activate upgrade logic if the revision on the existing job matches exactly the deployment revision (it assumes the upgrade is in-flight otherwise). This condition has been causing issue on upgrade from 2.16 because as part of the upgrade the customers need to scale the deployment down, which would bump the generation - which would cause the upgrade logic to incorrectly assume that migration is in flight and fail to delete the old job. Reconciliation then simply updates metadata including the annotation without recreating the job and the process silently fails to execute migration. This PR also includes all edge cases I could think of... even though we only hit the one described above.
The function needed to deal with cases when job didn't exist or the annotation didn't exist. It could return an error but that would just make it more complicated in the calling code. The new function needAppHookJobRerun returns false if the job didn't exist or revision matches deployment revision. It returns true if the job revision is unknown or deployment revision has changed.
No tests have been changed - this is mainly to change the flow of the Reconcile function to avoid calling ReconcileJob when the revision has not changed.
This PR fixes 3 issues:
1. SystemReconciler now does not re-create PreHook job if deployment controller
is slow to update revision
2. SystemReconciler now does not create PostHook before deployment is ready
3. Fix tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
94ba470 to
463a36b
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1138 +/- ##
==========================================
+ Coverage 39.34% 42.48% +3.14%
==========================================
Files 205 202 -3
Lines 23363 20530 -2833
==========================================
- Hits 9193 8723 -470
+ Misses 13184 11027 -2157
+ Partials 986 780 -206
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Tested with the current version (after rebase). Didn't observe any unexpected issues, and the timing has been more consistent. Tested with bundle Scenarios tested:
In all upgrade cases there was a delay of ~30s between pre-hook job pod being created and deployment pod being created. When performing rollout restart on 2.16, the job was created after deployment pod, which is expected. |
|
Using your bundle Upgrade from 2.15 to 2.16
Upgrade from 2.16 to alphaThe bundle was unpacked but no upgrade triggered. |
cc91653 to
0c5d392
Compare


Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com
What
Fix:
This PR fixes issue with system deployment reconciler happening on upgrade from 2.15 to 2.16 version.
The "state" of the upgrade process is persisted in an annotation on the job that records the deployment version.
The basic idea is that the job annotation shoudld always match the "version" of system-app deployment.
The 2.15 used generation as the version. 2.16 uses revision.
The reconciliation logic would only activate upgrade logic if the revision on the existing job matches exactly the deployment revision (it assumes the upgrade is in-flight otherwise).
This condition has been causing issue on upgrade from 2.16 because as part of the upgrade the customers need to scale the deployment down, which would bump the generation - which would cause the upgrade logic to incorrectly assume that migration is in flight and fail to delete the old job.
Reconciliation then simply updates metadata including the annotation without recreating the job and the process silently fails to execute migration.
This PR also includes some more edge cases, renamed annotation and some very minor cleanup of test suite.
Steps to reproduce
The deployment generation and job annotation should match. However the deployment generation should be ahead of the revision - this is needed to reproduce the bug. In my case I have generation 6 and revision 2:
Upgrade operator to 2.16... note the job image hasn't been updated but the deployment has.
The 2.16 operator also has updated the annotation to the current revision of the deployment - that is part of the bug.
Verification
The regression test case is implemented in test with test case name
"EDGE CASE: Both image AND revision changed - old job deleted and new one created".Due to this PR also renaming the annotation, the next upgrade will fall into test case
"UPGRADE: Job exists without annotation during upgrade - deleted and recreated".