Skip to content

THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138

Open
urbanikb wants to merge 7 commits into3scale:masterfrom
urbanikb:THREESCALE-12161
Open

THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138
urbanikb wants to merge 7 commits into3scale:masterfrom
urbanikb:THREESCALE-12161

Conversation

@urbanikb
Copy link
Copy Markdown
Contributor

@urbanikb urbanikb commented Jan 29, 2026

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

What

Fix:

This PR fixes issue with system deployment reconciler happening on upgrade from 2.15 to 2.16 version.

The "state" of the upgrade process is persisted in an annotation on the job that records the deployment version.

The basic idea is that the job annotation shoudld always match the "version" of system-app deployment.

The 2.15 used generation as the version. 2.16 uses revision.

The reconciliation logic would only activate upgrade logic if the revision on the existing job matches exactly the deployment revision (it assumes the upgrade is in-flight otherwise).

This condition has been causing issue on upgrade from 2.16 because as part of the upgrade the customers need to scale the deployment down, which would bump the generation - which would cause the upgrade logic to incorrectly assume that migration is in flight and fail to delete the old job.

Reconciliation then simply updates metadata including the annotation without recreating the job and the process silently fails to execute migration.

This PR also includes some more edge cases, renamed annotation and some very minor cleanup of test suite.

Steps to reproduce

  1. install version 2.15 and create APIManager deployment
  2. trigger deployment scale down / up, this should result in generation change, the 2.15 operator will automatically re-create the job to match the generation:
% oc get job system-app-pre -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations
NAME             NAMESPACE   ANNOTATIONS
system-app-pre   bu-v215     map[system-app-deployment-generation:5]

% oc scale deployment system-app --replicas 1
deployment.apps/system-app scaled

The deployment generation and job annotation should match. However the deployment generation should be ahead of the revision - this is needed to reproduce the bug. In my case I have generation 6 and revision 2:

% oc get job system-app-pre -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations
NAME             NAMESPACE   ANNOTATIONS
system-app-pre   bu-v215     map[system-app-deployment-generation:6]
% oc get deployment system-app -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations,GENERATION:.metadata.generation
NAME         NAMESPACE   ANNOTATIONS                                GENERATION
system-app   bu-v215     map[deployment.kubernetes.io/revision:2]   6

Upgrade operator to 2.16... note the job image hasn't been updated but the deployment has.

The 2.16 operator also has updated the annotation to the current revision of the deployment - that is part of the bug.

% oc get job system-app-pre -o 'custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations,IMAGES:.spec.template.spec.containers[*].image'
NAME             NAMESPACE   ANNOTATIONS                               IMAGES
system-app-pre   bu-v215     map[system-app-deployment-generation:3]   registry.redhat.io/3scale-amp2/system-rhel8@sha256:4c72dfeb1919a1de4fbc47f4611d0c830ef89cf9f17255ad55a9b370618e0e8f
% oc get deployment system-app -o 'custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations,GENERATION:.metadata.generation,IMAGES:.spec.template.spec.containers[*].image'
NAME         NAMESPACE   ANNOTATIONS                                GENERATION   IMAGES
system-app   bu-v215     map[deployment.kubernetes.io/revision:3]   7            registry.redhat.io/3scale-amp2/system-rhel9@sha256:3a01cbf4581099bad69c4fea7cf9182b275f5f60f38610de9a6c1f3038f1b949,registry.redhat.io/3scale-amp2/system-rhel9@sha256:3a01cbf4581099bad69c4fea7cf9182b275f5f60f38610de9a6c1f3038f1b949,registry.redhat.io/3scale-amp2/system-rhel9@sha256:3a01cbf4581099bad69c4fea7cf9182b275f5f60f38610de9a6c1f3038f1b949

Verification

The regression test case is implemented in test with test case name "EDGE CASE: Both image AND revision changed - old job deleted and new one created".

Due to this PR also renaming the annotation, the next upgrade will fall into test case "UPGRADE: Job exists without annotation during upgrade - deleted and recreated".

@urbanikb urbanikb requested a review from a team as a code owner January 29, 2026 18:25
@briangallagher
Copy link
Copy Markdown
Contributor

briangallagher commented Jan 29, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@tkan145
Copy link
Copy Markdown
Contributor

tkan145 commented Feb 2, 2026

/retest

@tkan145
Copy link
Copy Markdown
Contributor

tkan145 commented Feb 2, 2026

/ok-to-test

@tkan145
Copy link
Copy Markdown
Contributor

tkan145 commented Feb 4, 2026

Thanks the code looks much better.

I have perform the following tests:

  • Fresh install
  • Rollout system-app
  • Scalling up/down
  • Upgrade

During the upgrade, the system-app-pre hooks run AFTER system-app, which is the behavior we've seen in THREESCALE-12013THREESCALE-12013

 ▲ lib/3scale/3scale-operator oc get pod -o wide | grep system-app                                                                                         
system-app-84486b9f9c-btdzh                                       3/3     Running     0               8m25s   10.217.1.224   crc    <none>           <none>
system-app-post-zsjfw                                             0/1     Completed   0               6m54s   10.217.1.232   crc    <none>           <none>
system-app-pre-s6h25                                              0/1     Completed   0               7m24s   10.217.1.231   crc    <none>           <none>

I'll run the test again tomorrow. In the meantime, could you build the bundle and check the upgrade process?

@urbanikb
Copy link
Copy Markdown
Contributor Author

urbanikb commented Feb 5, 2026

During the upgrade, the system-app-pre hooks run AFTER system-app, which is the behavior we've seen in THREESCALE-12013THREESCALE-12013

 ▲ lib/3scale/3scale-operator oc get pod -o wide | grep system-app                                                                                         
system-app-84486b9f9c-btdzh                                       3/3     Running     0               8m25s   10.217.1.224   crc    <none>           <none>
system-app-post-zsjfw                                             0/1     Completed   0               6m54s   10.217.1.232   crc    <none>           <none>
system-app-pre-s6h25                                              0/1     Completed   0               7m24s   10.217.1.231   crc    <none>           <none>

This should not be possible on the upgrade path - the deployment should update only after job with revision+1 has been scheduled.

Having jobs run after is possible on the "manual trigger" path - on revision change - although it shouldn't take a minute to trigger, so even for that path it seems to be a little bit off. If you still have this instance, could you check the logs of the "pre" hook to check whether the migration was executed in that run or it was already a no-op?

I'll run the test again tomorrow. In the meantime, could you build the bundle and check the upgrade process?

I built the bundle but couldn't reproduce the state where the migration isn't run from the "image changed" path - the bundle I tested with is this one (contains CSV from 2.15 and "alpha" with quay.io images):

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: threescale-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/burbanik/3scale-operator-index:v0.2.0

@urbanikb
Copy link
Copy Markdown
Contributor Author

urbanikb commented Feb 9, 2026

Tested upgrade of the current PR from following index:

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: threescale-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: 'quay.io/burbanik/3scale-operator-index:v0.3.0'
EOF

Tried scaling down+up via the CR and it came up correctly, there were some logs in the 3.15 operator regarding pod being deleted ("won't delete job system-app-post because it's still running"), which means the job was being re-created twice.

There were 2 caching-related Reconcile errors in the new version that don't seem to be anything to worry about - the corresponding logic has been re-queued and jobs have been re-created within a minute.

{"level":"error","ts":"2026-02-09T21:18:48Z","msg":"Reconciler error","controller":"apimanager","controllerGroup":"apps.3scale.net","controllerKind":"APIManager","APIManager":{"name":"3scale","namespace":"3scale-upgrade"},"namespace":"3scale-upgrade","name":"3scale","reconcileID":"3a894066-0537-4b76-b270-0e25c2ade5ac","error":"Operation cannot be fulfilled on jobs.batch \"system-app-pre\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/jobs/3scale-upgrade/system-app-pre, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: e1230b60-fd4e-4737-8cec-074c1a180af7, UID in object meta: ","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2026-02-09T21:20:00Z","msg":"Reconciler error","controller":"apimanager","controllerGroup":"apps.3scale.net","controllerKind":"APIManager","APIManager":{"name":"3scale","namespace":"3scale-upgrade"},"namespace":"3scale-upgrade","name":"3scale","reconcileID":"036850c3-3c8a-4784-ad5e-da47e00360dc","error":"Operation cannot be fulfilled on jobs.batch \"system-app-post\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

return reconcile.Result{}, err
}
if trackedRevision == -1 {
r.Logger().Info("Deleted PreHook job that was missing revision annotation")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

}

// Already at correct revision
if trackedRevision == targetRevision {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a scenario where we have the same job revision but different image? Perhaps we should check the job images here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test/code for this, I think it's good catch - even if there isn't a "normal" way to get to that state, it fits the function - after reconcile I'd expect the image to match.

Comment on lines +373 to +376
targetRevision, err := component.ParseSystemAppHookRevision(desired)
if err != nil {
return reconcile.Result{}, err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: We create the desired job before calling this function with the desired revision, so perhaps we should pass that revision into this function call instead of calling ParseSystemAppHookRevision?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me - I've moved the targetRevision parameter from the component "factory" function here - could have it in 2 places but then it would be little less clear. Does the new version look OK?

@urbanikb
Copy link
Copy Markdown
Contributor Author

urbanikb commented Feb 16, 2026

Index with 2.15 and the bundle for the PR above:

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: threescale-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: 'quay.io/burbanik/3scale-operator-index:v0.4.0'
EOF

@tkan145
Copy link
Copy Markdown
Contributor

tkan145 commented Feb 17, 2026

/retest

@urbanikb urbanikb force-pushed the THREESCALE-12161 branch 2 times, most recently from 027750a to fdef331 Compare February 25, 2026 12:46
@tkan145
Copy link
Copy Markdown
Contributor

tkan145 commented Feb 27, 2026

I've run the upgrade process several times, but the sequence of job is still not right. pre-hook-job still trigger after system-app

upgrade

@urbanikb urbanikb force-pushed the THREESCALE-12161 branch 2 times, most recently from 4d1efa3 to 7bf7723 Compare March 10, 2026 21:46
@urbanikb
Copy link
Copy Markdown
Contributor Author

I've run the upgrade process several times, but the sequence of job is still not right. pre-hook-job still trigger after system-app

upgrade

I haven't been able to reproduce this but after some testing I found that requeue requests of the system reconciler (or any sub-reconciler) aren't respected by the controller, so I've added that here.

After troubleshooting and some experimentation, I've found a plausible explanation for why both issues on the screenshot happen:

  1. the job would be re-created after it completes once and subsequently deployment is updated - the deployment controller in k8s will react to updated spec, but the revision is only updated once replicaset has been created, so, if the apimanager requeues before that happens, it will re-create the job because it expects revision to be N+1 (it expects that the deployment resource will get a new revision and image atomicaly, but that is not the case)
  2. the pre-hook job should be waiting for the deployment to become ready, not just updated

I've tested it with a dev bundle, order of pod creation is correct (I've scaled the system-app down and up for 2.15 upgrade; rollout restart for 2.16):

2.15 -> 2.16:

system-app-5bb64fdf9-649sm                                   3/3     Running            0             89s
system-app-post-5m2dg                                        0/1     Completed          0             26s
system-app-pre-dkkwt                                         0/1     Completed          0             2m12s

2.16 -> 2.16 (different porta image):

system-app-755f898549-pbrpd                                  3/3     Running     0              103s
system-app-post-zt7dx                                        0/1     Completed   0              40s
system-app-pre-l8qm4                                         0/1     Completed   0              102s 

@urbanikb urbanikb force-pushed the THREESCALE-12161 branch 2 times, most recently from 439c039 to 94ba470 Compare March 11, 2026 10:48
@urbanikb
Copy link
Copy Markdown
Contributor Author

The fix to controller where the reconcile can stall due to ignored requeue request had to be reverted because it caused deadlock in the e2e tests.

I've opened https://issues.redhat.com/browse/THREESCALE-12448 for that one.

urbanikb and others added 6 commits March 20, 2026 08:27
This commit fixes issue with system deployment reconciler happening
on upgrade from 2.15 to 2.16 version.

The "state" of the upgrade process is persisted in an annotation on
the job that records the deployment version.

The basic idea is that the job annotation shoudld always match the "version"
of system-app deployment.

The 2.15 used generation as the version. 2.16 uses revision.

The reconciliation logic would only activate upgrade logic if the revision
on the existing job matches exactly the deployment revision (it assumes the
upgrade is in-flight otherwise).

This condition has been causing issue on upgrade from 2.16 because as part
of the upgrade the customers need to scale the deployment down, which would
bump the generation - which would cause the upgrade logic to incorrectly
assume that migration is in flight and fail to delete the old job.

Reconciliation then simply updates metadata including the annotation without
recreating the job and the process silently fails to execute migration.

This PR also includes all edge cases I could think of... even though we only
hit the one described above.
The function needed to deal with cases when job didn't exist
or the annotation didn't exist. It could return an error but
that would just make it more complicated in the calling code.

The new function needAppHookJobRerun returns false if the job
didn't exist or revision matches deployment revision.

It returns true if the job revision is unknown or deployment
revision has changed.
No tests have been changed - this is mainly to change the flow
of the Reconcile function to avoid calling ReconcileJob when
the revision has not changed.
This PR fixes 3 issues:

1. SystemReconciler now does not re-create PreHook job if deployment controller
    is slow to update revision
2. SystemReconciler now does not create PostHook before deployment is ready
3. Fix tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 81.17647% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.48%. Comparing base (4e62f84) to head (0c5d392).
⚠️ Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
pkg/3scale/amp/operator/system_reconciler.go 84.61% 5 Missing and 3 partials ⚠️
pkg/3scale/amp/component/system.go 78.94% 2 Missing and 2 partials ⚠️
pkg/helper/job.go 71.42% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1138      +/-   ##
==========================================
+ Coverage   39.34%   42.48%   +3.14%     
==========================================
  Files         205      202       -3     
  Lines       23363    20530    -2833     
==========================================
- Hits         9193     8723     -470     
+ Misses      13184    11027    -2157     
+ Partials      986      780     -206     
Flag Coverage Δ
unit 42.48% <81.17%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
apis/apps/v1alpha1 (u) 58.55% <ø> (+29.45%) ⬆️
apis/capabilities/v1alpha1 (u) 3.50% <ø> (+1.99%) ⬆️
apis/capabilities/v1beta1 (u) 20.21% <ø> (-1.16%) ⬇️
controllers (i) 9.32% <0.00%> (-0.01%) ⬇️
pkg (u) 61.69% <79.16%> (+0.70%) ⬆️
Files with missing lines Coverage Δ
pkg/3scale/amp/component/system.go 74.68% <78.94%> (+3.42%) ⬆️
pkg/helper/job.go 58.69% <71.42%> (+3.14%) ⬆️
pkg/3scale/amp/operator/system_reconciler.go 75.06% <84.61%> (+7.66%) ⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@urbanikb
Copy link
Copy Markdown
Contributor Author

Tested with the current version (after rebase). Didn't observe any unexpected issues, and the timing has been more consistent.

Tested with bundle quay.io/burbanik/3scale-operator-index:v0.1002.0 which contains one 2.15 version from the registry and 2 test 2.16 versions to test upgrade + upgrade of porta image.

Scenarios tested:

  • Upgrade 2.15 -> 2.16 with scaledown / up prior to upgrade
  • Upgrade 2.15 -> 2.16 with rollout restart prior to upgrade
  • Upgrade 2.16 -> 2.16 with rollout restart

In all upgrade cases there was a delay of ~30s between pre-hook job pod being created and deployment pod being created.

When performing rollout restart on 2.16, the job was created after deployment pod, which is expected.

@tkan145
Copy link
Copy Markdown
Contributor

tkan145 commented Mar 24, 2026

Using your bundle

Upgrade from 2.15 to 2.16

system-app-pre job triggered 2 times. system-app-pre -> system-app -> system-app-pre again. This can be confusing when debugging issues with must-gather, as the job's timestamp is very close to the deployment time.

Upgrade from 2.16 to alpha

The bundle was unpacked but no upgrade triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants