fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady#245
Draft
AleksZimin wants to merge 4 commits into
Draft
fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady#245AleksZimin wants to merge 4 commits into
AleksZimin wants to merge 4 commits into
Conversation
LVG status.phase can get stuck in NotReady even when every condition is
True because reconcileLVGConditions issued two sequential Status().Update()
calls (one for the Ready condition, one for the phase) and a concurrent
writer could bump resourceVersion between them, making the second call
fail with 409 Conflict. The happy-path branch only logged that error,
so Status.Conditions stayed consistent, reflect.DeepEqual(...) in the
watcher's UpdateFunc swallowed subsequent events, and the phase never
converged.
Fix: derive the desired (phase, Ready status/reason/message) via a pure
decideLVGReadyAndPhase function and apply both mutations with a single
Status().Update() in updateLVGReadyConditionAndPhaseIfNeeded. Any error,
including 409, is returned to the reconciler so controller-runtime can
requeue and re-read a fresh object.
Tests: - table-driven unit tests for decideLVGReadyAndPhase covering missing
conditions, Creating/Terminating short-circuits, non-acceptable
False reasons, acceptable False reasons (Updating/ValidationFailed),
happy path, and the Ready condition being ignored in its own
evaluation.
- integration tests for updateLVGReadyConditionAndPhaseIfNeeded on a
fake client with WithStatusSubresource(LVMVolumeGroup): verify the
helper performs exactly one Status().Update() (observed via
resourceVersion), is a no-op when both fields already match, and
does not duplicate the Ready condition when mutating an existing
one.
Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Previously in CI we set suite.FailFast=true, which means the first failing
spec aborts the suite and later specs (including potentially unrelated
regressions and the cluster teardown-side paths) never run. That hides
downstream failures and makes diagnosing multi-failure runs expensive.
Changes: * always FailFast=false — every spec runs to completion, even when an
earlier spec has already failed. go test still exits non-zero when any
spec failed, so CI remains honest.
* new ReportAfterSuite writes a JUnit XML (e2e-junit-report.xml) and a
JSON report (e2e-report.json) into ${E2E_REPORT_DIR:-.}, and prints
an explicit "E2E SUITE SUMMARY / FAILED SPEC(S)" block to GinkgoWriter
listing every failed spec (full text + location + first-line message).
This guarantees the failure list is visible in the raw test log even
when artifact upload does not happen.
* build_dev.yml: set E2E_REPORT_DIR=\${{ github.workspace }}/e2e so the
reports land next to e2e-test-output.log and add an Upload test
reports step that publishes both files as a separate artifact.
Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Local helper that runs the exact same Go lint flow as the
deckhouse/modules-actions/go_linter@v12 action used in go_checks.yaml:
* installs a pinned golangci-lint (v2.9.0 by default, overridable via
GOLANGCI_LINT_VERSION) into hack/.bin so the local version never
drifts from CI;
* iterates every go.mod under images/ and every build tag in
GO_BUILD_TAGS ("ce ee se seplus csepro"), running
`golangci-lint run --allow-parallel-runners --build-tags <edition>`;
* keeps going after individual failures and exits non-zero if any run
failed.
Adds two extras for local use:
* --fix / --fix-only to apply `golangci-lint fmt` (gci, gofmt,
goimports, ...) before running linters;
* --module / --edition to narrow the matrix to a single module/tag
when iterating on a specific file.
hack/.bin/ (cached golangci-lint binary) is added to .gitignore.
Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
6d58865 to
e6c3a6d
Compare
Contributor
E2E Smoke Tests Results ❌Module: sds-node-configurator
Tests were executed via storage-e2e (TestSdsNodeConfigurator: Common Scheduler Extender, then Sds Node Configurator). Artifacts: Test logs are available in workflow artifacts. |
…bled On RedOS 8 multipathd.service is shipped enabled but inactive by default, so `systemctl is-enabled --quiet multipathd` returned 0 while the unit was not running. The handler then called `systemctl reload multipathd`, which prints "multipathd.service is not active, cannot reload." and exits non-zero. That non-zero status propagated out of the bb-sync-file-changed event handler and bashible retried the whole 100_sds-node-configurator-add-loop-devices-to-blacklist.sh step every 10 seconds indefinitely. Switch the predicate to `systemctl is-active`, which is what reload actually requires. If multipathd is stopped there is nothing to reload — it will pick up the new /etc/multipath/conf.d/loop-blacklist.conf at its next start. Absent, masked and disabled units behave the same as before (is-active returns non-zero, handler does nothing). Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Why do we need it, and what problem does it solve?
What is the expected result?
Checklist