Skip to content

fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady#245

Draft
AleksZimin wants to merge 4 commits into
mainfrom
fix/lvg-phase-stuck-notready
Draft

fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady#245
AleksZimin wants to merge 4 commits into
mainfrom
fix/lvg-phase-stuck-notready

Conversation

@AleksZimin
Copy link
Copy Markdown
Member

Description

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

@AleksZimin AleksZimin self-assigned this Apr 22, 2026
@AleksZimin AleksZimin changed the title fix(controller): LVMVolumeGroup stuck in NotReady despite all conditions True fix(lvg-conditions-watcher): atomically update Ready condition and phase to prevent stuck NotReady Apr 22, 2026
LVG status.phase can get stuck in NotReady even when every condition is
True because reconcileLVGConditions issued two sequential Status().Update()
calls (one for the Ready condition, one for the phase) and a concurrent
writer could bump resourceVersion between them, making the second call
fail with 409 Conflict. The happy-path branch only logged that error,
so Status.Conditions stayed consistent, reflect.DeepEqual(...) in the
watcher's UpdateFunc swallowed subsequent events, and the phase never
converged.

Fix: derive the desired (phase, Ready status/reason/message) via a pure
decideLVGReadyAndPhase function and apply both mutations with a single
Status().Update() in updateLVGReadyConditionAndPhaseIfNeeded. Any error,
including 409, is returned to the reconciler so controller-runtime can
requeue and re-read a fresh object.

Tests: - table-driven unit tests for decideLVGReadyAndPhase covering missing
    conditions, Creating/Terminating short-circuits, non-acceptable
    False reasons, acceptable False reasons (Updating/ValidationFailed),
    happy path, and the Ready condition being ignored in its own
    evaluation.
  - integration tests for updateLVGReadyConditionAndPhaseIfNeeded on a
    fake client with WithStatusSubresource(LVMVolumeGroup): verify the
    helper performs exactly one Status().Update() (observed via
    resourceVersion), is a no-op when both fields already match, and
    does not duplicate the Ready condition when mutating an existing
    one.
Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Previously in CI we set suite.FailFast=true, which means the first failing
spec aborts the suite and later specs (including potentially unrelated
regressions and the cluster teardown-side paths) never run. That hides
downstream failures and makes diagnosing multi-failure runs expensive.

Changes: * always FailFast=false — every spec runs to completion, even when an
    earlier spec has already failed. go test still exits non-zero when any
    spec failed, so CI remains honest.
  * new ReportAfterSuite writes a JUnit XML (e2e-junit-report.xml) and a
    JSON report (e2e-report.json) into ${E2E_REPORT_DIR:-.}, and prints
    an explicit "E2E SUITE SUMMARY / FAILED SPEC(S)" block to GinkgoWriter
    listing every failed spec (full text + location + first-line message).
    This guarantees the failure list is visible in the raw test log even
    when artifact upload does not happen.
  * build_dev.yml: set E2E_REPORT_DIR=\${{ github.workspace }}/e2e so the
    reports land next to e2e-test-output.log and add an Upload test
    reports step that publishes both files as a separate artifact.
Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Local helper that runs the exact same Go lint flow as the
deckhouse/modules-actions/go_linter@v12 action used in go_checks.yaml:

  * installs a pinned golangci-lint (v2.9.0 by default, overridable via
    GOLANGCI_LINT_VERSION) into hack/.bin so the local version never
    drifts from CI;
  * iterates every go.mod under images/ and every build tag in
    GO_BUILD_TAGS ("ce ee se seplus csepro"), running
    `golangci-lint run --allow-parallel-runners --build-tags <edition>`;
  * keeps going after individual failures and exits non-zero if any run
    failed.

Adds two extras for local use:

  * --fix / --fix-only to apply `golangci-lint fmt` (gci, gofmt,
    goimports, ...) before running linters;
  * --module / --edition to narrow the matrix to a single module/tag
    when iterating on a specific file.

hack/.bin/ (cached golangci-lint binary) is added to .gitignore.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@AleksZimin AleksZimin force-pushed the fix/lvg-phase-stuck-notready branch from 6d58865 to e6c3a6d Compare April 22, 2026 08:57
@github-actions
Copy link
Copy Markdown
Contributor

E2E Smoke Tests Results ❌

Module: sds-node-configurator
Framework: storage-e2e
Namespace: e2e-sds-node-configurator-pr245-24769936987
Status: Tests failed

View Details

Tests were executed via storage-e2e (TestSdsNodeConfigurator: Common Scheduler Extender, then Sds Node Configurator).

Artifacts: Test logs are available in workflow artifacts.

…bled

On RedOS 8 multipathd.service is shipped enabled but inactive by default,
so `systemctl is-enabled --quiet multipathd` returned 0 while the unit
was not running. The handler then called `systemctl reload multipathd`,
which prints "multipathd.service is not active, cannot reload." and
exits non-zero. That non-zero status propagated out of the
bb-sync-file-changed event handler and bashible retried the whole
100_sds-node-configurator-add-loop-devices-to-blacklist.sh step every
10 seconds indefinitely.

Switch the predicate to `systemctl is-active`, which is what reload
actually requires. If multipathd is stopped there is nothing to reload —
it will pick up the new /etc/multipath/conf.d/loop-blacklist.conf at its
next start. Absent, masked and disabled units behave the same as before
(is-active returns non-zero, handler does nothing).

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant