Skip to content

Core components install random failure during create-cluster #7780

@DavidePrincipi

Description

@DavidePrincipi

The installation of core components may fail during create-cluster if variables like LOKI_ADDR cannot be discovered in Redis. This happens if Loki installation is not completed yet.

Steps to reproduce

  • Run the Create New Cluster procedure

Expected behavior

All components and services are correctly installed and started.

Actual behavior

The node_exporter.service unit is not enabled and didn't start.

The leader node itself has an offline alert.

Image

Log evidence, LOKI_ADDR is accessed by concurrent services (Metrics) just before Loki installation completes:

Dec 05 07:04:48 rl1 runagent[35965]: Traceback (most recent call last):
Dec 05 07:04:48 rl1 runagent[35965]:   File "/home/metrics1/.config/bin/provision-prometheus", line 222, in <module>
Dec 05 07:04:48 rl1 runagent[35965]:     generate_prometheus_config(redis_client)
Dec 05 07:04:48 rl1 runagent[35965]:   File "/home/metrics1/.config/bin/provision-prometheus", line 55, in generate_prometheus_config
Dec 05 07:04:48 rl1 runagent[35965]:     logcli["LOKI_ADDR"] = logcli["LOKI_ADDR"] + ':' + logcli["LOKI_HTTP_PORT"]
Dec 05 07:04:48 rl1 runagent[35965]:                           ~~~~~~^^^^^^^^^^^^^
Dec 05 07:04:48 rl1 runagent[35965]: KeyError: 'LOKI_ADDR'
Dec 05 07:04:48 rl1 systemd[34508]: Started libcrun container.
Dec 05 07:04:48 rl1 systemd[34571]: prometheus.service: Control process exited, code=exited, status=1/FAILURE
Dec 05 07:04:48 rl1 podman[36012]: loki
Dec 05 07:04:48 rl1 systemd[34508]: Started Loki pod service.
Dec 05 07:04:48 rl1 agent@loki1[34538]: task/module/loki1/55290a26-9233-43e0-bc40-8414f72b1029: action "create-module" status is "completed" (0) at step 20systemd

Metrics installation failure aborts create-cluster action, leaving node_exporter.service unconfigured and stopped.

Dec 05 07:04:46 rl1 agent@metrics1[34607]: task/module/metrics1/2b0e231e-7c05-4dad-a5bd-e6ce6af4b1da: action "create-module" status is "aborted" (1) at step 80start_services

Components

  • Core 3.15
  • Metrics 1.2.0
  • Loki 1.4.0

See also

Metadata

Metadata

Assignees

No one assigned

    Labels

    verifiedAll test cases were verified successfully

    Type

    Projects

    Status

    Verified

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions