Lifecycle hooks can make the agent unresponsive

Bosh-agent itself is already running with higher priority than BOSH/monit jobs to mitigate CPU-intensive workloads blocking the agent <-> director communication, see https://github.com/cloudfoundry/bosh-linux-stemcell-builder/commit/00054bd98693465dd75eda1f12a7326fc5191804 .

However, as it seems lifecycle hooks like pre-start scripts can as well have the same negative effect on the communication with the director because they are started by the bosh-agent itself and hence run with the same priority. At least this is my assumption because I wasn't able to find a line of code that lowers that priority and looking at a VM while it is running a pre-start reveals that the pre-start script with all sub-processes runs with the same priority as the agent.

In our case cloning a lot of data from the remaining part of a BOSH-managed PostgreSQL cluster can trigger this issue inconsistently, which in extreme situations extends downtime unnecessarily because the bosh task itself errors with an agent timeout and the pre-start has to run from scratch again.

Of course as a quick mitigation we could for example renice the priority in our pre-start script. Yet I would see benefit as well as consistency and hence predictability if bosh agent starts external scripts/binaries with lower priority than itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lifecycle hooks can make the agent unresponsive #337

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lifecycle hooks can make the agent unresponsive #337

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions