Skip to content

Fix for config corruption on Linux (Docker) after host crash / reboot#1868

Open
banksy-git wants to merge 1 commit intoTechnitiumSoftware:masterfrom
banksy-git:fix-corrupt-configs
Open

Fix for config corruption on Linux (Docker) after host crash / reboot#1868
banksy-git wants to merge 1 commit intoTechnitiumSoftware:masterfrom
banksy-git:fix-corrupt-configs

Conversation

@banksy-git
Copy link
Copy Markdown

I experienced regular corruption of auth.config every time one of my cluster nodes rebooted.

The write-to-temp and then rename is a valid pattern but for crash safety the file should be flushed to disk before renaming.

This is a fix for observed behavior where my docker host was rebooted and then it had corrupted this file. (
My temporary fix, if it helps someone, was to fsync one of the other cluster nodes and copy the auth.config file from it.)

I searched for other instances of this pattern; I think I got them.

Love this project btw, thanks for making it!

The write-to-temp and then rename is a valid pattern but for crash
safety you should flush the file before renaming.

This is a fix for observed behavior where my docker host was rebooted
and then it had corrupted this file. (I fsync'd and then copied one
from another cluster node to fix it.)

Searched for other instances - I think I got them.
@ShreyasZare
Copy link
Copy Markdown
Member

Thanks for the PR. Will evaluate it soon.

Also does the issue that you mentioned is really fixed with these changes? Have you tested these changes with your cluster nodes with reboot tests?

@banksy-git
Copy link
Copy Markdown
Author

banksy-git commented Apr 30, 2026

Reproducing it is more complicated.

The corruption has happened practically every time the host gets an OS update and it reboots the host to activate the update. (Immutable OS). At first I ignored it hoping a later update might fix it but it happened on v15.0.1 so I decided to dig. I'm going to edit the start script now so it zips up the config directory before starting so I can see what it was.

I looked at how the config was written - that fsync is required for crash safety - because a rename can hit the fs journal without all the pages having been flushed and the ordering of those things is not guaranteed. In my case though, I don't think it explains it unless the host is somehow rebooting without flushing; nothing in the logs suggest an abnormal shutdown.

Will continue to dig. :-)

@ShreyasZare
Copy link
Copy Markdown
Member

Thanks for the details. If you have deployed this PR then do let me know if its working well when the OS reboots since I do not have docker production setup so it will be nice too get a feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants