Fix for config corruption on Linux (Docker) after host crash / reboot#1868
Fix for config corruption on Linux (Docker) after host crash / reboot#1868banksy-git wants to merge 1 commit intoTechnitiumSoftware:masterfrom
Conversation
The write-to-temp and then rename is a valid pattern but for crash safety you should flush the file before renaming. This is a fix for observed behavior where my docker host was rebooted and then it had corrupted this file. (I fsync'd and then copied one from another cluster node to fix it.) Searched for other instances - I think I got them.
|
Thanks for the PR. Will evaluate it soon. Also does the issue that you mentioned is really fixed with these changes? Have you tested these changes with your cluster nodes with reboot tests? |
|
Reproducing it is more complicated. The corruption has happened practically every time the host gets an OS update and it reboots the host to activate the update. (Immutable OS). At first I ignored it hoping a later update might fix it but it happened on v15.0.1 so I decided to dig. I'm going to edit the start script now so it zips up the config directory before starting so I can see what it was. I looked at how the config was written - that fsync is required for crash safety - because a rename can hit the fs journal without all the pages having been flushed and the ordering of those things is not guaranteed. In my case though, I don't think it explains it unless the host is somehow rebooting without flushing; nothing in the logs suggest an abnormal shutdown. Will continue to dig. :-) |
|
Thanks for the details. If you have deployed this PR then do let me know if its working well when the OS reboots since I do not have docker production setup so it will be nice too get a feedback. |
I experienced regular corruption of auth.config every time one of my cluster nodes rebooted.
The write-to-temp and then rename is a valid pattern but for crash safety the file should be flushed to disk before renaming.
This is a fix for observed behavior where my docker host was rebooted and then it had corrupted this file. (
My temporary fix, if it helps someone, was to fsync one of the other cluster nodes and copy the auth.config file from it.)
I searched for other instances of this pattern; I think I got them.
Love this project btw, thanks for making it!