-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
What happened?
When you have multiple syndic masters (for example 7/8 is when i started seeing it) and you target a minion on each of those masters, sometimes a race condition happens on the master of masters on return. I see it about 1-6 times for every 100 commands. When it occurs we see some sort of combination of these errors in the logs:
2025-11-04 20:05:05,534 [salt.master :1933][ERROR ][3123949] Error in function _syndic_return:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1927, in run_func
ret = getattr(self, func)(load)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1774, in _syndic_return
self._return(ret)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1718, in _return
salt.utils.job.store_job(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/job.py", line 128, in store_job
if job_cache == "local_cache" and mminion.returners[getfstr](load.get("jid", "")):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 159, in _call_
ret = self.loader.run(run_func, *args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1245, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1260, in _run_as
ret = _func_or_method(*args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/returners/local_cache.py", line 312, in get_load
all_minions.update(salt.payload.load(rfh))
TypeError: 'NoneType' object is not iterable
2025-11-04 19:34:46,761 [salt.master :1933][ERROR ][3123758] Error in function _syndic_return:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1927, in run_func
ret = getattr(self, func)(load)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1774, in _syndic_return
self._return(ret)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1718, in _return
salt.utils.job.store_job(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/job.py", line 128, in store_job
if job_cache == "local_cache" and mminion.returners[getfstr](load.get("jid", "")):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 159, in _call_
ret = self.loader.run(run_func, *args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1245, in run
return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1260, in _run_as
ret = _func_or_method(*args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/returners/local_cache.py", line 301, in get_load
if exc is not None:
UnboundLocalError: local variable 'exc' referenced before assignment
2025-11-04 19:34:46,507 [salt.payload :100 ][CRITICAL][3123875] Could not deserialize msgpack message. This often happens when trying to read a fil
e not in binary mode. To see message payload, enable debug logging and retry. Exception: unpack(b) received extra data.
Whats happening is on syndic return we call save_load: https://github.com/saltstack/salt/blob/master/salt/master.py#L1885
Which writes to the job cache for the files load.p and minions.p. I believe what is happening is sometimes if the returns occur at the same time they try to write to these files at the same time and cause corruption. This is my current theory and my initial patch is working on local testing on my laptop.
I'm working on a patch and will submit when i've fully validated it.
Type of salt install
Official deb
Major version
3006.x
What supported OS are you seeing the problem on? Can select multiple. (If bug appears on an unsupported OS, please open a GitHub Discussion instead)
ubuntu-22.04
salt --versions-report output
3006.9
But I also validated this behavior exhibits on the head of 3006.x