Skip to content

[Bug]: Syndic Race Condition in job cache on minion return #68508

@Ch3LL

Description

@Ch3LL

What happened?

When you have multiple syndic masters (for example 7/8 is when i started seeing it) and you target a minion on each of those masters, sometimes a race condition happens on the master of masters on return. I see it about 1-6 times for every 100 commands. When it occurs we see some sort of combination of these errors in the logs:

2025-11-04 20:05:05,534 [salt.master      :1933][ERROR   ][3123949] Error in function _syndic_return:

Traceback (most recent call last):

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1927, in run_func

    ret = getattr(self, func)(load)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1774, in _syndic_return

    self._return(ret)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1718, in _return

    salt.utils.job.store_job(

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/job.py", line 128, in store_job

    if job_cache == "local_cache" and mminion.returners[getfstr](load.get("jid", "")):

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 159, in _call_

    ret = self.loader.run(run_func, *args, **kwargs)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1245, in run

    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1260, in _run_as

    ret = _func_or_method(*args, **kwargs)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/returners/local_cache.py", line 312, in get_load

    all_minions.update(salt.payload.load(rfh))

TypeError: 'NoneType' object is not iterable

2025-11-04 19:34:46,761 [salt.master      :1933][ERROR   ][3123758] Error in function _syndic_return:

Traceback (most recent call last):

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1927, in run_func

    ret = getattr(self, func)(load)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1774, in _syndic_return

    self._return(ret)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1718, in _return

    salt.utils.job.store_job(

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/job.py", line 128, in store_job

    if job_cache == "local_cache" and mminion.returners[getfstr](load.get("jid", "")):

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 159, in _call_

    ret = self.loader.run(run_func, *args, **kwargs)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1245, in run

    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1260, in _run_as

    ret = _func_or_method(*args, **kwargs)

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/returners/local_cache.py", line 301, in get_load

    if exc is not None:

UnboundLocalError: local variable 'exc' referenced before assignment

2025-11-04 19:34:46,507 [salt.payload     :100 ][CRITICAL][3123875] Could not deserialize msgpack message. This often happens when trying to read a fil

e not in binary mode. To see message payload, enable debug logging and retry. Exception: unpack(b) received extra data.

Whats happening is on syndic return we call save_load: https://github.com/saltstack/salt/blob/master/salt/master.py#L1885

Which writes to the job cache for the files load.p and minions.p. I believe what is happening is sometimes if the returns occur at the same time they try to write to these files at the same time and cause corruption. This is my current theory and my initial patch is working on local testing on my laptop.

I'm working on a patch and will submit when i've fully validated it.

Type of salt install

Official deb

Major version

3006.x

What supported OS are you seeing the problem on? Can select multiple. (If bug appears on an unsupported OS, please open a GitHub Discussion instead)

ubuntu-22.04

salt --versions-report output

3006.9

But I also validated this behavior exhibits on the head of 3006.x

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbroken, incorrect, or confusing behaviorneeds-triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions