action #116113
closedsalt responses timing out some of the time size:M
0%
Description
Observation¶
Sometimes salt works fine, other times some or all of the minions time out (when matching multiple machines):
sudo salt -vvv -C 'G@roles:worker' cmd.run 'uptime'
Executing job with jid 20220901080543238391
-------------------------------------------
openqaworker6.suse.de:
10:05:43 up 4 days 6:27, 0 users, load average: 0.83, 1.56, 1.68
QA-Power8-5-kvm.qa.suse.de:
10:05:43 up 4 days 6:30, 0 users, load average: 4.11, 4.20, 6.23
[ERROR ] Message timed out
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
I considered that we recently saw packages having been uninstalled which was covered by #115484 but it seems it's all installed, and the versions match (mismatching packages could break the connection).
Verbose output unfortunately doesn't seem to add much here.
I also checked grafana but couldn't spot any obvious issues there.
Errors observed in the journal of salt-master¶
Sep 01 11:30:18 openqa salt-master[14829]: [ERROR ] Event iteration failed with exception: 'str' object has no attribute 'items'
And many occurrences of:
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/31/32bb5ea181d9b2d49c5a42f08b0e8b8220d53256288b372d63c2891a7ba7df: [Errno 13] Permission denied: 'jid'
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/31/294d7552c90320e36d749edc5ff9947f9d89faf9a9e1fe8cd0ba40a176f3fb: [Errno 13] Permission denied: 'out.p'
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/7a/ed0cbedb5f7404fc91f5b1e8e96fb8101551c434584b9a1b8080a32b7b32f5: [Errno 13] Permission denied: 'out.p'
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/97/3aed50d2e575d40333ac5556637b8c78ebc45521aee86cbfe63307e1d5cd08: [Errno 13] Permission denied: 'jid'
…
Acceptance criteria¶
- *AC1: No more message errors
Suggestions¶
- Check the systemd journal for salt-master
- Investigate why we see permission errors - salt-master should run as root, on osd it must be called via root
- Check why we don't always run the service as the
salt
user- Restart services and monitor carefully how it behaves after that
- A manual restart (temporarily) resolved the issue
- There is no "salt" hostname, which salt expects? Maybe this was a side-effect of #115484
- Double-check recent config files changes?
Updated by livdywan about 2 years ago
- Subject changed from salt responses timing out some of the time to salt responses timing out some of the time size:M
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Normal to High
Updated by nicksinger about 2 years ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
This is what I saw when it "hangs":
openqa:~ # salt -l debug 'backup.qa.suse.de' cmd.run 'uptime'
[DEBUG ] Reading configuration from /etc/salt/master
[DEBUG ] Using cached minion ID from /etc/salt/minion_id: openqa.suse.de
[DEBUG ] Missing configuration file: /root/.saltrc
[DEBUG ] Using importlib_metadata to load entry points
[DEBUG ] Override __grains__: <module 'salt.loaded.int.log_handlers.sentry_mod' from '/usr/lib/python3.6/site-packages/salt/log/handlers/sentry_mod.py'>
[DEBUG ] Configuration file path: /etc/salt/master
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG ] Reading configuration from /etc/salt/master
[DEBUG ] Using cached minion ID from /etc/salt/minion_id: openqa.suse.de
[DEBUG ] Missing configuration file: /root/.saltrc
[DEBUG ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG ] Using importlib_metadata to load entry points
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG ] SaltReqTimeoutError, retrying. (1/3)
^C
Exiting gracefully on Ctrl-c
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[DEBUG ] Closing IPCMessageSubscriber instance
^C
Exiting gracefully on Ctrl-c
Exception ignored in: <module 'threading' from '/usr/lib64/python3.6/threading.py'>
Traceback (most recent call last):
File "/usr/lib64/python3.6/threading.py", line 1294, in _shutdown
t.join()
File "/usr/lib64/python3.6/threading.py", line 1056, in join
self._wait_for_tstate_lock()
File "/usr/lib64/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
File "/usr/lib/python3.6/site-packages/salt/scripts.py", line 61, in _handle_signals
sys.exit(salt.defaults.exitcodes.EX_OK)
SystemExit: 0
There was a salt process hanging as "[defunct]" is "ps aux" so I just tried to restart the salt-master on OSD which apparently resolved the situation for now. Afterwards I did a full chown salt:salt /var/cache/salt/master/jobs/
to get rid of the journal entries.
Updated by openqa_review about 2 years ago
- Due date set to 2022-09-16
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan about 2 years ago
Discussed in the unblock. The localhost connection to the zeromq broker provider by salt is getting stuck. An upstream bug should be reported, since it looks like it could be a bug in salt. For now we've not seen immediate issues and we can work-around it by restarting salt.
Updated by nicksinger about 2 years ago
- Status changed from In Progress to Closed
In the weekly unblock we discussed that a bug report might be hard to describe without actually being able to reproduce the problem (we don't see it any more). We keep this ticket here as reference anyway if it happens again in the future.