Actions
action #116113
closedsalt responses timing out some of the time size:M
Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-08-31
Due date:
2022-09-16
% Done:
0%
Estimated time:
Description
Observation¶
Sometimes salt works fine, other times some or all of the minions time out (when matching multiple machines):
sudo salt -vvv -C 'G@roles:worker' cmd.run 'uptime'
Executing job with jid 20220901080543238391
-------------------------------------------
openqaworker6.suse.de:
10:05:43 up 4 days 6:27, 0 users, load average: 0.83, 1.56, 1.68
QA-Power8-5-kvm.qa.suse.de:
10:05:43 up 4 days 6:30, 0 users, load average: 4.11, 4.20, 6.23
[ERROR ] Message timed out
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
I considered that we recently saw packages having been uninstalled which was covered by #115484 but it seems it's all installed, and the versions match (mismatching packages could break the connection).
Verbose output unfortunately doesn't seem to add much here.
I also checked grafana but couldn't spot any obvious issues there.
Errors observed in the journal of salt-master¶
Sep 01 11:30:18 openqa salt-master[14829]: [ERROR ] Event iteration failed with exception: 'str' object has no attribute 'items'
And many occurrences of:
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/31/32bb5ea181d9b2d49c5a42f08b0e8b8220d53256288b372d63c2891a7ba7df: [Errno 13] Permission denied: 'jid'
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/31/294d7552c90320e36d749edc5ff9947f9d89faf9a9e1fe8cd0ba40a176f3fb: [Errno 13] Permission denied: 'out.p'
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/7a/ed0cbedb5f7404fc91f5b1e8e96fb8101551c434584b9a1b8080a32b7b32f5: [Errno 13] Permission denied: 'out.p'
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/97/3aed50d2e575d40333ac5556637b8c78ebc45521aee86cbfe63307e1d5cd08: [Errno 13] Permission denied: 'jid'
…
Acceptance criteria¶
- *AC1: No more message errors
Suggestions¶
- Check the systemd journal for salt-master
- Investigate why we see permission errors - salt-master should run as root, on osd it must be called via root
- Check why we don't always run the service as the
salt
user- Restart services and monitor carefully how it behaves after that
- A manual restart (temporarily) resolved the issue
- There is no "salt" hostname, which salt expects? Maybe this was a side-effect of #115484
- Double-check recent config files changes?
Actions