action #116113
Updated by livdywan over 2 years ago
## Observation Sometimes salt works fine, other times some or all of the minions time out (when matching multiple machines): ``` sudo salt -vvv -C 'G@roles:worker' cmd.run 'uptime' Executing job with jid 20220901080543238391 ------------------------------------------- openqaworker6.suse.de: 10:05:43 up 4 days 6:27, 0 users, load average: 0.83, 1.56, 1.68 QA-Power8-5-kvm.qa.suse.de: 10:05:43 up 4 days 6:30, 0 users, load average: 4.11, 4.20, 6.23 [ERROR ] Message timed out Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later. ``` I considered that we recently saw packages having been uninstalled which was covered by #115484 but it seems it's all installed, and the versions match (mismatching packages could break the connection). Verbose output unfortunately doesn't seem to add much here. I also checked grafana but couldn't spot any obvious issues there. ## Errors observed in the journal of salt-master ``` Sep 01 11:30:18 openqa salt-master[14829]: [ERROR ] Event iteration failed with exception: 'str' object has no attribute 'items' ``` And many occurrences of: ``` Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/31/32bb5ea181d9b2d49c5a42f08b0e8b8220d53256288b372d63c2891a7ba7df: [Errno 13] Permission denied: 'jid' Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/31/294d7552c90320e36d749edc5ff9947f9d89faf9a9e1fe8cd0ba40a176f3fb: [Errno 13] Permission denied: 'out.p' Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/7a/ed0cbedb5f7404fc91f5b1e8e96fb8101551c434584b9a1b8080a32b7b32f5: [Errno 13] Permission denied: 'out.p' Sep 01 11:31:14 openqa salt-master[14815]: [ERROR ] Unable to remove /var/cache/salt/master/jobs/97/3aed50d2e575d40333ac5556637b8c78ebc45521aee86cbfe63307e1d5cd08: [Errno 13] Permission denied: 'jid' … ``` ## Acceptance criteria * **AC1:* No more message errors ## Suggestions - Check the systemd journal for salt-master - Investigate why we see permission errors - salt-master should run as root, on osd it must be called via root - Check why we don't always run the service as the `salt` user - Restart services and monitor carefully Watch [the tutorial video](https://www.youtube.com/watch?v=dQw4w9WgXcQ) explaining how it behaves after that - A manual restart (temporarily) resolved the issue - There is no "salt" hostname, which to troubleshoot salt expects? Maybe this was a side-effect of #115484 - Double-check recent config files changes?