Project

General

Profile

action #116113

Updated by livdywan over 2 years ago

## Observation 
 Sometimes salt works fine, other times some or all of the minions time out (when matching multiple machines): 

 ``` 
 sudo salt -vvv -C 'G@roles:worker' cmd.run 'uptime' 
 Executing job with jid 20220901080543238391 
 ------------------------------------------- 

 openqaworker6.suse.de: 
      10:05:43    up 4 days    6:27,    0 users,    load average: 0.83, 1.56, 1.68 
 QA-Power8-5-kvm.qa.suse.de: 
      10:05:43    up 4 days    6:30,    0 users,    load average: 4.11, 4.20, 6.23 
 [ERROR     ] Message timed out 
 Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later. 
 ``` 

 I considered that we recently saw packages having been uninstalled which was covered by #115484 but it seems it's all installed, and the versions match (mismatching packages could break the connection). 

 Verbose output unfortunately doesn't seem to add much here. 

 I also checked grafana but couldn't spot any obvious issues there. 

 ## Errors observed in the journal of salt-master 
 ``` 
 Sep 01 11:30:18 openqa salt-master[14829]: [ERROR     ] Event iteration failed with exception: 'str' object has no attribute 'items' 
 ``` 
 And many occurrences of: 
 ``` 
 Sep 01 11:31:14 openqa salt-master[14815]: [ERROR     ] Unable to remove /var/cache/salt/master/jobs/31/32bb5ea181d9b2d49c5a42f08b0e8b8220d53256288b372d63c2891a7ba7df: [Errno 13] Permission denied: 'jid'  
 Sep 01 11:31:14 openqa salt-master[14815]: [ERROR     ] Unable to remove /var/cache/salt/master/jobs/31/294d7552c90320e36d749edc5ff9947f9d89faf9a9e1fe8cd0ba40a176f3fb: [Errno 13] Permission denied: 'out.p'  
 Sep 01 11:31:14 openqa salt-master[14815]: [ERROR     ] Unable to remove /var/cache/salt/master/jobs/7a/ed0cbedb5f7404fc91f5b1e8e96fb8101551c434584b9a1b8080a32b7b32f5: [Errno 13] Permission denied: 'out.p'  
 Sep 01 11:31:14 openqa salt-master[14815]: [ERROR     ] Unable to remove /var/cache/salt/master/jobs/97/3aed50d2e575d40333ac5556637b8c78ebc45521aee86cbfe63307e1d5cd08: [Errno 13] Permission denied: 'jid' 
 … 
 ``` 

 ## Acceptance criteria 
 * **AC1:* No more message errors 

 ## Suggestions 
 - Check the systemd journal for salt-master 
   - Investigate why we see permission errors - salt-master should run as root, on osd it must be called via root 
   - Check why we don't always run the service as the `salt` user 
      - Restart services and monitor carefully Watch [the tutorial video](https://www.youtube.com/watch?v=dQw4w9WgXcQ) explaining how it behaves after that 
        - A manual restart (temporarily) resolved the issue 
       - There is no "salt" hostname, which to troubleshoot salt expects? Maybe this was a side-effect of #115484 
      - Double-check recent config files changes? 

Back