Project

General

Profile

Actions

action #116113

closed

salt responses timing out some of the time size:M

Added by livdywan over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-08-31
Due date:
2022-09-16
% Done:

0%

Estimated time:

Description

Observation

Sometimes salt works fine, other times some or all of the minions time out (when matching multiple machines):

sudo salt -vvv -C 'G@roles:worker' cmd.run 'uptime'
Executing job with jid 20220901080543238391
-------------------------------------------

openqaworker6.suse.de:
     10:05:43  up 4 days  6:27,  0 users,  load average: 0.83, 1.56, 1.68
QA-Power8-5-kvm.qa.suse.de:
     10:05:43  up 4 days  6:30,  0 users,  load average: 4.11, 4.20, 6.23
[ERROR   ] Message timed out
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

I considered that we recently saw packages having been uninstalled which was covered by #115484 but it seems it's all installed, and the versions match (mismatching packages could break the connection).

Verbose output unfortunately doesn't seem to add much here.

I also checked grafana but couldn't spot any obvious issues there.

Errors observed in the journal of salt-master

Sep 01 11:30:18 openqa salt-master[14829]: [ERROR   ] Event iteration failed with exception: 'str' object has no attribute 'items'

And many occurrences of:

Sep 01 11:31:14 openqa salt-master[14815]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/31/32bb5ea181d9b2d49c5a42f08b0e8b8220d53256288b372d63c2891a7ba7df: [Errno 13] Permission denied: 'jid' 
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/31/294d7552c90320e36d749edc5ff9947f9d89faf9a9e1fe8cd0ba40a176f3fb: [Errno 13] Permission denied: 'out.p' 
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/7a/ed0cbedb5f7404fc91f5b1e8e96fb8101551c434584b9a1b8080a32b7b32f5: [Errno 13] Permission denied: 'out.p' 
Sep 01 11:31:14 openqa salt-master[14815]: [ERROR   ] Unable to remove /var/cache/salt/master/jobs/97/3aed50d2e575d40333ac5556637b8c78ebc45521aee86cbfe63307e1d5cd08: [Errno 13] Permission denied: 'jid'
…

Acceptance criteria

  • *AC1: No more message errors

Suggestions

  • Check the systemd journal for salt-master
    • Investigate why we see permission errors - salt-master should run as root, on osd it must be called via root
    • Check why we don't always run the service as the salt user
      • Restart services and monitor carefully how it behaves after that
      • A manual restart (temporarily) resolved the issue
      • There is no "salt" hostname, which salt expects? Maybe this was a side-effect of #115484
      • Double-check recent config files changes?
Actions

Also available in: Atom PDF