Project

General

Profile

Actions

action #58956

closed

salt minion on arm workers sometimes do not respond

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2019-10-31
Due date:
% Done:

0%

Estimated time:

Description

Observation

We realized that sometimes our arm workers do not respond (in time) on salt commands.

Problem

I could not see a problem right now with for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*' test.ping || break; done but later during a sudo salt -l error --state-output=changes -C 'G@roles:worker' state.apply,cmd.run ,"systemctl status openqa-worker.target" the three arm workers timed out.

Actions #1

Updated by okurz over 4 years ago

  • Assignee set to okurz
  • Target version set to Current Sprint

Trying to reproduce with

for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*arm*3*' state.apply,test.ping openqa.ntp, || break; done

which does not reproduce it. Neither does

for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*arm*3*' state.apply,state.apply,test.ping salt.fix,salt.minion, || break; done

but

 for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*arm*3*' test.ping,state.apply,test.ping || break; done

immediately shows a problem.

on the machine in /var/log/salt/minion :

2019-10-31 16:38:42,580 [salt.minion      :1899][WARNING ][43915] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:44,213 [salt.minion      :1899][WARNING ][43920] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:45,760 [salt.minion      :1899][WARNING ][43925] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:47,411 [salt.minion      :1899][WARNING ][43930] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:49,003 [salt.minion      :1899][WARNING ][43935] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:39:06,981 [salt.state       :1011][ERROR   ][43946] Error encountered during module reload. Modules were not reloaded.
Actions #2

Updated by okurz over 4 years ago

  • Status changed from New to Feedback

I attached to the minion process(es) on arm3 with strace and could see that while the connection from master to minion times out the salt minion is (simply) still busy applying the state. Calling salt with parameter -t 180 for a longer timeout should help here.

https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/217

Actions #3

Updated by coolo over 4 years ago

Our high state is just very slow to apply due to relying on remote zypp repos to refresh. Perhaps we stop that and make refresh an explicit action when deploying?

Actions #4

Updated by coolo over 4 years ago

... and this is a problem as it's applied every hour. So if you're unlucky to want something from the arm workers during that time, you're lost

Actions #5

Updated by okurz over 4 years ago

Refreshing the zypper repos is unfortunately very often quite slow however I thought this is already fixed with https://gitlab.suse.de/openqa/salt-states-openqa/commit/338cd4b9c4c6c36d35aa849dfe441bf0c2a39886 . It shouldn't really be a problem that an action triggered by salt is slow only that the communication between minion and server is impacted. What exactly is "applied every hour"?

Actions #6

Updated by okurz over 4 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF