Project

General

Profile

action #58956

salt minion on arm workers sometimes do not respond

Added by okurz 8 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Start date:
2019-10-31
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Observation

We realized that sometimes our arm workers do not respond (in time) on salt commands.

Problem

I could not see a problem right now with for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*' test.ping || break; done but later during a sudo salt -l error --state-output=changes -C 'G@roles:worker' state.apply,cmd.run ,"systemctl status openqa-worker.target" the three arm workers timed out.

History

#1 Updated by okurz 8 months ago

  • Assignee set to okurz
  • Target version set to Current Sprint

Trying to reproduce with

for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*arm*3*' state.apply,test.ping openqa.ntp, || break; done

which does not reproduce it. Neither does

for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*arm*3*' state.apply,state.apply,test.ping salt.fix,salt.minion, || break; done

but

 for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*arm*3*' test.ping,state.apply,test.ping || break; done

immediately shows a problem.

on the machine in /var/log/salt/minion :

2019-10-31 16:38:42,580 [salt.minion      :1899][WARNING ][43915] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:44,213 [salt.minion      :1899][WARNING ][43920] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:45,760 [salt.minion      :1899][WARNING ][43925] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:47,411 [salt.minion      :1899][WARNING ][43930] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:49,003 [salt.minion      :1899][WARNING ][43935] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:39:06,981 [salt.state       :1011][ERROR   ][43946] Error encountered during module reload. Modules were not reloaded.

#2 Updated by okurz 8 months ago

  • Status changed from New to Feedback

I attached to the minion process(es) on arm3 with strace and could see that while the connection from master to minion times out the salt minion is (simply) still busy applying the state. Calling salt with parameter -t 180 for a longer timeout should help here.

https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/217

#3 Updated by coolo 8 months ago

Our high state is just very slow to apply due to relying on remote zypp repos to refresh. Perhaps we stop that and make refresh an explicit action when deploying?

#4 Updated by coolo 8 months ago

... and this is a problem as it's applied every hour. So if you're unlucky to want something from the arm workers during that time, you're lost

#5 Updated by okurz 8 months ago

Refreshing the zypper repos is unfortunately very often quite slow however I thought this is already fixed with https://gitlab.suse.de/openqa/salt-states-openqa/commit/338cd4b9c4c6c36d35aa849dfe441bf0c2a39886 . It shouldn't really be a problem that an action triggered by salt is slow only that the communication between minion and server is impacted. What exactly is "applied every hour"?

#6 Updated by okurz 8 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF