action #58956
closedsalt minion on arm workers sometimes do not respond
0%
Description
Observation¶
We realized that sometimes our arm workers do not respond (in time) on salt commands.
Problem¶
I could not see a problem right now with for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*' test.ping || break; done
but later during a sudo salt -l error --state-output=changes -C 'G@roles:worker' state.apply,cmd.run ,"systemctl status openqa-worker.target"
the three arm workers timed out.
Updated by okurz over 5 years ago
- Assignee set to okurz
- Target version set to Current Sprint
Trying to reproduce with
for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*arm*3*' state.apply,test.ping openqa.ntp, || break; done
which does not reproduce it. Neither does
for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*arm*3*' state.apply,state.apply,test.ping salt.fix,salt.minion, || break; done
but
for i in {1..100}; do echo "TRY: $i" && sudo salt -l error --state-output=changes '*arm*3*' test.ping,state.apply,test.ping || break; done
immediately shows a problem.
on the machine in /var/log/salt/minion :
2019-10-31 16:38:42,580 [salt.minion :1899][WARNING ][43915] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:44,213 [salt.minion :1899][WARNING ][43920] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:45,760 [salt.minion :1899][WARNING ][43925] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:47,411 [salt.minion :1899][WARNING ][43930] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:38:49,003 [salt.minion :1899][WARNING ][43935] The minion function caused an exception: expected str, bytes or os.PathLike object, not list
2019-10-31 16:39:06,981 [salt.state :1011][ERROR ][43946] Error encountered during module reload. Modules were not reloaded.
Updated by okurz over 5 years ago
- Status changed from New to Feedback
I attached to the minion process(es) on arm3 with strace and could see that while the connection from master to minion times out the salt minion is (simply) still busy applying the state. Calling salt with parameter -t 180
for a longer timeout should help here.
https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/217
Updated by coolo over 5 years ago
Our high state is just very slow to apply due to relying on remote zypp repos to refresh. Perhaps we stop that and make refresh an explicit action when deploying?
Updated by coolo over 5 years ago
... and this is a problem as it's applied every hour. So if you're unlucky to want something from the arm workers during that time, you're lost
Updated by okurz over 5 years ago
Refreshing the zypper repos is unfortunately very often quite slow however I thought this is already fixed with https://gitlab.suse.de/openqa/salt-states-openqa/commit/338cd4b9c4c6c36d35aa849dfe441bf0c2a39886 . It shouldn't really be a problem that an action triggered by salt is slow only that the communication between minion and server is impacted. What exactly is "applied every hour"?