action #136325
closed
salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.org
Added by okurz over 1 year ago.
Updated over 1 year ago.
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1848768#L9651
ERROR: Minions returned with non-zero exit code
sapworker2.qe.nue2.suse.org:
Minion did not return. [Not connected]
sapworker3.qe.nue2.suse.org:
Minion did not return. [Not connected]
worker-arm2.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker-arm1.oqa.prg2.suse.org:
Minion did not return. [Not connected]
Rollback actions¶
- Add back to salt: sapworker2.qe.nue2.suse.org, sapworker3.qe.nue2.suse.org, worker-arm1.oqa.prg2.suse.org, worker-arm2.oqa.prg2.suse.org
for i in sapworker2.qe.nue2.suse.org sapworker3.qe.nue2.suse.org worker-arm1.oqa.prg2.suse.org worker-arm2.oqa.prg2.suse.org ; do sudo salt-key -y -a $i; done && sudo salt \* state.apply
- Tags changed from infra, salt, deploy to infra, salt, deploy, alert
- Subject changed from salt deploy failes due to multiple offline workers in qe.nue2.suse.org+prg2.suse.org to salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.org
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to nicksinger
All run salt-minion-3005 and list it as "defunct" in the process table. I suggest we actually go back to salt-minion-3004 on all machines.
for i in sapworker2.qe.nue2.suse.org sapworker3.qe.nue2.suse.org worker-arm2.oqa.prg2.suse.org worker-arm1.oqa.prg2.suse.org; do echo "### $i" && ssh $i "arch=$(uname -m); sudo zypper rl salt* salt salt-minion salt-bash-completion python3-salt && sudo zypper -n rm salt-bash-completion && sudo zypper -n in --oldpackage --allow-downgrade http://download.opensuse.org/update/leap/15.4/sle/\$arch/salt-3004-150400.8.25.1.\$arch.rpm http://download.opensuse.org/update/leap/15.4/sle/\$arch/salt-minion-3004-150400.8.25.1.\$arch.rpm http://download.opensuse.org/update/leap/15.4/sle/\$arch/python3-salt-3004-150400.8.25.1.\$arch.rpm && sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt"; done
And then adding back the salt key of all 4 machines
- Related to action #134906: osd-deployment failed due to openqaworker1 showing "No response" in salt size:M added
- Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added
- Status changed from In Progress to Resolved
- Assignee changed from nicksinger to okurz
I added back the salt keys and multiple times have successfully applied the salt high state. This should suffice.
Also available in: Atom
PDF