Project

General

Profile

Actions

action #136325

closed

salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.org

Added by okurz 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-09-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/1848768#L9651

ERROR: Minions returned with non-zero exit code
sapworker2.qe.nue2.suse.org:
    Minion did not return. [Not connected]
sapworker3.qe.nue2.suse.org:
    Minion did not return. [Not connected]
worker-arm2.oqa.prg2.suse.org:
    Minion did not return. [Not connected]
worker-arm1.oqa.prg2.suse.org:
    Minion did not return. [Not connected]

Rollback actions

  • Add back to salt: sapworker2.qe.nue2.suse.org, sapworker3.qe.nue2.suse.org, worker-arm1.oqa.prg2.suse.org, worker-arm2.oqa.prg2.suse.org
for i in sapworker2.qe.nue2.suse.org sapworker3.qe.nue2.suse.org worker-arm1.oqa.prg2.suse.org worker-arm2.oqa.prg2.suse.org ; do sudo salt-key -y -a $i; done && sudo salt \* state.apply

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #134906: osd-deployment failed due to openqaworker1 showing "No response" in salt size:MResolvednicksinger2023-08-312023-09-23

Actions
Related to openQA Infrastructure - action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:MResolvedokurz2023-06-22

Actions
Actions #1

Updated by okurz 7 months ago

  • Tags changed from infra, salt, deploy to infra, salt, deploy, alert
  • Subject changed from salt deploy failes due to multiple offline workers in qe.nue2.suse.org+prg2.suse.org to salt deploy fails due to multiple offline workers in qe.nue2.suse.org+prg2.suse.org
Actions #2

Updated by okurz 7 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to nicksinger

Working on this with Nick. Removed workers from salt

for i in sapworker2.qe.nue2.suse.org sapworker3.qe.nue2.suse.org worker-arm1.oqa.prg2.suse.org worker-arm2.oqa.prg2.suse.org ; do sudo salt-key -y -d $i; done

and retriggered osd-deployment https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/811810

EDIT: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/811810 looks good now.

Actions #3

Updated by okurz 7 months ago

All run salt-minion-3005 and list it as "defunct" in the process table. I suggest we actually go back to salt-minion-3004 on all machines.

for i in sapworker2.qe.nue2.suse.org sapworker3.qe.nue2.suse.org worker-arm2.oqa.prg2.suse.org worker-arm1.oqa.prg2.suse.org; do echo "### $i" && ssh $i "arch=$(uname -m); sudo zypper rl salt* salt salt-minion salt-bash-completion python3-salt && sudo zypper -n rm salt-bash-completion && sudo zypper -n in --oldpackage --allow-downgrade http://download.opensuse.org/update/leap/15.4/sle/\$arch/salt-3004-150400.8.25.1.\$arch.rpm http://download.opensuse.org/update/leap/15.4/sle/\$arch/salt-minion-3004-150400.8.25.1.\$arch.rpm http://download.opensuse.org/update/leap/15.4/sle/\$arch/python3-salt-3004-150400.8.25.1.\$arch.rpm && sudo zypper al --comment "poo#131249 - potential salt regression, unresponsive salt-minion" salt salt-minion salt-bash-completion python3-salt"; done

And then adding back the salt key of all 4 machines

Actions #4

Updated by okurz 7 months ago

  • Related to action #134906: osd-deployment failed due to openqaworker1 showing "No response" in salt size:M added
Actions #5

Updated by okurz 7 months ago

  • Related to action #131249: [alert][ci][deployment] OSD deployment failed, grenache-1, worker5, worker2 salt-minion does not return, error message "No response" size:M added
Actions #6

Updated by okurz 7 months ago

  • Status changed from In Progress to Resolved
  • Assignee changed from nicksinger to okurz

I added back the salt keys and multiple times have successfully applied the salt high state. This should suffice.

Actions

Also available in: Atom PDF