action #159303
open[alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S
0%
Description
See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2510604
ERROR: Minions returned with non-zero exit code
worker32.oqa.prg2.suse.org:
openqaworker18.qa.suse.cz:
worker31.oqa.prg2.suse.org:
worker35.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
worker29.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
worker40.oqa.prg2.suse.org:
worker34.oqa.prg2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker17.qa.suse.cz:
worker-arm2.oqa.prg2.suse.org:
worker-arm1.oqa.prg2.suse.org:
qesapworker-prg4.qa.suse.cz:
qesapworker-prg6.qa.suse.cz:
qesapworker-prg5.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
sapworker2.qe.nue2.suse.org:
sapworker3.qe.nue2.suse.org:
sapworker1.qe.nue2.suse.org:
openqaworker14.qa.suse.cz:
mania.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
openqaworker1.qe.nue2.suse.org:
imagetester.qe.nue2.suse.org:
diesel.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
Minion did not return. [Not connected]
Suggestions¶
- DONE Retry the pipeline, could be temporary - seems to persist over a couple hours at least so far
- Ensure that after the machine is up again in #159270 the deployment is retried and works fine
- Wait for #157753 to have complete automation
- This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time. Let's assume that our automatic recovery is good enough and always recovers the machine soon enough
- Teach livdywan that it was just one machine failing, not all the listed ones
Out of scope¶
- Recovering openqaworker-arm-1: #159270
Updated by okurz 13 days ago
- Related to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Updated by okurz 13 days ago
- Category set to Regressions/Crashes
This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time.
Updated by okurz 13 days ago
- Related to action #159270: openqaworker-arm-1 is Unreachable size:S added
Updated by okurz 13 days ago
- Subject changed from [alert] osd-deployment pre-deploy pipeline fails Minion did not return with many workers not responding to [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger 12 days ago
- Status changed from Workable to Blocked
- Assignee set to nicksinger
The recovery pipeline can be triggered and runs as expected: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2513420
In grafana the according "Contact point" for "Trigger reboot of openqaworker-arm-1" is still present but labeled as "Unused". The deployment was successful after we recovered arm-1 manually: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1088854
Basically this has to wait until #157753 is done to continue further.
Updated by nicksinger 12 days ago
- Priority changed from High to Normal
Also as the immediate problem was mitigated I think we can lower the prio.