Actions
action #157441
closedosd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz)
Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-18
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2398227
Date: Sun, 17 Mar 2024 05:49:14 +0000
Date: Mon, 18 Mar 2024 05:49:42 +0000
qesapworker-prg5.qa.suse.cz:
2184 Minion did not return. [Not connected]
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_qesapworker-prg5/view?orgId=1
The worker seemed to have hung up. No login prompt on serial tty.
Rebooted via IPMI.
Worker came up, but a systemd service failed: …
It seems like the NVMe disk is not found anymore. Maybe it died and the system subsequently freezed.
Acceptance criteria¶
- AC1: osd-deployment passed again
- AC2: qesapworker-prg5.qa.suse.cz back in production again
Suggestions¶
- DONE Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
- DONE Remove qesapworker-prg5.qa.suse.cz from production
ssh osd "sudo salt-key -y -d qesapworker-prg5.qa.suse.cz"
- Retrigger failed osd deployment CI pipeline
- Investigate the specific issue on qesapworker-prg5.qa.suse.cz
- Fix any potential hardware issue, e.g. with hardware replacement
- Ensure qesapworker-prg5.qa.suse.cz is back in production
Rollback steps¶
- https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production
hostname=qesapworker-prg5.qa.suse.cz ssh osd "sudo salt-key -a $hostname && sudo salt --state-output=changes $hostname state.apply"
Actions