action #157441
closedosd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz)
0%
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2398227
Date: Sun, 17 Mar 2024 05:49:14 +0000
Date: Mon, 18 Mar 2024 05:49:42 +0000
qesapworker-prg5.qa.suse.cz:
2184 Minion did not return. [Not connected]
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_qesapworker-prg5/view?orgId=1
The worker seemed to have hung up. No login prompt on serial tty.
Rebooted via IPMI.
Worker came up, but a systemd service failed: …
It seems like the NVMe disk is not found anymore. Maybe it died and the system subsequently freezed.
Acceptance criteria¶
- AC1: osd-deployment passed again
- AC2: qesapworker-prg5.qa.suse.cz back in production again
Suggestions¶
- DONE Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
- DONE Remove qesapworker-prg5.qa.suse.cz from production
ssh osd "sudo salt-key -y -d qesapworker-prg5.qa.suse.cz"
- Retrigger failed osd deployment CI pipeline
- Investigate the specific issue on qesapworker-prg5.qa.suse.cz
- Fix any potential hardware issue, e.g. with hardware replacement
- Ensure qesapworker-prg5.qa.suse.cz is back in production
Rollback steps¶
- https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production
hostname=qesapworker-prg5.qa.suse.cz ssh osd "sudo salt-key -a $hostname && sudo salt --state-output=changes $hostname state.apply"
Updated by okurz 8 months ago
- Related to action #157453: [FIRING:1] host_up (qesapworker-prg5: host up alert openQA qesapworker-prg5 host_up_alert_qesapworker-prg5 worker) added
Updated by okurz 8 months ago
# systemctl status openqa_nvme_format.service
. openqa_nvme_format.service - Setup NVMe before mounting it
Loaded: loaded (/etc/systemd/system/openqa_nvme_format.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2024-03-18 11:01:18 CET; 3min 30s ago
Process: 31734 ExecStart=/usr/local/bin/openqa-establish-nvme-setup (code=exited, status=1/FAILURE)
Main PID: 31734 (code=exited, status=1/FAILURE)
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: │ /boot/grub2/i386-pc
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: │ /.snapshots
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: │ /
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: └─sda3 8:3 0 1G 0 part [SWAP]
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31734]: Creating RAID0 "/dev/md/openqa" on: /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31748]: mdadm: cannot open /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e: No such file or directory
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31734]: Unable to create RAID, mdadm returned with non-zero code
Mar 18 11:01:18 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Mar 18 11:01:18 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Mar 18 11:01:18 qesapworker-prg5 systemd[1]: Failed to start Setup NVMe before mounting it.
Updated by okurz 8 months ago
- Description updated (diff)
- Status changed from In Progress to Resolved
I removed the machine from OSD salt keys and retriggered the OSD pipeline: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1043545
On qesapworker-prg5 I triggered a reboot and to check if maybe a complete power cycle helps I did ipmitool … power cycle
and ipmitool … sol activate
and I saw a complete successful reboot. It seems the power cycle
help as the device is back. Putting machine back into production. Meanwhile https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1043545 is good again. salt state was cleanly applied, openQA workers are running fine again. alerts back to good as well.
Updated by livdywan about 2 months ago
- Copied to action #167164: osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M added
Updated by okurz about 2 months ago
- Related to action #166520: [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:S added
Updated by okurz about 2 months ago
- Related to action #164907: [alert][FIRING:1] host_up (qesapworker-prg5: host up alert openQA, qesapworker-prg5-mgmt.qa.suse.cz not reachable, failing osd-deployment added