action #167164
closedosd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M
0%
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3125109
qesapworker-prg5.qa.suse.cz:
Minion did not return. [Not connected]
Acceptance criteria¶
- AC1: osd-deployment passed again
- AC2: qesapworker-prg5.qa.suse.cz responsive and able to process jobs
Suggestions¶
- Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
- Remove qesapworker-prg5.qa.suse.cz from production
ssh osd "sudo salt-key -y -d qesapworker-prg5.qa.suse.cz"
- Retrigger failed osd deployment CI pipeline
- Investigate the specific issue on qesapworker-prg5.qa.suse.cz
- Fix any potential hardware issue, e.g. with hardware replacement
- Ensure qesapworker-prg5.qa.suse.cz is usable in production
Files
Updated by livdywan 3 months ago
- Copied from action #157441: osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz) added
Updated by ybonatakis 3 months ago
- Status changed from New to In Progress
Steps to bring it up:
- ipmitool -I lanplus -H qesapworker-prg5-mgmt.qa.suse.cz -U <> -P *** power reset I used the following cmds to verify machine is pingable
- iob@openqa:~> sudo salt qesapworker-prg5.qa.suse.cz test.ping
- ping -c2 qesapworker-prg5.qa.suse.cz
Journal logs from 22/09
https://gist.github.com/b10n1k/0e584da3b3a6d73e9d0dae42f5eee09b
CI retriggered and passes https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3125107
Updated by ybonatakis 3 months ago · Edited
https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1323448 run and everything look passing. but the qesapworker-prg5 workers are still offline.
iob@openqa:~> sudo salt -l debug qesapworker-prg5.qa.suse.cz cmd.run 'systemctl status openqa-worker-auto-restart@*.service'
iob@openqa:~> sudo salt -l debug qesapworker-prg5.qa.suse.cz cmd.run 'systemctl restart openqa-worker-auto-restart@*.service'
doesnt say much, but it has something like [DEBUG ] return event: {'qesapworker-prg5.qa.suse.cz': {'ret': '', 'retcode': 0, 'jid': '20240923120727551566'}}
Dig in further and found
iob@qesapworker-prg5:~> sudo journalctl -u openqa-worker-auto-restart@1.service --since="today"
Sep 23 09:31:15 qesapworker-prg5 systemd[1]: Dependency failed for openQA Worker #1.
Sep 23 09:31:15 qesapworker-prg5 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
Sep 23 10:33:05 qesapworker-prg5 systemd[1]: Dependency failed for openQA Worker #1.
Sep 23 10:33:05 qesapworker-prg5 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
Sep 23 11:33:02 qesapworker-prg5 systemd[1]: Dependency failed for openQA Worker #1.
Sep 23 11:33:02 qesapworker-prg5 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
iob@qesapworker-prg5:~> sudo systemctl list-units --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● automount-restarter@var-lib-openqa-share.service loaded failed failed Restarts the automount unit var-lib-openqa-share
● openqa_nvme_format.service loaded failed failed Setup NVMe before mounting it
● smartd.service loaded failed failed Self Monitoring and Reporting Technology (SMART) Daemon
iob@qesapworker-prg5:~> sudo systemctl status openqa_nvme_format.service
× openqa_nvme_format.service - Setup NVMe before mounting it
Loaded: loaded (/etc/systemd/system/openqa_nvme_format.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2024-09-23 12:13:36 UTC; 16min ago
Process: 39448 ExecStart=/usr/local/bin/openqa-establish-nvme-setup (code=exited, status=1/FAILURE)
Main PID: 39448 (code=exited, status=1/FAILURE)
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: │ /home
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: │ /.snapshots
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: │ /
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: └─sda3 8:3 0 1G 0 part [SWAP]
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39448]: Creating RAID0 "/dev/md/openqa" on: /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39458]: mdadm: cannot open /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e: No such file or directory
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39448]: Unable to create RAID, mdadm returned with non-zero code
Sep 23 12:13:36 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Sep 23 12:13:36 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Sep 23 12:13:36 qesapworker-prg5 systemd[1]: Failed to start Setup NVMe before mounting it.
iob@qesapworker-prg5:~> sudo systemctl status automount-restarter@var-lib-openqa-share.service
× automount-restarter@var-lib-openqa-share.service - Restarts the automount unit var-lib-openqa-share
Loaded: loaded (/etc/systemd/system/automount-restarter@.service; static)
Active: failed (Result: exit-code) since Mon 2024-09-23 09:31:26 UTC; 3h 0min ago
Main PID: 3390 (code=exited, status=1/FAILURE)
Sep 23 09:31:16 qesapworker-prg5 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Sep 23 09:31:26 qesapworker-prg5 bash[3390]: A dependency job for var-lib-openqa-share.automount failed. See 'journalctl -xe' for details.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Main process exited, code=exited, status=1/FAILURE
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Failed with result 'exit-code'.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: Failed to start Restarts the automount unit var-lib-openqa-share.
iob@qesapworker-prg5:~> sudo journalctl -xeu automount-restarter@var-lib-openqa-share.service
░░ A start job for unit automount-restarter@var-lib-openqa-share.service has begun execution.
░░
░░ The job identifier is 421.
Sep 23 09:31:26 qesapworker-prg5 bash[3390]: A dependency job for var-lib-openqa-share.automount failed. See 'journalctl -xe' for details.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ An ExecStart= process belonging to unit automount-restarter@var-lib-openqa-share.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit automount-restarter@var-lib-openqa-share.service has entered the 'failed' state with result 'exit-code'.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: Failed to start Restarts the automount unit var-lib-openqa-share.
░░ Subject: A start job for unit automount-restarter@var-lib-openqa-share.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit automount-restarter@var-lib-openqa-share.service has finished with a failure.
░░
░░ The job identifier is 421 and the job result is failed.
Updated by ybonatakis 3 months ago
- Related to action #166169: Failed systemd services on worker31 / osd size:M added
Updated by ybonatakis 3 months ago · Edited
I linked #166169 which as pointed by @nicksinger seems relevant
Updated by openqa_review 3 months ago
- Due date set to 2024-10-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 3 months ago
livdywan wrote in #note-7:
Did you consider taking qesapworker-prg5.qa.suse.cz out of salt for now? While units are failed pipelines will continue to fail.
I did but not because the pipelines. https://gitlab.suse.de/openqa/osd-deployment/-/pipelines dont seem to cause failures.
Updated by nicksinger 3 months ago
I removed the machine from production following https://progress.opensuse.org/projects/openqav3/wiki#Take-machines-out-of-salt-controlled-production because it failed deployment: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3137418#L789
Updated by okurz 3 months ago
- Tags changed from infra, reactive work, alert to infra, reactive work, alert, next-office-day, next-prague-office-visit
I checked lsblk
on qesapworker-prg5 and found same as in previous tickets that the second storage device is missing. As this now seemed to happen multiple times. As discussed please physically ensure proper seating of the storage devices, i.e. pull them out and plug them back in.
Updated by ybonatakis 3 months ago
- Tags changed from infra, reactive work, alert, next-office-day, next-prague-office-visit to infra, reactive work, alert
- File clipboard-202409271831-rfwpi.png clipboard-202409271831-rfwpi.png added
- Status changed from In Progress to Feedback
I have no idea if someone else did something before me.
I follow suggestion to reboot the iDRAC and then I had to turn the system on.
In the meantime I was trying to find how to update the firmware (iDRAC) but I didnt have that option. I logged in with creds from workerconf.sls. But trying to find something regarding this I navigated to the storage where I saw
I will leave it to it and check again on Monday
Updated by okurz 3 months ago
- Related to action #166520: [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:S added
Updated by okurz 3 months ago
- Related to action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket server added
Updated by ybonatakis 3 months ago
okurz wrote in #note-15:
For #167557 I removed qesapworker-prg5 from production per https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production as it's not fully usable right now anyway.
i was thinnking to make them available again when I saw your comment. Can I try? or what do you expect as next step?
Updated by okurz 3 months ago
ybonatakis wrote in #note-17:
okurz wrote in #note-15:
For #167557 I removed qesapworker-prg5 from production per https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production as it's not fully usable right now anyway.
i was thinnking to make them available again when I saw your comment. Can I try? or what do you expect as next step?
well, as the same problem seems to have happened multiple times I still suggest you readd the tags you removed in #167164-13 and physically ensure proper seating of storage device connectors.
Updated by ybonatakis 3 months ago
- Tags changed from infra, reactive work, alert to infra, reactive work, alert, next-prague-office-visit
just add the next-prague-office-visit
which incidentally removed on friday
Updated by nicksinger 3 months ago
I've updated the following components:
- iDRAC from 7.00.(something) to 7.10.70.00
- SAS RAID Firmware from 52.16.1-4405 to 52.26.0-5179
- BIOS from 1.10.2 to 1.15.2
I also cleared the iDRAC logs so you have a clean starting point to validate if the machine still has problems detecting its disks.
Updated by ybonatakis 3 months ago · Edited
I just turned on the system after @nicksinger updated. Disk showed up and all look normal
Updated by ybonatakis 3 months ago
qesapworker-prg5.qa.suse.cz workers are online on osd. Also checked https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1&var-all_machines=qesapworker-prg5.qa.suse.cz
Updated by ybonatakis 3 months ago
- Status changed from Feedback to Resolved
Oli said that the worker class is disable on most of the workers. but this is related to another ticket(poo#??). resolved as per last actions from Nick and as we have the machine stable running in OSD. Disks are visible and we move to monitoring stage.