action #166169
closedFailed systemd services on worker31 / osd size:M
0%
Description
Observation¶
Failed services:
automount-restarter@var-lib-openqa-share, openqa-reload-worker-auto-restart@1, openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@11, openqa-reload-worker-auto-restart@12, openqa-reload-worker-auto-restart@13, openqa-reload-worker-auto-restart@14, openqa-reload-worker-auto-restart@15, openqa-reload-worker-auto-restart@16, openqa-reload-worker-auto-restart@17, openqa-reload-worker-auto-restart@18, openqa-reload-worker-auto-restart@19, openqa-reload-worker-auto-restart@2, openqa-reload-worker-auto-restart@20, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@24, openqa-reload-worker-auto-restart@25, openqa-reload-worker-auto-restart@26, openqa-reload-worker-auto-restart@27, openqa-reload-worker-auto-restart@28, openqa-reload-worker-auto-restart@29, openqa-reload-worker-auto-restart@3, openqa-reload-worker-auto-restart@30, openqa-reload-worker-auto-restart@31, openqa-reload-worker-auto-restart@32, openqa-reload-worker-auto-restart@33, openqa-reload-worker-auto-restart@34, openqa-reload-worker-auto-restart@35, openqa-reload-worker-auto-restart@36, openqa-reload-worker-auto-restart@37, openqa-reload-worker-auto-restart@38, openqa-reload-worker-auto-restart@39, openqa-reload-worker-auto-restart@4, openqa-reload-worker-auto-restart@40, openqa-reload-worker-auto-restart@41, openqa-reload-worker-auto-restart@42, openqa-reload-worker-auto-restart@43, openqa-reload-worker-auto-restart@44, openqa-reload-worker-auto-restart@45, openqa-reload-worker-auto-restart@46, openqa-reload-worker-auto-restart@47, openqa-reload-worker-auto-restart@48, openqa-reload-worker-auto-restart@49, openqa-reload-worker-auto-restart@5, openqa-reload-worker-auto-restart@50, openqa-reload-worker-auto-restart@51, openqa-reload-worker-auto-restart@52, openqa-reload-worker-auto-restart@53, openqa-reload-worker-auto-restart@54, openqa-reload-worker-auto-restart@55, openqa-reload-worker-auto-restart@56, openqa-reload-worker-auto-restart@57, openqa-reload-worker-auto-restart@58, openqa-reload-worker-auto-restart@59, openqa-reload-worker-auto-restart@6, openqa-reload-worker-auto-restart@60, openqa-reload-worker-auto-restart@61, openqa-reload-worker-auto-restart@62, openqa-reload-worker-auto-restart@63, openqa-reload-worker-auto-restart@7, openqa-reload-worker-auto-restart@8, openqa-reload-worker-auto-restart@003930@bb.com.br
Acceptance Criteria¶
- AC1: Worker31 boots reliably without any issues mounting disks
Rollback steps¶
hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
ssh osd "worker31.oqa.prg2.suse.org' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"
Suggestions¶
- Login and check failed services
- Restart services and/ or the worker
- See #162293#note-30 for recent changes to the disk array setup
Updated by livdywan 4 months ago
Machine booted. All worker instances show as offline:
systemctl list-units --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● automount-restarter@var-lib-openqa-share.service loaded failed failed Restarts the automount unit var-lib-openqa-share
● openqa_nvme_format.service loaded failed failed Setup NVMe before mounting it
Updated by livdywan 4 months ago
sudo systemctl status openqa_nvme_format
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24698]: 499975512 blocks super 1.2
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24698]:
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24698]: unused devices: <none>
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24700]: INACTIVE-ARRAY /dev/md127 metadata=1.2 name=worker31:openqa UUID=dd441ab6:e96c057f:4e7f302c:b7da4445
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24418]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24701]: mke2fs 1.46.4 (18-Aug-2021)
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24701]: The file /dev/md/openqa does not exist and no size was specified.
Updated by openqa_review 4 months ago
- Due date set to 2024-09-17
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 4 months ago
- Related to action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:M added
Updated by dheidler 3 months ago
There is no proper raid config because openqa-establish-nvme-setup
fails:
Creating RAID0 "/dev/md/openqa" on: /dev/nvme1n1 /dev/nvme2n1
mdadm --create /dev/md/openqa --level=0 --force --assume-clean --raid-devices=2 --run /dev/nvme1n1 /dev/nvme2n1
mdadm: cannot open /dev/nvme1n1: Device or resource busy
It seems that a broken mdraid config was present on /dev/nvme1n1 that prevented creating a new raid.
I unloaded the md_mod
kernel module and wrote 10MB of zeros to the nvme1n1 device, then loaded md_mod
again.
Afterwards openqa-establish-nvme-setup
was able to run fine.
Updated by dheidler 3 months ago
- Status changed from In Progress to Resolved
After the reboot mdadm --examine /dev/nvme2n1
didn't show any traces of a raid on nvme2n1.
Also no traces of raid name at offset 0x1020 in dd if=/dev/nvme2n1 bs=1k count=10 | hexdump -C
.
Stopped the inactive raid on nvme1n1 via mdadm --stop /dev/md127
.
dd if=/dev/zero of=/dev/nvme1n1 bs=10M count=1
dd if=/dev/zero of=/dev/nvme2n1 bs=10M count=1
Then openqa-establish-nvme-setup
worked and the machine even survived a reboot.
Due to the fact that SMART reports that both nvme1n1 and nvme2n1 are way above the vendor supported written bytes (137% and 133%) and smartctl -H /dev/nvme1n1
(also for 2n1) reports:
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
... I would expect this to only be working by chance - the NVMEs being broken and being unreliable or breaking in the near future, until we replace them.
Updated by ybonatakis 3 months ago
- Related to action #167164: osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M added