Project

General

Profile

Actions

action #166169

closed

Failed systemd services on worker31 / osd size:M

Added by livdywan 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-07-09
Due date:
2024-09-17
% Done:

0%

Estimated time:

Description

Observation

Failed services:

automount-restarter@var-lib-openqa-share, openqa-reload-worker-auto-restart@1, openqa-reload-worker-auto-restart@10, openqa-reload-worker-auto-restart@11, openqa-reload-worker-auto-restart@12, openqa-reload-worker-auto-restart@13, openqa-reload-worker-auto-restart@14, openqa-reload-worker-auto-restart@15, openqa-reload-worker-auto-restart@16, openqa-reload-worker-auto-restart@17, openqa-reload-worker-auto-restart@18, openqa-reload-worker-auto-restart@19, openqa-reload-worker-auto-restart@2, openqa-reload-worker-auto-restart@20, openqa-reload-worker-auto-restart@21, openqa-reload-worker-auto-restart@22, openqa-reload-worker-auto-restart@23, openqa-reload-worker-auto-restart@24, openqa-reload-worker-auto-restart@25, openqa-reload-worker-auto-restart@26, openqa-reload-worker-auto-restart@27, openqa-reload-worker-auto-restart@28, openqa-reload-worker-auto-restart@29, openqa-reload-worker-auto-restart@3, openqa-reload-worker-auto-restart@30, openqa-reload-worker-auto-restart@31, openqa-reload-worker-auto-restart@32, openqa-reload-worker-auto-restart@33, openqa-reload-worker-auto-restart@34, openqa-reload-worker-auto-restart@35, openqa-reload-worker-auto-restart@36, openqa-reload-worker-auto-restart@37, openqa-reload-worker-auto-restart@38, openqa-reload-worker-auto-restart@39, openqa-reload-worker-auto-restart@4, openqa-reload-worker-auto-restart@40, openqa-reload-worker-auto-restart@41, openqa-reload-worker-auto-restart@42, openqa-reload-worker-auto-restart@43, openqa-reload-worker-auto-restart@44, openqa-reload-worker-auto-restart@45, openqa-reload-worker-auto-restart@46, openqa-reload-worker-auto-restart@47, openqa-reload-worker-auto-restart@48, openqa-reload-worker-auto-restart@49, openqa-reload-worker-auto-restart@5, openqa-reload-worker-auto-restart@50, openqa-reload-worker-auto-restart@51, openqa-reload-worker-auto-restart@52, openqa-reload-worker-auto-restart@53, openqa-reload-worker-auto-restart@54, openqa-reload-worker-auto-restart@55, openqa-reload-worker-auto-restart@56, openqa-reload-worker-auto-restart@57, openqa-reload-worker-auto-restart@58, openqa-reload-worker-auto-restart@59, openqa-reload-worker-auto-restart@6, openqa-reload-worker-auto-restart@60, openqa-reload-worker-auto-restart@61, openqa-reload-worker-auto-restart@62, openqa-reload-worker-auto-restart@63, openqa-reload-worker-auto-restart@7, openqa-reload-worker-auto-restart@8, openqa-reload-worker-auto-restart@003930@bb.com.br

Acceptance Criteria

  • AC1: Worker31 boots reliably without any issues mounting disks

Rollback steps

  • hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
  • ssh osd "worker31.oqa.prg2.suse.org' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"

Suggestions

  • Login and check failed services
  • Restart services and/ or the worker
  • See #162293#note-30 for recent changes to the disk array setup

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:MResolvednicksinger2024-06-14

Actions
Related to openQA Infrastructure (public) - action #167164: osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:MResolvedybonatakis

Actions
Actions #1

Updated by livdywan 4 months ago

  • Status changed from New to In Progress
  • Assignee set to livdywan

Let's see if a straightforward restart may do the trick, or if this is more involved.

Actions #2

Updated by livdywan 4 months ago

Machine booted. All worker instances show as offline:

systemctl list-units --failed
  UNIT                                             LOAD   ACTIVE SUB    DESCRIPTION                                     
● automount-restarter@var-lib-openqa-share.service loaded failed failed Restarts the automount unit var-lib-openqa-share
● openqa_nvme_format.service                       loaded failed failed Setup NVMe before mounting it
Actions #3

Updated by livdywan 4 months ago

  • Description updated (diff)

I can't seem to login on Grafana even after triple-checking the login I was using. Same with stats. so it wouldn't seem like the old issue with the other domain.
Hence alert not silenced for now.

Actions #4

Updated by livdywan 4 months ago

sudo systemctl status openqa_nvme_format
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24698]:       499975512 blocks super 1.2
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24698]:        
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24698]: unused devices: <none>
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24700]: INACTIVE-ARRAY /dev/md127 metadata=1.2 name=worker31:openqa UUID=dd441ab6:e96c057f:4e7f302c:b7da4445
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24418]: Creating ext2 filesystem on RAID0 "/dev/md/openqa"
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24701]: mke2fs 1.46.4 (18-Aug-2021)
Sep 02 19:07:07 worker31 openqa-establish-nvme-setup[24701]: The file /dev/md/openqa does not exist and no size was specified.
Actions #5

Updated by openqa_review 4 months ago

  • Due date set to 2024-09-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by livdywan 4 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (livdywan)

I guess it's not just the race condition and multiple reboots did not help. Maybe best to estimate first.

Actions #7

Updated by livdywan 4 months ago

  • Status changed from Workable to New

This should be New of course.

Actions #8

Updated by livdywan 4 months ago

  • Related to action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:M added
Actions #9

Updated by livdywan 4 months ago

  • Description updated (diff)
Actions #10

Updated by livdywan 3 months ago

  • Subject changed from Failed systemd services on worker31 / osd to Failed systemd services on worker31 / osd size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by dheidler 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #12

Updated by dheidler 3 months ago

There is no proper raid config because openqa-establish-nvme-setup fails:

Creating RAID0 "/dev/md/openqa" on: /dev/nvme1n1 /dev/nvme2n1
mdadm --create /dev/md/openqa --level=0 --force --assume-clean --raid-devices=2 --run /dev/nvme1n1 /dev/nvme2n1
mdadm: cannot open /dev/nvme1n1: Device or resource busy

It seems that a broken mdraid config was present on /dev/nvme1n1 that prevented creating a new raid.
I unloaded the md_mod kernel module and wrote 10MB of zeros to the nvme1n1 device, then loaded md_mod again.

Afterwards openqa-establish-nvme-setup was able to run fine.

Actions #13

Updated by dheidler 3 months ago

But after a reboot, the problem comes up again.

Actions #14

Updated by dheidler 3 months ago

  • Status changed from In Progress to Resolved

After the reboot mdadm --examine /dev/nvme2n1 didn't show any traces of a raid on nvme2n1.
Also no traces of raid name at offset 0x1020 in dd if=/dev/nvme2n1 bs=1k count=10 | hexdump -C.
Stopped the inactive raid on nvme1n1 via mdadm --stop /dev/md127.

dd if=/dev/zero of=/dev/nvme1n1 bs=10M count=1
dd if=/dev/zero of=/dev/nvme2n1 bs=10M count=1

Then openqa-establish-nvme-setup worked and the machine even survived a reboot.

Due to the fact that SMART reports that both nvme1n1 and nvme2n1 are way above the vendor supported written bytes (137% and 133%) and smartctl -H /dev/nvme1n1 (also for 2n1) reports:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

... I would expect this to only be working by chance - the NVMEs being broken and being unreliable or breaking in the near future, until we replace them.

Actions #15

Updated by dheidler 3 months ago

Executed rollback steps.

Actions #16

Updated by ybonatakis 3 months ago

  • Related to action #167164: osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M added
Actions

Also available in: Atom PDF