Project

General

Profile

Actions

action #162293

closed

openQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

SMART errors on bootup of worker31, worker32 and worker34 size:M

Added by okurz 6 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-06-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Struggling with worker31, worker32 and worker34 that upgraded themselves to Leap 15.6 and then crashed multiple times after booting into kernel 6.4 we observed that early during bootup there were SMART errors shown. Possibly this might explain kernel crashes or might be separate errors. We downgraded to Leap 15.5 for now and took it out of production but still run as openQA worker.

Acceptance criteria

  • AC1: w31 boots up fine without SMART errors
  • AC2: w32 boots up fine without SMART errors
  • AC3: w33 boots up fine without SMART errors

Steps to reproduce

  • reboot worker31 and then follow the output on ssh -t jumpy@qe-jumpy.prg2.suse.org "ipmitool -I lanplus -H openqaworker31.qe-ipmi-ur -U … -P … sol activate"
  • observe SMART errors very early during firmware initialization

Suggestions

  • Check the content of /var/crash and clean up after investigation
  • Check the status of SMART from the running Linux system and then also the messages on bootup
  • Crosscheck the SMART status on other salt controlled machines, at least observed the same on w32
  • Consider replacing defective hardware
  • Ensure no failed services again
  • Bring back the system into production

Rollback steps

  • hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
  • ssh osd "worker31.oqa.prg2.suse.org' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"

Related issues 4 (2 open2 closed)

Related to openQA Infrastructure (public) - action #163745: [tools] tests on worker31 time out on yast2 firewall services add zone=EXT service=service:targetResolvedokurz2024-07-11

Actions
Related to openQA Infrastructure (public) - action #166169: Failed systemd services on worker31 / osd size:MResolveddheidler2024-07-092024-09-17

Actions
Copied from openQA Infrastructure (public) - action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:SIn Progressybonatakis2024-12-26

Actions
Copied to openQA Project (public) - action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:SIn Progressdheidler2024-06-142024-12-26

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:S added
Actions #2

Updated by okurz 6 months ago

  • Description updated (diff)
  • Priority changed from Normal to High
Actions #3

Updated by okurz 6 months ago

  • Subject changed from SMART errors on bootup of w31 to SMART errors on bootup of w31+w32, possibly more
  • Description updated (diff)
Actions #4

Updated by okurz 6 months ago

  • Copied to action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S added
Actions #5

Updated by okurz 6 months ago

Also observed on w34

Actions #6

Updated by okurz 6 months ago

  • Priority changed from High to Normal
Actions #7

Updated by okurz 6 months ago

  • Target version changed from Ready to Tools - Next
Actions #8

Updated by livdywan 5 months ago

  • Subject changed from SMART errors on bootup of w31+w32, possibly more to SMART errors on bootup of worker31 worker32, worker34

Mentioned all workers known to be affected

Actions #9

Updated by livdywan 5 months ago

  • Description updated (diff)
Actions #10

Updated by livdywan 5 months ago

  • Subject changed from SMART errors on bootup of worker31 worker32, worker34 to SMART errors on bootup of worker31, worker32 and worker34
  • Description updated (diff)
Actions #11

Updated by okurz 5 months ago

  • Target version changed from Tools - Next to Ready

Priority should be to bring back the workers into salt due to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/872#note_651248 regardless if the SMART errors assuming that they are not critical.

Actions #12

Updated by nicksinger 5 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #13

Updated by nicksinger 5 months ago

starting out with worker31 I can see:

worker31:~ # smartctl -a /dev/nvme0n1
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150500.55.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZPLJ6T4HALA-00007
Serial Number:                      S55KNC0TA00961
Firmware Version:                   EPK9CB5Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 6,401,252,745,216 [6.40 TB]
Unallocated NVM Capacity:           0
Controller ID:                      65
NVMe Version:                       1.3
Number of Namespaces:               32
Namespace 1 Size/Capacity:          6,401,252,745,216 [6.40 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Jul 26 09:59:26 2024 CEST
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x00df):   Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec Vrt_Mngmt
Optional NVM Commands (0x007f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Resv Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     87 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W   19.00W       -    0  0  0  0      180     180

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         1
 1 -     512       8         3
 2 -    4096       0         0
 3 -    4096       8         2
 4 -    4096      64         3

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        35 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    3,093,267 [1.58 TB]
Data Units Written:                 25,192,745 [12.8 TB]
Host Read Commands:                 14,666,659
Host Write Commands:                617,095,582
Controller Busy Time:               17
Power Cycles:                       19
Power On Hours:                     9,577
Unsafe Shutdowns:                   16
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               35 Celsius
Temperature Sensor 2:               33 Celsius
Temperature Sensor 3:               33 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

which looks strange. I looked at the SMART FAQ and found: https://www.smartmontools.org/wiki/FAQ#ATAdriveisfailingself-testsbutSMARThealthstatusisPASSED.Whatsgoingon - having bad blocks because of sudden outages sounds like it applies to our situation as well. I will now trying to follow https://www.smartmontools.org/wiki/BadBlockHowto to see if I can bring the device back into a clean state without replacing hardware.

Actions #14

Updated by nicksinger 5 months ago

Indeed the described method of writing the affected block back to the disk resolved the issue. I accomplished that the brute-force way by executing a full btrfs balance (which rewrites every block to disk again) with btrfs balance start --full-balance /. After this we can also see that SMART is happy again:

worker31:~ # smartctl -x /dev/nvme0n1
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150500.55.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZPLJ6T4HALA-00007
Serial Number:                      S55KNC0TA00961
Firmware Version:                   EPK9CB5Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 6,401,252,745,216 [6.40 TB]
Unallocated NVM Capacity:           0
Controller ID:                      65
NVMe Version:                       1.3
Number of Namespaces:               32
Namespace 1 Size/Capacity:          6,401,252,745,216 [6.40 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Jul 26 10:44:35 2024 CEST
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x00df):   Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec Vrt_Mngmt
Optional NVM Commands (0x007f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Resv Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     87 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W   19.00W       -    0  0  0  0      180     180

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         1
 1 -     512       8         3
 2 -    4096       0         0
 3 -    4096       8         2
 4 -    4096      64         3

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    3,238,297 [1.65 TB]
Data Units Written:                 25,301,676 [12.9 TB]
Host Read Commands:                 15,249,973
Host Write Commands:                617,549,916
Controller Busy Time:               17
Power Cycles:                       19
Power On Hours:                     9,578
Unsafe Shutdowns:                   16
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               34 Celsius
Temperature Sensor 3:               34 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

I also noticed we have 100% spare blocks left on that NVMe so I think it is safe to assume we don't see a hardware issue here. I'm going to research now how I can accomplish to rewrite all blocks for the raid0-device we have on nvme1 and nvme2

Actions #15

Updated by nicksinger 5 months ago

Oh, the situation is different on the other two NVMes; they both report as model "SAMSUNG MZVL2512HCJQ-00B00" which seems to be a "980 Pro". I found bug reports for the kernel: https://bugzilla.kernel.org/show_bug.cgi?id=217445 - they explain that the kernel cannot really do anything so I was thinking about upgrading the firmware. There was a lot of rumors in the past about the 980 Pros so its worth to update anyway. I'm currently figuring out how I can do this. fwupd unfortunately doesn't work so I have to resort to some strange vendor tools found on https://semiconductor.samsung.com/consumer-storage/support/tools/

Actions #16

Updated by openqa_review 5 months ago

  • Due date set to 2024-08-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by nicksinger 5 months ago

I was able to flash the newest (?) firmware with:

worker31:~ # nvme fw-activate --slot 0x1 --action 0x1 /dev/nvme1
worker31:~ # nvme fw-download --fw /home/nsinger/GXA7801Q_Noformat.bin /dev/nvme1

The most trustworthy source for that file was https://help.ovhcloud.com/csm/en-dedicated-servers-samsung-nvme-firmware-upgrade?id=kb_article_view&sysparm_article=KB0060093

but no improvement. After a reboot smartctl -a /dev/nvme1n1 still shows:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
[…]
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Currently badblocks (badblocks -wsv /dev/nvme1n1) is running and up until now no pattern showed problems (already finished 4 of them I think) but also the message from smartctl does not go away like it did with the first drive after "writing" the bad block(s) again (due to the full btrfs balance I did).

After this finishes I want to issue the nvme selftest (nvme device-self-test) because up until now it is very hard to argue to our vendor that this drive is actually defective other then the (maybe erroneous) SMART messages we see.

Actions #18

Updated by mkittler 5 months ago

  • Subject changed from SMART errors on bootup of worker31, worker32 and worker34 to SMART errors on bootup of worker31, worker32 and worker34 size:M
Actions #19

Updated by nicksinger 5 months ago

Except one (openqaworker1.qe.nue2.suse.org) we see the problem on many Samsung NVMes:

openqa:~ # salt '*' cmd.run 'for dev in /dev/nvme?; do smartctl -a "${dev}" | grep FAILED && echo ${dev} && smartctl -a "${dev}" | grep Model; done'
s390zl13.oqa.prg2.suse.org:
s390zl12.oqa.prg2.suse.org:
backup-qam.qe.nue2.suse.org:
storage.qe.prg2.suse.org:
unreal6.qe.nue2.suse.org:
osiris-1.qe.nue2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker18.qa.suse.cz:
ada.qe.prg2.suse.org:
sapworker3.qe.nue2.suse.org:
qesapworker-prg5.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
qesapworker-prg4.qa.suse.cz:
openqaw5-xen.qe.prg2.suse.org:
qesapworker-prg6.qa.suse.cz:
worker40.oqa.prg2.suse.org:
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme2
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
sapworker1.qe.nue2.suse.org:
openqa.suse.de:
openqaworker14.qa.suse.cz:
baremetal-support.qe.nue2.suse.org:
worker33.oqa.prg2.suse.org:
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme1
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme2
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
worker-arm2.oqa.prg2.suse.org:
qamaster.qe.nue2.suse.org:
worker34.oqa.prg2.suse.org:
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme1
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme2
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
worker35.oqa.prg2.suse.org:
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme1
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme2
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
worker29.oqa.prg2.suse.org:
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme1
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme2
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
worker-arm1.oqa.prg2.suse.org:
openqaworker1.qe.nue2.suse.org:
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme1
    Model Number:                       INTEL SSDPEKNW010T8
openqaworker17.qa.suse.cz:
worker30.oqa.prg2.suse.org:
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme1
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme2
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
schort-server.qe.nue2.suse.org:
backup-vm.qe.nue2.suse.org:
worker32.oqa.prg2.suse.org:
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme1
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
    SMART overall-health self-assessment test result: FAILED!
    /dev/nvme2
    Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
monitor.qe.nue2.suse.org:
sapworker2.qe.nue2.suse.org:
tumblesle.qe.nue2.suse.org:
imagetester.qe.nue2.suse.org:
jenkins.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
openqa-piworker.qe.nue2.suse.org:
mania.qe.nue2.suse.org:
diesel.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
ERROR: Minions returned with non-zero exit code

I've now wrote a mail to happyware asking for support on this.
While researching some details for my mail, I noticed that these disks apparently are rated for "300TBW". We exceed this with all of the failing disks. I'm starting to think that this might be just a normal behavior that the disk reports as failed as soon as this threshold is reached.

Actions #20

Updated by nicksinger 5 months ago

  • Status changed from In Progress to Feedback

to quote myself from slack: "I think we need to have a talk when we consider to replace NVMe disks in our systems… I just noticed that all of the failing disks have huge amounts of data written to them already (500TB++) while their rated endurance is rated at 300TBW. But apparently our workload is not really bad (?) and all of these disks report 100% spare still available and work perfectly fine.". I try to drive this discussion in parallel waiting for feedback from happyware.

Actions #21

Updated by okurz 5 months ago

  • Related to action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilized added
Actions #22

Updated by okurz 5 months ago

  • Related to deleted (action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilized)
Actions #23

Updated by nicksinger 5 months ago

  • Status changed from Feedback to Resolved

worker3{1,2,3} are back in salt and highstate applied successfully. We discussed the topic of replacing the NVMes and decided that we don't want to introduce new metrics and rather wait for jobs to fail before we actually replace the hardware. The warning while booting cannot be disabled in the bios so that means increased reboot time by a minute or so. As we cannot do anything more I'm resolving this now despite the ACs not being fulfilled.

Actions #24

Updated by okurz 5 months ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Due date deleted (2024-08-10)
  • Category changed from Regressions/Crashes to Regressions/Crashes
Actions #25

Updated by livdywan 5 months ago

  • Status changed from Resolved to Workable
          ID: /var/lib/openqa/share
    Function: mount.mounted
      Result: False
     Comment: Unable to unmount /var/lib/openqa/share: umount: /var/lib/openqa/share: not mounted..
     Started: 10:14:35.253985
    Duration: 93.574 ms
     Changes:   
              ----------
              umount:
                  Forced unmount and mount because options (ro) changed
Summary for worker31.oqa.prg2.suse.org

I feel like something did not go well here, though?

Actions #26

Updated by nicksinger 5 months ago

yes, investigating. I've disabled worker31 for now again.

Actions #27

Updated by nicksinger 5 months ago

  • Status changed from Workable to In Progress
Actions #28

Updated by openqa_review 5 months ago

  • Due date set to 2024-08-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #29

Updated by nicksinger 5 months ago

so the issue is that for some reason openqa_nvme_format.service does not format the nvme drives in time and therefore every subsequent service relying on the mountpoint fails. Looking at the logs of the service I can see:

Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19132]: nvme1n1     259:5    0 476.9G  0 disk
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19132]: └─md127       9:127  0     0B  0 md
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19132]: nvme2n1     259:6    0 476.9G  0 disk
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19137]: /dev/nvme0n1p2[/@/.snapshots/870/snapshot]
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19129]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme1n1 /dev/nvme2n1
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19147]: mdadm: cannot open /dev/nvme1n1: Device or resource busy
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19129]: Waiting 10 seconds before trying again after failing due to busy device.
Aug 06 12:07:19 worker31 openqa-establish-nvme-setup[19129]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Aug 06 12:07:19 worker31 openqa-establish-nvme-setup[19129]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme1n1 /dev/nvme2n1

so for some reason there is md127 but only on one disk. Not sure yet why this happens and why the script can't handle it (I'm pretty certain, it should). Stopping the service, then the existing "raid" and restarting the service works as expected.
Running the script without unmounting it produces a similar result so maybe some mount-units are mounting the raid before the script has a chance to reformat them. I look for further clues and differences with other workers.

Actions #30

Updated by nicksinger 5 months ago

  • Status changed from In Progress to Feedback

I boiled down my findings into suggested changes to our current script: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1246

On worker31, we often have /dev/md127 as incomplete array present while booting and before our script runs. I assume this can happen if our script runs in parallel to udev because we rely on symlinks in /dev (/dev/nvme?n1) and only one of both nvme-symlinks are present. My approach of ordering our service "After=systemd-udev-settle.service" has its own problems though, see: https://www.freedesktop.org/software/systemd/man/latest/systemd-udev-settle.service.html - but I don't really know how to implement the suggested alternative easily. We could also rewrite our script to just run before udev but I'm not sure how feasible that is.

This incomplete array gets initialized by udev while booting but never shows up in /dev/md/openqa because of its incomplete status. I extended our script to check if considered nvme disks are already part of a array (in /proc/mdstat) and stop it if necessary by using /dev/md*-nodes. This should also help to avoid "device busy" errors.

Actions #31

Updated by nicksinger 4 months ago

MR cleaned up and extracted into a function. Also found a small bug which would result in an early exit. The latest revision was tested on worker31.

Actions #32

Updated by livdywan 4 months ago

  • Related to action #163745: [tools] tests on worker31 time out on yast2 firewall services add zone=EXT service=service:target added
Actions #33

Updated by livdywan 4 months ago

#163745 is about a different issue, but since both remove worker31 from salt as a mitigation I'm linking them for visibility

Actions #34

Updated by livdywan 4 months ago

  • Due date changed from 2024-08-20 to 2024-08-23

I'll assume we want to give this a bit more time as we decided to wait on @nicksinger rather than somebody else stepping in.

Actions #35

Updated by livdywan 4 months ago · Edited

I guess [FIRING:1] (Average Ping time (ms) alert Salt Fm02cmf4z) was due to this?

B0=269.0475  B1=309.0335714285713

The following machines were not pingable for several minutes: * url=worker32.oqa.prg2.suse.org * url=worker35.oqa.prg2.suse.org Suggested actions: * Check if *you* can ping the machine (network connection within the infrastructure might be disrupted) * Login over ssh if possible, otherwise use a management interface, e.g. IPMI (machine could be stuck in boot process)
Actions #36

Updated by nicksinger 4 months ago

  • Status changed from Feedback to Resolved

livdywan wrote in #note-33:

#163745 is about a different issue, but since both remove worker31 from salt as a mitigation I'm linking them for visibility

As we closed the related one I enabled worker31 again and applied a highstate cleanly. As far as I can tell the issue here is resolved and the machine can boot cleanly again. If its not strictly about a broken raid setup please think about opening a new ticket instead of reopening this one just because it is "about worker31".

Actions #37

Updated by livdywan 4 months ago

  • Related to action #166169: Failed systemd services on worker31 / osd size:M added
Actions #38

Updated by okurz 3 months ago

  • Due date deleted (2024-08-23)
Actions

Also available in: Atom PDF