action #162293
closedopenQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6
SMART errors on bootup of worker31, worker32 and worker34 size:M
0%
Description
Observation¶
Struggling with worker31, worker32 and worker34 that upgraded themselves to Leap 15.6 and then crashed multiple times after booting into kernel 6.4 we observed that early during bootup there were SMART errors shown. Possibly this might explain kernel crashes or might be separate errors. We downgraded to Leap 15.5 for now and took it out of production but still run as openQA worker.
Acceptance criteria¶
- AC1: w31 boots up fine without SMART errors
- AC2: w32 boots up fine without SMART errors
- AC3: w33 boots up fine without SMART errors
Steps to reproduce¶
- reboot worker31 and then follow the output on
ssh -t jumpy@qe-jumpy.prg2.suse.org "ipmitool -I lanplus -H openqaworker31.qe-ipmi-ur -U … -P … sol activate"
- observe SMART errors very early during firmware initialization
Suggestions¶
- Check the content of /var/crash and clean up after investigation
- Check the status of SMART from the running Linux system and then also the messages on bootup
- Crosscheck the SMART status on other salt controlled machines, at least observed the same on w32
- Consider replacing defective hardware
- Ensure no failed services again
- Bring back the system into production
Rollback steps¶
hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
ssh osd "worker31.oqa.prg2.suse.org' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"
Updated by okurz 6 months ago
- Copied from action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:S added
Updated by okurz 6 months ago
- Copied to action #162296: openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S added
Updated by okurz 5 months ago
- Target version changed from Tools - Next to Ready
Priority should be to bring back the workers into salt due to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/872#note_651248 regardless if the SMART errors assuming that they are not critical.
Updated by nicksinger 5 months ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by nicksinger 5 months ago
starting out with worker31 I can see:
worker31:~ # smartctl -a /dev/nvme0n1
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150500.55.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZPLJ6T4HALA-00007
Serial Number: S55KNC0TA00961
Firmware Version: EPK9CB5Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 6,401,252,745,216 [6.40 TB]
Unallocated NVM Capacity: 0
Controller ID: 65
NVMe Version: 1.3
Number of Namespaces: 32
Namespace 1 Size/Capacity: 6,401,252,745,216 [6.40 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Fri Jul 26 09:59:26 2024 CEST
Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x00df): Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec Vrt_Mngmt
Optional NVM Commands (0x007f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Resv Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 80 Celsius
Critical Comp. Temp. Threshold: 87 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W 19.00W - 0 0 0 0 180 180
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 - 512 0 1
1 - 512 8 3
2 - 4096 0 0
3 - 4096 8 2
4 - 4096 64 3
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 35 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 3,093,267 [1.58 TB]
Data Units Written: 25,192,745 [12.8 TB]
Host Read Commands: 14,666,659
Host Write Commands: 617,095,582
Controller Busy Time: 17
Power Cycles: 19
Power On Hours: 9,577
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 35 Celsius
Temperature Sensor 2: 33 Celsius
Temperature Sensor 3: 33 Celsius
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
which looks strange. I looked at the SMART FAQ and found: https://www.smartmontools.org/wiki/FAQ#ATAdriveisfailingself-testsbutSMARThealthstatusisPASSED.Whatsgoingon - having bad blocks because of sudden outages sounds like it applies to our situation as well. I will now trying to follow https://www.smartmontools.org/wiki/BadBlockHowto to see if I can bring the device back into a clean state without replacing hardware.
Updated by nicksinger 5 months ago
Indeed the described method of writing the affected block back to the disk resolved the issue. I accomplished that the brute-force way by executing a full btrfs balance (which rewrites every block to disk again) with btrfs balance start --full-balance /
. After this we can also see that SMART is happy again:
worker31:~ # smartctl -x /dev/nvme0n1
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150500.55.68-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZPLJ6T4HALA-00007
Serial Number: S55KNC0TA00961
Firmware Version: EPK9CB5Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 6,401,252,745,216 [6.40 TB]
Unallocated NVM Capacity: 0
Controller ID: 65
NVMe Version: 1.3
Number of Namespaces: 32
Namespace 1 Size/Capacity: 6,401,252,745,216 [6.40 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Fri Jul 26 10:44:35 2024 CEST
Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x00df): Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec Vrt_Mngmt
Optional NVM Commands (0x007f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Resv Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 80 Celsius
Critical Comp. Temp. Threshold: 87 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W 19.00W - 0 0 0 0 180 180
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 - 512 0 1
1 - 512 8 3
2 - 4096 0 0
3 - 4096 8 2
4 - 4096 64 3
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 36 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 3,238,297 [1.65 TB]
Data Units Written: 25,301,676 [12.9 TB]
Host Read Commands: 15,249,973
Host Write Commands: 617,549,916
Controller Busy Time: 17
Power Cycles: 19
Power On Hours: 9,578
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 36 Celsius
Temperature Sensor 2: 34 Celsius
Temperature Sensor 3: 34 Celsius
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
I also noticed we have 100% spare blocks left on that NVMe so I think it is safe to assume we don't see a hardware issue here. I'm going to research now how I can accomplish to rewrite all blocks for the raid0-device we have on nvme1 and nvme2
Updated by nicksinger 5 months ago
Oh, the situation is different on the other two NVMes; they both report as model "SAMSUNG MZVL2512HCJQ-00B00" which seems to be a "980 Pro". I found bug reports for the kernel: https://bugzilla.kernel.org/show_bug.cgi?id=217445 - they explain that the kernel cannot really do anything so I was thinking about upgrading the firmware. There was a lot of rumors in the past about the 980 Pros so its worth to update anyway. I'm currently figuring out how I can do this. fwupd
unfortunately doesn't work so I have to resort to some strange vendor tools found on https://semiconductor.samsung.com/consumer-storage/support/tools/
Updated by openqa_review 5 months ago
- Due date set to 2024-08-10
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 5 months ago
I was able to flash the newest (?) firmware with:
worker31:~ # nvme fw-activate --slot 0x1 --action 0x1 /dev/nvme1
worker31:~ # nvme fw-download --fw /home/nsinger/GXA7801Q_Noformat.bin /dev/nvme1
The most trustworthy source for that file was https://help.ovhcloud.com/csm/en-dedicated-servers-samsung-nvme-firmware-upgrade?id=kb_article_view&sysparm_article=KB0060093
but no improvement. After a reboot smartctl -a /dev/nvme1n1
still shows:
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
[…]
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
Currently badblocks (badblocks -wsv /dev/nvme1n1
) is running and up until now no pattern showed problems (already finished 4 of them I think) but also the message from smartctl does not go away like it did with the first drive after "writing" the bad block(s) again (due to the full btrfs balance I did).
After this finishes I want to issue the nvme selftest (nvme device-self-test
) because up until now it is very hard to argue to our vendor that this drive is actually defective other then the (maybe erroneous) SMART messages we see.
Updated by nicksinger 5 months ago
Except one (openqaworker1.qe.nue2.suse.org) we see the problem on many Samsung NVMes:
openqa:~ # salt '*' cmd.run 'for dev in /dev/nvme?; do smartctl -a "${dev}" | grep FAILED && echo ${dev} && smartctl -a "${dev}" | grep Model; done'
s390zl13.oqa.prg2.suse.org:
s390zl12.oqa.prg2.suse.org:
backup-qam.qe.nue2.suse.org:
storage.qe.prg2.suse.org:
unreal6.qe.nue2.suse.org:
osiris-1.qe.nue2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker18.qa.suse.cz:
ada.qe.prg2.suse.org:
sapworker3.qe.nue2.suse.org:
qesapworker-prg5.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
qesapworker-prg4.qa.suse.cz:
openqaw5-xen.qe.prg2.suse.org:
qesapworker-prg6.qa.suse.cz:
worker40.oqa.prg2.suse.org:
SMART overall-health self-assessment test result: FAILED!
/dev/nvme2
Model Number: SAMSUNG MZVL2512HCJQ-00B00
sapworker1.qe.nue2.suse.org:
openqa.suse.de:
openqaworker14.qa.suse.cz:
baremetal-support.qe.nue2.suse.org:
worker33.oqa.prg2.suse.org:
SMART overall-health self-assessment test result: FAILED!
/dev/nvme1
Model Number: SAMSUNG MZVL2512HCJQ-00B00
SMART overall-health self-assessment test result: FAILED!
/dev/nvme2
Model Number: SAMSUNG MZVL2512HCJQ-00B00
worker-arm2.oqa.prg2.suse.org:
qamaster.qe.nue2.suse.org:
worker34.oqa.prg2.suse.org:
SMART overall-health self-assessment test result: FAILED!
/dev/nvme1
Model Number: SAMSUNG MZVL2512HCJQ-00B00
SMART overall-health self-assessment test result: FAILED!
/dev/nvme2
Model Number: SAMSUNG MZVL2512HCJQ-00B00
worker35.oqa.prg2.suse.org:
SMART overall-health self-assessment test result: FAILED!
/dev/nvme1
Model Number: SAMSUNG MZVL2512HCJQ-00B00
SMART overall-health self-assessment test result: FAILED!
/dev/nvme2
Model Number: SAMSUNG MZVL2512HCJQ-00B00
worker29.oqa.prg2.suse.org:
SMART overall-health self-assessment test result: FAILED!
/dev/nvme1
Model Number: SAMSUNG MZVL2512HCJQ-00B00
SMART overall-health self-assessment test result: FAILED!
/dev/nvme2
Model Number: SAMSUNG MZVL2512HCJQ-00B00
worker-arm1.oqa.prg2.suse.org:
openqaworker1.qe.nue2.suse.org:
SMART overall-health self-assessment test result: FAILED!
/dev/nvme1
Model Number: INTEL SSDPEKNW010T8
openqaworker17.qa.suse.cz:
worker30.oqa.prg2.suse.org:
SMART overall-health self-assessment test result: FAILED!
/dev/nvme1
Model Number: SAMSUNG MZVL2512HCJQ-00B00
SMART overall-health self-assessment test result: FAILED!
/dev/nvme2
Model Number: SAMSUNG MZVL2512HCJQ-00B00
schort-server.qe.nue2.suse.org:
backup-vm.qe.nue2.suse.org:
worker32.oqa.prg2.suse.org:
SMART overall-health self-assessment test result: FAILED!
/dev/nvme1
Model Number: SAMSUNG MZVL2512HCJQ-00B00
SMART overall-health self-assessment test result: FAILED!
/dev/nvme2
Model Number: SAMSUNG MZVL2512HCJQ-00B00
monitor.qe.nue2.suse.org:
sapworker2.qe.nue2.suse.org:
tumblesle.qe.nue2.suse.org:
imagetester.qe.nue2.suse.org:
jenkins.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
openqa-piworker.qe.nue2.suse.org:
mania.qe.nue2.suse.org:
diesel.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
ERROR: Minions returned with non-zero exit code
I've now wrote a mail to happyware asking for support on this.
While researching some details for my mail, I noticed that these disks apparently are rated for "300TBW". We exceed this with all of the failing disks. I'm starting to think that this might be just a normal behavior that the disk reports as failed as soon as this threshold is reached.
Updated by nicksinger 5 months ago
- Status changed from In Progress to Feedback
to quote myself from slack: "I think we need to have a talk when we consider to replace NVMe disks in our systems… I just noticed that all of the failing disks have huge amounts of data written to them already (500TB++) while their rated endurance is rated at 300TBW. But apparently our workload is not really bad (?) and all of these disks report 100% spare still available and work perfectly fine.". I try to drive this discussion in parallel waiting for feedback from happyware.
Updated by okurz 5 months ago
- Related to action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilized added
Updated by okurz 5 months ago
- Related to deleted (action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilized)
Updated by nicksinger 5 months ago
- Status changed from Feedback to Resolved
worker3{1,2,3} are back in salt and highstate applied successfully. We discussed the topic of replacing the NVMes and decided that we don't want to introduce new metrics and rather wait for jobs to fail before we actually replace the hardware. The warning while booting cannot be disabled in the bios so that means increased reboot time by a minute or so. As we cannot do anything more I'm resolving this now despite the ACs not being fulfilled.
Updated by livdywan 5 months ago
- Status changed from Resolved to Workable
ID: /var/lib/openqa/share
Function: mount.mounted
Result: False
Comment: Unable to unmount /var/lib/openqa/share: umount: /var/lib/openqa/share: not mounted..
Started: 10:14:35.253985
Duration: 93.574 ms
Changes:
----------
umount:
Forced unmount and mount because options (ro) changed
Summary for worker31.oqa.prg2.suse.org
I feel like something did not go well here, though?
Updated by nicksinger 5 months ago
yes, investigating. I've disabled worker31 for now again.
Updated by openqa_review 5 months ago
- Due date set to 2024-08-20
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 5 months ago
so the issue is that for some reason openqa_nvme_format.service
does not format the nvme drives in time and therefore every subsequent service relying on the mountpoint fails. Looking at the logs of the service I can see:
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19132]: nvme1n1 259:5 0 476.9G 0 disk
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19132]: └─md127 9:127 0 0B 0 md
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19132]: nvme2n1 259:6 0 476.9G 0 disk
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19137]: /dev/nvme0n1p2[/@/.snapshots/870/snapshot]
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19129]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme1n1 /dev/nvme2n1
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19147]: mdadm: cannot open /dev/nvme1n1: Device or resource busy
Aug 06 12:07:09 worker31 openqa-establish-nvme-setup[19129]: Waiting 10 seconds before trying again after failing due to busy device.
Aug 06 12:07:19 worker31 openqa-establish-nvme-setup[19129]: Trying RAID0 creation again after timeout (attempt 2 of 10)
Aug 06 12:07:19 worker31 openqa-establish-nvme-setup[19129]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme1n1 /dev/nvme2n1
so for some reason there is md127 but only on one disk. Not sure yet why this happens and why the script can't handle it (I'm pretty certain, it should). Stopping the service, then the existing "raid" and restarting the service works as expected.
Running the script without unmounting it produces a similar result so maybe some mount-units are mounting the raid before the script has a chance to reformat them. I look for further clues and differences with other workers.
Updated by nicksinger 5 months ago
- Status changed from In Progress to Feedback
I boiled down my findings into suggested changes to our current script: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1246
On worker31, we often have /dev/md127 as incomplete array present while booting and before our script runs. I assume this can happen if our script runs in parallel to udev because we rely on symlinks in /dev (/dev/nvme?n1
) and only one of both nvme-symlinks are present. My approach of ordering our service "After=systemd-udev-settle.service" has its own problems though, see: https://www.freedesktop.org/software/systemd/man/latest/systemd-udev-settle.service.html - but I don't really know how to implement the suggested alternative easily. We could also rewrite our script to just run before udev but I'm not sure how feasible that is.
This incomplete array gets initialized by udev while booting but never shows up in /dev/md/openqa because of its incomplete status. I extended our script to check if considered nvme disks are already part of a array (in /proc/mdstat) and stop it if necessary by using /dev/md*-nodes. This should also help to avoid "device busy" errors.
Updated by nicksinger 4 months ago
MR cleaned up and extracted into a function. Also found a small bug which would result in an early exit. The latest revision was tested on worker31.
Updated by livdywan 4 months ago
- Related to action #163745: [tools] tests on worker31 time out on yast2 firewall services add zone=EXT service=service:target added
Updated by livdywan 4 months ago
- Due date changed from 2024-08-20 to 2024-08-23
I'll assume we want to give this a bit more time as we decided to wait on @nicksinger rather than somebody else stepping in.
Updated by livdywan 4 months ago · Edited
I guess [FIRING:1] (Average Ping time (ms) alert Salt Fm02cmf4z)
was due to this?
B0=269.0475 B1=309.0335714285713
The following machines were not pingable for several minutes: * url=worker32.oqa.prg2.suse.org * url=worker35.oqa.prg2.suse.org Suggested actions: * Check if *you* can ping the machine (network connection within the infrastructure might be disrupted) * Login over ssh if possible, otherwise use a management interface, e.g. IPMI (machine could be stuck in boot process)
Updated by nicksinger 4 months ago
- Status changed from Feedback to Resolved
livdywan wrote in #note-33:
#163745 is about a different issue, but since both remove worker31 from salt as a mitigation I'm linking them for visibility
As we closed the related one I enabled worker31 again and applied a highstate cleanly. As far as I can tell the issue here is resolved and the machine can boot cleanly again. If its not strictly about a broken raid setup please think about opening a new ticket instead of reopening this one just because it is "about worker31".
Updated by livdywan 4 months ago
- Related to action #166169: Failed systemd services on worker31 / osd size:M added