action #162485
closedopenQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
[alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1718765678465&to=1718781562588 shows that openqa-worker-cacheservice failed on worker40 which is critical because due to #162374 only worker40 is working on OSD x86_64 multi-machine tests. Details from `journalctl -u openqa-worker-cacheservice":
Jun 19 09:01:08 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite/Transaction.pm line 31.
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at ….
Jun 19 09:01:09 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
…
and /var/lib/openqa/cache is just empty besides a "tmp" directory. I triggered a reboot.
Suggestions¶
- DONE Mitigate and workaround the problem
- Look into the system journal from that time. Maybe there are more problems reported in the same time frame.
Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Jun 19 09:01:08 worker40 kernel: EXT4-fs error: 356 callbacks suppressed
Jun 19 09:01:08 worker40 kernel: EXT4-fs error (device md127): ext4_check_bdev_write_error:218: comm …
Jun 19 09:01:09 worker40 kernel: Buffer I/O error on device md127, logical block 67968243
Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_>
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 19 09:01:09 worker40 systemd[1]:
so filesystem corruption
- Check health of physical storage device, e.g. SMART
Out of scope¶
- Change filesystem -> #155764
Updated by okurz 5 months ago
- Status changed from In Progress to Feedback
- Priority changed from Immediate to Low
reboot recovered the system. Failed systemd service alert vanished. I don't know what we could do. It's an interesting timing that we see this issue now after #162374 . I observed that worker40 is nearly continuously busy so I suspect the problem will happen again. We could consider #155764 to change the filesystem or there is a problem in how we use sqlite. Will monitor.
Updated by okurz 5 months ago
- Related to action #155716: [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S added
Updated by okurz 5 months ago
- Copied to action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) added
Updated by okurz 5 months ago
- Related to action #155764: Consider switching to safer filesystems than ext2 in osd+o3 added
Updated by okurz 5 months ago
- Subject changed from [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" to [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S
- Description updated (diff)
Updated by okurz 5 months ago
- Status changed from Feedback to Resolved
# smartctl -H --all /dev/nvme2n1
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150500.55.65-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVL2512HCJQ-00B00
Serial Number: S675NL0TA45651
Firmware Version: GXA7601Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 6
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Utilization: 504,673,882,112 [504 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 ba21a39aba
Local Time is: Thu Jun 20 15:01:56 2024 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 81 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.37W - - 0 0 0 0 0 0
1 + 8.37W - - 1 1 1 1 0 200
2 + 8.37W - - 2 2 2 2 0 200
3 - 0.0500W - - 3 3 3 3 2000 1200
4 - 0.0050W - - 4 4 4 4 500 9500
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 28 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 165%
Data Units Read: 45,475,161 [23.2 TB]
Data Units Written: 1,123,140,836 [575 TB]
Host Read Commands: 175,666,506
Host Write Commands: 2,066,160,167
Controller Busy Time: 10,788
Power Cycles: 17
Power On Hours: 2,098
Unsafe Shutdowns: 6
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 28 Celsius
Temperature Sensor 2: 36 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
so "SMART overall-health self-assessment test result: FAILED!" but "No Errors Logged". I assume it's just picky SMART always finding something to complain about but no specific problem that we need to act on.
So far probably nothing else to be done.