Project

General

Profile

Actions

action #162485

closed

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S

Added by okurz 6 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Start date:
2024-06-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1718765678465&to=1718781562588 shows that openqa-worker-cacheservice failed on worker40 which is critical because due to #162374 only worker40 is working on OSD x86_64 multi-machine tests. Details from `journalctl -u openqa-worker-cacheservice":

Jun 19 09:01:08 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite/Transaction.pm line 31.
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at ….
Jun 19 09:01:09 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
…

and /var/lib/openqa/cache is just empty besides a "tmp" directory. I triggered a reboot.

Suggestions

  • DONE Mitigate and workaround the problem
  • Look into the system journal from that time. Maybe there are more problems reported in the same time frame.
Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Jun 19 09:01:08 worker40 kernel: EXT4-fs error: 356 callbacks suppressed
Jun 19 09:01:08 worker40 kernel: EXT4-fs error (device md127): ext4_check_bdev_write_error:218: comm …
Jun 19 09:01:09 worker40 kernel: Buffer I/O error on device md127, logical block 67968243
Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_>
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 19 09:01:09 worker40 systemd[1]:

so filesystem corruption

  • Check health of physical storage device, e.g. SMART

Out of scope


Related issues 3 (1 open2 closed)

Related to openQA Project (public) - action #155716: [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:SResolvedmkittler2024-02-212024-03-07

Actions
Related to openQA Infrastructure (public) - action #155764: Consider switching to safer filesystems than ext2 in osd+o3New

Actions
Copied to openQA Infrastructure (public) - action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker)Resolvedlivdywan

Actions
Actions #1

Updated by okurz 6 months ago

  • Parent task set to #111929
Actions #2

Updated by okurz 6 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Immediate to Low

reboot recovered the system. Failed systemd service alert vanished. I don't know what we could do. It's an interesting timing that we see this issue now after #162374 . I observed that worker40 is nearly continuously busy so I suspect the problem will happen again. We could consider #155764 to change the filesystem or there is a problem in how we use sqlite. Will monitor.

Actions #3

Updated by okurz 6 months ago

  • Related to action #155716: [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S added
Actions #4

Updated by livdywan 6 months ago

2024-06-19 18:06:00 worker40    var-lib-openqa-share.automount
Actions #5

Updated by okurz 6 months ago

  • Copied to action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) added
Actions #6

Updated by okurz 6 months ago

  • Related to action #155764: Consider switching to safer filesystems than ext2 in osd+o3 added
Actions #7

Updated by okurz 6 months ago

  • Subject changed from [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" to [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S
  • Description updated (diff)
Actions #8

Updated by okurz 6 months ago

  • Status changed from Feedback to Resolved
# smartctl -H --all /dev/nvme2n1
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150500.55.65-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVL2512HCJQ-00B00
Serial Number:                      S675NL0TA45651
Firmware Version:                   GXA7601Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            504,673,882,112 [504 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 ba21a39aba
Local Time is:                      Thu Jun 20 15:01:56 2024 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.37W       -        -    0  0  0  0        0       0
 1 +     8.37W       -        -    1  1  1  1        0     200
 2 +     8.37W       -        -    2  2  2  2        0     200
 3 -   0.0500W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        28 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    165%
Data Units Read:                    45,475,161 [23.2 TB]
Data Units Written:                 1,123,140,836 [575 TB]
Host Read Commands:                 175,666,506
Host Write Commands:                2,066,160,167
Controller Busy Time:               10,788
Power Cycles:                       17
Power On Hours:                     2,098
Unsafe Shutdowns:                   6
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               28 Celsius
Temperature Sensor 2:               36 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

so "SMART overall-health self-assessment test result: FAILED!" but "No Errors Logged". I assume it's just picky SMART always finding something to complain about but no specific problem that we need to act on.

So far probably nothing else to be done.

Actions

Also available in: Atom PDF