Project

General

Profile

Actions

action #162485

closed

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S

Added by okurz 28 days ago. Updated 27 days ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1718765678465&to=1718781562588 shows that openqa-worker-cacheservice failed on worker40 which is critical because due to #162374 only worker40 is working on OSD x86_64 multi-machine tests. Details from `journalctl -u openqa-worker-cacheservice":

Jun 19 09:01:08 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite/Transaction.pm line 31.
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at ….
Jun 19 09:01:09 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
…

and /var/lib/openqa/cache is just empty besides a "tmp" directory. I triggered a reboot.

Suggestions

  • DONE Mitigate and workaround the problem
  • Look into the system journal from that time. Maybe there are more problems reported in the same time frame.
Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Jun 19 09:01:08 worker40 kernel: EXT4-fs error: 356 callbacks suppressed
Jun 19 09:01:08 worker40 kernel: EXT4-fs error (device md127): ext4_check_bdev_write_error:218: comm …
Jun 19 09:01:09 worker40 kernel: Buffer I/O error on device md127, logical block 67968243
Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_>
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 19 09:01:09 worker40 systemd[1]:

so filesystem corruption

  • Check health of physical storage device, e.g. SMART

Out of scope


Related issues 3 (2 open1 closed)

Related to openQA Project - action #155716: [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:SResolvedmkittler2024-02-212024-03-07

Actions
Related to openQA Infrastructure - action #155764: Consider switching to safer filesystems than ext2 in osd+o3New

Actions
Copied to openQA Infrastructure - action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retryBlockedokurz

Actions
Actions

Also available in: Atom PDF