action #162485
Updated by okurz 6 months ago
## Observation
https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1718765678465&to=1718781562588 shows that openqa-worker-cacheservice failed on worker40 which is critical because due to #162374 only worker40 is working on OSD x86_64 multi-machine tests. Details from `journalctl -u openqa-worker-cacheservice":
```
Jun 19 09:01:08 worker40 systemd[1]: Started OpenQA Worker Cache Service.
Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite/Transaction.pm line 31.
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'.
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at ….
Jun 19 09:01:09 worker40 systemd[1]: Stopped OpenQA Worker Cache Service.
…
```
and /var/lib/openqa/cache is just empty besides a "tmp" directory. I triggered a reboot.
## Suggestions
* *DONE* Mitigate and workaround the problem
* Look into the system journal from that time. Maybe there are more problems reported in the same time frame.
```
Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache"
Jun 19 09:01:08 worker40 kernel: EXT4-fs error: 356 callbacks suppressed
Jun 19 09:01:08 worker40 kernel: EXT4-fs error (device md127): ext4_check_bdev_write_error:218: comm …
Jun 19 09:01:09 worker40 kernel: Buffer I/O error on device md127, logical block 67968243
Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_>
Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 19 09:01:09 worker40 systemd[1]:
```
so filesystem corruption
* Check health of physical storage device, e.g. SMART
## Out of scope
* Change filesystem -> #155764
Back