Project

General

Profile

action #162485

Updated by okurz 6 months ago

## Observation 
 https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1718765678465&to=1718781562588 shows that openqa-worker-cacheservice failed on worker40 which is critical because due to #162374 only worker40 is working on OSD x86_64 multi-machine tests. Details from `journalctl -u openqa-worker-cacheservice": 

 ``` 
 Jun 19 09:01:08 worker40 systemd[1]: Started OpenQA Worker Cache Service. 
 Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache" 
 Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/SQLite/Transaction.pm line 31. 
 Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED 
 Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Failed with result 'exit-code'. 
 Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Scheduled restart job, restart counter is at …. 
 Jun 19 09:01:09 worker40 systemd[1]: Stopped OpenQA Worker Cache Service. 
 … 
 ``` 

 and /var/lib/openqa/cache is just empty besides a "tmp" directory. I triggered a reboot. 

 ## Suggestions 
 * *DONE* Mitigate and workaround the problem 
 * Look into the system journal from that time. Maybe there are more problems reported in the same time frame. 

 ``` 
 Jun 19 09:01:08 worker40 openqa-workercache-daemon[33982]: [33982] [i] Creating cache directory tree for "/var/lib/openqa/cache" 
 Jun 19 09:01:08 worker40 kernel: EXT4-fs error: 356 callbacks suppressed 
 Jun 19 09:01:08 worker40 kernel: EXT4-fs error (device md127): ext4_check_bdev_write_error:218: comm … 
 Jun 19 09:01:09 worker40 kernel: Buffer I/O error on device md127, logical block 67968243 
 Jun 19 09:01:09 worker40 openqa-workercache-daemon[33982]: DBD::SQLite::db commit failed: disk I/O error at /usr/lib/perl5/vendor_> 
 Jun 19 09:01:09 worker40 systemd[1]: openqa-worker-cacheservice.service: Main process exited, code=exited, status=5/NOTINSTALLED 
 Jun 19 09:01:09 worker40 systemd[1]: 
 ``` 

 so filesystem corruption 
    
 * Check health of physical storage device, e.g. SMART 

 ## Out of scope 
 * Change filesystem -> #155764

Back