Project

General

Profile

action #155716

Updated by livdywan 3 months ago

## Observation 
 From https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1 

 `ssh worker29.oqa.prg2.suse.org "journalctl -u openqa-worker-cacheservice"` says 

 ``` 
 Feb 21 09:25:43 worker29 openqa-workercache-daemon[86009]: [86009] [e] Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error at /u> 
 Feb 21 09:25:43 worker29 openqa-workercache-daemon[86009]: [86009] [e] Killing processes accessing the database file handles and removing database 
 ``` 

 ## Acceptance criteria 
 * **AC1:** Cache service on worker29 works again 

 ## Suggestions 
 * *DONE* Add silence(s) 
 * Gather logs helpful for debugging especially before the machine is rebooted 
 * Maybe ext2 is just unreliable -> yes, it is. A reboot of the machine already fixed the problem because we recreate the filesystem automatically 
 * Create another ticket for the related fallout of the reboot triggered problem 

 ## Rollback actions 
 * Remove silence `alertname=Failed systemd services alert` from https://monitor.qa.suse.de/alerting/silences 
 * Remove silence `alertname=Broken workers alert` from https://monitor.qa.suse.de/alerting/silences 

 ## Out of scope 
 * Using another filesystem

Back