Project

General

Profile

action #116722

Updated by okurz over 1 year ago

## Observation 
 Multiple alerts on https://monitor.qa.suse.de/alerting/list?state=not_ok , all seem to point to openqa.suse.de itself being down. The first alert sending an email seems to be "Failed systemd services" https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1663476407099&to=1663478162658&viewPanel=2 from 2022-09-18 06:50 and following. 
 There seems to be a sudden increase in CPU load visible in 
 https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1663473600000&to=1663480800000 going to 100 with spotty response from telegraf and no data soon afterwards. Also a significant increase in disk I/O requests in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1663473600000&to=1663480800000&viewPanel=119 

 ## Impact 
 openqa.suse.de is not reachable even over ping and not reachable for workers or users so completely unusable. 

 ## Suggestions 
 * *DONE* Increase memory of the VM 
 * *DONE* Find the root cause -> soft lockups in kernel since the latest kernel upgrade 
 * *DONE* Report product issue 
 * *DONE* Apply workaround, e.g. downgraded kernel version 
 * *DONE* Apply kernel.panic = 60 
 * Wait for product issue to be resolved and upgrade kernel again 
 * Optional: Review the changelog diff of the kernel

Back