action #116722
Updated by okurz about 2 years ago
## Observation
Multiple alerts on https://monitor.qa.suse.de/alerting/list?state=not_ok , all seem to point to openqa.suse.de itself being down. The first alert sending an email seems to be "Failed systemd services" https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1663476407099&to=1663478162658&viewPanel=2 from 2022-09-18 06:50 and following.
There seems to be a sudden increase in CPU load visible in
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1663473600000&to=1663480800000 going to 100 with spotty response from telegraf and no data soon afterwards. Also a significant increase in disk I/O requests in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1663473600000&to=1663480800000&viewPanel=119
## Impact
openqa.suse.de is not reachable even over ping and not reachable for workers or users so completely unusable.
## Suggestions
* *DONE* Increase memory of the VM
* *DONE* Find the root cause -> soft lockups in kernel since the latest kernel upgrade
* *DONE* Report product issue
* *DONE* Apply workaround, e.g. downgraded kernel version
* *DONE* Apply kernel.panic = 60
* Wait for product issue to be resolved and upgrade kernel again
* Optional: Review the changelog diff of the kernel