Project

General

Profile

Actions

action #116722

closed

openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M

Added by okurz over 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2022-09-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

Multiple alerts on https://monitor.qa.suse.de/alerting/list?state=not_ok , all seem to point to openqa.suse.de itself being down. The first alert sending an email seems to be "Failed systemd services" https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1663476407099&to=1663478162658&viewPanel=2 from 2022-09-18 06:50 and following.
There seems to be a sudden increase in CPU load visible in
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1663473600000&to=1663480800000 going to 100 with spotty response from telegraf and no data soon afterwards. Also a significant increase in disk I/O requests in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1663473600000&to=1663480800000&viewPanel=119

Impact

openqa.suse.de is not reachable even over ping and not reachable for workers or users so completely unusable.

Suggestions

  • DONE Increase memory of the VM
  • DONE Find the root cause -> soft lockups in kernel since the latest kernel upgrade
  • DONE Report product issue
  • DONE Apply workaround, e.g. downgraded kernel version
  • DONE Apply kernel.panic = 60
  • Wait for product issue to be resolved and upgrade kernel again
  • Optional: Review the changelog diff of the kernel

Files

journal_osd_crash_2022-09-18.log.xz (300 KB) journal_osd_crash_2022-09-18.log.xz system journal of crash period, multiple OOM situations okurz, 2022-09-19 07:05

Related issues 9 (0 open9 closed)

Related to openQA Infrastructure (public) - action #116740: [alert] openqaworker14: host up alertResolvednicksinger2022-09-19

Actions
Related to openQA Infrastructure (public) - action #116743: [alert] QA-Power8-5-kvm: host up alertResolvednicksinger2022-09-192022-10-04

Actions
Related to openQA Infrastructure (public) - action #116746: [alert] openqaworker9: host up alertResolvednicksinger2022-09-19

Actions
Related to openQA Infrastructure (public) - action #116752: [alert] powerqaworker-qam-1: host up alertResolvednicksinger2022-09-19

Actions
Related to openQA Infrastructure (public) - coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekendResolvedokurz2022-06-22

Actions
Related to openQA Infrastructure (public) - action #115208: failed-systemd-services: logrotate-openqa alerting on and off size:MResolvedlivdywan

Actions
Related to openQA Infrastructure (public) - action #116848: Ensure kdump is enabled and working on all OSD machinesResolvedmkittler2022-09-202022-10-25

Actions
Related to openQA Infrastructure (public) - action #116911: [openQA][needle] Can not commit new needle for test suite on openqa.suse.deResolvedmkittler2022-09-21

Actions
Related to openQA Infrastructure (public) - action #126212: openqa.suse.de response times very slow. No alert fired size:MResolvedmkittler2023-03-20

Actions
Actions

Also available in: Atom PDF