Project

General

Profile

Actions

action #67219

closed

o3 down, alert about / full

Added by okurz almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
2020-05-25
Due date:
% Done:

0%

Estimated time:

Description

Observation

** PROBLEM Service Alert: ariel-opensuse.suse.de/root partition is CRITICAL **

From:   Monitoring User <nagios@suse.de> resent from: OKurz@suse.com
To: okurz@suse.com
Date:   25/05/2020 18.36
Spam Status:    Spamassassin
Notification: PROBLEM
Host:         ariel-opensuse.suse.de
State:        CRITICAL
Date/Time:    Mon May 25 16:36:45 UTC 2020
Info:         DISK CRITICAL - free space: / 0 GB (0% inode=72%):

and users reported in irc that o3 is down.

Actions #1

Updated by okurz almost 4 years ago

ariel:/home/okurz # last
mloviska pts/19       download.infra.o Mon May 25 17:09 - 17:13  (00:04)
martchus pts/16       download.infra.o Mon May 25 16:59   still logged in
okurz    pts/1        scar.infra.opens Mon May 25 16:55   still logged in
mloviska pts/14       download.infra.o Mon May 25 16:39 - 16:42  (00:03)
dancerma pts/1        download.infra.o Mon May 25 16:36 - 16:41  (00:05)
mloviska pts/1        download.infra.o Mon May 25 16:18 - 16:18  (00:00)
jpupava  pts/1        download.infra.o Mon May 25 13:52 - 13:52  (00:00)
martchus pts/13       download.infra.o Mon May 25 13:29   still logged in

martchus/mkittler is currently trying to fix it. Triggered reboot.

Actions #2

Updated by mkittler almost 4 years ago

While it is restarting my findings so far:

The only process writing to /var/log/openqa anymore was the web socket server and it didn't log anything interesting (only worker status updates). The machine was generally responsive. I couldn't find any suspicious processes via htop.

In the end the web UI did not came up anymore at all. Strace shows that it stopped on locking /var/log/openqa:

sudo sudo -u geekotest strace -y /usr/share/openqa/script/openqa-webui-daemon
[...]
getpid()                                = 12220
getpid()                                = 12220
sendto(5<socket:[681345334]>, "Q\0\0\0\nbegin\0", 11, MSG_NOSIGNAL, NULL, 0) = 11
poll([{fd=5<socket:[681345334]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
recvfrom(5<socket:[681345334]>, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17
sendto(5<socket:[681345334]>, "P\0\0\0\252\0INSERT INTO audit_events ("..., 297, MSG_NOSIGNAL, NULL, 0) = 297
poll([{fd=5<socket:[681345334]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
recvfrom(5<socket:[681345334]>, "1\0\0\0\0042\0\0\0\4T\0\0\0\33\0\1id\0\0\0@\30\0\1\0\0\0\27\0\4"..., 16384, 0, NULL, NULL) = 78
getpid()                                = 12220
sendto(5<socket:[681345334]>, "Q\0\0\0\vcommit\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=5<socket:[681345334]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
recvfrom(5<socket:[681345334]>, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 18
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=114, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=114, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=114, ...}) = 0
flock(3</var/log/openqa>, LOCK_EX^C)      = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
strace: Process 12220 detached
Actions #3

Updated by okurz almost 4 years ago

  • Status changed from In Progress to Resolved
  • Assignee changed from mkittler to okurz

rwawrig helped to recover the VM and it's all up and running again. See https://chat.suse.de/channel/suse-it-ama?msg=MKqEGMrPGQuCzeZCq and following for reference. Thanks to rwawrig and mkittler for their help. The VM seems to have been actually powered off, not just restarted. And that also made it disappear from the VM management cluster. rwawrig recreated it from the XML definition file. I checked all services, free disk space, https://openqa.opensuse.org/ and previous jobs. All aborted jobs have been incompleted and properly retriggered. All seems good now.

Actions

Also available in: Atom PDF