Project

General

Profile

coordination #102266

[epic] o3 ran out of disk space

Added by cdywan 3 months ago. Updated about 1 month ago.

Status:
Blocked
Priority:
Normal
Assignee:
Target version:
Start date:
2021-12-21
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)

Description

Observation

We identified follow-up items for #102143

Suggestions

o3-disk-space.png (110 KB) o3-disk-space.png tinita, 2021-11-11 13:11
12164

Subtasks

action #104217: Ask eng infra why thruk.suse.de stopped workingBlockedokurz


Related issues

Copied from openQA Infrastructure - action #102143: o3 ran out of disk spaceResolved2021-11-09

History

#1 Updated by cdywan 3 months ago

#2 Updated by okurz 2 months ago

  • Status changed from New to Workable

#3 Updated by okurz about 1 month ago

  • Priority changed from Normal to Urgent

Asking EngInfra "why thruk stopped working" will become harder and harder the longer we wait so this should be handled quickly

#4 Updated by okurz about 1 month ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

#5 Updated by ghormoon about 1 month ago

i see this in the web interface, were you testing this somehow manually?

Log File Entries for ariel-opensuse.suse.de - root partition
External Command[2021-12-19 20:04:00] EXTERNAL COMMAND: SEND_CUSTOM_SVC_NOTIFICATION;ariel-opensuse.suse.de;root partition;0;Oliver Kurz;This is a test notification, please respond in https://progress.opensuse.org/issues/102266 if you could see this message

#6 Updated by ghormoon about 1 month ago

Ask eng infra why thruk.suse.de stopped working

were not doing any config changes to thruk for some time already, i'll try to figure out why you didn't get the notifications, but the question is, how actively do you still use it? we'd like to decomission it at some point as we're now using zabbix

If you'll be interested, we could make you a group in zabbix and add your hosts (though maybe we'll need an opensuse proxy first, unless all hosts have also a leg in enginfra network).
Do any of your hosts use more than "base" monitoring (disk/cpu/ram, ...) that would need to be redone in zabbix too?

also regarding the question "Also it would be great if you could ensure that “o3-admins@suse.de” is part of the recipient list." i see the notifications are set per user in thruk, in zabbix both variants are possible, either we'll make you a meta-user with the mailinglist email and set notifications there or you can do it per user individually

#7 Updated by ghormoon about 1 month ago

as for alert notifications, sadly last one i see is "2021-11-11 14:18:38" but the event with disk seems to have happened on 9.11.2021, so i'm not able to figure out to which users (if at all) it tried to send the notification, at least not from the interface.
maybe it would be possible to find out something from logs, if they are kept long enough, but i'll need to arrange access to the opensuse nagios server (192.168.47.7) first as i don't even have it personally

#8 Updated by ghormoon about 1 month ago

ah, i have the access, i just didn't realise it's through my user, not root. Sadly it seems the logs are already rotated away. We can do some test with the trigger if you want to

is thruk.suse.de your only interface you're using? i.e. in case we'd be considering the zabbix, is there anyone from the community who would have issue accessing it (compared to current situation)?

#9 Updated by okurz about 1 month ago

ghormoon wrote:

i see this in the web interface, were you testing this somehow manually?
Log File Entries for ariel-opensuse.suse.de - root partition
External Command[2021-12-19 20:04:00] EXTERNAL COMMAND: SEND_CUSTOM_SVC_NOTIFICATION;ariel-opensuse.suse.de;root partition;0;Oliver Kurz;This is a test notification, please respond in https://progress.opensuse.org/issues/102266 if you could see this message

yes that was me sending a test notification over thruk.suse.de

ghormoon wrote:

is thruk.suse.de your only interface you're using? i.e. in case we'd be considering the zabbix, is there anyone from the community who would have issue accessing it (compared to current situation)?

yes, thruk.suse.de would be the only interface. And no one from community AFAIK has access to that so we wouldn't loose anyone with a solution that is only available to SUSE internally going forward.

I took a look into my email archives and found the last time I received an email about an alert was

Notification: PROBLEM
Host:         ariel-opensuse.suse.de
State:        UNKNOWN
Date/Time:    Fri Sept 25 10:13:56 UTC 2020
Info:         check_ntp_time: Invalid hostname/address - ntp.infra.opensuse.org

Service:      NTP

Long Output:  Usage:\n check_ntp_time -H host [-4

See Online:   https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=ariel-opensuse.suse.de&service=NTP

in the followup bmwiedemann helped us with https://infra.nue.suse.com/SelfService/Display.html?id=175747 where he wrote

I told it to stop notifying about "all services on this host" (there was only NTP listed)

so maybe that was in the end more than "only NTP"? I asked bmwiedemann in https://suse.slack.com/archives/C029APBKLGK/p1640119784105700

https://thruk.suse.de/thruk/cgi-bin/status.cgi?host=ariel-opensuse.suse.de looks to me like a sane choice of service. And looks to me like all notifications are enabled. Maybe someone (you ghormoon?) enabled them to fix the current problem. https://thruk.suse.de/thruk/cgi-bin/notifications.cgi?host=ariel-opensuse.suse.de states that emails have been sent out to a list of users. But I don't have any such email.

Also available in: Atom PDF