coordination #102266
closedopenQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
[epic] o3 ran out of disk space
100%
Description
Observation¶
We identified follow-up items for #102143
Suggestions¶
- Ask eng infra why thruk.suse.de stopped working
- Take a look at backup.qa.suse.de for relevant changes in /etc
- Install https://wiki.archlinux.org/title/etckeeper to keep track of changes
Files
Updated by livdywan about 3 years ago
- Copied from action #102143: o3 ran out of disk space added
Updated by okurz about 3 years ago
- Priority changed from Normal to Urgent
Asking EngInfra "why thruk stopped working" will become harder and harder the longer we wait so this should be handled quickly
Updated by okurz almost 3 years ago
- Status changed from Workable to Blocked
- Assignee set to okurz
Updated by ghormoon almost 3 years ago
i see this in the web interface, were you testing this somehow manually?
Log File Entries for ariel-opensuse.suse.de - root partition
External Command[2021-12-19 20:04:00] EXTERNAL COMMAND: SEND_CUSTOM_SVC_NOTIFICATION;ariel-opensuse.suse.de;root partition;0;Oliver Kurz;This is a test notification, please respond in https://progress.opensuse.org/issues/102266 if you could see this message
Updated by ghormoon almost 3 years ago
Ask eng infra why thruk.suse.de stopped working
were not doing any config changes to thruk for some time already, i'll try to figure out why you didn't get the notifications, but the question is, how actively do you still use it? we'd like to decomission it at some point as we're now using zabbix
If you'll be interested, we could make you a group in zabbix and add your hosts (though maybe we'll need an opensuse proxy first, unless all hosts have also a leg in enginfra network).
Do any of your hosts use more than "base" monitoring (disk/cpu/ram, ...) that would need to be redone in zabbix too?
also regarding the question "Also it would be great if you could ensure that “o3-admins@suse.de” is part of the recipient list." i see the notifications are set per user in thruk, in zabbix both variants are possible, either we'll make you a meta-user with the mailinglist email and set notifications there or you can do it per user individually
Updated by ghormoon almost 3 years ago
as for alert notifications, sadly last one i see is "2021-11-11 14:18:38" but the event with disk seems to have happened on 9.11.2021, so i'm not able to figure out to which users (if at all) it tried to send the notification, at least not from the interface.
maybe it would be possible to find out something from logs, if they are kept long enough, but i'll need to arrange access to the opensuse nagios server (192.168.47.7) first as i don't even have it personally
Updated by ghormoon almost 3 years ago
ah, i have the access, i just didn't realise it's through my user, not root. Sadly it seems the logs are already rotated away. We can do some test with the trigger if you want to
is thruk.suse.de your only interface you're using? i.e. in case we'd be considering the zabbix, is there anyone from the community who would have issue accessing it (compared to current situation)?
Updated by okurz almost 3 years ago
ghormoon wrote:
i see this in the web interface, were you testing this somehow manually?
Log File Entries for ariel-opensuse.suse.de - root partition
External Command[2021-12-19 20:04:00] EXTERNAL COMMAND: SEND_CUSTOM_SVC_NOTIFICATION;ariel-opensuse.suse.de;root partition;0;Oliver Kurz;This is a test notification, please respond in https://progress.opensuse.org/issues/102266 if you could see this message
yes that was me sending a test notification over thruk.suse.de
ghormoon wrote:
is thruk.suse.de your only interface you're using? i.e. in case we'd be considering the zabbix, is there anyone from the community who would have issue accessing it (compared to current situation)?
yes, thruk.suse.de would be the only interface. And no one from community AFAIK has access to that so we wouldn't loose anyone with a solution that is only available to SUSE internally going forward.
I took a look into my email archives and found the last time I received an email about an alert was
Notification: PROBLEM
Host: ariel-opensuse.suse.de
State: UNKNOWN
Date/Time: Fri Sept 25 10:13:56 UTC 2020
Info: check_ntp_time: Invalid hostname/address - ntp.infra.opensuse.org
Service: NTP
Long Output: Usage:\n check_ntp_time -H host [-4
See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=ariel-opensuse.suse.de&service=NTP
in the followup bmwiedemann helped us with https://infra.nue.suse.com/SelfService/Display.html?id=175747 where he wrote
I told it to stop notifying about "all services on this host" (there was only NTP listed)
so maybe that was in the end more than "only NTP"? I asked bmwiedemann in https://suse.slack.com/archives/C029APBKLGK/p1640119784105700
https://thruk.suse.de/thruk/cgi-bin/status.cgi?host=ariel-opensuse.suse.de looks to me like a sane choice of service. And looks to me like all notifications are enabled. Maybe someone (you ghormoon?) enabled them to fix the current problem. https://thruk.suse.de/thruk/cgi-bin/notifications.cgi?host=ariel-opensuse.suse.de states that emails have been sent out to a list of users. But I don't have any such email.
Updated by okurz over 2 years ago
- Status changed from Blocked to Resolved
We have resolved the notification problem. I don't plan further tasks