Project

General

Profile

Actions

coordination #102266

closed

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] o3 ran out of disk space

Added by livdywan over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-12-21
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

We identified follow-up items for #102143

Suggestions


Files

o3-disk-space.png (110 KB) o3-disk-space.png tinita, 2021-11-11 13:11

Subtasks 1 (0 open1 closed)

action #104217: Ask eng infra why thruk.suse.de stopped workingResolvedokurz2021-12-21

Actions

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #102143: o3 ran out of disk spaceResolvedmkittler2021-11-09

Actions
Actions #1

Updated by livdywan over 2 years ago

Actions #2

Updated by okurz over 2 years ago

  • Status changed from New to Workable
Actions #3

Updated by okurz over 2 years ago

  • Priority changed from Normal to Urgent

Asking EngInfra "why thruk stopped working" will become harder and harder the longer we wait so this should be handled quickly

Actions #4

Updated by okurz over 2 years ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz
Actions #5

Updated by ghormoon over 2 years ago

i see this in the web interface, were you testing this somehow manually?

Log File Entries for ariel-opensuse.suse.de - root partition
External Command[2021-12-19 20:04:00] EXTERNAL COMMAND: SEND_CUSTOM_SVC_NOTIFICATION;ariel-opensuse.suse.de;root partition;0;Oliver Kurz;This is a test notification, please respond in https://progress.opensuse.org/issues/102266 if you could see this message

Actions #6

Updated by ghormoon over 2 years ago

Ask eng infra why thruk.suse.de stopped working

were not doing any config changes to thruk for some time already, i'll try to figure out why you didn't get the notifications, but the question is, how actively do you still use it? we'd like to decomission it at some point as we're now using zabbix

If you'll be interested, we could make you a group in zabbix and add your hosts (though maybe we'll need an opensuse proxy first, unless all hosts have also a leg in enginfra network).
Do any of your hosts use more than "base" monitoring (disk/cpu/ram, ...) that would need to be redone in zabbix too?

also regarding the question "Also it would be great if you could ensure that “o3-admins@suse.de” is part of the recipient list." i see the notifications are set per user in thruk, in zabbix both variants are possible, either we'll make you a meta-user with the mailinglist email and set notifications there or you can do it per user individually

Actions #7

Updated by ghormoon over 2 years ago

as for alert notifications, sadly last one i see is "2021-11-11 14:18:38" but the event with disk seems to have happened on 9.11.2021, so i'm not able to figure out to which users (if at all) it tried to send the notification, at least not from the interface.
maybe it would be possible to find out something from logs, if they are kept long enough, but i'll need to arrange access to the opensuse nagios server (192.168.47.7) first as i don't even have it personally

Actions #8

Updated by ghormoon over 2 years ago

ah, i have the access, i just didn't realise it's through my user, not root. Sadly it seems the logs are already rotated away. We can do some test with the trigger if you want to

is thruk.suse.de your only interface you're using? i.e. in case we'd be considering the zabbix, is there anyone from the community who would have issue accessing it (compared to current situation)?

Actions #9

Updated by okurz over 2 years ago

ghormoon wrote:

i see this in the web interface, were you testing this somehow manually?
Log File Entries for ariel-opensuse.suse.de - root partition
External Command[2021-12-19 20:04:00] EXTERNAL COMMAND: SEND_CUSTOM_SVC_NOTIFICATION;ariel-opensuse.suse.de;root partition;0;Oliver Kurz;This is a test notification, please respond in https://progress.opensuse.org/issues/102266 if you could see this message

yes that was me sending a test notification over thruk.suse.de

ghormoon wrote:

is thruk.suse.de your only interface you're using? i.e. in case we'd be considering the zabbix, is there anyone from the community who would have issue accessing it (compared to current situation)?

yes, thruk.suse.de would be the only interface. And no one from community AFAIK has access to that so we wouldn't loose anyone with a solution that is only available to SUSE internally going forward.

I took a look into my email archives and found the last time I received an email about an alert was

Notification: PROBLEM
Host:         ariel-opensuse.suse.de
State:        UNKNOWN
Date/Time:    Fri Sept 25 10:13:56 UTC 2020
Info:         check_ntp_time: Invalid hostname/address - ntp.infra.opensuse.org

Service:      NTP

Long Output:  Usage:\n check_ntp_time -H host [-4

See Online:   https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=ariel-opensuse.suse.de&service=NTP

in the followup bmwiedemann helped us with https://infra.nue.suse.com/SelfService/Display.html?id=175747 where he wrote

I told it to stop notifying about "all services on this host" (there was only NTP listed)

so maybe that was in the end more than "only NTP"? I asked bmwiedemann in https://suse.slack.com/archives/C029APBKLGK/p1640119784105700

https://thruk.suse.de/thruk/cgi-bin/status.cgi?host=ariel-opensuse.suse.de looks to me like a sane choice of service. And looks to me like all notifications are enabled. Maybe someone (you ghormoon?) enabled them to fix the current problem. https://thruk.suse.de/thruk/cgi-bin/notifications.cgi?host=ariel-opensuse.suse.de states that emails have been sent out to a list of users. But I don't have any such email.

Actions #10

Updated by okurz about 2 years ago

  • Parent task set to #80142
Actions #11

Updated by okurz almost 2 years ago

  • Status changed from Blocked to Resolved

We have resolved the notification problem. I don't plan further tasks

Actions

Also available in: Atom PDF