action #76876
openQA Project - coordination #80142: [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
Find a better (automated) way to inform infra about hanging (arm) workers
0%
Description
Observation¶
As max already reported repeatably that he can't extract info from our automated alerts from grafana I think it is time to find a better solution. Just setting infra as receiver for grafana alerts results in mails like this:
"Dear Colleague, Thank you for your report of: "[No Data] [openqa] openqaworker-arm-3 online (long-time) alert" assigned reference number: "178873" Someone from the designate team will contact you about your request as soon as we can. If you have additional comments or questions, you can follow up to the ticket here at : https://infra.nue.suse.com/Ticket/Display.html?id=178873 Regards, The Engineering Infrastructure Team" infra@suse.de ------------------------------------------------------------------------- The original message: ------------------------------------------------------------------------- [IMAGE] [IMAGE] [IMAGE] [IMAGE] [No Data] [openqa] openqaworker-arm-3 online (long-time) alert [No Data] [openqa] openqaworker-arm-3 online (long-time) alert [No Data] [openqa] openqaworker-arm-3 online (long-time) alert The IPMI management interface for this machine is inaccessible (again). The The IPMI management interface for this machine is inaccessible (again). The Metric name Metric name Value View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) View your Alert rule (http://stats.openqa-m onitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/) Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/) machine itself is also not reachable over ping. Suggested action: Reset the machine itself is also not reachable over ping. Suggested action: Reset the © 2016 Grafana and raintank © 2016 Grafana and raintank The IPMI management interface for this machine is inaccessible (again). The machine including the management interface. Similar issues were handled in machine including the management interface. Similar issues were handled in Value Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) [No Data] [openqa] openqaworker-arm-3 online (long-time) alert machine itself is also not reachable over ping. Suggested action: Reset the https://infra.nue.suse.com/SelfService/Update.html?id=174650 and https://infra.nue.suse.com/SelfService/Update.html?id=174650 and machine including the management interface. Similar issues were handled in https://infra.nue.suse.com/SelfService/Display.html?id=166330 and https://infra.nue.suse.com/SelfService/Display.html?id=166330 and The IPMI management interface for this machine is inaccessible (again). The https://infra.nue.suse.com/SelfService/Update.html?id=174650 and https://infra.nue.suse.com/SelfService/Display.html?id=164419 and https://infra.nue.suse.com/SelfService/Display.html?id=164419 and machine itself is also not reachable over ping. Suggested action: Reset the https://infra.nue.suse.com/SelfService/Display.html?id=166330 and https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same machine including the management interface. Similar issues were handled in https://infra.nue.suse.com/SelfService/Display.html?id=164419 and machine machine https://infra.nue.suse.com/SelfService/Update.html?id=174650 and https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same https://infra.nue.suse.com/SelfService/Display.html?id=166330 and machine https://infra.nue.suse.com/SelfService/Display.html?id=164419 and https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same Metric name machine Value Metric name View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) Value Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/) © 2016 Grafana and raintank
Acceptance criteria¶
- AC1: EngInfra is created, e.g. by ticket created over email to infra@suse.de, which is directly "readable", e.g. no HTML message and shorter
Suggestions¶
- Maybe the mail template can be changed? (best to text only)
- We can use a similar approach like we have for automated_actions already: Let a custom gitlab-job create the infra ticket
- We can implement our own piece of software which talks the grafana webhook api
History
#1
Updated by nicksinger 3 months ago
I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/394 now to disable the alerts until we have a better solution
#3
Updated by okurz 3 months ago
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Normal to Urgent
- Target version set to Ready
As it is very likely that our arm workers will disappear soon again and we need to handle that anyway we should regard this ticket as "Urgent". Added AC1 based on your suggestions
#4
Updated by okurz 2 months ago
Seems openqaworker-arm-3 BMC is down, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/282784 . So this becomes more urgent or someone writes a ticket manually. I have done that already around 6 times and hence implemented the automatic ticket reporting. I would really appreciate if it can be someone else's term now :)
#5
Updated by okurz 2 months ago
https://github.com/grafana/grafana/issues/11436 describes an open feature request for "Alerts - "Plain Text" (i.e., non-html) Email Option?" and no workaround mentioned.
Honestly what we should do is enable the old option and explain the situation to EngInfra, including that we like to help and officially they are responsible for hardware in the server room. And having a responsive management interface for me is pretty clear in that responsibility area. I think they understand and can live with the current situation. Also we have provided better alternative proposals already, e.g. network controlled power outlets, etc.
What I see as alternatives: Use another gitlab CI action to send a plain email to create the corresponding ticket. Or we try multiple times to reach the machines over IPMI and if that repeatedly fails from within the same gitlab CI action we create the ticket from there.
#6
Updated by cdywan 2 months ago
- Status changed from Workable to In Progress
- Assignee set to cdywan
Thinking how to translate the suggestions into a pragmatic approach, I suppose grafana-webhook-actions already pings via ipmi and it should be possible to extend it to send a "readable" email. That seems easier than e.g. learning how to contribute a patch to Grafana and might even be more re-usable.
#8
Updated by okurz 2 months ago
MR merged. You can trigger https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines manually now with "MACHINE" equalling "openqaworker-arm-3". As the machine is actually down and not recoverable as IPMI is down as well this should actually create an EngInfra ticket which would be good. We can tell EngInfra that the alert is correct because the machine is still down but link it to the other ticket that nsinger recorded where mmaher mentioned the machine as well
#10
Updated by cdywan 2 months ago
okurz wrote:
MR merged. You can trigger https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines manually now with "MACHINE" equalling "openqaworker-arm-3". As the machine is actually down and not recoverable as IPMI is down as well this should actually create an EngInfra ticket which would be good. We can tell EngInfra that the alert is correct because the machine is still down but link it to the other ticket that nsinger recorded where mmaher mentioned the machine as well
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/289677
Kinda seems as though the machine was fixed while the pipeline was running...
$ echo rebooting $MACHINE rebooting openqaworker-arm-3 $ $IPMITOOL chassis bootdev disk Set Boot Device to disk $ $IPMITOOL power cycle Chassis Power Control: Cycle $ eval $PING || ($IPMITOOL power cycle && eval $PING) PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data. From caasp-w6.suse.de (10.160.1.151) icmp_seq=1 Destination Host Unreachable --- openqaworker-arm-3.suse.de ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data. From caasp-w6.suse.de (10.160.1.151) icmp_seq=1 Destination Host Unreachable --- openqaworker-arm-3.suse.de ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data. From caasp-w6.suse.de (10.160.1.151) icmp_seq=1 Destination Host Unreachable [...] --- openqaworker-arm-3.suse.de ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 291.855/291.855/291.855/0.000 ms $ timeout -k 5 300 sh -c "until nc -vz -w 1 $MACHINE 22; do :; done" Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded! Running after_script 00:01 Running after script... $ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\" && echo -e '\n\n' | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded! /usr/bin/bash: line 105: mail: command not found Cleaning up file based variables 00:00 Job succeeded
And I guess the mail command didn't work. And I wonder why it was even called as the ping before that seems to have succeeded but it's after an ||
in the script 🤔
#11
Updated by okurz 2 months ago
cdywan wrote:
okurz wrote:
MR merged. You can trigger https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines manually now with "MACHINE" equalling "openqaworker-arm-3". As the machine is actually down and not recoverable as IPMI is down as well this should actually create an EngInfra ticket which would be good. We can tell EngInfra that the alert is correct because the machine is still down but link it to the other ticket that nsinger recorded where mmaher mentioned the machine as well
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/289677
Kinda seems as though the machine was fixed while the pipeline was running...
wow, that really surprised me now :D But I can confirm that the machine is back up. I will not add it back to salt control so that EngInfra has a chance to use that machine for testing freely.
So it seems the "mail" command is missing. In https://gitlab.suse.de/openqa/osd-deployment/-/blob/master/.gitlab-ci.yml#L3 we use an "registry.opensuse.org/opensuse/infrastructure/images/opensuse_leap_15.0/images/opensuse-leap-15.0:current" which has it but is old and outdated. In https://build.opensuse.org/package/show/home:okurz:container/ipmitool-ping-nc-mailx I am now building a container that you could use as replaced with path
registry.opensuse.org/home/okurz/container/containers/tumbleweed:ipmitool-ping-nc-mailx
#12
Updated by okurz about 2 months ago
cdywan this is still urgent as right now jobs fail, e.g. with:
$ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\" && echo -e '\n\n' | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de nc: getaddrinfo for host "openqaworker-arm-1" port 22: Temporary failure in name resolution /usr/bin/bash: line 107: echo: command not found
in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/290949
#13
Updated by okurz about 2 months ago
the last jobs failed due to #80178 and currently openqaworker-arm-1 and openqaworker-arm-2 are ok so don't be mislead by the last failure regarding failed name resolution.
#14
Updated by cdywan about 2 months ago
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/11
Thank you for providing the container. I shouldn't have assumed just because we use mail elsewhere it's actually generally available :-D
Also merged the echos to get the conditionals to work.
#15
Updated by okurz about 2 months ago
I triggered
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744
and it showed in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744#L32
Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded! Running after_script 00:01 /usr/bin/bash: eval: line 105: unexpected EOF while looking for matching `"' Running after script... $ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\\n\n" | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de
I am still not sure about that after_script
part. I am concerned we would miss problems in there when the exit code is just ignored which I assume is the case.
#16
Updated by cdywan about 2 months ago
okurz wrote:
I triggered
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744and it showed in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744#L32
Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded! Running after_script 00:01 /usr/bin/bash: eval: line 105: unexpected EOF while looking for matching `"' Running after script... $ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\\n\n" | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de
It's unfortunate there is no validation for scripts and the line numbers makes no sense. I spotted the mistake in the escape codes now... the \
got moved away from the "
by accident.
I am still not sure about that
after_script
part. I am concerned we would miss problems in there when the exit code is just ignored which I assume is the case.
- The pipeline fails if the server's unreachable.
- The after sends an email if the server really is unreachable, or nothing if it's broken. The pipeline is already in failed state if that didn't happen but should.
Maybe I should just move it to an actual script that we can validate with shellcheck, though.
#17
Updated by cdywan about 2 months ago
cdywan wrote:
Maybe I should just move it to an actual script that we can validate with shellcheck, though.
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/12
#18
Updated by okurz about 2 months ago
cdywan wrote:
- The pipeline fails if the server's unreachable.
ok. Given that we commonly suffer from too many alerts I think for the next step I prefer if the pipeline would not fail if an action was taken, i.e. if host can be recovered automatically, good, stop there, if that fails because we can't reach IPMI, tell EngInfra by email as they asked for that action and not fail as there is nothing more can we do anyway. WDYT?
#19
Updated by okurz about 2 months ago
- Estimated time set to 80142.00 h
#20
Updated by okurz about 2 months ago
- Estimated time deleted (
80142.00 h)
#21
Updated by okurz about 2 months ago
- Parent task set to #80142
#22
Updated by cdywan about 2 months ago
okurz wrote:
cdywan wrote:
- The pipeline fails if the server's unreachable.
ok. Given that we commonly suffer from too many alerts I think for the next step I prefer if the pipeline would not fail if an action was taken, i.e. if host can be recovered automatically, good, stop there, if that fails because we can't reach IPMI, tell EngInfra by email as they asked for that action and not fail as there is nothing more can we do anyway. WDYT?
Right. The script introduced in #12 actually succeeds in both cases and a failure would indicate a problem with the script. I was worried you might object since failure=alert seems to be the default so far :-D
#23
Updated by cdywan about 2 months ago
- Status changed from In Progress to Feedback
The new script got merged, let's see how it fares in practice. Will try and keep an eye on the GitLab jobs.
#24
Updated by cdywan about 2 months ago
- Status changed from Feedback to Resolved
cdywan wrote:
The new script got merged, let's see how it fares in practice. Will try and keep an eye on the GitLab jobs.
Seems to work relibly now:
$ ./ipmi-health-check 24Checking if openqaworker-arm-2 is healthy 25PING openqaworker-arm-2.suse.de (10.160.0.227) 56(84) bytes of data. 26--- openqaworker-arm-2.suse.de ping statistics --- [...] rtt min/avg/max/mdev = 0.201/0.201/0.201/0.000 ms 1653Connection to openqaworker-arm-2 22 port [tcp/ssh] succeeded!