action #76876
closedopenQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
Find a better (automated) way to inform infra about hanging (arm) workers
0%
Description
Observation¶
As max already reported repeatably that he can't extract info from our automated alerts from grafana I think it is time to find a better solution. Just setting infra as receiver for grafana alerts results in mails like this:
"Dear Colleague,
Thank you for your report of: "[No Data] [openqa] openqaworker-arm-3 online (long-time) alert"
assigned reference number: "178873"
Someone from the designate team will contact you about
your request as soon as we can.
If you have additional comments or questions, you can
follow up to the ticket here at :
https://infra.nue.suse.com/Ticket/Display.html?id=178873
Regards,
The Engineering Infrastructure Team"
infra@suse.de
-------------------------------------------------------------------------
The original message:
-------------------------------------------------------------------------
[IMAGE] [IMAGE] [IMAGE] [IMAGE] [No Data] [openqa] openqaworker-arm-3 online (long-time) alert [No Data] [openqa] openqaworker-arm-3 online (long-time) alert [No Data] [openqa] openqaworker-arm-3 online (long-time) alert The IPMI management interface for this machine is inaccessible (again). The The IPMI management interface for this machine is inaccessible (again). The Metric name Metric name Value View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) View your Alert rule (http://stats.openqa-m
onitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/) Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)
machine itself is also not reachable over ping. Suggested action: Reset the machine itself is also not reachable over ping. Suggested action: Reset the
© 2016 Grafana and raintank © 2016 Grafana and raintank
The IPMI management interface for this machine is inaccessible (again). The machine including the management interface. Similar issues were handled in machine including the management interface. Similar issues were handled in Value Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)
[No Data] [openqa] openqaworker-arm-3 online (long-time) alert machine itself is also not reachable over ping. Suggested action: Reset the https://infra.nue.suse.com/SelfService/Update.html?id=174650 and https://infra.nue.suse.com/SelfService/Update.html?id=174650 and
machine including the management interface. Similar issues were handled in https://infra.nue.suse.com/SelfService/Display.html?id=166330 and https://infra.nue.suse.com/SelfService/Display.html?id=166330 and
The IPMI management interface for this machine is inaccessible (again). The https://infra.nue.suse.com/SelfService/Update.html?id=174650 and https://infra.nue.suse.com/SelfService/Display.html?id=164419 and https://infra.nue.suse.com/SelfService/Display.html?id=164419 and
machine itself is also not reachable over ping. Suggested action: Reset the https://infra.nue.suse.com/SelfService/Display.html?id=166330 and https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same
machine including the management interface. Similar issues were handled in https://infra.nue.suse.com/SelfService/Display.html?id=164419 and machine machine
https://infra.nue.suse.com/SelfService/Update.html?id=174650 and https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same
https://infra.nue.suse.com/SelfService/Display.html?id=166330 and machine
https://infra.nue.suse.com/SelfService/Display.html?id=164419 and
https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same Metric name
machine
Value
Metric name
View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)
Value
Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)
View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)
Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)
Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)
© 2016 Grafana and raintank
Acceptance criteria¶
- AC1: EngInfra is created, e.g. by ticket created over email to infra@suse.de, which is directly "readable", e.g. no HTML message and shorter
Suggestions¶
- Maybe the mail template can be changed? (best to text only)
- We can use a similar approach like we have for automated_actions already: Let a custom gitlab-job create the infra ticket
- We can implement our own piece of software which talks the grafana webhook api
Updated by nicksinger about 4 years ago
I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/394 now to disable the alerts until we have a better solution
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Normal to Urgent
- Target version set to Ready
As it is very likely that our arm workers will disappear soon again and we need to handle that anyway we should regard this ticket as "Urgent". Added AC1 based on your suggestions
Updated by okurz about 4 years ago
Seems openqaworker-arm-3 BMC is down, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/282784 . So this becomes more urgent or someone writes a ticket manually. I have done that already around 6 times and hence implemented the automatic ticket reporting. I would really appreciate if it can be someone else's term now :)
Updated by okurz about 4 years ago
https://github.com/grafana/grafana/issues/11436 describes an open feature request for "Alerts - "Plain Text" (i.e., non-html) Email Option?" and no workaround mentioned.
Honestly what we should do is enable the old option and explain the situation to EngInfra, including that we like to help and officially they are responsible for hardware in the server room. And having a responsive management interface for me is pretty clear in that responsibility area. I think they understand and can live with the current situation. Also we have provided better alternative proposals already, e.g. network controlled power outlets, etc.
What I see as alternatives: Use another gitlab CI action to send a plain email to create the corresponding ticket. Or we try multiple times to reach the machines over IPMI and if that repeatedly fails from within the same gitlab CI action we create the ticket from there.
Updated by livdywan about 4 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Thinking how to translate the suggestions into a pragmatic approach, I suppose grafana-webhook-actions already pings via ipmi and it should be possible to extend it to send a "readable" email. That seems easier than e.g. learning how to contribute a patch to Grafana and might even be more re-usable.
Updated by okurz about 4 years ago
MR merged. You can trigger https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines manually now with "MACHINE" equalling "openqaworker-arm-3". As the machine is actually down and not recoverable as IPMI is down as well this should actually create an EngInfra ticket which would be good. We can tell EngInfra that the alert is correct because the machine is still down but link it to the other ticket that nsinger recorded where mmaher mentioned the machine as well
Updated by livdywan about 4 years ago
okurz wrote:
MR merged. You can trigger https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines manually now with "MACHINE" equalling "openqaworker-arm-3". As the machine is actually down and not recoverable as IPMI is down as well this should actually create an EngInfra ticket which would be good. We can tell EngInfra that the alert is correct because the machine is still down but link it to the other ticket that nsinger recorded where mmaher mentioned the machine as well
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/289677
Kinda seems as though the machine was fixed while the pipeline was running...
$ echo rebooting $MACHINE
rebooting openqaworker-arm-3
$ $IPMITOOL chassis bootdev disk
Set Boot Device to disk
$ $IPMITOOL power cycle
Chassis Power Control: Cycle
$ eval $PING || ($IPMITOOL power cycle && eval $PING)
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
From caasp-w6.suse.de (10.160.1.151) icmp_seq=1 Destination Host Unreachable
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
From caasp-w6.suse.de (10.160.1.151) icmp_seq=1 Destination Host Unreachable
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
From caasp-w6.suse.de (10.160.1.151) icmp_seq=1 Destination Host Unreachable
[...]
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 291.855/291.855/291.855/0.000 ms
$ timeout -k 5 300 sh -c "until nc -vz -w 1 $MACHINE 22; do :; done"
Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded!
Running after_script
00:01
Running after script...
$ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\" && echo -e '\n\n' | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de
Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded!
/usr/bin/bash: line 105: mail: command not found
Cleaning up file based variables
00:00
Job succeeded
And I guess the mail command didn't work. And I wonder why it was even called as the ping before that seems to have succeeded but it's after an ||
in the script 🤔
Updated by okurz about 4 years ago
cdywan wrote:
okurz wrote:
MR merged. You can trigger https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines manually now with "MACHINE" equalling "openqaworker-arm-3". As the machine is actually down and not recoverable as IPMI is down as well this should actually create an EngInfra ticket which would be good. We can tell EngInfra that the alert is correct because the machine is still down but link it to the other ticket that nsinger recorded where mmaher mentioned the machine as well
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/289677
Kinda seems as though the machine was fixed while the pipeline was running...
wow, that really surprised me now :D But I can confirm that the machine is back up. I will not add it back to salt control so that EngInfra has a chance to use that machine for testing freely.
So it seems the "mail" command is missing. In https://gitlab.suse.de/openqa/osd-deployment/-/blob/master/.gitlab-ci.yml#L3 we use an "registry.opensuse.org/opensuse/infrastructure/images/opensuse_leap_15.0/images/opensuse-leap-15.0:current" which has it but is old and outdated. In https://build.opensuse.org/package/show/home:okurz:container/ipmitool-ping-nc-mailx I am now building a container that you could use as replaced with path
registry.opensuse.org/home/okurz/container/containers/tumbleweed:ipmitool-ping-nc-mailx
Updated by okurz about 4 years ago
@cdywan this is still urgent as right now jobs fail, e.g. with:
$ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\" && echo -e '\n\n' | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de
nc: getaddrinfo for host "openqaworker-arm-1" port 22: Temporary failure in name resolution
/usr/bin/bash: line 107: echo: command not found
in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/290949
Updated by okurz about 4 years ago
the last jobs failed due to #80178 and currently openqaworker-arm-1 and openqaworker-arm-2 are ok so don't be mislead by the last failure regarding failed name resolution.
Updated by livdywan about 4 years ago
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/11
Thank you for providing the container. I shouldn't have assumed just because we use mail elsewhere it's actually generally available :-D
Also merged the echos to get the conditionals to work.
Updated by okurz about 4 years ago
I triggered
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744
and it showed in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744#L32
Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded!
Running after_script
00:01
/usr/bin/bash: eval: line 105: unexpected EOF while looking for matching `"'
Running after script...
$ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\\n\n" | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de
I am still not sure about that after_script
part. I am concerned we would miss problems in there when the exit code is just ignored which I assume is the case.
Updated by livdywan about 4 years ago
okurz wrote:
I triggered
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744and it showed in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744#L32
Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded! Running after_script 00:01 /usr/bin/bash: eval: line 105: unexpected EOF while looking for matching `"' Running after script... $ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\\n\n" | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de
It's unfortunate there is no validation for scripts and the line numbers makes no sense. I spotted the mistake in the escape codes now... the \
got moved away from the "
by accident.
I am still not sure about that
after_script
part. I am concerned we would miss problems in there when the exit code is just ignored which I assume is the case.
- The pipeline fails if the server's unreachable.
- The after sends an email if the server really is unreachable, or nothing if it's broken. The pipeline is already in failed state if that didn't happen but should.
Maybe I should just move it to an actual script that we can validate with shellcheck, though.
Updated by livdywan about 4 years ago
cdywan wrote:
Maybe I should just move it to an actual script that we can validate with shellcheck, though.
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/12
Updated by okurz about 4 years ago
cdywan wrote:
- The pipeline fails if the server's unreachable.
ok. Given that we commonly suffer from too many alerts I think for the next step I prefer if the pipeline would not fail if an action was taken, i.e. if host can be recovered automatically, good, stop there, if that fails because we can't reach IPMI, tell EngInfra by email as they asked for that action and not fail as there is nothing more can we do anyway. WDYT?
Updated by livdywan about 4 years ago
okurz wrote:
cdywan wrote:
- The pipeline fails if the server's unreachable.
ok. Given that we commonly suffer from too many alerts I think for the next step I prefer if the pipeline would not fail if an action was taken, i.e. if host can be recovered automatically, good, stop there, if that fails because we can't reach IPMI, tell EngInfra by email as they asked for that action and not fail as there is nothing more can we do anyway. WDYT?
Right. The script introduced in #12 actually succeeds in both cases and a failure would indicate a problem with the script. I was worried you might object since failure=alert seems to be the default so far :-D
Updated by livdywan about 4 years ago
- Status changed from In Progress to Feedback
The new script got merged, let's see how it fares in practice. Will try and keep an eye on the GitLab jobs.
Updated by livdywan about 4 years ago
- Status changed from Feedback to Resolved
cdywan wrote:
The new script got merged, let's see how it fares in practice. Will try and keep an eye on the GitLab jobs.
Seems to work relibly now:
$ ./ipmi-health-check
24Checking if openqaworker-arm-2 is healthy
25PING openqaworker-arm-2.suse.de (10.160.0.227) 56(84) bytes of data.
26--- openqaworker-arm-2.suse.de ping statistics ---
[...]
rtt min/avg/max/mdev = 0.201/0.201/0.201/0.000 ms
1653Connection to openqaworker-arm-2 22 port [tcp/ssh] succeeded!
Updated by okurz over 3 years ago
- Related to action #92176: [alert] openqaworker-arm-3 offline and CI pipeline unable to send email but stating "passed" added