Project

General

Profile

action #76876

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Find a better (automated) way to inform infra about hanging (arm) workers

Added by nicksinger 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2020-11-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

As max already reported repeatably that he can't extract info from our automated alerts from grafana I think it is time to find a better solution. Just setting infra as receiver for grafana alerts results in mails like this:

"Dear Colleague,

Thank you for your report of: "[No Data] [openqa] openqaworker-arm-3 online (long-time) alert"
assigned reference number: "178873"

Someone from the designate team will contact you about
your request as soon as we can. 

If you have additional comments or questions, you can
follow up to the ticket here at :

https://infra.nue.suse.com/Ticket/Display.html?id=178873

Regards,
The Engineering Infrastructure Team"
infra@suse.de

-------------------------------------------------------------------------
The original message:
-------------------------------------------------------------------------
                                                                 [IMAGE]                                                                                                 [IMAGE]                                 [IMAGE]  [IMAGE]                                         [No Data] [openqa] openqaworker-arm-3 online (long-time) alert                                      [No Data] [openqa] openqaworker-arm-3 online (long-time) alert  [No Data] [openqa] openqaworker-arm-3 online (long-time) alert The IPMI management interface for this machine is inaccessible (again). The  The IPMI management interface for this machine is inaccessible (again). The Metric name  Metric name  Value     View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)  View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) View your Alert rule (http://stats.openqa-m
 onitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)  Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                machine itself is also not reachable over ping. Suggested action: Reset the  machine itself is also not reachable over ping. Suggested action: Reset the                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                 © 2016 Grafana and raintank                                                         © 2016 Grafana and raintank                     
                                                                                                                                                                                                                                                                       The IPMI management interface for this machine is inaccessible (again). The                                                                                                                                                              machine including the management interface. Similar issues were handled in   machine including the management interface. Similar issues were handled in  Value                               Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)                                                                                                                                                                                                                                                  

                                        [No Data] [openqa] openqaworker-arm-3 online (long-time) alert                                                                                                                                                                 machine itself is also not reachable over ping. Suggested action: Reset the                                                                                                                                                              https://infra.nue.suse.com/SelfService/Update.html?id=174650 and                  https://infra.nue.suse.com/SelfService/Update.html?id=174650 and                                                                                                                                                                                                                                                                                                                                                                    

                                                                                                                                                                                                                                                                       machine including the management interface. Similar issues were handled in                                                                                                                                                               https://infra.nue.suse.com/SelfService/Display.html?id=166330 and                 https://infra.nue.suse.com/SelfService/Display.html?id=166330 and                                                                                                                                                                                                                                                                                                                                                                   

                                  The IPMI management interface for this machine is inaccessible (again). The                                                                                                                                                               https://infra.nue.suse.com/SelfService/Update.html?id=174650 and                                                                                                                                                                    https://infra.nue.suse.com/SelfService/Display.html?id=164419 and                 https://infra.nue.suse.com/SelfService/Display.html?id=164419 and                                                                                                                                                                                                                                                                                                                                                                   

                                  machine itself is also not reachable over ping. Suggested action: Reset the                                                                                                                                                               https://infra.nue.suse.com/SelfService/Display.html?id=166330 and                                                                                                                                                                   https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same   https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same                                                                                                                                                                                                                                                                                                                                                               

                                  machine including the management interface. Similar issues were handled in                                                                                                                                                                https://infra.nue.suse.com/SelfService/Display.html?id=164419 and                                                                                                                                                                   machine                                                                                                        machine                                                                                                                                                                                                                                                                                                                                                                                                

                                       https://infra.nue.suse.com/SelfService/Update.html?id=174650 and                                                                                                                                                                https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

                                       https://infra.nue.suse.com/SelfService/Display.html?id=166330 and                                                                                                                                                                                                 machine                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

                                       https://infra.nue.suse.com/SelfService/Display.html?id=164419 and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

                                  https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same                                                                                                                                                                                           Metric name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

                                                                    machine                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

                                                                                                                                                                                                                                                                                                          Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

                                                                  Metric name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

                                                                                                                                                                                                                                         View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

                                                                     Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

                                                                                                                                                                                                                                                                         Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

    View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           



                                    Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           



                                       Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

                                                          © 2016 Grafana and raintank                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                

Acceptance criteria

  • AC1: EngInfra is created, e.g. by ticket created over email to infra@suse.de, which is directly "readable", e.g. no HTML message and shorter

Suggestions

  • Maybe the mail template can be changed? (best to text only)
  • We can use a similar approach like we have for automated_actions already: Let a custom gitlab-job create the infra ticket
  • We can implement our own piece of software which talks the grafana webhook api

Related issues

Related to openQA Infrastructure - action #92176: [alert] openqaworker-arm-3 offline and CI pipeline unable to send email but stating "passed"Resolved2021-05-052021-05-21

History

#1 Updated by nicksinger 9 months ago

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/394 now to disable the alerts until we have a better solution

#2 Updated by cdywan 9 months ago

  • Description updated (diff)

#3 Updated by okurz 9 months ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Normal to Urgent
  • Target version set to Ready

As it is very likely that our arm workers will disappear soon again and we need to handle that anyway we should regard this ticket as "Urgent". Added AC1 based on your suggestions

#4 Updated by okurz 9 months ago

Seems openqaworker-arm-3 BMC is down, see https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/282784 . So this becomes more urgent or someone writes a ticket manually. I have done that already around 6 times and hence implemented the automatic ticket reporting. I would really appreciate if it can be someone else's term now :)

#5 Updated by okurz 9 months ago

https://github.com/grafana/grafana/issues/11436 describes an open feature request for "Alerts - "Plain Text" (i.e., non-html) Email Option?" and no workaround mentioned.

Honestly what we should do is enable the old option and explain the situation to EngInfra, including that we like to help and officially they are responsible for hardware in the server room. And having a responsive management interface for me is pretty clear in that responsibility area. I think they understand and can live with the current situation. Also we have provided better alternative proposals already, e.g. network controlled power outlets, etc.

What I see as alternatives: Use another gitlab CI action to send a plain email to create the corresponding ticket. Or we try multiple times to reach the machines over IPMI and if that repeatedly fails from within the same gitlab CI action we create the ticket from there.

#6 Updated by cdywan 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

Thinking how to translate the suggestions into a pragmatic approach, I suppose grafana-webhook-actions already pings via ipmi and it should be possible to extend it to send a "readable" email. That seems easier than e.g. learning how to contribute a patch to Grafana and might even be more re-usable.

#8 Updated by okurz 8 months ago

MR merged. You can trigger https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines manually now with "MACHINE" equalling "openqaworker-arm-3". As the machine is actually down and not recoverable as IPMI is down as well this should actually create an EngInfra ticket which would be good. We can tell EngInfra that the alert is correct because the machine is still down but link it to the other ticket that nsinger recorded where mmaher mentioned the machine as well

#10 Updated by cdywan 8 months ago

okurz wrote:

MR merged. You can trigger https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines manually now with "MACHINE" equalling "openqaworker-arm-3". As the machine is actually down and not recoverable as IPMI is down as well this should actually create an EngInfra ticket which would be good. We can tell EngInfra that the alert is correct because the machine is still down but link it to the other ticket that nsinger recorded where mmaher mentioned the machine as well

https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/289677

Kinda seems as though the machine was fixed while the pipeline was running...

$ echo rebooting $MACHINE
rebooting openqaworker-arm-3
$ $IPMITOOL chassis bootdev disk
Set Boot Device to disk
$ $IPMITOOL power cycle
Chassis Power Control: Cycle
$ eval $PING || ($IPMITOOL power cycle && eval $PING)
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
From caasp-w6.suse.de (10.160.1.151) icmp_seq=1 Destination Host Unreachable
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
From caasp-w6.suse.de (10.160.1.151) icmp_seq=1 Destination Host Unreachable
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
PING openqaworker-arm-3.suse.de (10.160.0.85) 56(84) bytes of data.
From caasp-w6.suse.de (10.160.1.151) icmp_seq=1 Destination Host Unreachable
[...]
--- openqaworker-arm-3.suse.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 291.855/291.855/291.855/0.000 ms
$ timeout -k 5 300 sh -c "until nc -vz -w 1 $MACHINE 22; do :; done"
Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded!
Running after_script
00:01
Running after script...
$ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\" && echo -e '\n\n' | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de
Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded!
/usr/bin/bash: line 105: mail: command not found
Cleaning up file based variables
00:00
Job succeeded

And I guess the mail command didn't work. And I wonder why it was even called as the ping before that seems to have succeeded but it's after an || in the script 🤔

#11 Updated by okurz 8 months ago

cdywan wrote:

okurz wrote:

MR merged. You can trigger https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines manually now with "MACHINE" equalling "openqaworker-arm-3". As the machine is actually down and not recoverable as IPMI is down as well this should actually create an EngInfra ticket which would be good. We can tell EngInfra that the alert is correct because the machine is still down but link it to the other ticket that nsinger recorded where mmaher mentioned the machine as well

https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/289677

Kinda seems as though the machine was fixed while the pipeline was running...

wow, that really surprised me now :D But I can confirm that the machine is back up. I will not add it back to salt control so that EngInfra has a chance to use that machine for testing freely.

So it seems the "mail" command is missing. In https://gitlab.suse.de/openqa/osd-deployment/-/blob/master/.gitlab-ci.yml#L3 we use an "registry.opensuse.org/opensuse/infrastructure/images/opensuse_leap_15.0/images/opensuse-leap-15.0:current" which has it but is old and outdated. In https://build.opensuse.org/package/show/home:okurz:container/ipmitool-ping-nc-mailx I am now building a container that you could use as replaced with path
registry.opensuse.org/home/okurz/container/containers/tumbleweed:ipmitool-ping-nc-mailx

#12 Updated by okurz 8 months ago

cdywan this is still urgent as right now jobs fail, e.g. with:

$ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\" && echo -e '\n\n' | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de
nc: getaddrinfo for host "openqaworker-arm-1" port 22: Temporary failure in name resolution
/usr/bin/bash: line 107:  echo: command not found

in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/290949

#13 Updated by okurz 8 months ago

the last jobs failed due to #80178 and currently openqaworker-arm-1 and openqaworker-arm-2 are ok so don't be mislead by the last failure regarding failed name resolution.

#14 Updated by cdywan 8 months ago

https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/11

Thank you for providing the container. I shouldn't have assumed just because we use mail elsewhere it's actually generally available :-D

Also merged the echos to get the conditionals to work.

#15 Updated by okurz 8 months ago

I triggered
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744

and it showed in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744#L32

Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded!
Running after_script
00:01
/usr/bin/bash: eval: line 105: unexpected EOF while looking for matching `"'
Running after script...
$ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\\n\n" | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de

I am still not sure about that after_script part. I am concerned we would miss problems in there when the exit code is just ignored which I assume is the case.

#16 Updated by cdywan 8 months ago

okurz wrote:

I triggered
https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744

and it showed in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/291744#L32

Connection to openqaworker-arm-3 22 port [tcp/ssh] succeeded!
Running after_script
00:01
/usr/bin/bash: eval: line 105: unexpected EOF while looking for matching `"'
Running after script...
$ nc -vz -w 1 $MACHINE 22 || \ echo -e \"$EMAIL\\n\n" | mail -s "[openqa] $MACHINE not bootable via IPMI" infra@suse.de

It's unfortunate there is no validation for scripts and the line numbers makes no sense. I spotted the mistake in the escape codes now... the \ got moved away from the " by accident.

I am still not sure about that after_script part. I am concerned we would miss problems in there when the exit code is just ignored which I assume is the case.

  • The pipeline fails if the server's unreachable.
  • The after sends an email if the server really is unreachable, or nothing if it's broken. The pipeline is already in failed state if that didn't happen but should.

Maybe I should just move it to an actual script that we can validate with shellcheck, though.

#17 Updated by cdywan 8 months ago

cdywan wrote:

Maybe I should just move it to an actual script that we can validate with shellcheck, though.

https://gitlab.suse.de/openqa/grafana-webhook-actions/-/merge_requests/12

#18 Updated by okurz 8 months ago

cdywan wrote:

  • The pipeline fails if the server's unreachable.

ok. Given that we commonly suffer from too many alerts I think for the next step I prefer if the pipeline would not fail if an action was taken, i.e. if host can be recovered automatically, good, stop there, if that fails because we can't reach IPMI, tell EngInfra by email as they asked for that action and not fail as there is nothing more can we do anyway. WDYT?

#19 Updated by okurz 8 months ago

  • Estimated time set to 80142.00 h

#20 Updated by okurz 8 months ago

  • Estimated time deleted (80142.00 h)

#21 Updated by okurz 8 months ago

  • Parent task set to #80142

#22 Updated by cdywan 8 months ago

okurz wrote:

cdywan wrote:

  • The pipeline fails if the server's unreachable.

ok. Given that we commonly suffer from too many alerts I think for the next step I prefer if the pipeline would not fail if an action was taken, i.e. if host can be recovered automatically, good, stop there, if that fails because we can't reach IPMI, tell EngInfra by email as they asked for that action and not fail as there is nothing more can we do anyway. WDYT?

Right. The script introduced in #12 actually succeeds in both cases and a failure would indicate a problem with the script. I was worried you might object since failure=alert seems to be the default so far :-D

#23 Updated by cdywan 8 months ago

  • Status changed from In Progress to Feedback

The new script got merged, let's see how it fares in practice. Will try and keep an eye on the GitLab jobs.

#24 Updated by cdywan 8 months ago

  • Status changed from Feedback to Resolved

cdywan wrote:

The new script got merged, let's see how it fares in practice. Will try and keep an eye on the GitLab jobs.

Seems to work relibly now:

$ ./ipmi-health-check
24Checking if openqaworker-arm-2 is healthy
25PING openqaworker-arm-2.suse.de (10.160.0.227) 56(84) bytes of data.
26--- openqaworker-arm-2.suse.de ping statistics ---
[...]
rtt min/avg/max/mdev = 0.201/0.201/0.201/0.000 ms
1653Connection to openqaworker-arm-2 22 port [tcp/ssh] succeeded!

#25 Updated by okurz 3 months ago

  • Related to action #92176: [alert] openqaworker-arm-3 offline and CI pipeline unable to send email but stating "passed" added

Also available in: Atom PDF