Project

General

Profile

Actions

action #76876

closed

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Find a better (automated) way to inform infra about hanging (arm) workers

Added by nicksinger over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2020-11-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

As max already reported repeatably that he can't extract info from our automated alerts from grafana I think it is time to find a better solution. Just setting infra as receiver for grafana alerts results in mails like this:

"Dear Colleague,

Thank you for your report of: "[No Data] [openqa] openqaworker-arm-3 online (long-time) alert"
assigned reference number: "178873"

Someone from the designate team will contact you about
your request as soon as we can. 

If you have additional comments or questions, you can
follow up to the ticket here at :

https://infra.nue.suse.com/Ticket/Display.html?id=178873

Regards,
The Engineering Infrastructure Team"
infra@suse.de

-------------------------------------------------------------------------
The original message:
-------------------------------------------------------------------------
                                                                 [IMAGE]                                                                                                 [IMAGE]                                 [IMAGE]  [IMAGE]                                         [No Data] [openqa] openqaworker-arm-3 online (long-time) alert                                      [No Data] [openqa] openqaworker-arm-3 online (long-time) alert  [No Data] [openqa] openqaworker-arm-3 online (long-time) alert The IPMI management interface for this machine is inaccessible (again). The  The IPMI management interface for this machine is inaccessible (again). The Metric name  Metric name  Value     View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)  View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) View your Alert rule (http://stats.openqa-m
 onitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)  Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                machine itself is also not reachable over ping. Suggested action: Reset the  machine itself is also not reachable over ping. Suggested action: Reset the                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                 © 2016 Grafana and raintank                                                         © 2016 Grafana and raintank                     
                                                                                                                                                                                                                                                                       The IPMI management interface for this machine is inaccessible (again). The                                                                                                                                                              machine including the management interface. Similar issues were handled in   machine including the management interface. Similar issues were handled in  Value                               Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)                                                                                                                                                                                                                                                  

                                        [No Data] [openqa] openqaworker-arm-3 online (long-time) alert                                                                                                                                                                 machine itself is also not reachable over ping. Suggested action: Reset the                                                                                                                                                              https://infra.nue.suse.com/SelfService/Update.html?id=174650 and                  https://infra.nue.suse.com/SelfService/Update.html?id=174650 and                                                                                                                                                                                                                                                                                                                                                                    

                                                                                                                                                                                                                                                                       machine including the management interface. Similar issues were handled in                                                                                                                                                               https://infra.nue.suse.com/SelfService/Display.html?id=166330 and                 https://infra.nue.suse.com/SelfService/Display.html?id=166330 and                                                                                                                                                                                                                                                                                                                                                                   

                                  The IPMI management interface for this machine is inaccessible (again). The                                                                                                                                                               https://infra.nue.suse.com/SelfService/Update.html?id=174650 and                                                                                                                                                                    https://infra.nue.suse.com/SelfService/Display.html?id=164419 and                 https://infra.nue.suse.com/SelfService/Display.html?id=164419 and                                                                                                                                                                                                                                                                                                                                                                   

                                  machine itself is also not reachable over ping. Suggested action: Reset the                                                                                                                                                               https://infra.nue.suse.com/SelfService/Display.html?id=166330 and                                                                                                                                                                   https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same   https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same                                                                                                                                                                                                                                                                                                                                                               

                                  machine including the management interface. Similar issues were handled in                                                                                                                                                                https://infra.nue.suse.com/SelfService/Display.html?id=164419 and                                                                                                                                                                   machine                                                                                                        machine                                                                                                                                                                                                                                                                                                                                                                                                

                                       https://infra.nue.suse.com/SelfService/Update.html?id=174650 and                                                                                                                                                                https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

                                       https://infra.nue.suse.com/SelfService/Display.html?id=166330 and                                                                                                                                                                                                 machine                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

                                       https://infra.nue.suse.com/SelfService/Display.html?id=164419 and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

                                  https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same                                                                                                                                                                                           Metric name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

                                                                    machine                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

                                                                                                                                                                                                                                                                                                          Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

                                                                  Metric name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

                                                                                                                                                                                                                                         View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

                                                                     Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

                                                                                                                                                                                                                                                                         Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

    View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           



                                    Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           



                                       Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

                                                          © 2016 Grafana and raintank                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                

Acceptance criteria

  • AC1: EngInfra is created, e.g. by ticket created over email to infra@suse.de, which is directly "readable", e.g. no HTML message and shorter

Suggestions

  • Maybe the mail template can be changed? (best to text only)
  • We can use a similar approach like we have for automated_actions already: Let a custom gitlab-job create the infra ticket
  • We can implement our own piece of software which talks the grafana webhook api

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #92176: [alert] openqaworker-arm-3 offline and CI pipeline unable to send email but stating "passed"Resolvedmkittler2021-05-052021-05-21

Actions
Actions

Also available in: Atom PDF