Actions
action #76876
closedopenQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
Find a better (automated) way to inform infra about hanging (arm) workers
Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2020-11-02
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
As max already reported repeatably that he can't extract info from our automated alerts from grafana I think it is time to find a better solution. Just setting infra as receiver for grafana alerts results in mails like this:
"Dear Colleague,
Thank you for your report of: "[No Data] [openqa] openqaworker-arm-3 online (long-time) alert"
assigned reference number: "178873"
Someone from the designate team will contact you about
your request as soon as we can.
If you have additional comments or questions, you can
follow up to the ticket here at :
https://infra.nue.suse.com/Ticket/Display.html?id=178873
Regards,
The Engineering Infrastructure Team"
infra@suse.de
-------------------------------------------------------------------------
The original message:
-------------------------------------------------------------------------
[IMAGE] [IMAGE] [IMAGE] [IMAGE] [No Data] [openqa] openqaworker-arm-3 online (long-time) alert [No Data] [openqa] openqaworker-arm-3 online (long-time) alert [No Data] [openqa] openqaworker-arm-3 online (long-time) alert The IPMI management interface for this machine is inaccessible (again). The The IPMI management interface for this machine is inaccessible (again). The Metric name Metric name Value View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) View your Alert rule (http://stats.openqa-m
onitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting) Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/) Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)
machine itself is also not reachable over ping. Suggested action: Reset the machine itself is also not reachable over ping. Suggested action: Reset the
© 2016 Grafana and raintank © 2016 Grafana and raintank
The IPMI management interface for this machine is inaccessible (again). The machine including the management interface. Similar issues were handled in machine including the management interface. Similar issues were handled in Value Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)
[No Data] [openqa] openqaworker-arm-3 online (long-time) alert machine itself is also not reachable over ping. Suggested action: Reset the https://infra.nue.suse.com/SelfService/Update.html?id=174650 and https://infra.nue.suse.com/SelfService/Update.html?id=174650 and
machine including the management interface. Similar issues were handled in https://infra.nue.suse.com/SelfService/Display.html?id=166330 and https://infra.nue.suse.com/SelfService/Display.html?id=166330 and
The IPMI management interface for this machine is inaccessible (again). The https://infra.nue.suse.com/SelfService/Update.html?id=174650 and https://infra.nue.suse.com/SelfService/Display.html?id=164419 and https://infra.nue.suse.com/SelfService/Display.html?id=164419 and
machine itself is also not reachable over ping. Suggested action: Reset the https://infra.nue.suse.com/SelfService/Display.html?id=166330 and https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same
machine including the management interface. Similar issues were handled in https://infra.nue.suse.com/SelfService/Display.html?id=164419 and machine machine
https://infra.nue.suse.com/SelfService/Update.html?id=174650 and https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same
https://infra.nue.suse.com/SelfService/Display.html?id=166330 and machine
https://infra.nue.suse.com/SelfService/Display.html?id=164419 and
https://infra.nue.suse.com/SelfService/Display.html?id=153124 for the same Metric name
machine
Value
Metric name
View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)
Value
Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)
View your Alert rule (http://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?fullscreen&edit&tab=alert&panelId=7&orgId=1)
Go to the Alerts page (http://stats.openqa-monitor.qa.suse.de/alerting)
Sent by Grafana v6.4.3 (http://stats.openqa-monitor.qa.suse.de/)
© 2016 Grafana and raintank
Acceptance criteria¶
- AC1: EngInfra is created, e.g. by ticket created over email to infra@suse.de, which is directly "readable", e.g. no HTML message and shorter
Suggestions¶
- Maybe the mail template can be changed? (best to text only)
- We can use a similar approach like we have for automated_actions already: Let a custom gitlab-job create the infra ticket
- We can implement our own piece of software which talks the grafana webhook api
Actions