Actions
action #77089
closed[osd][retrospective] multiple unattended alerts, unattended gitlab CI pipeline fails, all osd aarch64 workers offline
Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2020-11-07
Due date:
% Done:
100%
Estimated time:
(Total: 0.00 h)
Tags:
Description
Observation¶
On 2020-11-06 Found multiple unattended alerts, unattended gitlab CI pipeline fails, all osd aarch64 workers offline. What happened?
What I have seen failing:
- Minion Jobs alert for more than one day
- openqaworker-arm-1, openqaworker-arm-2, openqaworker-arm-3 offline alert but also the long-time alert for all three
- An increased job schedule of 600 aarch64 jobs and not decreasing, see https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&fullscreen&panelId=12&from=1604651865692&to=1604718259586
- Multiple email alerts from failed gitlab CI pipelines, e.g. for the grafana-webhook-ations, openqa-review, auto-review
- No message in Rocket.Chat nor email about anyone handling any of the above alerts until Friday, 2020-11-06, 22:00 UTC
Acceptance criteria¶
- AC1: Alerts handled
- AC2: gitlab CI jobs can find shared runners again
- AC3: issue has been discussed with team, e.g. in retrospective
Actions