Project

General

Profile

action #77089

Updated by okurz over 3 years ago

## Observation 

 On 2020-11-06 Found multiple unattended alerts, unattended gitlab CI pipeline fails, all osd aarch64 workers offline. What happened? 

 What I have seen failing: 
 * Minion Jobs alert for more than one day 
 * openqaworker-arm-1, openqaworker-arm-2, openqaworker-arm-3 offline alert but also the long-time alert for all three 
 * An increased job schedule of 600 aarch64 jobs and not decreasing, see https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&fullscreen&panelId=12&from=1604651865692&to=1604718259586 
 * Multiple email alerts from failed gitlab CI pipelines, e.g. for the grafana-webhook-ations, openqa-review, auto-review 
 * No message in Rocket.Chat nor email about anyone handling any of the above alerts until Friday, 2020-11-06, 22:00 UTC 

 ## Acceptance criteria 
 * **AC1:** Alerts handled 
 * **AC2:** gitlab CI jobs can find shared runners again 
 * **AC3:** issue has been discussed with team, e.g. in retrospective

Back