action #77089
closed[osd][retrospective] multiple unattended alerts, unattended gitlab CI pipeline fails, all osd aarch64 workers offline
100%
Description
Observation¶
On 2020-11-06 Found multiple unattended alerts, unattended gitlab CI pipeline fails, all osd aarch64 workers offline. What happened?
What I have seen failing:
- Minion Jobs alert for more than one day
- openqaworker-arm-1, openqaworker-arm-2, openqaworker-arm-3 offline alert but also the long-time alert for all three
- An increased job schedule of 600 aarch64 jobs and not decreasing, see https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&fullscreen&panelId=12&from=1604651865692&to=1604718259586
- Multiple email alerts from failed gitlab CI pipelines, e.g. for the grafana-webhook-ations, openqa-review, auto-review
- No message in Rocket.Chat nor email about anyone handling any of the above alerts until Friday, 2020-11-06, 22:00 UTC
Acceptance criteria¶
- AC1: Alerts handled
- AC2: gitlab CI jobs can find shared runners again
- AC3: issue has been discussed with team, e.g. in retrospective
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from High to Urgent
- Target version set to Ready
To have at least one aarch64 worker I did ipmi-openqaworker-arm-1-ipmi power reset
now.
Likely all gitlab CI pipelines fail after the shared gitlab CI runners have been changed. I added a comment on https://gitlab.suse.de/openqa/openqa-trigger-from-ibs-plugin/-/compare/94e15aeec52be49db86969aece69b4efd358d632...c525232ebbb658a21e0b4bb3303f432f520912ed regarding this now.
Updated by okurz about 4 years ago
- Status changed from Workable to In Progress
- Assignee set to okurz
missing gitlab CI runners handled in #77101
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from In Progress to Feedback
Fixed openqaworker-arm-1 and openqaworker-arm-2, openqaworker-arm-3 can not be controlled over IPMI again, see comment in #76876#note-5, alerts for openqaworker-arm-1 and openqaworker-arm-2 are back green.
minion jobs on osd reviewed, most are obs_sync ones that we already have a ticket for: #70768, commented there and increased priority.
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
Acceptance criteria¶
- AC1: Alerts handled
DONE
- AC2: gitlab CI jobs can find shared runners again
DONE, see #77101
- AC3: issue has been discussed with team, e.g. in retrospective
DONE, discussed in https://www.retrospected.com/game/kj6hv59UD with cdywan at least already, came up with suggestion #77317