action #116494
closed
Too many Minion job failures alert because needle-pusher is blocked on GitLab
Added by tinita over 1 year ago.
Updated over 1 year ago.
Description
Observation¶
We have a lot of failed minion jobs.
Example:
https://openqa.suse.de/minion/jobs?id=5268904
result:
error: "<strong>Failed to save yast2_kdump-yast2-kdump-no-restart-info-20220913.</strong><br><pre>Unable
to fetch from origin master: Fetching origin\nremote: \nremote: ========================================================================\nremote:
\nremote: Your account has been blocked.\nremote: \nremote: ========================================================================\nremote:
\nfatal: Could not read from remote repository.\n\nPlease make sure you have the
correct access rights\nand the repository exists.\nerror: could not fetch origin</pre>"
The user has no rights to push the needles to the git repo anymore.
There is already an SD ticket about it: https://sd.suse.com/servicedesk/customer/portal/1/SD-98249
Suggestions¶
- DONE:
Fix blocked account
- Review failed minion jobs and remove the ones that are about this ticket
- Ensure that the number of failed minion jobs is again below the alerting threshold
- There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting 2022-09-16 please delete them.
- Ensure that only one alert "web UI: Too many Minion job failures alert" remains
- Cross-check alert state
Rollback steps¶
- Unpause alert(s) "web UI: Too many Minion job failures alert"
- Project changed from openQA Project to openQA Infrastructure
- Category deleted (
Regressions/Crashes)
- Target version set to Ready
- Status changed from New to In Progress
Let's consider this in progress. I'm in touch with Jiri Novak.
For the record openqa-pusher creates needle commits and was setup by @lnussel before, so ideally we can transfer ownership to osd-admins@suse.de now.
Once that's sorted and the account is unblocked I'd like to document the setup. There is a README but GitLab won't show it to me, maybe simply because of the sheer amount of needles in that repo so we may want to document this somewhere else, maybe even the Tools wiki if we're going to own the account.
- Due date set to 2022-09-28
Setting due date based on mean cycle time of SUSE QE Tools
- Subject changed from [Alerting] web UI: Too many Minion job failures alert to Too many Minion job failures alert because needle-pusher is blocked on GitLab
Just for clarity I'm giving this a more concrete title.
On a related note oqabot@suse.com also came up in conversation. Nobody from current or former Tools team members seems to use it. It's planned to be removed next week if no owner can be identified - be sure to get back to Jiri Novak if you know anything else or use it.
- Status changed from In Progress to Feedback
I was able to login as openqa-pusher and the account is now unblocked. At least one person confirmed that things are working again!
- Due date deleted (
2022-09-28)
- Status changed from Feedback to Resolved
- Status changed from Resolved to Feedback
- Assignee changed from livdywan to tinita
- Description updated (diff)
I paused the alerts and updated the ticket description with suggestions and rollback steps
I removed the minion jobs that were caused by the gitlab problem, and unpaused the alert.
- Status changed from Feedback to Resolved
I deleted the alerts for WebUI test and WebUI old
- Status changed from Resolved to Feedback
I now deleted the "WebUI Summary test" and "WebUI Summary old" dashboards, as Oli asked me to do that.
From the ticket description
There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting 2022-09-16 please delete them.
that was not really clear, as "them" could mean "the alerts".
- Status changed from Feedback to Resolved
Also available in: Atom
PDF