action #116494
closedToo many Minion job failures alert because needle-pusher is blocked on GitLab
0%
Description
Observation¶
We have a lot of failed minion jobs.
Example:
https://openqa.suse.de/minion/jobs?id=5268904
result:
error: "<strong>Failed to save yast2_kdump-yast2-kdump-no-restart-info-20220913.</strong><br><pre>Unable
to fetch from origin master: Fetching origin\nremote: \nremote: ========================================================================\nremote:
\nremote: Your account has been blocked.\nremote: \nremote: ========================================================================\nremote:
\nfatal: Could not read from remote repository.\n\nPlease make sure you have the
correct access rights\nand the repository exists.\nerror: could not fetch origin</pre>"
The user has no rights to push the needles to the git repo anymore.
There is already an SD ticket about it: https://sd.suse.com/servicedesk/customer/portal/1/SD-98249
Suggestions¶
- DONE:
Fix blocked account - Review failed minion jobs and remove the ones that are about this ticket
- Ensure that the number of failed minion jobs is again below the alerting threshold
- There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting 2022-09-16 please delete them.
- Ensure that only one alert "web UI: Too many Minion job failures alert" remains
- Cross-check alert state
Rollback steps¶
- Unpause alert(s) "web UI: Too many Minion job failures alert"
Updated by tinita about 2 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Category deleted (
Regressions/Crashes) - Target version set to Ready
Updated by livdywan about 2 years ago
- Status changed from New to In Progress
Let's consider this in progress. I'm in touch with Jiri Novak.
For the record openqa-pusher creates needle commits and was setup by @lnussel before, so ideally we can transfer ownership to osd-admins@suse.de now.
Once that's sorted and the account is unblocked I'd like to document the setup. There is a README but GitLab won't show it to me, maybe simply because of the sheer amount of needles in that repo so we may want to document this somewhere else, maybe even the Tools wiki if we're going to own the account.
Updated by openqa_review about 2 years ago
- Due date set to 2022-09-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan about 2 years ago
- Subject changed from [Alerting] web UI: Too many Minion job failures alert to Too many Minion job failures alert because needle-pusher is blocked on GitLab
Just for clarity I'm giving this a more concrete title.
On a related note oqabot@suse.com also came up in conversation. Nobody from current or former Tools team members seems to use it. It's planned to be removed next week if no owner can be identified - be sure to get back to Jiri Novak if you know anything else or use it.
Updated by livdywan about 2 years ago
- Status changed from In Progress to Feedback
I was able to login as openqa-pusher and the account is now unblocked. At least one person confirmed that things are working again!
Updated by okurz about 2 years ago
- Due date deleted (
2022-09-28) - Status changed from Feedback to Resolved
Seems to be good. The email should be osd-admins@suse.de as done by SUSE-IT. I clarified the email in the password file https://gitlab.suse.de/openqa/password/-/commit/f3fcbac2ef6a52e2914175e77a8da9d6261fa24e
Updated by okurz about 2 years ago
- Status changed from Resolved to Feedback
- Assignee changed from livdywan to tinita
@tinita you linked alert messages. please check the actual minion job results and then ensure that the alert disappears from https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok
There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting please delete them.
Updated by okurz about 2 years ago
I paused the alerts and updated the ticket description with suggestions and rollback steps
Updated by tinita about 2 years ago
I removed the minion jobs that were caused by the gitlab problem, and unpaused the alert.
Updated by tinita about 2 years ago
- Status changed from Feedback to Resolved
I deleted the alerts for WebUI test and WebUI old
Updated by okurz about 2 years ago
- Status changed from Resolved to Feedback
I still see three the dashboards, e.g. https://monitor.qa.suse.de/d/Webuiold/webui-summary-old?editPanel=17&tab=alert&orgId=1 and https://monitor.qa.suse.de/d/Webuitest/webui-summary-test?editPanel=17&tab=alert&orgId=1
Updated by tinita about 2 years ago
I now deleted the "WebUI Summary test" and "WebUI Summary old" dashboards, as Oli asked me to do that.
From the ticket description
There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting 2022-09-16 please delete them.
that was not really clear, as "them" could mean "the alerts".