Project

General

Profile

Actions

action #116494

closed

Too many Minion job failures alert because needle-pusher is blocked on GitLab

Added by tinita about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-09-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

We have a lot of failed minion jobs.
Example:
https://openqa.suse.de/minion/jobs?id=5268904

result:
  error: "<strong>Failed to save yast2_kdump-yast2-kdump-no-restart-info-20220913.</strong><br><pre>Unable
    to fetch from origin master: Fetching origin\nremote: \nremote: ========================================================================\nremote:
    \nremote: Your account has been blocked.\nremote: \nremote: ========================================================================\nremote:
    \nfatal: Could not read from remote repository.\n\nPlease make sure you have the
    correct access rights\nand the repository exists.\nerror: could not fetch origin</pre>"

The user has no rights to push the needles to the git repo anymore.

There is already an SD ticket about it: https://sd.suse.com/servicedesk/customer/portal/1/SD-98249

Suggestions

  • DONE: Fix blocked account
  • Review failed minion jobs and remove the ones that are about this ticket
  • Ensure that the number of failed minion jobs is again below the alerting threshold
  • There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting 2022-09-16 please delete them.
  • Ensure that only one alert "web UI: Too many Minion job failures alert" remains
  • Cross-check alert state

Rollback steps

  • Unpause alert(s) "web UI: Too many Minion job failures alert"
Actions #1

Updated by tinita about 2 years ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (Regressions/Crashes)
  • Target version set to Ready
Actions #2

Updated by tinita about 2 years ago

  • Assignee set to livdywan
Actions #3

Updated by livdywan about 2 years ago

  • Status changed from New to In Progress

Let's consider this in progress. I'm in touch with Jiri Novak.

For the record openqa-pusher creates needle commits and was setup by @lnussel before, so ideally we can transfer ownership to osd-admins@suse.de now.

Once that's sorted and the account is unblocked I'd like to document the setup. There is a README but GitLab won't show it to me, maybe simply because of the sheer amount of needles in that repo so we may want to document this somewhere else, maybe even the Tools wiki if we're going to own the account.

Actions #4

Updated by openqa_review about 2 years ago

  • Due date set to 2022-09-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by livdywan about 2 years ago

  • Subject changed from [Alerting] web UI: Too many Minion job failures alert to Too many Minion job failures alert because needle-pusher is blocked on GitLab

Just for clarity I'm giving this a more concrete title.

On a related note oqabot@suse.com also came up in conversation. Nobody from current or former Tools team members seems to use it. It's planned to be removed next week if no owner can be identified - be sure to get back to Jiri Novak if you know anything else or use it.

Actions #6

Updated by livdywan about 2 years ago

  • Status changed from In Progress to Feedback

I was able to login as openqa-pusher and the account is now unblocked. At least one person confirmed that things are working again!

Actions #7

Updated by okurz about 2 years ago

  • Due date deleted (2022-09-28)
  • Status changed from Feedback to Resolved

Seems to be good. The email should be osd-admins@suse.de as done by SUSE-IT. I clarified the email in the password file https://gitlab.suse.de/openqa/password/-/commit/f3fcbac2ef6a52e2914175e77a8da9d6261fa24e

Actions #8

Updated by okurz about 2 years ago

  • Status changed from Resolved to Feedback
  • Assignee changed from livdywan to tinita

@tinita you linked alert messages. please check the actual minion job results and then ensure that the alert disappears from https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok
There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting please delete them.

Actions #9

Updated by okurz about 2 years ago

  • Description updated (diff)
Actions #10

Updated by okurz about 2 years ago

I paused the alerts and updated the ticket description with suggestions and rollback steps

Actions #11

Updated by tinita about 2 years ago

I removed the minion jobs that were caused by the gitlab problem, and unpaused the alert.

Actions #12

Updated by tinita about 2 years ago

  • Status changed from Feedback to Resolved

I deleted the alerts for WebUI test and WebUI old

Actions #14

Updated by tinita about 2 years ago

I now deleted the "WebUI Summary test" and "WebUI Summary old" dashboards, as Oli asked me to do that.
From the ticket description

There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting 2022-09-16 please delete them.

that was not really clear, as "them" could mean "the alerts".

Actions #15

Updated by tinita about 2 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF