Project

General

Profile

action #116494

Too many Minion job failures alert because needle-pusher is blocked on GitLab

Added by tinita 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-09-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

We have a lot of failed minion jobs.
Example:
https://openqa.suse.de/minion/jobs?id=5268904

result:
  error: "<strong>Failed to save yast2_kdump-yast2-kdump-no-restart-info-20220913.</strong><br><pre>Unable
    to fetch from origin master: Fetching origin\nremote: \nremote: ========================================================================\nremote:
    \nremote: Your account has been blocked.\nremote: \nremote: ========================================================================\nremote:
    \nfatal: Could not read from remote repository.\n\nPlease make sure you have the
    correct access rights\nand the repository exists.\nerror: could not fetch origin</pre>"

The user has no rights to push the needles to the git repo anymore.

There is already an SD ticket about it: https://sd.suse.com/servicedesk/customer/portal/1/SD-98249

Suggestions

  • DONE: Fix blocked account
  • Review failed minion jobs and remove the ones that are about this ticket
  • Ensure that the number of failed minion jobs is again below the alerting threshold
  • There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting 2022-09-16 please delete them.
  • Ensure that only one alert "web UI: Too many Minion job failures alert" remains
  • Cross-check alert state

Rollback steps

  • Unpause alert(s) "web UI: Too many Minion job failures alert"

History

#1 Updated by tinita 3 months ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (Concrete Bugs)
  • Target version set to Ready

#2 Updated by tinita 3 months ago

  • Assignee set to cdywan

#3 Updated by cdywan 3 months ago

  • Status changed from New to In Progress

Let's consider this in progress. I'm in touch with Jiri Novak.

For the record openqa-pusher creates needle commits and was setup by lnussel before, so ideally we can transfer ownership to osd-admins@suse.de now.

Once that's sorted and the account is unblocked I'd like to document the setup. There is a README but GitLab won't show it to me, maybe simply because of the sheer amount of needles in that repo so we may want to document this somewhere else, maybe even the Tools wiki if we're going to own the account.

#4 Updated by openqa_review 3 months ago

  • Due date set to 2022-09-28

Setting due date based on mean cycle time of SUSE QE Tools

#5 Updated by cdywan 3 months ago

  • Subject changed from [Alerting] web UI: Too many Minion job failures alert to Too many Minion job failures alert because needle-pusher is blocked on GitLab

Just for clarity I'm giving this a more concrete title.

On a related note oqabot@suse.com also came up in conversation. Nobody from current or former Tools team members seems to use it. It's planned to be removed next week if no owner can be identified - be sure to get back to Jiri Novak if you know anything else or use it.

#6 Updated by cdywan 3 months ago

  • Status changed from In Progress to Feedback

I was able to login as openqa-pusher and the account is now unblocked. At least one person confirmed that things are working again!

#7 Updated by okurz 3 months ago

  • Due date deleted (2022-09-28)
  • Status changed from Feedback to Resolved

Seems to be good. The email should be osd-admins@suse.de as done by SUSE-IT. I clarified the email in the password file https://gitlab.suse.de/openqa/password/-/commit/f3fcbac2ef6a52e2914175e77a8da9d6261fa24e

#8 Updated by okurz 3 months ago

  • Status changed from Resolved to Feedback
  • Assignee changed from cdywan to tinita

tinita you linked alert messages. please check the actual minion job results and then ensure that the alert disappears from https://stats.openqa-monitor.qa.suse.de/alerting/list?state=not_ok
There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting please delete them.

#9 Updated by okurz 3 months ago

  • Description updated (diff)

#10 Updated by okurz 3 months ago

I paused the alerts and updated the ticket description with suggestions and rollback steps

#11 Updated by tinita 3 months ago

I removed the minion jobs that were caused by the gitlab problem, and unpaused the alert.

#12 Updated by tinita 3 months ago

  • Status changed from Feedback to Resolved

I deleted the alerts for WebUI test and WebUI old

#14 Updated by tinita 3 months ago

I now deleted the "WebUI Summary test" and "WebUI Summary old" dashboards, as Oli asked me to do that.
From the ticket description

There are three alerts now as well due to "webui-old" and "webui-test" dashboards. As decided in the weekly meeting 2022-09-16 please delete them.

that was not really clear, as "them" could mean "the alerts".

#15 Updated by tinita 3 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF