Project

General

Profile

Actions

action #128945

closed

[alert][grafana] web UI: Too many Minion job failures alert Salt (liA25iB4k)

Added by okurz 12 months ago. Updated 12 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-05-08
Due date:
2023-05-24
% Done:

0%

Estimated time:

Description


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #128942: [alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:MResolvednicksinger2023-05-082023-06-23

Actions
Actions #1

Updated by okurz 12 months ago

  • Copied from action #128942: [alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:M added
Actions #2

Updated by livdywan 12 months ago

  • Status changed from New to In Progress
  • Assignee set to livdywan

Looking at the minion dashboard seems like this is all obs_rsync_run jobs:

Unexpected local arg: /opt/openqa-trigger-from-ibs/SUSE:SLE-15-SP3:Update:BCI/Media1_SLE-BCI-*.lst
If arg is a remote file/dir, prefix it with a colon (:).
rsync error: syntax or usage error (code 1) at main.c(1528) [Receiver=3.2.3]
Actions #3

Updated by mkittler 12 months ago

So likely a recent change in https://github.com/os-autoinst/openqa-trigger-from-obs caused this (as this public repo is actually what we use on OSD at this point as well:

martchus@openqa:/opt/openqa-trigger-from-ibs> sudo -u geekotest git remote -v
old     https://gitlab.suse.de/openqa/openqa-trigger-from-ibs.git (fetch)
old     https://gitlab.suse.de/openqa/openqa-trigger-from-ibs.git (push)
origin  https://github.com/os-autoinst/openqa-trigger-from-obs (fetch)
origin  https://github.com/os-autoinst/openqa-trigger-from-obs (push)

)

But strangely the last change there is quite old.

Actions #4

Updated by openqa_review 12 months ago

  • Due date set to 2023-05-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by andriinikitin 12 months ago

This should be fixed now

geekotest@openqa:/opt/openqa-trigger-from-ibs> bash SUSE\:SLE-15-SP3\:Update\:BCI/read_files.sh 
Unexpected local arg: /opt/openqa-trigger-from-ibs/SUSE:SLE-15-SP3:Update:BCI/Media1_SLE-BCI-aarch64.lst
If arg is a remote file/dir, prefix it with a colon (:).
rsync error: syntax or usage error (code 1) at main.c(1528) [Receiver=3.2.3]
geekotest@openqa:/opt/openqa-trigger-from-ibs> rm SUSE\:SLE-15-SP3\:Update\:BCI/*lst
geekotest@openqa:/opt/openqa-trigger-from-ibs> bash SUSE\:SLE-15-SP3\:Update\:BCI/read_files.sh 
geekotest@openqa:/opt/openqa-trigger-from-ibs> 

problem is combination of rare things - we should avoid using '*' in file names in xml + there were leftover .lst files (because of reverted commit in openqa-trigger-from-obs).

I will clean the tasks in gru as well.

Actions #6

Updated by livdywan 12 months ago

  • Status changed from In Progress to Feedback

andriinikitin wrote:

problem is combination of rare things - we should avoid using '*' in file names in xml + there were leftover .lst files (because of reverted commit in openqa-trigger-from-obs).

I will clean the tasks in gru as well.

Thank you for taking care of it!

Actions #7

Updated by okurz 12 months ago

  • Status changed from Feedback to In Progress

We still have a way too high number of failed minion jobs, see https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=19

Actions #8

Updated by livdywan 12 months ago

  • Status changed from In Progress to Feedback

okurz wrote:

We still have a way too high number of failed minion jobs, see https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=19

Ack. I hadn't checked the "other ones" yet. There's some really old ones here.

I now also deleted a bunch of save_needle failing like this:

 error: |-
    <strong>Failed to save installation-autoyast-boot-20230505.</strong><br><pre>Unable to commit via Git: On branch master
    Your branch is up to date with 'origin/master'.

    nothing to commit, working tree clean
    </pre>

and limit_{results_and_logs,screenshots} failing like this:

notes:
  gru_id: 33669590
  signal_handler: Received signal TERM, scheduling retry and releasing locks
Actions #9

Updated by livdywan 12 months ago

  • Status changed from Feedback to Resolved

No new ones over the weekend. I think we're good here.

Actions

Also available in: Atom PDF