Project

General

Profile

Actions

action #118969

closed

[alert] web UI: Too many Minion job failures alert

Added by jbaier_cz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-10-17
Due date:
% Done:

0%

Estimated time:
Tags:

Description

E-mail alert about job failures, the job in question is download_assets, see https://openqa.suse.de/minion/jobs?id=5564865

result: 'Downloading "http://carwos-runner.qa.suse.de/gitlab/Bogdan.Lezhepekov/carwos-Bogdan.Lezhepekov_branch_mr-1194442.qcow2"
  failed with: Download of "/var/lib/openqa/share/factory/hdd/carwos-Bogdan.Lezhepekov_branch_mr-1194442.qcow2"
  failed: 521 Connect timeout'

As of now, the address in question is really not reachable:

$  ping -c 1 carwos-runner.qa.suse.de
PING carwos-runner.qa.suse.de (10.161.50.2) 56(84) bytes of data.

--- carwos-runner.qa.suse.de ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

I already deleted a bunch of them, but there is more in the inactive queue, so I expected some more failures in the near future.


Related issues 1 (1 open0 closed)

Related to openQA Project - coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alertNew2020-09-01

Actions
Actions #1

Updated by jbaier_cz over 1 year ago

  • Related to coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert added
Actions #2

Updated by okurz over 1 year ago

  • Category changed from Organisational to Feature requests
  • Priority changed from Normal to Urgent

First step: Disable the alert and notify the according machine owner. Second step: Those error messages would be good as "reason" within the according jobs that should incomplete. There should not be an alert alerting us as the test owners that specified the asset download URL need to handle that.

Actions #3

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #4

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress

I left a message in the chat. Likely the alert Incomplete jobs (not restarted) of last 24h alert is related as well.

Actions #5

Updated by mkittler over 1 year ago

Those error messages would be good as "reason" within the according jobs that should incomplete.

The jobs already end up incomplete. These GRU downloads predate the introduction of the reason field and therefore the error message is added as a job module result here (e.g. https://openqa.suse.de/tests/9746552). So should I change that to use the reason field instead? It would be more consistent, we wouldn't have the rather misleading reason "no test modules scheduled/uploaded" and could search for those jobs via DB queries.

There should not be an alert alerting us as the test owners that specified the asset download URL need to handle that.

I suppose the easiest way to prevent that is to avoid considering these Minion jobs failed.

Actions #6

Updated by mkittler over 1 year ago

Actions #7

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved
  • Resumed the alert
  • PR has been merged and deployed
Actions

Also available in: Atom PDF