Project

General

Profile

Actions

action #176124

closed

coordination #161414: [epic] Improved salt based infrastructure management

OSD influxdb minion route seemingly returns only a very small number of failed minion jobs, not all

Added by okurz about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #176013-5

What is weird is the number of failed minion jobs in grafana versus in the dashboard.
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=panel-19&from=2025-01-22T18:45:14.947Z&to=2025-01-23T02:53:20.613Z&var-host_disks=$__all
There is this peak of 645 failed minion jobs for 4 hours, while having 13 before and 14 after.
The current dashboard https://openqa.suse.de/minion says 652 failed jobs.
However the influx route says 14:
https://openqa.suse.de/admin/influxdb/minion

openqa_minion_jobs,url=https://openqa.suse.de active=1i,delayed=2i,failed=14i,inactive=2i
openqa_minion_jobs_hook_rc_failed,...
...

So that's something to check, why the route only reports 14

Thr relevant code:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebAPI/Controller/Admin/Influxdb.pm#L90


Related issues 2 (1 open1 closed)

Related to openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17Blockedokurz2024-07-10

Actions
Copied from openQA Infrastructure (public) - action #176013: [alert] web UI: Too many Minion job failures alert size:SResolvedybonatakis2025-01-23

Actions
Actions #1

Updated by okurz about 1 month ago

  • Copied from action #176013: [alert] web UI: Too many Minion job failures alert size:S added
Actions #2

Updated by tinita about 1 month ago

  • Description updated (diff)
Actions #3

Updated by tinita about 1 month ago

  • Assignee set to tinita
Actions #4

Updated by tinita about 1 month ago

  • Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Actions #5

Updated by kraih about 1 month ago

As detailed in https://progress.opensuse.org/issues/176013#note-8, this is not actually a bug, but a slightly misleading feature.

Actions #6

Updated by tinita about 1 month ago

  • Status changed from New to Resolved

This was very likely caused by a corrupted openqa.ini, see #175710

https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebAPI/Controller/Admin/Influxdb.pm#L92

my $block_list = $self->app->config->{influxdb}->{ignored_failed_minion_jobs} || [];

We are normally ignoring obs_rsync tasks, and so for 4 hours the minion server was running with a corrupted config file missing ignored_failed_minion_jobs (until it was restarted, so the config file was ok again at that point).

Actions

Also available in: Atom PDF