action #176124
closedcoordination #161414: [epic] Improved salt based infrastructure management
OSD influxdb minion route seemingly returns only a very small number of failed minion jobs, not all
0%
Description
Motivation¶
See #176013-5
What is weird is the number of failed minion jobs in grafana versus in the dashboard.
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=panel-19&from=2025-01-22T18:45:14.947Z&to=2025-01-23T02:53:20.613Z&var-host_disks=$__all
There is this peak of 645 failed minion jobs for 4 hours, while having 13 before and 14 after.
The current dashboard https://openqa.suse.de/minion says 652 failed jobs.
However the influx route says 14:
https://openqa.suse.de/admin/influxdb/minion
openqa_minion_jobs,url=https://openqa.suse.de active=1i,delayed=2i,failed=14i,inactive=2i
openqa_minion_jobs_hook_rc_failed,...
...
So that's something to check, why the route only reports 14
Thr relevant code:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebAPI/Controller/Admin/Influxdb.pm#L90
Updated by okurz about 1 month ago
- Copied from action #176013: [alert] web UI: Too many Minion job failures alert size:S added
Updated by tinita about 1 month ago
- Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Updated by kraih about 1 month ago
As detailed in https://progress.opensuse.org/issues/176013#note-8, this is not actually a bug, but a slightly misleading feature.
Updated by tinita about 1 month ago
- Status changed from New to Resolved
This was very likely caused by a corrupted openqa.ini, see #175710
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebAPI/Controller/Admin/Influxdb.pm#L92
my $block_list = $self->app->config->{influxdb}->{ignored_failed_minion_jobs} || [];
We are normally ignoring obs_rsync tasks, and so for 4 hours the minion server was running with a corrupted config file missing ignored_failed_minion_jobs
(until it was restarted, so the config file was ok again at that point).