action #93612
closedSeveral unhandled alerts triggered regarding incompletes and running out of space
0%
Description
These are not necessarily related but to start with I'd like to preserve these alerts because this is a lot of unhandled alerts in succession. It may make sense to split or move some out.
Note: All of these went OK eventually as far as I can tell but I couldn't determine why.
File systems¶
http://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=74&orgId=1
[No Data] File systems alert
One of the file systems is too full
Metric name
Value
Job age¶
[Alerting] Job age (scheduled) (median) alert
Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration 2020-11-27: Alert limit set to 259200s = 3d, see https://progress.opensuse.org/issues/73174#note-2 about the decision Related progress issue: https://progress.opensuse.org/issues/65975
Metric name
Value
50% percentile (median)
566107.500
[Alerting] Job age (scheduled) (max) alert
Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value
Metric name
Value
50% percentile (max)
525289.000
[Alerting] Job age (scheduled) (max) alert
Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value
Metric name
Value
50% percentile (max)
556986.500
[Alerting] Job age (scheduled) (median) alert
Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration 2020-11-27: Alert limit set to 259200s = 3d, see https://progress.opensuse.org/issues/73174#note-2 about the decision Related progress issue: https://progress.opensuse.org/issues/65975
Metric name
Value
50% percentile (median)
620407.500
http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1
Incomplete jobs¶
[No Data] Incomplete jobs (not restarted) of last 24h alert
Metric name
Value
http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=alert&viewPanel=17&orgId=1
Updated by livdywan over 3 years ago
New alert triggered just now:
[Alerting] Incomplete jobs (not restarted) of last 24h alert
Metric name
Value
Incompletes (not restarted) of last 24h
362.000
http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=alert&viewPanel=17&orgId=1
Updated by okurz over 3 years ago
- Tags set to alert, alerts, osd, grafana, filesystem, space, job age
- Status changed from New to Workable
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by livdywan over 3 years ago
The incompletes seem to manifest themselves as Bareword "load_pynfs_tests" not allowed according to @nicksinger which refers to products/sle/main.pm:746. Normally "bareword" here means the function wasn't defined properly.
Updated by livdywan over 3 years ago
cdywan wrote:
The incompletes seem to manifest themselves as Bareword "load_pynfs_tests" not allowed according to @nicksinger which refers to products/sle/main.pm:746. Normally "bareword" here means the function wasn't defined properly.
I didn't see a relevant change, but also didn't see further occurences of this. Not a general openQA issue anyway I assume. Just checking for completeness (no pun intended).
Job age
File systems
Maybe related? I couldn't determine if anyone actively made changes there. If somebody just deleted files or clean-up kicked in.
Updated by livdywan over 3 years ago
- Subject changed from Several unhandled alerts triggered to Several unhandled alerts triggered regarding incompletes and running out of space
Updated by okurz over 3 years ago
- Status changed from Workable to Resolved
- Assignee set to okurz
For the "job age" I found one older "zkvm" job that was stuck in schedule as all workers with "zkvm" have been removed. I cancelled the job and provided an explanation comment in the job. I hope that no openQA test reviewers just blindly retrigger the tests again :)
I also checked my email inbox and https://monitor.qa.suse.de/alerting/list?state=not_ok and found no further problems. Also all the filesystems look ok for now. I also did https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/502 for the assets filesystem usage limit in coordination what I provided for the corresponding nagios check.