Project

General

Profile

Actions

action #93612

closed

Several unhandled alerts triggered regarding incompletes and running out of space

Added by livdywan over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2021-06-08
Due date:
% Done:

0%

Estimated time:

Description

These are not necessarily related but to start with I'd like to preserve these alerts because this is a lot of unhandled alerts in succession. It may make sense to split or move some out.

Note: All of these went OK eventually as far as I can tell but I couldn't determine why.

File systems

http://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=74&orgId=1

[No Data] File systems alert

One of the file systems is too full
Metric name

Value

Job age

[Alerting] Job age (scheduled) (median) alert

Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration 2020-11-27: Alert limit set to 259200s = 3d, see https://progress.opensuse.org/issues/73174#note-2 about the decision Related progress issue: https://progress.opensuse.org/issues/65975
Metric name

Value
50% percentile (median)

566107.500

[Alerting] Job age (scheduled) (max) alert

Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value
Metric name

Value
50% percentile (max)

525289.000

[Alerting] Job age (scheduled) (max) alert

Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value
Metric name

Value
50% percentile (max)

556986.500

[Alerting] Job age (scheduled) (median) alert

Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration 2020-11-27: Alert limit set to 259200s = 3d, see https://progress.opensuse.org/issues/73174#note-2 about the decision Related progress issue: https://progress.opensuse.org/issues/65975
Metric name

Value
50% percentile (median)

620407.500

http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1

Incomplete jobs

[No Data] Incomplete jobs (not restarted) of last 24h alert

Metric name

Value

http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=alert&viewPanel=17&orgId=1

Actions #1

Updated by livdywan over 3 years ago

New alert triggered just now:

[Alerting] Incomplete jobs (not restarted) of last 24h alert

Metric name

Value
Incompletes (not restarted) of last 24h

362.000

http://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=alert&viewPanel=17&orgId=1

Actions #2

Updated by okurz over 3 years ago

  • Tags set to alert, alerts, osd, grafana, filesystem, space, job age
  • Status changed from New to Workable
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #3

Updated by livdywan over 3 years ago

The incompletes seem to manifest themselves as Bareword "load_pynfs_tests" not allowed according to @nicksinger which refers to products/sle/main.pm:746. Normally "bareword" here means the function wasn't defined properly.

Actions #4

Updated by livdywan over 3 years ago

cdywan wrote:

The incompletes seem to manifest themselves as Bareword "load_pynfs_tests" not allowed according to @nicksinger which refers to products/sle/main.pm:746. Normally "bareword" here means the function wasn't defined properly.

I didn't see a relevant change, but also didn't see further occurences of this. Not a general openQA issue anyway I assume. Just checking for completeness (no pun intended).

Job age
File systems

Maybe related? I couldn't determine if anyone actively made changes there. If somebody just deleted files or clean-up kicked in.

Actions #5

Updated by livdywan over 3 years ago

  • Subject changed from Several unhandled alerts triggered to Several unhandled alerts triggered regarding incompletes and running out of space
Actions #6

Updated by okurz over 3 years ago

  • Status changed from Workable to Resolved
  • Assignee set to okurz

For the "job age" I found one older "zkvm" job that was stuck in schedule as all workers with "zkvm" have been removed. I cancelled the job and provided an explanation comment in the job. I hope that no openQA test reviewers just blindly retrigger the tests again :)

I also checked my email inbox and https://monitor.qa.suse.de/alerting/list?state=not_ok and found no further problems. Also all the filesystems look ok for now. I also did https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/502 for the assets filesystem usage limit in coordination what I provided for the corresponding nagios check.

Actions

Also available in: Atom PDF