Project

General

Profile

Actions

action #95443

closed

Variants of Job age (scheduled) alerts on Grafana on Sunday and Monday size:S

Added by livdywan over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2021-07-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

I observed several unhandled alerts on Grafana on Sunday and Monday.

[Alerting] Job age (scheduled) (max) alert

Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value
Metric name

Value
50% percentile (max)

501773.500

click

[Alerting] Job age (scheduled) (median) alert

Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration 2020-11-27: Alert limit set to 259200s = 3d, see https://progress.opensuse.org/issues/73174#note-2 about the decision Related progress issue: https://progress.opensuse.org/issues/65975
Metric name

Value
50% percentile (median)

501113.500

click

[Alerting] Job age (scheduled) (max) alert

Jobs not scheduled for 4 days (345600s). Possible reasons: * There are no online workers for selected scheduled jobs, misconfiguration on the side of tests likely See https://progress.opensuse.org/issues/73174#note-2 for an explanation of the selection of the specific value
Metric name

Value
50% percentile (max)

954811.000

click

[No Data] Incomplete jobs (not restarted) of last 24h alert click

Acceptance criteria

  • AC1: The cause of the alerts is clear or a follow-up ticket is filed with a feature request to have the necessary details next time

Suggestions

  • Look at the alert history in Grafana
  • Look at all tests and check for cancelled jobs or removed workers
  • Other issues handling these alerts recently: #93612 and #92110 according to a quick search.
Actions #1

Updated by okurz over 3 years ago

  • Target version set to Ready
Actions #2

Updated by livdywan over 3 years ago

  • Subject changed from Variants of Job age (scheduled) alerts on Sunday and Monday to Variants of Job age (scheduled) alerts on Grafana on Sunday and Monday size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz over 3 years ago

  • Priority changed from Normal to Urgent

we should be more diligent with our alert handling

Actions #4

Updated by tinita over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #5

Updated by tinita over 3 years ago

I cancelled this job: https://openqa.suse.de/tests/6353757#comments

The WORKER_CLASS is set to s390-kvm-sle15, but there is no worker for that. The machine is s390x-kvm-sle12

Actions #6

Updated by tinita over 3 years ago

Got feedback from coolgw that it was a mistake and cancelling was ok.

This view also shows that the age for the mentioned machine type was increasing: https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?viewPanel=4&from=now-30d&to=now

Actions #7

Updated by tinita over 3 years ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by ilausuch over 3 years ago

I checked on openqa.suse.de and the oldest scheduler jobs (there are many) have 9 hours old at this moment

Actions #9

Updated by tinita over 3 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF