Project

General

Profile

Actions

action #111590

closed

[alert] HPC jobs not picked up for multiple days, job age alert triggered

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-05-25
Due date:
2022-06-14
% Done:

0%

Estimated time:

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #111926: osd-deployment pipeline failed: test 481 -le 0, due to job age alert, likely just the raspberry pi based tests stuck in scheduleResolvednicksinger2022-06-01

Actions
Actions #1

Updated by okurz almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by kraih almost 2 years ago

It looks like the situation has changed a bit since the ticket was opened. Now i see this job as cancelled, and the parent job as failed (with what looks like a normal test failure).

Actions #3

Updated by kraih almost 2 years ago

  • Assignee set to kraih
Actions #4

Updated by kraih almost 2 years ago

  • Assignee deleted (kraih)

Putting the ticket back into the queue for estimation, since the expected result here is not quite clear to me.

Actions #5

Updated by kraih almost 2 years ago

Btw. Most of the cancellations do not have a scheduled for more than X days reason. So it wasn't triggered by the max_job_scheduled_time limit in the scheduler.

Actions #6

Updated by livdywan almost 2 years ago

The way I read the description, this is the situation:

  • I see a lot of old, cancelled jobs i.e. 12 days old and this is also true for the children of the clones of the parent
  • The parent failed due to network issues. This shouldn't be related to cancelled children.
  • I'm assuming the jobs were manually cancelled. @okurz suggests this was automatic. Either way it means the jobs were never run.
Actions #7

Updated by kraih almost 2 years ago

I would suggest adding cancellation reasons everywhere, to make similar issues easier to investigate.

Actions #8

Updated by livdywan almost 2 years ago

Drafted a PR adding reasons to other cases: https://github.com/os-autoinst/openQA/pull/4681

And I see why Oli mentioned the prio. This would affect what's now called cancelled based on job settings via API call in my PR.

Actions #9

Updated by livdywan almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to livdywan

I still want to estimate this to make sure we understand the issue, but taking it now since it's Urgent

Actions #10

Updated by openqa_review almost 2 years ago

  • Due date set to 2022-06-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by livdywan almost 2 years ago

  • Status changed from In Progress to Feedback

cdywan wrote:

Drafted a PR adding reasons to other cases: https://github.com/os-autoinst/openQA/pull/4681

PR got merged.

the culprit seems to be in particular "hpc" test scenarios

Down to 85 scheduled hpc scenarios right now (I don't know how to link a web UI query with the same meaning).

The number has gone down steadily without further action from my end besides monitoring it.

Actions #12

Updated by livdywan almost 2 years ago

Unpaused. I see no reason for it to be paused currently.

Actions #13

Updated by livdywan almost 2 years ago

  • Description updated (diff)
  • Status changed from Feedback to Resolved

No issues with hpc tests currently, asked in #eng-testing concerning the 3 🥧️ tests currently stuck because the machine is offline, which is unrelated to this ticket.

Actions #14

Updated by okurz almost 2 years ago

  • Status changed from Resolved to Feedback

the alert from https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1 just again triggered today in the morning and the graph does not look like it's good so a problem still persists.

Actions #15

Updated by livdywan almost 2 years ago

  • Status changed from Feedback to Resolved

okurz wrote:

the alert from https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1 just again triggered today in the morning and the graph does not look like it's good so a problem still persists.

The alert was fine when I resolved. See my comment with regard to the jobs that weren't getting picked up.

Actions #16

Updated by okurz almost 2 years ago

  • Related to action #111926: osd-deployment pipeline failed: test 481 -le 0, due to job age alert, likely just the raspberry pi based tests stuck in schedule added
Actions

Also available in: Atom PDF