Project

General

Profile

action #111590

[alert] HPC jobs not picked up for multiple days, job age alert triggered

Added by okurz about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-05-25
Due date:
2022-06-14
% Done:

0%

Estimated time:
Difficulty:

Related issues

Related to openQA Infrastructure - action #111926: osd-deployment pipeline failed: test 481 -le 0, due to job age alert, likely just the raspberry pi based tests stuck in scheduleResolved2022-06-01

History

#1 Updated by okurz about 1 month ago

  • Description updated (diff)

#2 Updated by kraih about 1 month ago

It looks like the situation has changed a bit since the ticket was opened. Now i see this job as cancelled, and the parent job as failed (with what looks like a normal test failure).

#3 Updated by kraih about 1 month ago

  • Assignee set to kraih

#4 Updated by kraih about 1 month ago

  • Assignee deleted (kraih)

Putting the ticket back into the queue for estimation, since the expected result here is not quite clear to me.

#5 Updated by kraih about 1 month ago

Btw. Most of the cancellations do not have a scheduled for more than X days reason. So it wasn't triggered by the max_job_scheduled_time limit in the scheduler.

#6 Updated by cdywan about 1 month ago

The way I read the description, this is the situation:

  • I see a lot of old, cancelled jobs i.e. 12 days old and this is also true for the children of the clones of the parent
  • The parent failed due to network issues. This shouldn't be related to cancelled children.
  • I'm assuming the jobs were manually cancelled. okurz suggests this was automatic. Either way it means the jobs were never run.

#7 Updated by kraih about 1 month ago

I would suggest adding cancellation reasons everywhere, to make similar issues easier to investigate.

#8 Updated by cdywan about 1 month ago

Drafted a PR adding reasons to other cases: https://github.com/os-autoinst/openQA/pull/4681

And I see why Oli mentioned the prio. This would affect what's now called cancelled based on job settings via API call in my PR.

#9 Updated by cdywan about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to cdywan

I still want to estimate this to make sure we understand the issue, but taking it now since it's Urgent

#10 Updated by openqa_review about 1 month ago

  • Due date set to 2022-06-14

Setting due date based on mean cycle time of SUSE QE Tools

#11 Updated by cdywan about 1 month ago

  • Status changed from In Progress to Feedback

cdywan wrote:

Drafted a PR adding reasons to other cases: https://github.com/os-autoinst/openQA/pull/4681

PR got merged.

the culprit seems to be in particular "hpc" test scenarios

Down to 85 scheduled hpc scenarios right now (I don't know how to link a web UI query with the same meaning).

The number has gone down steadily without further action from my end besides monitoring it.

#12 Updated by cdywan about 1 month ago

Unpaused. I see no reason for it to be paused currently.

#13 Updated by cdywan about 1 month ago

  • Description updated (diff)
  • Status changed from Feedback to Resolved

No issues with hpc tests currently, asked in #eng-testing concerning the 3 🥧️ tests currently stuck because the machine is offline, which is unrelated to this ticket.

#14 Updated by okurz about 1 month ago

  • Status changed from Resolved to Feedback

the alert from https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1 just again triggered today in the morning and the graph does not look like it's good so a problem still persists.

#15 Updated by cdywan about 1 month ago

  • Status changed from Feedback to Resolved

okurz wrote:

the alert from https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1 just again triggered today in the morning and the graph does not look like it's good so a problem still persists.

The alert was fine when I resolved. See my comment with regard to the jobs that weren't getting picked up.

#16 Updated by okurz about 1 month ago

  • Related to action #111926: osd-deployment pipeline failed: test 481 -le 0, due to job age alert, likely just the raspberry pi based tests stuck in schedule added

Also available in: Atom PDF